Ancestry Determination via Genetic Variant Analysis
Sat, 18 Apr 2020
Ancestry, Bioinformatics, Data Science, Genome Sequencing, Medicine, Snp, Statistics
Introduction
Sequencing of the human genome began in 1990 as part of the Human Genome Project. With the technology available at the time, the project was a substantial undertaking. The human genome contains two sets of 23 chromosomes each with roughly 3.2 billion base pairs. A number of institutions, in countries around the world, participated in the project. Thirteen years later the project was complete at a cost of roughly three billion US dollars. The result was the first reference human genome.
Rapid advances in the field of genomics have dramatically lowered the cost of genetic sequencing and have ushered in the age of the once fabled "$1000 genome." Now, a growing list of companies offer whole genome sequencing for hundreds of dollars with turn around time measured in weeks. This technology enables introspection into the sequences of nucleobases that comprise DNA and thus the genes of anyone curious enough to take the plunge.
Genetic Sequencing
With a reference human genome in place, one motivation for performing genome sequencing on a subject is to identify differences between the DNA of the subject and that of the reference. In this way, differences corresponding to genetic characteristics, diseases, or risk factors can be identified.
First, the sequencing data from the subject is aligned to the reference genome. This computationally intensive process involves aligning matching portions of the sample genome to the reference. Once alignment is complete, the
i-th base pair in a given reference sequence corresponds to the
i-th base pair in the corresponding sequence in the other genome. Alignment allows for more easy comparison of the sample to the reference.
Next, the two aligned genomes are compared. There are a variety of ways in which these two sequences of DNA can differ. Bases can be inserted, deleted, or replaced resulting in different types of variants. Several online resources catalogue known variants according to their ID. Many variants have no known clinical significance, however a growing number have been shown to be important in genetic traits, diseases, and predispositions [3].
The Dataset
The International Genome Sample Resource (IGSR) provides a collection of several thousand genomes in various file formats [1]. In addition to the full sequencing data, the project also provides files encoding the genetic variants each of the subjects possess. Variant call format (VCF) files are used to store information about these genetic variants. An example of several rows from a VCF file are shown in Detail 1.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample001
20 14370 rs6054257 G A 29 PASS DP=14;AF=0.5;DB GT:DP 0/1:1
20 17330 . T A 3 q10 DP=11;AF=0.017 GT:DP 1/1:3
Detail 1: The VCF File Format
The first row of the file contains the column names while the next two rows contain information about two genetic variants. Both variants occur in chromosome 20 with one at position 14370 and one at 17330. The first variant has ID
rs6054257, while the second has no ID assigned to it. The file contains information for a single sample:
Sample001.
In addition to the sequencing data, the IGSR also has information about the gender and ethnic origins of each subject [1]. This data is employed later to construct a mathematical model.
Data Setup
Scikit-allel is used to read the VCF file into memory [2]. The data in the VCF file is loaded into a three-dimensional array where each row corresponds to a variant, each column corresponds to a subject, and the third axis corresponds to ploidy. Since human cells are diploid, the third axis has a length of two.
[[[0, 1]],
[[1, 1]]]
Detail 2: The Loaded VCF Genotype Array
Each element in this array is the ID of the allele with 0 being the reference allele ID. This third axis is aggregated by counting the number of non-reference alleles in both pairs of chromosomes. Thus, each element in the reduced matrix is in the set \(\{0, 1, 2\}\). Missing allele cells are assumed to match the reference; they are treated as zero.
Now, a new matrix \(\mathbf{A}\) is constructed where every row of the matrix corresponds to a subject and every column corresponds to a variant ID. Due to the number of variants possible, a list of common variants is obtained from the NBHI and this list is further reduced using random sampling and feature selection. Each cell \(\mathbf{A}_{ij}\) of the matrix contains the number of copies of the
j-th variant, regardless of allele ID, that sample
i possesses. With this arrangement, humans, being diploid, can have a maximum value of two for any cell.
Transformation
The high dimensional sparse encoding matrix is transformed to a reduced dense matrix using the truncated singular value decomposition algorithm. Using the singular vectors, the data is projected onto a lower dimensional subspace that preserves a large amount of the information content found in the original matrix. Several features are added including the total number of variants, the total number of heterozygous variants, and the total number of homozygous variants.
Next, a 26-class classification problem is constructed. A random forest classifier is constructed that attempts to predict the race of the subject given its dense feature vector. Thus, the classifier attempts to predict the race of a subject given information about the genetic variants the subject possesses.
Results
Next, the performance of the model is estimated. Due to the small number of samples in the dataset, leave-one-out cross-validation is employed to evaluate the performance of the model. Table 1 lists the training and validation performance results as well as the number of nodes in the random forest.
Field | Value |
Train Accuracy | 98.52% |
Test Accuracy | 85.94% |
Node Count | 5974 |
Table 1: Model Evaluation Results
Both training score and node count are derived from a model trained on all samples while the validation score is the average of all leave-one-out validation runs. Given the number and granularity of the classes, the discrete accuracy measure is quite an unforgiving measure of model performance. Despite this, the model still performs ably.
Figure 1 shows the confusion matrix for the model created using by combining the results of the validation runs. The color of the cell in row
i and column
j is the percentage of samples of class
i predicted to be of class
j.
Figure 1: Model Confusion Matrix
As can be seen, the mistakes made by the model are frequently between two races that are more genetically similar. For example, the model makes several mistakes in differentiating Han Chinese in southern China from Han Chinese in northern China and in differentiating residents of the US from residents of the UK.
Finally, the model is evaluated by viewing a low dimensional projection of the class probabilities. Principal component analysis is performed on the 26 column matrix containing the predicted class probabilities. The two largest components are preserved so that a two-dimensional projection is obtained.
Figure 2: Projected Model Class Prediction Scores
The spatial relationships among the projected samples corroborate the positive results of the model. In the plot, races that are intuitively more genetically similar to each other are located closer together. Table 2 provides the definitions for each of the annotations, taken from the IGSR website [1].
Label | Count | Description |
GBR | 91 | British in England and Scotland |
FIN | 99 | Finnish in Finland |
CHS | 105 | Southern Han Chinese, China |
PUR | 104 | Puerto Rican in Puerto Rico |
CDX | 93 | Chinese Dai in Xishuangbanna, China |
CLM | 94 | Colombian in Medellin, Colombia |
IBS | 107 | Iberian populations in Spain |
PEL | 85 | Peruvian in Lima, Peru |
PJL | 96 | Punjabi in Lahore, Pakistan |
KHV | 99 | Kinh in Ho Chi Minh City, Vietnam |
ACB | 96 | African Caribbean in Barbados |
GWD | 113 | Gambian in Western Division, The Gambia |
ESN | 99 | Esan in Nigeria |
BEB | 86 | Bengali in Bangladesh |
MSL | 85 | Mende in Sierra Leone |
STU | 102 | Sri Lankan Tamil in the UK |
ITU | 102 | Indian Telugu in the UK |
CEU | 99 | Utah residents of Northern & Western European ancestry |
YRI | 108 | Yoruba in Ibadan, Nigeria |
CHB | 103 | Han Chinese in Bejing, China |
JPT | 104 | Japanese in Tokyo, Japan |
LWK | 99 | Luhya in Webuye, Kenya |
ASW | 61 | African Ancestry in Southwest US |
MXL | 64 | Mexican Ancestry in Los Angeles, California |
TSI | 107 | Toscani in Italy |
GIH | 103 | Gujarati Indian in Houston,TX |
Table 2: Annotation Label Definitions
Asians form a cluster near the bottom left of the graph. Those of African descent form a long arm off to the right. The third arm stretching off to the top left contains mostly subjects of northern and western European ancestry.
Conclusion
In the modern era, genetic sequencing is faster and cheaper than ever. A number of companies offers genetic sequencing services that look for known genetic variants. Further, a number of these companies provide access to raw VCF files. This post shows that VCF files can be used to construct a model that accurately predicts ethnic background. Empirical results suggest that this model is reasonably accurate and captures meaningful variance within the data.
References
[1] https://www.internationalgenome.org/
[2] https://scikit-allel.readthedocs.io/en/stable/
[3] https://www.snpedia.com/