Figure 1: Distribution of Missing Values in OpenSNP Data
Figure 1 shows a log-scale histogram of the percent of missing values for all users in the openSNP data over the selected SNPs. The majority of users are in the smallest bin, however most of these have at least some missing values. Over 75% of users have fewer than 1% of the selected SNPs missing, but only 115 users (~2.5%) are found with no missing values.Figure 2: Performance Degradation from Missing Values using Zero-Filling
Empirical results suggest that the different imputing methods perform similarly and so zero-filling is ultimately selected, being the simplest method. Figure 2 shows the impact on model performance as the average percent of missing values per sample increases. As can be seen, model performance degrades roughly linearly with the percentage of missing values. For samples with fewer than 1% of values missing, the performance degradation is likely negligible.
Figure 3: Population Counts and Probability Frequencies
Figure 3 shows both the counts of the most dominant population per sample and the average class probabilities computed over the openSNP data. The populations corresponding to Caucasians from the United States and Great Britain lead, accounting for roughly 30% of samples. This suggests that the user base of openSNP is primarily of Caucasian ancestry.Figure 4: OpenSNP Genotype Data Embedding
The encoded matrix and inferred class probabilities are mapped into four dimensions using CCA and shown in a scatter plot. In Figure 4, the x and y-axes correspond to the encoded matrix and the color and size axes correspond to the inferred class probabilities. Plot markers identify the most likely class for each sample. A table containing the definitions of the population abbreviations is available in an earlier blog post [4]. As can be seen, the large cluster in the top left consists primarily of American and British individuals.Figure 5: OpenSNP World Map Embedding
It is perhaps interesting to note the way that the CEU (United States), GBR (Great Britain), and IBS (Iberian) populations are drawn towards each other in Figure 5. The result is that the embedding distorts the physical distances between these populations on a world map. This is a reflection of the fact that these groups are more genetically similar than the spatial distance between them might suggest.Figure 6: OpenSNP PCA Projection
Figure 6 shows a PCA projection of the openSNP samples with predicted class labels identified. Only samples with 1% or fewer of values missing are included in the plot. The clusters are perhaps less tight than the thousand genomes data, possibly due to presence of more mixed individuals, but similar centroids still group together [4].Figure 7: Population Similarity Map
Next, the population genetic similarity for a specific user, user 154 is displayed in Figure 7 as a choropleth heatmap. The model reports a maximum population similarity with the MXL population corresponding to Mexican ancestry. These results match those taken directly from the user: "Hispanic or Mexican American."User 154: 22917 99.9% (- 22)
-User Report:
--Anc: Mexican Sephardic Jewish
--Eth: Hispanic or mexican american
-Predicted Ancestry:
--MXL 83.93% Mexican Ancestry in Los Angeles, California
--PEL 6.04% Peruvian in Lima, Peru
--CLM 2.65% Colombian in Medellin, Colombia
--PUR 2.18% Puerto Rican in Puerto Rico
User 920: 22910 99.9% (- 29)
-User Report:
--Anc: Southern European (mostly Iberian)
--Eth: Southern European (mostly Iberian)
-Predicted Ancestry:
--IBS 73.65% Iberian Populations in Spain
--TSI 19.56% Toscani in Italy
--CEU 0.85% Utah Residents of Northern & Western European Ancestry
--PUR 0.76% Puerto Rican in Puerto Rico
User 943: 22737 99.1% (- 202)
-User Report:
--Anc: British isles, northern europe
--Eth: Mixed
-Predicted Ancestry:
--GBR 62.47% British in England and Scotland
--CEU 21.85% Utah Residents of Northern & Western European Ancestry
--FIN 7.51% Finnish in Finland
--TSI 2.45% Toscani in Italy
User 2675: 22903 99.8% (- 36)
-User Report:
--Anc: East asian
--Eth: [...]
-Predicted Ancestry:
--JPT 98.82% Japanese in Tokyo, Japan
--CHB 0.36% Han Chinese in Bejing, China
--CHS 0.23% Southern Han Chinese, China
--CDX 0.08% Chinese Dai in Xishuangbanna, China
User 3050: 22781 99.3% (- 158)
-User Report:
--Anc: Irish italian german
--Eth: Italian
-Predicted Ancestry:
--TSI 60.66% Toscani in Italy
--GBR 24.25% British in England and Scotland
--CEU 6.51% Utah Residents of Northern & Western European Ancestry
--IBS 2.09% Iberian Populations in Spain
User 4034: 22892 99.8% (- 47)
-User Report:
--Anc: West African
--Eth: US Black
-Predicted Ancestry:
--ASW 26.01% African Ancestry in Southwest US
--ACB 23.14% African Caribbean in Barbados
--YRI 13.33% Yoruba in Ibadan, Nigeria
--MSL 11.59% Mende in Sierra Leone
User 4577: 22931 100.0% (- 8)
-User Report:
--Anc: South european
--Eth: Iberian
-Predicted Ancestry:
--IBS 40.84% Iberian Populations in Spain
--TSI 31.00% Toscani in Italy
--GBR 14.90% British in England and Scotland
--CEU 2.70% Utah Residents of Northern & Western European Ancestry
User 4832: 22649 98.7% (- 290)
-User Report:
--Anc: Northern european
--Eth: Finnish
-Predicted Ancestry:
--FIN 97.82% Finnish in Finland
--CEU 0.56% Utah Residents of Northern & Western European Ancestry
--ASW 0.28% African Ancestry in Southwest US
--CLM 0.27% Colombian in Medellin, Colombia
User 5526: 22888 99.8% (- 51)
-User Report:
--Anc: Northwest European + Southeast Asian
--Eth: Northwestern European & Southeast Asian
-Predicted Ancestry:
--KHV 29.49% Kinh in Ho Chi Minh City, Vietnam
--GBR 21.86% British in England and Scotland
--CEU 18.68% Utah Residents of Northern & Western European Ancestry
--CDX 6.24% Chinese Dai in Xishuangbanna, China
User 5573: 22929 100.0% (- 10)
-User Report:
--Anc: Mixed european (north, central, southern and eastern)
--Eth: Caucasian
-Predicted Ancestry:
--CEU 83.99% Utah Residents of Northern & Western European Ancestry
--GBR 5.41% British in England and Scotland
--TSI 3.65% Toscani in Italy
--FIN 2.07% Finnish in Finland
User 5675: 22899 99.8% (- 40)
-User Report:
--Anc: Italian
--Eth: Italian
-Predicted Ancestry:
--TSI 83.78% Toscani in Italy
--IBS 10.98% Iberian Populations in Spain
--CEU 1.91% Utah Residents of Northern & Western European Ancestry
--GBR 0.74% British in England and Scotland
User 6824: 22680 98.9% (- 259)
-User Report:
--Anc: R-Y944
--Eth: Gujarati
-Predicted Ancestry:
--GIH 46.70% Gujarati Indian in Houston,TX
--PJL 16.76% Punjabi in Lahore, Pakistan
--ITU 13.38% Indian Telugu in the UK
--STU 11.25% Sri Lankan Tamil in the UK
User 5168: 22828 99.5% (- 111)
-User Report:
--Anc: Swedish
--Eth: Swedish
-Predicted Ancestry:
--FIN 42.20% Finnish in Finland
--CEU 26.80% Utah Residents of Northern & Western European Ancestry
--GBR 17.27% British in England and Scotland
--IBS 3.08% Iberian Populations in Spain
User 5596: 22324 97.3% (- 615)
-User Report:
--Anc: Russian+ashkenazi
--Eth: Russian
-Predicted Ancestry:
--FIN 74.60% Finnish in Finland
--CEU 9.01% Utah Residents of Northern & Western European Ancestry
--GBR 5.95% British in England and Scotland
--CDX 2.26% Chinese Dai in Xishuangbanna, China
User 8915: 22795 99.4% (- 144)
-User Report:
--Anc: Danish, Norwegian, Swedish, Anglo-Frisian, Lower German, East German, Sorbian, Polish
--Eth: North Sea Germanic, Skandinavian, Germanic, Balto-Slavic, Slavic
-Predicted Ancestry:
--CEU 67.07% Utah Residents of Northern & Western European Ancestry
--FIN 15.43% Finnish in Finland
--GBR 7.16% British in England and Scotland
--ACB 1.55% African Caribbean in Barbados
Figure 8: Individual Genetic Kaleidoscopes
Figure 8 shows several heatmaps computed using different users and color schemes. The image is treated as a torroidal array to simulate an atlas.[1] | https://opensnp.org |
[2] | https://twitter.com/openSNPorg |
[3] | Posts/59.html |
[4] | Posts/48.html |
[5] | https://www.internationalgenome.org/home |