Follow-up: Aftermarket genetic information services

I recently described my experiences using Geno 2.0, a National Geographic-branded genetic testing kit that focuses on tracing ancestry and relatives based on information gleaned from haplogroups. At the suggestion of a comment on that post, I discovered that Nat Geo now makes the results of the test downloadable as a .CSV file containing a list of the SNP groups detected in the cheek swab that I sent to them. The advantage of having the raw data from the test is that I can now use several third-party services that specialize in processing data from these tests in order to reveal additional information about the specificities of my genetic code:

1. Promethease

Promethease was the first service I investigated because the company is among the few who directly accept data from Geno 2.0 (as opposed to 23andMe and other, more popular kits). I like this particular service because it emphasizes medical traits, rather than ancestry information, making it complementary to the results reported by the National Geographic program. The processing is done by a privately-held biotechnology LLC, and the analysis cost me $5.00 via an online payment.

promethease output

The top results for the Promethease screening output. The left column is a list of all SNPs recognized from the Geno 2.0 results, and the right column is a live-editable search specification box that can be used to refine search results.

The figure shows several example results listings from Promethease. The results specify a full list of SNPs identified with the software, as well as annotations associated with those SNPs that are provided by Promethease’s community of scientists and users. The founder notes that the results tend to favor negative traits (outlined in red boxes in the figure) because there are more studies showing that “Trait X causes cancer” than there are showing that “Trait Y causes superpowers.” The results are sorted by the “magnitude” of the effect, which is a subjective ranking of how severe or telling the presence of the SNP ought to be. More telling, however, is the “frequency” tab (highlighted in red in the results), which gives a rough estimate of the fraction of the US population that shares that particular SNP. For example, I was a little concerned to see that I had at least two SNPs associated with male-pattern baldness, but then I noticed that the relative proportion of the population bearing each of the two SNPs was 50% and 84%—dispiriting news, certainty, but hardly a uniquely bad outcome given that at least half of US males go bald by middle-age.

The only good news I received was that I have an increased inflammatory response to mosquito bites, which is apparently associated with increased resistance to malaria.

Overall, I was satisfied with the results provided by the service, although it was clear that the total information provided by the Geno 2.0 test (~1000 SNP) was lacking in comparison to that provided by other genetic testing kits (~17,000 SNPs are analyzed for 23andMe data). The medical results are more interesting and specific than those provided by the Genographic database, and they definitely made me more interested in using another testing service in the future in order to gain even more information about my genome.

2. FTDNA Geno Data Transfer. Family Tree DNA is a Texas-based company that specializes in high-throughput DNA testing—they have been around long enough that even National Genographic outsources some of its DNA kit screening to their facilities. As a result, it’s pretty easy to transfer Geno 2.0 data to their service, although instantly after doing so I was bombarded with offers of additional tests and analyses for outrageous prices. However, the additional analysis of the Geno data offered by their service is pretty worthwhile, particulary a section of their website that allows different SNP associated with different paternal haplogroups to be traced to specific geographic regions, as shown here:

The prevalence of my paternal ancestors bearing two specific SNPs.

The prevalence of my paternal ancestors bearing two specific SNPs with Haplogroup I.

The lineage tracing overwhelmingly favors European and East-Coast U.S. features, which I suspect mirrors the demographics of people whose genomes have been widely studied and mapped using FTDNA’s services. Despite the fact that almost all of the heat maps indicated heavy clusters in Europe and the eastern United States, the graphs did offer a few interesting tidbits, particularly this graph indicating that my paternal haplogroups contain at least one SNP strongly associated with Finland of all places:

SNP Finland

The locations of known ancestral groups bearing the SNP N-M178 in the male haplogroup.

Overall, the service didn’t tell me much more than Genographic’s existing analysis, although it was cool to see the localization of various SNPs on heat maps. The service did very little with information about my maternal line, which is arguably more interesting because it would contain only Southern Asian traits. Instead, the only other information offered about my maternal heritage was a vague assurance that a kit-purchaser in the United Arab Emirates had received similar results to mine, against which I could compare mine for an additional fee.

I think this service is useful, although it is unapologetically and often annoyingly commercial, persistently attempting to upsell my results with further testing. My belief in the credibility in the service is further undermined by the availability of additional vanity tests like a test for the “warrior gene” or “confidence gene.”

3. Morley terminal subclade predictor. This is a quick tool that finds the terminal subclade associated with a list of SNPs from the Y chromosome. It operates using the list of SNPs detected by FTDNA, which can be copied and pasted into the simple interface. It predicted that my terminal subclade is I2a1c, which is associated with the Alpine region in Western Europe. No information about my maternal ancestry was available.

4. Interpretome. This Stanford-based tool provides a remarkably broad range of results and analyses for free. While the tool technically supports only results from 23andMe and FTDNA’s testing services, I was able to upload my Geno 2.0 data without a problem, although results should probably be taken with a grain of salt. While the website lacks fancy graphs and visualizations of my genetic data, it includes several unique tools for gaining additional information from the limited raw data available from National Geographic’s Geno website. For example, there’s a tool that takes my Autosomal DNA, combined with my age and parental heights, and predicts my actual height.

d.

 

Using text classification to determine authorship

In high school, my classmates and I were required to use the website turnitin.com, which bills itself as a “global leader in evaluating and improving student learning” but which really exists for a single (albeit important) purpose: determining whether submitted papers are plagiarized from online sources. The use of the website was perhaps an overzealous response to the sparsely-documented phenomenon of high schoolers stealing their papers from online sources, particularly because the website can do little to detect whether a paper has been purchased from a “paper mill,” where students can hire others to write their essays for them. Instead, the website appears to use a strangely rudimentary regular expressions search capability to compare all possible substrings (down to some minimal length) in a document against a database of text compiled from online sources like Google Books and Wikipedia. I remain skeptical of the sophistication of the website’s methods because it would regularly flag phrases like “Just as when” or “at odds with,” suggesting that it didn’t contain any sort of whitelist for common phrases. I also hold a personal grudge because I once turned in an essay containing (MLA formatted!) quotes from Heart of Darkness, and the program declared that I had plagiarized parts of my essay from a certain book by Joseph Conrad.

How can plagiarism be detected more effectively? I have spent this summer reading a little about text classification, and it seems apparent that a website like Turnitin could benefit from implementing a procedure that looks for whether the style of a student’s writing is consistent, rather than simply looking for copied text. It would be harder for such a site to present definitive evidence of cheating, but it would at least allow teachers to get a better idea when several students turn in similar papers or when a single one of a student’s papers demonstrates a markedly different tone and style from xyr others.

A classic paper that outlines a lot of the concepts and principals associated with text classification is “Inference in an authorship problem” by Mosteller and Wallace (1963). The authors apply statistical techniques to analyze whether several the Federalist Papers (1788) were written by Hamilton or Madison—at the time of the publication of their article, no consensus existed among historians regarding which author had written which papers. The basic technique used by Mosteller and Wallace is strikingly similar to techniques used today after the rise of modern computerized data mining: they implement a linear discriminant analysis that looks at the frequency that the authors use certain common words (“by”,”from”,”to,” and the like). The technique works by looking for usage frequencies for sets of words, rather than a single word, that strongly differ between the two authors bodies of work—this set of words is determined by taking various weighted combinations of words and applying them to known samples of each author’s works, and finding a combination of words and weights that yields the high sums for works by one author (Madison in the paper) and low sums for the other author. Once advanced technique that Mosteller and Wallace implement involves the observation that certain, context independent “function” words—like prepositions or common adverbs—have frequencies that roughly obey a Poisson distribution, allowing them to apply Bayes theorem when selecting their set of words in order to determine the likelihood of an observed word frequency given the assumption that the distribution of frequencies should look Poissonian.

The authors find, surprisingly, that Madison wrote every single one of the disputed papers. This contradicts various claims made by contemporaries of the original federalists, but it corroborates several historical analyses that came out around the time that Mosteller and Wallace published their original papers. Because the method only requires known samples of an author’s work in order to develop a model, a website like Turnitin could implement a similar technique by archiving a student’s initial submissions or past work. Presumably, the website would then have little difficulty discerning whether a history essay was written by an adolescent or Joseph Conrad.