Follow-up: Aftermarket genetic information services

I recently described my experiences using Geno 2.0, a National Geographic-branded genetic testing kit that focuses on tracing ancestry and relatives based on information gleaned from haplogroups. At the suggestion of a comment on that post, I discovered that Nat Geo now makes the results of the test downloadable as a .CSV file containing a list of the SNP groups detected in the cheek swab that I sent to them. The advantage of having the raw data from the test is that I can now use several third-party services that specialize in processing data from these tests in order to reveal additional information about the specificities of my genetic code:

1. Promethease

Promethease was the first service I investigated because the company is among the few who directly accept data from Geno 2.0 (as opposed to 23andMe and other, more popular kits). I like this particular service because it emphasizes medical traits, rather than ancestry information, making it complementary to the results reported by the National Geographic program. The processing is done by a privately-held biotechnology LLC, and the analysis cost me $5.00 via an online payment.

promethease output

The top results for the Promethease screening output. The left column is a list of all SNPs recognized from the Geno 2.0 results, and the right column is a live-editable search specification box that can be used to refine search results.

The figure shows several example results listings from Promethease. The results specify a full list of SNPs identified with the software, as well as annotations associated with those SNPs that are provided by Promethease’s community of scientists and users. The founder notes that the results tend to favor negative traits (outlined in red boxes in the figure) because there are more studies showing that “Trait X causes cancer” than there are showing that “Trait Y causes superpowers.” The results are sorted by the “magnitude” of the effect, which is a subjective ranking of how severe or telling the presence of the SNP ought to be. More telling, however, is the “frequency” tab (highlighted in red in the results), which gives a rough estimate of the fraction of the US population that shares that particular SNP. For example, I was a little concerned to see that I had at least two SNPs associated with male-pattern baldness, but then I noticed that the relative proportion of the population bearing each of the two SNPs was 50% and 84%—dispiriting news, certainty, but hardly a uniquely bad outcome given that at least half of US males go bald by middle-age.

The only good news I received was that I have an increased inflammatory response to mosquito bites, which is apparently associated with increased resistance to malaria.

Overall, I was satisfied with the results provided by the service, although it was clear that the total information provided by the Geno 2.0 test (~1000 SNP) was lacking in comparison to that provided by other genetic testing kits (~17,000 SNPs are analyzed for 23andMe data). The medical results are more interesting and specific than those provided by the Genographic database, and they definitely made me more interested in using another testing service in the future in order to gain even more information about my genome.

2. FTDNA Geno Data Transfer. Family Tree DNA is a Texas-based company that specializes in high-throughput DNA testing—they have been around long enough that even National Genographic outsources some of its DNA kit screening to their facilities. As a result, it’s pretty easy to transfer Geno 2.0 data to their service, although instantly after doing so I was bombarded with offers of additional tests and analyses for outrageous prices. However, the additional analysis of the Geno data offered by their service is pretty worthwhile, particulary a section of their website that allows different SNP associated with different paternal haplogroups to be traced to specific geographic regions, as shown here:

The prevalence of my paternal ancestors bearing two specific SNPs.

The prevalence of my paternal ancestors bearing two specific SNPs with Haplogroup I.

The lineage tracing overwhelmingly favors European and East-Coast U.S. features, which I suspect mirrors the demographics of people whose genomes have been widely studied and mapped using FTDNA’s services. Despite the fact that almost all of the heat maps indicated heavy clusters in Europe and the eastern United States, the graphs did offer a few interesting tidbits, particularly this graph indicating that my paternal haplogroups contain at least one SNP strongly associated with Finland of all places:

SNP Finland

The locations of known ancestral groups bearing the SNP N-M178 in the male haplogroup.

Overall, the service didn’t tell me much more than Genographic’s existing analysis, although it was cool to see the localization of various SNPs on heat maps. The service did very little with information about my maternal line, which is arguably more interesting because it would contain only Southern Asian traits. Instead, the only other information offered about my maternal heritage was a vague assurance that a kit-purchaser in the United Arab Emirates had received similar results to mine, against which I could compare mine for an additional fee.

I think this service is useful, although it is unapologetically and often annoyingly commercial, persistently attempting to upsell my results with further testing. My belief in the credibility in the service is further undermined by the availability of additional vanity tests like a test for the “warrior gene” or “confidence gene.”

3. Morley terminal subclade predictor. This is a quick tool that finds the terminal subclade associated with a list of SNPs from the Y chromosome. It operates using the list of SNPs detected by FTDNA, which can be copied and pasted into the simple interface. It predicted that my terminal subclade is I2a1c, which is associated with the Alpine region in Western Europe. No information about my maternal ancestry was available.

4. Interpretome. This Stanford-based tool provides a remarkably broad range of results and analyses for free. While the tool technically supports only results from 23andMe and FTDNA’s testing services, I was able to upload my Geno 2.0 data without a problem, although results should probably be taken with a grain of salt. While the website lacks fancy graphs and visualizations of my genetic data, it includes several unique tools for gaining additional information from the limited raw data available from National Geographic’s Geno website. For example, there’s a tool that takes my Autosomal DNA, combined with my age and parental heights, and predicts my actual height.



Phytoplankton are colorful

A few summers ago I had the chance to work at Mote Marine Laboratory, in the Phytoplankton Ecology Group lead by Dr. Gary Kirkpatrick. This was the first time that I really considered doing research outside of traditional physics (in high school I had always imagined that my research career would consist of aligning lasers and soldering circuit boards), and so it was a really eye-opening experience that showed me how elegant the intersections between biology, computer science, and physics can really be.

The image at the head of this post is a scalogram. It’s the result of performing a mathematical operation called a wavelet transform on a one-dimensional data set. This particular data set was an absorption spectrum from a sample of live phytoplankton (spectrum shown below), which shows the amount of light that a sample of plankton absorbs as a function of the wavelength of that light. This is essentially a really precise way to determine the color of  the plankton, a useful metric because different species of phytoplankton tend to absorb different colors. Part of the reason for this is because different species of plankton inhabit different depths in the ocean, and the color of sunlight (and thus the wavelength at which maximal photosynthesis occurs) changes with depth due to dispersion of short-wavelength light by seawater.

Absorption spectrum

The absorption spectrum of a mixed sample of phytoplankton. Each peak comprising the spectrum is the result of a single pigment with a Gaussian absorption spectrum, and the observed overall spectrum is a linear combination of the spectra of these individual pigments.

The scalogram is the result of taking short sections of the spectrum, and comparing them to short pieces of a  reference function called a wavelet, here represented by a normal distribution that’s been truncated outside of some interval (more generally, any finite section of a smooth function will work as a wavelet—it just happens that a bell-shaped distribution is a convenient first choice in many problems). A large similarity between the wavelet and the section of the spectrum results in a dark color being shown on the plot—this indicates that that portion of the spectrum was shaped like the wavelet. As one moves along the vertical axis of the plot, the width of the wavelet is gradually widened and the transform is repeated, resulting in a plot with two explanatory variables (the width of the wavelet and the location in the spectrum), and one response variable (how much that part of the spectrum matched the shape of a wavelet of that size).

Why are scalograms interesting? The plot shows where the data had peaks (corresponding to pigments absorbing colors), but it also shows the characteristic scale of those peaks. Most plankton spectra have the large feature in the right hand side of this plot, which corresponds to the absorption of light by Chlorophyll A (the primary photosynthetic pigment found in most species of phytoplankton). The vertical axis allows one to see that Chlorophyll A absorption peaks also have a characteristic width of 6 (arbitrary units), identified by where the tree-trunk like shape was darkest. The advantage of looking at a plot that separates both the width and location of absorption peaks is that one can also identify more subtle features (like the faint, wide patterns at the top of the plot) which would be almost impossible to identify by looking at the absorption spectrum alone. Scalograms are thus used to increase the information that one can extract from a spectrum, by allowing features in the data to be identified by both their characteristic absorption color (the location of the peak) and their characteristic width. For a sample of water containing many different species of plankton, the added information conferred by taking a wavelet transform can be used to better separate out the many species that can be contributing the overall observed spectrum.

Genetic testing kit

Much to my surprise and suspicion, for Christmas this year I received a DNA testing kit. The Geno 2.0 kit from National Geographic is one of a handful of “test-at-home” kits currently on the market that allow you to send a sample of your DNA to a private lab to be sequenced and analyzed. While some kits specialize in finding unknown relatives or examining your medical risks, the Geno kit promises an “unprecedented view of your ancestral journey”—it focuses on comparing your DNA to stock samples from reference populations throughout the world, and using this information to determine the migration path your distant ancestors took. It also claims to determine your relationship to various hominids, like Neanderthals and Denisovans, whose DNA samples have recently become available.

The kit contains two cheek swabs and two vials of preservatives—I rubbed the swabs inside my cheek, deposited the cotton ends inside the vials, and sent them to Texas for six weeks of processing. The Geno kit appears to be processed at a different type of facility than other kits (such as 23andMe) because it looks for a different set of genetics markers. Due to its emphasis on determining distant lineages and ancestral relationships, the Geno kit focuses on haplogroups, genetic material that remains largely unchanged over many generations. These portions of our DNA are less useful for diagnostic purposes because they are less unique from person to person (ie, they cannot be easily used for genetic fingerprinting), but they instead are useful for determining the relationships among large populations who share similar traits. The two haplogroups of particular interest for the kit are the male Y chromosome and maternal mitochondrial DNA. The idea is that Y chromosomes don’t change much from generation to generation—my father’s Y is the same as his father’s Y, which is the same as that of every male ancestor in his family. Compare this to X and other chromosomes, which get jumbled and reshuffled in every generation by meiosis, in which chromosomes from the father and mother that serve similar functions (such as the chromosome for eye color) get mixed and matched to build a final offspring genome consisting of traits that are a mixture of those of the father and mother. Since males have only one Y chromosome (there is not another similar chromosome from the mother’s genome), this shuffling doesn’t happen during reproduction, and so the Y chromosome can remain largely identical over thousands of generations, save for the occasional random mutation. A similar idea applies to mitochondrial DNA, which is passed down on the mother’s side only. The idea is that mitochondria, the individual organelles within cells that provide energy, contain their own, unique genome from the rest of the cell, which may be a leftover from a time when mitochondria lived independently outside of cells. Since the first cell of every human arises from a mother’s egg, our mitochondria are copies of our mother’s mitochondria, and so the mitochondrial DNA of a single person traces her maternal ancestry.

This basically means that, if you’re male, the Geno 2.0 kit can isolate DNA unique to your father and his male ancestors, as well as your mother and her female ancestors, allowing it to greatly simplify the analysis necessary to determine your ancestry—without haplogroups, it would be impossible to determine whether a given mutation or pattern in the DNA occurred recently or thousands of generations ago, or which parts of the genome came from which side of the family. This means that it can generate custom plots like this one, showing the migration route of known early humans who carried my maternal haploid group (first image) and known groups who bore my paternal haploid (other image). My mother happens to have ancestors from more parts of the world than my father, and so it is unsurprising that her genes are better-traveled according to the plot—the original humans who carried her mitochondria traveled widely, and so they have descendants in parts of the world ranging from the Paraguay to Mongolia. My father has a more direct ancestry, which is reflected in the comparatively small range in which his Y chromosome is found.

maternal heatmap

A heatmap showing the density of my maternal haploid group among the current world population. My maternal ancestors travelled widely, as people with my maternal traits are found in high concentrations throughout the world.

paternal heatmap

A heatmap showing the density of my paternal haploid group among the current population of Eurasia. People with my paternal lineage tend to be concentrated in Northern Europe.

However, isolating the DNA itself would be useless without other sets of genomes to which to compare it, a task aided by the Genographic Project, a research effort which has accumulated a database of thousands of haplogroup DNA samples for very specific subpopulations of current humans. Representatives of the Genographic project have gone into certain parts of the world that have relatively homogenous populations, such as Mongolia or New Zealand, and collected hundreds of DNA samples from various individuals. Assuming that the individuals tested are descended from people who also lived in the area for many generations (ie, no recent immigrants), then many of the individuals tested will have similarities in their haplogroup that suggest common ancestors on their male and female sides. From this pool of results, researchers can then construct a sort of “average” genome that is taken as representative of members of that population. For example, all people who are descended directly from the original Oceanic settlers in New Zealand might have a certain genetic pattern (let’s call it “Pattern A” to be pedantic) that has been passed down to all of their descendants in modern New Zealand. However, along the way, Dutch or British settlers likely intermarried with the descendants of the original Oceanians, which would introduce another Pattern B that would also be present in modern-day New Zealanders. A researcher looking at the sequences of many modern day New Zealanders might notice that all have Pattern A, but only some have Pattern B—this would allow her to infer that pattern B likely comes from a more recent ancestor (like the later settlers), while Pattern A comes from the original Oceanians. This means that the “average genome” taken as representative of the entire population would likely include Pattern A but not Pattern B, since Pattern A is a more common trait that represents earlier ancestors.

Naturally this process gets incredibly noisy when one has all sorts of different waves of settlers from different countries, as well as a millions of possible patterns that may or may not represent ancestral traits. While occasionally scientists will get lucky and find actual DNA from an early human to which they can compare modern DNA, most of the data available to construct ancestry trees comes from present populations, which already have thousands of generations of genetic intermingling behind them. The Genographic Project thus uses masterful statistical analysis and data processing to automate searching for common patterns in the genomes it samples—instead of looking for just single patterns, it also looks for relationships among groups of patterns. The only upshot to this guessing game is the sheer number of people available for testing today, which allows reasonably strong statistical confidence to be established in the most dominant trends. My results, shown below, correctly places my ancestry as dominated by a single group, Southern Indian, which agrees with my mother’s family history. At the same time, no genetic group is represented in more than 50% of my genome, in part because my parents are from two very different parts of the world and thus have very different reference populations.

A breakdown of the percent my haploid groups share in common with reference populations being studied for the Genographic Project. The dominant component, Southwest Asian, matches my mother's family's origin in Southern India.

A breakdown of the percent my haploid groups share in common with reference populations being studied for the Genographic Project. The dominant component, Southwest Asian, matches my mother’s family’s origin in Southern India.

In general, I am impressed by the unique analysis and processing that the Geno 2.0 kit offers, and I am glad to know that my test results will help improve the Genographic Project’s massive database of reference populations. However, my main concern with the test results is the sparseness of final information given the supposed rigor of the analysis. While I am hoping that more analysis of my results will arrive as better data processing technology becomes available, I would appreciate it very much if projects like these made raw sequence data available–it would be easy to do, and it would allow either after-market companies or savvy individuals to further process and analyze the information that the Genographic team has gathered.

Entropy and cellular automata

Here’s a few frames of a simple simulation of The Game of Life I wrote in MATLAB:

To me, it’s pretty unintuitive that biological processes, like DNA translation or bird flock motion, work so well given that they are often very far from “equilibrium” in the sense we learn in chemistry class. I was taught in high school to think of “equilibrium” as the most stable, least interesting, but most likely outcome of a chemical reaction—vinegar and baking soda eventually fizzle out into brown goo, and even nuclear fusion in stars eventually stops as clumps of iron form in the stellar core.

I think the supposed intuition for the idea of unavoidable equilibration comes from the second law of thermodynamics: entropy constantly increases in the universe, and so there is no spontaneous physical process that can occur on a large enough scale to reverse this tendency. The universe is like a deck of cards: it is always easier to shuffle it than to arrange it in a particular order; thus large scale processes tend to favor disordered outcomes rather than neat patterns. This idea appears throughout the sciences in various forms: one of the axioms of cosmology is that the universe at large scales is homogenous and isotropic—it has no definite structure or patterns, but rather looks like a well-mixed soup of randomly arranged galaxies and gas clouds.

Biological systems can locally violate this rule–they exist as well-ordered clockworks within a universe otherwise characterized by collision and diffusion. While the second law still holds on the large scale, the law of averages allows some leeway on the cosmically insignificant scale of the earth—for every sequoia tree or giant squid there is a much larger disordered element, such as a cloud of gas or a stellar explosion, to compensate. But it still seems surprising that systems as orderly as living beings, with their underlying ability to replicate and evolve repeatedly over millenia, can spontaneously have emerged from the noisy background to cosmos. This raises the question of whether there is some fundamental physical property that makes “living” things unique.

In 1970, the mathematician John Conway proposed “The Game of Life,” a simple mathematical model that sheds light on how “designed” systems can emerge from chaos. In Conway’s version of the game, a black grid is drawn on which some random pattern of squares or tiles is filled in with white. If these white tiles are taken to be some sort of initial colony of “living” cells against an otherwise dead (black) backdrop, then simple rules can be defined for updating the grid to represent colony growth:

1. If a black tile has two or three white tiles adjacent or immediately diagonal to it, then in the next generation it will become white (as if by reproduction).

2. If a white tile has more than three white tiles surrounding it (up to 8 total), then it will become black in the next generation as if by overcrowding; if a white tile has less than 2 white neighbors nearby, it will die in the next generation due to starvation.

3. Any white cell with exactly 2 or 3 white neighbors will persist to the next generation.

These simple rules create an extremely efficient simulation that can be run over many time steps to create life-like simulations of colony growth. What makes Conway’s game uncanny is that even the most random initial patterns can create dynamic, predictable colonies– the second law of thermodynamics does not mean that all, or even most, versions of the game will create a chaotic mass of cells. For example, here’s a pattern that one set of initial conditions will create when the game is run for many time steps (click the image to see the animation):

Click to see the gif on Wikimedia.

The animation shows several important structures that can emerge within the game: gliders are groups of cells that continuously move across the playing field, returning to their original shape (but in a different location) within a few generations. Some cells cluster together and form stable, immobile structures that persist indefinitely until they interact with a wayward glider or another structure.

Conway’s game provides a simple illustration of how life-like systems can emerge from random initial conditions, implying a decrease in entropy in the limited “universe” of the simulation. The game and its variants with different rules, tiling patterns, etc are collectively known as cellular automata, which form the basis of a lot of important research currently occurring in image processing and computational biology. Several noted scientists, including Turing, von Neumann, and Wolfram have investigated the implications of these simple models—Wolfram in particular has devoted several decades or research, thousands of textbook pages, and a particularly unusual Reddit AMA, to the theory that cellular automata provide the basis of the most fundamental physical laws of the universe.

But the Game of Life also connects to many more general mathematical concepts. Markov models, which mathematically characterize systems in which individuals undergo transitions, arrivals, or departures from several finite, well-defined states are an alternative way of representing the same information as Conway’s tiles. The defining principle of Markov models is that the next state is determined purely by the present state: a population ecologist who uses Markov models would assume that the next change in population size can be predicted purely by information about the present population (for example during exponential growth, in which the growth rate of a group of organisms correlates to the size of the group). An ecologist would keep track of all the possible state changes in the population using a transition matrix, which contains empirical estimates of the rate at which new individuals are born (arrival event), old individuals die (departure), and existing individuals survive (transition). The parallel with Conway’s three rules is clear, but Markov models can be easily represented with matrices, and so they represent a natural limiting case for any system in which a physical entity evolves based on a limited subset of its past states.

If Conway’s tiled grid is replaced with a continuous set of points, and the survival status of a given point depends on a weighted integral of the “brightness” of points within a given radius of it, then the transition matrices for many continuous cellular automata will become the solution of a differential equation in space and time. Certain types of diffusion equations, for example, use integration over neighboring points as a continuous approximations of the rules of the Game of Life. A set of differential equations that illustrate the well-defined structures that can emerge from an otherwise disordered system are the reaction-diffusion equations, which model the strange patterns that can be observed when a homogenous solution of potassium bromate and cerium sulfate is mixed with several acids:

Taken from Wikipedia


Thus diffusive differential equations, Markov models, and cellular automata really all describe essentially the same process, in which local interactions cause ordered structures and patterns to emerge and aggregate within an otherwise random system.

Laser Microscopy in 20 Minutes


Protozoans in a water droplet, projected with a laser pointer beam.

Using a sketchy and cheap Chinese laser pointer, a decent mirror(here provided by an old hard drive platter), and some water from a disgusting aquarium tank, you can create a powerful projection microscope at home. The water droplet itself provides the magnifying optics—using a smaller droplet will increase the magnification, but make focusing the laser a lot more frustrating. The image size can be increased just by increasing the throw distance of the laser. Here’s my setup– I used the straw to make perfectly round droplets by dipping the end in the aquarium, and I used the microfiber cloth to keep smudges off the mirror:


My materials

Laser Microscope Schematic

My setup of the laser pointer microscope. I used a hard drive platter as my mirror.

With the water from the aquarium, I can easily see amoebas and paramecia swimming around and interacting. Obviously the diffraction limit heavily applies to the quality of the image, but some sub-cellular structures are definitely visible within the amoebas:

A good thing to note is that some of the more geometric bodies that you see moving are actually very small organisms that are Rayleigh and Mie scattering in the laser light—the bodies themselves are too small to see, but they create a geometric interference pattern that appears to move with them through the water.

Scattering microscopy can also be applied to other transparent materials, such as glass and crystals, to reveal internal structures. A good one to try is a clear marble, preferably one with cracks on the inside. Here is a photograph of the laser through Icelandic spar, a variant of the common calcite crystal that exhibits complex double refracting behavior in standard lighting. The laser reveals the cleavage plane of the crystal quite nicely:

Calcite Interference Pattern

The “Icelandic Spar” variety of Calcite exhibits double refraction when held against a sheet of text; the laser light reveals that this is due to its orderly internal lattice structure.

What is a laser?

Laser light is coherent, meaning that all the photons that comprise it march in phase—they never interfere with one another because they all take the same steps at the same time. This is incredibly useful because it not only means that all the photons have the same frequency—they take the same number of steps per minute—but it also means that they take identical steps (or are in phase).

The reason is matters to the scientists is because, while light is indeed broken up into little particles of energy, these photons happen to act like waves in that they can interfere with one another and overlap, just as ocean waves give rise to complex eddies and lulls. Most light sources like incandescent light bulbs simply toss off photons with whatever phase and frequency happens to be most convenient, but lasers are designed to product barrages of photons that are coherent(have the same phase) and monochromatic(have the same frequency). This allows physicists to start out with no interference at all, and then to introduce various substances into the laser beam to see how they cause interference. Often the interference properties of a substance provide fundamental details about its microscopic structure.

The basic idea of lasers is that an electric field causes many atoms in a gas to reach an excited state, or a state in which their atoms have reconfigured their electronic structure in order to hold additional energy. Most atoms would prefer not to hold this additional energy for long, and so after some time they will decay into their ground(normal) state, releasing the energy that they were hoarding as a photon. The trick to this process is that only certain changes in electron arrangement around the nucleus are physically possible, and so only certain changes in energy are possible. This means that gases are predisposed to emit photons with identical energies because their electrons spontaneously absorb and emit only photons that correspond to the allowed variations in electron distributions. The energy of a photon is directly proportional to its frequency(and thus color) via Plank’s law, which is why we know that the blue part of a candle flame is much hotter than the yellow(low frequency) part. Most elements have characteristic colors that they emit light at, as each element has a unique atomic geometry and thus a unique set of acceptable electronic configurations about their nucleus. The study of the characteristic colors, or spectra, of chemicals is the basis of spectroscopy, which I discuss in my post on incandescence.

So lasers already have the monochromatic issue taken care of—they simply use a mixture of gases that ensures that atoms only spontaneously absorb and emit light at the desired color. But lasers are so powerful because they use stimulated emission— as a photon in a laser passes by an excited atom that has not yet released its energy, it can provoke it to release a photon that is moving perfectly in step with it. So in addition to being monochromatic, the light emitted from a laser is always coherent(in phase).

There’s a reasonable explanation for how this occurs: according to the Pauli Exclusion Principle(or lack thereof for bosons, of which photons are a subclass), it is impossible to tell photons with the same set of properties apart. The basis of this is that photons, unlike objects we encounter in the macroscopic world, are able to overlap like waves. So if I place two photons in the same place, and everything about the two photons is the same, then I can never tell them apart. If a photon in a laser flies by an excited atom and stimulates a photon with a random phase to be released, however, there are two possible ways for the new photon to be in phase, but only one way for them to be out of phase. The reason for this is that there are two axes along which the photons can agree or disagree, but by the Pauli Principle all the disagreements appear to be the exact same, single state. This is rather unintuitive, but minutephysics provides a nice example with a quantum coin flip. The phenomenon is known as the consolidation of eigenstates(eigen is a German prefix that means “terrible algebra”), meaning that there end up being more options for the photons to stay in phase than to go out of phase, resulting in the former being statistically favorable.

As a result, the number of photons in-phase gradually builds up until eventually the laser output is dominated by coherent light. d.