Algorithmic interactions on social networks

I recently built a simple program that attempts to impersonate users of an online forum using a combination of a Markov model and a context-free grammar.

Recently, I’ve been using Python to scrape data from the content-aggregation website Reddit, at which users can submit links and vote on each other’s submissions. Links that generate large numbers of upvotes in the community may eventually make it to the front page of Reddit, where they will be viewed by millions of individual casual browsers as they first enter the site. Reddit’s voting algorithm and method for determining the ranking is closely guarded in order to prevent marketing companies from spamming the front page with links.

A really great Python utility for working with Reddit data is PRAW, which provides an interface between Reddit’s API and Python. The module not only allows easy scraping of information like the popularity and content top articles, the number of upvotes, and the number of comments; it also simplifies the creation of “bots,” or automated users in the Reddit community that comment on posts. The function of bots ranges from practical—one bot posts a text-mined summary of every Wikipedia article that makes it to the front page, while another posts metric conversions of every reddit headline containing imperial units—to whimsical: one bot periodically scans comment threads for rhyming words, and adds a comment to the thread demanding users cease deploying puns.

For my bot, I wanted to create a user who could convincingly act like a normal “human” user. My immediate idea was to scrape comments on an article, create a word-as-token Markov model, and then automatically post a comment containing a fixed-length output generated by the model. Markov text generators are an incredibly common opening exercise in introductory computer science courses—they are relatively easy to prepare, yet can produce surprisingly realistic (and occasionally humorous) output if trained on the right corpus of text, like this sample output from a 3-word-token Markov model I trained using Mary Shelley’s Frankenstein; or, The Modern Prometheus:

This towering above all, thy mountains, obscured in darkness and cast my eyes were shut to the conversations he held with his parents ever since our infancy. We were in my enterprise. I have confessed myself guilty of a peculiar and overpowering nature ; nor do the floating sheets of ice , which was one which attracted my attention suddenly grew despicable. By one of your sentiments of this wretched mockery of a different conclusion

Because of the relatively small size of the training text in comparison to the token size, a lot of the phrases, like “his parents ever since our infancy” or “floating sheets of ice,” are complete phrases taken directly from the novel. Nonetheless, randomly picking three word phrases based on their frequency in the corpus captures the style of the novel remarkably well, despite there being no explicit specification of sentence structure when the code is trained on the corpus. If I reduce the order of the model to 2, I get an output text that’s significantly more “original,” but also marred with more grammatical and stylistic errors:

This apprehensions as he had hitherto been present to your health rendered , no disaster is murdered , on your duty towards the most beautiful that you could I opened my, I, the reflections determined thenceforth to a direction, and my appetite. One by Elizabeth. She was not unfolded to recollect what has been adduced against me was free last night; such a fiend can not describe their native town of great crime , in death shall be the beginning of food or take their inquiries clear conception of the dashing waves continually renewed violence.

I originally intended to have my bot write pseudoscience or new-age philosophy, which would convincingly allow other users to overlook its frequent lapses in grammar as kooky derailment. I quickly learned, however, that the output of a Markov model is distinctive enough that other, tech-savvy users could infer that the comments were algorithmic. Looking to refine my approach, I instead investigated to approach used by SCIgen, a well-known automatic scientific paper generated that gained attention in 2005 when its authors successfully submitted SCIgen papers with titles like “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy” to several less-reputable journals and conferences. Contrary to what I expected, SCIgen does not use an augmented Markov model, but rather context-free grammar, a token-based approach to generating text that takes into account the natural hierarchies observed in sentence structure first described by Chomsky. The method is outlined in much better detail elsewhere, but the essential concept is that sentences contain a natural hierarchy of information that motivates their representation as a tree data structure. For example, the sentence, “The dog who was dirty ate the slippers” can be split first into two parts—the subject cause about the dirty dog, and the verb clause about his crime—and each of these parts can be further subdivided into adjectival clauses, predicates, and finally individual words like articles, nouns, and adjectives. In a context-free grammar, the non-terminal nodes of the tree (clauses, etc) are governed by production rules (verb clause ->  verb + article + direct object) that state the next levels of decomposition, each of which has its own production rule (article -> “the” OR “a” OR “an”). A more advanced grammar tree (for a Victorian sentence from “Frankenstein”) looks like this:

A context-free grammar for a sentence from Frankenstein, parsed using the NLTK module for Python.

A context-free grammar for a sentence from Frankenstein, parsed using the NLTK module for Python.

A CFG text generator would start with the start symbol (‘S’) and then move down the tree from left to right, outputting a terminal symbol each time it sees one and then moving back up the the lowest incomplete branch. In order to get truly random text, the CFG is trained with many sentences, and so the parser will have multiple possible options for each symbol—after it parses S, it could choose to move to a sentence that looks like NP VP (like the one above), or one that looks like NN PN (subject noun – predicate noun, like “the man is a dog”). After it randomly makes that decision, it then has many options for each of the subsequent nonterminal nodes, and then it finally has a choice of many possible terminal symbols (once it reaches an NN, it can pick from any of the NN used in the training set, since they are syntactically equivalent).

One noteworthy detail of the sentence shown in the figure is the presence of repeated units, like  noun phrases (NP) that link to other noun phrases, which capture the fractal nature of language constructions in which sentence-like clauses are used to modify individual components of a sentence. For this reason, when generating text from a CFG, it is very easy to implement recursion, but it’s also very easy for that recursion to get stuck in an infinite loop, where a lower symbol links to a higher symbol, resulting in repetition:

The professor hated the man who hated the professor who hated the man who hated the…

The sentence is grammatical but undesirable. For large grammars, the incidence of long sentences seems to increase, since the number of possible interlinking loops and knots in the grammar tree becomes large.

In order to use a CFG to generate grammatically-correct gibberish, I made use of the famous NLTK module for Python, which contains many tools for processing and cleaning natural language data sets, as well as text in which human experts have tagged individual words as nouns, verbs, etc. This makes it possible to use pre-built functions in the module to tag individual words in a user-chosen sample text based on their similarity to words in the text used by the experts, while simultaneously identifying the overall relations among words in a data set in order to identify non-terminal symbols like clauses. In my particular code, I scan a text for all terminal symbols first, which is a compratively fast operation. I then pick sentences at random and parse their full grammar (which can take up to a minute), but then keep only their nonterminal symbols (so I discard individual words). I then concatenate the word rules with the nonterminal production rules, resulting in a grammar with the full vocabulary of my corpus, but only a subset of its grammatical structures. Since there are multiple possible terminal symbols for a given word (ie, noun -> ‘Frankenstein’ or ‘professor’ or ‘night’ or [any other noun in the entire book]), the generated text is structured yet completely random in its choice of specific word to fulfill a given function in the sentence. But restricting the nonterminal grammar rules also allows me to monitor whether a specific sentence structure tends to cause infinite recursion or other problems. Running this model with the Frankenstein corpus resulted in occasionally eerie text:

All delight not, spoke for tears unknown and conceived. I and a mankind , and in the place , close wonder paid to saw passed this behaviour. My beautiful not hired inquietude of immeasurable serenity these beautiful care, my variable indeed enslaved sorrow this fellow, that remained eyes with this willow endeavour in the courage of the truth before interested daylight.

The past innocence not, was I of feeble horror. A invader near a white slave which a loss answered with this truth: Man broken in the considerable atmosphere. Misery. her brother, my remorse, the world.

The text may be less convincing than that generated by the 3-gram Markov model, but that text borrowed entire phrases from the source text whereas this model only borrows specific sentence structures: the words used to fulfill the various functions in the sentence, such as the nominative or verb, are chosen entirely at random from all nouns and verbs found in the source text, making it pretty unlikely that a specific chain of multiple distinct words in the corpus, like “floating sheets of ice,” would recur here. Thus while the 3-gram Markov text might better fool a human reader (who doesn’t have the novel memorized), the context-free grammar model is more likely to fool a computer program that detects plagiarism by searching for phrases in an online database.

Because of the lyrical quality of the Frankenstein CFG output, I tried to post a few (re-formatted) outputs in the poetry criticism subreddit, /r/OCpoetry. The verses received generally positive feedback until one user noticed that my account name and post history suggested that the text was random gibberish. In retrospect, this wasn’t a particularly surprising outcome, given the unusual training corpus and the need for the bot to fool human users, and so in future work I may try to restrict the bot to more specific types of comments or posts with formulaic structures. However, for the doodles or occasional flashes of brilliant nonsense that I come across while training the bot, I created a subreddit /r/markovpoetry where my bots (and other users) can post amusing grammar generated by their text-generation programs.

The full code for this project, along with sample corpora, can be found on my GitHub page.

Scratching holograms into bulletproof glass

A few months ago I came across an old piece of bulletproof glass I bought in high school when I was trying to make a camera shield for a thermite project. I had a little bit of free time, and so I was reminded of William Beatty’s famous method for making holograms, in which a virtual image can be created by literally scratching a pice of acrylic glass. I tried to make a simple star pattern, and the resulting video does not really do the effect justice:

 

A much better video comes from the author himself:

 

A detailed tutorial is available on Bill Beatty’s colorful website, amasci.com. Because the depth below the surface at which the virtual image appears scales with the radius of curvature of the arcs scratched into the glass, more sophisticated images can be generated by varying the radius of curvature in order to create images at multiple depths—allowing one to create a real 3D rendering rather than a 2D image that appears to sit below a surface.

Identifying fossils using machine learning.

This weekend I wrote an image processing routine that uses machine learning methods to classify fossil shark teeth from my collection.

Some of my favorite early childhood memories involve wandering up and down a the beach in Venice, FL, searching for fossilized shark teeth for which the region is known:

Over the years, my collection has grown to roughly 10,000 full or partial teeth, which are roughly sorted by morphology or, by proxy, species. Sorting the teeth by eye is not entirely trivial, particularly because of various subspecies and close relatives that have large variations in tooth shape, and because the shape of teeth from a particular species of shark will vary depending on their location in the mouth. However, because I already have a large set of teeth pre-classified, I thought I would use my collection as an opportunity to play with Python’s scikit-learn library for machine learning, to see if algorithmic methods might identify patterns or distinctions that I am missing or neglecting when I sort the teeth by eye.

My manual classification is based on the guides to each shark available on the University of Florida website, augmented by additional photos available on a now 404’d website I used to access when I was younger and first initiated the collection:

Generation of image outlines for classifier

I first drew a 1″ x 1′ grid on a piece of white paper and placed around 54 teeth within the resulting 9 x 6 grid, taking care to space them as evenly as possible. I took a quick image of this array with my iPhone, taking care to square the corners of the viewing frame with the page, which conveniently has the same aspect ratio as the page, allowing me to both minimize tilting/rotation and to enforce the same scale for all images without using a tripod.

Fossilized sand shark teeth.

Fossilized sand shark teeth arranged in a 1″ x 1″ grid.

I processed the resulting images through imagesplitter, allowing me to divide the grid into separate images for each shark tooth. There are probably fancier ways of creating separated segmented images for each tooth that don’t involve aligning them in a grid or splitting the images (MATLAB’s bwlabel() function comes to mind), but I didn’t mind having separate images for each tooth in case they come in handy for a later project.

I took the resulting image series for each species and opened them in Fiji (Fiji is just ImageJ) as a stack. While most operations for which ImageJ/Fiji are commonly used can be done easily without a GUI in MATLAB, I like using a GUI for stacks because it’s easier to preview the effect that operations have across the entire stack. For other projects, whenever I’ve needed to something fancier than a standard segmentation or tracking I’ve found it easier to write a Java program to run in Fiji than to go down the rabbit hole of opening and processing stacks of images in MATLAB, which never quite performs as smoothly as I would expect.

For the teeth that have left or right crooks, such as tiger shark teeth, I went through and manually flipped the images of teeth pointing the other direction, since nothing I’ve read suggests that teeth with a given chirality have other distinct features that would be useful to the classifier—the orientation just depends on which side of the shark’s mouth the tooth originally came from (a quick lookover confirms that I appear to have roughly equal numbers of left and right pointing teeth—apparently ancient sharks didn’t preferentially chew on one side)

I then performed rigid registration (rotation/scaling/translation) of the binary images for each species onto one another using the “StackReg” module that comes built-into the Fiji “registration” toolkit. I cropped the resulting images to a square in order to simplify resizing and stacking. The sequences registered images end up looking like this:

Different morphologies for fossilized tiger shark teeth.

Different morphologies for fossilized tiger shark teeth.

Different morphologies for fossilized bull shark teeth.

Different morphologies for fossilized bull shark teeth.

Classification

In a sense, I am already passing the classifier much more information than it needs to know, since the boundary of the segmented region is the only feature that contains information in the images.

For this project, I thought I would try both a supervised and unsupervised classification approach. Since I already have approximate classifications for the teeth based on my manual sorting (by shape) over the years, I can label each set of segmented images with a candidate species, train a classifier, and then apply the resulting regression to a couple of new images of teeth to see if the classifier agrees with my best guess.

The more intriguing aspect of the project is determining whether the code can work out distinctions itself given an unlabeled set of teeth from multiple species. This would give me an idea of just how distinctive the different morphologies really are, and it could reveal the presence of other species with similar looking teeth that I’ve been mis-classifying because they look so much like other species.

Performance

Perhaps unsurprisingly, even basic Naive Bayesian classification performed extremely well for this image set, even for similar tooth morphologies (in incisors of bull shark teeth look very similar to those of dusky sharks, yet the classifier was miraculously proficient at discerning them. I’d estimate an accuracy of about 96% for the collection of teeth I processed today.

Update 1/4/2015: I recently made this website, which features images (with scale bars!) of all the pieces in my fossil collection, including these shark teeth.

Follow-up: Aftermarket genetic information services

I recently described my experiences using Geno 2.0, a National Geographic-branded genetic testing kit that focuses on tracing ancestry and relatives based on information gleaned from haplogroups. At the suggestion of a comment on that post, I discovered that Nat Geo now makes the results of the test downloadable as a .CSV file containing a list of the SNP groups detected in the cheek swab that I sent to them. The advantage of having the raw data from the test is that I can now use several third-party services that specialize in processing data from these tests in order to reveal additional information about the specificities of my genetic code:

1. Promethease

Promethease was the first service I investigated because the company is among the few who directly accept data from Geno 2.0 (as opposed to 23andMe and other, more popular kits). I like this particular service because it emphasizes medical traits, rather than ancestry information, making it complementary to the results reported by the National Geographic program. The processing is done by a privately-held biotechnology LLC, and the analysis cost me $5.00 via an online payment.

promethease output

The top results for the Promethease screening output. The left column is a list of all SNPs recognized from the Geno 2.0 results, and the right column is a live-editable search specification box that can be used to refine search results.

The figure shows several example results listings from Promethease. The results specify a full list of SNPs identified with the software, as well as annotations associated with those SNPs that are provided by Promethease’s community of scientists and users. The founder notes that the results tend to favor negative traits (outlined in red boxes in the figure) because there are more studies showing that “Trait X causes cancer” than there are showing that “Trait Y causes superpowers.” The results are sorted by the “magnitude” of the effect, which is a subjective ranking of how severe or telling the presence of the SNP ought to be. More telling, however, is the “frequency” tab (highlighted in red in the results), which gives a rough estimate of the fraction of the US population that shares that particular SNP. For example, I was a little concerned to see that I had at least two SNPs associated with male-pattern baldness, but then I noticed that the relative proportion of the population bearing each of the two SNPs was 50% and 84%—dispiriting news, certainty, but hardly a uniquely bad outcome given that at least half of US males go bald by middle-age.

The only good news I received was that I have an increased inflammatory response to mosquito bites, which is apparently associated with increased resistance to malaria.

Overall, I was satisfied with the results provided by the service, although it was clear that the total information provided by the Geno 2.0 test (~1000 SNP) was lacking in comparison to that provided by other genetic testing kits (~17,000 SNPs are analyzed for 23andMe data). The medical results are more interesting and specific than those provided by the Genographic database, and they definitely made me more interested in using another testing service in the future in order to gain even more information about my genome.

2. FTDNA Geno Data Transfer. Family Tree DNA is a Texas-based company that specializes in high-throughput DNA testing—they have been around long enough that even National Genographic outsources some of its DNA kit screening to their facilities. As a result, it’s pretty easy to transfer Geno 2.0 data to their service, although instantly after doing so I was bombarded with offers of additional tests and analyses for outrageous prices. However, the additional analysis of the Geno data offered by their service is pretty worthwhile, particulary a section of their website that allows different SNP associated with different paternal haplogroups to be traced to specific geographic regions, as shown here:

The prevalence of my paternal ancestors bearing two specific SNPs.

The prevalence of my paternal ancestors bearing two specific SNPs with Haplogroup I.

The lineage tracing overwhelmingly favors European and East-Coast U.S. features, which I suspect mirrors the demographics of people whose genomes have been widely studied and mapped using FTDNA’s services. Despite the fact that almost all of the heat maps indicated heavy clusters in Europe and the eastern United States, the graphs did offer a few interesting tidbits, particularly this graph indicating that my paternal haplogroups contain at least one SNP strongly associated with Finland of all places:

SNP Finland

The locations of known ancestral groups bearing the SNP N-M178 in the male haplogroup.

Overall, the service didn’t tell me much more than Genographic’s existing analysis, although it was cool to see the localization of various SNPs on heat maps. The service did very little with information about my maternal line, which is arguably more interesting because it would contain only Southern Asian traits. Instead, the only other information offered about my maternal heritage was a vague assurance that a kit-purchaser in the United Arab Emirates had received similar results to mine, against which I could compare mine for an additional fee.

I think this service is useful, although it is unapologetically and often annoyingly commercial, persistently attempting to upsell my results with further testing. My belief in the credibility in the service is further undermined by the availability of additional vanity tests like a test for the “warrior gene” or “confidence gene.”

3. Morley terminal subclade predictor. This is a quick tool that finds the terminal subclade associated with a list of SNPs from the Y chromosome. It operates using the list of SNPs detected by FTDNA, which can be copied and pasted into the simple interface. It predicted that my terminal subclade is I2a1c, which is associated with the Alpine region in Western Europe. No information about my maternal ancestry was available.

4. Interpretome. This Stanford-based tool provides a remarkably broad range of results and analyses for free. While the tool technically supports only results from 23andMe and FTDNA’s testing services, I was able to upload my Geno 2.0 data without a problem, although results should probably be taken with a grain of salt. While the website lacks fancy graphs and visualizations of my genetic data, it includes several unique tools for gaining additional information from the limited raw data available from National Geographic’s Geno website. For example, there’s a tool that takes my Autosomal DNA, combined with my age and parental heights, and predicts my actual height.

d.

 

Using text classification to determine authorship

In high school, my classmates and I were required to use the website turnitin.com, which bills itself as a “global leader in evaluating and improving student learning” but which really exists for a single (albeit important) purpose: determining whether submitted papers are plagiarized from online sources. The use of the website was perhaps an overzealous response to the sparsely-documented phenomenon of high schoolers stealing their papers from online sources, particularly because the website can do little to detect whether a paper has been purchased from a “paper mill,” where students can hire others to write their essays for them. Instead, the website appears to use a strangely rudimentary regular expressions search capability to compare all possible substrings (down to some minimal length) in a document against a database of text compiled from online sources like Google Books and Wikipedia. I remain skeptical of the sophistication of the website’s methods because it would regularly flag phrases like “Just as when” or “at odds with,” suggesting that it didn’t contain any sort of whitelist for common phrases. I also hold a personal grudge because I once turned in an essay containing (MLA formatted!) quotes from Heart of Darkness, and the program declared that I had plagiarized parts of my essay from a certain book by Joseph Conrad.

How can plagiarism be detected more effectively? I have spent this summer reading a little about text classification, and it seems apparent that a website like Turnitin could benefit from implementing a procedure that looks for whether the style of a student’s writing is consistent, rather than simply looking for copied text. It would be harder for such a site to present definitive evidence of cheating, but it would at least allow teachers to get a better idea when several students turn in similar papers or when a single one of a student’s papers demonstrates a markedly different tone and style from xyr others.

A classic paper that outlines a lot of the concepts and principals associated with text classification is “Inference in an authorship problem” by Mosteller and Wallace (1963). The authors apply statistical techniques to analyze whether several the Federalist Papers (1788) were written by Hamilton or Madison—at the time of the publication of their article, no consensus existed among historians regarding which author had written which papers. The basic technique used by Mosteller and Wallace is strikingly similar to techniques used today after the rise of modern computerized data mining: they implement a linear discriminant analysis that looks at the frequency that the authors use certain common words (“by”,”from”,”to,” and the like). The technique works by looking for usage frequencies for sets of words, rather than a single word, that strongly differ between the two authors bodies of work—this set of words is determined by taking various weighted combinations of words and applying them to known samples of each author’s works, and finding a combination of words and weights that yields the high sums for works by one author (Madison in the paper) and low sums for the other author. Once advanced technique that Mosteller and Wallace implement involves the observation that certain, context independent “function” words—like prepositions or common adverbs—have frequencies that roughly obey a Poisson distribution, allowing them to apply Bayes theorem when selecting their set of words in order to determine the likelihood of an observed word frequency given the assumption that the distribution of frequencies should look Poissonian.

The authors find, surprisingly, that Madison wrote every single one of the disputed papers. This contradicts various claims made by contemporaries of the original federalists, but it corroborates several historical analyses that came out around the time that Mosteller and Wallace published their original papers. Because the method only requires known samples of an author’s work in order to develop a model, a website like Turnitin could implement a similar technique by archiving a student’s initial submissions or past work. Presumably, the website would then have little difficulty discerning whether a history essay was written by an adolescent or Joseph Conrad.

Coilgun with disposable camera capacitors

In high school, one of my favorite quick projects was this disposable camera coilgun.

The concept behind the coilgun is pretty simple: disposable cameras are very, very good at charging capacitors, and then rapidly discharging them through the flash tube. The spark that passes through the flash tube is related to the voltage and capacity of a large component in the charging circuit called the “photoflash capacitor,” which is a special type of electrolytic capacitor that has been optimized to ensure that it can release almost all of its stored charge nearly instantaneously. The output current of the photoflash capacitor can be amplified by taking multiple disposable cameras apart and soldering their capacitors together in parallel to produce a capacitor bank with capacitance equal to the number of capacitors times their individual rated capacities. Because the voltage of the assembly remains the same when capacitors are connected in parallel, the capacitor bank can be charged using the standard DC charging circuit from one of the cameras, which even contains a special LED indicator that alerts when the bank is fully charged.

Since the output current upon discharge scales roughly proportionately to the number of capacitors, a very large magnetic field can be generated by discharging the capacitor bank through a hollow coil of wire with low gauge (to minimize resistance). If the coil is kept short and positioned carefully, a magnetic projectile like a nail will be very rapidly pulled towards the center of the coil by the giant magnetic field produced by the capacitors discharging. However (and this is where the initial position and the geometry of the coil become important) if the discharge finishes right as the projectile reaches the center of the coil, there will be no restoring force to pull the nail back towards the center when it overshoots due to its inertia from rapidly being pulled into the coil. As a result, the projectile will keep going, past the center of the coil and out the other end, exiting the device and flying across the room.

Phytoplankton are colorful

A few summers ago I had the chance to work at Mote Marine Laboratory, in the Phytoplankton Ecology Group lead by Dr. Gary Kirkpatrick. This was the first time that I really considered doing research outside of traditional physics (in high school I had always imagined that my research career would consist of aligning lasers and soldering circuit boards), and so it was a really eye-opening experience that showed me how elegant the intersections between biology, computer science, and physics can really be.

The image at the head of this post is a scalogram. It’s the result of performing a mathematical operation called a wavelet transform on a one-dimensional data set. This particular data set was an absorption spectrum from a sample of live phytoplankton (spectrum shown below), which shows the amount of light that a sample of plankton absorbs as a function of the wavelength of that light. This is essentially a really precise way to determine the color of  the plankton, a useful metric because different species of phytoplankton tend to absorb different colors. Part of the reason for this is because different species of plankton inhabit different depths in the ocean, and the color of sunlight (and thus the wavelength at which maximal photosynthesis occurs) changes with depth due to dispersion of short-wavelength light by seawater.

Absorption spectrum

The absorption spectrum of a mixed sample of phytoplankton. Each peak comprising the spectrum is the result of a single pigment with a Gaussian absorption spectrum, and the observed overall spectrum is a linear combination of the spectra of these individual pigments.

The scalogram is the result of taking short sections of the spectrum, and comparing them to short pieces of a  reference function called a wavelet, here represented by a normal distribution that’s been truncated outside of some interval (more generally, any finite section of a smooth function will work as a wavelet—it just happens that a bell-shaped distribution is a convenient first choice in many problems). A large similarity between the wavelet and the section of the spectrum results in a dark color being shown on the plot—this indicates that that portion of the spectrum was shaped like the wavelet. As one moves along the vertical axis of the plot, the width of the wavelet is gradually widened and the transform is repeated, resulting in a plot with two explanatory variables (the width of the wavelet and the location in the spectrum), and one response variable (how much that part of the spectrum matched the shape of a wavelet of that size).

Why are scalograms interesting? The plot shows where the data had peaks (corresponding to pigments absorbing colors), but it also shows the characteristic scale of those peaks. Most plankton spectra have the large feature in the right hand side of this plot, which corresponds to the absorption of light by Chlorophyll A (the primary photosynthetic pigment found in most species of phytoplankton). The vertical axis allows one to see that Chlorophyll A absorption peaks also have a characteristic width of 6 (arbitrary units), identified by where the tree-trunk like shape was darkest. The advantage of looking at a plot that separates both the width and location of absorption peaks is that one can also identify more subtle features (like the faint, wide patterns at the top of the plot) which would be almost impossible to identify by looking at the absorption spectrum alone. Scalograms are thus used to increase the information that one can extract from a spectrum, by allowing features in the data to be identified by both their characteristic absorption color (the location of the peak) and their characteristic width. For a sample of water containing many different species of plankton, the added information conferred by taking a wavelet transform can be used to better separate out the many species that can be contributing the overall observed spectrum.