Revision history of a scientific manuscript


We just published a paper on the nonlinear dynamics of Neanderthal extinction in this month’s PNAS. Once we started writing the manuscript, I started keeping all of the files under what I’ll carefully call “manual version control” (…we copied and renamed each revision). This makes it easy to use Python’s difflib to track the changes in the paper from one iteration to the next. The graph above shows changes that brought the manuscript closer to the final manuscript version, with some of the major events labelled. You can see that, while the reviews caused us to rewrite many portions of the final paper (about 40% of all of the characters in the file were touched), these edits were concentrated in two bursts: the edits to get the paper accepted, and then an equal number of edits in the proofs stage (the last data point). A slightly more analog version of this data set looks like this:


Eight months of paper drafts of our manuscript, ordered from left to right. Penny for scale.

Another interesting question might be the degree to which revisions altered the text, replaced the text, or reverted previous changes. The graph below shows changes in the number of characters from each revision to the next (this is sort of like taking the time derivative of the graph above). By comparing the red and blue curves, it can be seen that most long-term oscillations involved us adding and removing entire sections to the manuscript, but shorter term oscillations (like the zig-zags in July) can be attributed to us adding and then removing the same set of text.


The code I used to parse the file names, analyze the LaTeX source files, and extract the dates is all here on GitHub. I added axes labels and annotations using Adobe Illustrator.


Building a high power voltage multiplier


A simple high-voltage circuitry project is the Cockroft-Walton voltage multiplier. I first created this as a demonstration for a class in high school, but I’ve altered it over the years in order to improve its performance. The nice thing about this project is that it can be created entirely using cheap, store-bought components—diodes and capacitors—and so it is thus relatively easy to ensure that it will work and perform at the voltage estimated. I got my components from West Florida Components, and the total cost of everything was under $10. A good tutorial that gives possible specifications for the components can be found on Instructables.

The total voltage drop across many capacitors in series is equal to the sum of the voltage drop across each component—this is a consequence of Kirchoff’s circuit laws, and can be mentally visualized as a charge on one end of a capacitor displacing and equal but opposite charge on the opposite plate, which in turn displaces an opposite charge on any other capacitor plates to which it connects, and so on. The Cockroft-Walton multiplier can be visualized as a fancy way of putting a bunch of capacitors in series and charging them so that they each have a voltage drop of 120V, resulting in a total discharge voltage of 120 V times the number of capacitors. This output is roughly DC, and it has a much lower current than the device draws from the mains, hence preserving energy conservation since power = (voltage)*(current). A simple diagram of the half-wave CW multiplier looks like this:

A circuit schematic for a half-wave Cockroft-Walton voltage multiplier.

A circuit schematic for a half-wave Cockroft-Walton voltage multiplier.

The manner by which the CW multiplier can charge each capacitor separately to 120 V is essentially by charging them in parallel and discharging them in series. The concept borrows from the design of a basic half-wave rectifier, which uses a diode and smoothing capacitor to convert the positive portions of AC sine waves to a smooth-ish DC current. The idea is that the first stage in the circuit (capacitor 1 and diode 1) converts the AC to an approximately constant DC signal, which then gets fed forward through diode 2 to the right plate of capacitor 2. During the first positive cycle, that capacitor charges to +120V. During the “off” cycle (the negative portion of the AC sine wave gets blocked by the first diode), the second capacitor discharges through diode 3 into capacitor 3 because, during the off cycle, there’s now -120V on the bottom plate of that capacitor, leading to a potential difference that allows charging. During the next “on” cycle, the current ignores capacitor 3 because it is fully charged (and so it essentially acts like a break in the circuit there), and so now capacitor 4 gets charged instead. During the next off cycle, capacitor 4 discharges through diode 4 to charge capacitor 5, and the cycle repeats itself until, after (number of capacitors)x(charging time) all the capacitors are charged.

There are several equivalent ways of visualizing what is going on in the CW circuit, but the key things to remember are that the capacitors store the charge (differential) and the diodes force the AC to feed forward and charge each capacitor in sequence. The charging time can be adjusted by adjusting the time constants for each capacitor in the circuit relative to the AC cycle frequency (60 hz in the US).

The device would build up charge twice as quickly if one instead uses a full-wave design (which is analogous to a full-wave bridge rectifier), because it would then take advantage of the negative swing of the AC sine wave, which gets lopped off by the first diode in the half-wave version.

I the video above, I have added a switch and fuse for safety reasons (visible in the upper-left hand portion of the screen; I used a plastic lid as a base for the two components). In the first cut, the ~1 mm spark regularly produced by the device is visible. This spark can be used to drive continuously an 8 inch fluorescent tube (shown in the second section), but, curiously, the frequency of the pulses through the fluorescence in the tube depends on the proximity of other conducting objects—in the fourth clip, it is apparent that touching pliers to the glass reduces the frequency of the pulses, rather than increasing them as I would have expected. My best guess for the cause of this effect is charge build-up on the glass interior beneath the metal, leading to low-frequency discharge for the same reason that high-voltage capacitors decrease the sparking frequency in the primary circuit of a spark-gap Tesla coil. The last clip shows the device discharging through a 1 inch xenon flash tube salvaged from a disposable camera. The firing frequency is low due to the relatively large distance that the spark has to cover, despite the low dielectric constant of xenon gas. In other tests, I’ve noticed that large spark gaps that require charge build-up over periods longer than the ~3-5 s for the flash tube will generally result in short circuits occurring upstream in the capacitors in the CW, which probably cause damage to the solder joints and possible the capacitor ceramic due to dielectric breakdown.

Cockroft-Walton generators have a special significance in the history of physics because they were used to generate steering currents in one of the earliest particle accelerators, enabling their creators to win the first Nobel Prize in Physics ever given to a collider project. For this reason, one of the first large-scale CW multipliers (manufactured by Philips Co.) is prominently displayed in the National Science Museum in London:


A Cockroft-Walton generator built in 1937 by Philips of Eindhoven. National Science Museum, London, England.

A Cockroft-Walton generator built in 1937 by Philips of Eindhoven. National Science Museum, London, England. Image from Wikimedia Commons.



Algorithmic interactions on social networks

I recently built a simple program that attempts to impersonate users of an online forum using a combination of a Markov model and a context-free grammar.

Recently, I’ve been using Python to scrape data from the content-aggregation website Reddit, at which users can submit links and vote on each other’s submissions. Links that generate large numbers of upvotes in the community may eventually make it to the front page of Reddit, where they will be viewed by millions of individual casual browsers as they first enter the site. Reddit’s voting algorithm and method for determining the ranking is closely guarded in order to prevent marketing companies from spamming the front page with links.

A really great Python utility for working with Reddit data is PRAW, which provides an interface between Reddit’s API and Python. The module not only allows easy scraping of information like the popularity and content top articles, the number of upvotes, and the number of comments; it also simplifies the creation of “bots,” or automated users in the Reddit community that comment on posts. The function of bots ranges from practical—one bot posts a text-mined summary of every Wikipedia article that makes it to the front page, while another posts metric conversions of every reddit headline containing imperial units—to whimsical: one bot periodically scans comment threads for rhyming words, and adds a comment to the thread demanding users cease deploying puns.

For my bot, I wanted to create a user who could convincingly act like a normal “human” user. My immediate idea was to scrape comments on an article, create a word-as-token Markov model, and then automatically post a comment containing a fixed-length output generated by the model. Markov text generators are an incredibly common opening exercise in introductory computer science courses—they are relatively easy to prepare, yet can produce surprisingly realistic (and occasionally humorous) output if trained on the right corpus of text, like this sample output from a 3-word-token Markov model I trained using Mary Shelley’s Frankenstein; or, The Modern Prometheus:

This towering above all, thy mountains, obscured in darkness and cast my eyes were shut to the conversations he held with his parents ever since our infancy. We were in my enterprise. I have confessed myself guilty of a peculiar and overpowering nature ; nor do the floating sheets of ice , which was one which attracted my attention suddenly grew despicable. By one of your sentiments of this wretched mockery of a different conclusion

Because of the relatively small size of the training text in comparison to the token size, a lot of the phrases, like “his parents ever since our infancy” or “floating sheets of ice,” are complete phrases taken directly from the novel. Nonetheless, randomly picking three word phrases based on their frequency in the corpus captures the style of the novel remarkably well, despite there being no explicit specification of sentence structure when the code is trained on the corpus. If I reduce the order of the model to 2, I get an output text that’s significantly more “original,” but also marred with more grammatical and stylistic errors:

This apprehensions as he had hitherto been present to your health rendered , no disaster is murdered , on your duty towards the most beautiful that you could I opened my, I, the reflections determined thenceforth to a direction, and my appetite. One by Elizabeth. She was not unfolded to recollect what has been adduced against me was free last night; such a fiend can not describe their native town of great crime , in death shall be the beginning of food or take their inquiries clear conception of the dashing waves continually renewed violence.

I originally intended to have my bot write pseudoscience or new-age philosophy, which would convincingly allow other users to overlook its frequent lapses in grammar as kooky derailment. I quickly learned, however, that the output of a Markov model is distinctive enough that other, tech-savvy users could infer that the comments were algorithmic. Looking to refine my approach, I instead investigated to approach used by SCIgen, a well-known automatic scientific paper generated that gained attention in 2005 when its authors successfully submitted SCIgen papers with titles like “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy” to several less-reputable journals and conferences. Contrary to what I expected, SCIgen does not use an augmented Markov model, but rather context-free grammar, a token-based approach to generating text that takes into account the natural hierarchies observed in sentence structure first described by Chomsky. The method is outlined in much better detail elsewhere, but the essential concept is that sentences contain a natural hierarchy of information that motivates their representation as a tree data structure. For example, the sentence, “The dog who was dirty ate the slippers” can be split first into two parts—the subject cause about the dirty dog, and the verb clause about his crime—and each of these parts can be further subdivided into adjectival clauses, predicates, and finally individual words like articles, nouns, and adjectives. In a context-free grammar, the non-terminal nodes of the tree (clauses, etc) are governed by production rules (verb clause ->  verb + article + direct object) that state the next levels of decomposition, each of which has its own production rule (article -> “the” OR “a” OR “an”). A more advanced grammar tree (for a Victorian sentence from “Frankenstein”) looks like this:

A context-free grammar for a sentence from Frankenstein, parsed using the NLTK module for Python.

A context-free grammar for a sentence from Frankenstein, parsed using the NLTK module for Python.

A CFG text generator would start with the start symbol (‘S’) and then move down the tree from left to right, outputting a terminal symbol each time it sees one and then moving back up the the lowest incomplete branch. In order to get truly random text, the CFG is trained with many sentences, and so the parser will have multiple possible options for each symbol—after it parses S, it could choose to move to a sentence that looks like NP VP (like the one above), or one that looks like NN PN (subject noun – predicate noun, like “the man is a dog”). After it randomly makes that decision, it then has many options for each of the subsequent nonterminal nodes, and then it finally has a choice of many possible terminal symbols (once it reaches an NN, it can pick from any of the NN used in the training set, since they are syntactically equivalent).

One noteworthy detail of the sentence shown in the figure is the presence of repeated units, like  noun phrases (NP) that link to other noun phrases, which capture the fractal nature of language constructions in which sentence-like clauses are used to modify individual components of a sentence. For this reason, when generating text from a CFG, it is very easy to implement recursion, but it’s also very easy for that recursion to get stuck in an infinite loop, where a lower symbol links to a higher symbol, resulting in repetition:

The professor hated the man who hated the professor who hated the man who hated the…

The sentence is grammatical but undesirable. For large grammars, the incidence of long sentences seems to increase, since the number of possible interlinking loops and knots in the grammar tree becomes large.

In order to use a CFG to generate grammatically-correct gibberish, I made use of the famous NLTK module for Python, which contains many tools for processing and cleaning natural language data sets, as well as text in which human experts have tagged individual words as nouns, verbs, etc. This makes it possible to use pre-built functions in the module to tag individual words in a user-chosen sample text based on their similarity to words in the text used by the experts, while simultaneously identifying the overall relations among words in a data set in order to identify non-terminal symbols like clauses. In my particular code, I scan a text for all terminal symbols first, which is a compratively fast operation. I then pick sentences at random and parse their full grammar (which can take up to a minute), but then keep only their nonterminal symbols (so I discard individual words). I then concatenate the word rules with the nonterminal production rules, resulting in a grammar with the full vocabulary of my corpus, but only a subset of its grammatical structures. Since there are multiple possible terminal symbols for a given word (ie, noun -> ‘Frankenstein’ or ‘professor’ or ‘night’ or [any other noun in the entire book]), the generated text is structured yet completely random in its choice of specific word to fulfill a given function in the sentence. But restricting the nonterminal grammar rules also allows me to monitor whether a specific sentence structure tends to cause infinite recursion or other problems. Running this model with the Frankenstein corpus resulted in occasionally eerie text:

All delight not, spoke for tears unknown and conceived. I and a mankind , and in the place , close wonder paid to saw passed this behaviour. My beautiful not hired inquietude of immeasurable serenity these beautiful care, my variable indeed enslaved sorrow this fellow, that remained eyes with this willow endeavour in the courage of the truth before interested daylight.

The past innocence not, was I of feeble horror. A invader near a white slave which a loss answered with this truth: Man broken in the considerable atmosphere. Misery. her brother, my remorse, the world.

The text may be less convincing than that generated by the 3-gram Markov model, but that text borrowed entire phrases from the source text whereas this model only borrows specific sentence structures: the words used to fulfill the various functions in the sentence, such as the nominative or verb, are chosen entirely at random from all nouns and verbs found in the source text, making it pretty unlikely that a specific chain of multiple distinct words in the corpus, like “floating sheets of ice,” would recur here. Thus while the 3-gram Markov text might better fool a human reader (who doesn’t have the novel memorized), the context-free grammar model is more likely to fool a computer program that detects plagiarism by searching for phrases in an online database.

Because of the lyrical quality of the Frankenstein CFG output, I tried to post a few (re-formatted) outputs in the poetry criticism subreddit, /r/OCpoetry. The verses received generally positive feedback until one user noticed that my account name and post history suggested that the text was random gibberish. In retrospect, this wasn’t a particularly surprising outcome, given the unusual training corpus and the need for the bot to fool human users, and so in future work I may try to restrict the bot to more specific types of comments or posts with formulaic structures. However, for the doodles or occasional flashes of brilliant nonsense that I come across while training the bot, I created a subreddit /r/markovpoetry where my bots (and other users) can post amusing grammar generated by their text-generation programs.

The full code for this project, along with sample corpora, can be found on my GitHub page.

Scratching holograms into bulletproof glass

A few months ago I came across an old piece of bulletproof glass I bought in high school when I was trying to make a camera shield for a thermite project. I had a little bit of free time, and so I was reminded of William Beatty’s famous method for making holograms, in which a virtual image can be created by literally scratching a pice of acrylic glass. I tried to make a simple star pattern, and the resulting video does not really do the effect justice:


A much better video comes from the author himself:


A detailed tutorial is available on Bill Beatty’s colorful website, Because the depth below the surface at which the virtual image appears scales with the radius of curvature of the arcs scratched into the glass, more sophisticated images can be generated by varying the radius of curvature in order to create images at multiple depths—allowing one to create a real 3D rendering rather than a 2D image that appears to sit below a surface.

Identifying fossils using machine learning.

This weekend I wrote an image processing routine that uses machine learning methods to classify fossil shark teeth from my collection.

Some of my favorite early childhood memories involve wandering up and down a the beach in Venice, FL, searching for fossilized shark teeth for which the region is known:

Over the years, my collection has grown to roughly 10,000 full or partial teeth, which are roughly sorted by morphology or, by proxy, species. Sorting the teeth by eye is not entirely trivial, particularly because of various subspecies and close relatives that have large variations in tooth shape, and because the shape of teeth from a particular species of shark will vary depending on their location in the mouth. However, because I already have a large set of teeth pre-classified, I thought I would use my collection as an opportunity to play with Python’s scikit-learn library for machine learning, to see if algorithmic methods might identify patterns or distinctions that I am missing or neglecting when I sort the teeth by eye.

My manual classification is based on the guides to each shark available on the University of Florida website, augmented by additional photos available on a now 404’d website I used to access when I was younger and first initiated the collection:

Generation of image outlines for classifier

I first drew a 1″ x 1′ grid on a piece of white paper and placed around 54 teeth within the resulting 9 x 6 grid, taking care to space them as evenly as possible. I took a quick image of this array with my iPhone, taking care to square the corners of the viewing frame with the page, which conveniently has the same aspect ratio as the page, allowing me to both minimize tilting/rotation and to enforce the same scale for all images without using a tripod.

Fossilized sand shark teeth.

Fossilized sand shark teeth arranged in a 1″ x 1″ grid.

I processed the resulting images through imagesplitter, allowing me to divide the grid into separate images for each shark tooth. There are probably fancier ways of creating separated segmented images for each tooth that don’t involve aligning them in a grid or splitting the images (MATLAB’s bwlabel() function comes to mind), but I didn’t mind having separate images for each tooth in case they come in handy for a later project.

I took the resulting image series for each species and opened them in Fiji (Fiji is just ImageJ) as a stack. While most operations for which ImageJ/Fiji are commonly used can be done easily without a GUI in MATLAB, I like using a GUI for stacks because it’s easier to preview the effect that operations have across the entire stack. For other projects, whenever I’ve needed to something fancier than a standard segmentation or tracking I’ve found it easier to write a Java program to run in Fiji than to go down the rabbit hole of opening and processing stacks of images in MATLAB, which never quite performs as smoothly as I would expect.

For the teeth that have left or right crooks, such as tiger shark teeth, I went through and manually flipped the images of teeth pointing the other direction, since nothing I’ve read suggests that teeth with a given chirality have other distinct features that would be useful to the classifier—the orientation just depends on which side of the shark’s mouth the tooth originally came from (a quick lookover confirms that I appear to have roughly equal numbers of left and right pointing teeth—apparently ancient sharks didn’t preferentially chew on one side)

I then performed rigid registration (rotation/scaling/translation) of the binary images for each species onto one another using the “StackReg” module that comes built-into the Fiji “registration” toolkit. I cropped the resulting images to a square in order to simplify resizing and stacking. The sequences registered images end up looking like this:

Different morphologies for fossilized tiger shark teeth.

Different morphologies for fossilized tiger shark teeth.

Different morphologies for fossilized bull shark teeth.

Different morphologies for fossilized bull shark teeth.


In a sense, I am already passing the classifier much more information than it needs to know, since the boundary of the segmented region is the only feature that contains information in the images.

For this project, I thought I would try both a supervised and unsupervised classification approach. Since I already have approximate classifications for the teeth based on my manual sorting (by shape) over the years, I can label each set of segmented images with a candidate species, train a classifier, and then apply the resulting regression to a couple of new images of teeth to see if the classifier agrees with my best guess.

The more intriguing aspect of the project is determining whether the code can work out distinctions itself given an unlabeled set of teeth from multiple species. This would give me an idea of just how distinctive the different morphologies really are, and it could reveal the presence of other species with similar looking teeth that I’ve been mis-classifying because they look so much like other species.


Perhaps unsurprisingly, even basic Naive Bayesian classification performed extremely well for this image set, even for similar tooth morphologies (in incisors of bull shark teeth look very similar to those of dusky sharks, yet the classifier was miraculously proficient at discerning them. I’d estimate an accuracy of about 96% for the collection of teeth I processed today.

Update 1/4/2015: I recently made this website, which features images (with scale bars!) of all the pieces in my fossil collection, including these shark teeth.

Using text classification to determine authorship

In high school, my classmates and I were required to use the website, which bills itself as a “global leader in evaluating and improving student learning” but which really exists for a single (albeit important) purpose: determining whether submitted papers are plagiarized from online sources. The use of the website was perhaps an overzealous response to the sparsely-documented phenomenon of high schoolers stealing their papers from online sources, particularly because the website can do little to detect whether a paper has been purchased from a “paper mill,” where students can hire others to write their essays for them. Instead, the website appears to use a strangely rudimentary regular expressions search capability to compare all possible substrings (down to some minimal length) in a document against a database of text compiled from online sources like Google Books and Wikipedia. I remain skeptical of the sophistication of the website’s methods because it would regularly flag phrases like “Just as when” or “at odds with,” suggesting that it didn’t contain any sort of whitelist for common phrases. I also hold a personal grudge because I once turned in an essay containing (MLA formatted!) quotes from Heart of Darkness, and the program declared that I had plagiarized parts of my essay from a certain book by Joseph Conrad.

How can plagiarism be detected more effectively? I have spent this summer reading a little about text classification, and it seems apparent that a website like Turnitin could benefit from implementing a procedure that looks for whether the style of a student’s writing is consistent, rather than simply looking for copied text. It would be harder for such a site to present definitive evidence of cheating, but it would at least allow teachers to get a better idea when several students turn in similar papers or when a single one of a student’s papers demonstrates a markedly different tone and style from xyr others.

A classic paper that outlines a lot of the concepts and principals associated with text classification is “Inference in an authorship problem” by Mosteller and Wallace (1963). The authors apply statistical techniques to analyze whether several the Federalist Papers (1788) were written by Hamilton or Madison—at the time of the publication of their article, no consensus existed among historians regarding which author had written which papers. The basic technique used by Mosteller and Wallace is strikingly similar to techniques used today after the rise of modern computerized data mining: they implement a linear discriminant analysis that looks at the frequency that the authors use certain common words (“by”,”from”,”to,” and the like). The technique works by looking for usage frequencies for sets of words, rather than a single word, that strongly differ between the two authors bodies of work—this set of words is determined by taking various weighted combinations of words and applying them to known samples of each author’s works, and finding a combination of words and weights that yields the high sums for works by one author (Madison in the paper) and low sums for the other author. Once advanced technique that Mosteller and Wallace implement involves the observation that certain, context independent “function” words—like prepositions or common adverbs—have frequencies that roughly obey a Poisson distribution, allowing them to apply Bayes theorem when selecting their set of words in order to determine the likelihood of an observed word frequency given the assumption that the distribution of frequencies should look Poissonian.

The authors find, surprisingly, that Madison wrote every single one of the disputed papers. This contradicts various claims made by contemporaries of the original federalists, but it corroborates several historical analyses that came out around the time that Mosteller and Wallace published their original papers. Because the method only requires known samples of an author’s work in order to develop a model, a website like Turnitin could implement a similar technique by archiving a student’s initial submissions or past work. Presumably, the website would then have little difficulty discerning whether a history essay was written by an adolescent or Joseph Conrad.

Coilgun with disposable camera capacitors

In high school, one of my favorite quick projects was this disposable camera coilgun.

The concept behind the coilgun is pretty simple: disposable cameras are very, very good at charging capacitors, and then rapidly discharging them through the flash tube. The spark that passes through the flash tube is related to the voltage and capacity of a large component in the charging circuit called the “photoflash capacitor,” which is a special type of electrolytic capacitor that has been optimized to ensure that it can release almost all of its stored charge nearly instantaneously. The output current of the photoflash capacitor can be amplified by taking multiple disposable cameras apart and soldering their capacitors together in parallel to produce a capacitor bank with capacitance equal to the number of capacitors times their individual rated capacities. Because the voltage of the assembly remains the same when capacitors are connected in parallel, the capacitor bank can be charged using the standard DC charging circuit from one of the cameras, which even contains a special LED indicator that alerts when the bank is fully charged.

Since the output current upon discharge scales roughly proportionately to the number of capacitors, a very large magnetic field can be generated by discharging the capacitor bank through a hollow coil of wire with low gauge (to minimize resistance). If the coil is kept short and positioned carefully, a magnetic projectile like a nail will be very rapidly pulled towards the center of the coil by the giant magnetic field produced by the capacitors discharging. However (and this is where the initial position and the geometry of the coil become important) if the discharge finishes right as the projectile reaches the center of the coil, there will be no restoring force to pull the nail back towards the center when it overshoots due to its inertia from rapidly being pulled into the coil. As a result, the projectile will keep going, past the center of the coil and out the other end, exiting the device and flying across the room.