Revision history of a scientific manuscript


We just published a paper on the nonlinear dynamics of Neanderthal extinction in this month’s PNAS. Once we started writing the manuscript, I started keeping all of the files under what I’ll carefully call “manual version control” (…we copied and renamed each revision). This makes it easy to use Python’s difflib to track the changes in the paper from one iteration to the next. The graph above shows changes that brought the manuscript closer to the final manuscript version, with some of the major events labelled. You can see that, while the reviews caused us to rewrite many portions of the final paper (about 40% of all of the characters in the file were touched), these edits were concentrated in two bursts: the edits to get the paper accepted, and then an equal number of edits in the proofs stage (the last data point). A slightly more analog version of this data set looks like this:


Eight months of paper drafts of our manuscript, ordered from left to right. Penny for scale.

Another interesting question might be the degree to which revisions altered the text, replaced the text, or reverted previous changes. The graph below shows changes in the number of characters from each revision to the next (this is sort of like taking the time derivative of the graph above). By comparing the red and blue curves, it can be seen that most long-term oscillations involved us adding and removing entire sections to the manuscript, but shorter term oscillations (like the zig-zags in July) can be attributed to us adding and then removing the same set of text.


The code I used to parse the file names, analyze the LaTeX source files, and extract the dates is all here on GitHub. I added axes labels and annotations using Adobe Illustrator.


Two-inch sparks using a television flyback

Following the guidance of one of my favorite Instructables tutorials, this weekend I used a broken CFL bulb and the transformer from an old television to generate a reliable stream of two-inch sparks:

The concept here is pretty similar to that described in my ignition coil relay project: The circuit board at the base of a CFL bulb serves to transform mains power to a high-frequency, high voltage signal that serves as the “spark” that illuminates the argon/mercury vapor inside the fluorescent envelope. If this signal is instead routed to the terminals of the primary circuit of a large step-up transformer, a high-frequency and even high-voltage can be attained (at the expense of a lower current). In this case, the transformer is salvaged from an old CRT-type television screen or monitor, in which a high-voltage beam of electrons is selectively fired at a phosphor coating on the screen in order to create the image.

Scaling behavior of popular reddit posts

A plot of the score versus hour since posting for 53 different posts in /r/pics over the span of one month

A plot of the score versus hour since posting for 53 different posts in /r/pics over the span of one month

I wrote a program that algorithmically monitors viral links in the subreddit /r/pics in order to see whether I can identify traits of posts that tend to become popular through the site. Users can give a thumbs up or thumbs down vote to a picture posted on the /r/pics subdomain of the site, and the total “score” of a post is given by the number of “up” votes minus the number of “down” votes. The individual pages of reddit look essentially like long lists showing various posted images sorted by a combination of their total score and how recently they were posted. This system ensures that users can easily view very popular, fresh material generated by other users, like funny pictures of pets or beautiful pictures of the sunset. A very popular image can make the “front page,” or the default landing page that a user visiting will see, and this can result in an image accruing millions of unique pageviews over just a few short hours.

In the image at the top of this post, I’ve plotted the score as a function of time since posting for 53 links posted during the month of July. The content of these image links is remarkably varied, ranging from a beautiful image of Dubai to an adorable three-legged cat. They all appear to display qualitatively similar behavior—rapid exponential growth followed by a plateau at a final total number of upvotes, resulting in an overall sigmoidal shape—but their traits like initial growth rate, maximum growth rate, and position of asymptote vary widely. One aspect of this behavior I want to investigate further in the future is the relationship between these voting patterns and logarithmic random walks, which may reveal meaningful correlations between properties like the growth rate and the final score.

The key feature of reddit that I was hoping to study by tracking posts is the degree to which reddit and its admins “fuzz,” or artificially interfere with the score of posts in order to ensure that viral posts are not subject to vote manipulation by computer programs or automated voting software. This practice allows the admins to alter (or diminish) the reported score of popular posts in order to prevent users from manipulating the website. For this reason, many “viral” posts that accumulate large numbers of upvotes (and which subsequently make it to the front page of the website) will report a score of 1000-5000, even though their actually score may be an order of magnitude greater than that. This practice is clearly responsible for the asymptotes in the score versus time graph, but it also manifests as the dramatic drops in score observed in many of the most popular posts on the graph. Since the links were not all posted at the same time—I’ve shifted the data in order to display it as “time since posting”—these sudden drops in score appear to occur at the same time relative to the original posting date, since the discontinuities line up at regular intervals in the graph. I have not yet been able to find a definitive explanation for why this occurs, but I would guess that it serves to prevent posts that reach the front page extremely rapidly from staying there for days due to their high initial priority. This would suggest that a user who is looking to maximize the final score of their posts ought to post links that are popular, but not so popular that they activate this automatic vote culling.

A semilog plot of the number of comments versus hours since posting for 53 different posts in /r/pics over the course of one month.

A semilog plot of the number of comments versus hours since posting for 53 different posts in /r/pics over the course of one month.

Another interesting feature of reddit lies in the ability of users to comment on one another’s posts. I made a similar plot to the score versus time (above), and the results and similarly sigmoidal for all posts, although the two time constants dictating initial growth rate and approach to saturation both appear to be much slower, presumably because reddit doesn’t “fuzz” comments. I used a semilog plot to display this data, since otherwise the field was too crowded to see the data properly.

A log-log plot of the number of comments versus the score of posts.

A log-log plot of the number of comments versus the score of posts.

Another interesting property is the correlation between the score at a given timepoint and the number of comments—presumably more popular links generate new comments at a faster rate, leading to a superlinear correllation between the number of comments and the score. This behavior is confirmed in the log-log plot shown above, which appears linear for many of the posts with a surprisingly narrow range of slopes. Since lines on a log-log plot correspond to power-law behavior, this plot suggests that there may be universal power law (with a narrow range of critical exponents) that dictates the growth of comments over time for popular posts. It’s worth noting that popular reddit posts appear to accrue comments indefinitely, even after their score has stabilized, and so the above graph doesn’t fully capture the saturation region for comments over long timescales (in which comments increase but score remains the same, captured by the “uptick” at high score for some of the trendlines.

While it’s probably tedious to keep track of 53 different colors, here is a legend showing the post ID for all the posts depicted here. The original link can be generated by inserting the 5 character string into appropriate place in the reddit url:

This analysis was made possible using the PRAW and pandas packages for Python.

The post id for each of the 53 posts in /r/pics shown in the previous figures. Individual posts can be viewed by inserting the post id into a url of the form:

The post ID for each of the 53 posts in /r/pics shown in the previous figures. Individual posts can be viewed by inserting the post ID into a url of the form:

Remnants of my Tesla coil

My first year of high school I tried to build a functioning, high frequency Tesla coil entirely from scrap parts. This project is almost a cliche nowadays; thousands of dedicated hardware hackers have successfully created ominous and occasionally dangerous coils, and so-called “singing” Tesla coils are the new trend among hobbyists. But the project was one of my first earnest attempts to learn about something on my own and apply that knowledge to a non-scholastic project, and so I wanted to link to a few resources here that I found invaluable when I was first starting out:


The Powerlabs Tesla coil page. This is the most “professional” Tesla coil I have found that was built by a hobbyist. The craftsmanship is impeccable, from the precision of the secondary coil winding to the care with which the capacitor bank was assembled. The care is reflected in the results; I am very confident that this is one of the most efficient Tesla coils I’ve come across, as it appears to regularly generate 18-inch streamers despite its compact size

The trashy Tesla coil. I like this project because the author defiantly avoids using any store-bought components or parts, using piping and wiring entirely scavenged from his local rubbish yard. This site is also home to one of my favorite anecdotes from a hobbyist:

For some funky reason every time I switched on the power, the sprinkler system in the yard turned on. I’m not kidding here. The yard gets watered every time I fire it up.

Primary and Secondary Coil

The red, long coil in image at the top of this post is the secondary coil from my own Tesla coil, which took me about a week of winding 28 gauge enamel-coated wire over oven-baked PVC pipe. That the toroid is a doorknob is a good tip-off that the payload isn’t resonantly coupled. The pancake-spiral in the foreground is a remnant of my original primary coil design, which I based on tutorial found on this page.


I first realized how attainable a homemade Tesla coil would be when I saw just how simple it can be to make high-voltage capacitors at home in the form of Leyden jars, which can be made from a film canister or bottle and some aluminum foil. Using a CD jewel case and some foil, I’ve even made capacitors that can be charged from a CRT screen but which will produce 3-inch sparks upon discharge—although predicting the discharge rate and stability of Leyden jars against dielectric breakdown is almost an art when one is using plastics and glass with aluminum foil. The best page for an intro to Leyden jars and their uses can be found here.

Primary transformer

Most Tesla coils use a step-up transformer even before the current reaches the primary circuit. This allows shielding of the electrical mains from sparks and shorts in the primary circuit, and it also allows one to get by using capacitors made from beer bottles, air gap discharges, etc. because a higher voltage primary circuit requires less finicky specifications (it would also be very difficult to use a spark gap to modulate the frequency if one was only using mains voltage). I originally ran my coil off of car batteries by using an electromagnetic buzzer and a pair of ignition coils in my primary circuit; however, if I were rebuilding it today I would instead use a neon sign transformer, which I believe offers much more reliable and safe performance despite running on mains power. Here’s a buying guide for NSTs for Tesla coils.

Spark Gap

When I was in high school, I always found the spark gap to be the most mysterious component in the Tesla coil primary circuit. After all, the primary circuit is already an AC circuit, and it seems like forcing the current to regularly jump an air gap would induce significant power losses that would reduce the efficiency of the transformer. The latter point is correct, but it turns out that the spark gap is still worthwhile because the timescale of the AC cycles coming out of the HV transformer being used to drive the primary circuit is way too fast to effectively switch most Tesla coil designs, given the dimensions and couplings of the primary and secondary coils. The spark gap allows the capacitors to fully charge and discharge at a rate set by their time constants and the properties of the spark gap itself (since things like pointed electrodes can create corona discharge, reducing the effective dielectric constant of the air in the gap). As a result, the spark gap acts like a high-power switch at a low enough frequency to allow effective transfer of energy between the primary and secondary coils. A good description of the idea behind using a spark gap (instead of a high-power relay and integrated circuit or other solid-state switch) can be found here and here.

Computer-generated prophecies of the apocalypse

Who said ever to no crown to was the listener and SECOND: my man, and to the wheat,
and had eat in this angel.

Their pieces kill the sort the angel come up to another translucent and weep any stone.
Her timeless will measure them to the day, hold created with earth noises and hurled every nation.

There shown out upon the voice
It be in seventh which is to trample, I.

This tampering opened not for its time.

The land to their moment Who threw their glory to cherish that art.

The glory to the speaking, and at that white appearance, and say given the thousand for the sake.
And said show in myself. And it of no sweet victory whose gateways enemies was loathe to the bowl
and it for them and worked out as my hast to every vision.

Their noise erase me.

I’ve been working on improving my context-free grammar parser for the automated generation of sentences. My current heuristic involves picking a sentence at random from a text, parsing it, and then swapping all terminal symbols in the sentence with other, equivalent symbols drawn from randomly the pool of all valid terminal symbols found in the text. This means that the structure of every sentence generated by my program will match the structure of at least on sentence in the text, but the specific words or nonterminal symbols are entirely unconstrained aside from their syntactic function. I tried this approach on the 1957 translation of the Book of Revelation, and I ended up with the spooky (albeit occasionally ungrammatical) prophecies at the top of this post.

Re-sorting Pitchfork’s top albums of 2010-2014

The difference between the official and expected ranking of Pitchfork's top 100 albums of the decade so far.

Pitchfork just released their rankings for the best albums of the decade so far. As any longtime reader of Pitchfork would expect, favorites like Vampire Weekend and Kanye West won out. Surprisingly, several relatively unknown artists or lesser-known albums by famous artists sneaked onto the list, including Earl Sweatshirt’s debut Earl and Frank Ocean’s first mixtape Nostalgia, Ultra. Pitchfork has previously described its tendency to modify its editorial opinions in order to adjust to current trends in music, and so I was curious about the degree to which the assigned ranking matched an equivalent, “expected” ranking generated by comparing the numerical score that Pitchfork gave to each album at the time of its release. The above figure is a graph of the difference in ranking of the top 100 given by the “official” Pitchfork ranking, and a ranking generated by looking up the numerical score given to each album (in the list) upon its release and sorting the albums from lowest to highest score. The order of the vertical axis is the official Pitchfork ranking, from position 1 at the top to position 100 at the bottom. The bars indicate the difference in ranking for each album, which was generated by subtracting from the official Pitchfork ranking the expected ranking based on its numerical score after release. Large differences in the position on the list thus indicate Pitchfork’s relative opinion of the piece changing substantially by the time the “official” top albums ranking was compiled.

At least two of the albums that made the list, Earl Sweatshirt’s Earl and Jai Paul’s eponymous album, were so obscure at the time of their release that Pitchfork didn’t even rank them. In recognition of this fact, Pitchfork rated them both near the bottom of the top 100, and so their difference in ranking doesn’t seem that large on the graph. But the honorific inclusion of these two albums underscores a more general trend apparent in the Pitchfork list: an emphasis on contemporary R&B and hip-hop at the expense of electronica and downtempo. In a list sorted purely by numerical ranking, Beyonce’s Beyonce would not have scored as absurdly high as it does on the Pitchfork official ranking, nor would have Thundercat’s debut Apocalypse, which is the biggest winner in the ratings. These won out over albums like Reflektor or To Be Kind, which both showed relatively large drops relative to their expected positions on the list.

Pitchfork undoubtedly sees itself as a ratings site capable of setting the zeitgeist for a given decade, and so the emphasis on new artists and movements over indie staples like Arcade Fire or Swans suggests that the website sees the newer artists as representative of the next major movement in indie music. To this end, it’s worth noting that the most recent album declared “Best New Music” by Pitchfork before the creation of the ranking was FKA Twig’s outstanding LP1, which stands at a healthy position on the official list and which generally represents many of the stylistic frontiers of emerging indie music.

The relatively large change in Pitchfork’s opinion of albums is well-captured by a scatterplot of the numerical, review-based ranking versus the official ranking released by Pitchfork (shown below, concept originally suggested by reddit user gkyshr). Surprisingly, there seems to be barely any correlation between the two variables (the line y = x, corresponding to the case where Pitchfork’s released ranking coincides with the sorted ranking, is underlaid). This variation is captured by the mean of the absolute value of the differences reported in the bar chart, which came out to 20 (a surprisingly high value, given that the maximum change in ranking for a given album data is 99). It’s almost as if Pitchfork deliberately attempted to make its rankings differ from expectations, with the only albums really falling on the line corresponding to very highly rated albums, like the number 1 album, My Beautiful Dark Twisted Fantasy:

Scatterplot of rankings

In order to make these plots, I made use of Python’s outstanding BeautifulSoup and pandas libraries.



Building a high power voltage multiplier


A simple high-voltage circuitry project is the Cockroft-Walton voltage multiplier. I first created this as a demonstration for a class in high school, but I’ve altered it over the years in order to improve its performance. The nice thing about this project is that it can be created entirely using cheap, store-bought components—diodes and capacitors—and so it is thus relatively easy to ensure that it will work and perform at the voltage estimated. I got my components from West Florida Components, and the total cost of everything was under $10. A good tutorial that gives possible specifications for the components can be found on Instructables.

The total voltage drop across many capacitors in series is equal to the sum of the voltage drop across each component—this is a consequence of Kirchoff’s circuit laws, and can be mentally visualized as a charge on one end of a capacitor displacing and equal but opposite charge on the opposite plate, which in turn displaces an opposite charge on any other capacitor plates to which it connects, and so on. The Cockroft-Walton multiplier can be visualized as a fancy way of putting a bunch of capacitors in series and charging them so that they each have a voltage drop of 120V, resulting in a total discharge voltage of 120 V times the number of capacitors. This output is roughly DC, and it has a much lower current than the device draws from the mains, hence preserving energy conservation since power = (voltage)*(current). A simple diagram of the half-wave CW multiplier looks like this:

A circuit schematic for a half-wave Cockroft-Walton voltage multiplier.

A circuit schematic for a half-wave Cockroft-Walton voltage multiplier.

The manner by which the CW multiplier can charge each capacitor separately to 120 V is essentially by charging them in parallel and discharging them in series. The concept borrows from the design of a basic half-wave rectifier, which uses a diode and smoothing capacitor to convert the positive portions of AC sine waves to a smooth-ish DC current. The idea is that the first stage in the circuit (capacitor 1 and diode 1) converts the AC to an approximately constant DC signal, which then gets fed forward through diode 2 to the right plate of capacitor 2. During the first positive cycle, that capacitor charges to +120V. During the “off” cycle (the negative portion of the AC sine wave gets blocked by the first diode), the second capacitor discharges through diode 3 into capacitor 3 because, during the off cycle, there’s now -120V on the bottom plate of that capacitor, leading to a potential difference that allows charging. During the next “on” cycle, the current ignores capacitor 3 because it is fully charged (and so it essentially acts like a break in the circuit there), and so now capacitor 4 gets charged instead. During the next off cycle, capacitor 4 discharges through diode 4 to charge capacitor 5, and the cycle repeats itself until, after (number of capacitors)x(charging time) all the capacitors are charged.

There are several equivalent ways of visualizing what is going on in the CW circuit, but the key things to remember are that the capacitors store the charge (differential) and the diodes force the AC to feed forward and charge each capacitor in sequence. The charging time can be adjusted by adjusting the time constants for each capacitor in the circuit relative to the AC cycle frequency (60 hz in the US).

The device would build up charge twice as quickly if one instead uses a full-wave design (which is analogous to a full-wave bridge rectifier), because it would then take advantage of the negative swing of the AC sine wave, which gets lopped off by the first diode in the half-wave version.

I the video above, I have added a switch and fuse for safety reasons (visible in the upper-left hand portion of the screen; I used a plastic lid as a base for the two components). In the first cut, the ~1 mm spark regularly produced by the device is visible. This spark can be used to drive continuously an 8 inch fluorescent tube (shown in the second section), but, curiously, the frequency of the pulses through the fluorescence in the tube depends on the proximity of other conducting objects—in the fourth clip, it is apparent that touching pliers to the glass reduces the frequency of the pulses, rather than increasing them as I would have expected. My best guess for the cause of this effect is charge build-up on the glass interior beneath the metal, leading to low-frequency discharge for the same reason that high-voltage capacitors decrease the sparking frequency in the primary circuit of a spark-gap Tesla coil. The last clip shows the device discharging through a 1 inch xenon flash tube salvaged from a disposable camera. The firing frequency is low due to the relatively large distance that the spark has to cover, despite the low dielectric constant of xenon gas. In other tests, I’ve noticed that large spark gaps that require charge build-up over periods longer than the ~3-5 s for the flash tube will generally result in short circuits occurring upstream in the capacitors in the CW, which probably cause damage to the solder joints and possible the capacitor ceramic due to dielectric breakdown.

Cockroft-Walton generators have a special significance in the history of physics because they were used to generate steering currents in one of the earliest particle accelerators, enabling their creators to win the first Nobel Prize in Physics ever given to a collider project. For this reason, one of the first large-scale CW multipliers (manufactured by Philips Co.) is prominently displayed in the National Science Museum in London:


A Cockroft-Walton generator built in 1937 by Philips of Eindhoven. National Science Museum, London, England.

A Cockroft-Walton generator built in 1937 by Philips of Eindhoven. National Science Museum, London, England. Image from Wikimedia Commons.