Scaling behavior of popular reddit posts

A plot of the score versus hour since posting for 53 different posts in /r/pics over the span of one month

A plot of the score versus hour since posting for 53 different posts in /r/pics over the span of one month

I wrote a program that algorithmically monitors viral links in the subreddit /r/pics in order to see whether I can identify traits of posts that tend to become popular through the site. Users can give a thumbs up or thumbs down vote to a picture posted on the /r/pics subdomain of the site, and the total “score” of a post is given by the number of “up” votes minus the number of “down” votes. The individual pages of reddit look essentially like long lists showing various posted images sorted by a combination of their total score and how recently they were posted. This system ensures that users can easily view very popular, fresh material generated by other users, like funny pictures of pets or beautiful pictures of the sunset. A very popular image can make the “front page,” or the default landing page that a user visiting reddit.com will see, and this can result in an image accruing millions of unique pageviews over just a few short hours.

In the image at the top of this post, I’ve plotted the score as a function of time since posting for 53 links posted during the month of July. The content of these image links is remarkably varied, ranging from a beautiful image of Dubai to an adorable three-legged cat. They all appear to display qualitatively similar behavior—rapid exponential growth followed by a plateau at a final total number of upvotes, resulting in an overall sigmoidal shape—but their traits like initial growth rate, maximum growth rate, and position of asymptote vary widely. One aspect of this behavior I want to investigate further in the future is the relationship between these voting patterns and logarithmic random walks, which may reveal meaningful correlations between properties like the growth rate and the final score.

The key feature of reddit that I was hoping to study by tracking posts is the degree to which reddit and its admins “fuzz,” or artificially interfere with the score of posts in order to ensure that viral posts are not subject to vote manipulation by computer programs or automated voting software. This practice allows the admins to alter (or diminish) the reported score of popular posts in order to prevent users from manipulating the website. For this reason, many “viral” posts that accumulate large numbers of upvotes (and which subsequently make it to the front page of the website) will report a score of 1000-5000, even though their actually score may be an order of magnitude greater than that. This practice is clearly responsible for the asymptotes in the score versus time graph, but it also manifests as the dramatic drops in score observed in many of the most popular posts on the graph. Since the links were not all posted at the same time—I’ve shifted the data in order to display it as “time since posting”—these sudden drops in score appear to occur at the same time relative to the original posting date, since the discontinuities line up at regular intervals in the graph. I have not yet been able to find a definitive explanation for why this occurs, but I would guess that it serves to prevent posts that reach the front page extremely rapidly from staying there for days due to their high initial priority. This would suggest that a user who is looking to maximize the final score of their posts ought to post links that are popular, but not so popular that they activate this automatic vote culling.

A semilog plot of the number of comments versus hours since posting for 53 different posts in /r/pics over the course of one month.

A semilog plot of the number of comments versus hours since posting for 53 different posts in /r/pics over the course of one month.

Another interesting feature of reddit lies in the ability of users to comment on one another’s posts. I made a similar plot to the score versus time (above), and the results and similarly sigmoidal for all posts, although the two time constants dictating initial growth rate and approach to saturation both appear to be much slower, presumably because reddit doesn’t “fuzz” comments. I used a semilog plot to display this data, since otherwise the field was too crowded to see the data properly.

A log-log plot of the number of comments versus the score of posts.

A log-log plot of the number of comments versus the score of posts.

Another interesting property is the correlation between the score at a given timepoint and the number of comments—presumably more popular links generate new comments at a faster rate, leading to a superlinear correllation between the number of comments and the score. This behavior is confirmed in the log-log plot shown above, which appears linear for many of the posts with a surprisingly narrow range of slopes. Since lines on a log-log plot correspond to power-law behavior, this plot suggests that there may be universal power law (with a narrow range of critical exponents) that dictates the growth of comments over time for popular posts. It’s worth noting that popular reddit posts appear to accrue comments indefinitely, even after their score has stabilized, and so the above graph doesn’t fully capture the saturation region for comments over long timescales (in which comments increase but score remains the same, captured by the “uptick” at high score for some of the trendlines.

While it’s probably tedious to keep track of 53 different colors, here is a legend showing the post ID for all the posts depicted here. The original link can be generated by inserting the 5 character string into appropriate place in the reddit url: http://www.reddit.com/r/pics/comments/2byz7g

This analysis was made possible using the PRAW and pandas packages for Python.

The post id for each of the 53 posts in /r/pics shown in the previous figures. Individual posts can be viewed by inserting the post id into a url of the form: http://www.reddit.com/r/pics/comments/2bgjsv

The post ID for each of the 53 posts in /r/pics shown in the previous figures. Individual posts can be viewed by inserting the post ID into a url of the form:
http://www.reddit.com/r/pics/comments/2bgjsv

Advertisements

Computer-generated prophecies of the apocalypse

Who said ever to no crown to was the listener and SECOND: my man, and to the wheat,
and had eat in this angel.

Their pieces kill the sort the angel come up to another translucent and weep any stone.
Her timeless will measure them to the day, hold created with earth noises and hurled every nation.

There shown out upon the voice
It be in seventh which is to trample, I.

This tampering opened not for its time.

The land to their moment Who threw their glory to cherish that art.

The glory to the speaking, and at that white appearance, and say given the thousand for the sake.
And said show in myself. And it of no sweet victory whose gateways enemies was loathe to the bowl
and it for them and worked out as my hast to every vision.

Their noise erase me.

I’ve been working on improving my context-free grammar parser for the automated generation of sentences. My current heuristic involves picking a sentence at random from a text, parsing it, and then swapping all terminal symbols in the sentence with other, equivalent symbols drawn from randomly the pool of all valid terminal symbols found in the text. This means that the structure of every sentence generated by my program will match the structure of at least on sentence in the text, but the specific words or nonterminal symbols are entirely unconstrained aside from their syntactic function. I tried this approach on the 1957 translation of the Book of Revelation, and I ended up with the spooky (albeit occasionally ungrammatical) prophecies at the top of this post.

Re-sorting Pitchfork’s top albums of 2010-2014

The difference between the official and expected ranking of Pitchfork's top 100 albums of the decade so far.

Pitchfork just released their rankings for the best albums of the decade so far. Unsurprisingly, hipster favorites like Vampire Weekend and Kanye West won out (as any longtime reader of Pitchfork would expect), but, surprisingly, several relatively unknown artists or lesser-known albums by famous artists sneaked onto the list, including Earl Sweatshirt’s debut Earl and Frank Ocean’s first mixtape Nostalgia, Ultra. I’ve read before that Pitchfork tends to modify its editorial opinions in order to adjust to current trends in music, and so I was curious about the degree to which the assigned ranking matched an equivalent, “expected” ranking generated by comparing the numerical score that Pitchfork gave to each album at the time of its release. The above figure is a graph of the difference in ranking of the top 100 given by the “official” Pitchfork ranking, and a ranking generated by looking up the numerical score given to each album (in the list) upon its release and sorting the albums from lowest to highest score. The order of the vertical axis is the official Pitchfork ranking, from position 1 at the top to position 100 at the bottom. The bars indicate the difference in ranking for each album, which was generated by subtracting from the official Pitchfork ranking the expected ranking based on its numerical score after release. Large differences in the position on the list thus indicate Pitchfork’s relative opinion of the piece changing substantially by the time the “official” top albums ranking was compiled.

At least two of the albums that made the list, Earl Sweatshirt’s Earl and Jai Paul’s eponymous album, were so obscure at the time of their release that Pitchfork didn’t even rank them. In recognition of this fact, Pitchfork rated them both near the bottom of the top 100, and so their difference in ranking doesn’t seem that large on the graph. But the honorific inclusion of these two albums underscores a more general trend apparent in the Pitchfork list: an emphasis on contemporary R&B and hip-hop at the expense of electronica and downtempo. In a list sorted purely by numerical ranking, Beyonce’s Beyonce would not have scored as absurdly high as it does on the Pitchfork official ranking, nor would have Thundercat’s debut Apocalypse, which is the biggest winner in the ratings. These won out over albums like Reflektor or To Be Kind, which both showed relatively large drops relative to their expected positions on the list.

Pitchfork undoubtedly sees itself as a ratings site capable of setting the zeitgeist for a given decade, and so the emphasis on new artists and movements over indie staples like Arcade Fire or Swans suggests that the website sees the newer artists as representative of the next major movement in indie music. To this end, it’s worth noting that the most recent album declared “Best New Music” by Pitchfork before the creation of the ranking was FKA Twig’s outstanding LP1, which stands at a healthy position on the official list and which generally represents many of the stylistic frontiers of emerging indie music.

The relatively large change in Pitchfork’s opinion of albums is well-captured by a scatterplot of the numerical, review-based ranking versus the official ranking released by Pitchfork (shown below, concept originally suggested by reddit user gkyshr). Surprisingly, there seems to be barely any correlation between the two variables (the line y = x, corresponding to the case where Pitchfork’s released ranking coincides with the sorted ranking, is underlaid). This variation is captured by the mean of the absolute value of the differences reported in the bar chart, which came out to 20 (a surprisingly high value, given that the maximum change in ranking for a given album data is 99). It’s almost as if Pitchfork deliberately attempted to make its rankings differ from expectations, with the only albums really falling on the line corresponding to very highly rated albums, like the number 1 album, My Beautiful Dark Twisted Fantasy:

Scatterplot of rankings

In order to make these plots, I made use of Python’s outstanding BeautifulSoup and pandas libraries.

 

d.

Computer tricks I use for research

My research involves a combination of coding, bench work, and pencil-and-paper equation solving, and there are a couple of programs and tricks that I find myself constantly using to make my life easier when storing and manipulating data:

1. Mathematica + Dropbox. One of the more annoying features of Mathematica is its lack of a simple “undo” feature for typing errors—if you accidentally delete a cell containing an important code snippet, it is pretty much gone forever. A really cool workaround I found here involves storing all of my Mathematica notebooks in a Dropbox folder that has automatic versioning activated. I’ve inserted the snippet SetOptions[SelectedNotebook[], NotebookAutoSave -> True] into the preamble of all of my notebooks (that is, the evaluation block that I always run before I begin working on the notebook, which does stuff like set the colors of plots and control the scope of new variables). This just tells Mathematica to perform a Save every time I evaluate some portion of my code, which Dropbox in turn sees as an opportunity to create a new version of the notebook online. I found that this method was much simpler to use than the options for Mathematica and GitHub (since it really just consists of remembering to dump all of my notebooks into Dropbox), and so I use this trick almost every day.

2. NameChanger. There are a number of ways to change the names of a large batch of files using native utilities on OSX—you can create a smart folder, write a Python script, or use Automator. NameChanger is a simple, open-source Python program that allows you to change the names of hundreds of files using regular expressions and a GUI.

3. Caffeine. This is a great utility for Keynote presentations and running long SSH sessions. All that this program does is prevent OSX from going to sleep for some user-specified amount of time.

4. Illustrator+Photoshop files into Keynote+InDesign. Whenever I’m designing a talk or a poster, I make of all my figures and diagrams using Illustrator and Photoshop. It turns out that you can embed links to these files into Keynote presentations or InDesign files (I use InDesign for all my posters), and if you update the original source files, you can set the host files to automatically update the poster or talk to have the newest figures. This is very helpful for when your supervisor reminds you that you forgot scale bars on all your plots, since after modifying the images you won’t have to re-export all your figures and then re-import them into your presentation or poster.

5. IFTTT (if this then that). This is a high-level web app and smartphone app that allows me to pass data and actions between common websites, like most major social networks, my Google docs account, and my email. The design is pretty similar to Yahoo Pipes, in that it doesn’t serve a specific purpose so much as it allows people to perform a lot of actions that otherwise would require programming web scrapers. For example, I use a script (the site calls them “recipes”) that records my GPS coordinates to a Google Document every time I send an email from my smartphone. This data is part of an ongoing machine learning project I’ve been working on. There are also scripts that do a lot of common tasks, like transferring all tagged Facebook photos to a Dropbox folder, or mirroring flickr collections in a 500 px account. I haven’t yet exhausted the range of functions accessible with this site, but I can see it coming in handy for performing little tasks that would otherwise require programming a custom Python script that accesses the API of two different web applications. IFTTT has essentially made a high-level wrapper for accessing API, allowing simple data transfers and operations to be performed.

6. iPython Notebooks. These are what convinced me to switch to Python over MATLAB. The killer feature of the cell interface is that it allows you to only run or re-run portions of a script, making it competitive with MATLAB’s own cell-mode interface for scripting. But unlike MATLAB, importing new functions and packaging functions is really painless. However, I still prefer using Mathematica notebooks for symbolic math over iPython+SymPy, since the former has special syntax for typing symbolic expressions where it formats things like fractions/square roots as you type, making it easier to make sure you are typing the formula correctly.