Scaling behavior of popular reddit posts

A plot of the score versus hour since posting for 53 different posts in /r/pics over the span of one month

A plot of the score versus hour since posting for 53 different posts in /r/pics over the span of one month

I wrote a program that algorithmically monitors viral links in the subreddit /r/pics in order to see whether I can identify traits of posts that tend to become popular through the site. Users can give a thumbs up or thumbs down vote to a picture posted on the /r/pics subdomain of the site, and the total “score” of a post is given by the number of “up” votes minus the number of “down” votes. The individual pages of reddit look essentially like long lists showing various posted images sorted by a combination of their total score and how recently they were posted. This system ensures that users can easily view very popular, fresh material generated by other users, like funny pictures of pets or beautiful pictures of the sunset. A very popular image can make the “front page,” or the default landing page that a user visiting reddit.com will see, and this can result in an image accruing millions of unique pageviews over just a few short hours.

In the image at the top of this post, I’ve plotted the score as a function of time since posting for 53 links posted during the month of July. The content of these image links is remarkably varied, ranging from a beautiful image of Dubai to an adorable three-legged cat. They all appear to display qualitatively similar behavior—rapid exponential growth followed by a plateau at a final total number of upvotes, resulting in an overall sigmoidal shape—but their traits like initial growth rate, maximum growth rate, and position of asymptote vary widely. One aspect of this behavior I want to investigate further in the future is the relationship between these voting patterns and logarithmic random walks, which may reveal meaningful correlations between properties like the growth rate and the final score.

The key feature of reddit that I was hoping to study by tracking posts is the degree to which reddit and its admins “fuzz,” or artificially interfere with the score of posts in order to ensure that viral posts are not subject to vote manipulation by computer programs or automated voting software. This practice allows the admins to alter (or diminish) the reported score of popular posts in order to prevent users from manipulating the website. For this reason, many “viral” posts that accumulate large numbers of upvotes (and which subsequently make it to the front page of the website) will report a score of 1000-5000, even though their actually score may be an order of magnitude greater than that. This practice is clearly responsible for the asymptotes in the score versus time graph, but it also manifests as the dramatic drops in score observed in many of the most popular posts on the graph. Since the links were not all posted at the same time—I’ve shifted the data in order to display it as “time since posting”—these sudden drops in score appear to occur at the same time relative to the original posting date, since the discontinuities line up at regular intervals in the graph. I have not yet been able to find a definitive explanation for why this occurs, but I would guess that it serves to prevent posts that reach the front page extremely rapidly from staying there for days due to their high initial priority. This would suggest that a user who is looking to maximize the final score of their posts ought to post links that are popular, but not so popular that they activate this automatic vote culling.

A semilog plot of the number of comments versus hours since posting for 53 different posts in /r/pics over the course of one month.

A semilog plot of the number of comments versus hours since posting for 53 different posts in /r/pics over the course of one month.

Another interesting feature of reddit lies in the ability of users to comment on one another’s posts. I made a similar plot to the score versus time (above), and the results and similarly sigmoidal for all posts, although the two time constants dictating initial growth rate and approach to saturation both appear to be much slower, presumably because reddit doesn’t “fuzz” comments. I used a semilog plot to display this data, since otherwise the field was too crowded to see the data properly.

A log-log plot of the number of comments versus the score of posts.

A log-log plot of the number of comments versus the score of posts.

Another interesting property is the correlation between the score at a given timepoint and the number of comments—presumably more popular links generate new comments at a faster rate, leading to a superlinear correllation between the number of comments and the score. This behavior is confirmed in the log-log plot shown above, which appears linear for many of the posts with a surprisingly narrow range of slopes. Since lines on a log-log plot correspond to power-law behavior, this plot suggests that there may be universal power law (with a narrow range of critical exponents) that dictates the growth of comments over time for popular posts. It’s worth noting that popular reddit posts appear to accrue comments indefinitely, even after their score has stabilized, and so the above graph doesn’t fully capture the saturation region for comments over long timescales (in which comments increase but score remains the same, captured by the “uptick” at high score for some of the trendlines.

While it’s probably tedious to keep track of 53 different colors, here is a legend showing the post ID for all the posts depicted here. The original link can be generated by inserting the 5 character string into appropriate place in the reddit url: http://www.reddit.com/r/pics/comments/2byz7g

This analysis was made possible using the PRAW and pandas packages for Python.

The post id for each of the 53 posts in /r/pics shown in the previous figures. Individual posts can be viewed by inserting the post id into a url of the form: http://www.reddit.com/r/pics/comments/2bgjsv

The post ID for each of the 53 posts in /r/pics shown in the previous figures. Individual posts can be viewed by inserting the post ID into a url of the form:
http://www.reddit.com/r/pics/comments/2bgjsv

Advertisements

Algorithmic interactions on social networks

I recently built a simple program that attempts to impersonate users of an online forum using a combination of a Markov model and a context-free grammar.

Recently, I’ve been using Python to scrape data from the content-aggregation website Reddit, at which users can submit links and vote on each other’s submissions. Links that generate large numbers of upvotes in the community may eventually make it to the front page of Reddit, where they will be viewed by millions of individual casual browsers as they first enter the site. Reddit’s voting algorithm and method for determining the ranking is closely guarded in order to prevent marketing companies from spamming the front page with links.

A really great Python utility for working with Reddit data is PRAW, which provides an interface between Reddit’s API and Python. The module not only allows easy scraping of information like the popularity and content top articles, the number of upvotes, and the number of comments; it also simplifies the creation of “bots,” or automated users in the Reddit community that comment on posts. The function of bots ranges from practical—one bot posts a text-mined summary of every Wikipedia article that makes it to the front page, while another posts metric conversions of every reddit headline containing imperial units—to whimsical: one bot periodically scans comment threads for rhyming words, and adds a comment to the thread demanding users cease deploying puns.

For my bot, I wanted to create a user who could convincingly act like a normal “human” user. My immediate idea was to scrape comments on an article, create a word-as-token Markov model, and then automatically post a comment containing a fixed-length output generated by the model. Markov text generators are an incredibly common opening exercise in introductory computer science courses—they are relatively easy to prepare, yet can produce surprisingly realistic (and occasionally humorous) output if trained on the right corpus of text, like this sample output from a 3-word-token Markov model I trained using Mary Shelley’s Frankenstein; or, The Modern Prometheus:

This towering above all, thy mountains, obscured in darkness and cast my eyes were shut to the conversations he held with his parents ever since our infancy. We were in my enterprise. I have confessed myself guilty of a peculiar and overpowering nature ; nor do the floating sheets of ice , which was one which attracted my attention suddenly grew despicable. By one of your sentiments of this wretched mockery of a different conclusion

Because of the relatively small size of the training text in comparison to the token size, a lot of the phrases, like “his parents ever since our infancy” or “floating sheets of ice,” are complete phrases taken directly from the novel. Nonetheless, randomly picking three word phrases based on their frequency in the corpus captures the style of the novel remarkably well, despite there being no explicit specification of sentence structure when the code is trained on the corpus. If I reduce the order of the model to 2, I get an output text that’s significantly more “original,” but also marred with more grammatical and stylistic errors:

This apprehensions as he had hitherto been present to your health rendered , no disaster is murdered , on your duty towards the most beautiful that you could I opened my, I, the reflections determined thenceforth to a direction, and my appetite. One by Elizabeth. She was not unfolded to recollect what has been adduced against me was free last night; such a fiend can not describe their native town of great crime , in death shall be the beginning of food or take their inquiries clear conception of the dashing waves continually renewed violence.

I originally intended to have my bot write pseudoscience or new-age philosophy, which would convincingly allow other users to overlook its frequent lapses in grammar as kooky derailment. I quickly learned, however, that the output of a Markov model is distinctive enough that other, tech-savvy users could infer that the comments were algorithmic. Looking to refine my approach, I instead investigated to approach used by SCIgen, a well-known automatic scientific paper generated that gained attention in 2005 when its authors successfully submitted SCIgen papers with titles like “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy” to several less-reputable journals and conferences. Contrary to what I expected, SCIgen does not use an augmented Markov model, but rather context-free grammar, a token-based approach to generating text that takes into account the natural hierarchies observed in sentence structure first described by Chomsky. The method is outlined in much better detail elsewhere, but the essential concept is that sentences contain a natural hierarchy of information that motivates their representation as a tree data structure. For example, the sentence, “The dog who was dirty ate the slippers” can be split first into two parts—the subject cause about the dirty dog, and the verb clause about his crime—and each of these parts can be further subdivided into adjectival clauses, predicates, and finally individual words like articles, nouns, and adjectives. In a context-free grammar, the non-terminal nodes of the tree (clauses, etc) are governed by production rules (verb clause ->  verb + article + direct object) that state the next levels of decomposition, each of which has its own production rule (article -> “the” OR “a” OR “an”). A more advanced grammar tree (for a Victorian sentence from “Frankenstein”) looks like this:

A context-free grammar for a sentence from Frankenstein, parsed using the NLTK module for Python.

A context-free grammar for a sentence from Frankenstein, parsed using the NLTK module for Python.

A CFG text generator would start with the start symbol (‘S’) and then move down the tree from left to right, outputting a terminal symbol each time it sees one and then moving back up the the lowest incomplete branch. In order to get truly random text, the CFG is trained with many sentences, and so the parser will have multiple possible options for each symbol—after it parses S, it could choose to move to a sentence that looks like NP VP (like the one above), or one that looks like NN PN (subject noun – predicate noun, like “the man is a dog”). After it randomly makes that decision, it then has many options for each of the subsequent nonterminal nodes, and then it finally has a choice of many possible terminal symbols (once it reaches an NN, it can pick from any of the NN used in the training set, since they are syntactically equivalent).

One noteworthy detail of the sentence shown in the figure is the presence of repeated units, like  noun phrases (NP) that link to other noun phrases, which capture the fractal nature of language constructions in which sentence-like clauses are used to modify individual components of a sentence. For this reason, when generating text from a CFG, it is very easy to implement recursion, but it’s also very easy for that recursion to get stuck in an infinite loop, where a lower symbol links to a higher symbol, resulting in repetition:

The professor hated the man who hated the professor who hated the man who hated the…

The sentence is grammatical but undesirable. For large grammars, the incidence of long sentences seems to increase, since the number of possible interlinking loops and knots in the grammar tree becomes large.

In order to use a CFG to generate grammatically-correct gibberish, I made use of the famous NLTK module for Python, which contains many tools for processing and cleaning natural language data sets, as well as text in which human experts have tagged individual words as nouns, verbs, etc. This makes it possible to use pre-built functions in the module to tag individual words in a user-chosen sample text based on their similarity to words in the text used by the experts, while simultaneously identifying the overall relations among words in a data set in order to identify non-terminal symbols like clauses. In my particular code, I scan a text for all terminal symbols first, which is a compratively fast operation. I then pick sentences at random and parse their full grammar (which can take up to a minute), but then keep only their nonterminal symbols (so I discard individual words). I then concatenate the word rules with the nonterminal production rules, resulting in a grammar with the full vocabulary of my corpus, but only a subset of its grammatical structures. Since there are multiple possible terminal symbols for a given word (ie, noun -> ‘Frankenstein’ or ‘professor’ or ‘night’ or [any other noun in the entire book]), the generated text is structured yet completely random in its choice of specific word to fulfill a given function in the sentence. But restricting the nonterminal grammar rules also allows me to monitor whether a specific sentence structure tends to cause infinite recursion or other problems. Running this model with the Frankenstein corpus resulted in occasionally eerie text:

All delight not, spoke for tears unknown and conceived. I and a mankind , and in the place , close wonder paid to saw passed this behaviour. My beautiful not hired inquietude of immeasurable serenity these beautiful care, my variable indeed enslaved sorrow this fellow, that remained eyes with this willow endeavour in the courage of the truth before interested daylight.

The past innocence not, was I of feeble horror. A invader near a white slave which a loss answered with this truth: Man broken in the considerable atmosphere. Misery. her brother, my remorse, the world.

The text may be less convincing than that generated by the 3-gram Markov model, but that text borrowed entire phrases from the source text whereas this model only borrows specific sentence structures: the words used to fulfill the various functions in the sentence, such as the nominative or verb, are chosen entirely at random from all nouns and verbs found in the source text, making it pretty unlikely that a specific chain of multiple distinct words in the corpus, like “floating sheets of ice,” would recur here. Thus while the 3-gram Markov text might better fool a human reader (who doesn’t have the novel memorized), the context-free grammar model is more likely to fool a computer program that detects plagiarism by searching for phrases in an online database.

Because of the lyrical quality of the Frankenstein CFG output, I tried to post a few (re-formatted) outputs in the poetry criticism subreddit, /r/OCpoetry. The verses received generally positive feedback until one user noticed that my account name and post history suggested that the text was random gibberish. In retrospect, this wasn’t a particularly surprising outcome, given the unusual training corpus and the need for the bot to fool human users, and so in future work I may try to restrict the bot to more specific types of comments or posts with formulaic structures. However, for the doodles or occasional flashes of brilliant nonsense that I come across while training the bot, I created a subreddit /r/markovpoetry where my bots (and other users) can post amusing grammar generated by their text-generation programs.

The full code for this project, along with sample corpora, can be found on my GitHub page.