I recently built a simple program that attempts to impersonate users of an online forum using a combination of a Markov model and a context-free grammar.
Recently, I’ve been using Python to scrape data from the content-aggregation website Reddit, at which users can submit links and vote on each other’s submissions. Links that generate large numbers of upvotes in the community may eventually make it to the front page of Reddit, where they will be viewed by millions of individual casual browsers as they first enter the site. Reddit’s voting algorithm and method for determining the ranking is closely guarded in order to prevent marketing companies from spamming the front page with links.
A really great Python utility for working with Reddit data is PRAW, which provides an interface between Reddit’s API and Python. The module not only allows easy scraping of information like the popularity and content top articles, the number of upvotes, and the number of comments; it also simplifies the creation of “bots,” or automated users in the Reddit community that comment on posts. The function of bots ranges from practical—one bot posts a text-mined summary of every Wikipedia article that makes it to the front page, while another posts metric conversions of every reddit headline containing imperial units—to whimsical: one bot periodically scans comment threads for rhyming words, and adds a comment to the thread demanding users cease deploying puns.
For my bot, I wanted to create a user who could convincingly act like a normal “human” user. My immediate idea was to scrape comments on an article, create a word-as-token Markov model, and then automatically post a comment containing a fixed-length output generated by the model. Markov text generators are an incredibly common opening exercise in introductory computer science courses—they are relatively easy to prepare, yet can produce surprisingly realistic (and occasionally humorous) output if trained on the right corpus of text, like this sample output from a 3-word-token Markov model I trained using Mary Shelley’s Frankenstein; or, The Modern Prometheus:
This towering above all, thy mountains, obscured in darkness and cast my eyes were shut to the conversations he held with his parents ever since our infancy. We were in my enterprise. I have confessed myself guilty of a peculiar and overpowering nature ; nor do the floating sheets of ice , which was one which attracted my attention suddenly grew despicable. By one of your sentiments of this wretched mockery of a different conclusion
Because of the relatively small size of the training text in comparison to the token size, a lot of the phrases, like “his parents ever since our infancy” or “floating sheets of ice,” are complete phrases taken directly from the novel. Nonetheless, randomly picking three word phrases based on their frequency in the corpus captures the style of the novel remarkably well, despite there being no explicit specification of sentence structure when the code is trained on the corpus. If I reduce the order of the model to 2, I get an output text that’s significantly more “original,” but also marred with more grammatical and stylistic errors:
This apprehensions as he had hitherto been present to your health rendered , no disaster is murdered , on your duty towards the most beautiful that you could I opened my, I, the reflections determined thenceforth to a direction, and my appetite. One by Elizabeth. She was not unfolded to recollect what has been adduced against me was free last night; such a fiend can not describe their native town of great crime , in death shall be the beginning of food or take their inquiries clear conception of the dashing waves continually renewed violence.
I originally intended to have my bot write pseudoscience or new-age philosophy, which would convincingly allow other users to overlook its frequent lapses in grammar as kooky derailment. I quickly learned, however, that the output of a Markov model is distinctive enough that other, tech-savvy users could infer that the comments were algorithmic. Looking to refine my approach, I instead investigated to approach used by SCIgen, a well-known automatic scientific paper generated that gained attention in 2005 when its authors successfully submitted SCIgen papers with titles like “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy” to several less-reputable journals and conferences. Contrary to what I expected, SCIgen does not use an augmented Markov model, but rather context-free grammar, a token-based approach to generating text that takes into account the natural hierarchies observed in sentence structure first described by Chomsky. The method is outlined in much better detail elsewhere, but the essential concept is that sentences contain a natural hierarchy of information that motivates their representation as a tree data structure. For example, the sentence, “The dog who was dirty ate the slippers” can be split first into two parts—the subject cause about the dirty dog, and the verb clause about his crime—and each of these parts can be further subdivided into adjectival clauses, predicates, and finally individual words like articles, nouns, and adjectives. In a context-free grammar, the non-terminal nodes of the tree (clauses, etc) are governed by production rules (verb clause -> verb + article + direct object) that state the next levels of decomposition, each of which has its own production rule (article -> “the” OR “a” OR “an”). A more advanced grammar tree (for a Victorian sentence from “Frankenstein”) looks like this:
A CFG text generator would start with the start symbol (‘S’) and then move down the tree from left to right, outputting a terminal symbol each time it sees one and then moving back up the the lowest incomplete branch. In order to get truly random text, the CFG is trained with many sentences, and so the parser will have multiple possible options for each symbol—after it parses S, it could choose to move to a sentence that looks like NP VP (like the one above), or one that looks like NN PN (subject noun – predicate noun, like “the man is a dog”). After it randomly makes that decision, it then has many options for each of the subsequent nonterminal nodes, and then it finally has a choice of many possible terminal symbols (once it reaches an NN, it can pick from any of the NN used in the training set, since they are syntactically equivalent).
One noteworthy detail of the sentence shown in the figure is the presence of repeated units, like noun phrases (NP) that link to other noun phrases, which capture the fractal nature of language constructions in which sentence-like clauses are used to modify individual components of a sentence. For this reason, when generating text from a CFG, it is very easy to implement recursion, but it’s also very easy for that recursion to get stuck in an infinite loop, where a lower symbol links to a higher symbol, resulting in repetition:
The professor hated the man who hated the professor who hated the man who hated the…
The sentence is grammatical but undesirable. For large grammars, the incidence of long sentences seems to increase, since the number of possible interlinking loops and knots in the grammar tree becomes large.
In order to use a CFG to generate grammatically-correct gibberish, I made use of the famous NLTK module for Python, which contains many tools for processing and cleaning natural language data sets, as well as text in which human experts have tagged individual words as nouns, verbs, etc. This makes it possible to use pre-built functions in the module to tag individual words in a user-chosen sample text based on their similarity to words in the text used by the experts, while simultaneously identifying the overall relations among words in a data set in order to identify non-terminal symbols like clauses. In my particular code, I scan a text for all terminal symbols first, which is a compratively fast operation. I then pick sentences at random and parse their full grammar (which can take up to a minute), but then keep only their nonterminal symbols (so I discard individual words). I then concatenate the word rules with the nonterminal production rules, resulting in a grammar with the full vocabulary of my corpus, but only a subset of its grammatical structures. Since there are multiple possible terminal symbols for a given word (ie, noun -> ‘Frankenstein’ or ‘professor’ or ‘night’ or [any other noun in the entire book]), the generated text is structured yet completely random in its choice of specific word to fulfill a given function in the sentence. But restricting the nonterminal grammar rules also allows me to monitor whether a specific sentence structure tends to cause infinite recursion or other problems. Running this model with the Frankenstein corpus resulted in occasionally eerie text:
All delight not, spoke for tears unknown and conceived. I and a mankind , and in the place , close wonder paid to saw passed this behaviour. My beautiful not hired inquietude of immeasurable serenity these beautiful care, my variable indeed enslaved sorrow this fellow, that remained eyes with this willow endeavour in the courage of the truth before interested daylight.
The past innocence not, was I of feeble horror. A invader near a white slave which a loss answered with this truth: Man broken in the considerable atmosphere. Misery. her brother, my remorse, the world.
The text may be less convincing than that generated by the 3-gram Markov model, but that text borrowed entire phrases from the source text whereas this model only borrows specific sentence structures: the words used to fulfill the various functions in the sentence, such as the nominative or verb, are chosen entirely at random from all nouns and verbs found in the source text, making it pretty unlikely that a specific chain of multiple distinct words in the corpus, like “floating sheets of ice,” would recur here. Thus while the 3-gram Markov text might better fool a human reader (who doesn’t have the novel memorized), the context-free grammar model is more likely to fool a computer program that detects plagiarism by searching for phrases in an online database.
Because of the lyrical quality of the Frankenstein CFG output, I tried to post a few (re-formatted) outputs in the poetry criticism subreddit, /r/OCpoetry. The verses received generally positive feedback until one user noticed that my account name and post history suggested that the text was random gibberish. In retrospect, this wasn’t a particularly surprising outcome, given the unusual training corpus and the need for the bot to fool human users, and so in future work I may try to restrict the bot to more specific types of comments or posts with formulaic structures. However, for the doodles or occasional flashes of brilliant nonsense that I come across while training the bot, I created a subreddit /r/markovpoetry where my bots (and other users) can post amusing grammar generated by their text-generation programs.
The full code for this project, along with sample corpora, can be found on my GitHub page.