Using text classification to determine authorship

In high school, my classmates and I were required to use the website turnitin.com, which bills itself as a “global leader in evaluating and improving student learning” but which really exists for a single (albeit important) purpose: determining whether submitted papers are plagiarized from online sources. The use of the website was perhaps an overzealous response to the sparsely-documented phenomenon of high schoolers stealing their papers from online sources, particularly because the website can do little to detect whether a paper has been purchased from a “paper mill,” where students can hire others to write their essays for them. Instead, the website appears to use a strangely rudimentary regular expressions search capability to compare all possible substrings (down to some minimal length) in a document against a database of text compiled from online sources like Google Books and Wikipedia. I remain skeptical of the sophistication of the website’s methods because it would regularly flag phrases like “Just as when” or “at odds with,” suggesting that it didn’t contain any sort of whitelist for common phrases. I also hold a personal grudge because I once turned in an essay containing (MLA formatted!) quotes from Heart of Darkness, and the program declared that I had plagiarized parts of my essay from a certain book by Joseph Conrad.

How can plagiarism be detected more effectively? I have spent this summer reading a little about text classification, and it seems apparent that a website like Turnitin could benefit from implementing a procedure that looks for whether the style of a student’s writing is consistent, rather than simply looking for copied text. It would be harder for such a site to present definitive evidence of cheating, but it would at least allow teachers to get a better idea when several students turn in similar papers or when a single one of a student’s papers demonstrates a markedly different tone and style from xyr others.

A classic paper that outlines a lot of the concepts and principals associated with text classification is “Inference in an authorship problem” by Mosteller and Wallace (1963). The authors apply statistical techniques to analyze whether several the Federalist Papers (1788) were written by Hamilton or Madison—at the time of the publication of their article, no consensus existed among historians regarding which author had written which papers. The basic technique used by Mosteller and Wallace is strikingly similar to techniques used today after the rise of modern computerized data mining: they implement a linear discriminant analysis that looks at the frequency that the authors use certain common words (“by”,”from”,”to,” and the like). The technique works by looking for usage frequencies for sets of words, rather than a single word, that strongly differ between the two authors bodies of work—this set of words is determined by taking various weighted combinations of words and applying them to known samples of each author’s works, and finding a combination of words and weights that yields the high sums for works by one author (Madison in the paper) and low sums for the other author. Once advanced technique that Mosteller and Wallace implement involves the observation that certain, context independent “function” words—like prepositions or common adverbs—have frequencies that roughly obey a Poisson distribution, allowing them to apply Bayes theorem when selecting their set of words in order to determine the likelihood of an observed word frequency given the assumption that the distribution of frequencies should look Poissonian.

The authors find, surprisingly, that Madison wrote every single one of the disputed papers. This contradicts various claims made by contemporaries of the original federalists, but it corroborates several historical analyses that came out around the time that Mosteller and Wallace published their original papers. Because the method only requires known samples of an author’s work in order to develop a model, a website like Turnitin could implement a similar technique by archiving a student’s initial submissions or past work. Presumably, the website would then have little difficulty discerning whether a history essay was written by an adolescent or Joseph Conrad.

Advertisements

2 thoughts on “Using text classification to determine authorship

  1. Impressive, might be sophisticated but sounds like an easy enough algorithm to implement using q-q testing. The CS department in college used something similar to identify if deciding submissions were plagiarised, this was in 2002-03 perhaps turnitin needs to move on…

  2. Algorithmic trolling of social networks | gammacephei

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s