Keywords and Signatures
Warren R. Johnson


Text Analysis. When we evaluate writing with an eye on how much was written by whom we are actually engaging in text analysis. Text analysis ranges from the easy task of counting words in order to see which ones occur most often, to the much more difficult task of identifying authorship itself.

Maybe the most elaborate instance of authorship identification was undertaken more than a century ago. Nineteenth-century scholars sought an answer to a question that seemed to have an answer already, "Who wrote the Five Books of Moses?"   The Graf-Wellhausen Documentary Hypothesis suggested an answer that pointed to four different authors and is, of course, not without its critics (1).   More recent instances of text analysis can be found in forensic linguistics. The Automated Linguistic Identification of Authorship System or ALIAS® created by Carole Chaski has been most useful in identifying the right individuals in legal cases (2).

For our aims we need neither Graf-Wellhausen, for we are doing nothing of Biblical dimensions, nor ALIAS®, because crimes are not at issue here. Most often we run into students' essays that cause us to question their authenticity.  To do so we can use a handful of clues sufficient to add to our despair or give hope to the weary.

A nice little device at
http://textalyser.net/ analyzes text and calculates both complexity and readability scores. Complexity refers to lexical density or the percentage of different words per sentence. Readability combines sentence length and syllables per word.   Not always, but often, the same writer gets the same results. Better yet, two different writers generate entirely different results. In the two cases I looked at, one from Winston Churchill and the other from Mark Twain, Churchill got a complexity score of 48 and a readability score of 46. Mark Twain got a complexity score of 61 and a readability score of 63.  Twain's score is on level with Sport's Illustrated and Churchill's is a tad higher than the Wall Street Journal (3).

Catching up with Shakespeare. Much more useful than counting words and establishing their scores is looking at the shape of the vocabulary an author uses in a particular article.  For writers of the caliber of Mark Twain and Winston Churchill the words they have in mind take on a very distinguished bell-shaped form when they are distributed by length. Twain seems to have aimed at four letter words (away, best, high, know, more, said, very, this, with, well) while Churchill leaned towards seven-letter words (already, Britain, devoted, Europe, Germany, injured, members, subject).  In their actual writing, where vocabulary words have to be repeated in order to provide appropriate syntax, three-letter words did most of the heavy lifting (all, but, for, not, was). In that regard Churchill and Twain are very much like the rest of us. Unlike the rest of us, a graphic representation of their writing turns into a nicely shaped histogram similar to this one: 

    

        1      2      3      4      5     6      7      8     9    10    11   12   13   14   15   16  17  18  19  20  +21

Number of Letters in Word

Source: Adapted from 10ticks.co.uk/s_codebreaker_letter



In other words, not only does an address by Churchill read well, or a letter to a friend by Twain sound fine, both look good at a deeper level, too (4).  By contrast, one of the standard features of academe, résumés, look like hacked off cries for help, when their words are arranged by length on a bar-graph.

A few cases, of course, are not conclusive, they are merely suggestive. But there were more instances. The bell-shaped curves were characteristic of several long texts from Shakespeare, the Sermon on the Mount, the second of the prophets named Isaiah, and Samuel 28 (the Witch of Endor).  With a sufficiently satisfactory program larger samples might be investigated.

Finding a basis for contrast. A particularly useful tool created by Joyce Maeda and Hobara Yuu is called Frequency Level Checker (5). The various forms of be (am, are, been, being, is, was, were, etc.) are counted in one category. In combination with three additional words the program led me to a curious conclusion for our purposes: four keywords beg to be called an author's signature. The words are "that", "as", "and", "be". They are empty of content and therefore used often for reasons of syntax (6). On average, such words occur at specific rates for each writer. Churchill and Twain, for instance, almost never use "as". On the other hand, they use "be" in one of its forms about once every 20-30 words. In more than 100 samples of writing ranging from Shakespeare, the Bible, Mark Twain, Saul Bellow and private communications, that rate occurred more often than any other. Uncommon was finding the combination of "that" and "as" more often than "be".

Almost at the uncommon level, though, were two articles I recently found on the internet. They were nearly identical in their use of the four keywords.  One was signed, one was not.  Whether it would be possible for two persons to write with that much similarity is doubtful.  Obversely, over time and depending on the subject about which an author writes, there can be broad deviations in the rate of keywords.  Like signatures, keywords are not fingerprints.  Like signatures they do give us a fair indication of who wrote what, whether the authors want to reveal themselves or not.

 Watching our Ps and Qs.  Another sort of analysis contrasts how often each letter in the alphabet is used.  Letters that have to do with a writer's topic tend to reoccur if only because the topic word, like a tonic note in music, occurrs regularly throughout the manuscript.  For instance, if you write about religion, every time you repeat the word the eight letters that make up that word are counted anew .  If you write about war, those three letters will occur more often than expected for the same reason.  What is striking in English is the low likelihood of the letters p and q.  According to 10ticks.co.uk the letter p appears in newspapers about 1.9% of the time and the letter q about .1% of the time.  Words that would count high for the letter p include perhaps, people, preparation, and presupposed.  Iraq, quality, requirements, and unique would drive the q score up.  Someone who consistently has higher than average p and q scores might well be writing a technical paper that calls upon particular words repeatedly.

Conclusion. Were two instances of writing identical, we would not have to concern ourselves with text analysis. It is rather dishonest writers who seek to hide their identities lest they be caught that concern us. One of the criticisims of distance education is students are possibly not doing their own work. The discrepancy between proctored midterms and classwork, if any, can be analyzed with text analysis tools. Counting letters and words is probably not the best way. Looking at word distributions may well be more helpful. That text anaylsis might reveal a similarity between two individuals' work or demonstrate that one essay and another were written by two different hands makes the practical application of text analysis worth its weight in gold.

 


1. A summary of their answer can be found at http://ccat.sas.upenn.edu/rs/2/Judaism/jepd.html The authors were J, E, D, and P. In 1990 David Rosenberg and Harold Bloom followed the same line of reasoning in The Book of J, New York: Grove Weidenfeld.

2. Chaski, Carole E. 2001. "Empirical Evaluations of Language-based author identification techniques". The International Journal of Speech, Language and the Law: Forensic Linguistics. Vol. 8, No. 1

3. For the sake of comparison Churchill's address to the House of Commons on June 18, 1940 was one text. Twain's letter to Howell about his friend John T. Lewis was the second text. Lexical density is the percentage of different words per sentence. Readability is calculated by the same means  Rudolph Flesch described in How to Write Plain English, Harper and Row (1979) Chapter 2 at
http://pages.stern.nyu.edu/~wstarbuc/Writing/Flesch.htm The funnies earn a score of 92 and the tax code a score of minus 6.

4. Though not necessary for the analysis of text
http://www.10ticks.co.uk/s_codebreaker_word.asp is visually very useful.

5. Maeda, Joyce and Hobara Yuu (2000) "Frequency Level Checker".
http://language.tiu.ac.jp/flc/tool.html

6. Thanks are due to David Marshall for calling syncategorematic to my attention. During the Middle Ages the syncategoremata were discussed extensively by Peter of Spain. See for instance,
http://setis.library.usyd.edu.au/stanford/archives/fall2001/entries/peter-spain/#4