|
Keywords and Signatures
Warren R. Johnson
Text Analysis. When
we evaluate writing with an eye on how much was written by whom we are
actually engaging in text analysis. Text analysis ranges from the easy
task of counting words in order to see which ones occur most often, to the
much more difficult task of identifying authorship itself.
Maybe the most elaborate instance of authorship identification was
undertaken more than a century ago. Nineteenth-century scholars sought an
answer to a question that seemed to have an answer already, "Who
wrote the Five Books of Moses?" The Graf-Wellhausen
Documentary Hypothesis suggested an answer that pointed to four different
authors and is, of course, not without its critics (1). More
recent instances of text analysis can be found in forensic linguistics.
The Automated Linguistic Identification of Authorship System or ALIAS®
created by Carole Chaski has been most useful in identifying the right
individuals in legal cases (2).
For our aims we need neither Graf-Wellhausen, for we are doing nothing of
Biblical dimensions, nor ALIAS®, because crimes are not at issue here.
Most often we run into students' essays that cause us to question their
authenticity. To do so we can use a handful of clues sufficient to
add to our despair or give hope to the weary.
A nice little device at http://textalyser.net/
analyzes text and calculates both complexity and readability scores.
Complexity refers to lexical density or the percentage of different words
per sentence. Readability combines sentence length and syllables per word.
Not always, but often, the same writer gets the same results. Better yet,
two different writers generate entirely different results. In the two
cases I looked at, one from Winston Churchill and the other from Mark
Twain, Churchill got a complexity score of 48 and a readability score of
46. Mark Twain got a complexity score of 61 and a readability score of 63.
Twain's score is on level with Sport's Illustrated and Churchill's
is a tad higher than the Wall Street Journal (3).
Catching up with Shakespeare. Much more useful than counting words
and establishing their scores is looking at the shape of the vocabulary an
author uses in a particular article. For writers of the caliber of
Mark Twain and Winston Churchill the words they have in mind take on a
very distinguished bell-shaped form when they are distributed by length.
Twain seems to have aimed at four letter words (away, best, high, know,
more, said, very, this, with, well) while Churchill leaned towards
seven-letter words (already, Britain, devoted, Europe, Germany, injured,
members, subject). In their actual writing, where vocabulary words
have to be repeated in order to provide appropriate syntax, three-letter
words did most of the heavy lifting (all, but, for, not, was). In that
regard Churchill and Twain are very much like the rest of us. Unlike the
rest of us, a graphic representation of their writing turns into a nicely
shaped histogram similar to this one:

1
2 3 4
5 6 7
8 9 10 11
12 13 14 15 16 17
18 19 20 +21
Number of Letters in Word
Source: Adapted from
10ticks.co.uk/s_codebreaker_letter
In other words, not
only does an address by Churchill read well, or a letter to a friend by
Twain sound fine, both look good at a deeper level, too (4). By
contrast, one of the standard features of academe, résumés, look like
hacked off cries for help, when their words are arranged by length on a
bar-graph.
A few cases, of course, are not conclusive, they are merely suggestive.
But there were more instances. The bell-shaped curves were characteristic
of several long texts from Shakespeare, the Sermon on the Mount, the
second of the prophets named Isaiah, and Samuel 28 (the Witch of Endor).
With a sufficiently satisfactory program larger samples might be
investigated.
Finding a basis for contrast. A particularly useful tool created by
Joyce Maeda and Hobara Yuu is called Frequency Level Checker (5).
The various forms of be (am, are, been, being, is, was, were, etc.) are
counted in one category. In combination with three additional words the
program led me to a curious conclusion for our purposes: four keywords beg
to be called an author's signature. The words are "that",
"as", "and", "be". They are empty of content
and therefore used often for reasons of syntax (6). On average, such words
occur at specific rates for each writer. Churchill and Twain, for
instance, almost never use "as". On the other hand, they use
"be" in one of its forms about once every 20-30 words. In more
than 100 samples of writing ranging from Shakespeare, the Bible, Mark
Twain, Saul Bellow and private communications, that rate occurred more
often than any other. Uncommon was finding the combination of
"that" and "as" more often than "be".
Almost at the uncommon level, though, were two articles I recently found
on the internet. They were nearly identical in their use of the four
keywords. One was signed, one was not. Whether it would be
possible for two persons to write with that much similarity is doubtful.
Obversely, over time and depending on the subject about which an author
writes, there can be broad deviations in the rate of keywords. Like
signatures, keywords are not fingerprints. Like signatures they do
give us a fair indication of who wrote what, whether the authors want to
reveal themselves or not.
Watching our
Ps and Qs. Another
sort of analysis contrasts how often each letter in the alphabet is used.
Letters that have to do with a writer's topic tend to reoccur if only
because the topic word, like a tonic note in music, occurrs regularly
throughout the manuscript. For instance, if you write about
religion, every time you repeat the word the eight letters that make up
that word are counted anew . If you write about war, those three
letters will occur more often than expected for the same reason.
What is striking in English is the low likelihood of the letters p
and q. According to 10ticks.co.uk the letter p appears
in newspapers about 1.9% of the time and the letter q about .1% of
the time. Words that would count high for the letter p
include perhaps, people, preparation, and presupposed. Iraq,
quality, requirements, and unique would drive the q score up.
Someone who consistently has higher than average p and q
scores might well be writing a technical paper that calls upon particular
words repeatedly.
Conclusion. Were two
instances of writing identical, we would not have to concern ourselves
with text analysis. It is rather dishonest writers who seek to hide their
identities lest they be caught that concern us. One of the criticisims of
distance education is students are possibly not doing their own work. The
discrepancy between proctored midterms and classwork, if any, can be
analyzed with text analysis tools. Counting letters and words is probably
not the best way. Looking at word distributions may well be more helpful.
That text anaylsis might reveal a similarity between two individuals' work
or demonstrate that one essay and another were written by two different
hands makes the practical application of text analysis worth its weight in
gold.
1. A summary of their
answer can be found at http://ccat.sas.upenn.edu/rs/2/Judaism/jepd.html
The authors were J, E, D, and
P. In 1990 David Rosenberg and Harold Bloom followed the same line of
reasoning in The Book of J, New York: Grove Weidenfeld.
2. Chaski, Carole E. 2001. "Empirical Evaluations of Language-based
author identification techniques". The International Journal of
Speech, Language and the Law: Forensic Linguistics. Vol. 8, No. 1
3. For the sake of comparison Churchill's address to the House of Commons
on June 18, 1940 was one text. Twain's letter to Howell about his friend
John T. Lewis was the second text. Lexical density is the percentage of
different words per sentence. Readability is calculated by the same means
Rudolph Flesch described in How to Write Plain English, Harper and
Row (1979) Chapter 2 at http://pages.stern.nyu.edu/~wstarbuc/Writing/Flesch.htm
The funnies earn a
score of 92 and the tax code a score of minus 6.
4. Though not necessary for the analysis of text http://www.10ticks.co.uk/s_codebreaker_word.asp
is visually very useful.
5. Maeda, Joyce and Hobara Yuu (2000) "Frequency Level Checker".
http://language.tiu.ac.jp/flc/tool.html
6. Thanks are due to David Marshall for calling syncategorematic to
my attention. During the Middle Ages the syncategoremata were discussed
extensively by Peter of Spain. See for instance, http://setis.library.usyd.edu.au/stanford/archives/fall2001/entries/peter-spain/#4
|