Guide to all Text Analysis Posts
Prev: Sample Text
Definitions and Text Operations:
Text atoms: Uppercase letters (L), lowercase letters (u), punctuation marks (?), numerals (4).
Accent Marks: None in English, just consider accented letters as independent letters?
Whitespace: space, tab, newline, end-of-line, end-of-file.
Word: A sequence of text atoms; in text, preceded and succeeded by whitespace.
Text: A sequence of text atoms and whitespace. Equivalently, a sequence of words separated by whitespace. Text space is a linear space.
Corpus: A collection of texts.
Word list: a list of words, order usually does not matter. It is generated either fro a text or from another list.
Dictionary word: A word found in a dictionary or in one of a set of dictionaries.
Parse: (By default on whitespace, but a parse map can be defined for any sequence of words.) This is an operation that acts on text and generates a list of words, so parse is not in fact a text operation. It maps a point from one space into another.
Various operations can be carried out on text (or on lists, since each element of a list is text), the ones relevant for our purposes are:
Lower: Acts on any piece of text, converts all uppercase letters to the corresponding lowercase ones.
lower(upper's and(?) LOWER) = upper's and(?) lower
DePunctuate: Almost all punctuation in normal English text is juxtaposed with whitespace (WJPMs). Punctuation marks which are normally juxtaposed with whitespace can either be deleted or replaced with whitespace, in either case one will be left with just whitespace. The two exceptions are the hyphen '-' and the apostrophe “'”. '-' occours in hyphenated words like 'super-talented', which are composed of independent words. Hence the hyphen should be replaced by a whitespace. The apostrophe occours in possessives and contractions, e.g. “dog's”, “she'd” and “isn't”. In order to not create single letter words and peculiarities like “isn”, the apostrophe should be deleted. Note that stemming leaves unchanged all three of 'isn', 'isnt' and 't', and hence can't be used to make a choice. With either choice for eliminating the WJPMs, we will have one exception, either the hyphen or the apostrophe. For simplicity, let us delete the WJPMs without replacement, delete the apostrophe without replacement as well, and replace the hyphen with whitespace.
DePunctuate(“The dog's bone was super-tasty! Wasn't it?” she asked.) = The dogs bone was super tasty Wasnt it she asked
Chop[n]: Eliminates words of length n or less, with possible exceptions like 'i' and 'a' for n = 1.
Chop(To be or not to be, that is the question.) = not be, that the question.
I thought of this as a short cut to stop. With the choice of depunctuate above that punctuation marks (except for '-', which is replaced by whitespace) are simply deleted, there shouldn't be single letter words other than 'i' and 'a'. Having rethought stop, chop became superfluous.
Stem: Acts on a word and returns its stem. Conjugations of verbs, plurals, adjectives, adverbs and other modifiers often have the same stem. Note that the result of stem on a regular English word is not necessarily a dictionary word – it is only the sequence of letters identified as the stem. There are many stemming algorithms. I've used Porter2 from the Python implementation by Matt Chaput.
I had misunderstood the purpose of stem, and when I saw words like veget and readabl I wrote ifstem so it would only replace a word by its stem if the stem was a dictionary word.
Stop[rank]: Eliminates words whose tokens rank higher (are more frequent) than rank. See
StopWordsToBeOrNotToBeZipf Post to appear in the next day or two.
Derarify[rank from bottom]: Eliminates words whose tokens are very low-ranking by frequency. See
SpellCheck: Not thought about this yet. Eliminating the least frequent tokens –DeRarifying– should take care of some mis-spellings.
ImProper: Identify and eliminate proper nouns. While any one given text may have a few names that occur with high frequency (for example in Alice in Wonderland the name Alice is the 10th ranked word), over the whole corpus most proper nouns should be relative rarities and can be eliminated by DeRarifying the text. The most common proper nouns can perhaps be listed and eliminated separately.
Histogram: Again, this is not a text operation, in fact it acts on a wordlist and returns a list of tuples or a dictionary (list of key:value pairs) whose first elements or keys are the members of the set constructed from the elements of the list and the second elements or values are the number of times they occur in the list. The space of word histograms is linear, has a natural origin and is non-negative, it is a whole-number valued “vector space”. So lets call these things “word vectors”. For any given text, then, one can construct its word vector.
The above operations don't commute with each other, hence the order in which they are carried out will affect the result.
Note for further thought: the above operations are linear operators on a linear space, and they do have eigenvalues and eigenvectors. But the operators are not all symmetric, and some of their eigenvectors lie outside text space. E.g. pure lower case text are eigenvectors of lower with eigenvalue 1. A word vector consisting of lowercase text low and equal and opposite frequencies of occurrence of text whose lowercase is low (e.g. lOw, LOW) is an eigenvector with eigenvalue 0. (See my future post on how to extend text space to make some of these operators symmetric.)