Guide to all Text Analysis Posts
Prev: Sample Text
Definitions and Text Operations:
Text atoms: Uppercase letters (L),
lowercase letters (u), punctuation marks (?), numerals
(4).
Accent Marks: None in English, just
consider accented letters as independent letters?
Whitespace: space, tab, newline,
end-of-line, end-of-file.
Word: A sequence of text atoms; in
text, preceded and succeeded by whitespace.
Text: A sequence of text atoms and
whitespace. Equivalently, a sequence of words separated by
whitespace. Text space is a linear space.
Corpus: A collection of texts.
Word list: a list of words, order
usually does not matter. It is generated either fro a text or from
another list.
Dictionary word: A word found in a
dictionary or in one of a set of dictionaries.
Parse:
(By default on whitespace, but a parse map can be defined for any
sequence of words.) This is an operation that acts on text
and generates a list
of words, so parse
is not in fact a text operation. It maps a point from one space into
another.
Various operations can be carried out
on text (or on lists, since each element of a list is text), the ones
relevant for our purposes are:
Lower: Acts on any piece of
text, converts all uppercase letters to the corresponding lowercase
ones.
lower(upper's
and(?) LOWER) = upper's
and(?) lower
DePunctuate: Almost all
punctuation in normal English text is juxtaposed with whitespace
(WJPMs). Punctuation marks which are normally juxtaposed with
whitespace can either be deleted or replaced with whitespace, in
either case one will be left with just whitespace. The two exceptions
are the hyphen '-' and the apostrophe “'”. '-' occours in
hyphenated words like 'super-talented', which are composed of
independent words. Hence the hyphen should be replaced by a
whitespace. The apostrophe occours in possessives and contractions,
e.g. “dog's”, “she'd” and “isn't”. In order to not create
single letter words and peculiarities like “isn”, the apostrophe
should be deleted. Note that stemming leaves unchanged all three of
'isn', 'isnt' and 't', and hence can't be used to make a choice. With
either choice for eliminating the WJPMs, we will have one exception,
either the hyphen or the apostrophe. For simplicity, let us delete
the WJPMs without replacement, delete the apostrophe without
replacement as well, and replace the hyphen with whitespace.
DePunctuate(“The
dog's bone was super-tasty! Wasn't it?” she asked.)
= The dogs bone
was super tasty Wasnt it she asked
Chop[n]: Eliminates words of
length n or less, with
possible exceptions like 'i' and 'a' for n
= 1.
Chop[2](To
be or not to be, that is the question.)
= not be, that the
question.
I
thought of this as a short cut to stop.
With the choice of depunctuate
above that punctuation marks (except for '-', which is replaced by
whitespace) are simply deleted, there shouldn't be single letter
words other than 'i' and 'a'. Having rethought stop,
chop
became superfluous.
Stem:
Acts on a word and returns its stem. Conjugations of verbs, plurals,
adjectives, adverbs and other modifiers often have the same stem.
Note that the result of stem
on a regular English word is not necessarily a dictionary word – it
is only the sequence of letters identified as the stem. There are
many stemming algorithms. I've used Porter2 from the Python
implementation by Matt Chaput.
I had misunderstood the purpose of
stem, and when I saw words
like veget
and readabl
I wrote ifstem
so it would only replace a word by its stem if the stem was a
dictionary word.
Stop[rank]:
Eliminates words whose tokens rank higher (are more frequent) than
rank.
See
StopWordsToBeOrNotToBeZipf Post to appear in the next day or two.
Derarify[rank from bottom]:
Eliminates words whose tokens are very low-ranking by frequency. See
StopWordsToBeOrNotToBeZipf
SpellCheck:
Not thought about this yet. Eliminating the least frequent tokens
–DeRarifying– should take care of some mis-spellings.
ImProper:
Identify and eliminate proper nouns. While any one given text may
have a few names that occur with high frequency (for example in Alice
in Wonderland the name
Alice is the 10th
ranked word), over the whole corpus most proper nouns should be
relative rarities and can be eliminated by DeRarifying the text. The
most common proper nouns can perhaps be listed and eliminated
separately.
Histogram:
Again, this is not a text operation, in fact it acts on a wordlist
and returns a list of tuples or a dictionary (list of key:value
pairs) whose first elements or keys are the members of the set
constructed from the elements of the list and the second elements or
values are the number of times they occur in the list. The space of
word histograms is linear, has a natural origin and is non-negative,
it is a whole-number valued “vector space”. So lets call these
things “word vectors”. For any given text, then, one can
construct its word vector.
The above operations don't commute with
each other, hence the order in which they are carried out will affect
the result.
Note for further thought: the above
operations are linear operators on a linear space, and they do have
eigenvalues and eigenvectors. But the operators are not all
symmetric, and some of their eigenvectors lie outside text space.
E.g. pure lower case text are eigenvectors of lower
with eigenvalue 1. A word vector consisting of lowercase text low
and equal and
opposite frequencies of occurrence of text whose lowercase is low
(e.g. lOw,
LOW)
is an eigenvector with eigenvalue 0. (See my future post on how to
extend text space to make some of these operators symmetric.)
No comments:
Post a Comment