theta_etc: NTA3: StopAndRareWordsToBeOrNotToBeZipf

Exactly, that is the question!

Numerical Text Analysis 3

Guide to All Text Analysis Posts

Prev: Text Analysis with Illustrative Example

Let's stop and think about stop words. What is the reason to remove them before doing numerical text analysis? Surely the reason is not that they account for 50% of all the words in a text (they do) and that hence removing them will result in data savings. These are words like 'the', 'and', 'to' and 'of' that occur with very high frequency in the corpus of all texts and whose relative frequencies across texts are approximately constant. As such stop words do not serve to distinguish one text from another. (I was about to say that they do not have any semantic content, but I'll carefully ... step ... around ... that ... mine.) So we can now choose to drop some number of the most frequent words. Note that this decision should absolutely be corpus-based! and not based on some canonical list of stop words that some grad. student came up with. For example, the most frequent words in the corpus(Wm. Shakespeare's works), corpus(Charles A. Dodgson's works), corpus(LinkedIn profiles) or corpus(Match.com profiles) will likely be different.

So if there isn't a canonical set (nor even number) of stop words, how should one choose the cut-off? For this let's consider another, more numerical reason to drop stop words.

In Numerical Text Analysis, there is a law that captures the general distribution of word frequencies in texts. For any text or corpus, rank all the word tokens by decreasing order of frequency, so 'the' is ranked 1 (relative frequency is 0.05) and something like 'zitterbewegung' –with relative frequency of 10^-9 even within physics texts– is ranked 250,000. Zipf's Law essentially states that the product of the rank and the frequency is a constant function of rank. Mandelbrot –who got his start in science doing numerical text analysis until he tried to analyse Jorge Luis Borges' works and switched to chaos theory– made a modification to this law that the exponent for frequency as a function of rank is less than -1, and took into account deviations near the ends. In any case, we then expect a plot of log₁₀(frequency * Rank) vs log₁₀(Rank) to approximate a straight line. So let's look at a sample Zipf plot for Alice:

Alice in Wonderland original text

Note that the way I am plotting the data is slightly different from Zipf's. Typically (meaning every instance I've seen so far) log(frequency) is plotted vs. log(rank), and one looks for a slope of -1. What that implies is that Frequency * Rank = constant, hence if I plot log(Frequency * Rank), data that satisfies Zipf's law should be flatlined, and deviations from Zipf's law will be zoomed up.

"What's with the discontinuities?". The text has only 50K words and, after stemming, 2,000 tokens. There are a very large number of words (~700) that occur only once in the text. As we go up this (unsorted) subset, the frequency remains the same but the numerical rank keeps reducing, until we get to the top end and enter the subset of words which occur twice: at this point the rank only changes by '1', but the frequency doubles, adding the 0.3 on the log10 scale. Note that this discontinuity is almost half the y-axis range! For a large corpus the number of genuine words that occur only a few times will reduce and their rank will be pushed to larger numbers on a wider y-axis range, so even though the first discontinuity will still be log10(2), the plot will appear smoother.

We clearly see the deviations from the expected power law behaviour at low ranks, i.e. for frequently occouring words. Roughly speaking from the 100th token onwards we see that a fit is an almost 0-slope line, whereas higher ranked tokens deviate from that behaviour. If we wanted to be aggressive about stemming, we would drop words ranked lower than 359. If we wanted to be less aggressive, we would perhaps choose a cut off rank of 99. See the application of this idea to a corpus.

For the full corpus of all texts, we should also expect deviations from linearity at the low frequency end of the plot, and this should be used to find a cut off for the rare words.

Conclusion for stop and rare words

For the corpus of texts under consideration:

Do a Cummings and stem the words in the texts
count the frequencies of occurrence in the corpus of all the word tokens, and rank the word tokens by order of frequency.
Plot log(frequency * rank) vs. log(rank) of the word tokens – the Zipf Plot for the corpus.
Look for deviations from the expected straight line behaviour and choose a cut-off rank at both ends.
The set of stop words are the word tokens with higher rank (more frequent) than the corrsponding cut-off and the rare words are those with lower rank (less frequent).
Remove these sets of stop and rare words from all texts under consideration.
The resulting truncated histograms are the word-histograms for the texts that one should then proceed to use for further numerical text analysis.

Next: Word Histograms

theta_etc

Monday, March 26, 2012

NTA3: StopAndRareWordsToBeOrNotToBeZipf

No comments:

Post a Comment