Numerical Text Analysis 3
Guide to All Text Analysis Posts
Prev: Text Analysis with Illustrative Example
So if there isn't a canonical set (nor even number) of stop words, how should one choose the cut-off? For this let's consider another, more numerical reason to drop stop words.
In
Numerical Text Analysis, there is a law that captures the general
distribution of word frequencies in texts. For any text or corpus,
rank all the word tokens by decreasing order of frequency, so 'the'
is ranked 1 (relative frequency is 0.05) and something like
'zitterbewegung' –with relative frequency of 10-9
even within physics texts– is ranked 250,000. Zipf's Law
essentially states that the product of the rank and the frequency is
a constant function of rank. Mandelbrot –who got his start in
science doing numerical text analysis until he tried to analyse Jorge
Luis Borges' works and switched to chaos theory– made a
modification to this law that the exponent for frequency as a
function of rank is less than -1, and took into account deviations
near the ends. In any case, we then expect a plot of log10(frequency
* Rank) vs log10(Rank) to approximate a straight line. So let's look at a
sample Zipf plot for Alice:
Alice in Wonderland original text |
Note that the way I am plotting the data is slightly different from Zipf's. Typically (meaning every instance I've seen so far) log(frequency) is plotted vs. log(rank), and one looks for a slope of -1. What that implies is that Frequency * Rank = constant, hence if I plot log(Frequency * Rank), data that satisfies Zipf's law should be flatlined, and deviations from Zipf's law will be zoomed up.
"What's with the discontinuities?". The text has only 50K words and, after stemming, 2,000 tokens. There are a very large number of words (~700) that occur only once in the text. As we go up this (unsorted) subset, the frequency remains the same but the numerical rank keeps reducing, until we get to the top end and enter the subset of words which occur twice: at this point the rank only changes by '1', but the frequency doubles, adding the 0.3 on the log10 scale. Note that this discontinuity is almost half the y-axis range! For a large corpus the number of genuine words that occur only a few times will reduce and their rank will be pushed to larger numbers on a wider y-axis range, so even though the first discontinuity will still be log10(2), the plot will appear smoother.
"What's with the discontinuities?". The text has only 50K words and, after stemming, 2,000 tokens. There are a very large number of words (~700) that occur only once in the text. As we go up this (unsorted) subset, the frequency remains the same but the numerical rank keeps reducing, until we get to the top end and enter the subset of words which occur twice: at this point the rank only changes by '1', but the frequency doubles, adding the 0.3 on the log10 scale. Note that this discontinuity is almost half the y-axis range! For a large corpus the number of genuine words that occur only a few times will reduce and their rank will be pushed to larger numbers on a wider y-axis range, so even though the first discontinuity will still be log10(2), the plot will appear smoother.
We clearly see the
deviations from the expected power law behaviour at low ranks, i.e.
for frequently occouring words. Roughly speaking from the 100th token onwards we see that a fit is an
almost 0-slope line, whereas higher ranked tokens deviate from that
behaviour. If we wanted to be aggressive about
stemming, we would drop words ranked lower than 359. If we wanted to
be less aggressive, we would perhaps choose a cut off rank of 99. See the application of this idea to a corpus.
For the full corpus
of all texts, we should also expect deviations from linearity at the
low frequency end of the plot, and this should be used to find a cut
off for the rare words.
Conclusion
for stop and rare words
For the corpus of
texts under consideration:
- Do a Cummings and stem the words in the texts
- count the frequencies of occurrence in the corpus of all the word tokens, and rank the word tokens by order of frequency.
- Plot log(frequency * rank) vs. log(rank) of the word tokens – the Zipf Plot for the corpus.
- Look for deviations from the expected straight line behaviour and choose a cut-off rank at both ends.
- The set of stop words are the word tokens with higher rank (more frequent) than the corrsponding cut-off and the rare words are those with lower rank (less frequent).
- Remove these sets of stop and rare words from all texts under consideration.
- The resulting truncated histograms are the word-histograms for the texts that one should then proceed to use for further numerical text analysis.
Next: Word Histograms
No comments:
Post a Comment