theta_etc: NTA5: WordPlay

Numerical Text Analysis 5

Guide to All Text Analysis Posts

Prev: Constructing Word Histograms

This is not serious stuff, so indulge me here. I am just playing with a cute idea which probably has zipf practical applications.
Read this after reading about the construction of word histogram space.

Since I'll be analysing text, and writing text to do so, let me distinguish text to be analysed by putting it on a gray background. For example, the result of the action of the lower-casing operator on some text can be shown as

low:

ThiS is tHe TeXt fOR makiNG LowER case

this is the text for making lower case

Before proceeding I just want to point out that it is not really the text I am interested in as in the histogram of the text. So

LOWER case

and

case LOWER

are the same as vectors or word-histograms, as are:

text text text

3 *

text

Eigen Wha?: Just to refresh our memories, a linear operator on a vector space acts on any vector and gives you back another vector in the space. If the result of the action of the operator on a specific vector is to return that same vector multiplied by a scalar, that vector is called an eigenvector of the operator and the corresponding scalar is an eigenvalue of the operator. The determinant of the operator is the product of its eigenvalues. The eigenvalues are solutions to det(oper. - lambda* Identity Op.) = 0.

An eigenvector of the low operator with eigenvalue 1 is

upper

, since

low:

upper

= 1*

upper

Let's just look at the low operator in a bit more detail. If I restrict my word space to only 'a', both lower and upper case, then it is two dimensional. Let me represent text in this space by the ordered pair of the number of occourences of 'a' and 'A', (x, y). So the text

a A A a a

is represented by (3,2). Now the action of the lowering operator is low(x,y) = (x+y,0), and the eigenvector with eigenvalue 1 is (1,0). However, if you calculate the determinant of low, it turns out to be 0!, so we know that it has an eigenvector with 0 eigenvalue. In the above representation, this is (1,-1). But what does '-1' occurences mean? We know that a word can't occur -1 times in a text!

Or, do we?

My word but it can

So what are the rules? Well there is only one:

word

word word

So the 0-eigenvalue eigenvector of low is the following text!

a A

When we read text like this, we are only sensitive to the magnitude.

What we've accomplished is that we have extended word histogram space (linear space but only in the positive 2ⁿ-ant) to be a genuine, full-fledged, authentic VECTOR space!

This raises the possibility –which, since I am a self-acknowledged ignoramus in this field, I am sure has already been explored– of using vector space and operator analysis to look at text. However, I doubt that there is anything new there beyond what is already known from standard statistical text analysis.

In any case, I hope this was as good for you as it was for me.

Word Histograms for the Corpus

Next: Cosine Theta, Murdabad!

theta_etc

Monday, March 26, 2012

NTA5: WordPlay

No comments:

Post a Comment