Guide to All Text Analysis Posts

Prev: Constructing Word Histograms

This is not serious stuff, so indulge me here. I am just playing with a cute idea which probably has zipf practical applications.

Read this after reading about the construction of word histogram space.

Since I'll be analysing text, and
writing text to do so, let me distinguish text to be analysed by
putting it on a gray background. For example, the result of the
action of the lower-casing operator on some text can be shown as

**low**:

ThiS
is tHe TeXt fOR makiNG LowER case

=

this is the text
for making lower case

Before
proceeding I just want to point out that it is not really the

*text*I am interested in as in the histogram of the text. So
LOWER
case

and

case
LOWER

are
the same as vectors or word-histograms, as are:

text
text text

=

3
*

text

*Eigen Wha?*: Just to refresh our memories, a linear operator on a vector space acts on any vector and gives you back another vector in the space. If the result of the action of the operator on a specific vector is to return that same vector multiplied by a scalar, that vector is called an

**eigenvector**of the operator and the corresponding scalar is an

**eigenvalue of the operator**. The determinant of the operator is the product of its eigenvalues. The eigenvalues are solutions to det(oper. - lambda* Identity Op.) = 0.

An eigenvector of the

**low**operator with eigenvalue 1 is

upper

,
since

**low**:

upper

=
1*

upper

Let's
just look at the

**low**operator in a bit more detail. If I restrict my word space to only 'a', both lower and upper case, then it is two dimensional. Let me represent text in this space by the ordered pair of the number of occourences of 'a' and 'A', (*x, y*). So the text
a
A A a a

is
represented by (

*3,2*). Now the action of the lowering operator is**low**(*x,y*) = (*x+y,0*), and the eigenvector with eigenvalue 1 is (*1,0*). However, if you calculate the determinant of**low**, it turns out to be 0!, so we know that it has an eigenvector with 0 eigenvalue. In the above representation, this is (*1,-1*). But what does '-1' occurences mean? We*know*that a word can't occur -1 times in a text!
Or,
do we?

My
word
but it can

So
what are the rules? Well there is only one:

word

+

word

=

word
word

=

So
the 0-eigenvalue eigenvector of

**low**is the following text!
a
A

When we

*read*text like this, we are only sensitive to the magnitude.
What
we've accomplished is that we have extended word histogram space
(linear space but only in the positive 2

^{n}-ant) to be a genuine, full-fledged, authentic VECTOR space!
This
raises the possibility –which, since I am a self-acknowledged
ignoramus in this field, I am sure has already been explored– of
using vector space and operator analysis to look at text. However, I
doubt that there is anything new there beyond what is already known
from standard statistical text analysis.

Next: Cosine Theta, Murdabad!

## No comments:

## Post a Comment