Guide to All Text Analysis Posts
Prev: Constructing Word Histograms
This is not serious stuff, so indulge me here. I am just playing with a cute idea which probably has zipf practical applications.
Read this after reading about the construction of word histogram space.
Since I'll be analysing text, and
writing text to do so, let me distinguish text to be analysed by
putting it on a gray background. For example, the result of the
action of the lower-casing operator on some text can be shown as
low:
ThiS
is tHe TeXt fOR makiNG LowER case
=
this is the text
for making lower case
Before
proceeding I just want to point out that it is not really the text
I am interested in as in the histogram of the text. So
LOWER
case
and
case
LOWER
are
the same as vectors or word-histograms, as are:
text
text text
=
3
*
text
Eigen Wha?: Just to refresh our memories, a linear operator on a vector space acts on any vector and gives you back another vector in the space. If the result of the action of the operator on a specific vector is to return that same vector multiplied by a scalar, that vector is called an eigenvector of the operator and the corresponding scalar is an eigenvalue of the operator. The determinant of the operator is the product of its eigenvalues. The eigenvalues are solutions to det(oper. - lambda* Identity Op.) = 0.
An eigenvector of the low operator with eigenvalue 1 is
An eigenvector of the low operator with eigenvalue 1 is
upper
,
since
low:
upper
=
1*
upper
Let's
just look at the low
operator in a bit more detail. If I restrict my word space to only
'a', both lower and upper case, then it is two dimensional. Let me
represent text in this space by the ordered pair of the number of
occourences of 'a' and 'A', (x,
y).
So the text
a
A A a a
is
represented by (3,2).
Now the action of the lowering operator is low(x,y)
= (x+y,0),
and the eigenvector with eigenvalue 1 is (1,0).
However, if you calculate the determinant of low,
it turns out to be 0!, so we know that it has an eigenvector with 0
eigenvalue. In the above representation, this is (1,-1).
But what does '-1' occurences mean? We know
that a word can't occur -1 times in a text!
Or,
do we?
My
word
but it can
So
what are the rules? Well there is only one:
word
+
word
=
word
word
=
So
the 0-eigenvalue eigenvector of low
is the following text!
a
A
When we read text like this, we are only sensitive to the magnitude.
What
we've accomplished is that we have extended word histogram space
(linear space but only in the positive 2n-ant)
to be a genuine, full-fledged, authentic VECTOR space!
This
raises the possibility –which, since I am a self-acknowledged
ignoramus in this field, I am sure has already been explored– of
using vector space and operator analysis to look at text. However, I
doubt that there is anything new there beyond what is already known
from standard statistical text analysis.
Next: Cosine Theta, Murdabad!
No comments:
Post a Comment