theta_etc: NTA4: WordHistogramsOrHamlet{question:1}

Numerical Text Analysis 4

Guide to All text Analysis posts

Prev: Selecting Stop Tokens

WordHistogramsOrHamlet{question:1}

Consider the space of word-vectors:

If there are D different word tokens in the entire collection of texts under consideration, then word-vector space is D-dimensional.

It is a linear space, but it is not exactly a vector space: Though there is an origin or 0-vector (null-texts), word-vectors do not have additive inverses that I can interpret. Hence there is no subtraction defined on the space of word-vectors. Hence word-vector space is the D-dimensional positive 1/2^D-ant.

2. Two texts are equivalent if they have the same word histogram, i.e. the same distribution of the frequencies of occurrence of the stemmed, non-stop word-tokens. Word-vectors are then equivalence classes of texts.

For example, of the following 3 texts, Text A and B are equivalent to each other, but not to Text C.

Text A: This “teXt” is equivalent to the following text, Not the preceding one?

Text B: The preceding text IS equivalent to “this one”, but not to the following texts.

Text C: “This text” is not the same as the preceding one, and has no following text.

After removing punctuation, lowercasing, stemming and removing stop words, their histograms are A: {text: 2, preced: 1, follow: 1, equival: 1},

B: {text: 2, preced: 1, follow: 1, equival: 1} and

C: {text: 2, preced: 1, follow: 1, equival: 0}.

Two word-vectors can be added to each other, and they can be scaled by positive integers. If we define the '+' of two texts to be the text formed by writing one text followed by the other, then

vec(Text D) + vec(Text E) = vec(Text D '+' Text E)
You can effectively divide a word-vector V by a positive integer q (and hence scale by positive rationals) by leaving V alone and scaling all the rest by q.

If we normalize all word-vectors to unit magnitude (i.e. as vectors, so the sum of the squares of the frequencies is 1), then the resulting space is the positive 1/2^D-1-sphere and texts are now points on this hemisphere.

However, we could also normalize the word-histograms, so the sum of the frequencies themselves – not the sum of their squares – is 1. In this case the resulting space is the D-1 simplex, which is tangent to the sphere at (1, 1, 1, …). Note that the space of probabilities for the occurrence of D mutually exclusive events is a D-simplex, and that word-histograms have a natural probabilistic interpretation but not an obvious interpretation as vectors. So in most cases we will prefer to live on the simplex and not the sphere.

What does text-space look like? I want you to imagine yourself at the origin, at the point corresponding to the null-text. Small texts with few words are close by and large texts are further away. All 'normal' texts are in the positive 2^-D-ant (π/2 radians in 2D and π steradians in 3D etc.). Similar texts are along the same line of sight and very dissimilar texts will be at large angles to each other, up to a maximum of 90 degrees. Whether we've normalized the texts as vectors (on the unit sphere) or as histograms (on the unit simplex), the angles between them as viewed from the origin are the same.

The word histogram for the first line of Hamlet's soliloquy at the beginning of Act 3 Scene 1 of the eponymous play is {question: 1}.

Still fairly simple undergrad or even high school CS skills (based on the really smart interns at LinkedIn during the summer of 2011) and perhaps a graduate numerical text analysis course.

Next: An aside Word Play
or see the word histogram for the corpus

theta_etc

Thursday, March 22, 2012

NTA4: WordHistogramsOrHamlet{question:1}

No comments:

Post a Comment