Guide to All text Analysis posts

Prev: Selecting Stop Tokens

WordHistogramsOrHamlet{question:1}

Consider the space of word-vectors:

- If there are D different word tokens in the entire collection of texts under consideration, then word-vector space is D-dimensional.

- It is a linear space, but it is not exactly a vector space: Though there is an origin or 0-vector (null-texts), word-vectors do not have additive inverses that I can interpret. Hence there is no subtraction defined on the space of word-vectors. Hence word-vector space is the D-dimensional positive 1/2
^{D}-ant.

2. Two texts are equivalent if they
have the same word histogram, i.e. the same distribution of the
frequencies of occurrence of the stemmed, non-stop word-tokens.
Word-vectors are then equivalence classes of texts.

For example, of the following 3 texts,
Text A and B are equivalent to each other, but not to Text C.

Text A: This “teXt” is equivalent
to the following text, Not the preceding one?

Text B: The preceding text IS
equivalent to “this one”, but not to the following texts.

Text C: “This text” is not the same
as the preceding one, and has no following text.

After removing punctuation,
lowercasing, stemming and removing stop words, their histograms are
A: {text: 2, preced: 1, follow: 1, equival: 1},

B: {text: 2, preced:
1, follow: 1, equival: 1} and

C: {text: 2, preced: 1, follow: 1,
equival: 0}.

- Two word-vectors can be added to each other, and they can be scaled by positive integers. If we define the '+' of two texts to be the text formed by writing one text followed by the other, thenvec(Text D) + vec(Text E) = vec(Text D '+' Text E)
- You can effectively
*divide*a word-vector V by a positive integer*q*(and hence scale by positive rationals) by leaving V alone and scaling all the rest by*q*.

If we normalize all word-vectors to
unit magnitude (i.e. as vectors, so the sum of the squares of the
frequencies is 1), then the resulting space is the positive
1/2

^{D-1}-sphere and texts are now points on this hemisphere.*not*the sum of their squares – is 1. In this case the resulting space is the D-1 simplex, which is tangent to the sphere at (1, 1, 1, …). Note that the space of probabilities for the occurrence of D mutually exclusive events is a D-simplex, and that word-histograms have a natural probabilistic interpretation but not an obvious interpretation as vectors. So in most cases we will prefer to live on the simplex and not the sphere.

What does text-space look like? I want you to imagine yourself at the origin, at the point corresponding to the null-text. Small texts with few words are close by and large texts are further away. All 'normal' texts are in the positive 2

^{-D}-ant (π/2 radians in 2D and π steradians in 3D etc.). Similar texts are along the same line of sight and very dissimilar texts will be at large angles to each other, up to a maximum of 90 degrees. Whether we've normalized the texts as vectors (on the unit sphere) or as histograms (on the unit simplex), the angles between them

*as viewed from the origin*are the same.

The word histogram for the first line of Hamlet's soliloquy at the beginning of Act 3 Scene 1 of the eponymous play is {question: 1}.

Still fairly simple undergrad or even high school CS skills (based on the really smart interns at LinkedIn during the summer of 2011) and perhaps a graduate numerical text analysis course.

Next: An aside Word Play

or see the word histogram for the corpus

## No comments:

## Post a Comment