Guide to All text Analysis posts
Prev: Selecting Stop Tokens
WordHistogramsOrHamlet{question:1}
Consider the space of word-vectors:
- If there are D different word tokens in the entire collection of texts under consideration, then word-vector space is D-dimensional.
- It is a linear space, but it is not exactly a vector space: Though there is an origin or 0-vector (null-texts), word-vectors do not have additive inverses that I can interpret. Hence there is no subtraction defined on the space of word-vectors. Hence word-vector space is the D-dimensional positive 1/2D-ant.
2. Two texts are equivalent if they
have the same word histogram, i.e. the same distribution of the
frequencies of occurrence of the stemmed, non-stop word-tokens.
Word-vectors are then equivalence classes of texts.
For example, of the following 3 texts,
Text A and B are equivalent to each other, but not to Text C.
Text A: This “teXt” is equivalent
to the following text, Not the preceding one?
Text B: The preceding text IS
equivalent to “this one”, but not to the following texts.
Text C: “This text” is not the same
as the preceding one, and has no following text.
After removing punctuation,
lowercasing, stemming and removing stop words, their histograms are
A: {text: 2, preced: 1, follow: 1, equival: 1},
B: {text: 2, preced:
1, follow: 1, equival: 1} and
C: {text: 2, preced: 1, follow: 1,
equival: 0}.
- Two word-vectors can be added to each other, and they can be scaled by positive integers. If we define the '+' of two texts to be the text formed by writing one text followed by the other, thenvec(Text D) + vec(Text E) = vec(Text D '+' Text E)
- You can effectively divide a word-vector V by a positive integer q (and hence scale by positive rationals) by leaving V alone and scaling all the rest by q.
If we normalize all word-vectors to
unit magnitude (i.e. as vectors, so the sum of the squares of the
frequencies is 1), then the resulting space is the positive
1/2D-1-sphere and texts are now points on this hemisphere.
What does text-space look like? I want you to imagine yourself at the origin, at the point corresponding to the null-text. Small texts with few words are close by and large texts are further away. All 'normal' texts are in the positive 2-D-ant (π/2 radians in 2D and π steradians in 3D etc.). Similar texts are along the same line of sight and very dissimilar texts will be at large angles to each other, up to a maximum of 90 degrees. Whether we've normalized the texts as vectors (on the unit sphere) or as histograms (on the unit simplex), the angles between them as viewed from the origin are the same.
The word histogram for the first line of Hamlet's soliloquy at the beginning of Act 3 Scene 1 of the eponymous play is {question: 1}.
Still fairly simple undergrad or even high school CS skills (based on the really smart interns at LinkedIn during the summer of 2011) and perhaps a graduate numerical text analysis course.
Next: An aside Word Play
or see the word histogram for the corpus
No comments:
Post a Comment