Tuesday, March 27, 2012

Guide to Posts on text Analysis

The posts make most logical sense in the following order:

NTA0: Sample Text for Analysis

NTA1: Preliminaries

NTA2: Text Analysis Illustrative Example

NTA3: Selecting Stop Tokens

NTA4: Word Histograms

NTA5: WordPlay

NTA6: Word Histogram for the Corpus

NTA7: What tokens to use, dimensional Sparseness


  1. Tate,

    All good stuff. Stemming is really cool - I wonder how Porter came up with it. I don't think it applies to other languages.

    About 5-6 years back, in order to understand the search technologies better, I implemented a search engine. It's still active - you can try it out at Line Spout www.linespout.com. For obvious reasons of resources and time, it's a "limited" search engine, in that it looks only at "new" content from a set of socially curated sources (aka real time search).

    Apart from text analysis, search has stuff like relevancy, accounting for social feedback, and a relevant advertisement auction, all of which I implemented in Line Spout.

    In terms of text analysis alone, there are a few further "hard" problems:
    (a) conflations introduced by stemming - eg., Federer and Federal stem to the same thing. So whenever I search for Federer on Line Spout, I get news about the Federal Govt. Resolving this needs a dictionary (probably socially curated). The dictionary can also come in handy for synonyms etc. for search expansion.
    (b) phrase match - needs a lot more computation and storage.
    (c) grep (which I have implemented - see pattern search under options) - can get really slow. Online Map-Reduce could help here (generally this is used on the offline indexing side).

    I am reasonably convinced that "dumb" index search is probably just as effective as any more sophisticated approach because
    (i) getting the best top ten search results does not have a life-or-death or even a financial accounting level of importance, and
    (ii) users retry searches with different queries till they get reasonably what they want.

    My suggestion to you would be to focus on your new non-cosine approach that can make matching the query to the content more robust, or any other new "angles" that improve user experience and/or reduce implementation complexity - emphasis on new and different.


  2. Hi Ranjeet! It's so cool that you're getting into this field. I do similar work but from a much more 'symbolic', as opposed to 'statistical', vantage point.

    I just read the first post and it made me wonder if you've ever read anything on formal language theory. Here's a primer: http://www.helsinki.fi/esslli/courses/readers/K10.pdf

    Also, since you mention Python, there is an old book called Text Processing in Python by David Mertz that you might find interesting. Mertz favors a *very* functional approach to text processing (his Python looks as much like Haskell/ML as anything I've seen). The whole book is online for free here: http://gnosis.cx/TPiP/

    Also, you might find NLTK useful: http://www.nltk.org/