May 30, 2011

Data mining, text mining, topic modeling

From today's NY Times - http://opinionator.blogs.nytimes.com/2011/05/29/of-monsters-men-and-topic-modeling/

"Topic modeling is a probabilistic, statistical method that can uncover themes and categories in amounts of text so large that they cannot be read by any individual human being."

Researchers at the Digital Scholarship Lab at the University of Richmond have applied topic modeling techniques to articles published in the Richmond Daily Dispatch from 1861 through 1864.  What they've found is that there are patterns among article topics that occur at particular points in time during the Civil War.  These topics had a point of motivating Southerners to join the Confederate Army and risk their lives and families.  Among other things, researchers found an association between anti-Northern diatribes and articles on patriotism and poetry.

"The remarkable similarity in the signatures of these two topics is striking. If one appeared relatively frequently in the paper at any given moment, the other would appear frequently too. The close association of these two topics — which would be difficult, maybe impossible, to see without this topic model — suggests that they were always two sides of the same coin ...

......those appeals needed to be made at particular points in time. We see the steady rise of patriotic poetry and vitriolic attacks during the secession crisis and the early months of the war. Later we see the sharp jump following the implementation of the draft in April 1862. And we see the last gasp of Confederate patriotism and nationalism at the very end of the war, as Southerners made one final attempt to rally the troops and salvage their cause.

These were the moments of great stress for the Confederate army: as it was just being built, as it struggled for adequate manpower, and as it faced final defeat."

All from data/text mining.....