Data Mining vs. Text Mining for Business Applications

I think I’m hung up on semantics again. This era of connected systems we live in, businesses highly rely on knowledge management to ‘know thy visitor’. All the consumer specific data they can get their hands on and all the possible customer trends which could be derived from this information are deemed as an asset or probably a given at times. The knowledge management process comprises of several activities including but not limited to summarization, filtering, visualization, searching, categorization, mining & extraction and clustering. In this arena of digging up clues, text mining is the creation, discovery, derivation or deduction of new and previously unknown patterns from text documents.

Like it was said in the Hearst paper, “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999), data and text mining are crucial to business process orchestration today. This process is seldom defined as unsupervised learning, lexical analysis, information extraction, live classification, self annotation, hierarchical text classification semantic web etc. From executive point of view, it’s mere CRM. However, the core ideology gets mixed up in the plethora of the buzz words. What is the difference between text and data mining and where the line needs to be drawn? Wikipedia defines Text Mining as

“Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics”.

 The semantic difference is when data and text mining are used interchangeably which is a fallacy. With applications like Riya and Multimedia content filtering going mainstream (read TiVo), trend analysis is not text bound anymore. Re-routing your help desk ticket to the right correspondent using Bayesian inference is one thing but if you are matching up your interactive voice response (IVR) logs with customer’s demographics, banner clicks and web hits to evaluate the business requirements, it’s beyond mere text mining. “The difference between text mining and information retrieval is analogous to the difference between data mining and database management” (Thuraisingham, 1999) makes the point. Also, the idea of intelligent text mining vs. standard text mining augments the theory of mere statistical clustering vs. application of heuristics or specialized learning algorithms on text streams.

There are several conferences coming up dedicated to business applications of text mining for instance the 2nd Annual Text Analytics Summit, 22-23 June 2006, in Boston, Massachusetts, which has several interesting tracks. Some of them which seem particularly interesting to me includes

·        Understand, predict and act by Olivier Jouve, VP Text mining, SPSS

·        Enabling your enterprise with Text Analytics – A financial perspective by John Anthony – Director, P&C Innovation Lab The Hartford Financial Services Group, Inc

·        How HP perform needs-based customer segmentation using text mining by Randy Collica – Sr. Business Analyst Hewlett Packard

·        Methodology for Defining Text Enabled Business Intelligence Applications by Jay Henderson - Director of Product Marketing, ClearForest

·        High performance text analysis architectures & applications, Ramana Rao – CTO, Inxight

·        Visualising textual data by Bill Inmon – CEO Inmon Data Systems

Categorization, structuring and the cleanup of text is discussed in both (Hearst, 1999) and (Jan H. Kroeze et al, 2003) in much detail and there are counter opinions to it as well “It is a fallacy that text data are unstructured. (Nasukawa et al, 2001) and hence this discussion will go further in both camps but IMHO, Nasukawa derives his point from Google Page Rank.

Further Reading

·        Differentiating data- and text-mining terminology Jan H. Kroeze, Machdel C. Matthee, Theo J. D. Bothma
Proceedings of the 2003 annual research conference of the South African institute of computer scientists and information technologists on Enablement through technology SAICSIT '03

·        Untangling Text Data Mining – (Hearst, 99) ACM

·        Data Mining 2005 Sixth International Conference on Data Mining, Text Mining and their Business Applications

·        Business Intelligence Text Mining

·        UT ML Group: Text Data Mining

·        Experimental study of discovering essential information from customer inquiry – ACM

·        Mining concept associations for knowledge discovery in large textual databases – ACM

·        Generating association graphs of non-cooccurring text objects using transitive methods – ACM

·        Unsupervised Learning of Soft Patterns for Generating Definitions from Online News -Cui, H., Kan, M-Y. and Chua, T-S

·        Information Retrieval and Text Mining: A domain independent environment for creating information extraction modules – ACM

·        Text mining as integration of several related research areas: report on KDD's workshop on text mining 2000

·        Evaluating the novelty of text-mined rules using lexical knowledge

·        Artificial intelligence #2: Topic-based clustering of news articles