Subjectivity of an interestingness measure has been a subject of discussion for a while in the data mining and machine learning communities. In their SIGKDD paper titled “A framework for mining interesting pattern sets”, authors Bie, Kontonasios and Spyropoulou have suggested a framework, which approaches the problem in a dual-attack manner.
- Assuming that the prior model encoding is strictly subject to the researcher’s understanding and hence fundamentally making data miner part of the equation and
- Iterate through the item-set, which provides maximum information and is efficient.
This research is motivated by the real-life applications of pattern mining and the problems encountered during this research. Authors claim that by increasing the number of interestingness measure, the researchers would not find clarity in the domain but we should rather focus on quantifying the intuitive and subjectivity of interestingness. The paper discusses prior approaches using probabilistic graphical models to reflect the bias and their shortcomings however falls short of providing concrete examples of these flaws. By formalizing prior information in a probabilistic model, researchers begin explaining their framework where the focus is on the data-miner as compared to data previously, who is now part of the model. Formalizing the underlying patterns, authors use information theory to quantify subjective interestingness but also acknowledge that one of the bottleneck in this research is empirical assessment of subjective interestingness measures. Tiling databases and KRIMP is cited as prior research.
As part of definition for interesting pattern, authors quantify interestingness of a pattern π(D) as
interestingness(π, π’ ) = I(π, π’ )/D(π, π’ )
i.e.π is deﬁned as the ratio of the self-information over the description length. The design follows the cryptographic model of the communication between the data miner (Alice) and Bob (the data), trying to help convey the data from Alice to Bob as eﬃciently as possible. This model seems interesting but the disregard of existing techniques based on probabilistic graphical models and probabilistic interestingness measures is not quite understandable. Like the cryptography approach, this design takes into account what Bob already knows (or thinks to know), as well as the syntactic form of the patterns he believes the data may contain.
The paper is concluded citing computational traceability as one of the major issues for future research. It’s an important paper to be read for researchers looking into current state of mining interesting patters as it gives a critical overview of current techniques and their corresponding (some supposed) shortcomings.