ADIOS caught my eye in December's issue of Dr Dobbs Journal and I decided to try it out. It's a really cool idea (and implementation); a not-so-novelle but relatively newer approach to unsupervised learning. The algorithm performs the autonomous learning by regenerating and repeatedly aligning sentences and checking the overlapping parts.
More ..from their website
"The ADIOS project addresses the problem, fundamental to linguistics, bio informatics and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, nucleotide base pairs, amino acid sequence data, musical notation, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (Automatic DIstillation of Structure) algorithm relies on a statistical method for pattern extraction (The MEX algorithm) and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, on coding regions in DNA sequences, and on protein data correlating sequence with function. This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical novel sentences, scoring well in standard language proficiency tests, and proving useful in other fields that call for structure discovery from raw data, such as bio informatics.
For further details see Zach Solan's thesis
download ADIOS-Lite 1.0 Linux Version
download ADIOS-Lite 1.0 Cygwin Version