On a recent hackathon we chose to implement a simple tool for visualizing expenses, e.g. from a business trip, and classifying those expenses with appropriate labels into pre-defined classes like food, transport, hotel, etc.
The hackathon lasted 1.5 days and in that time we managed to conduct a small scale user study (with other participants of the hackathon), design a solution and the architecture for the quick-and-dirty implementation, collect data for the natural understanding module of the tool, train a Maximum Entropy classifier with the collected and pre-labeled data, and visualize the classified inputs with d3.js.
The Maximum Entropy (ME) classifier was chosen as recently had some experiments with it for multimodal integration, and thus it was easy to adapt for language understanding. The ME is a feature-based classifier and works well with tens, hundreds, even with thousands of features. A feature is a logical statement about the relation of the incoming data and the desired label. For our task we defined only two features:
- the so-called bag-of-words, the set of all the words in the incoming sentence, and
- all the bigrams, that is all the pairs of words, as they occur in the input sentence.
Even with these two minimal (and pretty obvious) features we achieved nearly 70% recognition accuracy on the small data set. The number of classes were 4 only.
On the output side, the results were visualized with Sankey diagrams, a very powerful tool to display data with multiple classes and how their relation in volume changes over time. The d3.js framework was used, by Mike Bostock, and the Sankey plugin was, developed by Jason Davies and Mike Bostock.
Below are couple snapshots from the ME classifier training log and from the visualization of the expense tracker for multiple days.
About the ME classification with a larger number of features: a publication is under preparation where the initial feature set is reduced through a heuristic optimization procedure to a smaller set. The topic is multimodal integration and the optimization framework is by Genetic Algorithms. More news on this soon.