Multimodal integration is an essential part of any multimodal system. By multimodal input we refer to any explicit input modalities like speech, gesture, touch, etc. and implicit modalities like sensor inputs, contextual information and even bio signals. While multimodal interaction has been on the research agenda for more than three decades, multimodal integration itself is an often overlooked topic – thus room for research.
Typical solutions are ad-hoc approaches, where depending on the application the integration part is hard-coded into the implementation. These are usually rule-based solutions and thus do not scale. More sophisticated approaches are linguistically motivated ones (unification grammar-based, case frames) or originate from classification/modelling-based implementations (voting using agents, finite-state transducers). Some statistical approaches are available as well, actually, recently some new approaches have been introduced. Good overviews of various integration approaches can be found here:
Delgado, R.L.C. and M. Araki (2005), Spoken, Multilingual and Multimodal Dialogue Systems: Development and Assessment. John Wiley & Sons.
Lalanne, Denis, Laurence Nigay, Philippe Palanque, Peter Robinson, Jean Vanderdonckt, and Jean-Fran ̧cois Ladry (2009), “Fusion engines for multimodal input: a survey.” ICMI-MLMI ’09, http://doi.acm.org/10.1145/ 1647314.1647343.
This project addressed the integration problem from the linguistic perspective and used a Maximum Entropy based classification for fusing speech and gesture modalities with available contextual information. The whole classification method is embedded into a Genetic Algorithm based iteration that aims to find the optimal feature set for the classifier.
The work is ongoing, results on a larger speech-gesture database indicate that an accuracy of over 90% can be achieved with Maximum Entropy based classification using 3-5 features only.
Figures: (c) 2012 Péter. All rights reserved.