Human Activity Recognition: Fusing Modalities For Better Classification

Why Use Multimodal Learning for Activity Recognition?

Our experience of the world is multimodal in nature, as quoted by Baltrušaitis et al. There are multiple modalities a human is blessed with. We can touch, see, smell, hear, and taste, and understand the world around us in a better way. Most parents would remember their kid’s understanding of “what a dog is” keeps improving after seeing actual dogs, videos of dogs, photographs of dogs and cartoon dogs and being told that they are all dogs. Just seeing a single video of an actual dog does not help the kid identify the character “Goofy” as a dog. It is the same for machines. Multimodal machine learning models can process and relate the information from multiple modalities, learning in a more holistic way.

Multimodal Learning for Human Activity Recognition — Our Recipe

Our goal was to recognize 10 activities — basketball, biking, diving, golf swing, horse riding, soccer juggling, tennis swing, trampoline jumping, volleyball spiking, and walking. We created the multimodal models for activity recognition by fusing the two unimodal models — image-based and video-based — using the ensemble method, thus enhancing the effect of the classifier.

One Modality at a Time

There are different methodologies for Multimodal learning as is described by Song et al., 2016, Tzirakis et al., 2017, and Yoon et al., 2018. One of the techniques is ensemble learning, in which 2DCNN model and 3DCNN models are trained separately and the final softmax probabilities are combined to get predictions. Other techniques include joint representation, coordinated representation etc. A detailed overview is available in Baltrušaitis et al., 2018.


Beyond doubt, our Multimodal models performed better than the Unimodal ones. Comparing the multimodal engines, Averaging and Maximum Pooling performed better than the Maximum Vote method, as is evident from the overall accuracies of 87% and 84% respectively. The reason is that the Averaging and Maximum Pooling method considers the confidence of the predicted label whereas, the Maximum Vote method considers only the label with maximum probability.

Ashwini Gupta

Ashwini is an engineering graduate and a lifelong learner. Skilled in AI practices like Computer Vision, Deep Neural Networks and NLP, Ashwini also has hands-on experience with AWS and Azure Cloud Services.

Dr. Monika Singh

Affine is a leading AWS Select Consulting Partner renowned for providing cutting-edge services on the AWS platform.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Affine is a provider of analytics solutions, working with global organizations solving their strategic and day to day business problems