Multimodal Deep Learning

A tutorial of MMM 2019

Thessaloniki, Greece (8th January 2019)

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.



Xavier Giro-i-Nieto is an associate professor at the Universitat Politecnica de Catalunya (UPC) in Barcelona, as member of the Intelligent Data Science and Artificial Intelligence Research Center (IDEAI-UPC) and also a visiting researcher at Barcelona Supercomputing Center (BSC). He works closely with the Insight Center of Data Analytics at Dublin City University and the Digital Video and MultiMedia at Columbia University, as well as his industrial partners at Vilynx, Mediapro, and Crisalix. He is the director of the postgraduate degree on Artificial Intelligence with Deep Learning at UPC School and coordinates the deep learning courses at UPC TelecomBCN, as well as general chair of the Deep Learning Barcelona Symposium 2018. He serves as associate editor at IEEE Transactions in Multimedia, and reviews for top tier conferences in machine learning, computer vision and multimedia.


The Deep Basics

  • Motivation
  • Network topologies
  • Basic loss functions

Multimodal Deep Learning

  • Language and Vision
  • Audio and vision
  • Speech and vision



Please share your public comments and photos of the tutorial with the hashtah #DLMM and handles @DocXavi @mmm2019_conf.


Reach the instructor by e-mail or Twitter.


Previous editions

Related courses


Industrial doctorate from the Government of Catalonia and Vilynx


We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan Z and Titan X used in this work. logo-nvidia

25th International Conference on MultiMedia Modeling