Multimodal Deep Learning

A tutorial of MMM 2019

Thessaloniki, Greece (8th January 2019)

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.

Instructor

Xavier Giró

Xavier Giro-i-Nieto is an associate professor at the Universitat Politecnica de Catalunya (UPC) in Barcelona, as member of the Intelligent Data Science and Artificial Intelligence Research Center (IDEAI-UPC) and also a visiting researcher at Barcelona Supercomputing Center (BSC). He works closely with the Insight Center of Data Analytics at Dublin City University and the Digital Video and MultiMedia at Columbia University, as well as his industrial partners at Vilynx, Mediapro, and Crisalix. He is the director of the postgraduate degree on Artificial Intelligence with Deep Learning at UPC School and coordinates the deep learning courses at UPC TelecomBCN, as well as general chair of the Deep Learning Barcelona Symposium 2018. He serves as associate editor at IEEE Transactions in Multimedia, and reviews for top tier conferences in machine learning, computer vision and multimedia.

The Deep Basics

Motivation
Network topologies
Basic loss functions