Multimodal Deep Learning
A tutorial of MMM 2019
Thessaloniki, Greece (8th January 2019)
Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.
Xavier Giro-i-Nieto is an associate professor at the Universitat Politecnica de Catalunya (UPC) in Barcelona, as member of the Intelligent Data Science and Artificial Intelligence Research Center (IDEAI-UPC) and also a visiting researcher at Barcelona Supercomputing Center (BSC). He works closely with the Insight Center of Data Analytics at Dublin City University and the Digital Video and MultiMedia at Columbia University, as well as his industrial partners at Vilynx, Mediapro, and Crisalix. He is the director of the postgraduate degree on Artificial Intelligence with Deep Learning at UPC School and coordinates the deep learning courses at UPC TelecomBCN, as well as general chair of the Deep Learning Barcelona Symposium 2018. He serves as associate editor at IEEE Transactions in Multimedia, and reviews for top tier conferences in machine learning, computer vision and multimedia.
The Deep Basics
- Network topologies
- Basic loss functions
Multimodal Deep Learning
- Language and Vision
- Audio and vision
- Speech and vision
- Deep Learning for Computer Vision UPC TelecomBCN.  
- Deep Learning for Speech and Language UPC TelecomBCN.  
Deep Learning for Video. Master in Computer Vision Barcelona. 
- Deep Learning for Multimedia. Insight Dublin City University 2017.  
- Introduction to Deep Learning. UPC TelecomBCN 2018.
- Deep Learning for Artificial Intelligence. UPC TelecomBCN 2017.
- Amaia Salvador and Santiago Pascual. “Hands on Keras and TensorFlow”. Persontyle 2017.
- Fei-Fei Li, Andrej Karpathy, Justin Johnson, “CS231n: Convolutional Neural Networks for Visual Recognition”. Stanford University, Spring 2016.
- Sanja Fidler, “Deep Learning in Computer Vision”. University of Toronto, Winter 2016.
- Dhruv Batra, “ECE 6504: Deep learning for perception”. Virginia Tech, Fall 2015.