Foundational Speech Models and their Efficient Training with NVIDIA NeMo {AI Talks with Coffee/Tea #30}
Virtual: https://events.vtools.ieee.org/m/495161https://landing.signalprocessingsociety.org/ieee-sps-webinars-27-aug-2025 The intersection of speech and language models offer unique opportunities and challenges. This talk provides a comprehensive walkthrough of speech-language model research from NVIDIA NeMo. We cover several types of models such as attention-encoder-decoder Canary-1B, and LLM-based architectures such as SALM or BESTOW. In particular, we highlight the challenges in training and inference efficiency of such models and propose robust solutions via 2D bucketing and batch size OOMptimizer. Finally, we highlight the difficulty of preserving text-domain capabilities in speech-augmented training and present several possible solutions: EMMeTT, VoiceTextBlender, and Canary-Qwen-2.5B. About the Presenter: Piotr Żelasko received the B.S. and M.Sc. degrees in acoustic engineering, and the Ph.D. in electronic engineering from AGH-University Krakow, Poland in 2013, 2014, and 2019 respectively. He is currently a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. He held a research scientist position at JHU’s CLSP and developed speech technology at different companies (Techmo, Avaya, Meaning.Team). Dr. Żelasko is a co-author of the next-generation Kaldi toolkit (k2) and the maintainer of Lhotse. Agenda: https://landing.signalprocessingsociety.org/ieee-sps-webinars-27-aug-2025 Please register here too and on Vtools too. Virtual: https://events.vtools.ieee.org/m/495161