Talk

Speeding up Whisper for speech recognition by 10x - Optimisations and walkthrough

Friday, May 24

11:40 - 12:25
RoomSpaghetti
LanguageEnglish
Audience levelAdvanced
Elevator pitch

Late last year, we sped up Whisper (https://github.com/Vaibhavs10/insanely-fast-whisper) by 10x. It involved a lot of hacking (like flash attention, batching, torch.compile, etc) and tricks that can be implemented in other models, too. Join me as I walk through those optimisations and tricks.

Abstract

In this talk, we’ll brush upon optimisations that can help you speed up your PyTorch models. We’ll primarily look at the following optimisations:

  1. Flash Attention - faster attention (1 & 2)
  2. Torch.compile()
  3. Chunking/ batching your inputs
  4. Torchaudio pre-processing
  5. half-precison inference

By the end of this talk you’d have learnt more about how to make your existing LLMs/ PyTorch models faster and better.

TagsMachine-Learning, Deep Learning
Participant

Vaibhav Srivastav

I’m an ML Developer Advocate at Hugging Face focusing on extracting better insights from Text + Audio.

I’ve been a freelancer, tax analyst, consultant, tech speaker and advisor for five years. In the past three years, I’ve invested significant time volunteering for open source and science organisations like Hugging Face, EuroPython, PyCons across APAC, Google Cloud Delhi, and Facebook Developer Circles.

Delhi native, I now live in Germany!