TG Telegram Group Link
Channel: Speech Technology
Back to Bottom
This talk was here already but I watched it again recently and can recommend to revisit it again

Hearing the AGI from GMM HMM to GPT 4o Yu Zhang
November 15th LTI Colloquium Speaker

https://www.youtube.com/watch?v=pRUrO0x637A

Highly recommended:

1. Importance of scale
2. Importance of self-supervised learning for dirty data training
3. Very tricky case with dither seed and self-supervised learning
4. Voice search data is useless
5. Importance of multi-objective training (again)
6. Why readable transcripts (Whisper) better than good WER (RNNT)
7. Discussion on factors of audio and text data for audio LLM training
8. Size of the decoder and size of the encoder

Not always relevant for us gpu-poor guys but very nice overall.
https://github.com/zhai-lw/SQCodec

https://arxiv.org/abs/2504.04949

One Quantizer is Enough: Toward a Lightweight Audio Codec

Linwei Zhai, Han Ding, Cui Zhao, fei wang, Ge Wang, Wang Zhi, Wei Xi
Uniform steps are definitely a problem in speech LLMs, couple of attempts to solve that which come together recently, the idea is that we apply text/speech alignment before we feed data into LLM:

https://github.com/FreedomIntelligence/Soundwave

https://github.com/mtkresearch/TASTE-SpokenLM

https://arxiv.org/abs/2502.12900

Soundwave: Less is More for Speech-Text Alignment in LLMs

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at this https URL.
The second baseline from https://x.com/xueyao_98 is now available!

Check out their technical blog and open-sourced code:
Blog: https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a
Code: https://github.com/open-mmlab/Amphion/tree/main/models/svc/vevosing

Training data will be distributed starting April 28th.
Register for SVCC here: https://forms.gle/GZGAWJAZvgDK6QKcA
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

https://arxiv.org/abs/2503.01174

https://x.com/Sid_Arora_18/status/1897315720205328593

As usual, baseline cascaded system is intentionally weak. Whisper tiny as a baseline???
Dataset generated with OpenAI

https://huggingface.co/datasets/laion/laions_got_talent

"LAION's Got Talent" is a generated dataset comprising voice acting samples that exhibit a wide range of emotions, vocal bursts, topics, and content. This dataset is a component of the BUD-E project, spearheaded by LAION with support from Intel.
517 pages of instrument/vocals separation

https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/

Instrumental and vocal & stems separation & mastering
(UVR 5 GUI: VR/MDX-Net/MDX23C/Demucs 1-4, and BS/Mel-Roformer in beta
MVSEP-MDX23-Colab/KaraFan/drumsep/LarsNet/SCNet
x-minus.pro (uvronline.app)/mvsep.com/
GSEP/Dango.ai/Audioshake/Music.ai)
18B speech recognition models

https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d

https://arxiv.org/abs/2502.10373

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in la
The paper itself

Using Voicebox-based Synthetic Speech for ASR Adaptation

https://www.isca-archive.org/syndata4genai_2024/dhamyal24_syndata4genai.pdf
https://arxiv.org/abs/2411.18803

TS3-Codec: Transformer-Based Simple Streaming Single Codec

Haibin Wu, Naoyuki Kanda, Sefik Emre Eskimez, Jinyu Li

Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NAC models are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers with a few linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, the proposed TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.
HTML Embed Code:
2025/07/03 01:09:30
Back to Top