(2024-08-01) ReazonSpeech v2.1: Setting a New Standard in Japanese ASR¶

Today, we're excited to announce ReazonSpeech v2.1. In this release, we publish ReazonSpeech-k2-v2, an open-source Japanese ASR model which sets new records in benchmark tests. It is built on the Next-gen Kaldi framework and distributed in the platform-neutral Open Neural Network Exchange (ONNX) format. ReazonSpeech-k2-v2 excels in accuracy, compactness, and inference speed, and can run on-device without GPU.

We published the ReazonSpeech-k2-v2 model under the Apache 2.0 license. The model files and the inference code are readily available on Hugging Face and GitHub.

../_images/cer2.png — **Figure 1: ReazonSpeech v2.1 on common Japanese ASR benchmark tests**¶

What is ReazonSpeech v2.1?¶

ReazonSpeech v2.1 represents the latest iteration of Reazon Human Interaction Lab's ASR research. This release introduces a new Japanese ASR model that:

Outperforms existing Japanese ASR models on JSUT-BASIC5000 [1], Common Voice v8.0 [2], and TEDxJP-10K [3] benchmark sets (see the chart above).
Excels in compactness, only having 159M parameters.
Excels in inference speed, one of the fastest models to process short audio inputs.

What enables such outstanding performance is the state-of-the-art Transformer called Zipformer [4]. We trained this novel network architecture on 35,000 hours of Reazonspeech v2.0 corpus, which revealed a best-in-class performance.

Tip

For further details about the ReazonSpeech-k2-v2 model, the full training recipe is available on k2-fsa/icefall.

Easy deployment with ONNX¶

The ReazonSpeech-k2-v2 model is available in the ONNX format, significantly enhancing its versatility across a wide range of platforms. Leveraging the ONNX runtime, which is independent of the PyTorch framework, simplifies the setup process, facilitating seamless integration across diverse environments. This adaptability ensures practical application on various devices even without GPU, including Linux, macOS, Windows, embedded systems, Android, and iOS.

For more details about the supported platforms, please refer to the Sherpa-ONNX's documentation.

Reduce memory footprint with quantization¶

We also released a int8-quantized version of the ReazonSpeech-k2-v2 model. The quantized model exhibits a significantly smaller footprint, as shown in the following table.

Table 1: The effects of quantization on model size¶
FILE	FILE SIZE (FP32)	FILE SIZE (INT8)
Encoder	565 MB	148 MB
Decoder	12 MB	3 MB
Joiner	11 MB	3 MB

These quantized models are up to 10x smaller than comparable ASR models like Whisper-Large-v3, enabling their deployment on a wide range of devices with computational constraints. Notably, when used with a non-quantized decoder, these quantized models maintain accuracy levels comparable to their non-quantized counterparts. This enables the deployment of our model even on devices with very limited computational capacity.

Table 2: The effects of quantization on accuracy¶
Model Name	JSUT	Common Voice	TEDxJP-10K
ReazonSpeech-k2-v2	6.45	7.85	9.09
ReazonSpeech-k2-v2 (int8)	6.63	8.19	9.86
ReazonSpeech-k2-v2 (int8-fp32)	6.45	7.87	9.15
Whisper Large-v3	7.18	8.18	9.96
ReazonSpeech-NeMo-v2	7.31	8.81	10.42
ReazonSpeech-ESPnet-v2	6.89	8.27	9.28

Future goals¶

With this release, we have significantly enhanced both the speed and accuracy of our Japanese ASR models. By making our model open-source on the K2 Sherpa-ONNX platform, we have greatly improved accessibility for a broad range of users and developers across various platforms.

Looking ahead, we are committed to further advancing our models by expanding our dataset, developing streaming ASR capabilities, and incorporating multilingual data to create an exceptional bilingual English-Japanese ASR model.

This release represents a major milestone, and we are excited to continue pushing the boundaries of Japanese speech processing technology in the future. Currently, ReazonSpeech-k2-v2 can process longer segments of audio with the help of voice activity detection (VAD). In the future, we plan to release a streaming version of this model which can innately support real-time transcription.

(2024-08-01) ReazonSpeech v2.1: Setting a New Standard in Japanese ASR¶

What is ReazonSpeech v2.1?¶

Easy deployment with ONNX¶

Reduce memory footprint with quantization¶

Future goals¶

Footnotes¶