WER are we?¶

WER are we? An attempt at tracking states-of-the-art results and their corresponding codes on speech recognition. Feel free to correct! (Inspired by wer_are_we)

HKUST¶

(Possibly trained on more data than HKUST.)

CER Test	Paper	Published	Notes	Codes
21.2%	Improving Transformer-based Speech Recognition Using Unsupervised Pre-training	October 2019	Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Pre-training on 10,000 hours unlabeled speech	athena-team/Athena
22.75%	Improving Transformer-based Speech Recognition Using Unsupervised Pre-training	October 2019	Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Self data Pre-training	athena-team/Athena
23.09%	CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition	February 2020	CIF + SAN-based models (AM + LM) + speed perturbation + SpecAugment	None
23.5%	A Comparative Study on Transformer vs RNN in Speech Applications	September 2019	Transformer-CTC MTL + RNN-LM + speed perturbation	espnet/espnet
23.67%	Purely sequence-trained neural networks for ASR based on lattice-free MMI	2016	TDNN/HMM, lattice-free MMI + speed perturbation	kaldi-asr/kaldi
24.12%	Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping	February 2019	SAA Model + SAN-LM (joint training) + speed perturbation	None
27.67%	Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin	Feberary 2019	Extended-RNA + RNN-LM (joint training)	None
28.0%	Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM	June 2017	CTC-Attention MTL + joint decoding ( one-pass) + VGG Net + RNN-LM (seperate) + speed perturbation	espnet/espnet
29.9%	Joint CTC/attention decoding for end-to-end speech recognition	2017	CTC-Attention MTL-large + joint decoding (one pass) + speed perturbation	espnet/espnet

AISHELL-1¶

CER Dev	CER Test	Paper	Published	Notes	Codes
None	6.6%	Improving Transformer-based Speech Recognition Using Unsupervised Pre-training	October 2019	Transformer-CTC MTL + RNN-LM + speed perturbation + MPC Self data Pre-training	athena-team/Athena
None	6.34%	CAT: CRF-Based ASR Toolkit	November 2019	VGG + BLSTM + CTC-CRF + 3-gram LM + speed perturbation	thu-spmi/CAT
6.0%	6.7%	A Comparative Study on Transformer vs RNN in Speech Applications	September 2019	Transformer-CTC MTL + RNN-LM + speed perturbation	espnet/espnet
None	7.43%	Purely sequence-trained neural networks for ASR based on lattice-free MMI	2016	TDNN/HMM, lattice-free MMI + speed perturbation	kaldi-asr/kaldi

LibriSpeech¶

(Possibly trained on more data than LibriSpeech.)

WER test-clean	WER test-other	Paper	Published	Notes	Codes
5.83%	12.69%	Humans Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	December 2015	Humans	None
2.0%	4.1%	End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures	November 2019	Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring + 60k hours unlabeled	facebookresearch/wav2letter
2.3%	4.9%	Transformer-based Acoustic Modeling for Hybrid Speech Recognition	October 2019	Transformer AM (chenones) + 4-gram LM + Neural LM rescore (data augmentation:Speed perturbation and SpecAugment)	None
2.3%	5.0%	RWTH ASR Systems for LibriSpeech: Hybrid vs Attention	September 2019, Interspeech	HMM-DNN + lattice-based sMBR + LSTM LM + Transformer LM rescoring (no data augmentation)	rwth-i6/returnn rwth-i6/returnn-experiments
2.3%	5.2%	End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures	November 2019	Conv+Transformer AM (10k word pieces) with ConvLM decoding and Transformer rescoring	facebookresearch/wav2letter
2.2%	5.8%	State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention With Dilated 1D Convolutions	October 2019	Multi-stream self-attention in hybrid ASR + 4-gram LM + Neural LM rescore (no data augmentation)	s-omranpour/ ConvolutionalSpeechRecognition (not official)
2.5%	5.8%	SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition	April 2019	Listen Attend Spell	DemisEom/SpecAugment (not official)
3.2%	7.6%	From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition	October 2019	LC-BLSTM AM (chenones) + 4-gram LM (data augmentation:Speed perturbation and SpecAugment)	None
3.19%	7.64%	The CAPIO 2017 Conversational Speech Recognition System	April 2018	TDNN + TDNN-LSTM + CNN-bLSTM + Dense TDNN-LSTM across two kinds of trees + N-gram LM + Neural LM rescore	None
2.44%	8.29%	Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System	September 2019, Interspeech	encoder-attention-decoder + Transformer LM	None
3.80%	8.76%	Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks	Interspeech, Sept 2018	17-layer TDNN-F + iVectors	kaldi-asr/kaldi
2.8%	9.3%	RWTH ASR Systems for LibriSpeech: Hybrid vs Attention	September 2019, Interspeech	encoder-attention-decoder + BPE + Transformer LM (no data augmentation)	rwth-i6/returnn rwth-i6/returnn-experiments
3.26%	10.47%	Fully Convolutional Speech Recognition	December 2018	End-to-end CNN on the waveform + conv LM	None
3.82%	12.76%	Improved training of end-to-end attention models for speech recognition	Interspeech, Sept 2018	encoder-attention-decoder end-to-end model	rwth-i6/returnn rwth-i6/returnn-experiments
4.28%		Purely sequence-trained neural networks for ASR based on lattice-free MMI	September 2016	HMM-TDNN trained with MMI + data augmentation (speed) + iVectors + 3 regularizations	kaldi-asr/kaldi
4.83%		A time delay neural network architecture for efficient modeling of long temporal contexts	2015	HMM-TDNN + iVectors	kaldi-asr/kaldi
5.15%	12.73%	Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	December 2015	9-layer model w/ 2 layers of 2D-invariant convolution & 7 recurrent layers, w/ 100M parameters trained on 11940h	PaddlePaddle/DeepSpeech
5.51%	13.97%	LibriSpeech: an ASR Corpus Based on Public Domain Audio Books	2015	HMM-DNN + pNorm*	kaldi-asr/kaldi
4.8%	14.5%	Letter-Based Speech Recognition with Gated ConvNets	December 2017	(Gated) ConvNet for AM going to letters + 4-gram LM	None
8.01%	22.49%	same, Kaldi	2015	HMM-(SAT)GMM	kaldi-asr/kaldi
	12.51%	Audio Augmentation for Speech Recognition	2015	TDNN + pNorm + speed up/down speech	kaldi-asr/kaldi

Lexicon¶

WER: word error rate
PER: phone error rate
LM: language model
HMM: hidden markov model
GMM: Gaussian mixture model
DNN: deep neural network
CNN: convolutional neural network
DBN: deep belief network (RBM-based DNN)
TDNN-F: a factored form of time delay neural networks (TDNN)
RNN: recurrent neural network
LSTM: long short-term memory
CTC: connectionist temporal classification
MMI: maximum mutual information (MMI),
MPE: minimum phone error
sMBR: state-level minimum Bayes risk
SAT: speaker adaptive training
MLLR: maximum likelihood linear regression
LDA: (in this context) linear discriminant analysis
MFCC: Mel frequency cepstral coefficients
FB/FBANKS/MFSC: Mel frequency spectral coefficients
IFCC: Instantaneous frequency cosine coefficients (https://github.com/siplabiith/IFCC-Feature-Extraction)
VGG: very deep convolutional neural networks from Visual Graphics Group, VGG is an architecture of 2 {3x3 convolutions} followed by 1 pooling, repeated