athena.models.asr.speech_u2

This code is modified from https://github.com/wenet-e2e/wenet.git.

Module Contents

Classes

SpeechU2

Base model for U2

SpeechTransformerU2

U2 implementation of a SpeechTransformer. Model mainly consists of three parts:

SpeechConformerU2

Conformer-U2

class athena.models.asr.speech_u2.SpeechU2(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Base model for U2

default_config
enable_tf_funtion()
call(samples, training: bool = None)

call model

compute_logit_length(input_length)

used for get logit length

_forward_transformer_encoder(x, x_mask, training=None)
_forward_encoder(speech: tensorflow.Tensor, speech_length: tensorflow.Tensor, decoding_chunk_size: int = -1, num_decoding_left_chunks: int = -1, simulate_streaming: bool = False, use_dynamic_left_chunk: bool = False, static_chunk_size: int = -1, training: bool = None) Tuple[tensorflow.Tensor, tensorflow.Tensor]
_encoder_forward_chunk(xs: tensorflow.Tensor, offset: int, required_cache_size: int, subsampling_cache: tensorflow.Tensor, elayers_output_cache: tensorflow.Tensor, conformer_cnn_cache: tensorflow.Tensor) Tuple[tensorflow.Tensor, tensorflow.Tensor, List[tensorflow.Tensor], List[tensorflow.Tensor]]

Forward just one chunk

Parameters
  • xs (tf.Tensor), shape=[B(1) – chunk input

  • offset (int) – current offset in encoder output time stamp

  • required_cache_size (int) – cache size required for next chunk compuation >=0: actual cache size <0: means all history cache is required

  • subsampling_cache (1) – subsampling cache

  • shape=[B (1) – subsampling cache

  • elayers_output_cache – transformer/conformer encoder layers output cache

  • shape=[num_layers – transformer/conformer encoder layers output cache

  • 1 – transformer/conformer encoder layers output cache

  • T – transformer/conformer encoder layers output cache

  • d_model] – transformer/conformer encoder layers output cache

  • conformer_cnn_cache – conformer cnn cache

  • shape=[num_layers – conformer cnn cache

  • 1 – conformer cnn cache

  • T – conformer cnn cache

  • d_model] – conformer cnn cache

Returns

output of current input xs tf.Tensor: subsampling cache required for next chunk computation tf.Tensor: encoder layers output cache required for next

chunk computation

tf.Tensor: conformer cnn cache

Return type

tf.Tensor

_encoder_forward_chunk_by_chunk(speech: tensorflow.Tensor, decoding_chunk_size: int, num_decoding_left_chunks: int = -1) Tuple[tensorflow.Tensor, tensorflow.Tensor]
Forward input chunk by chunk with chunk_size like a streaming

fashion

Here we should pay special attention to computation cache in the streaming style forward chunk by chunk. Three things should be taken into account for computation in the current network:

  1. transformer/conformer encoder layers output cache

  2. convolution in conformer

  3. convolution in subsampling

However, we don’t implement subsampling cache for:
  1. We can control subsampling module to output the right result by overlapping input instead of cache left context, even though it wastes some computation, but subsampling only takes a very small fraction of computation in the whole model.

  2. Typically, there are several covolution layers with subsampling in subsampling module, it is tricky and complicated to do cache with different convolution layers with different subsampling rate.

  3. Currently, nn.Sequential is used to stack all the convolution layers in subsampling, we need to rewrite it to make it work with cache, which is not prefered.

Parameters
  • speech (tf.Tensor) – (1, max_len, feat_dim)

  • chunk_size (int) – decoding chunk size

get_encoder_init_input()
_forward_decoder(encoder_out, encoder_mask, hyps_pad, output_mask)
attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]
Apply attention rescoring decoding, CTC prefix beam search

is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters
  • samples

  • hparams – inference_config

  • ctc_final_layer – encoder final dense layer to output ctc prob.

  • lm_model

Returns

Attention rescoring result

Return type

List[int]

beam search for freeze only support batch=1

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

batch beam search for transformer model

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

  • lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

class athena.models.asr.speech_u2.SpeechTransformerU2(data_descriptions, config=None)

Bases: SpeechU2

U2 implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config
class athena.models.asr.speech_u2.SpeechConformerU2(data_descriptions, config=None)

Bases: SpeechU2

Conformer-U2

default_config