`athena.models.asr.speech_u2`¶

This code is modified from https://github.com/wenet-e2e/wenet.git.

Module Contents¶

Classes¶

`SpeechU2`	Base model for U2
`SpeechTransformerU2`	U2 implementation of a SpeechTransformer. Model mainly consists of three parts:
`SpeechConformerU2`	Conformer-U2

class athena.models.asr.speech_u2.SpeechU2(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

Base model for U2

default_config¶

enable_tf_funtion()¶

call(samples, training: bool = None)¶: call model

compute_logit_length(input_length)¶: used for get logit length

_forward_transformer_encoder(x, x_mask, training=None)¶

_forward_encoder(speech: tensorflow.Tensor, speech_length: tensorflow.Tensor, decoding_chunk_size: int = -1, num_decoding_left_chunks: int = -1, simulate_streaming: bool = False, use_dynamic_left_chunk: bool = False, static_chunk_size: int = -1, training: bool = None) → Tuple[tensorflow.Tensor, tensorflow.Tensor]¶

_encoder_forward_chunk(xs: tensorflow.Tensor, offset: int, required_cache_size: int, subsampling_cache: tensorflow.Tensor, elayers_output_cache: tensorflow.Tensor, conformer_cnn_cache: tensorflow.Tensor) → Tuple[tensorflow.Tensor, tensorflow.Tensor, List[tensorflow.Tensor], List[tensorflow.Tensor]]¶

Forward just one chunk

Parameters

xs (tf.Tensor), shape=[B(1) – chunk input
offset (int) – current offset in encoder output time stamp
required_cache_size (int) – cache size required for next chunk compuation >=0: actual cache size <0: means all history cache is required
subsampling_cache (1) – subsampling cache
shape=[B (1) – subsampling cache
elayers_output_cache – transformer/conformer encoder layers output cache
shape=[num_layers – transformer/conformer encoder layers output cache
1 – transformer/conformer encoder layers output cache
T – transformer/conformer encoder layers output cache
d_model] – transformer/conformer encoder layers output cache
conformer_cnn_cache – conformer cnn cache
shape=[num_layers – conformer cnn cache
1 – conformer cnn cache
T – conformer cnn cache
d_model] – conformer cnn cache

Returns

output of current input xs tf.Tensor: subsampling cache required for next chunk computation tf.Tensor: encoder layers output cache required for next

chunk computation

tf.Tensor: conformer cnn cache

Return type

tf.Tensor

_encoder_forward_chunk_by_chunk(speech: tensorflow.Tensor, decoding_chunk_size: int, num_decoding_left_chunks: int = -1) → Tuple[tensorflow.Tensor, tensorflow.Tensor]¶

Forward input chunk by chunk with chunk_size like a streaming: fashion

Here we should pay special attention to computation cache in the streaming style forward chunk by chunk. Three things should be taken into account for computation in the current network:

transformer/conformer encoder layers output cache

convolution in conformer

convolution in subsampling

However, we don’t implement subsampling cache for:

We can control subsampling module to output the right result by overlapping input instead of cache left context, even though it wastes some computation, but subsampling only takes a very small fraction of computation in the whole model.
Typically, there are several covolution layers with subsampling in subsampling module, it is tricky and complicated to do cache with different convolution layers with different subsampling rate.
Currently, nn.Sequential is used to stack all the convolution layers in subsampling, we need to rewrite it to make it work with cache, which is not prefered.

Parameters

speech (tf.Tensor) – (1, max_len, feat_dim)
chunk_size (int) – decoding chunk size

get_encoder_init_input()¶

_forward_decoder(encoder_out, encoder_mask, hyps_pad, output_mask)¶

ctc_prefix_beam_search(samples, hparams, ctc_final_layer) → List[int]¶

attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) → List[int]¶

Apply attention rescoring decoding, CTC prefix beam search: is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters

samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –

Returns

Attention rescoring result

Return type

List[int]

freeze_beam_search(samples, beam_size=1, hparams=None, lm_model=None)¶

beam search for freeze only support batch=1

Parameters

samples – the data source to be decoded
beam_size – beam size

beam_search(samples, hparams, lm_model=None)¶

batch beam search for transformer model

Parameters

samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')¶: restore from pretrained model

class athena.models.asr.speech_u2.SpeechTransformerU2(data_descriptions, config=None)¶

Bases: SpeechU2

U2 implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config¶

class athena.models.asr.speech_u2.SpeechConformerU2(data_descriptions, config=None)¶

Bases: SpeechU2

Conformer-U2

default_config¶

athena.models.asr.speech_u2¶

Module Contents¶

Classes¶

`athena.models.asr.speech_u2`¶