athena.models.asr.av_conformer

speech transformer implementation

Module Contents

Classes

AudioVideoConformer

Audio and video multimode Conformer. Model mainly consists of three parts:

class athena.models.asr.av_conformer.AudioVideoConformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Audio and video multimode Conformer. Model mainly consists of three parts: the a_net for input audio fbank feature preparation, the v_net, the y_net for output preparation and the transformer itself

default_config
call(samples, training: bool = None)

call model

compute_logit_length(input_length)

used for get logit length

_forward_encoder(samples, training: bool = None)
attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]
Apply attention rescoring decoding, CTC prefix beam search

is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters
  • samples

  • hparams – inference_config

  • ctc_final_layer – encoder final dense layer to output ctc prob.

  • lm_model

Returns

Attention rescoring result

Return type

List[int]

batch beam search for transformer model

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

  • lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model