`athena`¶

module

Subpackages¶

Submodules¶

Package Contents¶

Classes¶

`SpeechDatasetBuilder`	SpeechDatasetBuilder
`LanguageDatasetBuilder`	LanguageDatasetBuilder
`SpeechRecognitionDatasetBuilder`	SpeechRecognitionDatasetBuilder
`SpeechRecognitionDatasetKaldiIOBuilder`	SpeechRecognitionDatasetKaldiIOBuilder
`SpeechRecognitionDatasetBatchBinsBuilder`	SpeechRecognitionDatasetBatchBinsBuilder
`SpeechRecognitionDatasetBatchBinsKaldiIOBuilder`	SpeechRecognitionDatasetBatchBinsKaldiIOBuilder
`AudioVedioRecognitionDatasetBuilder`	SpeechRecognitionDatasetBuilder
`AudioVedioRecognitionDatasetBatchBinsBuilder`	SpeechRecognitionDatasetBatchBinsBuilder
`MpcSpeechDatasetBuilder`	SpeechDatasetBuilder
`MpcSpeechDatasetKaldiIOBuilder`	MpcSpeechDatasetKaldiIOBuilder
`SpeechSynthesisDatasetBuilder`	SpeechSynthesisDatasetBuilder
`SpeechFastspeech2DatasetBuilder`	SpeechSynthesisDatasetBuilder
`FeatureNormalizer`	Feature Normalizer
`FS2FeatureNormalizer`	Fastspeech2 Feature Normalizer
`VoiceActivityDetectionDatasetKaldiIOBuilder`	VoiceActivityDetectionDatasetKaldiIOBuilder
`SpeechWakeupFramewiseDatasetKaldiIOBuilder`	Dataset builder for CNN model. The builder treat every spliced frame as one image.
`SpeechWakeupDatasetKaldiIOBuilder`	Dataset builder for RNN model. The builder mix the spliced frame in one dim
`SpeechWakeupDatasetKaldiIOBuilderAVCE`	Dataset builder for RNN model. The builder mix the spliced frame in one dim
`TextFeaturizer`	The main text featurizer interface
`TextTokenizer`	TextTokenizer
`PositionalEncoding`	positional encoding can be used in transformer
`Collapse4D`	collapse4d can be used in cnn-lstm for speech processing
`TdnnLayer`	An implementation of Tdnn Layer
`Gelu`	Gaussian Error Linear Unit.
`MultiHeadAttention`	Multi-head attention consists of four parts:
`BahdanauAttention`	the Bahdanau Attention
`HanAttention`	Refer to [Hierarchical Attention Networks for Document Classification]
`MatchAttention`	Refer to [Learning Natural Language Inference with LSTM]
`Transformer`	A transformer model. User is able to modify the attributes as needed.
`TransformerEncoder`	TransformerEncoder is a stack of N encoder layers
`TransformerDecoder`	TransformerDecoder is a stack of N decoder layers
`TransformerEncoderLayer`	TransformerEncoderLayer is made up of self-attn and feedforward network.
`TransformerDecoderLayer`	TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
`ResnetBasicBlock`	Basic block of resnet
`BaseModel`	Base class for model.
`MaskedPredictCoding`	implementation for MPC pretrain model
`AV_MtlTransformer`	In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to
`SpeechConformer`	Standard implementation of a SpeechTransformer. Model mainly consists of three parts:
`SpeechConformerCTC`	Standard implementation of a SpeechTransformer. Model mainly consists of three parts:
`SpeechTransformer`	Standard implementation of a SpeechTransformer. Model mainly consists of three parts:
`SpeechTransformerU2`	U2 implementation of a SpeechTransformer. Model mainly consists of three parts:
`SpeechConformerU2`	Conformer-U2
`MtlTransformerCtc`	In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to
`AudioVideoConformer`	Audio and video multimode Conformer. Model mainly consists of three parts:
`VadMarbleNet`	implementation of a frame level or segment speech classification
`VadDnn`	implementation of a frame level or segment speech classification
`RNNLM`	Standard implementation of a RNNLM. Model mainly consists of embeding layer,
`TransformerLM`	Standard implementation of a RNNLM. Model mainly consists of embeding layer,
`FastSpeech`	Reference: Fastspeech: Fast, robust and controllable text to speech
`FastSpeech2`	Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
`Tacotron2`	An implementation of Tacotron2
`TTSTransformer`	TTS version of SpeechTransformer. Model mainly consists of three parts:
`CnnModel`	CNN model for kws"
`KWSConformer`	Standard implementation of a KWSConformer. Model mainly consists of three parts:
`CRnnModel`	CRNN model for e2e kws"
`DnnModel`	implementation of a frame level or segment speech classification
`MISPModel`	MISP challenge KWS baseline model for e2e kws"
`KWSTransformer_2Dense`	Standard implementation of a KWSTransformer. Model mainly consists of three parts:
`KWSTransformer`	Standard implementation of a KWSTransformer. Model mainly consists of three parts:
`KWSAVTransformer`	Standard implementation of a KWSTransformer. Model mainly consists of three parts:
`KWSTransformerRESNET`	Standard implementation of a KWSTransformer. Model mainly consists of three parts:
`KWSTransformer_FocalLoss`	Standard implementation of a KWSTransformer. Model mainly consists of three parts:
`BaseSolver`	Base Training Solver.
`HorovodSolver`	A multi-processer solver based on Horovod
`DecoderSolver`	ASR DecoderSolver
`AVSolver`	Base Solver.
`AVHorovodSolver`	A multi-processer solver based on Horovod
`AVDecoderSolver`	DecoderSolver
`VadSolver`	VadSolver
`SynthesisSolver`	SynthesisSolver (TTS Solver)
`CTCLoss`	CTC LOSS
`Seq2SeqSparseCategoricalCrossentropy`	Seq2SeqSparseCategoricalCrossentropy LOSS
`CTCAccuracy`	CTCAccuracy
`Seq2SeqSparseCategoricalAccuracy`	Seq2SeqSparseCategoricalAccuracy
`Checkpoint`	A wrapper for Tensorflow checkpoint
`WarmUpLearningSchedule`	WarmUp Learning rate schedule for Adam
`WarmUpAdam`	WarmUpAdam Implementation
`WarmUpLearningSchedule1`	WarmUp Learning rate schedule for Adam and can initialize a learning rate
`WarmUpAdam1`	WarmUpAdam Implementation
`ExponentialDecayLearningRateSchedule`	ExponentialDecayLearningRateSchedule
`ExponentialDecayAdam`	WarmUpAdam Implementation
`HParams`	Class to hold a set of hyperparameters as name-value pairs.
`CTCPrefixScoreTH`	Batch processing of CTCPrefixScore

Functions¶

`make_positional_encoding`(position, d_model)	generate a postional encoding list
`collapse4d`(x[, name])	reshape from [N T D C] -> [N T D*C]
`gelu`(x)	Gaussian Error Linear Unit.
`register_and_parse_hparams`(default_config[, config])	register default config and parse
`generate_square_subsequent_mask`(size)	Generate a square mask for the sequence. The masked positions are filled with float(1.0).
`generate_square_subsequent_mask_u2`(size)	Generate a square mask for the sequence. The masked positions are filled with bool(True).
`get_wave_file_length`(wave_file)	get the wave file length(duration) in ms
`set_default_summary_writer`([summary_directory])
`get_dict_from_scp`(vocab[, func])

Attributes¶

__version__

class athena.SpeechDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder

property num_class¶

@property

Returns: the target dim
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.LanguageDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

LanguageDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property input_vocab_size¶

@property

Returns: the input vocab size
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.int32,
    "input_length": tf.int32,
    "output": tf.int32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: load csv file

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_labels,
    "input_length": input_length,
    "output": output_labels,
    "output_length": output_length,
}

Return type

dict

class athena.SpeechRecognitionDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class¶: return the max_index of the vocabulary + 1

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()¶

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

class athena.SpeechRecognitionDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetKaldiIOBuilder

property num_class¶: return the max_index of the vocabulary + 1

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶: Generate a list of tuples (feat_key, speaker).

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.SpeechRecognitionDatasetBatchBinsBuilder(config=None)¶

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶

__getitem__(index)¶

__len__()¶

as_dataset(batch_size=16, num_threads=1)¶: return tf.data.Dataset object

shard(num_shards, index)¶: creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶

Batch-wise shuffling of the data entries.

Parameters

batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –

class athena.SpeechRecognitionDatasetBatchBinsKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.asr.speech_recognition_kaldiio.SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

property sample_shape_batch_bins¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config¶

preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶

read_shape_file(file_dir=None)¶

__getitem__(index)¶

__len__()¶

as_dataset(batch_size=16, num_threads=1)¶: return tf.data.Dataset object

shard(num_shards, index)¶: creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶

Batch-wise shuffling of the data entries.

Parameters

batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –

class athena.AudioVedioRecognitionDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class¶: return the max_index of the vocabulary + 1

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config¶

video_scp_loader(scp_dir)¶: load video list from scp file return a dic

image_normalizer(image)¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()¶

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
}

Return type

dict

class athena.AudioVedioRecognitionDatasetBatchBinsBuilder(config=None)¶

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "video":tf.TensorShape([None, None, high, wide]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
    "utt_id": tf.TensorShape([None]),
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "video": tf.TensorShape([None, None, None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "video": tf.TensorSpec(shape=(None, None, None, None), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

__len__()¶

as_dataset(batch_size=16, num_threads=1)¶: return tf.data.Dataset object

shard(num_shards, index)¶: creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶

Batch-wise shuffling of the data entries.

Parameters

batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –

class athena.MpcSpeechDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder This data builder is a online feature extractor and is used to mcp training

property num_class¶

@property

Returns: the target dim
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.MpcSpeechDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.mpc.mpc_speech_set.MpcSpeechDatasetBuilder

MpcSpeechDatasetKaldiIOBuilder This data builder is a offline feature data builder and is used to mcp training

default_config¶

preprocess_data(file_path, apply_sort_filter=True)¶: generate a list of tuples (feat_key, speaker).

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.SpeechSynthesisDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property feat_dim¶: return the number of feature dims

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "speaker": tf.TensorShape([])
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)¶

class athena.SpeechFastspeech2DatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property feat_dim¶: return the number of feature dims

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32,
    "duration": tf.int32
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "f0": tf.TensorShape([None]),
    "energy": tf.TensorShape([None]),
    "speaker": tf.TensorShape([]),
    "duration": tf.TensorShape([None])
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "f0": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "energy": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config¶

load_duration(duration)¶

preprocess_data(file_path)¶: generate a list of tuples (audio_feature, wav_length_ms, transcript, duration, speaker).

load_audio_feature(audio_feature_file)¶

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.FeatureNormalizer(cmvn_file=None)¶

Feature Normalizer

__call__(feat_data, speaker, reverse=False)¶

apply_cmvn(feat_data, speaker, reverse=False)¶: transform original feature to normalized feature

compute_cmvn(entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)¶: compute cmvn for filtered entries

compute_cmvn_by_chunk_for_all_speaker(feature_dim, speakers, featurizer, entries)¶: because of memory issue, we used incremental approximation for the calculation of cmvn

compute_cmvn_kaldiio(entries, speakers, kaldi_io_feats, feature_dim)¶: compute cmvn for filtered entries using kaldi-format data

load_cmvn()¶: load mean and var

save_cmvn(variable_list)¶

save cmvn variables determined by variable_list to file

Parameters: variable_list (list) – e.g. [“speaker”, “mean”, “var”]

class athena.FS2FeatureNormalizer(cmvn_file=None)¶

Bases: FeatureNormalizer

Fastspeech2 Feature Normalizer

__call__(feat_data, speaker, feature_type='mel', reverse=False)¶

compute_fs2_cmvn(entries, speakers, num_cmvn_workers=1)¶: compuate cmvn of mel-spec,f0 and energy

apply_cmvn(feat_data, speaker, feature_type='mel', reverse=False)¶: transform original feature to normalized feature

load_cmvn()¶: load mel_mean, mel_var, f0_mean, f0_var and energy_mean, energy_var

class athena.VoiceActivityDetectionDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    utt": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config¶

preprocess_data(data_scps_dir)¶: generate a list of tuples (wav_filename, wav_offset, wav_length_ms, transcript, label).

splice_feature(feature)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt": utt
}

Return type

dict

class athena.SpeechWakeupFramewiseDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for CNN model. The builder treat every spliced frame as one image. For example (21, 63) The input data format is (batch, timestep, height, width, channel) For example (b, t, 21, 63, 1) unbatch are used to split The output data format is, for example, (b, 21, 63, 1)

property sample_type¶: example types

property sent_sample_shape¶

property sample_shape¶: examples shapes

property sample_signature¶: examples signature

default_config¶

preprocess_data(data_dir='')¶: loading data

__getitem__(index)¶

splice_feature(feature, input_left_context, input_right_context)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

class athena.SpeechWakeupDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type¶: example types

property sample_shape¶: examples shapes

property sample_signature¶: examples signature

default_config¶

preprocess_data(data_dir='')¶: loading data

__getitem__(index)¶

splice_feature(feature, input_left_context, input_right_context)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

class athena.SpeechWakeupDatasetKaldiIOBuilderAVCE(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type¶: example types

property sample_shape¶: examples shapes

property sample_signature¶: examples signature

default_config¶

preprocess_data(data_dir='')¶: loading data

video_scp_loader(scp_dir)¶: load video list from scp file return a dic

__getitem__(index)¶

splice_feature(feature, input_left_context, input_right_context)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

class athena.TextFeaturizer(config=None)¶

The main text featurizer interface

property model_type¶

@property

Returns: the model type

property unk_index¶

@property

Returns: the unk index
Return type: int

supported_model¶

default_config¶

load_model(model_file)¶: load model

delete_punct(tokens)¶: delete punctuation tokens

__len__()¶

encode(texts)¶: convert a sentence to a list of ids, with special tokens added.

decode(sequences)¶: conver a list of ids to a sentence

decode_to_list(sequences, ignored_id=[])¶

class athena.TextTokenizer(text=None)¶

TextTokenizer

load_model(text)¶: load model

save_vocab(vocab_file)¶

load_csv(csv_file)¶

__len__()¶

encode(texts)¶: convert a sentence to a list of ids, with special tokens added.

decode(sequences)¶: conver a list of ids to a sentence

decode_to_list(ids, ignored_id=[])¶

athena.make_positional_encoding(position, d_model)¶: generate a postional encoding list

athena.collapse4d(x, name=None)¶: reshape from [N T D C] -> [N T D*C] using tf.shape(x), which generate a tensor instead of x.shape

athena.gelu(x)¶

Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415

Parameters: x – float Tensor to perform activation.
Returns: x with the GELU activation applied.

class athena.PositionalEncoding(d_model, max_position=800, scale=False)¶

Bases: tensorflow.keras.layers.Layer

positional encoding can be used in transformer

call(x)¶: call function

class athena.Collapse4D¶

Bases: tensorflow.keras.layers.Layer

collapse4d can be used in cnn-lstm for speech processing reshape from [N T D C] -> [N T D*C]

call(x)¶

class athena.TdnnLayer(context, output_dim, use_bias=False, **kwargs)¶

Bases: tensorflow.keras.layers.Layer

An implementation of Tdnn Layer :param context: a int of left and right context, or a list of context indexes, e.g. (-2, 0, 2). :param output_dim: the dim of the linear transform

call(x, training=None, mask=None)¶

class athena.Gelu¶

Bases: tensorflow.keras.layers.Layer

Gaussian Error Linear Unit.

This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415

Parameters: x – float Tensor to perform activation.
Returns: with the GELU activation applied.
Return type: x

call(x)¶

class athena.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)¶

Bases: tensorflow.keras.layers.Layer

Multi-head attention consists of four parts:

Linear layers and split into heads.
Scaled dot-product attention.
Concatenation of heads.
Final linear layer.

Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

split_heads(x, batch_size)¶

Split the last dimension into (num_heads, depth).

Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)

call(v, k, q, mask)¶: call function

class athena.BahdanauAttention(units, input_dim=1024)¶

Bases: tensorflow.keras.Model

the Bahdanau Attention

call(query, values)¶: call function

class athena.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)¶

Bases: tensorflow.keras.layers.Layer

Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)

>>> Input shape: (Batch size, steps, features)
>>> Output shape: (Batch size, features)

build(input_shape)¶: build in keras layer

call(inputs, training=None, mask=None)¶: call function in keras

compute_output_shape(input_shape)¶: compute output shape

_masked_softmax(logits, mask, axis)¶: Compute softmax with input mask.

class athena.MatchAttention(config, **kwargs)¶

Bases: tensorflow.keras.layers.Layer

Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170)

>>> Input shape: (Batch size, steps, features)
>>> Output shape: (Batch size, steps, features)

call(tensors)¶: Attention layer.

class athena.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, custom_encoder=None, custom_decoder=None, conv_module_kernel_size=0)¶

Bases: tensorflow.keras.layers.Layer

A transformer model. User is able to modify the attributes as needed.

Parameters

d_model – the number of expected features in the encoder/decoder inputs (default=512).
nhead – the number of heads in the multiheadattention models (default=8).
num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
custom_encoder – custom encoder (default=None).
custom_decoder – custom decoder (default=None).

Examples

>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12)
>>> src = tf.random.normal((10, 32, 512))
>>> tgt = tf.random.normal((20, 32, 512))
>>> out = transformer_model(src, tgt)

call(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, return_encoder_output=False, return_attention_weights=False, training=None)¶

Take in and process masked source/target sequences.

Parameters

src – the sequence to the encoder (required).
tgt – the sequence to the decoder (required).
src_mask – the additive mask for the src sequence (optional).
tgt_mask – the additive mask for the tgt sequence (optional).
memory_mask – the additive mask for the encoder output (optional).
src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).
memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).

Shape:

src: \((N, S, E)\).
tgt: \((N, T, E)\).
src_mask: \((N, S)\).
tgt_mask: \((N, T)\).
memory_mask: \((N, S)\).

Note: [src/tgt/memory]_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.

output: \((N, T, E)\).

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.

where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number

Examples

>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)

class athena.TransformerEncoder(encoder_layers)¶

Bases: tensorflow.keras.layers.Layer

TransformerEncoder is a stack of N encoder layers

Parameters

encoder_layer – an instance of the TransformerEncoderLayer() class (required).
num_layers – the number of sub-encoder-layers in the encoder (required).
norm – the layer normalization component (optional).

Examples

>>> encoder_layer = [TransformerEncoderLayer(d_model=512, nhead=8)
>>>                    for _ in range(num_layers)]
>>> transformer_encoder = TransformerEncoder(encoder_layer)
>>> src = torch.rand(10, 32, 512)
>>> out = transformer_encoder(src)

call(src, src_mask=None, training=None)¶

Pass the input through the endocder layers in turn.

Parameters

src – the sequnce to the encoder (required).
mask – the mask for the src sequence (optional).

set_unidirectional(uni=False)¶: whether to apply trianglar masks to make transformer unidirectional

class athena.TransformerDecoder(decoder_layers)¶

Bases: tensorflow.keras.layers.Layer

TransformerDecoder is a stack of N decoder layers

Parameters

decoder_layer – an instance of the TransformerDecoderLayer() class (required).
num_layers – the number of sub-decoder-layers in the decoder (required).
norm – the layer normalization component (optional).

Examples

>>> decoder_layer = [TransformerDecoderLayer(d_model=512, nhead=8)
>>>                     for _ in range(num_layers)]
>>> transformer_decoder = TransformerDecoder(decoder_layer)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = transformer_decoder(tgt, memory)

call(tgt, memory, tgt_mask=None, memory_mask=None, return_attention_weights=False, training=None)¶

Pass the inputs (and mask) through the decoder layer in turn.

Parameters

tgt – the sequence to the decoder (required).
memory – the sequnce from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).

class athena.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, ffn=None, conv_module_kernel_size=0)¶

Bases: tensorflow.keras.layers.Layer

TransformerEncoderLayer is made up of self-attn and feedforward network.

Parameters

d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).

Examples

>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
>>> src = tf.random(10, 32, 512)
>>> out = encoder_layer(src)

call(src, src_mask=None, training=None)¶

Pass the input through the encoder layer.

Parameters

src – the sequence to the encoder layer (required).
mask – the mask for the src sequence (optional).

set_unidirectional(uni=False)¶: whether to apply trianglar masks to make transformer unidirectional

class athena.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu')¶

Bases: tensorflow.keras.layers.Layer

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.

Reference:: “Attention Is All You Need”.

Parameters

d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).

Examples

>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8)
>>> memory = tf.random(10, 32, 512)
>>> tgt = tf.random(20, 32, 512)
>>> out = decoder_layer(tgt, memory)

call(tgt, memory, tgt_mask=None, memory_mask=None, training=None)¶

Pass the inputs (and mask) through the decoder layer.

Parameters

tgt – the sequence to the decoder layer (required).
memory – the sequence from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).

class athena.ResnetBasicBlock(num_filter, stride=1)¶

Bases: tensorflow.keras.layers.Layer

Basic block of resnet Reference to paper “Deep residual learning for image recognition”

call(inputs)¶: call model

make_downsample_layer(num_filter, stride)¶: perform downsampling using conv layer with stride != 1

class athena.BaseModel(**kwargs)¶

Bases: tensorflow.keras.Model

Base class for model.

abstract call(samples, training=None)¶: call model

get_loss(outputs, samples, training=None)¶: get loss

compute_logit_length(input_length)¶: compute the logit length

reset_metrics()¶: reset the metrics

prepare_samples(samples)¶: for special data prepare carefully: do not change the shape of samples

restore_from_pretrained_model(pretrained_model, model_type='')¶: restore from pretrained model

decode(samples, hparams, decoder)¶: decode interface

class athena.MaskedPredictCoding(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

implementation for MPC pretrain model

Parameters

num_filters – a int type number, i.e the number of filters in cnn
d_model – a int type number, i.e dimension of model
num_heads – number of heads in transformer
num_encoder_layers – number of layer in encoder
dff – a int type number, i.e dimension of model
rate – rate of dropout layers
chunk_size – number of consecutive masks, i.e 1 or 3
keep_probability – probability not to be masked
mode – train mode, i.e MPC: pretrain
max_pool_layers – index of max pool layers in encoder, default is -1

default_config¶

call(samples, training: bool = None)¶

used for training

Parameters

dict (samples is a) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank
keys (including) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank

Return:

MPC outputs to fit acoustic features
    encoder_outputs: Transformer encoder outputs, Tensor, shape is (batch, seqlen, dim)

get_loss(logits, samples, training=None)¶

get MPC loss

Parameters: logits – MPC output

Return:

MPC L1 loss and metrics

compute_logit_length(samples)¶: compute the logit length

generate_mpc_mask(input_data)¶

generate mask for pretraining

Parameters: features (acoustic) – i.e F-bank

Return:

mask tensor

prepare_samples(samples)¶: for special data prepare carefully: do not change the shape of samples

class athena.AV_MtlTransformer(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.

SUPPORTED_MODEL¶

default_config¶

call(samples, training=None)¶: call function in keras layers

get_loss(outputs, samples, training=None)¶: get loss used for training

compute_logit_length(input_length)¶: compute the logit length

reset_metrics()¶: reset the metrics

restore_from_pretrained_model(pretrained_model, model_type='')¶: A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore

decode(samples, hparams=None, lm_model=None)¶

Initialization of the model for decoding, decoder is called here to create predictions

Parameters

samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model

Returns:

predictions: the corresponding decoding results

class athena.SpeechConformer(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config¶

call(samples, training: bool = None)¶: call model

compute_logit_length(input_length)¶: used for get logit length

_forward_encoder(speech, speech_length, training: bool = None)¶

_forward_encoder_log_ctc(samples, final_layer, training: bool = None)¶

ctc_prefix_beam_search(samples, hparams, ctc_final_layer) → List[int]¶

freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=None) → List[int]¶

freeze_ctc_probs(samples, ctc_final_layer, hparams=None, beam_size=None) → List[int]¶

attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) → List[int]¶

Apply attention rescoring decoding, CTC prefix beam search: is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters

samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –

Returns

Attention rescoring result

Return type

List[int]

freeze_beam_search(samples, beam_size)¶

beam search for freeze only support batch=1

Parameters

samples – the data source to be decoded
beam_size – beam size

beam_search(samples, hparams, lm_model=None)¶

batch beam search for transformer model

Parameters

samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')¶: restore from pretrained model

class athena.SpeechConformerCTC(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation and the transformer itself

default_config¶

call(samples, training: bool = None)¶: call model

compute_logit_length(input_length)¶: used for get logit length

_forward_encoder(speech, speech_length, training=None)¶

_forward_encoder_log_ctc(samples, training: bool = None)¶

decode(samples, hparams, lm_model=None)¶

Initialization of the model for decoding, decoder is called here to create predictions

Parameters

samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model

Returns:

predictions: the corresponding decoding results

argmax(samples, hparams)¶

argmax for the Conformer CTC model

Parameters

samples – the data source to be decoded
hparams – decoding configs are included here

Returns::: predictions: the corresponding decoding results

ctc_prefix_beam_search(samples, hparams, ctc_final_layer) → List[int]¶

freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=1) → List[int]¶

merge_ctc_sequence(seqs, blank=-1)¶

freeze_beam_search(samples, beam_size)¶

beam search for freeze only support batch=1

Parameters

samples – the data source to be decoded
beam_size – beam size

restore_from_pretrained_model(pretrained_model, model_type='')¶: restore from pretrained model

class athena.SpeechTransformer(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config¶

call(samples, training: bool = None)¶: call model

compute_logit_length(input_length)¶: used for get logit length

_forward_encoder(speech, speech_length, training: bool = None)¶

_forward_encoder_log_ctc(samples, final_layer, training: bool = None)¶

ctc_prefix_beam_search(samples, hparams, ctc_final_layer) → List[int]¶

attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) → List[int]¶

Apply attention rescoring decoding, CTC prefix beam search: is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters

samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –

Returns

Attention rescoring result

Return type

List[int]

freeze_beam_search(samples, beam_size=1)¶

beam search for freeze only support batch=1

Parameters

samples – the data source to be decoded
beam_size – beam size

beam_search(samples, hparams, lm_model=None)¶

batch beam search for transformer model

Parameters

samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search

freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=1) → List[int]¶

restore_from_pretrained_model(pretrained_model, model_type='')¶: restore from pretrained model

class athena.SpeechTransformerU2(data_descriptions, config=None)¶

Bases: SpeechU2

U2 implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config¶

class athena.SpeechConformerU2(data_descriptions, config=None)¶

Bases: SpeechU2

Conformer-U2

default_config¶

class athena.MtlTransformerCtc(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.

SUPPORTED_MODEL¶

default_config¶

call(samples, training=None)¶: call function in keras layers

get_loss(outputs, samples, training=None)¶: get loss used for training

compute_logit_length(input_length)¶: compute the logit length

reset_metrics()¶: reset the metrics

restore_from_pretrained_model(pretrained_model, model_type='')¶: A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore

_forward_encoder_log_ctc(samples, training: bool = None)¶

decode(samples, hparams, lm_model=None)¶

Initialization of the model for decoding, decoder is called here to create predictions

Parameters

samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model

Returns:

predictions: the corresponding decoding results

enable_tf_funtion()¶

ctc_forward_chunk_freeze(encoder_out)¶

encoder_ctc_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)¶

encoder_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)¶

get_subsample_rate()¶

get_init()¶

encoder_forward_chunk_by_chunk_freeze(speech: tensorflow.Tensor, decoding_chunk_size: int, num_decoding_left_chunks: int = -1) → Tuple[tensorflow.Tensor, tensorflow.Tensor]¶

Forward input chunk by chunk with chunk_size like a streaming: fashion

Here we should pay special attention to computation cache in the streaming style forward chunk by chunk. Three things should be taken into account for computation in the current network:

transformer/conformer encoder layers output cache

convolution in conformer

convolution in subsampling

However, we don’t implement subsampling cache for:

We can control subsampling module to output the right result by overlapping input instead of cache left context, even though it wastes some computation, but subsampling only takes a very small fraction of computation in the whole model.
Typically, there are several covolution layers with subsampling in subsampling module, it is tricky and complicated to do cache with different convolution layers with different subsampling rate.
Currently, nn.Sequential is used to stack all the convolution layers in subsampling, we need to rewrite it to make it work with cache, which is not prefered.

Parameters

speech (tf.Tensor) – (1, max_len, dim)
chunk_size (int) – decoding chunk size

ctc_prefix_beam_search(samples, hparams, decoding_chunk_size, num_decoding_left_chunks) → List[int]¶

class athena.AudioVideoConformer(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

Audio and video multimode Conformer. Model mainly consists of three parts: the a_net for input audio fbank feature preparation, the v_net, the y_net for output preparation and the transformer itself

default_config¶

call(samples, training: bool = None)¶: call model

compute_logit_length(input_length)¶: used for get logit length

_forward_encoder(samples, training: bool = None)¶

ctc_prefix_beam_search(samples, hparams, ctc_final_layer) → List[int]¶

attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) → List[int]¶

Apply attention rescoring decoding, CTC prefix beam search: is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters

samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –

Returns

Attention rescoring result

Return type

List[int]

beam_search(samples, hparams, lm_model=None)¶

batch beam search for transformer model

Parameters

samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')¶: restore from pretrained model

class athena.VadMarbleNet(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

implementation of a frame level or segment speech classification

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

get_loss(outputs, samples, training=None)¶: get loss

class athena.VadDnn(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

implementation of a frame level or segment speech classification

default_config¶

call(samples, training=None)¶: call model

get_loss(outputs, samples, training=None)¶: get loss

class athena.RNNLM(data_descriptions, config=None)¶

Bases: athena.models.lm.nn_lm.NNLM

Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn

default_config¶

forward(inputs, inputs_length=None, training: bool = None)¶: do NN LM forward computation, for both train and decode.

class athena.TransformerLM(data_descriptions, config=None)¶

Bases: athena.models.lm.nn_lm.NNLM

Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn

default_config¶

forward(inputs, input_lengths, training: bool = None)¶: do NN LM forward computation, for both train and decode.

class athena.FastSpeech(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

Reference: Fastspeech: Fast, robust and controllable text to speech: (http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech.pdf)

default_config¶

set_teacher_model(teacher_model, teacher_type)¶

set teacher model and initialize duration_calculator before training

Parameters

teacher_model – the loaded teacher model
teacher_type – the model type, e.g., tacotron2, tts_transformer

restore_from_pretrained_model(pretrained_model, model_type='')¶

restore from pretrained model

Parameters

pretrained_model – the loaded pretrained model
model_type – the model type, e.g: tts_transformer

get_loss(outputs, samples, training=None)¶: get loss used for training

_feedforward_decoder(encoder_output, duration_indexes, duration_sequences, output_length, training)¶

feed-forward decoder

Parameters

encoder_output – encoder outputs, shape: [batch, x_steps, d_model]
duration_indexes – argmax weights calculated from duration_calculator. It is used for training only, shape: [batch, y_steps]
duration_sequences – It contains duration information for each phoneme, shape: [batch, x_steps]
output_length – the real output length
training – if it is in the training stage

Returns:

before_outs: the outputs before postnet calculation
after_outs: the outputs after postnet calculation

call(samples, training: bool = None)¶: call model

synthesize(samples)¶

class athena.FastSpeech2(data_descriptions, config=None)¶

Bases: athena.models.tts.fastspeech.FastSpeech

Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

default_config¶

call(samples, training: bool = None)¶: call model

synthesize(samples)¶

class athena.Tacotron2(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

https://arxiv.org/pdf/1712.05884.pdf

default_config¶

_pad_and_reshape(outputs, ori_lens, reverse=False)¶

Parameters

outputs – true labels, shape: [batch, y_steps, feat_dim]
ori_lens – scalar

Returns:

reshaped_outputs: it has to be reshaped to match reduction_factor
    shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]

call(samples, training: bool = None)¶: call model

initialize_input_y(y)¶

Parameters: y – the true label, shape: [batch, y_steps, feat_dim]

Returns:

y0: zeros will be padded as one step to the start step,
[batch, y_steps+1, feat_dim]

initialize_states(encoder_output, input_length)¶

Parameters

encoder_output – encoder outputs, shape: [batch, x_step, eunits]
input_length – shape: [batch]

Returns:

prev_rnn_states: initial states of rnns in decoder
    [rnn layers, 2, batch, dunits]
prev_attn_weight: initial attention weights, [batch, x_steps]
prev_context: initial context, [batch, eunits]

concat_speaker_embedding(encoder_output, speaker_embedding)¶

Parameters

encoder_output – encoder output (batch, x_steps, eunits)
speaker_embedding – speaker embedding (batch, embedding_dim)

Returns

the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)

time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)¶

Parameters

encoder_output – encoder output (batch, x_steps, eunits).
input_length – (batch,)
prev_y – one step of true labels or predicted labels (batch, feat_dim).
prev_rnn_states – previous rnn states [layers, 2, states] for lstm
prev_attn_weight – previous attention weights, shape: [batch, x_steps]
prev_context – previous context vector: [batch, attn_dim]
training – if it is training mode

Returns:

out: shape: [batch, feat_dim]
logit: shape: [batch, reduction_factor]
current_rnn_states: [rnn_layers, 2, batch, dunits]
attn_weight: [batch, x_steps]

get_loss(outputs, samples, training=None)¶: get loss

synthesize(samples)¶

Synthesize acoustic features from the input texts

Parameters: samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights

_synthesize_post_net(before_outs, logits_stack)¶

Parameters

before_outs – the outputs before postnet
logits_stack – the logits of all steps

Returns:

after_outs: the corresponding synthesized acoustic features

class athena.TTSTransformer(data_descriptions, config=None)¶

Bases: athena.models.tts.tacotron2.Tacotron2

TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network

(https://arxiv.org/pdf/1809.08895.pdf)

default_config¶

call(samples, training: bool = None)¶

time_propagate(encoder_output, memory_mask, outs, step)¶

Synthesize one step frames

Parameters

encoder_output – the encoder output, shape: [batch, x_steps, eunits]
memory_mask – the encoder output mask, shape: [batch, 1, 1, x_steps]
outs (TensorArray) – previous outputs
step – the current step number

Returns:

out: new frame outpus, shape: [batch, feat_dim * reduction_factor]
logit: new stop token prediction logit, shape: [batch, reduction_factor]
attention_weights (list): the corresponding attention weights,
    each element in the list represents the attention weights of one decoder layer
    shape: [batch, num_heads, seq_len_q, seq_len_k]

synthesize(samples)¶

Synthesize acoustic features from the input texts

Parameters: samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights

class athena.CnnModel(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

CNN model for kws”

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

class athena.KWSConformer(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSConformer. Model mainly consists of three parts: the x_net for input preparation, the conformer itself

default_config¶

call(samples, training=None)¶

build_model(data_descriptions)¶

class athena.CRnnModel(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

CRNN model for e2e kws”

default_config¶

input_features¶: _, _, w, c = input_features.get_shape().as_list() output_dim = w * c inner = layers.Reshape((-1, output_dim))(input_features) inner = PCENLayer()(inner) inner = layers.Reshape((-1, w, c))(inner)

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

class athena.DnnModel(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

implementation of a frame level or segment speech classification

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

class athena.MISPModel(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

MISP challenge KWS baseline model for e2e kws”

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

class athena.KWSTransformer_2Dense(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config¶

call(samples, training=None)¶

build_model(data_descriptions)¶

class athena.KWSTransformer(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

class athena.KWSAVTransformer(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config¶

inner¶: v_net

call(samples, training=None)¶

build_model(data_descriptions)¶

class athena.KWSTransformerRESNET(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

class athena.KWSTransformer_FocalLoss(data_descriptions, config=None)¶

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config¶

call(samples, training=None)¶: call model

build_model(data_descriptions)¶

get_loss(outputs, samples, training=None)¶: get loss

class athena.BaseSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶

Bases: tensorflow.keras.Model

Base Training Solver.

default_config¶

static initialize_devices(solver_gpus=None)¶: initialize hvd devices, should be called firstly

static clip_by_norm(grads, norm)¶: clip norm using tf.clip_by_norm

train_step(samples)¶: train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶: Update the model in 1 epoch

save_checkpointer(checkpointer, devset, epoch)¶

evaluate_step(samples)¶: evaluate the model 1 step

evaluate(dataset, epoch)¶: evaluate the model

class athena.HorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶

Bases: BaseSolver

A multi-processer solver based on Horovod

static initialize_devices(solver_gpus=None)¶

initialize hvd devices, should be called firstly

For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],

then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.

run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.

Parameters: solver_gpus ([list]) – a list to specify gpus being used.
Raises: ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.

train_step(samples)¶: train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶: Update the model in 1 epoch

evaluate(dataset, epoch=0)¶: evaluate the model

class athena.DecoderSolver(model, data_descriptions=None, config=None)¶

Bases: BaseSolver

ASR DecoderSolver

default_config¶

inference(dataset_builder, rank_size=1, conf=None)¶: decode the model

inference_saved_model(dataset_builder, rank_size=1, conf=None)¶: decode the model

class athena.AVSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶

Bases: tensorflow.keras.Model

Base Solver.

default_config¶

static initialize_devices(solver_gpus=None)¶: initialize hvd devices, should be called firstly

static clip_by_norm(grads, norm)¶: clip norm using tf.clip_by_norm

train_step(samples)¶: train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶: Update the model in 1 epoch

evaluate_step(samples)¶: evaluate the model 1 step

evaluate(dataset, epoch)¶: evaluate the model

class athena.AVHorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶

Bases: AVSolver

A multi-processer solver based on Horovod

static initialize_devices(solver_gpus=None)¶

initialize hvd devices, should be called firstly

For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],

then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.

run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.

Parameters: solver_gpus ([list]) – a list to specify gpus being used.
Raises: ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.

train_step(samples)¶: train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶: Update the model in 1 epoch

evaluate(dataset, epoch=0)¶: evaluate the model

class athena.AVDecoderSolver(model, data_descriptions=None, config=None)¶

Bases: AVSolver

DecoderSolver

default_config¶

inference(dataset_builder, rank_size=1, conf=None)¶: decode the model

inference_freeze(dataset_builder, rank_size=1, conf=None)¶: decode the model

inference_argmax(dataset_builder, rank_size=1, conf=None)¶: decode the model

class athena.VadSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, data_descriptions=None, config=None)¶

Bases: BaseSolver

VadSolver

default_config¶

inference(dataset, rank_size=1, conf=None)¶: decode the model

class athena.SynthesisSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, config=None, **kwargs)¶

Bases: BaseSolver

SynthesisSolver (TTS Solver)

default_config¶

inference(dataset_builder, rank_size=1, conf=None)¶: synthesize using vocoder on dataset

inference_saved_model(dataset_builder, rank_size=1, conf=None)¶: synthesize using vocoder on dataset

class athena.CTCLoss(logits_time_major=False, blank_index=-1, name='CTCLoss')¶

Bases: tensorflow.keras.losses.Loss

CTC LOSS CTC LOSS implemented with Tensorflow

__call__(logits, samples, logit_length=None)¶

class athena.Seq2SeqSparseCategoricalCrossentropy(num_classes, eos=-1, by_token=False, by_sequence=True, from_logits=True, label_smoothing=0.0)¶

Bases: tensorflow.keras.losses.CategoricalCrossentropy

Seq2SeqSparseCategoricalCrossentropy LOSS CategoricalCrossentropy calculated at each character for each sequence in a batch

__call__(logits, samples, logit_length=None)¶

class athena.CTCAccuracy(name='CTCAccuracy')¶

Bases: CharactorAccuracy

CTCAccuracy Inherits CharactorAccuracy and implements CTC accuracy calculation

__call__(logits, samples, logit_length=None)¶: Accumulate errors and counts, logit_length is the output length of encoder

class athena.Seq2SeqSparseCategoricalAccuracy(eos, name='Seq2SeqSparseCategoricalAccuracy')¶

Bases: CharactorAccuracy

Seq2SeqSparseCategoricalAccuracy Inherits CharactorAccuracy and implements Attention accuracy calculation

__call__(logits, samples, logit_length=None)¶: Accumulate errors and counts

class athena.Checkpoint(checkpoint_directory=None, use_dev_loss=True, model=None, **kwargs)¶

Bases: tensorflow.train.Checkpoint

A wrapper for Tensorflow checkpoint

Parameters

checkpoint_directory – the directory for checkpoint
summary_directory – the directory for summary used in Tensorboard
__init__ – provide the optimizer and model
__call__ – save the model

Example

>>> transformer = SpeechTransformer(target_vocab_size=dataset_builder.target_dim)
>>> optimizer = tf.keras.optimizers.Adam()
>>> ckpt = Checkpoint(checkpoint_directory='./train', summary_directory='./event',
>>>        transformer=transformer, optimizer=optimizer)
>>> solver = BaseSolver(transformer)
>>> for epoch in dataset:
>>>    ckpt()

_file_compatible(use_dev_loss)¶

Convert n_best file to CSV file

Add “index” and “Accuracy” for no csv n_best file.

_compare_and_save_best(loss, metrics, save_path, training=False)¶: compare and save the best model with best_loss and N best metrics

compute_nbest_avg(model_avg_num, sort_by=None, sort_by_time=False, reverse=True)¶

Restore n-best avg checkpoint,

if ‘sort_by_time’ is False, the n-best order is sorted by ‘sort_by’; If ‘sort_by_time’ is True, select the newest few models; If ‘reverse’ is True, select the largest models in the sorted order;

__call__(loss=None, metrics=None, training=False)¶

restore_from_best()¶: restore from the best model

class athena.WarmUpLearningSchedule(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0)¶

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

WarmUp Learning rate schedule for Adam

Example

>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512),
>>>        beta_1=0.9, beta_2=0.98, epsilon=1e-9)

Idea from the paper: Attention Is All You Need

__call__(step)¶

class athena.WarmUpAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config¶

class athena.WarmUpLearningSchedule1(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0, lr=None)¶

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

WarmUp Learning rate schedule for Adam and can initialize a learning rate

Example

>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512),
>>>        beta_1=0.9, beta_2=0.98, epsilon=1e-9)

Idea from the paper: Attention Is All You Need

__call__(step)¶

class athena.WarmUpAdam1(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config¶

class athena.ExponentialDecayLearningRateSchedule(initial_lr=0.005, decay_steps=10000, decay_rate=0.5, start_decay_steps=30000, final_lr=1e-05)¶

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

ExponentialDecayLearningRateSchedule

Example

>>> optimizer = tf.keras.optimizers.Adam(learning_rate = ExponentialDecayLearningRate(0.01, 100))

Parameters

initial_lr –
decay_steps –

Returns

initial_lr * (0.5 ** (step // decay_steps))

__call__(step)¶

class athena.ExponentialDecayAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-06, amsgrad=False, name='WarmUpAdam', **kwargs)¶

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config¶

class athena.HParams(model_structure=None, **kwargs)¶

Bases: object

Class to hold a set of hyperparameters as name-value pairs.

A HParams object holds hyperparameters used to build and train a model, such as the number of hidden units in a neural net layer or the learning rate to use when training.

You first create a HParams object by specifying the names and values of the hyperparameters.

To make them easily accessible the parameter names are added as direct attributes of the class. A typical usage is as follows:

```python # Create a HParams object specifying names and values of the model # hyperparameters: hparams = HParams(learning_rate=0.1, num_hidden_units=100)

# The hyperparameter are available as attributes of the HParams object: hparams.learning_rate ==> 0.1 hparams.num_hidden_units ==> 100 ```

Hyperparameters have type, which is inferred from the type of their value passed at construction type. The currently supported types are: integer, float, boolean, string, and list of integer, float, boolean, or string.

You can override hyperparameter values by calling the [parse()](#HParams.parse) method, passing a string of comma separated name=value pairs. This is intended to make it possible to override any hyperparameter values from a single command-line flag to which the user passes ‘hyper-param=value’ pairs. It avoids having to define one flag for each hyperparameter.

The syntax expected for each value depends on the type of the parameter. See parse() for a description of the syntax.

Example:

```python # Define a command line flag to pass name=value pairs. # For example using argparse: import argparse parser = argparse.ArgumentParser(description=’Train my model.’) parser.add_argument(’–hparams’, type=str,

help=’Comma separated list of “name=value” pairs.’)

args = parser.parse_args() … def my_program():

# Create a HParams object specifying the names and values of the # model hyperparameters: hparams = tf.HParams(learning_rate=0.1, num_hidden_units=100,

activations=[‘relu’, ‘tanh’])

# Override hyperparameters values by parsing the command line hparams.parse(args.hparams)

# If the user passed –hparams=learning_rate=0.3 on the command line # then ‘hparams’ has the following attributes: hparams.learning_rate ==> 0.3 hparams.num_hidden_units ==> 100 hparams.activations ==> [‘relu’, ‘tanh’]

# If the hyperparameters are in json format use parse_json: hparams.parse_json(‘{“learning_rate”: 0.3, “activations”: “relu”}’)

```

_HAS_DYNAMIC_ATTRIBUTES = True¶

add_hparam(name, value)¶

Adds {name, value} pair to hyperparameters.

Parameters

name – Name of the hyperparameter.
value – Value of the hyperparameter. Can be one of the following types:
int –
float –
string –
list (float) –
list –
list. (or string) –

Raises

ValueError – if one of the arguments is invalid.

set_hparam(name, value)¶

Set the value of an existing hyperparameter.

This function verifies that the type of the value matches the type of the existing hyperparameter.

Parameters

name – Name of the hyperparameter.
value – New value of the hyperparameter.

Raises

KeyError – If the hyperparameter doesn’t exist.
ValueError – If there is a type mismatch.

del_hparam(name)¶

Removes the hyperparameter with key ‘name’.

Does nothing if it isn’t present.

Parameters: name – Name of the hyperparameter.

parse(values, ignore_unknown=False)¶

Override existing hyperparameter values, parsing new values from a string.

See parse_values for more detail on the allowed format for values.

Parameters

values – String. Comma separated list of name=value pairs where ‘value’
above. (must follow the syntax described) –

Returns

The HParams instance.

Raises

ValueError – If values cannot be parsed or a hyperparameter in values
doesn't exist. –

override_from_dict(values_dict)¶

Override existing hyperparameter values, parsing new values from a dictionary.

Parameters

values_dict – Dictionary of name:value pairs.

Returns

The HParams instance.

Raises

KeyError – If a hyperparameter in values_dict doesn’t exist.
ValueError – If values_dict cannot be parsed.

set_model_structure(model_structure)¶

get_model_structure()¶

to_json(indent=None, separators=None, sort_keys=False)¶

Serializes the hyperparameters into JSON.

Parameters

indent – If a non-negative integer, JSON array elements and object members
0 (will be pretty-printed with that indent level. An indent level of) –
or –
negative (the default) –
None (will only insert newlines.) –
representation. (most compact) –
separators – Optional (item_separator, key_separator) tuple. Default is
`(' –
’)`.
' – ‘)`.
' – ‘)`.
sort_keys – If True, the output dictionaries will be sorted by key.

Returns

A JSON string.

parse_json(values_json)¶

Override existing hyperparameter values, parsing new values from a json object.

Parameters

values_json – String containing a json object of name:value pairs.

Returns

The HParams instance.

Raises

KeyError – If a hyperparameter in values_json doesn’t exist.
ValueError – If values_json cannot be parsed.

values()¶

Return the hyperparameter values as a Python dictionary.

Returns: A dictionary with hyperparameter names as keys. The values are the hyperparameter values.

get(key, default=None)¶: Returns the value of key if it exists, else default.

__contains__(key)¶

__str__()¶: Return str(self).

__repr__()¶: Return repr(self).

static _get_kind_name(param_type, is_list)¶

Returns the field name given parameter type and is_list.

Parameters

param_type – Data type of the hparam.
is_list – Whether this is a list.

Returns

A string representation of the field name.

Raises

ValueError – If parameter type is not recognized.

instantiate()¶

append(hp)¶

athena.register_and_parse_hparams(default_config: dict, config=None, **kwargs)¶: register default config and parse

athena.generate_square_subsequent_mask(size)¶: Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).

athena.generate_square_subsequent_mask_u2(size)¶: Generate a square mask for the sequence. The masked positions are filled with bool(True). Unmasked positions are filled with bool(False).

athena.get_wave_file_length(wave_file)¶

get the wave file length(duration) in ms

Parameters: wave_file – the path of wave file
Returns: the length(ms) of the wave file
Return type: wav_length

athena.set_default_summary_writer(summary_directory=None)¶

athena.get_dict_from_scp(vocab, func=lambda x: ...)¶

class athena.CTCPrefixScoreTH(x, xlens, blank, eos, margin=0)¶

Bases: object

Batch processing of CTCPrefixScore

which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the label probablities for multiple hypotheses simultaneously See also Seki et al. “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” In INTERSPEECH (pp. 3825-3829), 2019.

__call__(y, state, scoring_ids=None, att_w=None)¶

Compute CTC prefix scores for next labels

Parameters

y – tensor(shape=[W, L]), prefix label sequences
state (tuple) –
previous CTC state tuple(

tensor(shape=[T , 2, W]), tensor(shape=[W, O]), 0, 0

)
scoring_ids (torch.Tensor) – scores for pre-selection of hypotheses [Beam, Beam * pre_beam_ratio]
att_w (torch.Tensor) – attention weights to decide CTC window

:return new_state, ctc_local_scores (BW, O)

index_select_state(state, best_ids)¶

Select CTC states according to best ids

:param state : CTC state :param best_ids : index numbers selected by beam pruning (B, W) :return selected_state

athena.__version__ = 2.0¶

athena¶

Subpackages¶

Submodules¶

Package Contents¶

Classes¶

Functions¶

Attributes¶

`athena`¶