athena
¶
module
Subpackages¶
athena.data
athena.layers
athena.models
athena.models.asr
athena.models.kws
athena.models.kws.base
athena.models.kws.cnn_wakeup
athena.models.kws.conformer_wakeup
athena.models.kws.crnn_wakeup
athena.models.kws.dnn_wakeup
athena.models.kws.misp_wakeup
athena.models.kws.transformer_av_wakeup
athena.models.kws.transformer_wakeup
athena.models.kws.transformer_wakeup_2dense
athena.models.kws.transformer_wakeup_average_pooling
athena.models.kws.transformer_wakeup_focal_loss
athena.models.kws.transformer_wakeup_resnet
athena.models.lm
athena.models.tts
athena.models.vad
athena.models.base
athena.models.masked_pc
athena.tools
athena.transform
athena.transform.feats
athena.transform.feats.ops
athena.transform.feats.add_rir_noise_aecres
athena.transform.feats.add_rir_noise_aecres_test
athena.transform.feats.base_frontend
athena.transform.feats.cmvn
athena.transform.feats.cmvn_test
athena.transform.feats.fbank
athena.transform.feats.fbank_pitch
athena.transform.feats.fbank_pitch_test
athena.transform.feats.fbank_test
athena.transform.feats.framepow
athena.transform.feats.framepow_test
athena.transform.feats.mel_spectrum
athena.transform.feats.mel_spectrum_test
athena.transform.feats.mfcc
athena.transform.feats.mfcc_test
athena.transform.feats.pitch
athena.transform.feats.pitch_test
athena.transform.feats.read_wav
athena.transform.feats.read_wav_test
athena.transform.feats.spectrum
athena.transform.feats.spectrum_test
athena.transform.feats.write_wav
athena.transform.feats.write_wav_test
athena.transform.audio_featurizer
athena.utils
Submodules¶
Package Contents¶
Classes¶
SpeechDatasetBuilder |
|
LanguageDatasetBuilder |
|
SpeechRecognitionDatasetBuilder |
|
SpeechRecognitionDatasetKaldiIOBuilder |
|
SpeechRecognitionDatasetBatchBinsBuilder |
|
SpeechRecognitionDatasetBatchBinsKaldiIOBuilder |
|
SpeechRecognitionDatasetBuilder |
|
SpeechRecognitionDatasetBatchBinsBuilder |
|
SpeechDatasetBuilder |
|
MpcSpeechDatasetKaldiIOBuilder |
|
SpeechSynthesisDatasetBuilder |
|
SpeechSynthesisDatasetBuilder |
|
Feature Normalizer |
|
Fastspeech2 Feature Normalizer |
|
VoiceActivityDetectionDatasetKaldiIOBuilder |
|
Dataset builder for CNN model. The builder treat every spliced frame as one image. |
|
Dataset builder for RNN model. The builder mix the spliced frame in one dim |
|
Dataset builder for RNN model. The builder mix the spliced frame in one dim |
|
The main text featurizer interface |
|
TextTokenizer |
|
positional encoding can be used in transformer |
|
collapse4d can be used in cnn-lstm for speech processing |
|
An implementation of Tdnn Layer |
|
Gaussian Error Linear Unit. |
|
Multi-head attention consists of four parts: |
|
the Bahdanau Attention |
|
Refer to [Hierarchical Attention Networks for Document Classification] |
|
Refer to [Learning Natural Language Inference with LSTM] |
|
A transformer model. User is able to modify the attributes as needed. |
|
TransformerEncoder is a stack of N encoder layers |
|
TransformerDecoder is a stack of N decoder layers |
|
TransformerEncoderLayer is made up of self-attn and feedforward network. |
|
TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. |
|
Basic block of resnet |
|
Base class for model. |
|
implementation for MPC pretrain model |
|
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to |
|
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
U2 implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
Conformer-U2 |
|
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to |
|
Audio and video multimode Conformer. Model mainly consists of three parts: |
|
implementation of a frame level or segment speech classification |
|
implementation of a frame level or segment speech classification |
|
Standard implementation of a RNNLM. Model mainly consists of embeding layer, |
|
Standard implementation of a RNNLM. Model mainly consists of embeding layer, |
|
Reference: Fastspeech: Fast, robust and controllable text to speech |
|
Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech |
|
An implementation of Tacotron2 |
|
TTS version of SpeechTransformer. Model mainly consists of three parts: |
|
CNN model for kws" |
|
Standard implementation of a KWSConformer. Model mainly consists of three parts: |
|
CRNN model for e2e kws" |
|
implementation of a frame level or segment speech classification |
|
MISP challenge KWS baseline model for e2e kws" |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Base Training Solver. |
|
A multi-processer solver based on Horovod |
|
ASR DecoderSolver |
|
Base Solver. |
|
A multi-processer solver based on Horovod |
|
DecoderSolver |
|
VadSolver |
|
SynthesisSolver (TTS Solver) |
|
CTC LOSS |
|
Seq2SeqSparseCategoricalCrossentropy LOSS |
|
CTCAccuracy |
|
Seq2SeqSparseCategoricalAccuracy |
|
A wrapper for Tensorflow checkpoint |
|
WarmUp Learning rate schedule for Adam |
|
WarmUpAdam Implementation |
|
WarmUp Learning rate schedule for Adam and can initialize a learning rate |
|
WarmUpAdam Implementation |
|
ExponentialDecayLearningRateSchedule |
|
WarmUpAdam Implementation |
|
Class to hold a set of hyperparameters as name-value pairs. |
|
Batch processing of CTCPrefixScore |
Functions¶
|
generate a postional encoding list |
|
reshape from [N T D C] -> [N T D*C] |
|
Gaussian Error Linear Unit. |
|
register default config and parse |
Generate a square mask for the sequence. The masked positions are filled with float(1.0). |
|
Generate a square mask for the sequence. The masked positions are filled with bool(True). |
|
|
get the wave file length(duration) in ms |
|
|
|
Attributes¶
- class athena.SpeechDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
SpeechDatasetBuilder
- property num_class¶
@property
- Returns
the target dim
- Return type
int
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output": tf.float32, "output_length": tf.int32, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape( [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels] ), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None, None]), "output_length": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec( shape=(None, None, None, None), dtype=tf.float32 ), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, speaker).
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": input_data, "input_length": input_data.shape[0], "output": output_data, "output_length": output_data.shape[0], }
- Return type
dict
- class athena.LanguageDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilder
LanguageDatasetBuilder
- property num_class¶
@property
- Returns
the max_index of the vocabulary
- Return type
int
- property input_vocab_size¶
@property
- Returns
the input vocab size
- Return type
int
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.int32, "input_length": tf.int32, "output": tf.int32, "output_length": tf.int32, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "output_length": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
load csv file
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": input_labels, "input_length": input_length, "output": output_labels, "output_length": output_length, }
- Return type
dict
- class athena.SpeechRecognitionDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
SpeechRecognitionDatasetBuilder
- property num_class¶
return the max_index of the vocabulary + 1
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, "utt_id": tf.string, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "utt_id": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- storage_features_offline()¶
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, "utt_id": utt_id }
- Return type
dict
- class athena.SpeechRecognitionDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
SpeechRecognitionDatasetKaldiIOBuilder
- property num_class¶
return the max_index of the vocabulary + 1
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶
Generate a list of tuples (feat_key, speaker).
- __getitem__(index)¶
- compute_cmvn_if_necessary(is_necessary=True)¶
compute cmvn file
- class athena.SpeechRecognitionDatasetBatchBinsBuilder(config=None)¶
Bases:
athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder
SpeechRecognitionDatasetBatchBinsBuilder
- property sample_shape_batch_bins¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, None, dim, nc]), "input_length": tf.TensorShape([None]), "output_length": tf.TensorShape([None]), "output": tf.TensorShape([None, None]), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
- __getitem__(index)¶
- __len__()¶
- as_dataset(batch_size=16, num_threads=1)¶
return tf.data.Dataset object
- shard(num_shards, index)¶
creates a Dataset that includes only 1/num_shards of this dataset
- batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶
Batch-wise shuffling of the data entries.
- Parameters
batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –
- class athena.SpeechRecognitionDatasetBatchBinsKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.asr.speech_recognition_kaldiio.SpeechRecognitionDatasetKaldiIOBuilder
SpeechRecognitionDatasetBatchBinsKaldiIOBuilder
- property sample_shape_batch_bins¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, None, dim, nc]), "input_length": tf.TensorShape([None]), "output_length": tf.TensorShape([None]), "output": tf.TensorShape([None, None]), }
- Return type
dict
- default_config¶
- preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶
- read_shape_file(file_dir=None)¶
- __getitem__(index)¶
- __len__()¶
- as_dataset(batch_size=16, num_threads=1)¶
return tf.data.Dataset object
- shard(num_shards, index)¶
creates a Dataset that includes only 1/num_shards of this dataset
- batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶
Batch-wise shuffling of the data entries.
- Parameters
batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –
- class athena.AudioVedioRecognitionDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
SpeechRecognitionDatasetBuilder
- property num_class¶
return the max_index of the vocabulary + 1
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
- Return type
dict
- default_config¶
- video_scp_loader(scp_dir)¶
load video list from scp file return a dic
- image_normalizer(image)¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- storage_features_offline()¶
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, }
- Return type
dict
- class athena.AudioVedioRecognitionDatasetBatchBinsBuilder(config=None)¶
Bases:
athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder
SpeechRecognitionDatasetBatchBinsBuilder
- property sample_shape_batch_bins¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, None, dim, nc]), "video":tf.TensorShape([None, None, high, wide]), "input_length": tf.TensorShape([None]), "output_length": tf.TensorShape([None]), "output": tf.TensorShape([None, None]), "utt_id": tf.TensorShape([None]), }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "video": tf.TensorShape([None, None, None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "utt_id": tf.TensorShape([]), }
- Return type
dict
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, "utt_id": tf.string, }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "video": tf.TensorSpec(shape=(None, None, None, None), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, "utt_id": utt_id }
- Return type
dict
- __len__()¶
- as_dataset(batch_size=16, num_threads=1)¶
return tf.data.Dataset object
- shard(num_shards, index)¶
creates a Dataset that includes only 1/num_shards of this dataset
- batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶
Batch-wise shuffling of the data entries.
- Parameters
batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –
- class athena.MpcSpeechDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
SpeechDatasetBuilder This data builder is a online feature extractor and is used to mcp training
- property num_class¶
@property
- Returns
the target dim
- Return type
int
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output": tf.float32, "output_length": tf.int32, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape( [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels] ), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None, None]), "output_length": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec( shape=(None, None, None, None), dtype=tf.float32 ), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, speaker).
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": input_data, "input_length": input_data.shape[0], "output": output_data, "output_length": output_data.shape[0], }
- Return type
dict
- class athena.MpcSpeechDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.mpc.mpc_speech_set.MpcSpeechDatasetBuilder
MpcSpeechDatasetKaldiIOBuilder This data builder is a offline feature data builder and is used to mcp training
- default_config¶
- preprocess_data(file_path, apply_sort_filter=True)¶
generate a list of tuples (feat_key, speaker).
- __getitem__(index)¶
- compute_cmvn_if_necessary(is_necessary=True)¶
compute cmvn file
- class athena.SpeechSynthesisDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
SpeechSynthesisDatasetBuilder
- property num_class¶
@property
- Returns
the max_index of the vocabulary
- Return type
int
- property feat_dim¶
return the number of feature dims
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "utt_id": tf.string, "input": tf.int32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.float32, "speaker": tf.int32 }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "utt_id": tf.TensorShape([]), "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None, feature_dim]), "speaker": tf.TensorShape([]) }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, feature_dim), dtype=tf.float32), "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32) }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- __getitem__(index)¶
- class athena.SpeechFastspeech2DatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilder
SpeechSynthesisDatasetBuilder
- property num_class¶
@property
- Returns
the max_index of the vocabulary
- Return type
int
- property feat_dim¶
return the number of feature dims
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "utt_id": tf.string, "input": tf.int32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.float32, "speaker": tf.int32, "duration": tf.int32 }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "utt_id": tf.TensorShape([]), "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None, feature_dim]), "f0": tf.TensorShape([None]), "energy": tf.TensorShape([None]), "speaker": tf.TensorShape([]), "duration": tf.TensorShape([None]) }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, feature_dim), dtype=tf.float32), "f0": tf.TensorSpec(shape=(None, None), dtype=tf.float32), "energy": tf.TensorSpec(shape=(None, None), dtype=tf.float32), "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32) }
- Return type
dict
- default_config¶
- load_duration(duration)¶
- preprocess_data(file_path)¶
generate a list of tuples (audio_feature, wav_length_ms, transcript, duration, speaker).
- load_audio_feature(audio_feature_file)¶
- __getitem__(index)¶
- compute_cmvn_if_necessary(is_necessary=True)¶
compute cmvn file
- class athena.FeatureNormalizer(cmvn_file=None)¶
Feature Normalizer
- __call__(feat_data, speaker, reverse=False)¶
- apply_cmvn(feat_data, speaker, reverse=False)¶
transform original feature to normalized feature
- compute_cmvn(entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)¶
compute cmvn for filtered entries
- compute_cmvn_by_chunk_for_all_speaker(feature_dim, speakers, featurizer, entries)¶
because of memory issue, we used incremental approximation for the calculation of cmvn
- compute_cmvn_kaldiio(entries, speakers, kaldi_io_feats, feature_dim)¶
compute cmvn for filtered entries using kaldi-format data
- load_cmvn()¶
load mean and var
- save_cmvn(variable_list)¶
save cmvn variables determined by variable_list to file
- Parameters
variable_list (list) – e.g. [“speaker”, “mean”, “var”]
- class athena.FS2FeatureNormalizer(cmvn_file=None)¶
Bases:
FeatureNormalizer
Fastspeech2 Feature Normalizer
- __call__(feat_data, speaker, feature_type='mel', reverse=False)¶
- compute_fs2_cmvn(entries, speakers, num_cmvn_workers=1)¶
compuate cmvn of mel-spec,f0 and energy
- apply_cmvn(feat_data, speaker, feature_type='mel', reverse=False)¶
transform original feature to normalized feature
- load_cmvn()¶
load mel_mean, mel_var, f0_mean, f0_var and energy_mean, energy_var
- class athena.VoiceActivityDetectionDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilder
VoiceActivityDetectionDatasetKaldiIOBuilder
- property sample_type¶
@property
- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
- Return type
dict
- property sample_shape¶
@property
- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "utt": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property
- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), utt": tf.TensorSpec(shape=(None), dtype=tf.string), }
- Return type
dict
- default_config¶
- preprocess_data(data_scps_dir)¶
generate a list of tuples (wav_filename, wav_offset, wav_length_ms, transcript, label).
- splice_feature(feature)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, "utt": utt }
- Return type
dict
- class athena.SpeechWakeupFramewiseDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilder
Dataset builder for CNN model. The builder treat every spliced frame as one image. For example (21, 63) The input data format is (batch, timestep, height, width, channel) For example (b, t, 21, 63, 1) unbatch are used to split The output data format is, for example, (b, 21, 63, 1)
- property sample_type¶
example types
- property sent_sample_shape¶
- property sample_shape¶
examples shapes
- property sample_signature¶
examples signature
- default_config¶
- preprocess_data(data_dir='')¶
loading data
- __getitem__(index)¶
- splice_feature(feature, input_left_context, input_right_context)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- class athena.SpeechWakeupDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilder
Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)
- property sample_type¶
example types
- property sample_shape¶
examples shapes
- property sample_signature¶
examples signature
- default_config¶
- preprocess_data(data_dir='')¶
loading data
- __getitem__(index)¶
- splice_feature(feature, input_left_context, input_right_context)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- class athena.SpeechWakeupDatasetKaldiIOBuilderAVCE(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilder
Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)
- property sample_type¶
example types
- property sample_shape¶
examples shapes
- property sample_signature¶
examples signature
- default_config¶
- preprocess_data(data_dir='')¶
loading data
- video_scp_loader(scp_dir)¶
load video list from scp file return a dic
- __getitem__(index)¶
- splice_feature(feature, input_left_context, input_right_context)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- class athena.TextFeaturizer(config=None)¶
The main text featurizer interface
- property model_type¶
@property
- Returns
the model type
- property unk_index¶
@property
- Returns
the unk index
- Return type
int
- supported_model¶
- default_config¶
- load_model(model_file)¶
load model
- delete_punct(tokens)¶
delete punctuation tokens
- __len__()¶
- encode(texts)¶
convert a sentence to a list of ids, with special tokens added.
- decode(sequences)¶
conver a list of ids to a sentence
- decode_to_list(sequences, ignored_id=[])¶
- class athena.TextTokenizer(text=None)¶
TextTokenizer
- load_model(text)¶
load model
- save_vocab(vocab_file)¶
- load_csv(csv_file)¶
- __len__()¶
- encode(texts)¶
convert a sentence to a list of ids, with special tokens added.
- decode(sequences)¶
conver a list of ids to a sentence
- decode_to_list(ids, ignored_id=[])¶
- athena.make_positional_encoding(position, d_model)¶
generate a postional encoding list
- athena.collapse4d(x, name=None)¶
reshape from [N T D C] -> [N T D*C] using tf.shape(x), which generate a tensor instead of x.shape
- athena.gelu(x)¶
Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415
- Parameters
x – float Tensor to perform activation.
- Returns
x with the GELU activation applied.
- class athena.PositionalEncoding(d_model, max_position=800, scale=False)¶
Bases:
tensorflow.keras.layers.Layer
positional encoding can be used in transformer
- call(x)¶
call function
- class athena.Collapse4D¶
Bases:
tensorflow.keras.layers.Layer
collapse4d can be used in cnn-lstm for speech processing reshape from [N T D C] -> [N T D*C]
- call(x)¶
- class athena.TdnnLayer(context, output_dim, use_bias=False, **kwargs)¶
Bases:
tensorflow.keras.layers.Layer
An implementation of Tdnn Layer :param context: a int of left and right context, or a list of context indexes, e.g. (-2, 0, 2). :param output_dim: the dim of the linear transform
- call(x, training=None, mask=None)¶
- class athena.Gelu¶
Bases:
tensorflow.keras.layers.Layer
Gaussian Error Linear Unit.
This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415
- Parameters
x – float Tensor to perform activation.
- Returns
with the GELU activation applied.
- Return type
x
- call(x)¶
- class athena.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)¶
Bases:
tensorflow.keras.layers.Layer
Multi-head attention consists of four parts:
Linear layers and split into heads.
Scaled dot-product attention.
Concatenation of heads.
Final linear layer.
Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.
Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.
- split_heads(x, batch_size)¶
Split the last dimension into (num_heads, depth).
Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
- call(v, k, q, mask)¶
call function
- class athena.BahdanauAttention(units, input_dim=1024)¶
Bases:
tensorflow.keras.Model
the Bahdanau Attention
- call(query, values)¶
call function
- class athena.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)¶
Bases:
tensorflow.keras.layers.Layer
Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)
>>> Input shape: (Batch size, steps, features) >>> Output shape: (Batch size, features)
- build(input_shape)¶
build in keras layer
- call(inputs, training=None, mask=None)¶
call function in keras
- compute_output_shape(input_shape)¶
compute output shape
- _masked_softmax(logits, mask, axis)¶
Compute softmax with input mask.
- class athena.MatchAttention(config, **kwargs)¶
Bases:
tensorflow.keras.layers.Layer
Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170)
>>> Input shape: (Batch size, steps, features) >>> Output shape: (Batch size, steps, features)
- call(tensors)¶
Attention layer.
- class athena.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, custom_encoder=None, custom_decoder=None, conv_module_kernel_size=0)¶
Bases:
tensorflow.keras.layers.Layer
A transformer model. User is able to modify the attributes as needed.
- Parameters
d_model – the number of expected features in the encoder/decoder inputs (default=512).
nhead – the number of heads in the multiheadattention models (default=8).
num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
custom_encoder – custom encoder (default=None).
custom_decoder – custom decoder (default=None).
Examples
>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12) >>> src = tf.random.normal((10, 32, 512)) >>> tgt = tf.random.normal((20, 32, 512)) >>> out = transformer_model(src, tgt)
- call(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, return_encoder_output=False, return_attention_weights=False, training=None)¶
Take in and process masked source/target sequences.
- Parameters
src – the sequence to the encoder (required).
tgt – the sequence to the decoder (required).
src_mask – the additive mask for the src sequence (optional).
tgt_mask – the additive mask for the tgt sequence (optional).
memory_mask – the additive mask for the encoder output (optional).
src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).
memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).
- Shape:
src: \((N, S, E)\).
tgt: \((N, T, E)\).
src_mask: \((N, S)\).
tgt_mask: \((N, T)\).
memory_mask: \((N, S)\).
Note: [src/tgt/memory]_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.
output: \((N, T, E)\).
Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
Examples
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
- class athena.TransformerEncoder(encoder_layers)¶
Bases:
tensorflow.keras.layers.Layer
TransformerEncoder is a stack of N encoder layers
- Parameters
encoder_layer – an instance of the TransformerEncoderLayer() class (required).
num_layers – the number of sub-encoder-layers in the encoder (required).
norm – the layer normalization component (optional).
Examples
>>> encoder_layer = [TransformerEncoderLayer(d_model=512, nhead=8) >>> for _ in range(num_layers)] >>> transformer_encoder = TransformerEncoder(encoder_layer) >>> src = torch.rand(10, 32, 512) >>> out = transformer_encoder(src)
- call(src, src_mask=None, training=None)¶
Pass the input through the endocder layers in turn.
- Parameters
src – the sequnce to the encoder (required).
mask – the mask for the src sequence (optional).
- set_unidirectional(uni=False)¶
whether to apply trianglar masks to make transformer unidirectional
- class athena.TransformerDecoder(decoder_layers)¶
Bases:
tensorflow.keras.layers.Layer
TransformerDecoder is a stack of N decoder layers
- Parameters
decoder_layer – an instance of the TransformerDecoderLayer() class (required).
num_layers – the number of sub-decoder-layers in the decoder (required).
norm – the layer normalization component (optional).
Examples
>>> decoder_layer = [TransformerDecoderLayer(d_model=512, nhead=8) >>> for _ in range(num_layers)] >>> transformer_decoder = TransformerDecoder(decoder_layer) >>> memory = torch.rand(10, 32, 512) >>> tgt = torch.rand(20, 32, 512) >>> out = transformer_decoder(tgt, memory)
- call(tgt, memory, tgt_mask=None, memory_mask=None, return_attention_weights=False, training=None)¶
Pass the inputs (and mask) through the decoder layer in turn.
- Parameters
tgt – the sequence to the decoder (required).
memory – the sequnce from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).
- class athena.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, ffn=None, conv_module_kernel_size=0)¶
Bases:
tensorflow.keras.layers.Layer
TransformerEncoderLayer is made up of self-attn and feedforward network.
- Parameters
d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).
Examples
>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8) >>> src = tf.random(10, 32, 512) >>> out = encoder_layer(src)
- call(src, src_mask=None, training=None)¶
Pass the input through the encoder layer.
- Parameters
src – the sequence to the encoder layer (required).
mask – the mask for the src sequence (optional).
- set_unidirectional(uni=False)¶
whether to apply trianglar masks to make transformer unidirectional
- class athena.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu')¶
Bases:
tensorflow.keras.layers.Layer
TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
- Reference:
“Attention Is All You Need”.
- Parameters
d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).
Examples
>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8) >>> memory = tf.random(10, 32, 512) >>> tgt = tf.random(20, 32, 512) >>> out = decoder_layer(tgt, memory)
- call(tgt, memory, tgt_mask=None, memory_mask=None, training=None)¶
Pass the inputs (and mask) through the decoder layer.
- Parameters
tgt – the sequence to the decoder layer (required).
memory – the sequence from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).
- class athena.ResnetBasicBlock(num_filter, stride=1)¶
Bases:
tensorflow.keras.layers.Layer
Basic block of resnet Reference to paper “Deep residual learning for image recognition”
- call(inputs)¶
call model
- make_downsample_layer(num_filter, stride)¶
perform downsampling using conv layer with stride != 1
- class athena.BaseModel(**kwargs)¶
Bases:
tensorflow.keras.Model
Base class for model.
- abstract call(samples, training=None)¶
call model
- get_loss(outputs, samples, training=None)¶
get loss
- compute_logit_length(input_length)¶
compute the logit length
- reset_metrics()¶
reset the metrics
- prepare_samples(samples)¶
for special data prepare carefully: do not change the shape of samples
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- decode(samples, hparams, decoder)¶
decode interface
- class athena.MaskedPredictCoding(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
implementation for MPC pretrain model
- Parameters
num_filters – a int type number, i.e the number of filters in cnn
d_model – a int type number, i.e dimension of model
num_heads – number of heads in transformer
num_encoder_layers – number of layer in encoder
dff – a int type number, i.e dimension of model
rate – rate of dropout layers
chunk_size – number of consecutive masks, i.e 1 or 3
keep_probability – probability not to be masked
mode – train mode, i.e MPC: pretrain
max_pool_layers – index of max pool layers in encoder, default is -1
- default_config¶
- call(samples, training: bool = None)¶
used for training
- Parameters
dict (samples is a) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank
keys (including) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank
Return:
MPC outputs to fit acoustic features encoder_outputs: Transformer encoder outputs, Tensor, shape is (batch, seqlen, dim)
- get_loss(logits, samples, training=None)¶
get MPC loss
- Parameters
logits – MPC output
Return:
MPC L1 loss and metrics
- compute_logit_length(samples)¶
compute the logit length
- generate_mpc_mask(input_data)¶
generate mask for pretraining
- Parameters
features (acoustic) – i.e F-bank
Return:
mask tensor
- prepare_samples(samples)¶
for special data prepare carefully: do not change the shape of samples
- class athena.AV_MtlTransformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.
- SUPPORTED_MODEL¶
- default_config¶
- call(samples, training=None)¶
call function in keras layers
- get_loss(outputs, samples, training=None)¶
get loss used for training
- compute_logit_length(input_length)¶
compute the logit length
- reset_metrics()¶
reset the metrics
- restore_from_pretrained_model(pretrained_model, model_type='')¶
A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore
- decode(samples, hparams=None, lm_model=None)¶
Initialization of the model for decoding, decoder is called here to create predictions
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model
Returns:
predictions: the corresponding decoding results
- class athena.SpeechConformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(speech, speech_length, training: bool = None)¶
- _forward_encoder_log_ctc(samples, final_layer, training: bool = None)¶
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int] ¶
- freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=None) List[int] ¶
- freeze_ctc_probs(samples, ctc_final_layer, hparams=None, beam_size=None) List[int] ¶
- attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int] ¶
- Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out
- Parameters
samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –
- Returns
Attention rescoring result
- Return type
List[int]
- freeze_beam_search(samples, beam_size)¶
beam search for freeze only support batch=1
- Parameters
samples – the data source to be decoded
beam_size – beam size
- beam_search(samples, hparams, lm_model=None)¶
batch beam search for transformer model
- Parameters
samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.SpeechConformerCTC(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(speech, speech_length, training=None)¶
- _forward_encoder_log_ctc(samples, training: bool = None)¶
- decode(samples, hparams, lm_model=None)¶
Initialization of the model for decoding, decoder is called here to create predictions
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model
Returns:
predictions: the corresponding decoding results
- argmax(samples, hparams)¶
argmax for the Conformer CTC model
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
- Returns::
predictions: the corresponding decoding results
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int] ¶
- freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=1) List[int] ¶
- merge_ctc_sequence(seqs, blank=-1)¶
- freeze_beam_search(samples, beam_size)¶
beam search for freeze only support batch=1
- Parameters
samples – the data source to be decoded
beam_size – beam size
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.SpeechTransformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(speech, speech_length, training: bool = None)¶
- _forward_encoder_log_ctc(samples, final_layer, training: bool = None)¶
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int] ¶
- attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int] ¶
- Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out
- Parameters
samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –
- Returns
Attention rescoring result
- Return type
List[int]
- freeze_beam_search(samples, beam_size=1)¶
beam search for freeze only support batch=1
- Parameters
samples – the data source to be decoded
beam_size – beam size
- beam_search(samples, hparams, lm_model=None)¶
batch beam search for transformer model
- Parameters
samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search
- freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=1) List[int] ¶
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.SpeechTransformerU2(data_descriptions, config=None)¶
Bases:
SpeechU2
U2 implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
- default_config¶
- class athena.SpeechConformerU2(data_descriptions, config=None)¶
Bases:
SpeechU2
Conformer-U2
- default_config¶
- class athena.MtlTransformerCtc(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.
- SUPPORTED_MODEL¶
- default_config¶
- call(samples, training=None)¶
call function in keras layers
- get_loss(outputs, samples, training=None)¶
get loss used for training
- compute_logit_length(input_length)¶
compute the logit length
- reset_metrics()¶
reset the metrics
- restore_from_pretrained_model(pretrained_model, model_type='')¶
A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore
- _forward_encoder_log_ctc(samples, training: bool = None)¶
- decode(samples, hparams, lm_model=None)¶
Initialization of the model for decoding, decoder is called here to create predictions
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model
Returns:
predictions: the corresponding decoding results
- enable_tf_funtion()¶
- ctc_forward_chunk_freeze(encoder_out)¶
- encoder_ctc_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)¶
- encoder_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)¶
- get_subsample_rate()¶
- get_init()¶
- encoder_forward_chunk_by_chunk_freeze(speech: tensorflow.Tensor, decoding_chunk_size: int, num_decoding_left_chunks: int = -1) Tuple[tensorflow.Tensor, tensorflow.Tensor] ¶
- Forward input chunk by chunk with chunk_size like a streaming
fashion
Here we should pay special attention to computation cache in the streaming style forward chunk by chunk. Three things should be taken into account for computation in the current network:
transformer/conformer encoder layers output cache
convolution in conformer
convolution in subsampling
- However, we don’t implement subsampling cache for:
We can control subsampling module to output the right result by overlapping input instead of cache left context, even though it wastes some computation, but subsampling only takes a very small fraction of computation in the whole model.
Typically, there are several covolution layers with subsampling in subsampling module, it is tricky and complicated to do cache with different convolution layers with different subsampling rate.
Currently, nn.Sequential is used to stack all the convolution layers in subsampling, we need to rewrite it to make it work with cache, which is not prefered.
- Parameters
speech (tf.Tensor) – (1, max_len, dim)
chunk_size (int) – decoding chunk size
- ctc_prefix_beam_search(samples, hparams, decoding_chunk_size, num_decoding_left_chunks) List[int] ¶
- class athena.AudioVideoConformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
Audio and video multimode Conformer. Model mainly consists of three parts: the a_net for input audio fbank feature preparation, the v_net, the y_net for output preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(samples, training: bool = None)¶
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int] ¶
- attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int] ¶
- Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out
- Parameters
samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –
- Returns
Attention rescoring result
- Return type
List[int]
- beam_search(samples, hparams, lm_model=None)¶
batch beam search for transformer model
- Parameters
samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.VadMarbleNet(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
implementation of a frame level or segment speech classification
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- get_loss(outputs, samples, training=None)¶
get loss
- class athena.VadDnn(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
implementation of a frame level or segment speech classification
- default_config¶
- call(samples, training=None)¶
call model
- get_loss(outputs, samples, training=None)¶
get loss
- class athena.RNNLM(data_descriptions, config=None)¶
Bases:
athena.models.lm.nn_lm.NNLM
Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn
- default_config¶
- forward(inputs, inputs_length=None, training: bool = None)¶
do NN LM forward computation, for both train and decode.
- class athena.TransformerLM(data_descriptions, config=None)¶
Bases:
athena.models.lm.nn_lm.NNLM
Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn
- default_config¶
- forward(inputs, input_lengths, training: bool = None)¶
do NN LM forward computation, for both train and decode.
- class athena.FastSpeech(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
- Reference: Fastspeech: Fast, robust and controllable text to speech
(http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech.pdf)
- default_config¶
- set_teacher_model(teacher_model, teacher_type)¶
set teacher model and initialize duration_calculator before training
- Parameters
teacher_model – the loaded teacher model
teacher_type – the model type, e.g., tacotron2, tts_transformer
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- Parameters
pretrained_model – the loaded pretrained model
model_type – the model type, e.g: tts_transformer
- get_loss(outputs, samples, training=None)¶
get loss used for training
- _feedforward_decoder(encoder_output, duration_indexes, duration_sequences, output_length, training)¶
feed-forward decoder
- Parameters
encoder_output – encoder outputs, shape: [batch, x_steps, d_model]
duration_indexes – argmax weights calculated from duration_calculator. It is used for training only, shape: [batch, y_steps]
duration_sequences – It contains duration information for each phoneme, shape: [batch, x_steps]
output_length – the real output length
training – if it is in the training stage
Returns:
before_outs: the outputs before postnet calculation after_outs: the outputs after postnet calculation
- call(samples, training: bool = None)¶
call model
- synthesize(samples)¶
- class athena.FastSpeech2(data_descriptions, config=None)¶
Bases:
athena.models.tts.fastspeech.FastSpeech
Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
- default_config¶
- call(samples, training: bool = None)¶
call model
- synthesize(samples)¶
- class athena.Tacotron2(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
- default_config¶
- _pad_and_reshape(outputs, ori_lens, reverse=False)¶
- Parameters
outputs – true labels, shape: [batch, y_steps, feat_dim]
ori_lens – scalar
Returns:
reshaped_outputs: it has to be reshaped to match reduction_factor shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]
- call(samples, training: bool = None)¶
call model
- initialize_input_y(y)¶
- Parameters
y – the true label, shape: [batch, y_steps, feat_dim]
Returns:
y0: zeros will be padded as one step to the start step, [batch, y_steps+1, feat_dim]
- initialize_states(encoder_output, input_length)¶
- Parameters
encoder_output – encoder outputs, shape: [batch, x_step, eunits]
input_length – shape: [batch]
Returns:
prev_rnn_states: initial states of rnns in decoder [rnn layers, 2, batch, dunits] prev_attn_weight: initial attention weights, [batch, x_steps] prev_context: initial context, [batch, eunits]
- concat_speaker_embedding(encoder_output, speaker_embedding)¶
- Parameters
encoder_output – encoder output (batch, x_steps, eunits)
speaker_embedding – speaker embedding (batch, embedding_dim)
- Returns
the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)
- time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)¶
- Parameters
encoder_output – encoder output (batch, x_steps, eunits).
input_length – (batch,)
prev_y – one step of true labels or predicted labels (batch, feat_dim).
prev_rnn_states – previous rnn states [layers, 2, states] for lstm
prev_attn_weight – previous attention weights, shape: [batch, x_steps]
prev_context – previous context vector: [batch, attn_dim]
training – if it is training mode
Returns:
out: shape: [batch, feat_dim] logit: shape: [batch, reduction_factor] current_rnn_states: [rnn_layers, 2, batch, dunits] attn_weight: [batch, x_steps]
- get_loss(outputs, samples, training=None)¶
get loss
- synthesize(samples)¶
Synthesize acoustic features from the input texts
- Parameters
samples – the data source to be synthesized
Returns:
after_outs: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
- _synthesize_post_net(before_outs, logits_stack)¶
- Parameters
before_outs – the outputs before postnet
logits_stack – the logits of all steps
Returns:
after_outs: the corresponding synthesized acoustic features
- class athena.TTSTransformer(data_descriptions, config=None)¶
Bases:
athena.models.tts.tacotron2.Tacotron2
TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network
- default_config¶
- call(samples, training: bool = None)¶
- time_propagate(encoder_output, memory_mask, outs, step)¶
Synthesize one step frames
- Parameters
encoder_output – the encoder output, shape: [batch, x_steps, eunits]
memory_mask – the encoder output mask, shape: [batch, 1, 1, x_steps]
outs (TensorArray) – previous outputs
step – the current step number
Returns:
out: new frame outpus, shape: [batch, feat_dim * reduction_factor] logit: new stop token prediction logit, shape: [batch, reduction_factor] attention_weights (list): the corresponding attention weights, each element in the list represents the attention weights of one decoder layer shape: [batch, num_heads, seq_len_q, seq_len_k]
- synthesize(samples)¶
Synthesize acoustic features from the input texts
- Parameters
samples – the data source to be synthesized
Returns:
after_outs: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
- class athena.CnnModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
CNN model for kws”
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSConformer(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
Standard implementation of a KWSConformer. Model mainly consists of three parts: the x_net for input preparation, the conformer itself
- default_config¶
- call(samples, training=None)¶
- build_model(data_descriptions)¶
- class athena.CRnnModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
CRNN model for e2e kws”
- default_config¶
- input_features¶
_, _, w, c = input_features.get_shape().as_list() output_dim = w * c inner = layers.Reshape((-1, output_dim))(input_features) inner = PCENLayer()(inner) inner = layers.Reshape((-1, w, c))(inner)
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.DnnModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
implementation of a frame level or segment speech classification
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.MISPModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
MISP challenge KWS baseline model for e2e kws”
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSTransformer_2Dense(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
- build_model(data_descriptions)¶
- class athena.KWSTransformer(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSAVTransformer(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- inner¶
v_net
- call(samples, training=None)¶
- build_model(data_descriptions)¶
- class athena.KWSTransformerRESNET(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSTransformer_FocalLoss(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModel
Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- get_loss(outputs, samples, training=None)¶
get loss
- class athena.BaseSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
tensorflow.keras.Model
Base Training Solver.
- default_config¶
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
- static clip_by_norm(grads, norm)¶
clip norm using tf.clip_by_norm
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- save_checkpointer(checkpointer, devset, epoch)¶
- evaluate_step(samples)¶
evaluate the model 1 step
- evaluate(dataset, epoch)¶
evaluate the model
- class athena.HorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
BaseSolver
A multi-processer solver based on Horovod
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],
then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.
run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.
- Parameters
solver_gpus ([list]) – a list to specify gpus being used.
- Raises
ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- evaluate(dataset, epoch=0)¶
evaluate the model
- class athena.DecoderSolver(model, data_descriptions=None, config=None)¶
Bases:
BaseSolver
ASR DecoderSolver
- default_config¶
- inference(dataset_builder, rank_size=1, conf=None)¶
decode the model
- inference_saved_model(dataset_builder, rank_size=1, conf=None)¶
decode the model
- class athena.AVSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
tensorflow.keras.Model
Base Solver.
- default_config¶
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
- static clip_by_norm(grads, norm)¶
clip norm using tf.clip_by_norm
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- evaluate_step(samples)¶
evaluate the model 1 step
- evaluate(dataset, epoch)¶
evaluate the model
- class athena.AVHorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
AVSolver
A multi-processer solver based on Horovod
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],
then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.
run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.
- Parameters
solver_gpus ([list]) – a list to specify gpus being used.
- Raises
ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- evaluate(dataset, epoch=0)¶
evaluate the model
- class athena.AVDecoderSolver(model, data_descriptions=None, config=None)¶
Bases:
AVSolver
DecoderSolver
- default_config¶
- inference(dataset_builder, rank_size=1, conf=None)¶
decode the model
- inference_freeze(dataset_builder, rank_size=1, conf=None)¶
decode the model
- inference_argmax(dataset_builder, rank_size=1, conf=None)¶
decode the model
- class athena.VadSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, data_descriptions=None, config=None)¶
Bases:
BaseSolver
VadSolver
- default_config¶
- inference(dataset, rank_size=1, conf=None)¶
decode the model
- class athena.SynthesisSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
BaseSolver
SynthesisSolver (TTS Solver)
- default_config¶
- inference(dataset_builder, rank_size=1, conf=None)¶
synthesize using vocoder on dataset
- inference_saved_model(dataset_builder, rank_size=1, conf=None)¶
synthesize using vocoder on dataset
- class athena.CTCLoss(logits_time_major=False, blank_index=-1, name='CTCLoss')¶
Bases:
tensorflow.keras.losses.Loss
CTC LOSS CTC LOSS implemented with Tensorflow
- __call__(logits, samples, logit_length=None)¶
- class athena.Seq2SeqSparseCategoricalCrossentropy(num_classes, eos=-1, by_token=False, by_sequence=True, from_logits=True, label_smoothing=0.0)¶
Bases:
tensorflow.keras.losses.CategoricalCrossentropy
Seq2SeqSparseCategoricalCrossentropy LOSS CategoricalCrossentropy calculated at each character for each sequence in a batch
- __call__(logits, samples, logit_length=None)¶
- class athena.CTCAccuracy(name='CTCAccuracy')¶
Bases:
CharactorAccuracy
CTCAccuracy Inherits CharactorAccuracy and implements CTC accuracy calculation
- __call__(logits, samples, logit_length=None)¶
Accumulate errors and counts, logit_length is the output length of encoder
- class athena.Seq2SeqSparseCategoricalAccuracy(eos, name='Seq2SeqSparseCategoricalAccuracy')¶
Bases:
CharactorAccuracy
Seq2SeqSparseCategoricalAccuracy Inherits CharactorAccuracy and implements Attention accuracy calculation
- __call__(logits, samples, logit_length=None)¶
Accumulate errors and counts
- class athena.Checkpoint(checkpoint_directory=None, use_dev_loss=True, model=None, **kwargs)¶
Bases:
tensorflow.train.Checkpoint
A wrapper for Tensorflow checkpoint
- Parameters
checkpoint_directory – the directory for checkpoint
summary_directory – the directory for summary used in Tensorboard
__init__ – provide the optimizer and model
__call__ – save the model
Example
>>> transformer = SpeechTransformer(target_vocab_size=dataset_builder.target_dim) >>> optimizer = tf.keras.optimizers.Adam() >>> ckpt = Checkpoint(checkpoint_directory='./train', summary_directory='./event', >>> transformer=transformer, optimizer=optimizer) >>> solver = BaseSolver(transformer) >>> for epoch in dataset: >>> ckpt()
- _file_compatible(use_dev_loss)¶
Convert n_best file to CSV file
Add “index” and “Accuracy” for no csv n_best file.
- _compare_and_save_best(loss, metrics, save_path, training=False)¶
compare and save the best model with best_loss and N best metrics
- compute_nbest_avg(model_avg_num, sort_by=None, sort_by_time=False, reverse=True)¶
Restore n-best avg checkpoint,
if ‘sort_by_time’ is False, the n-best order is sorted by ‘sort_by’; If ‘sort_by_time’ is True, select the newest few models; If ‘reverse’ is True, select the largest models in the sorted order;
- __call__(loss=None, metrics=None, training=False)¶
- restore_from_best()¶
restore from the best model
- class athena.WarmUpLearningSchedule(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0)¶
Bases:
tensorflow.keras.optimizers.schedules.LearningRateSchedule
WarmUp Learning rate schedule for Adam
Example
>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512), >>> beta_1=0.9, beta_2=0.98, epsilon=1e-9)
Idea from the paper: Attention Is All You Need
- __call__(step)¶
- class athena.WarmUpAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶
Bases:
tensorflow.keras.optimizers.Adam
WarmUpAdam Implementation
- default_config¶
- class athena.WarmUpLearningSchedule1(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0, lr=None)¶
Bases:
tensorflow.keras.optimizers.schedules.LearningRateSchedule
WarmUp Learning rate schedule for Adam and can initialize a learning rate
Example
>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512), >>> beta_1=0.9, beta_2=0.98, epsilon=1e-9)
Idea from the paper: Attention Is All You Need
- __call__(step)¶
- class athena.WarmUpAdam1(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶
Bases:
tensorflow.keras.optimizers.Adam
WarmUpAdam Implementation
- default_config¶
- class athena.ExponentialDecayLearningRateSchedule(initial_lr=0.005, decay_steps=10000, decay_rate=0.5, start_decay_steps=30000, final_lr=1e-05)¶
Bases:
tensorflow.keras.optimizers.schedules.LearningRateSchedule
ExponentialDecayLearningRateSchedule
Example
>>> optimizer = tf.keras.optimizers.Adam(learning_rate = ExponentialDecayLearningRate(0.01, 100))
- Parameters
initial_lr –
decay_steps –
- Returns
initial_lr * (0.5 ** (step // decay_steps))
- __call__(step)¶
- class athena.ExponentialDecayAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-06, amsgrad=False, name='WarmUpAdam', **kwargs)¶
Bases:
tensorflow.keras.optimizers.Adam
WarmUpAdam Implementation
- default_config¶
- class athena.HParams(model_structure=None, **kwargs)¶
Bases:
object
Class to hold a set of hyperparameters as name-value pairs.
A HParams object holds hyperparameters used to build and train a model, such as the number of hidden units in a neural net layer or the learning rate to use when training.
You first create a HParams object by specifying the names and values of the hyperparameters.
To make them easily accessible the parameter names are added as direct attributes of the class. A typical usage is as follows:
```python # Create a HParams object specifying names and values of the model # hyperparameters: hparams = HParams(learning_rate=0.1, num_hidden_units=100)
# The hyperparameter are available as attributes of the HParams object: hparams.learning_rate ==> 0.1 hparams.num_hidden_units ==> 100 ```
Hyperparameters have type, which is inferred from the type of their value passed at construction type. The currently supported types are: integer, float, boolean, string, and list of integer, float, boolean, or string.
You can override hyperparameter values by calling the [parse()](#HParams.parse) method, passing a string of comma separated name=value pairs. This is intended to make it possible to override any hyperparameter values from a single command-line flag to which the user passes ‘hyper-param=value’ pairs. It avoids having to define one flag for each hyperparameter.
The syntax expected for each value depends on the type of the parameter. See parse() for a description of the syntax.
Example:
```python # Define a command line flag to pass name=value pairs. # For example using argparse: import argparse parser = argparse.ArgumentParser(description=’Train my model.’) parser.add_argument(’–hparams’, type=str,
help=’Comma separated list of “name=value” pairs.’)
args = parser.parse_args() … def my_program():
# Create a HParams object specifying the names and values of the # model hyperparameters: hparams = tf.HParams(learning_rate=0.1, num_hidden_units=100,
activations=[‘relu’, ‘tanh’])
# Override hyperparameters values by parsing the command line hparams.parse(args.hparams)
# If the user passed –hparams=learning_rate=0.3 on the command line # then ‘hparams’ has the following attributes: hparams.learning_rate ==> 0.3 hparams.num_hidden_units ==> 100 hparams.activations ==> [‘relu’, ‘tanh’]
# If the hyperparameters are in json format use parse_json: hparams.parse_json(‘{“learning_rate”: 0.3, “activations”: “relu”}’)
- _HAS_DYNAMIC_ATTRIBUTES = True¶
- add_hparam(name, value)¶
Adds {name, value} pair to hyperparameters.
- Parameters
name – Name of the hyperparameter.
value – Value of the hyperparameter. Can be one of the following types:
int –
float –
string –
list (float) –
list –
list. (or string) –
- Raises
ValueError – if one of the arguments is invalid.
- set_hparam(name, value)¶
Set the value of an existing hyperparameter.
This function verifies that the type of the value matches the type of the existing hyperparameter.
- Parameters
name – Name of the hyperparameter.
value – New value of the hyperparameter.
- Raises
KeyError – If the hyperparameter doesn’t exist.
ValueError – If there is a type mismatch.
- del_hparam(name)¶
Removes the hyperparameter with key ‘name’.
Does nothing if it isn’t present.
- Parameters
name – Name of the hyperparameter.
- parse(values, ignore_unknown=False)¶
Override existing hyperparameter values, parsing new values from a string.
See parse_values for more detail on the allowed format for values.
- Parameters
values – String. Comma separated list of name=value pairs where ‘value’
above. (must follow the syntax described) –
- Returns
The HParams instance.
- Raises
ValueError – If values cannot be parsed or a hyperparameter in values
doesn't exist. –
- override_from_dict(values_dict)¶
Override existing hyperparameter values, parsing new values from a dictionary.
- Parameters
values_dict – Dictionary of name:value pairs.
- Returns
The HParams instance.
- Raises
KeyError – If a hyperparameter in values_dict doesn’t exist.
ValueError – If values_dict cannot be parsed.
- set_model_structure(model_structure)¶
- get_model_structure()¶
- to_json(indent=None, separators=None, sort_keys=False)¶
Serializes the hyperparameters into JSON.
- Parameters
indent – If a non-negative integer, JSON array elements and object members
0 (will be pretty-printed with that indent level. An indent level of) –
or –
negative (the default) –
None (will only insert newlines.) –
representation. (most compact) –
separators – Optional (item_separator, key_separator) tuple. Default is
`(' –
’)`.
' – ‘)`.
' – ‘)`.
sort_keys – If True, the output dictionaries will be sorted by key.
- Returns
A JSON string.
- parse_json(values_json)¶
Override existing hyperparameter values, parsing new values from a json object.
- Parameters
values_json – String containing a json object of name:value pairs.
- Returns
The HParams instance.
- Raises
KeyError – If a hyperparameter in values_json doesn’t exist.
ValueError – If values_json cannot be parsed.
- values()¶
Return the hyperparameter values as a Python dictionary.
- Returns
A dictionary with hyperparameter names as keys. The values are the hyperparameter values.
- get(key, default=None)¶
Returns the value of key if it exists, else default.
- __contains__(key)¶
- __str__()¶
Return str(self).
- __repr__()¶
Return repr(self).
- static _get_kind_name(param_type, is_list)¶
Returns the field name given parameter type and is_list.
- Parameters
param_type – Data type of the hparam.
is_list – Whether this is a list.
- Returns
A string representation of the field name.
- Raises
ValueError – If parameter type is not recognized.
- instantiate()¶
- append(hp)¶
- athena.register_and_parse_hparams(default_config: dict, config=None, **kwargs)¶
register default config and parse
- athena.generate_square_subsequent_mask(size)¶
Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).
- athena.generate_square_subsequent_mask_u2(size)¶
Generate a square mask for the sequence. The masked positions are filled with bool(True). Unmasked positions are filled with bool(False).
- athena.get_wave_file_length(wave_file)¶
get the wave file length(duration) in ms
- Parameters
wave_file – the path of wave file
- Returns
the length(ms) of the wave file
- Return type
wav_length
- athena.set_default_summary_writer(summary_directory=None)¶
- athena.get_dict_from_scp(vocab, func=lambda x: ...)¶
- class athena.CTCPrefixScoreTH(x, xlens, blank, eos, margin=0)¶
Bases:
object
Batch processing of CTCPrefixScore
which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the label probablities for multiple hypotheses simultaneously See also Seki et al. “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” In INTERSPEECH (pp. 3825-3829), 2019.
- __call__(y, state, scoring_ids=None, att_w=None)¶
Compute CTC prefix scores for next labels
- Parameters
y – tensor(shape=[W, L]), prefix label sequences
state (tuple) –
previous CTC state tuple(
tensor(shape=[T , 2, W]), tensor(shape=[W, O]), 0, 0
)
scoring_ids (torch.Tensor) – scores for pre-selection of hypotheses [Beam, Beam * pre_beam_ratio]
att_w (torch.Tensor) – attention weights to decide CTC window
:return new_state, ctc_local_scores (BW, O)
- index_select_state(state, best_ids)¶
Select CTC states according to best ids
:param state : CTC state :param best_ids : index numbers selected by beam pruning (B, W) :return selected_state
- athena.__version__ = 2.0¶