athena¶
module
Subpackages¶
athena.dataathena.layersathena.modelsathena.models.asrathena.models.kwsathena.models.kws.baseathena.models.kws.cnn_wakeupathena.models.kws.conformer_wakeupathena.models.kws.crnn_wakeupathena.models.kws.dnn_wakeupathena.models.kws.misp_wakeupathena.models.kws.transformer_av_wakeupathena.models.kws.transformer_wakeupathena.models.kws.transformer_wakeup_2denseathena.models.kws.transformer_wakeup_average_poolingathena.models.kws.transformer_wakeup_focal_lossathena.models.kws.transformer_wakeup_resnet
athena.models.lmathena.models.ttsathena.models.vadathena.models.baseathena.models.masked_pc
athena.toolsathena.transformathena.transform.featsathena.transform.feats.opsathena.transform.feats.add_rir_noise_aecresathena.transform.feats.add_rir_noise_aecres_testathena.transform.feats.base_frontendathena.transform.feats.cmvnathena.transform.feats.cmvn_testathena.transform.feats.fbankathena.transform.feats.fbank_pitchathena.transform.feats.fbank_pitch_testathena.transform.feats.fbank_testathena.transform.feats.framepowathena.transform.feats.framepow_testathena.transform.feats.mel_spectrumathena.transform.feats.mel_spectrum_testathena.transform.feats.mfccathena.transform.feats.mfcc_testathena.transform.feats.pitchathena.transform.feats.pitch_testathena.transform.feats.read_wavathena.transform.feats.read_wav_testathena.transform.feats.spectrumathena.transform.feats.spectrum_testathena.transform.feats.write_wavathena.transform.feats.write_wav_test
athena.transform.audio_featurizer
athena.utils
Submodules¶
Package Contents¶
Classes¶
SpeechDatasetBuilder |
|
LanguageDatasetBuilder |
|
SpeechRecognitionDatasetBuilder |
|
SpeechRecognitionDatasetKaldiIOBuilder |
|
SpeechRecognitionDatasetBatchBinsBuilder |
|
SpeechRecognitionDatasetBatchBinsKaldiIOBuilder |
|
SpeechRecognitionDatasetBuilder |
|
SpeechRecognitionDatasetBatchBinsBuilder |
|
SpeechDatasetBuilder |
|
MpcSpeechDatasetKaldiIOBuilder |
|
SpeechSynthesisDatasetBuilder |
|
SpeechSynthesisDatasetBuilder |
|
Feature Normalizer |
|
Fastspeech2 Feature Normalizer |
|
VoiceActivityDetectionDatasetKaldiIOBuilder |
|
Dataset builder for CNN model. The builder treat every spliced frame as one image. |
|
Dataset builder for RNN model. The builder mix the spliced frame in one dim |
|
Dataset builder for RNN model. The builder mix the spliced frame in one dim |
|
The main text featurizer interface |
|
TextTokenizer |
|
positional encoding can be used in transformer |
|
collapse4d can be used in cnn-lstm for speech processing |
|
An implementation of Tdnn Layer |
|
Gaussian Error Linear Unit. |
|
Multi-head attention consists of four parts: |
|
the Bahdanau Attention |
|
Refer to [Hierarchical Attention Networks for Document Classification] |
|
Refer to [Learning Natural Language Inference with LSTM] |
|
A transformer model. User is able to modify the attributes as needed. |
|
TransformerEncoder is a stack of N encoder layers |
|
TransformerDecoder is a stack of N decoder layers |
|
TransformerEncoderLayer is made up of self-attn and feedforward network. |
|
TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network. |
|
Basic block of resnet |
|
Base class for model. |
|
implementation for MPC pretrain model |
|
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to |
|
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
Standard implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
U2 implementation of a SpeechTransformer. Model mainly consists of three parts: |
|
Conformer-U2 |
|
In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to |
|
Audio and video multimode Conformer. Model mainly consists of three parts: |
|
implementation of a frame level or segment speech classification |
|
implementation of a frame level or segment speech classification |
|
Standard implementation of a RNNLM. Model mainly consists of embeding layer, |
|
Standard implementation of a RNNLM. Model mainly consists of embeding layer, |
|
Reference: Fastspeech: Fast, robust and controllable text to speech |
|
Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech |
|
An implementation of Tacotron2 |
|
TTS version of SpeechTransformer. Model mainly consists of three parts: |
|
CNN model for kws" |
|
Standard implementation of a KWSConformer. Model mainly consists of three parts: |
|
CRNN model for e2e kws" |
|
implementation of a frame level or segment speech classification |
|
MISP challenge KWS baseline model for e2e kws" |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Standard implementation of a KWSTransformer. Model mainly consists of three parts: |
|
Base Training Solver. |
|
A multi-processer solver based on Horovod |
|
ASR DecoderSolver |
|
Base Solver. |
|
A multi-processer solver based on Horovod |
|
DecoderSolver |
|
VadSolver |
|
SynthesisSolver (TTS Solver) |
|
CTC LOSS |
|
Seq2SeqSparseCategoricalCrossentropy LOSS |
|
CTCAccuracy |
|
Seq2SeqSparseCategoricalAccuracy |
|
A wrapper for Tensorflow checkpoint |
|
WarmUp Learning rate schedule for Adam |
|
WarmUpAdam Implementation |
|
WarmUp Learning rate schedule for Adam and can initialize a learning rate |
|
WarmUpAdam Implementation |
|
ExponentialDecayLearningRateSchedule |
|
WarmUpAdam Implementation |
|
Class to hold a set of hyperparameters as name-value pairs. |
|
Batch processing of CTCPrefixScore |
Functions¶
|
generate a postional encoding list |
|
reshape from [N T D C] -> [N T D*C] |
|
Gaussian Error Linear Unit. |
|
register default config and parse |
Generate a square mask for the sequence. The masked positions are filled with float(1.0). |
|
Generate a square mask for the sequence. The masked positions are filled with bool(True). |
|
|
get the wave file length(duration) in ms |
|
|
|
Attributes¶
- class athena.SpeechDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderSpeechDatasetBuilder
- property num_class¶
@property- Returns
the target dim
- Return type
int
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output": tf.float32, "output_length": tf.int32, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape( [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels] ), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None, None]), "output_length": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec( shape=(None, None, None, None), dtype=tf.float32 ), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, speaker).
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": input_data, "input_length": input_data.shape[0], "output": output_data, "output_length": output_data.shape[0], }
- Return type
dict
- class athena.LanguageDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilderLanguageDatasetBuilder
- property num_class¶
@property- Returns
the max_index of the vocabulary
- Return type
int
- property input_vocab_size¶
@property- Returns
the input vocab size
- Return type
int
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.int32, "input_length": tf.int32, "output": tf.int32, "output_length": tf.int32, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "output_length": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
load csv file
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": input_labels, "input_length": input_length, "output": output_labels, "output_length": output_length, }
- Return type
dict
- class athena.SpeechRecognitionDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderSpeechRecognitionDatasetBuilder
- property num_class¶
return the max_index of the vocabulary + 1
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, "utt_id": tf.string, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "utt_id": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- storage_features_offline()¶
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, "utt_id": utt_id }
- Return type
dict
- class athena.SpeechRecognitionDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderSpeechRecognitionDatasetKaldiIOBuilder
- property num_class¶
return the max_index of the vocabulary + 1
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶
Generate a list of tuples (feat_key, speaker).
- __getitem__(index)¶
- compute_cmvn_if_necessary(is_necessary=True)¶
compute cmvn file
- class athena.SpeechRecognitionDatasetBatchBinsBuilder(config=None)¶
Bases:
athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilderSpeechRecognitionDatasetBatchBinsBuilder
- property sample_shape_batch_bins¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, None, dim, nc]), "input_length": tf.TensorShape([None]), "output_length": tf.TensorShape([None]), "output": tf.TensorShape([None, None]), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
- __getitem__(index)¶
- __len__()¶
- as_dataset(batch_size=16, num_threads=1)¶
return tf.data.Dataset object
- shard(num_shards, index)¶
creates a Dataset that includes only 1/num_shards of this dataset
- batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶
Batch-wise shuffling of the data entries.
- Parameters
batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –
- class athena.SpeechRecognitionDatasetBatchBinsKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.asr.speech_recognition_kaldiio.SpeechRecognitionDatasetKaldiIOBuilderSpeechRecognitionDatasetBatchBinsKaldiIOBuilder
- property sample_shape_batch_bins¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, None, dim, nc]), "input_length": tf.TensorShape([None]), "output_length": tf.TensorShape([None]), "output": tf.TensorShape([None, None]), }
- Return type
dict
- default_config¶
- preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶
- read_shape_file(file_dir=None)¶
- __getitem__(index)¶
- __len__()¶
- as_dataset(batch_size=16, num_threads=1)¶
return tf.data.Dataset object
- shard(num_shards, index)¶
creates a Dataset that includes only 1/num_shards of this dataset
- batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶
Batch-wise shuffling of the data entries.
- Parameters
batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –
- class athena.AudioVedioRecognitionDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderSpeechRecognitionDatasetBuilder
- property num_class¶
return the max_index of the vocabulary + 1
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), }
- Return type
dict
- default_config¶
- video_scp_loader(scp_dir)¶
load video list from scp file return a dic
- image_normalizer(image)¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- storage_features_offline()¶
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, }
- Return type
dict
- class athena.AudioVedioRecognitionDatasetBatchBinsBuilder(config=None)¶
Bases:
athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilderSpeechRecognitionDatasetBatchBinsBuilder
- property sample_shape_batch_bins¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, None, dim, nc]), "video":tf.TensorShape([None, None, high, wide]), "input_length": tf.TensorShape([None]), "output_length": tf.TensorShape([None]), "output": tf.TensorShape([None, None]), "utt_id": tf.TensorShape([None]), }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "video": tf.TensorShape([None, None, None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "utt_id": tf.TensorShape([]), }
- Return type
dict
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, "utt_id": tf.string, }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "video": tf.TensorSpec(shape=(None, None, None, None), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, "utt_id": utt_id }
- Return type
dict
- __len__()¶
- as_dataset(batch_size=16, num_threads=1)¶
return tf.data.Dataset object
- shard(num_shards, index)¶
creates a Dataset that includes only 1/num_shards of this dataset
- batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶
Batch-wise shuffling of the data entries.
- Parameters
batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –
- class athena.MpcSpeechDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderSpeechDatasetBuilder This data builder is a online feature extractor and is used to mcp training
- property num_class¶
@property- Returns
the target dim
- Return type
int
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output": tf.float32, "output_length": tf.int32, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape( [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels] ), "input_length": tf.TensorShape([]), "output": tf.TensorShape([None, None]), "output_length": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec( shape=(None, None, None, None), dtype=tf.float32 ), "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32), "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32), }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, speaker).
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": input_data, "input_length": input_data.shape[0], "output": output_data, "output_length": output_data.shape[0], }
- Return type
dict
- class athena.MpcSpeechDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.mpc.mpc_speech_set.MpcSpeechDatasetBuilderMpcSpeechDatasetKaldiIOBuilder This data builder is a offline feature data builder and is used to mcp training
- default_config¶
- preprocess_data(file_path, apply_sort_filter=True)¶
generate a list of tuples (feat_key, speaker).
- __getitem__(index)¶
- compute_cmvn_if_necessary(is_necessary=True)¶
compute cmvn file
- class athena.SpeechSynthesisDatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderSpeechSynthesisDatasetBuilder
- property num_class¶
@property- Returns
the max_index of the vocabulary
- Return type
int
- property feat_dim¶
return the number of feature dims
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "utt_id": tf.string, "input": tf.int32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.float32, "speaker": tf.int32 }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "utt_id": tf.TensorShape([]), "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None, feature_dim]), "speaker": tf.TensorShape([]) }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, feature_dim), dtype=tf.float32), "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32) }
- Return type
dict
- default_config¶
- preprocess_data(file_path)¶
generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).
- __getitem__(index)¶
- class athena.SpeechFastspeech2DatasetBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilderSpeechSynthesisDatasetBuilder
- property num_class¶
@property- Returns
the max_index of the vocabulary
- Return type
int
- property feat_dim¶
return the number of feature dims
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "utt_id": tf.string, "input": tf.int32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.float32, "speaker": tf.int32, "duration": tf.int32 }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "utt_id": tf.TensorShape([]), "input": tf.TensorShape([None]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None, feature_dim]), "f0": tf.TensorShape([None]), "energy": tf.TensorShape([None]), "speaker": tf.TensorShape([]), "duration": tf.TensorShape([None]) }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string), "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None, feature_dim), dtype=tf.float32), "f0": tf.TensorSpec(shape=(None, None), dtype=tf.float32), "energy": tf.TensorSpec(shape=(None, None), dtype=tf.float32), "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32) }
- Return type
dict
- default_config¶
- load_duration(duration)¶
- preprocess_data(file_path)¶
generate a list of tuples (audio_feature, wav_length_ms, transcript, duration, speaker).
- load_audio_feature(audio_feature_file)¶
- __getitem__(index)¶
- compute_cmvn_if_necessary(is_necessary=True)¶
compute cmvn file
- class athena.FeatureNormalizer(cmvn_file=None)¶
Feature Normalizer
- __call__(feat_data, speaker, reverse=False)¶
- apply_cmvn(feat_data, speaker, reverse=False)¶
transform original feature to normalized feature
- compute_cmvn(entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)¶
compute cmvn for filtered entries
- compute_cmvn_by_chunk_for_all_speaker(feature_dim, speakers, featurizer, entries)¶
because of memory issue, we used incremental approximation for the calculation of cmvn
- compute_cmvn_kaldiio(entries, speakers, kaldi_io_feats, feature_dim)¶
compute cmvn for filtered entries using kaldi-format data
- load_cmvn()¶
load mean and var
- save_cmvn(variable_list)¶
save cmvn variables determined by variable_list to file
- Parameters
variable_list (list) – e.g. [“speaker”, “mean”, “var”]
- class athena.FS2FeatureNormalizer(cmvn_file=None)¶
Bases:
FeatureNormalizerFastspeech2 Feature Normalizer
- __call__(feat_data, speaker, feature_type='mel', reverse=False)¶
- compute_fs2_cmvn(entries, speakers, num_cmvn_workers=1)¶
compuate cmvn of mel-spec,f0 and energy
- apply_cmvn(feat_data, speaker, feature_type='mel', reverse=False)¶
transform original feature to normalized feature
- load_cmvn()¶
load mel_mean, mel_var, f0_mean, f0_var and energy_mean, energy_var
- class athena.VoiceActivityDetectionDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.SpeechBaseDatasetBuilderVoiceActivityDetectionDatasetKaldiIOBuilder
- property sample_type¶
@property- Returns
sample_type of the dataset:
{ "input": tf.float32, "input_length": tf.int32, "output_length": tf.int32, "output": tf.int32, }
- Return type
dict
- property sample_shape¶
@property- Returns
sample_shape of the dataset:
{ "input": tf.TensorShape([None, dim, nc]), "input_length": tf.TensorShape([]), "output_length": tf.TensorShape([]), "output": tf.TensorShape([None]), "utt": tf.TensorShape([]), }
- Return type
dict
- property sample_signature¶
@property- Returns
sample_signature of the dataset:
{ "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32), "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32), "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32), utt": tf.TensorSpec(shape=(None), dtype=tf.string), }
- Return type
dict
- default_config¶
- preprocess_data(data_scps_dir)¶
generate a list of tuples (wav_filename, wav_offset, wav_length_ms, transcript, label).
- splice_feature(feature)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- __getitem__(index)¶
get a sample
- Parameters
index (int) – index of the entries
- Returns
sample:
{ "input": feat, "input_length": feat_length, "output_length": label_length, "output": label, "utt": utt }
- Return type
dict
- class athena.SpeechWakeupFramewiseDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilderDataset builder for CNN model. The builder treat every spliced frame as one image. For example (21, 63) The input data format is (batch, timestep, height, width, channel) For example (b, t, 21, 63, 1) unbatch are used to split The output data format is, for example, (b, 21, 63, 1)
- property sample_type¶
example types
- property sent_sample_shape¶
- property sample_shape¶
examples shapes
- property sample_signature¶
examples signature
- default_config¶
- preprocess_data(data_dir='')¶
loading data
- __getitem__(index)¶
- splice_feature(feature, input_left_context, input_right_context)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- class athena.SpeechWakeupDatasetKaldiIOBuilder(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilderDataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)
- property sample_type¶
example types
- property sample_shape¶
examples shapes
- property sample_signature¶
examples signature
- default_config¶
- preprocess_data(data_dir='')¶
loading data
- __getitem__(index)¶
- splice_feature(feature, input_left_context, input_right_context)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- class athena.SpeechWakeupDatasetKaldiIOBuilderAVCE(config=None)¶
Bases:
athena.data.datasets.base.BaseDatasetBuilderDataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)
- property sample_type¶
example types
- property sample_shape¶
examples shapes
- property sample_signature¶
examples signature
- default_config¶
- preprocess_data(data_dir='')¶
loading data
- video_scp_loader(scp_dir)¶
load video list from scp file return a dic
- __getitem__(index)¶
- splice_feature(feature, input_left_context, input_right_context)¶
splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,
repeat the first frame in case out the range
- input_right_context: the right features to be spliced,
repeat the last frame in case out the range
- Parameters
feature – the input features, shape may be [timestamp, dim, 1]
- Returns
the spliced features
- Return type
splice_feat
- class athena.TextFeaturizer(config=None)¶
The main text featurizer interface
- property model_type¶
@property- Returns
the model type
- property unk_index¶
@property- Returns
the unk index
- Return type
int
- supported_model¶
- default_config¶
- load_model(model_file)¶
load model
- delete_punct(tokens)¶
delete punctuation tokens
- __len__()¶
- encode(texts)¶
convert a sentence to a list of ids, with special tokens added.
- decode(sequences)¶
conver a list of ids to a sentence
- decode_to_list(sequences, ignored_id=[])¶
- class athena.TextTokenizer(text=None)¶
TextTokenizer
- load_model(text)¶
load model
- save_vocab(vocab_file)¶
- load_csv(csv_file)¶
- __len__()¶
- encode(texts)¶
convert a sentence to a list of ids, with special tokens added.
- decode(sequences)¶
conver a list of ids to a sentence
- decode_to_list(ids, ignored_id=[])¶
- athena.make_positional_encoding(position, d_model)¶
generate a postional encoding list
- athena.collapse4d(x, name=None)¶
reshape from [N T D C] -> [N T D*C] using tf.shape(x), which generate a tensor instead of x.shape
- athena.gelu(x)¶
Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415
- Parameters
x – float Tensor to perform activation.
- Returns
x with the GELU activation applied.
- class athena.PositionalEncoding(d_model, max_position=800, scale=False)¶
Bases:
tensorflow.keras.layers.Layerpositional encoding can be used in transformer
- call(x)¶
call function
- class athena.Collapse4D¶
Bases:
tensorflow.keras.layers.Layercollapse4d can be used in cnn-lstm for speech processing reshape from [N T D C] -> [N T D*C]
- call(x)¶
- class athena.TdnnLayer(context, output_dim, use_bias=False, **kwargs)¶
Bases:
tensorflow.keras.layers.LayerAn implementation of Tdnn Layer :param context: a int of left and right context, or a list of context indexes, e.g. (-2, 0, 2). :param output_dim: the dim of the linear transform
- call(x, training=None, mask=None)¶
- class athena.Gelu¶
Bases:
tensorflow.keras.layers.LayerGaussian Error Linear Unit.
This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415
- Parameters
x – float Tensor to perform activation.
- Returns
with the GELU activation applied.
- Return type
x
- call(x)¶
- class athena.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)¶
Bases:
tensorflow.keras.layers.LayerMulti-head attention consists of four parts:
Linear layers and split into heads.
Scaled dot-product attention.
Concatenation of heads.
Final linear layer.
Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.
Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.
- split_heads(x, batch_size)¶
Split the last dimension into (num_heads, depth).
Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
- call(v, k, q, mask)¶
call function
- class athena.BahdanauAttention(units, input_dim=1024)¶
Bases:
tensorflow.keras.Modelthe Bahdanau Attention
- call(query, values)¶
call function
- class athena.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)¶
Bases:
tensorflow.keras.layers.LayerRefer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)
>>> Input shape: (Batch size, steps, features) >>> Output shape: (Batch size, features)
- build(input_shape)¶
build in keras layer
- call(inputs, training=None, mask=None)¶
call function in keras
- compute_output_shape(input_shape)¶
compute output shape
- _masked_softmax(logits, mask, axis)¶
Compute softmax with input mask.
- class athena.MatchAttention(config, **kwargs)¶
Bases:
tensorflow.keras.layers.LayerRefer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170)
>>> Input shape: (Batch size, steps, features) >>> Output shape: (Batch size, steps, features)
- call(tensors)¶
Attention layer.
- class athena.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, custom_encoder=None, custom_decoder=None, conv_module_kernel_size=0)¶
Bases:
tensorflow.keras.layers.LayerA transformer model. User is able to modify the attributes as needed.
- Parameters
d_model – the number of expected features in the encoder/decoder inputs (default=512).
nhead – the number of heads in the multiheadattention models (default=8).
num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
custom_encoder – custom encoder (default=None).
custom_decoder – custom decoder (default=None).
Examples
>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12) >>> src = tf.random.normal((10, 32, 512)) >>> tgt = tf.random.normal((20, 32, 512)) >>> out = transformer_model(src, tgt)
- call(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, return_encoder_output=False, return_attention_weights=False, training=None)¶
Take in and process masked source/target sequences.
- Parameters
src – the sequence to the encoder (required).
tgt – the sequence to the decoder (required).
src_mask – the additive mask for the src sequence (optional).
tgt_mask – the additive mask for the tgt sequence (optional).
memory_mask – the additive mask for the encoder output (optional).
src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).
tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).
memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).
- Shape:
src: \((N, S, E)\).
tgt: \((N, T, E)\).
src_mask: \((N, S)\).
tgt_mask: \((N, T)\).
memory_mask: \((N, S)\).
Note: [src/tgt/memory]_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.
output: \((N, T, E)\).
Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
Examples
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
- class athena.TransformerEncoder(encoder_layers)¶
Bases:
tensorflow.keras.layers.LayerTransformerEncoder is a stack of N encoder layers
- Parameters
encoder_layer – an instance of the TransformerEncoderLayer() class (required).
num_layers – the number of sub-encoder-layers in the encoder (required).
norm – the layer normalization component (optional).
Examples
>>> encoder_layer = [TransformerEncoderLayer(d_model=512, nhead=8) >>> for _ in range(num_layers)] >>> transformer_encoder = TransformerEncoder(encoder_layer) >>> src = torch.rand(10, 32, 512) >>> out = transformer_encoder(src)
- call(src, src_mask=None, training=None)¶
Pass the input through the endocder layers in turn.
- Parameters
src – the sequnce to the encoder (required).
mask – the mask for the src sequence (optional).
- set_unidirectional(uni=False)¶
whether to apply trianglar masks to make transformer unidirectional
- class athena.TransformerDecoder(decoder_layers)¶
Bases:
tensorflow.keras.layers.LayerTransformerDecoder is a stack of N decoder layers
- Parameters
decoder_layer – an instance of the TransformerDecoderLayer() class (required).
num_layers – the number of sub-decoder-layers in the decoder (required).
norm – the layer normalization component (optional).
Examples
>>> decoder_layer = [TransformerDecoderLayer(d_model=512, nhead=8) >>> for _ in range(num_layers)] >>> transformer_decoder = TransformerDecoder(decoder_layer) >>> memory = torch.rand(10, 32, 512) >>> tgt = torch.rand(20, 32, 512) >>> out = transformer_decoder(tgt, memory)
- call(tgt, memory, tgt_mask=None, memory_mask=None, return_attention_weights=False, training=None)¶
Pass the inputs (and mask) through the decoder layer in turn.
- Parameters
tgt – the sequence to the decoder (required).
memory – the sequnce from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).
- class athena.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, ffn=None, conv_module_kernel_size=0)¶
Bases:
tensorflow.keras.layers.LayerTransformerEncoderLayer is made up of self-attn and feedforward network.
- Parameters
d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).
Examples
>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8) >>> src = tf.random(10, 32, 512) >>> out = encoder_layer(src)
- call(src, src_mask=None, training=None)¶
Pass the input through the encoder layer.
- Parameters
src – the sequence to the encoder layer (required).
mask – the mask for the src sequence (optional).
- set_unidirectional(uni=False)¶
whether to apply trianglar masks to make transformer unidirectional
- class athena.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu')¶
Bases:
tensorflow.keras.layers.LayerTransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
- Reference:
“Attention Is All You Need”.
- Parameters
d_model – the number of expected features in the input (required).
nhead – the number of heads in the multiheadattention models (required).
dim_feedforward – the dimension of the feedforward network model (default=2048).
dropout – the dropout value (default=0.1).
activation – the activation function of intermediate layer, relu or gelu (default=relu).
Examples
>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8) >>> memory = tf.random(10, 32, 512) >>> tgt = tf.random(20, 32, 512) >>> out = decoder_layer(tgt, memory)
- call(tgt, memory, tgt_mask=None, memory_mask=None, training=None)¶
Pass the inputs (and mask) through the decoder layer.
- Parameters
tgt – the sequence to the decoder layer (required).
memory – the sequence from the last layer of the encoder (required).
tgt_mask – the mask for the tgt sequence (optional).
memory_mask – the mask for the memory sequence (optional).
- class athena.ResnetBasicBlock(num_filter, stride=1)¶
Bases:
tensorflow.keras.layers.LayerBasic block of resnet Reference to paper “Deep residual learning for image recognition”
- call(inputs)¶
call model
- make_downsample_layer(num_filter, stride)¶
perform downsampling using conv layer with stride != 1
- class athena.BaseModel(**kwargs)¶
Bases:
tensorflow.keras.ModelBase class for model.
- abstract call(samples, training=None)¶
call model
- get_loss(outputs, samples, training=None)¶
get loss
- compute_logit_length(input_length)¶
compute the logit length
- reset_metrics()¶
reset the metrics
- prepare_samples(samples)¶
for special data prepare carefully: do not change the shape of samples
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- decode(samples, hparams, decoder)¶
decode interface
- class athena.MaskedPredictCoding(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelimplementation for MPC pretrain model
- Parameters
num_filters – a int type number, i.e the number of filters in cnn
d_model – a int type number, i.e dimension of model
num_heads – number of heads in transformer
num_encoder_layers – number of layer in encoder
dff – a int type number, i.e dimension of model
rate – rate of dropout layers
chunk_size – number of consecutive masks, i.e 1 or 3
keep_probability – probability not to be masked
mode – train mode, i.e MPC: pretrain
max_pool_layers – index of max pool layers in encoder, default is -1
- default_config¶
- call(samples, training: bool = None)¶
used for training
- Parameters
dict (samples is a) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank
keys (including) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank
Return:
MPC outputs to fit acoustic features encoder_outputs: Transformer encoder outputs, Tensor, shape is (batch, seqlen, dim)
- get_loss(logits, samples, training=None)¶
get MPC loss
- Parameters
logits – MPC output
Return:
MPC L1 loss and metrics
- compute_logit_length(samples)¶
compute the logit length
- generate_mpc_mask(input_data)¶
generate mask for pretraining
- Parameters
features (acoustic) – i.e F-bank
Return:
mask tensor
- prepare_samples(samples)¶
for special data prepare carefully: do not change the shape of samples
- class athena.AV_MtlTransformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelIn speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.
- SUPPORTED_MODEL¶
- default_config¶
- call(samples, training=None)¶
call function in keras layers
- get_loss(outputs, samples, training=None)¶
get loss used for training
- compute_logit_length(input_length)¶
compute the logit length
- reset_metrics()¶
reset the metrics
- restore_from_pretrained_model(pretrained_model, model_type='')¶
A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore
- decode(samples, hparams=None, lm_model=None)¶
Initialization of the model for decoding, decoder is called here to create predictions
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model
Returns:
predictions: the corresponding decoding results
- class athena.SpeechConformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelStandard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(speech, speech_length, training: bool = None)¶
- _forward_encoder_log_ctc(samples, final_layer, training: bool = None)¶
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int]¶
- freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=None) List[int]¶
- freeze_ctc_probs(samples, ctc_final_layer, hparams=None, beam_size=None) List[int]¶
- attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]¶
- Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out
- Parameters
samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –
- Returns
Attention rescoring result
- Return type
List[int]
- freeze_beam_search(samples, beam_size)¶
beam search for freeze only support batch=1
- Parameters
samples – the data source to be decoded
beam_size – beam size
- beam_search(samples, hparams, lm_model=None)¶
batch beam search for transformer model
- Parameters
samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.SpeechConformerCTC(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelStandard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(speech, speech_length, training=None)¶
- _forward_encoder_log_ctc(samples, training: bool = None)¶
- decode(samples, hparams, lm_model=None)¶
Initialization of the model for decoding, decoder is called here to create predictions
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model
Returns:
predictions: the corresponding decoding results
- argmax(samples, hparams)¶
argmax for the Conformer CTC model
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
- Returns::
predictions: the corresponding decoding results
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int]¶
- freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=1) List[int]¶
- merge_ctc_sequence(seqs, blank=-1)¶
- freeze_beam_search(samples, beam_size)¶
beam search for freeze only support batch=1
- Parameters
samples – the data source to be decoded
beam_size – beam size
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.SpeechTransformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelStandard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(speech, speech_length, training: bool = None)¶
- _forward_encoder_log_ctc(samples, final_layer, training: bool = None)¶
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int]¶
- attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]¶
- Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out
- Parameters
samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –
- Returns
Attention rescoring result
- Return type
List[int]
- freeze_beam_search(samples, beam_size=1)¶
beam search for freeze only support batch=1
- Parameters
samples – the data source to be decoded
beam_size – beam size
- beam_search(samples, hparams, lm_model=None)¶
batch beam search for transformer model
- Parameters
samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search
- freeze_ctc_prefix_beam_search(samples, ctc_final_layer, hparams=None, beam_size=1) List[int]¶
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.SpeechTransformerU2(data_descriptions, config=None)¶
Bases:
SpeechU2U2 implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself
- default_config¶
- class athena.SpeechConformerU2(data_descriptions, config=None)¶
Bases:
SpeechU2Conformer-U2
- default_config¶
- class athena.MtlTransformerCtc(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelIn speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.
- SUPPORTED_MODEL¶
- default_config¶
- call(samples, training=None)¶
call function in keras layers
- get_loss(outputs, samples, training=None)¶
get loss used for training
- compute_logit_length(input_length)¶
compute the logit length
- reset_metrics()¶
reset the metrics
- restore_from_pretrained_model(pretrained_model, model_type='')¶
A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore
- _forward_encoder_log_ctc(samples, training: bool = None)¶
- decode(samples, hparams, lm_model=None)¶
Initialization of the model for decoding, decoder is called here to create predictions
- Parameters
samples – the data source to be decoded
hparams – decoding configs are included here
lm_model – lm model
Returns:
predictions: the corresponding decoding results
- enable_tf_funtion()¶
- ctc_forward_chunk_freeze(encoder_out)¶
- encoder_ctc_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)¶
- encoder_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)¶
- get_subsample_rate()¶
- get_init()¶
- encoder_forward_chunk_by_chunk_freeze(speech: tensorflow.Tensor, decoding_chunk_size: int, num_decoding_left_chunks: int = -1) Tuple[tensorflow.Tensor, tensorflow.Tensor]¶
- Forward input chunk by chunk with chunk_size like a streaming
fashion
Here we should pay special attention to computation cache in the streaming style forward chunk by chunk. Three things should be taken into account for computation in the current network:
transformer/conformer encoder layers output cache
convolution in conformer
convolution in subsampling
- However, we don’t implement subsampling cache for:
We can control subsampling module to output the right result by overlapping input instead of cache left context, even though it wastes some computation, but subsampling only takes a very small fraction of computation in the whole model.
Typically, there are several covolution layers with subsampling in subsampling module, it is tricky and complicated to do cache with different convolution layers with different subsampling rate.
Currently, nn.Sequential is used to stack all the convolution layers in subsampling, we need to rewrite it to make it work with cache, which is not prefered.
- Parameters
speech (tf.Tensor) – (1, max_len, dim)
chunk_size (int) – decoding chunk size
- ctc_prefix_beam_search(samples, hparams, decoding_chunk_size, num_decoding_left_chunks) List[int]¶
- class athena.AudioVideoConformer(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelAudio and video multimode Conformer. Model mainly consists of three parts: the a_net for input audio fbank feature preparation, the v_net, the y_net for output preparation and the transformer itself
- default_config¶
- call(samples, training: bool = None)¶
call model
- compute_logit_length(input_length)¶
used for get logit length
- _forward_encoder(samples, training: bool = None)¶
- ctc_prefix_beam_search(samples, hparams, ctc_final_layer) List[int]¶
- attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]¶
- Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out
- Parameters
samples –
hparams – inference_config
ctc_final_layer – encoder final dense layer to output ctc prob.
lm_model –
- Returns
Attention rescoring result
- Return type
List[int]
- beam_search(samples, hparams, lm_model=None)¶
batch beam search for transformer model
- Parameters
samples – the data source to be decoded
beam_size – beam size
lm_model – rnnlm that used for beam search
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- class athena.VadMarbleNet(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelimplementation of a frame level or segment speech classification
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- get_loss(outputs, samples, training=None)¶
get loss
- class athena.VadDnn(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelimplementation of a frame level or segment speech classification
- default_config¶
- call(samples, training=None)¶
call model
- get_loss(outputs, samples, training=None)¶
get loss
- class athena.RNNLM(data_descriptions, config=None)¶
Bases:
athena.models.lm.nn_lm.NNLMStandard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn
- default_config¶
- forward(inputs, inputs_length=None, training: bool = None)¶
do NN LM forward computation, for both train and decode.
- class athena.TransformerLM(data_descriptions, config=None)¶
Bases:
athena.models.lm.nn_lm.NNLMStandard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn
- default_config¶
- forward(inputs, input_lengths, training: bool = None)¶
do NN LM forward computation, for both train and decode.
- class athena.FastSpeech(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel- Reference: Fastspeech: Fast, robust and controllable text to speech
(http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech.pdf)
- default_config¶
- set_teacher_model(teacher_model, teacher_type)¶
set teacher model and initialize duration_calculator before training
- Parameters
teacher_model – the loaded teacher model
teacher_type – the model type, e.g., tacotron2, tts_transformer
- restore_from_pretrained_model(pretrained_model, model_type='')¶
restore from pretrained model
- Parameters
pretrained_model – the loaded pretrained model
model_type – the model type, e.g: tts_transformer
- get_loss(outputs, samples, training=None)¶
get loss used for training
- _feedforward_decoder(encoder_output, duration_indexes, duration_sequences, output_length, training)¶
feed-forward decoder
- Parameters
encoder_output – encoder outputs, shape: [batch, x_steps, d_model]
duration_indexes – argmax weights calculated from duration_calculator. It is used for training only, shape: [batch, y_steps]
duration_sequences – It contains duration information for each phoneme, shape: [batch, x_steps]
output_length – the real output length
training – if it is in the training stage
Returns:
before_outs: the outputs before postnet calculation after_outs: the outputs after postnet calculation
- call(samples, training: bool = None)¶
call model
- synthesize(samples)¶
- class athena.FastSpeech2(data_descriptions, config=None)¶
Bases:
athena.models.tts.fastspeech.FastSpeechReference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
- default_config¶
- call(samples, training: bool = None)¶
call model
- synthesize(samples)¶
- class athena.Tacotron2(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModelAn implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
- default_config¶
- _pad_and_reshape(outputs, ori_lens, reverse=False)¶
- Parameters
outputs – true labels, shape: [batch, y_steps, feat_dim]
ori_lens – scalar
Returns:
reshaped_outputs: it has to be reshaped to match reduction_factor shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]
- call(samples, training: bool = None)¶
call model
- initialize_input_y(y)¶
- Parameters
y – the true label, shape: [batch, y_steps, feat_dim]
Returns:
y0: zeros will be padded as one step to the start step, [batch, y_steps+1, feat_dim]
- initialize_states(encoder_output, input_length)¶
- Parameters
encoder_output – encoder outputs, shape: [batch, x_step, eunits]
input_length – shape: [batch]
Returns:
prev_rnn_states: initial states of rnns in decoder [rnn layers, 2, batch, dunits] prev_attn_weight: initial attention weights, [batch, x_steps] prev_context: initial context, [batch, eunits]
- concat_speaker_embedding(encoder_output, speaker_embedding)¶
- Parameters
encoder_output – encoder output (batch, x_steps, eunits)
speaker_embedding – speaker embedding (batch, embedding_dim)
- Returns
the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)
- time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)¶
- Parameters
encoder_output – encoder output (batch, x_steps, eunits).
input_length – (batch,)
prev_y – one step of true labels or predicted labels (batch, feat_dim).
prev_rnn_states – previous rnn states [layers, 2, states] for lstm
prev_attn_weight – previous attention weights, shape: [batch, x_steps]
prev_context – previous context vector: [batch, attn_dim]
training – if it is training mode
Returns:
out: shape: [batch, feat_dim] logit: shape: [batch, reduction_factor] current_rnn_states: [rnn_layers, 2, batch, dunits] attn_weight: [batch, x_steps]
- get_loss(outputs, samples, training=None)¶
get loss
- synthesize(samples)¶
Synthesize acoustic features from the input texts
- Parameters
samples – the data source to be synthesized
Returns:
after_outs: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
- _synthesize_post_net(before_outs, logits_stack)¶
- Parameters
before_outs – the outputs before postnet
logits_stack – the logits of all steps
Returns:
after_outs: the corresponding synthesized acoustic features
- class athena.TTSTransformer(data_descriptions, config=None)¶
Bases:
athena.models.tts.tacotron2.Tacotron2TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network
- default_config¶
- call(samples, training: bool = None)¶
- time_propagate(encoder_output, memory_mask, outs, step)¶
Synthesize one step frames
- Parameters
encoder_output – the encoder output, shape: [batch, x_steps, eunits]
memory_mask – the encoder output mask, shape: [batch, 1, 1, x_steps]
outs (TensorArray) – previous outputs
step – the current step number
Returns:
out: new frame outpus, shape: [batch, feat_dim * reduction_factor] logit: new stop token prediction logit, shape: [batch, reduction_factor] attention_weights (list): the corresponding attention weights, each element in the list represents the attention weights of one decoder layer shape: [batch, num_heads, seq_len_q, seq_len_k]
- synthesize(samples)¶
Synthesize acoustic features from the input texts
- Parameters
samples – the data source to be synthesized
Returns:
after_outs: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
- class athena.CnnModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelCNN model for kws”
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSConformer(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelStandard implementation of a KWSConformer. Model mainly consists of three parts: the x_net for input preparation, the conformer itself
- default_config¶
- call(samples, training=None)¶
- build_model(data_descriptions)¶
- class athena.CRnnModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelCRNN model for e2e kws”
- default_config¶
- input_features¶
_, _, w, c = input_features.get_shape().as_list() output_dim = w * c inner = layers.Reshape((-1, output_dim))(input_features) inner = PCENLayer()(inner) inner = layers.Reshape((-1, w, c))(inner)
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.DnnModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelimplementation of a frame level or segment speech classification
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.MISPModel(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelMISP challenge KWS baseline model for e2e kws”
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSTransformer_2Dense(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelStandard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
- build_model(data_descriptions)¶
- class athena.KWSTransformer(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelStandard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSAVTransformer(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelStandard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- inner¶
v_net
- call(samples, training=None)¶
- build_model(data_descriptions)¶
- class athena.KWSTransformerRESNET(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelStandard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- class athena.KWSTransformer_FocalLoss(data_descriptions, config=None)¶
Bases:
athena.models.kws.base.BaseModelStandard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself
- default_config¶
- call(samples, training=None)¶
call model
- build_model(data_descriptions)¶
- get_loss(outputs, samples, training=None)¶
get loss
- class athena.BaseSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
tensorflow.keras.ModelBase Training Solver.
- default_config¶
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
- static clip_by_norm(grads, norm)¶
clip norm using tf.clip_by_norm
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- save_checkpointer(checkpointer, devset, epoch)¶
- evaluate_step(samples)¶
evaluate the model 1 step
- evaluate(dataset, epoch)¶
evaluate the model
- class athena.HorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
BaseSolverA multi-processer solver based on Horovod
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],
then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.
run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.
- Parameters
solver_gpus ([list]) – a list to specify gpus being used.
- Raises
ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- evaluate(dataset, epoch=0)¶
evaluate the model
- class athena.DecoderSolver(model, data_descriptions=None, config=None)¶
Bases:
BaseSolverASR DecoderSolver
- default_config¶
- inference(dataset_builder, rank_size=1, conf=None)¶
decode the model
- inference_saved_model(dataset_builder, rank_size=1, conf=None)¶
decode the model
- class athena.AVSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
tensorflow.keras.ModelBase Solver.
- default_config¶
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
- static clip_by_norm(grads, norm)¶
clip norm using tf.clip_by_norm
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- evaluate_step(samples)¶
evaluate the model 1 step
- evaluate(dataset, epoch)¶
evaluate the model
- class athena.AVHorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
AVSolverA multi-processer solver based on Horovod
- static initialize_devices(solver_gpus=None)¶
initialize hvd devices, should be called firstly
For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],
then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.
run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.
- Parameters
solver_gpus ([list]) – a list to specify gpus being used.
- Raises
ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.
- train_step(samples)¶
train the model 1 step
- train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)¶
Update the model in 1 epoch
- evaluate(dataset, epoch=0)¶
evaluate the model
- class athena.AVDecoderSolver(model, data_descriptions=None, config=None)¶
Bases:
AVSolverDecoderSolver
- default_config¶
- inference(dataset_builder, rank_size=1, conf=None)¶
decode the model
- inference_freeze(dataset_builder, rank_size=1, conf=None)¶
decode the model
- inference_argmax(dataset_builder, rank_size=1, conf=None)¶
decode the model
- class athena.VadSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, data_descriptions=None, config=None)¶
Bases:
BaseSolverVadSolver
- default_config¶
- inference(dataset, rank_size=1, conf=None)¶
decode the model
- class athena.SynthesisSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, config=None, **kwargs)¶
Bases:
BaseSolverSynthesisSolver (TTS Solver)
- default_config¶
- inference(dataset_builder, rank_size=1, conf=None)¶
synthesize using vocoder on dataset
- inference_saved_model(dataset_builder, rank_size=1, conf=None)¶
synthesize using vocoder on dataset
- class athena.CTCLoss(logits_time_major=False, blank_index=-1, name='CTCLoss')¶
Bases:
tensorflow.keras.losses.LossCTC LOSS CTC LOSS implemented with Tensorflow
- __call__(logits, samples, logit_length=None)¶
- class athena.Seq2SeqSparseCategoricalCrossentropy(num_classes, eos=-1, by_token=False, by_sequence=True, from_logits=True, label_smoothing=0.0)¶
Bases:
tensorflow.keras.losses.CategoricalCrossentropySeq2SeqSparseCategoricalCrossentropy LOSS CategoricalCrossentropy calculated at each character for each sequence in a batch
- __call__(logits, samples, logit_length=None)¶
- class athena.CTCAccuracy(name='CTCAccuracy')¶
Bases:
CharactorAccuracyCTCAccuracy Inherits CharactorAccuracy and implements CTC accuracy calculation
- __call__(logits, samples, logit_length=None)¶
Accumulate errors and counts, logit_length is the output length of encoder
- class athena.Seq2SeqSparseCategoricalAccuracy(eos, name='Seq2SeqSparseCategoricalAccuracy')¶
Bases:
CharactorAccuracySeq2SeqSparseCategoricalAccuracy Inherits CharactorAccuracy and implements Attention accuracy calculation
- __call__(logits, samples, logit_length=None)¶
Accumulate errors and counts
- class athena.Checkpoint(checkpoint_directory=None, use_dev_loss=True, model=None, **kwargs)¶
Bases:
tensorflow.train.CheckpointA wrapper for Tensorflow checkpoint
- Parameters
checkpoint_directory – the directory for checkpoint
summary_directory – the directory for summary used in Tensorboard
__init__ – provide the optimizer and model
__call__ – save the model
Example
>>> transformer = SpeechTransformer(target_vocab_size=dataset_builder.target_dim) >>> optimizer = tf.keras.optimizers.Adam() >>> ckpt = Checkpoint(checkpoint_directory='./train', summary_directory='./event', >>> transformer=transformer, optimizer=optimizer) >>> solver = BaseSolver(transformer) >>> for epoch in dataset: >>> ckpt()
- _file_compatible(use_dev_loss)¶
Convert n_best file to CSV file
Add “index” and “Accuracy” for no csv n_best file.
- _compare_and_save_best(loss, metrics, save_path, training=False)¶
compare and save the best model with best_loss and N best metrics
- compute_nbest_avg(model_avg_num, sort_by=None, sort_by_time=False, reverse=True)¶
Restore n-best avg checkpoint,
if ‘sort_by_time’ is False, the n-best order is sorted by ‘sort_by’; If ‘sort_by_time’ is True, select the newest few models; If ‘reverse’ is True, select the largest models in the sorted order;
- __call__(loss=None, metrics=None, training=False)¶
- restore_from_best()¶
restore from the best model
- class athena.WarmUpLearningSchedule(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0)¶
Bases:
tensorflow.keras.optimizers.schedules.LearningRateScheduleWarmUp Learning rate schedule for Adam
Example
>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512), >>> beta_1=0.9, beta_2=0.98, epsilon=1e-9)
Idea from the paper: Attention Is All You Need
- __call__(step)¶
- class athena.WarmUpAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶
Bases:
tensorflow.keras.optimizers.AdamWarmUpAdam Implementation
- default_config¶
- class athena.WarmUpLearningSchedule1(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0, lr=None)¶
Bases:
tensorflow.keras.optimizers.schedules.LearningRateScheduleWarmUp Learning rate schedule for Adam and can initialize a learning rate
Example
>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512), >>> beta_1=0.9, beta_2=0.98, epsilon=1e-9)
Idea from the paper: Attention Is All You Need
- __call__(step)¶
- class athena.WarmUpAdam1(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)¶
Bases:
tensorflow.keras.optimizers.AdamWarmUpAdam Implementation
- default_config¶
- class athena.ExponentialDecayLearningRateSchedule(initial_lr=0.005, decay_steps=10000, decay_rate=0.5, start_decay_steps=30000, final_lr=1e-05)¶
Bases:
tensorflow.keras.optimizers.schedules.LearningRateScheduleExponentialDecayLearningRateSchedule
Example
>>> optimizer = tf.keras.optimizers.Adam(learning_rate = ExponentialDecayLearningRate(0.01, 100))
- Parameters
initial_lr –
decay_steps –
- Returns
initial_lr * (0.5 ** (step // decay_steps))
- __call__(step)¶
- class athena.ExponentialDecayAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-06, amsgrad=False, name='WarmUpAdam', **kwargs)¶
Bases:
tensorflow.keras.optimizers.AdamWarmUpAdam Implementation
- default_config¶
- class athena.HParams(model_structure=None, **kwargs)¶
Bases:
objectClass to hold a set of hyperparameters as name-value pairs.
A HParams object holds hyperparameters used to build and train a model, such as the number of hidden units in a neural net layer or the learning rate to use when training.
You first create a HParams object by specifying the names and values of the hyperparameters.
To make them easily accessible the parameter names are added as direct attributes of the class. A typical usage is as follows:
```python # Create a HParams object specifying names and values of the model # hyperparameters: hparams = HParams(learning_rate=0.1, num_hidden_units=100)
# The hyperparameter are available as attributes of the HParams object: hparams.learning_rate ==> 0.1 hparams.num_hidden_units ==> 100 ```
Hyperparameters have type, which is inferred from the type of their value passed at construction type. The currently supported types are: integer, float, boolean, string, and list of integer, float, boolean, or string.
You can override hyperparameter values by calling the [parse()](#HParams.parse) method, passing a string of comma separated name=value pairs. This is intended to make it possible to override any hyperparameter values from a single command-line flag to which the user passes ‘hyper-param=value’ pairs. It avoids having to define one flag for each hyperparameter.
The syntax expected for each value depends on the type of the parameter. See parse() for a description of the syntax.
Example:
```python # Define a command line flag to pass name=value pairs. # For example using argparse: import argparse parser = argparse.ArgumentParser(description=’Train my model.’) parser.add_argument(’–hparams’, type=str,
help=’Comma separated list of “name=value” pairs.’)
args = parser.parse_args() … def my_program():
# Create a HParams object specifying the names and values of the # model hyperparameters: hparams = tf.HParams(learning_rate=0.1, num_hidden_units=100,
activations=[‘relu’, ‘tanh’])
# Override hyperparameters values by parsing the command line hparams.parse(args.hparams)
# If the user passed –hparams=learning_rate=0.3 on the command line # then ‘hparams’ has the following attributes: hparams.learning_rate ==> 0.3 hparams.num_hidden_units ==> 100 hparams.activations ==> [‘relu’, ‘tanh’]
# If the hyperparameters are in json format use parse_json: hparams.parse_json(‘{“learning_rate”: 0.3, “activations”: “relu”}’)
- _HAS_DYNAMIC_ATTRIBUTES = True¶
- add_hparam(name, value)¶
Adds {name, value} pair to hyperparameters.
- Parameters
name – Name of the hyperparameter.
value – Value of the hyperparameter. Can be one of the following types:
int –
float –
string –
list (float) –
list –
list. (or string) –
- Raises
ValueError – if one of the arguments is invalid.
- set_hparam(name, value)¶
Set the value of an existing hyperparameter.
This function verifies that the type of the value matches the type of the existing hyperparameter.
- Parameters
name – Name of the hyperparameter.
value – New value of the hyperparameter.
- Raises
KeyError – If the hyperparameter doesn’t exist.
ValueError – If there is a type mismatch.
- del_hparam(name)¶
Removes the hyperparameter with key ‘name’.
Does nothing if it isn’t present.
- Parameters
name – Name of the hyperparameter.
- parse(values, ignore_unknown=False)¶
Override existing hyperparameter values, parsing new values from a string.
See parse_values for more detail on the allowed format for values.
- Parameters
values – String. Comma separated list of name=value pairs where ‘value’
above. (must follow the syntax described) –
- Returns
The HParams instance.
- Raises
ValueError – If values cannot be parsed or a hyperparameter in values
doesn't exist. –
- override_from_dict(values_dict)¶
Override existing hyperparameter values, parsing new values from a dictionary.
- Parameters
values_dict – Dictionary of name:value pairs.
- Returns
The HParams instance.
- Raises
KeyError – If a hyperparameter in values_dict doesn’t exist.
ValueError – If values_dict cannot be parsed.
- set_model_structure(model_structure)¶
- get_model_structure()¶
- to_json(indent=None, separators=None, sort_keys=False)¶
Serializes the hyperparameters into JSON.
- Parameters
indent – If a non-negative integer, JSON array elements and object members
0 (will be pretty-printed with that indent level. An indent level of) –
or –
negative (the default) –
None (will only insert newlines.) –
representation. (most compact) –
separators – Optional (item_separator, key_separator) tuple. Default is
`(' –
’)`.
' – ‘)`.
' – ‘)`.
sort_keys – If True, the output dictionaries will be sorted by key.
- Returns
A JSON string.
- parse_json(values_json)¶
Override existing hyperparameter values, parsing new values from a json object.
- Parameters
values_json – String containing a json object of name:value pairs.
- Returns
The HParams instance.
- Raises
KeyError – If a hyperparameter in values_json doesn’t exist.
ValueError – If values_json cannot be parsed.
- values()¶
Return the hyperparameter values as a Python dictionary.
- Returns
A dictionary with hyperparameter names as keys. The values are the hyperparameter values.
- get(key, default=None)¶
Returns the value of key if it exists, else default.
- __contains__(key)¶
- __str__()¶
Return str(self).
- __repr__()¶
Return repr(self).
- static _get_kind_name(param_type, is_list)¶
Returns the field name given parameter type and is_list.
- Parameters
param_type – Data type of the hparam.
is_list – Whether this is a list.
- Returns
A string representation of the field name.
- Raises
ValueError – If parameter type is not recognized.
- instantiate()¶
- append(hp)¶
- athena.register_and_parse_hparams(default_config: dict, config=None, **kwargs)¶
register default config and parse
- athena.generate_square_subsequent_mask(size)¶
Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).
- athena.generate_square_subsequent_mask_u2(size)¶
Generate a square mask for the sequence. The masked positions are filled with bool(True). Unmasked positions are filled with bool(False).
- athena.get_wave_file_length(wave_file)¶
get the wave file length(duration) in ms
- Parameters
wave_file – the path of wave file
- Returns
the length(ms) of the wave file
- Return type
wav_length
- athena.set_default_summary_writer(summary_directory=None)¶
- athena.get_dict_from_scp(vocab, func=lambda x: ...)¶
- class athena.CTCPrefixScoreTH(x, xlens, blank, eos, margin=0)¶
Bases:
objectBatch processing of CTCPrefixScore
which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the label probablities for multiple hypotheses simultaneously See also Seki et al. “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” In INTERSPEECH (pp. 3825-3829), 2019.
- __call__(y, state, scoring_ids=None, att_w=None)¶
Compute CTC prefix scores for next labels
- Parameters
y – tensor(shape=[W, L]), prefix label sequences
state (tuple) –
previous CTC state tuple(
tensor(shape=[T , 2, W]), tensor(shape=[W, O]), 0, 0
)
scoring_ids (torch.Tensor) – scores for pre-selection of hypotheses [Beam, Beam * pre_beam_ratio]
att_w (torch.Tensor) – attention weights to decide CTC window
:return new_state, ctc_local_scores (BW, O)
- index_select_state(state, best_ids)¶
Select CTC states according to best ids
:param state : CTC state :param best_ids : index numbers selected by beam pruning (B, W) :return selected_state
- athena.__version__ = 2.0¶