athena

module

Subpackages

Submodules

Package Contents

Classes

SpeechDatasetBuilder

SpeechDatasetBuilder

LanguageDatasetBuilder

LanguageDatasetBuilder

SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsBuilder

SpeechRecognitionDatasetBatchBinsBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

AudioVedioRecognitionDatasetBuilder

SpeechRecognitionDatasetBuilder

AudioVedioRecognitionDatasetBatchBinsBuilder

SpeechRecognitionDatasetBatchBinsBuilder

MpcSpeechDatasetBuilder

SpeechDatasetBuilder

MpcSpeechDatasetKaldiIOBuilder

MpcSpeechDatasetKaldiIOBuilder

SpeechSynthesisDatasetBuilder

SpeechSynthesisDatasetBuilder

SpeechFastspeech2DatasetBuilder

SpeechSynthesisDatasetBuilder

FeatureNormalizer

Feature Normalizer

FS2FeatureNormalizer

Fastspeech2 Feature Normalizer

VoiceActivityDetectionDatasetKaldiIOBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

SpeechWakeupFramewiseDatasetKaldiIOBuilder

Dataset builder for CNN model. The builder treat every spliced frame as one image.

SpeechWakeupDatasetKaldiIOBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim

SpeechWakeupDatasetKaldiIOBuilderAVCE

Dataset builder for RNN model. The builder mix the spliced frame in one dim

TextFeaturizer

The main text featurizer interface

TextTokenizer

TextTokenizer

PositionalEncoding

positional encoding can be used in transformer

Collapse4D

collapse4d can be used in cnn-lstm for speech processing

TdnnLayer

An implementation of Tdnn Layer

Gelu

Gaussian Error Linear Unit.

MultiHeadAttention

Multi-head attention consists of four parts:

BahdanauAttention

the Bahdanau Attention

HanAttention

Refer to [Hierarchical Attention Networks for Document Classification]

MatchAttention

Refer to [Learning Natural Language Inference with LSTM]

Transformer

A transformer model. User is able to modify the attributes as needed.

TransformerEncoder

TransformerEncoder is a stack of N encoder layers

TransformerDecoder

TransformerDecoder is a stack of N decoder layers

TransformerEncoderLayer

TransformerEncoderLayer is made up of self-attn and feedforward network.

TransformerDecoderLayer

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.

ResnetBasicBlock

Basic block of resnet

BaseModel

Base class for model.

MaskedPredictCoding

implementation for MPC pretrain model

AV_MtlTransformer

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to

SpeechConformer

Standard implementation of a SpeechTransformer. Model mainly consists of three parts:

SpeechConformerCTC

Standard implementation of a SpeechTransformer. Model mainly consists of three parts:

SpeechTransformer

Standard implementation of a SpeechTransformer. Model mainly consists of three parts:

SpeechTransformerU2

U2 implementation of a SpeechTransformer. Model mainly consists of three parts:

SpeechConformerU2

Conformer-U2

MtlTransformerCtc

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to

AudioVideoConformer

Audio and video multimode Conformer. Model mainly consists of three parts:

VadMarbleNet

implementation of a frame level or segment speech classification

VadDnn

implementation of a frame level or segment speech classification

RNNLM

Standard implementation of a RNNLM. Model mainly consists of embeding layer,

TransformerLM

Standard implementation of a RNNLM. Model mainly consists of embeding layer,

FastSpeech

Reference: Fastspeech: Fast, robust and controllable text to speech

FastSpeech2

Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Tacotron2

An implementation of Tacotron2

TTSTransformer

TTS version of SpeechTransformer. Model mainly consists of three parts:

CnnModel

CNN model for kws"

KWSConformer

Standard implementation of a KWSConformer. Model mainly consists of three parts:

CRnnModel

CRNN model for e2e kws"

DnnModel

implementation of a frame level or segment speech classification

MISPModel

MISP challenge KWS baseline model for e2e kws"

KWSTransformer_2Dense

Standard implementation of a KWSTransformer. Model mainly consists of three parts:

KWSTransformer

Standard implementation of a KWSTransformer. Model mainly consists of three parts:

KWSAVTransformer

Standard implementation of a KWSTransformer. Model mainly consists of three parts:

KWSTransformerRESNET

Standard implementation of a KWSTransformer. Model mainly consists of three parts:

KWSTransformer_FocalLoss

Standard implementation of a KWSTransformer. Model mainly consists of three parts:

BaseSolver

Base Training Solver.

HorovodSolver

A multi-processer solver based on Horovod

DecoderSolver

ASR DecoderSolver

AVSolver

Base Solver.

AVHorovodSolver

A multi-processer solver based on Horovod

AVDecoderSolver

DecoderSolver

VadSolver

VadSolver

SynthesisSolver

SynthesisSolver (TTS Solver)

CTCLoss

CTC LOSS

Seq2SeqSparseCategoricalCrossentropy

Seq2SeqSparseCategoricalCrossentropy LOSS

CTCAccuracy

CTCAccuracy

Seq2SeqSparseCategoricalAccuracy

Seq2SeqSparseCategoricalAccuracy

Checkpoint

A wrapper for Tensorflow checkpoint

WarmUpLearningSchedule

WarmUp Learning rate schedule for Adam

WarmUpAdam

WarmUpAdam Implementation

WarmUpLearningSchedule1

WarmUp Learning rate schedule for Adam and can initialize a learning rate

WarmUpAdam1

WarmUpAdam Implementation

ExponentialDecayLearningRateSchedule

ExponentialDecayLearningRateSchedule

ExponentialDecayAdam

WarmUpAdam Implementation

HParams

Class to hold a set of hyperparameters as name-value pairs.

CTCPrefixScoreTH

Batch processing of CTCPrefixScore

Functions

make_positional_encoding(position, d_model)

generate a postional encoding list

collapse4d(x[, name])

reshape from [N T D C] -> [N T D*C]

gelu(x)

Gaussian Error Linear Unit.

register_and_parse_hparams(default_config[, config])

register default config and parse

generate_square_subsequent_mask(size)

Generate a square mask for the sequence. The masked positions are filled with float(1.0).

generate_square_subsequent_mask_u2(size)

Generate a square mask for the sequence. The masked positions are filled with bool(True).

get_wave_file_length(wave_file)

get the wave file length(duration) in ms

set_default_summary_writer([summary_directory])

get_dict_from_scp(vocab[, func])

Attributes

__version__

class athena.SpeechDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder

property num_class

@property

Returns

the target dim

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.LanguageDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

LanguageDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property input_vocab_size

@property

Returns

the input vocab size

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.int32,
    "input_length": tf.int32,
    "output": tf.int32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_data(file_path)

load csv file

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_labels,
    "input_length": input_length,
    "output": output_labels,
    "output_length": output_length,
}

Return type

dict

class athena.SpeechRecognitionDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class

return the max_index of the vocabulary + 1

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()
__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

class athena.SpeechRecognitionDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetKaldiIOBuilder

property num_class

return the max_index of the vocabulary + 1

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_kaldi_data(file_dir, apply_sort_filter=True)

Generate a list of tuples (feat_key, speaker).

__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.SpeechRecognitionDatasetBatchBinsBuilder(config=None)

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config
preprocess_data(file_path)
__getitem__(index)
__len__()
as_dataset(batch_size=16, num_threads=1)

return tf.data.Dataset object

shard(num_shards, index)

creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)

Batch-wise shuffling of the data entries.

Parameters
  • batch_size (int, optional) – an integer for the batch size. Defaults to 1

  • . (in batch_bins mode) –

class athena.SpeechRecognitionDatasetBatchBinsKaldiIOBuilder(config=None)

Bases: athena.data.datasets.asr.speech_recognition_kaldiio.SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

property sample_shape_batch_bins

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config
preprocess_kaldi_data(file_dir, apply_sort_filter=True)
read_shape_file(file_dir=None)
__getitem__(index)
__len__()
as_dataset(batch_size=16, num_threads=1)

return tf.data.Dataset object

shard(num_shards, index)

creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)

Batch-wise shuffling of the data entries.

Parameters
  • batch_size (int, optional) – an integer for the batch size. Defaults to 1

  • . (in batch_bins mode) –

class athena.AudioVedioRecognitionDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class

return the max_index of the vocabulary + 1

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config
video_scp_loader(scp_dir)

load video list from scp file return a dic

image_normalizer(image)
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()
__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
}

Return type

dict

class athena.AudioVedioRecognitionDatasetBatchBinsBuilder(config=None)

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "video":tf.TensorShape([None, None, high, wide]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
    "utt_id": tf.TensorShape([None]),
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "video": tf.TensorShape([None, None, None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "video": tf.TensorSpec(shape=(None, None, None, None), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

__len__()
as_dataset(batch_size=16, num_threads=1)

return tf.data.Dataset object

shard(num_shards, index)

creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)

Batch-wise shuffling of the data entries.

Parameters
  • batch_size (int, optional) – an integer for the batch size. Defaults to 1

  • . (in batch_bins mode) –

class athena.MpcSpeechDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder This data builder is a online feature extractor and is used to mcp training

property num_class

@property

Returns

the target dim

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.MpcSpeechDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.mpc.mpc_speech_set.MpcSpeechDatasetBuilder

MpcSpeechDatasetKaldiIOBuilder This data builder is a offline feature data builder and is used to mcp training

default_config
preprocess_data(file_path, apply_sort_filter=True)

generate a list of tuples (feat_key, speaker).

__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.SpeechSynthesisDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property feat_dim

return the number of feature dims

property sample_type

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "speaker": tf.TensorShape([])
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)
class athena.SpeechFastspeech2DatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property feat_dim

return the number of feature dims

property sample_type

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32,
    "duration": tf.int32
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "f0": tf.TensorShape([None]),
    "energy": tf.TensorShape([None]),
    "speaker": tf.TensorShape([]),
    "duration": tf.TensorShape([None])
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "f0": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "energy": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config
load_duration(duration)
preprocess_data(file_path)

generate a list of tuples (audio_feature, wav_length_ms, transcript, duration, speaker).

load_audio_feature(audio_feature_file)
__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.FeatureNormalizer(cmvn_file=None)

Feature Normalizer

__call__(feat_data, speaker, reverse=False)
apply_cmvn(feat_data, speaker, reverse=False)

transform original feature to normalized feature

compute_cmvn(entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)

compute cmvn for filtered entries

compute_cmvn_by_chunk_for_all_speaker(feature_dim, speakers, featurizer, entries)

because of memory issue, we used incremental approximation for the calculation of cmvn

compute_cmvn_kaldiio(entries, speakers, kaldi_io_feats, feature_dim)

compute cmvn for filtered entries using kaldi-format data

load_cmvn()

load mean and var

save_cmvn(variable_list)

save cmvn variables determined by variable_list to file

Parameters

variable_list (list) – e.g. [“speaker”, “mean”, “var”]

class athena.FS2FeatureNormalizer(cmvn_file=None)

Bases: FeatureNormalizer

Fastspeech2 Feature Normalizer

__call__(feat_data, speaker, feature_type='mel', reverse=False)
compute_fs2_cmvn(entries, speakers, num_cmvn_workers=1)

compuate cmvn of mel-spec,f0 and energy

apply_cmvn(feat_data, speaker, feature_type='mel', reverse=False)

transform original feature to normalized feature

load_cmvn()

load mel_mean, mel_var, f0_mean, f0_var and energy_mean, energy_var

class athena.VoiceActivityDetectionDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    utt": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config
preprocess_data(data_scps_dir)

generate a list of tuples (wav_filename, wav_offset, wav_length_ms, transcript, label).

splice_feature(feature)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt": utt
}

Return type

dict

class athena.SpeechWakeupFramewiseDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for CNN model. The builder treat every spliced frame as one image. For example (21, 63) The input data format is (batch, timestep, height, width, channel) For example (b, t, 21, 63, 1) unbatch are used to split The output data format is, for example, (b, 21, 63, 1)

property sample_type

example types

property sent_sample_shape
property sample_shape

examples shapes

property sample_signature

examples signature

default_config
preprocess_data(data_dir='')

loading data

__getitem__(index)
splice_feature(feature, input_left_context, input_right_context)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

class athena.SpeechWakeupDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type

example types

property sample_shape

examples shapes

property sample_signature

examples signature

default_config
preprocess_data(data_dir='')

loading data

__getitem__(index)
splice_feature(feature, input_left_context, input_right_context)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

class athena.SpeechWakeupDatasetKaldiIOBuilderAVCE(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type

example types

property sample_shape

examples shapes

property sample_signature

examples signature

default_config
preprocess_data(data_dir='')

loading data

video_scp_loader(scp_dir)

load video list from scp file return a dic

__getitem__(index)
splice_feature(feature, input_left_context, input_right_context)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

class athena.TextFeaturizer(config=None)

The main text featurizer interface

property model_type

@property

Returns

the model type

property unk_index

@property

Returns

the unk index

Return type

int

supported_model
default_config
load_model(model_file)

load model

delete_punct(tokens)

delete punctuation tokens

__len__()
encode(texts)

convert a sentence to a list of ids, with special tokens added.

decode(sequences)

conver a list of ids to a sentence

decode_to_list(sequences, ignored_id=[])
class athena.TextTokenizer(text=None)

TextTokenizer

load_model(text)

load model

save_vocab(vocab_file)
load_csv(csv_file)
__len__()
encode(texts)

convert a sentence to a list of ids, with special tokens added.

decode(sequences)

conver a list of ids to a sentence

decode_to_list(ids, ignored_id=[])
athena.make_positional_encoding(position, d_model)

generate a postional encoding list

athena.collapse4d(x, name=None)

reshape from [N T D C] -> [N T D*C] using tf.shape(x), which generate a tensor instead of x.shape

athena.gelu(x)

Gaussian Error Linear Unit. This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415

Parameters

x – float Tensor to perform activation.

Returns

x with the GELU activation applied.

class athena.PositionalEncoding(d_model, max_position=800, scale=False)

Bases: tensorflow.keras.layers.Layer

positional encoding can be used in transformer

call(x)

call function

class athena.Collapse4D

Bases: tensorflow.keras.layers.Layer

collapse4d can be used in cnn-lstm for speech processing reshape from [N T D C] -> [N T D*C]

call(x)
class athena.TdnnLayer(context, output_dim, use_bias=False, **kwargs)

Bases: tensorflow.keras.layers.Layer

An implementation of Tdnn Layer :param context: a int of left and right context, or a list of context indexes, e.g. (-2, 0, 2). :param output_dim: the dim of the linear transform

call(x, training=None, mask=None)
class athena.Gelu

Bases: tensorflow.keras.layers.Layer

Gaussian Error Linear Unit.

This is a smoother version of the RELU. Original paper: https://arxiv.org/abs/1606.08415

Parameters

x – float Tensor to perform activation.

Returns

with the GELU activation applied.

Return type

x

call(x)
class athena.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)

Bases: tensorflow.keras.layers.Layer

Multi-head attention consists of four parts:

  • Linear layers and split into heads.

  • Scaled dot-product attention.

  • Concatenation of heads.

  • Final linear layer.

Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

split_heads(x, batch_size)

Split the last dimension into (num_heads, depth).

Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)

call(v, k, q, mask)

call function

class athena.BahdanauAttention(units, input_dim=1024)

Bases: tensorflow.keras.Model

the Bahdanau Attention

call(query, values)

call function

class athena.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)

>>> Input shape: (Batch size, steps, features)
>>> Output shape: (Batch size, features)
build(input_shape)

build in keras layer

call(inputs, training=None, mask=None)

call function in keras

compute_output_shape(input_shape)

compute output shape

_masked_softmax(logits, mask, axis)

Compute softmax with input mask.

class athena.MatchAttention(config, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170)

>>> Input shape: (Batch size, steps, features)
>>> Output shape: (Batch size, steps, features)
call(tensors)

Attention layer.

class athena.Transformer(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, custom_encoder=None, custom_decoder=None, conv_module_kernel_size=0)

Bases: tensorflow.keras.layers.Layer

A transformer model. User is able to modify the attributes as needed.

Parameters
  • d_model – the number of expected features in the encoder/decoder inputs (default=512).

  • nhead – the number of heads in the multiheadattention models (default=8).

  • num_encoder_layers – the number of sub-encoder-layers in the encoder (default=6).

  • num_decoder_layers – the number of sub-decoder-layers in the decoder (default=6).

  • dim_feedforward – the dimension of the feedforward network model (default=2048).

  • dropout – the dropout value (default=0.1).

  • activation – the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).

  • custom_encoder – custom encoder (default=None).

  • custom_decoder – custom decoder (default=None).

Examples

>>> transformer_model = Transformer(nhead=16, num_encoder_layers=12)
>>> src = tf.random.normal((10, 32, 512))
>>> tgt = tf.random.normal((20, 32, 512))
>>> out = transformer_model(src, tgt)
call(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, return_encoder_output=False, return_attention_weights=False, training=None)

Take in and process masked source/target sequences.

Parameters
  • src – the sequence to the encoder (required).

  • tgt – the sequence to the decoder (required).

  • src_mask – the additive mask for the src sequence (optional).

  • tgt_mask – the additive mask for the tgt sequence (optional).

  • memory_mask – the additive mask for the encoder output (optional).

  • src_key_padding_mask – the ByteTensor mask for src keys per batch (optional).

  • tgt_key_padding_mask – the ByteTensor mask for tgt keys per batch (optional).

  • memory_key_padding_mask – the ByteTensor mask for memory keys per batch (optional).

Shape:
  • src: \((N, S, E)\).

  • tgt: \((N, T, E)\).

  • src_mask: \((N, S)\).

  • tgt_mask: \((N, T)\).

  • memory_mask: \((N, S)\).

Note: [src/tgt/memory]_mask should be a ByteTensor where True values are positions that should be masked with float(‘-inf’) and False values will be unchanged. This mask ensures that no information will be taken from position i if it is masked, and has a separate mask for each sequence in a batch.

  • output: \((N, T, E)\).

Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.

where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number

Examples

>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
class athena.TransformerEncoder(encoder_layers)

Bases: tensorflow.keras.layers.Layer

TransformerEncoder is a stack of N encoder layers

Parameters
  • encoder_layer – an instance of the TransformerEncoderLayer() class (required).

  • num_layers – the number of sub-encoder-layers in the encoder (required).

  • norm – the layer normalization component (optional).

Examples

>>> encoder_layer = [TransformerEncoderLayer(d_model=512, nhead=8)
>>>                    for _ in range(num_layers)]
>>> transformer_encoder = TransformerEncoder(encoder_layer)
>>> src = torch.rand(10, 32, 512)
>>> out = transformer_encoder(src)
call(src, src_mask=None, training=None)

Pass the input through the endocder layers in turn.

Parameters
  • src – the sequnce to the encoder (required).

  • mask – the mask for the src sequence (optional).

set_unidirectional(uni=False)

whether to apply trianglar masks to make transformer unidirectional

class athena.TransformerDecoder(decoder_layers)

Bases: tensorflow.keras.layers.Layer

TransformerDecoder is a stack of N decoder layers

Parameters
  • decoder_layer – an instance of the TransformerDecoderLayer() class (required).

  • num_layers – the number of sub-decoder-layers in the decoder (required).

  • norm – the layer normalization component (optional).

Examples

>>> decoder_layer = [TransformerDecoderLayer(d_model=512, nhead=8)
>>>                     for _ in range(num_layers)]
>>> transformer_decoder = TransformerDecoder(decoder_layer)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = transformer_decoder(tgt, memory)
call(tgt, memory, tgt_mask=None, memory_mask=None, return_attention_weights=False, training=None)

Pass the inputs (and mask) through the decoder layer in turn.

Parameters
  • tgt – the sequence to the decoder (required).

  • memory – the sequnce from the last layer of the encoder (required).

  • tgt_mask – the mask for the tgt sequence (optional).

  • memory_mask – the mask for the memory sequence (optional).

class athena.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu', unidirectional=False, look_ahead=0, ffn=None, conv_module_kernel_size=0)

Bases: tensorflow.keras.layers.Layer

TransformerEncoderLayer is made up of self-attn and feedforward network.

Parameters
  • d_model – the number of expected features in the input (required).

  • nhead – the number of heads in the multiheadattention models (required).

  • dim_feedforward – the dimension of the feedforward network model (default=2048).

  • dropout – the dropout value (default=0.1).

  • activation – the activation function of intermediate layer, relu or gelu (default=relu).

Examples

>>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
>>> src = tf.random(10, 32, 512)
>>> out = encoder_layer(src)
call(src, src_mask=None, training=None)

Pass the input through the encoder layer.

Parameters
  • src – the sequence to the encoder layer (required).

  • mask – the mask for the src sequence (optional).

set_unidirectional(uni=False)

whether to apply trianglar masks to make transformer unidirectional

class athena.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='gelu')

Bases: tensorflow.keras.layers.Layer

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.

Reference:

“Attention Is All You Need”.

Parameters
  • d_model – the number of expected features in the input (required).

  • nhead – the number of heads in the multiheadattention models (required).

  • dim_feedforward – the dimension of the feedforward network model (default=2048).

  • dropout – the dropout value (default=0.1).

  • activation – the activation function of intermediate layer, relu or gelu (default=relu).

Examples

>>> decoder_layer = TransformerDecoderLayer(d_model=512, nhead=8)
>>> memory = tf.random(10, 32, 512)
>>> tgt = tf.random(20, 32, 512)
>>> out = decoder_layer(tgt, memory)
call(tgt, memory, tgt_mask=None, memory_mask=None, training=None)

Pass the inputs (and mask) through the decoder layer.

Parameters
  • tgt – the sequence to the decoder layer (required).

  • memory – the sequence from the last layer of the encoder (required).

  • tgt_mask – the mask for the tgt sequence (optional).

  • memory_mask – the mask for the memory sequence (optional).

class athena.ResnetBasicBlock(num_filter, stride=1)

Bases: tensorflow.keras.layers.Layer

Basic block of resnet Reference to paper “Deep residual learning for image recognition”

call(inputs)

call model

make_downsample_layer(num_filter, stride)

perform downsampling using conv layer with stride != 1

class athena.BaseModel(**kwargs)

Bases: tensorflow.keras.Model

Base class for model.

abstract call(samples, training=None)

call model

get_loss(outputs, samples, training=None)

get loss

compute_logit_length(input_length)

compute the logit length

reset_metrics()

reset the metrics

prepare_samples(samples)

for special data prepare carefully: do not change the shape of samples

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

decode(samples, hparams, decoder)

decode interface

class athena.MaskedPredictCoding(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

implementation for MPC pretrain model

Parameters
  • num_filters – a int type number, i.e the number of filters in cnn

  • d_model – a int type number, i.e dimension of model

  • num_heads – number of heads in transformer

  • num_encoder_layers – number of layer in encoder

  • dff – a int type number, i.e dimension of model

  • rate – rate of dropout layers

  • chunk_size – number of consecutive masks, i.e 1 or 3

  • keep_probability – probability not to be masked

  • mode – train mode, i.e MPC: pretrain

  • max_pool_layers – index of max pool layers in encoder, default is -1

default_config
call(samples, training: bool = None)

used for training

Parameters
  • dict (samples is a) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank

  • keys (including) – ‘input’, ‘input_length’, ‘output_length’, ‘output’ input: acoustic features, Tensor, shape is (batch, time_len, dim, 1), i.e f-bank

Return:

MPC outputs to fit acoustic features
    encoder_outputs: Transformer encoder outputs, Tensor, shape is (batch, seqlen, dim)
get_loss(logits, samples, training=None)

get MPC loss

Parameters

logits – MPC output

Return:

MPC L1 loss and metrics
compute_logit_length(samples)

compute the logit length

generate_mpc_mask(input_data)

generate mask for pretraining

Parameters

features (acoustic) – i.e F-bank

Return:

mask tensor
prepare_samples(samples)

for special data prepare carefully: do not change the shape of samples

class athena.AV_MtlTransformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.

SUPPORTED_MODEL
default_config
call(samples, training=None)

call function in keras layers

get_loss(outputs, samples, training=None)

get loss used for training

compute_logit_length(input_length)

compute the logit length

reset_metrics()

reset the metrics

restore_from_pretrained_model(pretrained_model, model_type='')

A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore

decode(samples, hparams=None, lm_model=None)

Initialization of the model for decoding, decoder is called here to create predictions

Parameters
  • samples – the data source to be decoded

  • hparams – decoding configs are included here

  • lm_model – lm model

Returns:

predictions: the corresponding decoding results
class athena.SpeechConformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config
call(samples, training: bool = None)

call model

compute_logit_length(input_length)

used for get logit length

_forward_encoder(speech, speech_length, training: bool = None)
_forward_encoder_log_ctc(samples, final_layer, training: bool = None)
freeze_ctc_probs(samples, ctc_final_layer, hparams=None, beam_size=None) List[int]
attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]
Apply attention rescoring decoding, CTC prefix beam search

is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters
  • samples

  • hparams – inference_config

  • ctc_final_layer – encoder final dense layer to output ctc prob.

  • lm_model

Returns

Attention rescoring result

Return type

List[int]

beam search for freeze only support batch=1

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

batch beam search for transformer model

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

  • lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

class athena.SpeechConformerCTC(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation and the transformer itself

default_config
call(samples, training: bool = None)

call model

compute_logit_length(input_length)

used for get logit length

_forward_encoder(speech, speech_length, training=None)
_forward_encoder_log_ctc(samples, training: bool = None)
decode(samples, hparams, lm_model=None)

Initialization of the model for decoding, decoder is called here to create predictions

Parameters
  • samples – the data source to be decoded

  • hparams – decoding configs are included here

  • lm_model – lm model

Returns:

predictions: the corresponding decoding results
argmax(samples, hparams)

argmax for the Conformer CTC model

Parameters
  • samples – the data source to be decoded

  • hparams – decoding configs are included here

Returns::

predictions: the corresponding decoding results

merge_ctc_sequence(seqs, blank=-1)

beam search for freeze only support batch=1

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

class athena.SpeechTransformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Standard implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config
call(samples, training: bool = None)

call model

compute_logit_length(input_length)

used for get logit length

_forward_encoder(speech, speech_length, training: bool = None)
_forward_encoder_log_ctc(samples, final_layer, training: bool = None)
attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]
Apply attention rescoring decoding, CTC prefix beam search

is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters
  • samples

  • hparams – inference_config

  • ctc_final_layer – encoder final dense layer to output ctc prob.

  • lm_model

Returns

Attention rescoring result

Return type

List[int]

beam search for freeze only support batch=1

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

batch beam search for transformer model

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

  • lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

class athena.SpeechTransformerU2(data_descriptions, config=None)

Bases: SpeechU2

U2 implementation of a SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself

default_config
class athena.SpeechConformerU2(data_descriptions, config=None)

Bases: SpeechU2

Conformer-U2

default_config
class athena.MtlTransformerCtc(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

In speech recognition, adding CTC loss to Attention-based seq-to-seq model is known to help convergence. It usually gives better results than using attention alone.

SUPPORTED_MODEL
default_config
call(samples, training=None)

call function in keras layers

get_loss(outputs, samples, training=None)

get loss used for training

compute_logit_length(input_length)

compute the logit length

reset_metrics()

reset the metrics

restore_from_pretrained_model(pretrained_model, model_type='')

A more general-purpose interface for pretrained model restoration :param pretrained_model: checkpoint path of mpc model :param model_type: the type of pretrained model to restore

_forward_encoder_log_ctc(samples, training: bool = None)
decode(samples, hparams, lm_model=None)

Initialization of the model for decoding, decoder is called here to create predictions

Parameters
  • samples – the data source to be decoded

  • hparams – decoding configs are included here

  • lm_model – lm model

Returns:

predictions: the corresponding decoding results
enable_tf_funtion()
ctc_forward_chunk_freeze(encoder_out)
encoder_ctc_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)
encoder_forward_chunk_freeze(chunk_xs, offset, required_cache_size, subsampling_cache, elayers_output_cache, conformer_cnn_cache)
get_subsample_rate()
get_init()
encoder_forward_chunk_by_chunk_freeze(speech: tensorflow.Tensor, decoding_chunk_size: int, num_decoding_left_chunks: int = -1) Tuple[tensorflow.Tensor, tensorflow.Tensor]
Forward input chunk by chunk with chunk_size like a streaming

fashion

Here we should pay special attention to computation cache in the streaming style forward chunk by chunk. Three things should be taken into account for computation in the current network:

  1. transformer/conformer encoder layers output cache

  2. convolution in conformer

  3. convolution in subsampling

However, we don’t implement subsampling cache for:
  1. We can control subsampling module to output the right result by overlapping input instead of cache left context, even though it wastes some computation, but subsampling only takes a very small fraction of computation in the whole model.

  2. Typically, there are several covolution layers with subsampling in subsampling module, it is tricky and complicated to do cache with different convolution layers with different subsampling rate.

  3. Currently, nn.Sequential is used to stack all the convolution layers in subsampling, we need to rewrite it to make it work with cache, which is not prefered.

Parameters
  • speech (tf.Tensor) – (1, max_len, dim)

  • chunk_size (int) – decoding chunk size

class athena.AudioVideoConformer(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Audio and video multimode Conformer. Model mainly consists of three parts: the a_net for input audio fbank feature preparation, the v_net, the y_net for output preparation and the transformer itself

default_config
call(samples, training: bool = None)

call model

compute_logit_length(input_length)

used for get logit length

_forward_encoder(samples, training: bool = None)
attention_rescoring(samples, hparams, ctc_final_layer: tensorflow.keras.layers.Dense, lm_model: athena.models.base.BaseModel = None) List[int]
Apply attention rescoring decoding, CTC prefix beam search

is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out

Parameters
  • samples

  • hparams – inference_config

  • ctc_final_layer – encoder final dense layer to output ctc prob.

  • lm_model

Returns

Attention rescoring result

Return type

List[int]

batch beam search for transformer model

Parameters
  • samples – the data source to be decoded

  • beam_size – beam size

  • lm_model – rnnlm that used for beam search

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

class athena.VadMarbleNet(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

implementation of a frame level or segment speech classification

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
get_loss(outputs, samples, training=None)

get loss

class athena.VadDnn(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

implementation of a frame level or segment speech classification

default_config
call(samples, training=None)

call model

get_loss(outputs, samples, training=None)

get loss

class athena.RNNLM(data_descriptions, config=None)

Bases: athena.models.lm.nn_lm.NNLM

Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn

default_config
forward(inputs, inputs_length=None, training: bool = None)

do NN LM forward computation, for both train and decode.

class athena.TransformerLM(data_descriptions, config=None)

Bases: athena.models.lm.nn_lm.NNLM

Standard implementation of a RNNLM. Model mainly consists of embeding layer, rnn layers(with dropout), and the full connection layer, which are all incuded in self.model_for_rnn

default_config
forward(inputs, input_lengths, training: bool = None)

do NN LM forward computation, for both train and decode.

class athena.FastSpeech(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

Reference: Fastspeech: Fast, robust and controllable text to speech

(http://papers.nips.cc/paper/8580-fastspeech-fast-robust-and-controllable-text-to-speech.pdf)

default_config
set_teacher_model(teacher_model, teacher_type)

set teacher model and initialize duration_calculator before training

Parameters
  • teacher_model – the loaded teacher model

  • teacher_type – the model type, e.g., tacotron2, tts_transformer

restore_from_pretrained_model(pretrained_model, model_type='')

restore from pretrained model

Parameters
  • pretrained_model – the loaded pretrained model

  • model_type – the model type, e.g: tts_transformer

get_loss(outputs, samples, training=None)

get loss used for training

_feedforward_decoder(encoder_output, duration_indexes, duration_sequences, output_length, training)

feed-forward decoder

Parameters
  • encoder_output – encoder outputs, shape: [batch, x_steps, d_model]

  • duration_indexes – argmax weights calculated from duration_calculator. It is used for training only, shape: [batch, y_steps]

  • duration_sequences – It contains duration information for each phoneme, shape: [batch, x_steps]

  • output_length – the real output length

  • training – if it is in the training stage

Returns:

before_outs: the outputs before postnet calculation
after_outs: the outputs after postnet calculation
call(samples, training: bool = None)

call model

synthesize(samples)
class athena.FastSpeech2(data_descriptions, config=None)

Bases: athena.models.tts.fastspeech.FastSpeech

Reference: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

default_config
call(samples, training: bool = None)

call model

synthesize(samples)
class athena.Tacotron2(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

default_config
_pad_and_reshape(outputs, ori_lens, reverse=False)
Parameters
  • outputs – true labels, shape: [batch, y_steps, feat_dim]

  • ori_lens – scalar

Returns:

reshaped_outputs: it has to be reshaped to match reduction_factor
    shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]
call(samples, training: bool = None)

call model

initialize_input_y(y)
Parameters

y – the true label, shape: [batch, y_steps, feat_dim]

Returns:

y0: zeros will be padded as one step to the start step,
[batch, y_steps+1, feat_dim]
initialize_states(encoder_output, input_length)
Parameters
  • encoder_output – encoder outputs, shape: [batch, x_step, eunits]

  • input_length – shape: [batch]

Returns:

prev_rnn_states: initial states of rnns in decoder
    [rnn layers, 2, batch, dunits]
prev_attn_weight: initial attention weights, [batch, x_steps]
prev_context: initial context, [batch, eunits]
concat_speaker_embedding(encoder_output, speaker_embedding)
Parameters
  • encoder_output – encoder output (batch, x_steps, eunits)

  • speaker_embedding – speaker embedding (batch, embedding_dim)

Returns

the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)

time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)
Parameters
  • encoder_output – encoder output (batch, x_steps, eunits).

  • input_length – (batch,)

  • prev_y – one step of true labels or predicted labels (batch, feat_dim).

  • prev_rnn_states – previous rnn states [layers, 2, states] for lstm

  • prev_attn_weight – previous attention weights, shape: [batch, x_steps]

  • prev_context – previous context vector: [batch, attn_dim]

  • training – if it is training mode

Returns:

out: shape: [batch, feat_dim]
logit: shape: [batch, reduction_factor]
current_rnn_states: [rnn_layers, 2, batch, dunits]
attn_weight: [batch, x_steps]
get_loss(outputs, samples, training=None)

get loss

synthesize(samples)

Synthesize acoustic features from the input texts

Parameters

samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights
_synthesize_post_net(before_outs, logits_stack)
Parameters
  • before_outs – the outputs before postnet

  • logits_stack – the logits of all steps

Returns:

after_outs: the corresponding synthesized acoustic features
class athena.TTSTransformer(data_descriptions, config=None)

Bases: athena.models.tts.tacotron2.Tacotron2

TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network

default_config
call(samples, training: bool = None)
time_propagate(encoder_output, memory_mask, outs, step)

Synthesize one step frames

Parameters
  • encoder_output – the encoder output, shape: [batch, x_steps, eunits]

  • memory_mask – the encoder output mask, shape: [batch, 1, 1, x_steps]

  • outs (TensorArray) – previous outputs

  • step – the current step number

Returns:

out: new frame outpus, shape: [batch, feat_dim * reduction_factor]
logit: new stop token prediction logit, shape: [batch, reduction_factor]
attention_weights (list): the corresponding attention weights,
    each element in the list represents the attention weights of one decoder layer
    shape: [batch, num_heads, seq_len_q, seq_len_k]
synthesize(samples)

Synthesize acoustic features from the input texts

Parameters

samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights
class athena.CnnModel(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

CNN model for kws”

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
class athena.KWSConformer(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSConformer. Model mainly consists of three parts: the x_net for input preparation, the conformer itself

default_config
call(samples, training=None)
build_model(data_descriptions)
class athena.CRnnModel(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

CRNN model for e2e kws”

default_config
input_features

_, _, w, c = input_features.get_shape().as_list() output_dim = w * c inner = layers.Reshape((-1, output_dim))(input_features) inner = PCENLayer()(inner) inner = layers.Reshape((-1, w, c))(inner)

call(samples, training=None)

call model

build_model(data_descriptions)
class athena.DnnModel(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

implementation of a frame level or segment speech classification

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
class athena.MISPModel(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

MISP challenge KWS baseline model for e2e kws”

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
class athena.KWSTransformer_2Dense(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config
call(samples, training=None)
build_model(data_descriptions)
class athena.KWSTransformer(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
class athena.KWSAVTransformer(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config
inner

v_net

call(samples, training=None)
build_model(data_descriptions)
class athena.KWSTransformerRESNET(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
class athena.KWSTransformer_FocalLoss(data_descriptions, config=None)

Bases: athena.models.kws.base.BaseModel

Standard implementation of a KWSTransformer. Model mainly consists of three parts: the x_net for input preparation, the transformer itself

default_config
call(samples, training=None)

call model

build_model(data_descriptions)
get_loss(outputs, samples, training=None)

get loss

class athena.BaseSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)

Bases: tensorflow.keras.Model

Base Training Solver.

default_config
static initialize_devices(solver_gpus=None)

initialize hvd devices, should be called firstly

static clip_by_norm(grads, norm)

clip norm using tf.clip_by_norm

train_step(samples)

train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)

Update the model in 1 epoch

save_checkpointer(checkpointer, devset, epoch)
evaluate_step(samples)

evaluate the model 1 step

evaluate(dataset, epoch)

evaluate the model

class athena.HorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)

Bases: BaseSolver

A multi-processer solver based on Horovod

static initialize_devices(solver_gpus=None)

initialize hvd devices, should be called firstly

For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],

then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.

  1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.

Parameters

solver_gpus ([list]) – a list to specify gpus being used.

Raises

ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.

train_step(samples)

train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)

Update the model in 1 epoch

evaluate(dataset, epoch=0)

evaluate the model

class athena.DecoderSolver(model, data_descriptions=None, config=None)

Bases: BaseSolver

ASR DecoderSolver

default_config
inference(dataset_builder, rank_size=1, conf=None)

decode the model

inference_saved_model(dataset_builder, rank_size=1, conf=None)

decode the model

class athena.AVSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)

Bases: tensorflow.keras.Model

Base Solver.

default_config
static initialize_devices(solver_gpus=None)

initialize hvd devices, should be called firstly

static clip_by_norm(grads, norm)

clip norm using tf.clip_by_norm

train_step(samples)

train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)

Update the model in 1 epoch

evaluate_step(samples)

evaluate the model 1 step

evaluate(dataset, epoch)

evaluate the model

class athena.AVHorovodSolver(model, optimizer, sample_signature, eval_sample_signature=None, config=None, **kwargs)

Bases: AVSolver

A multi-processer solver based on Horovod

static initialize_devices(solver_gpus=None)

initialize hvd devices, should be called firstly

For examples, if you have two machines and each of them contains 4 gpus: 1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [0,3,0,1,2,3],

then the first gpu and the last gpu on machine1 and all gpus on machine2 will be used.

  1. run with command horovodrun -np 6 -H ip1:2,ip2:4 and set solver_gpus to be [], then the first 2 gpus on machine1 and all gpus on machine2 will be used.

Parameters

solver_gpus ([list]) – a list to specify gpus being used.

Raises

ValueError – If the list of solver gpus is not empty, its size should not be smaller than that of horovod configuration.

train_step(samples)

train the model 1 step

train(trainset, devset, checkpointer, pbar, epoch, total_batches=-1)

Update the model in 1 epoch

evaluate(dataset, epoch=0)

evaluate the model

class athena.AVDecoderSolver(model, data_descriptions=None, config=None)

Bases: AVSolver

DecoderSolver

default_config
inference(dataset_builder, rank_size=1, conf=None)

decode the model

inference_freeze(dataset_builder, rank_size=1, conf=None)

decode the model

inference_argmax(dataset_builder, rank_size=1, conf=None)

decode the model

class athena.VadSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, data_descriptions=None, config=None)

Bases: BaseSolver

VadSolver

default_config
inference(dataset, rank_size=1, conf=None)

decode the model

class athena.SynthesisSolver(model, optimizer=None, sample_signature=None, eval_sample_signature=None, config=None, **kwargs)

Bases: BaseSolver

SynthesisSolver (TTS Solver)

default_config
inference(dataset_builder, rank_size=1, conf=None)

synthesize using vocoder on dataset

inference_saved_model(dataset_builder, rank_size=1, conf=None)

synthesize using vocoder on dataset

class athena.CTCLoss(logits_time_major=False, blank_index=-1, name='CTCLoss')

Bases: tensorflow.keras.losses.Loss

CTC LOSS CTC LOSS implemented with Tensorflow

__call__(logits, samples, logit_length=None)
class athena.Seq2SeqSparseCategoricalCrossentropy(num_classes, eos=-1, by_token=False, by_sequence=True, from_logits=True, label_smoothing=0.0)

Bases: tensorflow.keras.losses.CategoricalCrossentropy

Seq2SeqSparseCategoricalCrossentropy LOSS CategoricalCrossentropy calculated at each character for each sequence in a batch

__call__(logits, samples, logit_length=None)
class athena.CTCAccuracy(name='CTCAccuracy')

Bases: CharactorAccuracy

CTCAccuracy Inherits CharactorAccuracy and implements CTC accuracy calculation

__call__(logits, samples, logit_length=None)

Accumulate errors and counts, logit_length is the output length of encoder

class athena.Seq2SeqSparseCategoricalAccuracy(eos, name='Seq2SeqSparseCategoricalAccuracy')

Bases: CharactorAccuracy

Seq2SeqSparseCategoricalAccuracy Inherits CharactorAccuracy and implements Attention accuracy calculation

__call__(logits, samples, logit_length=None)

Accumulate errors and counts

class athena.Checkpoint(checkpoint_directory=None, use_dev_loss=True, model=None, **kwargs)

Bases: tensorflow.train.Checkpoint

A wrapper for Tensorflow checkpoint

Parameters
  • checkpoint_directory – the directory for checkpoint

  • summary_directory – the directory for summary used in Tensorboard

  • __init__ – provide the optimizer and model

  • __call__ – save the model

Example

>>> transformer = SpeechTransformer(target_vocab_size=dataset_builder.target_dim)
>>> optimizer = tf.keras.optimizers.Adam()
>>> ckpt = Checkpoint(checkpoint_directory='./train', summary_directory='./event',
>>>        transformer=transformer, optimizer=optimizer)
>>> solver = BaseSolver(transformer)
>>> for epoch in dataset:
>>>    ckpt()
_file_compatible(use_dev_loss)

Convert n_best file to CSV file

Add “index” and “Accuracy” for no csv n_best file.

_compare_and_save_best(loss, metrics, save_path, training=False)

compare and save the best model with best_loss and N best metrics

compute_nbest_avg(model_avg_num, sort_by=None, sort_by_time=False, reverse=True)

Restore n-best avg checkpoint,

if ‘sort_by_time’ is False, the n-best order is sorted by ‘sort_by’; If ‘sort_by_time’ is True, select the newest few models; If ‘reverse’ is True, select the largest models in the sorted order;

__call__(loss=None, metrics=None, training=False)
restore_from_best()

restore from the best model

class athena.WarmUpLearningSchedule(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0)

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

WarmUp Learning rate schedule for Adam

Example

>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512),
>>>        beta_1=0.9, beta_2=0.98, epsilon=1e-9)

Idea from the paper: Attention Is All You Need

__call__(step)
class athena.WarmUpAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config
class athena.WarmUpLearningSchedule1(model_dim=512, warmup_steps=4000, k=1.0, decay_steps=99999999, decay_rate=1.0, lr=None)

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

WarmUp Learning rate schedule for Adam and can initialize a learning rate

Example

>>> optimizer = tf.keras.optimizers.Adam(learning_rate = WarmUpLearningSchedule(512),
>>>        beta_1=0.9, beta_2=0.98, epsilon=1e-9)

Idea from the paper: Attention Is All You Need

__call__(step)
class athena.WarmUpAdam1(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='WarmUpAdam', **kwargs)

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config
class athena.ExponentialDecayLearningRateSchedule(initial_lr=0.005, decay_steps=10000, decay_rate=0.5, start_decay_steps=30000, final_lr=1e-05)

Bases: tensorflow.keras.optimizers.schedules.LearningRateSchedule

ExponentialDecayLearningRateSchedule

Example

>>> optimizer = tf.keras.optimizers.Adam(learning_rate = ExponentialDecayLearningRate(0.01, 100))
Parameters
  • initial_lr

  • decay_steps

Returns

initial_lr * (0.5 ** (step // decay_steps))

__call__(step)
class athena.ExponentialDecayAdam(config=None, beta_1=0.9, beta_2=0.999, epsilon=1e-06, amsgrad=False, name='WarmUpAdam', **kwargs)

Bases: tensorflow.keras.optimizers.Adam

WarmUpAdam Implementation

default_config
class athena.HParams(model_structure=None, **kwargs)

Bases: object

Class to hold a set of hyperparameters as name-value pairs.

A HParams object holds hyperparameters used to build and train a model, such as the number of hidden units in a neural net layer or the learning rate to use when training.

You first create a HParams object by specifying the names and values of the hyperparameters.

To make them easily accessible the parameter names are added as direct attributes of the class. A typical usage is as follows:

```python # Create a HParams object specifying names and values of the model # hyperparameters: hparams = HParams(learning_rate=0.1, num_hidden_units=100)

# The hyperparameter are available as attributes of the HParams object: hparams.learning_rate ==> 0.1 hparams.num_hidden_units ==> 100 ```

Hyperparameters have type, which is inferred from the type of their value passed at construction type. The currently supported types are: integer, float, boolean, string, and list of integer, float, boolean, or string.

You can override hyperparameter values by calling the [parse()](#HParams.parse) method, passing a string of comma separated name=value pairs. This is intended to make it possible to override any hyperparameter values from a single command-line flag to which the user passes ‘hyper-param=value’ pairs. It avoids having to define one flag for each hyperparameter.

The syntax expected for each value depends on the type of the parameter. See parse() for a description of the syntax.

Example:

```python # Define a command line flag to pass name=value pairs. # For example using argparse: import argparse parser = argparse.ArgumentParser(description=’Train my model.’) parser.add_argument(’–hparams’, type=str,

help=’Comma separated list of “name=value” pairs.’)

args = parser.parse_args() … def my_program():

# Create a HParams object specifying the names and values of the # model hyperparameters: hparams = tf.HParams(learning_rate=0.1, num_hidden_units=100,

activations=[‘relu’, ‘tanh’])

# Override hyperparameters values by parsing the command line hparams.parse(args.hparams)

# If the user passed –hparams=learning_rate=0.3 on the command line # then ‘hparams’ has the following attributes: hparams.learning_rate ==> 0.3 hparams.num_hidden_units ==> 100 hparams.activations ==> [‘relu’, ‘tanh’]

# If the hyperparameters are in json format use parse_json: hparams.parse_json(‘{“learning_rate”: 0.3, “activations”: “relu”}’)

```

_HAS_DYNAMIC_ATTRIBUTES = True
add_hparam(name, value)

Adds {name, value} pair to hyperparameters.

Parameters
  • name – Name of the hyperparameter.

  • value – Value of the hyperparameter. Can be one of the following types:

  • int

  • float

  • string

  • list (float) –

  • list

  • list. (or string) –

Raises

ValueError – if one of the arguments is invalid.

set_hparam(name, value)

Set the value of an existing hyperparameter.

This function verifies that the type of the value matches the type of the existing hyperparameter.

Parameters
  • name – Name of the hyperparameter.

  • value – New value of the hyperparameter.

Raises
  • KeyError – If the hyperparameter doesn’t exist.

  • ValueError – If there is a type mismatch.

del_hparam(name)

Removes the hyperparameter with key ‘name’.

Does nothing if it isn’t present.

Parameters

name – Name of the hyperparameter.

parse(values, ignore_unknown=False)

Override existing hyperparameter values, parsing new values from a string.

See parse_values for more detail on the allowed format for values.

Parameters
  • values – String. Comma separated list of name=value pairs where ‘value’

  • above. (must follow the syntax described) –

Returns

The HParams instance.

Raises
  • ValueError – If values cannot be parsed or a hyperparameter in values

  • doesn't exist.

override_from_dict(values_dict)

Override existing hyperparameter values, parsing new values from a dictionary.

Parameters

values_dict – Dictionary of name:value pairs.

Returns

The HParams instance.

Raises
  • KeyError – If a hyperparameter in values_dict doesn’t exist.

  • ValueError – If values_dict cannot be parsed.

set_model_structure(model_structure)
get_model_structure()
to_json(indent=None, separators=None, sort_keys=False)

Serializes the hyperparameters into JSON.

Parameters
  • indent – If a non-negative integer, JSON array elements and object members

  • 0 (will be pretty-printed with that indent level. An indent level of) –

  • or

  • negative (the default) –

  • None (will only insert newlines.) –

  • representation. (most compact) –

  • separators – Optional (item_separator, key_separator) tuple. Default is

  • `('

    ’)`.

  • ' – ‘)`.

  • ' – ‘)`.

  • sort_keys – If True, the output dictionaries will be sorted by key.

Returns

A JSON string.

parse_json(values_json)

Override existing hyperparameter values, parsing new values from a json object.

Parameters

values_json – String containing a json object of name:value pairs.

Returns

The HParams instance.

Raises
  • KeyError – If a hyperparameter in values_json doesn’t exist.

  • ValueError – If values_json cannot be parsed.

values()

Return the hyperparameter values as a Python dictionary.

Returns

A dictionary with hyperparameter names as keys. The values are the hyperparameter values.

get(key, default=None)

Returns the value of key if it exists, else default.

__contains__(key)
__str__()

Return str(self).

__repr__()

Return repr(self).

static _get_kind_name(param_type, is_list)

Returns the field name given parameter type and is_list.

Parameters
  • param_type – Data type of the hparam.

  • is_list – Whether this is a list.

Returns

A string representation of the field name.

Raises

ValueError – If parameter type is not recognized.

instantiate()
append(hp)
athena.register_and_parse_hparams(default_config: dict, config=None, **kwargs)

register default config and parse

athena.generate_square_subsequent_mask(size)

Generate a square mask for the sequence. The masked positions are filled with float(1.0). Unmasked positions are filled with float(0.0).

athena.generate_square_subsequent_mask_u2(size)

Generate a square mask for the sequence. The masked positions are filled with bool(True). Unmasked positions are filled with bool(False).

athena.get_wave_file_length(wave_file)

get the wave file length(duration) in ms

Parameters

wave_file – the path of wave file

Returns

the length(ms) of the wave file

Return type

wav_length

athena.set_default_summary_writer(summary_directory=None)
athena.get_dict_from_scp(vocab, func=lambda x: ...)
class athena.CTCPrefixScoreTH(x, xlens, blank, eos, margin=0)

Bases: object

Batch processing of CTCPrefixScore

which is based on Algorithm 2 in WATANABE et al. “HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,” but extended to efficiently compute the label probablities for multiple hypotheses simultaneously See also Seki et al. “Vectorized Beam Search for CTC-Attention-Based Speech Recognition,” In INTERSPEECH (pp. 3825-3829), 2019.

__call__(y, state, scoring_ids=None, att_w=None)

Compute CTC prefix scores for next labels

Parameters
  • y – tensor(shape=[W, L]), prefix label sequences

  • state (tuple) –

    previous CTC state tuple(

    tensor(shape=[T , 2, W]), tensor(shape=[W, O]), 0, 0

    )

  • scoring_ids (torch.Tensor) – scores for pre-selection of hypotheses [Beam, Beam * pre_beam_ratio]

  • att_w (torch.Tensor) – attention weights to decide CTC window

:return new_state, ctc_local_scores (BW, O)

index_select_state(state, best_ids)

Select CTC states according to best ids

:param state : CTC state :param best_ids : index numbers selected by beam pruning (B, W) :return selected_state

athena.__version__ = 2.0