athena.data

data

Subpackages

Submodules

Package Contents

Classes

SpeechDatasetBuilder

SpeechDatasetBuilder

SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

SpeechRecognitionDatasetBatchBinsBuilder

SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

AudioVedioRecognitionDatasetBuilder

SpeechRecognitionDatasetBuilder

AudioVedioRecognitionDatasetBatchBinsBuilder

SpeechRecognitionDatasetBatchBinsBuilder

SpeechSynthesisDatasetBuilder

SpeechSynthesisDatasetBuilder

SpeechFastspeech2DatasetBuilder

SpeechSynthesisDatasetBuilder

SpeechSynthesisTestDatasetBuilder

SpeechSynthesisDatasetBuilder

SpeechWakeupFramewiseDatasetKaldiIOBuilder

Dataset builder for CNN model. The builder treat every spliced frame as one image.

SpeechWakeupDatasetKaldiIOBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim

SpeechWakeupDatasetKaldiIOBuilderAVCE

Dataset builder for RNN model. The builder mix the spliced frame in one dim

LanguageDatasetBuilder

LanguageDatasetBuilder

MpcSpeechDatasetBuilder

SpeechDatasetBuilder

MpcSpeechDatasetKaldiIOBuilder

MpcSpeechDatasetKaldiIOBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

FeatureNormalizer

Feature Normalizer

FS2FeatureNormalizer

Fastspeech2 Feature Normalizer

TextFeaturizer

The main text featurizer interface

SentencePieceFeaturizer

SentencePieceFeaturizer using tensorflow-text api

TextTokenizer

TextTokenizer

class athena.data.SpeechDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder

property num_class

@property

Returns

the target dim

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.data.SpeechRecognitionDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class

return the max_index of the vocabulary + 1

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()
__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

class athena.data.SpeechRecognitionDatasetBatchBinsBuilder(config=None)

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config
preprocess_data(file_path)
__getitem__(index)
__len__()
as_dataset(batch_size=16, num_threads=1)

return tf.data.Dataset object

shard(num_shards, index)

creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)

Batch-wise shuffling of the data entries.

Parameters
  • batch_size (int, optional) – an integer for the batch size. Defaults to 1

  • . (in batch_bins mode) –

class athena.data.SpeechRecognitionDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetKaldiIOBuilder

property num_class

return the max_index of the vocabulary + 1

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_kaldi_data(file_dir, apply_sort_filter=True)

Generate a list of tuples (feat_key, speaker).

__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.data.SpeechRecognitionDatasetBatchBinsKaldiIOBuilder(config=None)

Bases: athena.data.datasets.asr.speech_recognition_kaldiio.SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

property sample_shape_batch_bins

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config
preprocess_kaldi_data(file_dir, apply_sort_filter=True)
read_shape_file(file_dir=None)
__getitem__(index)
__len__()
as_dataset(batch_size=16, num_threads=1)

return tf.data.Dataset object

shard(num_shards, index)

creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)

Batch-wise shuffling of the data entries.

Parameters
  • batch_size (int, optional) – an integer for the batch size. Defaults to 1

  • . (in batch_bins mode) –

class athena.data.AudioVedioRecognitionDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class

return the max_index of the vocabulary + 1

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config
video_scp_loader(scp_dir)

load video list from scp file return a dic

image_normalizer(image)
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()
__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
}

Return type

dict

class athena.data.AudioVedioRecognitionDatasetBatchBinsBuilder(config=None)

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "video":tf.TensorShape([None, None, high, wide]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
    "utt_id": tf.TensorShape([None]),
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "video": tf.TensorShape([None, None, None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "video": tf.TensorSpec(shape=(None, None, None, None), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

__len__()
as_dataset(batch_size=16, num_threads=1)

return tf.data.Dataset object

shard(num_shards, index)

creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)

Batch-wise shuffling of the data entries.

Parameters
  • batch_size (int, optional) – an integer for the batch size. Defaults to 1

  • . (in batch_bins mode) –

class athena.data.SpeechSynthesisDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property feat_dim

return the number of feature dims

property sample_type

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "speaker": tf.TensorShape([])
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)
class athena.data.SpeechFastspeech2DatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property feat_dim

return the number of feature dims

property sample_type

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32,
    "duration": tf.int32
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "f0": tf.TensorShape([None]),
    "energy": tf.TensorShape([None]),
    "speaker": tf.TensorShape([]),
    "duration": tf.TensorShape([None])
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "f0": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "energy": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config
load_duration(duration)
preprocess_data(file_path)

generate a list of tuples (audio_feature, wav_length_ms, transcript, duration, speaker).

load_audio_feature(audio_feature_file)
__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.data.SpeechSynthesisTestDatasetBuilder(data_csv, config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "speaker": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "speaker": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (utt_id, transcript).

__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.data.SpeechWakeupFramewiseDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for CNN model. The builder treat every spliced frame as one image. For example (21, 63) The input data format is (batch, timestep, height, width, channel) For example (b, t, 21, 63, 1) unbatch are used to split The output data format is, for example, (b, 21, 63, 1)

property sample_type

example types

property sent_sample_shape
property sample_shape

examples shapes

property sample_signature

examples signature

default_config
preprocess_data(data_dir='')

loading data

__getitem__(index)
splice_feature(feature, input_left_context, input_right_context)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

class athena.data.SpeechWakeupDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type

example types

property sample_shape

examples shapes

property sample_signature

examples signature

default_config
preprocess_data(data_dir='')

loading data

__getitem__(index)
splice_feature(feature, input_left_context, input_right_context)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

class athena.data.SpeechWakeupDatasetKaldiIOBuilderAVCE(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type

example types

property sample_shape

examples shapes

property sample_signature

examples signature

default_config
preprocess_data(data_dir='')

loading data

video_scp_loader(scp_dir)

load video list from scp file return a dic

__getitem__(index)
splice_feature(feature, input_left_context, input_right_context)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

class athena.data.LanguageDatasetBuilder(config=None)

Bases: athena.data.datasets.base.BaseDatasetBuilder

LanguageDatasetBuilder

property num_class

@property

Returns

the max_index of the vocabulary

Return type

int

property input_vocab_size

@property

Returns

the input vocab size

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.int32,
    "input_length": tf.int32,
    "output": tf.int32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_data(file_path)

load csv file

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_labels,
    "input_length": input_length,
    "output": output_labels,
    "output_length": output_length,
}

Return type

dict

class athena.data.MpcSpeechDatasetBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder This data builder is a online feature extractor and is used to mcp training

property num_class

@property

Returns

the target dim

Return type

int

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config
preprocess_data(file_path)

generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.data.MpcSpeechDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.mpc.mpc_speech_set.MpcSpeechDatasetBuilder

MpcSpeechDatasetKaldiIOBuilder This data builder is a offline feature data builder and is used to mcp training

default_config
preprocess_data(file_path, apply_sort_filter=True)

generate a list of tuples (feat_key, speaker).

__getitem__(index)
compute_cmvn_if_necessary(is_necessary=True)

compute cmvn file

class athena.data.VoiceActivityDetectionDatasetKaldiIOBuilder(config=None)

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

property sample_type

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt": tf.TensorShape([]),
}

Return type

dict

property sample_signature

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    utt": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config
preprocess_data(data_scps_dir)

generate a list of tuples (wav_filename, wav_offset, wav_length_ms, transcript, label).

splice_feature(feature)

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,

repeat the last frame in case out the range

Parameters

feature – the input features, shape may be [timestamp, dim, 1]

Returns

the spliced features

Return type

splice_feat

__getitem__(index)

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt": utt
}

Return type

dict

class athena.data.FeatureNormalizer(cmvn_file=None)

Feature Normalizer

__call__(feat_data, speaker, reverse=False)
apply_cmvn(feat_data, speaker, reverse=False)

transform original feature to normalized feature

compute_cmvn(entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)

compute cmvn for filtered entries

compute_cmvn_by_chunk_for_all_speaker(feature_dim, speakers, featurizer, entries)

because of memory issue, we used incremental approximation for the calculation of cmvn

compute_cmvn_kaldiio(entries, speakers, kaldi_io_feats, feature_dim)

compute cmvn for filtered entries using kaldi-format data

load_cmvn()

load mean and var

save_cmvn(variable_list)

save cmvn variables determined by variable_list to file

Parameters

variable_list (list) – e.g. [“speaker”, “mean”, “var”]

class athena.data.FS2FeatureNormalizer(cmvn_file=None)

Bases: FeatureNormalizer

Fastspeech2 Feature Normalizer

__call__(feat_data, speaker, feature_type='mel', reverse=False)
compute_fs2_cmvn(entries, speakers, num_cmvn_workers=1)

compuate cmvn of mel-spec,f0 and energy

apply_cmvn(feat_data, speaker, feature_type='mel', reverse=False)

transform original feature to normalized feature

load_cmvn()

load mel_mean, mel_var, f0_mean, f0_var and energy_mean, energy_var

class athena.data.TextFeaturizer(config=None)

The main text featurizer interface

property model_type

@property

Returns

the model type

property unk_index

@property

Returns

the unk index

Return type

int

supported_model
default_config
load_model(model_file)

load model

delete_punct(tokens)

delete punctuation tokens

__len__()
encode(texts)

convert a sentence to a list of ids, with special tokens added.

decode(sequences)

conver a list of ids to a sentence

decode_to_list(sequences, ignored_id=[])
class athena.data.SentencePieceFeaturizer(spm_file)

SentencePieceFeaturizer using tensorflow-text api

load_model(model_file)

load sentence piece model

__len__()
encode(sentence)

convert a sentence to a list of ids by sentence piece model

decode(ids)

convert a list of ids to a sentence

decode_to_list(ids, ignored_id=[])
class athena.data.TextTokenizer(text=None)

TextTokenizer

load_model(text)

load model

save_vocab(vocab_file)
load_csv(csv_file)
__len__()
encode(texts)

convert a sentence to a list of ids, with special tokens added.

decode(sequences)

conver a list of ids to a sentence

decode_to_list(ids, ignored_id=[])