`athena.data`¶

data

Subpackages¶

athena.data.datasets

Submodules¶

Package Contents¶

Classes¶

`SpeechDatasetBuilder`	SpeechDatasetBuilder
`SpeechRecognitionDatasetBuilder`	SpeechRecognitionDatasetBuilder
`SpeechRecognitionDatasetBatchBinsBuilder`	SpeechRecognitionDatasetBatchBinsBuilder
`SpeechRecognitionDatasetKaldiIOBuilder`	SpeechRecognitionDatasetKaldiIOBuilder
`SpeechRecognitionDatasetBatchBinsKaldiIOBuilder`	SpeechRecognitionDatasetBatchBinsKaldiIOBuilder
`AudioVedioRecognitionDatasetBuilder`	SpeechRecognitionDatasetBuilder
`AudioVedioRecognitionDatasetBatchBinsBuilder`	SpeechRecognitionDatasetBatchBinsBuilder
`SpeechSynthesisDatasetBuilder`	SpeechSynthesisDatasetBuilder
`SpeechFastspeech2DatasetBuilder`	SpeechSynthesisDatasetBuilder
`SpeechSynthesisTestDatasetBuilder`	SpeechSynthesisDatasetBuilder
`SpeechWakeupFramewiseDatasetKaldiIOBuilder`	Dataset builder for CNN model. The builder treat every spliced frame as one image.
`SpeechWakeupDatasetKaldiIOBuilder`	Dataset builder for RNN model. The builder mix the spliced frame in one dim
`SpeechWakeupDatasetKaldiIOBuilderAVCE`	Dataset builder for RNN model. The builder mix the spliced frame in one dim
`LanguageDatasetBuilder`	LanguageDatasetBuilder
`MpcSpeechDatasetBuilder`	SpeechDatasetBuilder
`MpcSpeechDatasetKaldiIOBuilder`	MpcSpeechDatasetKaldiIOBuilder
`VoiceActivityDetectionDatasetKaldiIOBuilder`	VoiceActivityDetectionDatasetKaldiIOBuilder
`FeatureNormalizer`	Feature Normalizer
`FS2FeatureNormalizer`	Fastspeech2 Feature Normalizer
`TextFeaturizer`	The main text featurizer interface
`SentencePieceFeaturizer`	SentencePieceFeaturizer using tensorflow-text api
`TextTokenizer`	TextTokenizer

class athena.data.SpeechDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder

property num_class¶

@property

Returns: the target dim
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.data.SpeechRecognitionDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class¶: return the max_index of the vocabulary + 1

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()¶

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

class athena.data.SpeechRecognitionDatasetBatchBinsBuilder(config=None)¶

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶

__getitem__(index)¶

__len__()¶

as_dataset(batch_size=16, num_threads=1)¶: return tf.data.Dataset object

shard(num_shards, index)¶: creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶

Batch-wise shuffling of the data entries.

Parameters

batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –

class athena.data.SpeechRecognitionDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetKaldiIOBuilder

property num_class¶: return the max_index of the vocabulary + 1

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶: Generate a list of tuples (feat_key, speaker).

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.data.SpeechRecognitionDatasetBatchBinsKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.asr.speech_recognition_kaldiio.SpeechRecognitionDatasetKaldiIOBuilder

SpeechRecognitionDatasetBatchBinsKaldiIOBuilder

property sample_shape_batch_bins¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
}

Return type

dict

default_config¶

preprocess_kaldi_data(file_dir, apply_sort_filter=True)¶

read_shape_file(file_dir=None)¶

__getitem__(index)¶

__len__()¶

as_dataset(batch_size=16, num_threads=1)¶: return tf.data.Dataset object

shard(num_shards, index)¶: creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶

Batch-wise shuffling of the data entries.

Parameters

batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –

class athena.data.AudioVedioRecognitionDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechRecognitionDatasetBuilder

property num_class¶: return the max_index of the vocabulary + 1

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
}

Return type

dict

default_config¶

video_scp_loader(scp_dir)¶: load video list from scp file return a dic

image_normalizer(image)¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

storage_features_offline()¶

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
}

Return type

dict

class athena.data.AudioVedioRecognitionDatasetBatchBinsBuilder(config=None)¶

Bases: athena.data.datasets.asr.speech_recognition.SpeechRecognitionDatasetBuilder

SpeechRecognitionDatasetBatchBinsBuilder

property sample_shape_batch_bins¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, None, dim, nc]),
    "video":tf.TensorShape([None, None, high, wide]),
    "input_length": tf.TensorShape([None]),
    "output_length": tf.TensorShape([None]),
    "output": tf.TensorShape([None, None]),
    "utt_id": tf.TensorShape([None]),
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "video": tf.TensorShape([None, None, None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt_id": tf.TensorShape([]),
}

Return type

dict

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
    "utt_id": tf.string,
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "video": tf.TensorSpec(shape=(None, None, None, None), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt_id": utt_id
}

Return type

dict

__len__()¶

as_dataset(batch_size=16, num_threads=1)¶: return tf.data.Dataset object

shard(num_shards, index)¶: creates a Dataset that includes only 1/num_shards of this dataset

batch_wise_shuffle(batch_size=1, epoch=-1, seed=917)¶

Batch-wise shuffling of the data entries.

Parameters

batch_size (int, optional) – an integer for the batch size. Defaults to 1
. (in batch_bins mode) –

class athena.data.SpeechSynthesisDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property feat_dim¶: return the number of feature dims

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "speaker": tf.TensorShape([])
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, transcript, speaker).

__getitem__(index)¶

class athena.data.SpeechFastspeech2DatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property feat_dim¶: return the number of feature dims

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.float32,
    "speaker": tf.int32,
    "duration": tf.int32
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, feature_dim]),
    "f0": tf.TensorShape([None]),
    "energy": tf.TensorShape([None]),
    "speaker": tf.TensorShape([]),
    "duration": tf.TensorShape([None])
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, feature_dim),
                            dtype=tf.float32),
    "f0": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "energy": tf.TensorSpec(shape=(None, None), dtype=tf.float32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config¶

load_duration(duration)¶

preprocess_data(file_path)¶: generate a list of tuples (audio_feature, wav_length_ms, transcript, duration, speaker).

load_audio_feature(audio_feature_file)¶

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.data.SpeechSynthesisTestDatasetBuilder(data_csv, config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

SpeechSynthesisDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "utt_id": tf.string,
    "input": tf.int32,
    "input_length": tf.int32,
    "speaker": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "utt_id": tf.TensorShape([]),
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "speaker": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "utt_id": tf.TensorSpec(shape=(None), dtype=tf.string),
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "speaker": tf.TensorSpec(shape=(None), dtype=tf.int32)
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (utt_id, transcript).

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.data.SpeechWakeupFramewiseDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for CNN model. The builder treat every spliced frame as one image. For example (21, 63) The input data format is (batch, timestep, height, width, channel) For example (b, t, 21, 63, 1) unbatch are used to split The output data format is, for example, (b, 21, 63, 1)

property sample_type¶: example types

property sent_sample_shape¶

property sample_shape¶: examples shapes

property sample_signature¶: examples signature

default_config¶

preprocess_data(data_dir='')¶: loading data

__getitem__(index)¶

splice_feature(feature, input_left_context, input_right_context)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

class athena.data.SpeechWakeupDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type¶: example types

property sample_shape¶: examples shapes

property sample_signature¶: examples signature

default_config¶

preprocess_data(data_dir='')¶: loading data

__getitem__(index)¶

splice_feature(feature, input_left_context, input_right_context)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

class athena.data.SpeechWakeupDatasetKaldiIOBuilderAVCE(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

Dataset builder for RNN model. The builder mix the spliced frame in one dim For example (1, 1323) The input data format is (batch, t, dim, channel) For example (b, t, 1323, 1) The output data format is (batch, timestep)

property sample_type¶: example types

property sample_shape¶: examples shapes

property sample_signature¶: examples signature

default_config¶

preprocess_data(data_dir='')¶: loading data

video_scp_loader(scp_dir)¶: load video list from scp file return a dic

__getitem__(index)¶

splice_feature(feature, input_left_context, input_right_context)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

class athena.data.LanguageDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.BaseDatasetBuilder

LanguageDatasetBuilder

property num_class¶

@property

Returns: the max_index of the vocabulary
Return type: int

property input_vocab_size¶

@property

Returns: the input vocab size
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.int32,
    "input_length": tf.int32,
    "output": tf.int32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None]),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: load csv file

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_labels,
    "input_length": input_length,
    "output": output_labels,
    "output_length": output_length,
}

Return type

dict

class athena.data.MpcSpeechDatasetBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

SpeechDatasetBuilder This data builder is a online feature extractor and is used to mcp training

property num_class¶

@property

Returns: the target dim
Return type: int

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output": tf.float32,
    "output_length": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape(
        [None, self.audio_featurizer.dim, self.audio_featurizer.num_channels]
    ),
    "input_length": tf.TensorShape([]),
    "output": tf.TensorShape([None, None]),
    "output_length": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(
        shape=(None, None, None, None), dtype=tf.float32
    ),
    "input_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None, None), dtype=tf.float32),
    "output_length": tf.TensorSpec(shape=([None]), dtype=tf.int32),
}

Return type

dict

default_config¶

preprocess_data(file_path)¶: generate a list of tuples (wav_filename, wav_length_ms, speaker).

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": input_data,
    "input_length": input_data.shape[0],
    "output": output_data,
    "output_length": output_data.shape[0],
}

Return type

dict

class athena.data.MpcSpeechDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.mpc.mpc_speech_set.MpcSpeechDatasetBuilder

MpcSpeechDatasetKaldiIOBuilder This data builder is a offline feature data builder and is used to mcp training

default_config¶

preprocess_data(file_path, apply_sort_filter=True)¶: generate a list of tuples (feat_key, speaker).

__getitem__(index)¶

compute_cmvn_if_necessary(is_necessary=True)¶: compute cmvn file

class athena.data.VoiceActivityDetectionDatasetKaldiIOBuilder(config=None)¶

Bases: athena.data.datasets.base.SpeechBaseDatasetBuilder

VoiceActivityDetectionDatasetKaldiIOBuilder

property sample_type¶

@property

Returns

sample_type of the dataset:

{
    "input": tf.float32,
    "input_length": tf.int32,
    "output_length": tf.int32,
    "output": tf.int32,
}

Return type

dict

property sample_shape¶

@property

Returns

sample_shape of the dataset:

{
    "input": tf.TensorShape([None, dim, nc]),
    "input_length": tf.TensorShape([]),
    "output_length": tf.TensorShape([]),
    "output": tf.TensorShape([None]),
    "utt": tf.TensorShape([]),
}

Return type

dict

property sample_signature¶

@property

Returns

sample_signature of the dataset:

{
    "input": tf.TensorSpec(shape=(None, None, dim, nc), dtype=tf.float32),
    "input_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output_length": tf.TensorSpec(shape=(None), dtype=tf.int32),
    "output": tf.TensorSpec(shape=(None, None), dtype=tf.int32),
    utt": tf.TensorSpec(shape=(None), dtype=tf.string),
}

Return type

dict

default_config¶

preprocess_data(data_scps_dir)¶: generate a list of tuples (wav_filename, wav_offset, wav_length_ms, transcript, label).

splice_feature(feature)¶

splice features according to input_left_context and input_right_context input_left_context: the left features to be spliced,

repeat the first frame in case out the range

input_right_context: the right features to be spliced,: repeat the last frame in case out the range

Parameters: feature – the input features, shape may be [timestamp, dim, 1]
Returns: the spliced features
Return type: splice_feat

__getitem__(index)¶

get a sample

Parameters

index (int) – index of the entries

Returns

sample:

{
    "input": feat,
    "input_length": feat_length,
    "output_length": label_length,
    "output": label,
    "utt": utt
}

Return type

dict

class athena.data.FeatureNormalizer(cmvn_file=None)¶

Feature Normalizer

__call__(feat_data, speaker, reverse=False)¶

apply_cmvn(feat_data, speaker, reverse=False)¶: transform original feature to normalized feature

compute_cmvn(entries, speakers, featurizer, feature_dim, num_cmvn_workers=1)¶: compute cmvn for filtered entries

compute_cmvn_by_chunk_for_all_speaker(feature_dim, speakers, featurizer, entries)¶: because of memory issue, we used incremental approximation for the calculation of cmvn

compute_cmvn_kaldiio(entries, speakers, kaldi_io_feats, feature_dim)¶: compute cmvn for filtered entries using kaldi-format data

load_cmvn()¶: load mean and var

save_cmvn(variable_list)¶

save cmvn variables determined by variable_list to file

Parameters: variable_list (list) – e.g. [“speaker”, “mean”, “var”]

class athena.data.FS2FeatureNormalizer(cmvn_file=None)¶

Bases: FeatureNormalizer

Fastspeech2 Feature Normalizer

__call__(feat_data, speaker, feature_type='mel', reverse=False)¶

compute_fs2_cmvn(entries, speakers, num_cmvn_workers=1)¶: compuate cmvn of mel-spec,f0 and energy

apply_cmvn(feat_data, speaker, feature_type='mel', reverse=False)¶: transform original feature to normalized feature

load_cmvn()¶: load mel_mean, mel_var, f0_mean, f0_var and energy_mean, energy_var

class athena.data.TextFeaturizer(config=None)¶

The main text featurizer interface

property model_type¶

@property

Returns: the model type

property unk_index¶

@property

Returns: the unk index
Return type: int

supported_model¶

default_config¶

load_model(model_file)¶: load model

delete_punct(tokens)¶: delete punctuation tokens

__len__()¶

encode(texts)¶: convert a sentence to a list of ids, with special tokens added.

decode(sequences)¶: conver a list of ids to a sentence

decode_to_list(sequences, ignored_id=[])¶

class athena.data.SentencePieceFeaturizer(spm_file)¶

SentencePieceFeaturizer using tensorflow-text api

load_model(model_file)¶: load sentence piece model

__len__()¶

encode(sentence)¶: convert a sentence to a list of ids by sentence piece model

decode(ids)¶: convert a list of ids to a sentence

decode_to_list(ids, ignored_id=[])¶

class athena.data.TextTokenizer(text=None)¶

TextTokenizer

load_model(text)¶: load model

save_vocab(vocab_file)¶

load_csv(csv_file)¶

__len__()¶

encode(texts)¶: convert a sentence to a list of ids, with special tokens added.

decode(sequences)¶: conver a list of ids to a sentence

decode_to_list(ids, ignored_id=[])¶

athena.data¶

Subpackages¶

Submodules¶

Package Contents¶

Classes¶

`athena.data`¶