athena.transform.feats

Subpackages

Submodules

Package Contents

Classes

ReadWav

Read audio sample from wav file, return sample data and sample rate. The operation

Spectrum

Compute spectrum features of every frame in speech.

MelSpectrum

Computing filter banks is applying triangular filters on a Mel-scale to the magnitude

Framepow

Compute power of every frame in speech.

Pitch

Compute pitch features of every frame in speech.

Mfcc

Compute MFCC features of every frame in speech.

WriteWav

Encode audio data (input) using sample rate (input), return a write wav opration.

Fbank

Computing filter banks is applying triangular filters on a Mel-scale to the power

CMVN

Do CMVN on features.

FbankPitch

Compute Fbank && Pitch features respectively, and concatenate them.

Add_rir_noise_aecres

Add a random signal-to-noise ratio noise or impulse response to clean speech.

Functions

compute_cmvn(audio_feature[, mean, variance, local_cmvn])

class athena.transform.feats.ReadWav(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Read audio sample from wav file, return sample data and sample rate. The operation is based on tensorflow.audio.decode_wav.

Parameters

config – a dictionary contains optional parameters of read wav.

Examples

>>> config = {'audio_channels': 1}
>>> read_wav_op = ReadWav.params(config).instantiate()
>>> audio_data, sample_rate = read_wav_op('test.wav')

Note: The range of audio data are -32768 to 32767 (for 16 bits), not -1 to 1.

classmethod params(config=None)

Set params.

Parameters
  • config – contains the following two optional parameters

  • 'type' – ‘ReadWav’.

  • 'audio_channels' – index of the desired channel. (default=1)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(wavfile, speed=1.0)

Get audio data and sample rate from a wavfile.

Parameters
  • wavfile – filepath of wav.

  • speed – Speed of sample channels wanted. (default=1.0)

Shape:

Note: Return audio data and sample rate.

  • audio_data: \((L)\) with tf.float32 dtype

  • sample_rate: tf.int32

class athena.transform.feats.Spectrum(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute spectrum features of every frame in speech.

Parameters

config – contains ten optional parameters.

Shape:
  • output: \((T, F)\).

Examples::
>>> config = {'window_length': 0.25, 'window_type': 'hann'}
>>> spectrum_op = Spectrum.params(config).instantiate()
>>> spectrum_out = spectrum_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following ten optional parameters:

  • 'window_length' – Window length in seconds. (float, default = 0.025),

  • 'frame_length' – Hop length in seconds. (float, default = 0.010),

  • 'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1),

  • 'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97),

  • 'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)

  • 'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)

  • 'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = false)

  • 'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. If 3, return magnitude spectrum. (int, default = 2)

  • 'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

  • 'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate=None)

Caculate power spectrum or log power spectrum of audio data.

Parameters
  • audio_data – the audio signal from which to compute spectrum.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
class athena.transform.feats.MelSpectrum(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Computing filter banks is applying triangular filters on a Mel-scale to the magnitude spectrum to extract frequency bands, which based on MelSpectrum of Librosa.

Parameters

config – contains twelve optional parameters.

Shape:
  • output: \((T, F)\).

Examples::
>>> config = {'output_type': 3, 'filterbank_channel_count': 23}
>>> melspectrum_op = MelSpectrum.params(config).instantiate()
>>> melspectrum_out = melspectrum_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following twelve optional parameters:

  • 'window_length' – Window length in seconds. (float, default = 0.025),

  • 'frame_length' – Hop length in seconds. (float, default = 0.010),

  • 'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1),

  • 'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.0),

  • 'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “hann”)

  • 'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = False)

  • 'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)

  • 'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)

  • 'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

  • 'lower_frequency_limit' – Low cutoff frequency for mel bins (float, default = 60)

  • 'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 40)

  • 'dither' – Dithering constant (0.0 means no dither). (float, default = 0.0) [add robust to training]

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate)

Caculate logmelspectrum of audio data.

Parameters
  • audio_data – the audio signal from which to compute spectrum.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
num_channels()
class athena.transform.feats.Framepow(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute power of every frame in speech.

Parameters

config – contains four optional parameters.

Shape:
  • output: \((T, 1)\).

Examples::
>>> config = {'window_length': 0.25, 'remove_dc_offset': True}
>>> framepow_op = Framepow.params(config).instantiate()
>>> framepow_out = framepow_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following four optional parameters:

  • 'window_length' – Window length in seconds. (float, default = 0.025)

  • 'frame_length' – Hop length in seconds. (float, default = 0.010)

  • 'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)

  • 'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate)

Caculate power of every frame in speech.

Parameters
  • audio_data – the audio signal from which to compute spectrum.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
class athena.transform.feats.Pitch(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute pitch features of every frame in speech.

Parameters

config – contains nineteen optional parameters.

Shape:
  • output: \((T, 2)\).

Examples::
>>> config = {'resample_frequency': 4000, 'lowpass_cutoff': 1000}
>>> pitch_op = Pitch.params(config).instantiate()
>>> pitch_out = pitch_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following nineteen optional parameters:

  • 'delta_pitch' – Smallest relative change in pitch that our algorithm measures (float, default = 0.005)

  • 'window_length' – Frame length in seconds (float, default = 0.025)

  • 'frame_length' – Frame shift in seconds (float, default = 0.010)

  • 'frames-per-chunk' – Only relevant for offline pitch extraction (e.g. compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)

  • 'lowpass-cutoff' – cutoff frequency for LowPass filter (Hz). (float, default = 1000)

  • 'lowpass-filter-width' – Integer that determines filter width of lowpass filter, more gives sharper filter (int, default = 1)

  • 'max-f0' – max. F0 to search for (Hz) (float, default = 400)

  • 'max-frames-latency' – Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)

  • 'min-f0' – min. F0 to search for (Hz) (float, default = 50)

  • 'nccf-ballast' – Increasing this factor reduces NCCF for quiet frames. (float, default = 7000)

  • 'nccf-ballast-online' – This is useful mainly for debug; it affects how the NCCF ballast is computed. (bool, default = false)

  • 'penalty-factor' – cost factor for FO change. (float, default = 0.1)

  • 'preemphasis-coefficient' – Coefficient for use in signal preemphasis (deprecated). (float, default = 0)

  • 'recompute-frame' – Only relevant for online pitch extraction, or for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)

  • 'resample-frequency' – Frequency that we down-sample the signal to. Must be more than twice lowpass-cutoff (float, default = 4000)

  • 'simulate-first-pass-online' – If true, compute-kaldi-pitch-feats will output features that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)

  • 'snip-edges' – If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)

  • 'soft-min-f0' – Minimum f0, applied in soft way, must not exceed min-f0. (float, default = 10)

  • 'upsample-filter-width' – Integer that determines filter width when upsampling NCCF. (int, default = 5)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate)

Caculate pitch && POV features features of audio data.

Parameters
  • audio_data – the audio signal from which to compute mfcc.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
class athena.transform.feats.Mfcc(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute MFCC features of every frame in speech.

Parameters

config – contains fifteen optional parameters.

Shape:
  • output: \((C, T, F)\).

Examples::
>>> config = {'cepstral_lifter': 22.0, 'coefficient_count': 13}
>>> mfcc_op = Mfcc.params(config).instantiate()
>>> mfcc_out = mfcc_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following fifteen optional parameters:

  • 'window_length' – Window length in seconds. (float, default = 0.025),

  • 'frame_length' – Hop length in seconds. (float, default = 0.010),

  • 'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1),

  • 'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97),

  • 'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)

  • 'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)

  • 'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)

  • 'coefficient_count' – Number of cepstra in MFCC computation. (int, default = 13)

  • 'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)

  • 'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

  • 'lower_frequency_limit' – Low cutoff frequency for mel bins. (float, default = 20)

  • 'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 23)

  • 'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]

  • 'cepstral_lifter' – Constant that controls scaling of MFCCs. (float, default = 22)

  • 'use_energy' – Use energy (not C0) in MFCC computation. (bool, default = True)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate)

Caculate mfcc features of audio data.

Parameters
  • audio_data – the audio signal from which to compute mfcc.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
class athena.transform.feats.WriteWav(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Encode audio data (input) using sample rate (input), return a write wav opration. The operation is based on tensorflow.audio.encode_wav.

Parameters

config – a dictionary contains optional parameters of write wav.

Example::
>>> config = {'sample_rate': 16000}
>>> write_wav_op = WriteWav.params(config).instantiate()
>>> write_wav_op('test_new.wav', audio_data, 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following one optional parameter:

  • 'sample_rate' – the sample rate of the signal. (default=16000)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(filename, audio_data, sample_rate)

Write wav using audio_data.

Parameters
  • filename – filepath of wav.

  • audio_data – a tensor containing data of a wav.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • filename: string

  • audio_data: \((L)\)

  • sample_rate: float

Note: Return a op of write wav. Call it when writing a file.

class athena.transform.feats.Fbank(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Computing filter banks is applying triangular filters on a Mel-scale to the power spectrum to extract frequency bands.

Parameters

config – contains thirteen optional parameters.

Shape:
  • output: \((T, F, C)\).

Examples::
>>> config = {'filterbank_channel_count': 40, 'remove_dc_offset': True}
>>> fbank_op = Fbank.params(config).instantiate()
>>> fbank_out = fbank_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following thirteen optional parameters:

  • 'window_length' – Window length in seconds. (float, default = 0.025)

  • 'frame_length' – Hop length in seconds. (float, default = 0.010)

  • 'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)

  • 'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97)

  • 'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)

  • 'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)

  • 'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)

  • 'is_log10' – If true, using log10 to fbank. If false, using loge. (bool, default = false)

  • 'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)

  • 'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

  • 'lower_frequency_limit' – Low cutoff frequency for mel bins (float, default = 20)

  • 'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 23)

  • 'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate)

Caculate fbank features of audio data.

Parameters
  • audio_data – the audio signal from which to compute spectrum.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
num_channels()
class athena.transform.feats.CMVN(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Do CMVN on features.

Parameters

config – contains four optional parameters.

Shape:
  • output: \((T, F)\).

Examples::
>>> config = {'global_mean': 0.0, 'global_variance': 1.0}
>>> cmvn_op = CMVN.params(config).instantiate()
>>> cmvn_out = cmvn_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following four optional parameters:

  • 'type' – Type of Opration. (string, default = ‘CMVN’)

  • 'global_mean' – Global mean of features. (float, default = 0.0)

  • 'global_variance' – Global variance of features. (float, default = 1.0)

  • 'local_cmvn' – If ture, local cmvn will be done on features. (bool, default = False)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_feature, speed=1.0)

Compute CMVN on features.

dim()
athena.transform.feats.compute_cmvn(audio_feature, mean=None, variance=None, local_cmvn=False)
class athena.transform.feats.FbankPitch(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Compute Fbank && Pitch features respectively, and concatenate them.

Shape:
  • output: \((T, F)\).

Examples::
>>> config = {'raw_energy': 1, 'lowpass-cutoff': 1000}
>>> fbankpitch_op = FbankPitch.params(config).instantiate()
>>> fbankpitch_out = fbankpitch_op('test.wav', 16000)
classmethod params(config=None)

Set params.

Parameters
  • config – contains the following twenty-nine optional parameters:

  • 'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97),

  • 'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)

  • 'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)

  • 'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)

  • 'is_log10' – If true, using log10 to fbank. If false, using loge. (bool, default = false)

  • 'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)

  • 'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)

  • 'lower_frequency_limit' – Low cutoff frequency for mel bins (float, default = 20)

  • 'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 23)

  • 'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]

  • 'delta_pitch' – Smallest relative change in pitch that our algorithm measures (float, default = 0.005)

  • 'window_length' – Frame length in seconds (float, default = 0.025)

  • 'frame_length' – Frame shift in seconds (float, default = 0.010)

  • 'frames-per-chunk' – Only relevant for offline pitch extraction (e.g. compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)

  • 'lowpass-cutoff' – cutoff frequency for LowPass filter (Hz). (float, default = 1000)

  • 'lowpass-filter-width' – Integer that determines filter width of lowpass filter, more gives sharper filter (int, default = 1)

  • 'max-f0' – max. F0 to search for (Hz) (float, default = 400)

  • 'max-frames-latency' – Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)

  • 'min-f0' – min. F0 to search for (Hz) (float, default = 50)

  • 'nccf-ballast' – Increasing this factor reduces NCCF for quiet frames. (float, default = 7000)

  • 'nccf-ballast-online' – This is useful mainly for debug; it affects how the NCCF ballast is computed. (bool, default = false)

  • 'penalty-factor' – cost factor for FO change. (float, default = 0.1)

  • 'preemphasis-coefficient' – Coefficient for use in signal preemphasis (deprecated). (float, default = 0)

  • 'recompute-frame' – Only relevant for online pitch extraction, or for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)

  • 'resample-frequency' – Frequency that we down-sample the signal to. Must be more than twice lowpass-cutoff (float, default = 4000)

  • 'simulate-first-pass-online' – If true, compute-kaldi-pitch-feats will output features that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)

  • 'snip-edges' – If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)

  • 'soft-min-f0' – Minimum f0, applied in soft way, must not exceed min-f0. (float, default = 10)

  • 'upsample-filter-width' – Integer that determines filter width when upsampling NCCF. (int, default = 5)

Note

Return an object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate)

Caculate fbank && pitch(concat) features of wav.

Parameters
  • audio_data – the audio signal from which to compute mfcc.

  • sample_rate – the sample rate of the signal we working with.

Shape:
  • audio_data: \((1, N)\)

  • sample_rate: float

dim()
class athena.transform.feats.Add_rir_noise_aecres(config: dict)

Bases: athena.transform.feats.base_frontend.BaseFrontend

Add a random signal-to-noise ratio noise or impulse response to clean speech.

classmethod params(config=None)

Set params. :param config: contains nine optional parameters:

--sample_rate

: Sample frequency of waveform data. (int, default = 16000)

--if_add_rir

: If true, add rir to audio data. (bool, default = False)

--rir_filelist

: FileList path of rir.(string, default = ‘rirlist.scp’)

--if_add_noise

: If true, add random noise to audio data. (bool, default = False)

--snr_min

: Minimum SNR adds to signal. (float, default = 0)

--snr_max

: Maximum SNR adds to signal. (float, default = 30)

--noise_filelist

: FileList path of noise.(string, default = ‘noiselist.scp’)

--if_add_aecres

: If true, add aecres to audio data. (bool, default = False)

--aecres_filelist

: FileList path of aecres.(string, default = ‘aecreslist.scp’)

Returns

An object of class HParams, which is a set of hyperparameters as name-value pairs.

call(audio_data, sample_rate=None)

Caculate power spectrum or log power spectrum of audio data. :param audio_data: the audio signal from which to compute spectrum.

Should be an (1, N) tensor.

Parameters

sample_rate – [option]the samplerate of the signal we working with, default is 16kHz.

Returns

A float tensor of size N containing add-noise audio.