athena.transform.feats
¶
Subpackages¶
Submodules¶
athena.transform.feats.add_rir_noise_aecres
athena.transform.feats.add_rir_noise_aecres_test
athena.transform.feats.base_frontend
athena.transform.feats.cmvn
athena.transform.feats.cmvn_test
athena.transform.feats.fbank
athena.transform.feats.fbank_pitch
athena.transform.feats.fbank_pitch_test
athena.transform.feats.fbank_test
athena.transform.feats.framepow
athena.transform.feats.framepow_test
athena.transform.feats.mel_spectrum
athena.transform.feats.mel_spectrum_test
athena.transform.feats.mfcc
athena.transform.feats.mfcc_test
athena.transform.feats.pitch
athena.transform.feats.pitch_test
athena.transform.feats.read_wav
athena.transform.feats.read_wav_test
athena.transform.feats.spectrum
athena.transform.feats.spectrum_test
athena.transform.feats.write_wav
athena.transform.feats.write_wav_test
Package Contents¶
Classes¶
Read audio sample from wav file, return sample data and sample rate. The operation |
|
Compute spectrum features of every frame in speech. |
|
Computing filter banks is applying triangular filters on a Mel-scale to the magnitude |
|
Compute power of every frame in speech. |
|
Compute pitch features of every frame in speech. |
|
Compute MFCC features of every frame in speech. |
|
Encode audio data (input) using sample rate (input), return a write wav opration. |
|
Computing filter banks is applying triangular filters on a Mel-scale to the power |
|
Do CMVN on features. |
|
Compute Fbank && Pitch features respectively, and concatenate them. |
|
Add a random signal-to-noise ratio noise or impulse response to clean speech. |
Functions¶
|
- class athena.transform.feats.ReadWav(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Read audio sample from wav file, return sample data and sample rate. The operation is based on tensorflow.audio.decode_wav.
- Parameters
config – a dictionary contains optional parameters of read wav.
Examples
>>> config = {'audio_channels': 1} >>> read_wav_op = ReadWav.params(config).instantiate() >>> audio_data, sample_rate = read_wav_op('test.wav')
Note: The range of audio data are -32768 to 32767 (for 16 bits), not -1 to 1.
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following two optional parameters
'type' – ‘ReadWav’.
'audio_channels' – index of the desired channel. (default=1)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(wavfile, speed=1.0)¶
Get audio data and sample rate from a wavfile.
- Parameters
wavfile – filepath of wav.
speed – Speed of sample channels wanted. (default=1.0)
- Shape:
Note: Return audio data and sample rate.
audio_data: \((L)\) with tf.float32 dtype
sample_rate: tf.int32
- class athena.transform.feats.Spectrum(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Compute spectrum features of every frame in speech.
- Parameters
config – contains ten optional parameters.
- Shape:
output: \((T, F)\).
- Examples::
>>> config = {'window_length': 0.25, 'window_type': 'hann'} >>> spectrum_op = Spectrum.params(config).instantiate() >>> spectrum_out = spectrum_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following ten optional parameters:
'window_length' – Window length in seconds. (float, default = 0.025),
'frame_length' – Hop length in seconds. (float, default = 0.010),
'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1),
'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97),
'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)
'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)
'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = false)
'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. If 3, return magnitude spectrum. (int, default = 2)
'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)
'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate=None)¶
Caculate power spectrum or log power spectrum of audio data.
- Parameters
audio_data – the audio signal from which to compute spectrum.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- class athena.transform.feats.MelSpectrum(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Computing filter banks is applying triangular filters on a Mel-scale to the magnitude spectrum to extract frequency bands, which based on MelSpectrum of Librosa.
- Parameters
config – contains twelve optional parameters.
- Shape:
output: \((T, F)\).
- Examples::
>>> config = {'output_type': 3, 'filterbank_channel_count': 23} >>> melspectrum_op = MelSpectrum.params(config).instantiate() >>> melspectrum_out = melspectrum_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following twelve optional parameters:
'window_length' – Window length in seconds. (float, default = 0.025),
'frame_length' – Hop length in seconds. (float, default = 0.010),
'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1),
'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.0),
'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “hann”)
'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = False)
'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)
'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)
'lower_frequency_limit' – Low cutoff frequency for mel bins (float, default = 60)
'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 40)
'dither' – Dithering constant (0.0 means no dither). (float, default = 0.0) [add robust to training]
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate)¶
Caculate logmelspectrum of audio data.
- Parameters
audio_data – the audio signal from which to compute spectrum.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- num_channels()¶
- class athena.transform.feats.Framepow(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Compute power of every frame in speech.
- Parameters
config – contains four optional parameters.
- Shape:
output: \((T, 1)\).
- Examples::
>>> config = {'window_length': 0.25, 'remove_dc_offset': True} >>> framepow_op = Framepow.params(config).instantiate() >>> framepow_out = framepow_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following four optional parameters:
'window_length' – Window length in seconds. (float, default = 0.025)
'frame_length' – Hop length in seconds. (float, default = 0.010)
'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate)¶
Caculate power of every frame in speech.
- Parameters
audio_data – the audio signal from which to compute spectrum.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- class athena.transform.feats.Pitch(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Compute pitch features of every frame in speech.
- Parameters
config – contains nineteen optional parameters.
- Shape:
output: \((T, 2)\).
- Examples::
>>> config = {'resample_frequency': 4000, 'lowpass_cutoff': 1000} >>> pitch_op = Pitch.params(config).instantiate() >>> pitch_out = pitch_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following nineteen optional parameters:
'delta_pitch' – Smallest relative change in pitch that our algorithm measures (float, default = 0.005)
'window_length' – Frame length in seconds (float, default = 0.025)
'frame_length' – Frame shift in seconds (float, default = 0.010)
'frames-per-chunk' – Only relevant for offline pitch extraction (e.g. compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)
'lowpass-cutoff' – cutoff frequency for LowPass filter (Hz). (float, default = 1000)
'lowpass-filter-width' – Integer that determines filter width of lowpass filter, more gives sharper filter (int, default = 1)
'max-f0' – max. F0 to search for (Hz) (float, default = 400)
'max-frames-latency' – Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)
'min-f0' – min. F0 to search for (Hz) (float, default = 50)
'nccf-ballast' – Increasing this factor reduces NCCF for quiet frames. (float, default = 7000)
'nccf-ballast-online' – This is useful mainly for debug; it affects how the NCCF ballast is computed. (bool, default = false)
'penalty-factor' – cost factor for FO change. (float, default = 0.1)
'preemphasis-coefficient' – Coefficient for use in signal preemphasis (deprecated). (float, default = 0)
'recompute-frame' – Only relevant for online pitch extraction, or for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
'resample-frequency' – Frequency that we down-sample the signal to. Must be more than twice lowpass-cutoff (float, default = 4000)
'simulate-first-pass-online' – If true, compute-kaldi-pitch-feats will output features that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
'snip-edges' – If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)
'soft-min-f0' – Minimum f0, applied in soft way, must not exceed min-f0. (float, default = 10)
'upsample-filter-width' – Integer that determines filter width when upsampling NCCF. (int, default = 5)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate)¶
Caculate pitch && POV features features of audio data.
- Parameters
audio_data – the audio signal from which to compute mfcc.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- class athena.transform.feats.Mfcc(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Compute MFCC features of every frame in speech.
- Parameters
config – contains fifteen optional parameters.
- Shape:
output: \((C, T, F)\).
- Examples::
>>> config = {'cepstral_lifter': 22.0, 'coefficient_count': 13} >>> mfcc_op = Mfcc.params(config).instantiate() >>> mfcc_out = mfcc_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following fifteen optional parameters:
'window_length' – Window length in seconds. (float, default = 0.025),
'frame_length' – Hop length in seconds. (float, default = 0.010),
'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1),
'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97),
'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)
'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)
'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
'coefficient_count' – Number of cepstra in MFCC computation. (int, default = 13)
'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)
'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)
'lower_frequency_limit' – Low cutoff frequency for mel bins. (float, default = 20)
'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 23)
'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]
'cepstral_lifter' – Constant that controls scaling of MFCCs. (float, default = 22)
'use_energy' – Use energy (not C0) in MFCC computation. (bool, default = True)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate)¶
Caculate mfcc features of audio data.
- Parameters
audio_data – the audio signal from which to compute mfcc.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- class athena.transform.feats.WriteWav(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Encode audio data (input) using sample rate (input), return a write wav opration. The operation is based on tensorflow.audio.encode_wav.
- Parameters
config – a dictionary contains optional parameters of write wav.
- Example::
>>> config = {'sample_rate': 16000} >>> write_wav_op = WriteWav.params(config).instantiate() >>> write_wav_op('test_new.wav', audio_data, 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following one optional parameter:
'sample_rate' – the sample rate of the signal. (default=16000)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(filename, audio_data, sample_rate)¶
Write wav using audio_data.
- Parameters
filename – filepath of wav.
audio_data – a tensor containing data of a wav.
sample_rate – the sample rate of the signal we working with.
- Shape:
filename: string
audio_data: \((L)\)
sample_rate: float
Note: Return a op of write wav. Call it when writing a file.
- class athena.transform.feats.Fbank(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Computing filter banks is applying triangular filters on a Mel-scale to the power spectrum to extract frequency bands.
- Parameters
config – contains thirteen optional parameters.
- Shape:
output: \((T, F, C)\).
- Examples::
>>> config = {'filterbank_channel_count': 40, 'remove_dc_offset': True} >>> fbank_op = Fbank.params(config).instantiate() >>> fbank_out = fbank_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following thirteen optional parameters:
'window_length' – Window length in seconds. (float, default = 0.025)
'frame_length' – Hop length in seconds. (float, default = 0.010)
'snip_edges' – If 1, the last frame (shorter than window_length) will be cutoff. If 2, 1 // 2 frame_length data will be padded to data. (int, default = 1)
'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97)
'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)
'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)
'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
'is_log10' – If true, using log10 to fbank. If false, using loge. (bool, default = false)
'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)
'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)
'lower_frequency_limit' – Low cutoff frequency for mel bins (float, default = 20)
'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 23)
'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate)¶
Caculate fbank features of audio data.
- Parameters
audio_data – the audio signal from which to compute spectrum.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- num_channels()¶
- class athena.transform.feats.CMVN(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Do CMVN on features.
- Parameters
config – contains four optional parameters.
- Shape:
output: \((T, F)\).
- Examples::
>>> config = {'global_mean': 0.0, 'global_variance': 1.0} >>> cmvn_op = CMVN.params(config).instantiate() >>> cmvn_out = cmvn_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following four optional parameters:
'type' – Type of Opration. (string, default = ‘CMVN’)
'global_mean' – Global mean of features. (float, default = 0.0)
'global_variance' – Global variance of features. (float, default = 1.0)
'local_cmvn' – If ture, local cmvn will be done on features. (bool, default = False)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_feature, speed=1.0)¶
Compute CMVN on features.
- dim()¶
- athena.transform.feats.compute_cmvn(audio_feature, mean=None, variance=None, local_cmvn=False)¶
- class athena.transform.feats.FbankPitch(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Compute Fbank && Pitch features respectively, and concatenate them.
- Shape:
output: \((T, F)\).
- Examples::
>>> config = {'raw_energy': 1, 'lowpass-cutoff': 1000} >>> fbankpitch_op = FbankPitch.params(config).instantiate() >>> fbankpitch_out = fbankpitch_op('test.wav', 16000)
- classmethod params(config=None)¶
Set params.
- Parameters
config – contains the following twenty-nine optional parameters:
'preEph_coeff' – Coefficient for use in frame-signal preemphasis. (float, default = 0.97),
'window_type' – Type of window (“hamm”|”hann”|”povey”|”rect”|”blac”|”tria”). (string, default = “povey”)
'remove_dc_offset' – Subtract mean from waveform on each frame. (bool, default = true)
'is_fbank' – If true, compute power spetrum without frame energy. If false, using the frame energy instead of the square of the constant component of the signal. (bool, default = true)
'is_log10' – If true, using log10 to fbank. If false, using loge. (bool, default = false)
'output_type' – If 1, return power spectrum. If 2, return log-power spectrum. (int, default = 1)
'upper_frequency_limit' – High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)
'lower_frequency_limit' – Low cutoff frequency for mel bins (float, default = 20)
'filterbank_channel_count' – Number of triangular mel-frequency bins. (float, default = 23)
'dither' – Dithering constant (0.0 means no dither). (float, default = 1) [add robust to training]
'delta_pitch' – Smallest relative change in pitch that our algorithm measures (float, default = 0.005)
'window_length' – Frame length in seconds (float, default = 0.025)
'frame_length' – Frame shift in seconds (float, default = 0.010)
'frames-per-chunk' – Only relevant for offline pitch extraction (e.g. compute-kaldi-pitch-feats), you can set it to a small nonzero value, such as 10, for better feature compatibility with online decoding (affects energy normalization in the algorithm) (int, default = 0)
'lowpass-cutoff' – cutoff frequency for LowPass filter (Hz). (float, default = 1000)
'lowpass-filter-width' – Integer that determines filter width of lowpass filter, more gives sharper filter (int, default = 1)
'max-f0' – max. F0 to search for (Hz) (float, default = 400)
'max-frames-latency' – Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if –frames-per-chunk > 0 and –simulate-first-pass-online=true (int, default = 0)
'min-f0' – min. F0 to search for (Hz) (float, default = 50)
'nccf-ballast' – Increasing this factor reduces NCCF for quiet frames. (float, default = 7000)
'nccf-ballast-online' – This is useful mainly for debug; it affects how the NCCF ballast is computed. (bool, default = false)
'penalty-factor' – cost factor for FO change. (float, default = 0.1)
'preemphasis-coefficient' – Coefficient for use in signal preemphasis (deprecated). (float, default = 0)
'recompute-frame' – Only relevant for online pitch extraction, or for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if–frames-per-chunk > 0. (int, default = 500)
'resample-frequency' – Frequency that we down-sample the signal to. Must be more than twice lowpass-cutoff (float, default = 4000)
'simulate-first-pass-online' – If true, compute-kaldi-pitch-feats will output features that correspond to what an online decoder would see in the first pass of decoding– not the final version of the features, which is the default. Relevant if –frames-per-chunk > 0 (bool, default = false)
'snip-edges' – If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (bool, default = true)
'soft-min-f0' – Minimum f0, applied in soft way, must not exceed min-f0. (float, default = 10)
'upsample-filter-width' – Integer that determines filter width when upsampling NCCF. (int, default = 5)
Note
Return an object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate)¶
Caculate fbank && pitch(concat) features of wav.
- Parameters
audio_data – the audio signal from which to compute mfcc.
sample_rate – the sample rate of the signal we working with.
- Shape:
audio_data: \((1, N)\)
sample_rate: float
- dim()¶
- class athena.transform.feats.Add_rir_noise_aecres(config: dict)¶
Bases:
athena.transform.feats.base_frontend.BaseFrontend
Add a random signal-to-noise ratio noise or impulse response to clean speech.
- classmethod params(config=None)¶
Set params. :param config: contains nine optional parameters:
- --sample_rate
: Sample frequency of waveform data. (int, default = 16000)
- --if_add_rir
: If true, add rir to audio data. (bool, default = False)
- --rir_filelist
: FileList path of rir.(string, default = ‘rirlist.scp’)
- --if_add_noise
: If true, add random noise to audio data. (bool, default = False)
- --snr_min
: Minimum SNR adds to signal. (float, default = 0)
- --snr_max
: Maximum SNR adds to signal. (float, default = 30)
- --noise_filelist
: FileList path of noise.(string, default = ‘noiselist.scp’)
- --if_add_aecres
: If true, add aecres to audio data. (bool, default = False)
- --aecres_filelist
: FileList path of aecres.(string, default = ‘aecreslist.scp’)
- Returns
An object of class HParams, which is a set of hyperparameters as name-value pairs.
- call(audio_data, sample_rate=None)¶
Caculate power spectrum or log power spectrum of audio data. :param audio_data: the audio signal from which to compute spectrum.
Should be an (1, N) tensor.
- Parameters
sample_rate – [option]the samplerate of the signal we working with, default is 16kHz.
- Returns
A float tensor of size N containing add-noise audio.