athena.data.datasets.preprocess

preprecessing for speech features

This code is modified from https://github.com/wenet-e2e/wenet.git.

Module Contents

Classes

SpecAugment

Implementation of specaugument from paper "SpecAugment: A Simple Data

class athena.data.datasets.preprocess.SpecAugment(preprocess_config)

Implementation of specaugument from paper “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”

Parameters

preprocess_config

it contains configs below:

time_warping: warped time parameter, should be in (0, time / 2),
    a random horizontal center point in (W, time_steps - W) will be warped
    either to left or right by a distance chosen from range [0, W) randomly.
time_masking: masked time range, should be in (0, time_steps),
    the final masked time steps will be [t_0, t_0 + t),
    t is random from[0, T), t_0 is random from [0, time_steps - t)
frequency_masking: masked frequency range, should be in (0, dimension),
    the final masked frequencies will be [f_0, f_0 + f),
    f is random from[0, F), f_0 is random from [0, dimension - f)
mask_cols: masking operation is executed mask_cols times in each axis

default_config
__call__(feat)

spec augment preprocess for audio features

Parameters

feat – audio features, shape should be [time_steps, dimension, channels]

Returns

processed features

spec_augmentation(x)

Deep copy x and do spec augmentation then return it

Parameters
  • x – input feature, T * F 2D

  • num_t_mask – number of time mask to apply

  • num_f_mask – number of freq mask to apply

  • max_t – max width of time mask

  • max_f – max width of freq mask

  • max_w – max width of time warp

Returns

augmented feature