athena.models.tts.tts_transformer

speech transformer implementation

Module Contents

Classes

TTSTransformer

TTS version of SpeechTransformer. Model mainly consists of three parts:

class athena.models.tts.tts_transformer.TTSTransformer(data_descriptions, config=None)

Bases: athena.models.tts.tacotron2.Tacotron2

TTS version of SpeechTransformer. Model mainly consists of three parts: the x_net for input preparation, the y_net for output preparation and the transformer itself Reference: Neural Speech Synthesis with Transformer Network

default_config
call(samples, training: bool = None)
time_propagate(encoder_output, memory_mask, outs, step)

Synthesize one step frames

Parameters
  • encoder_output – the encoder output, shape: [batch, x_steps, eunits]

  • memory_mask – the encoder output mask, shape: [batch, 1, 1, x_steps]

  • outs (TensorArray) – previous outputs

  • step – the current step number

Returns:

out: new frame outpus, shape: [batch, feat_dim * reduction_factor]
logit: new stop token prediction logit, shape: [batch, reduction_factor]
attention_weights (list): the corresponding attention weights,
    each element in the list represents the attention weights of one decoder layer
    shape: [batch, num_heads, seq_len_q, seq_len_k]
synthesize(samples)

Synthesize acoustic features from the input texts

Parameters

samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights