athena.models.tts.tacotron2

tacotron2 implementation

Module Contents

Classes

Tacotron2

An implementation of Tacotron2

class athena.models.tts.tacotron2.Tacotron2(data_descriptions, config=None)

Bases: athena.models.base.BaseModel

An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

default_config
_pad_and_reshape(outputs, ori_lens, reverse=False)
Parameters
  • outputs – true labels, shape: [batch, y_steps, feat_dim]

  • ori_lens – scalar

Returns:

reshaped_outputs: it has to be reshaped to match reduction_factor
    shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]
call(samples, training: bool = None)

call model

initialize_input_y(y)
Parameters

y – the true label, shape: [batch, y_steps, feat_dim]

Returns:

y0: zeros will be padded as one step to the start step,
[batch, y_steps+1, feat_dim]
initialize_states(encoder_output, input_length)
Parameters
  • encoder_output – encoder outputs, shape: [batch, x_step, eunits]

  • input_length – shape: [batch]

Returns:

prev_rnn_states: initial states of rnns in decoder
    [rnn layers, 2, batch, dunits]
prev_attn_weight: initial attention weights, [batch, x_steps]
prev_context: initial context, [batch, eunits]
concat_speaker_embedding(encoder_output, speaker_embedding)
Parameters
  • encoder_output – encoder output (batch, x_steps, eunits)

  • speaker_embedding – speaker embedding (batch, embedding_dim)

Returns

the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)

time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)
Parameters
  • encoder_output – encoder output (batch, x_steps, eunits).

  • input_length – (batch,)

  • prev_y – one step of true labels or predicted labels (batch, feat_dim).

  • prev_rnn_states – previous rnn states [layers, 2, states] for lstm

  • prev_attn_weight – previous attention weights, shape: [batch, x_steps]

  • prev_context – previous context vector: [batch, attn_dim]

  • training – if it is training mode

Returns:

out: shape: [batch, feat_dim]
logit: shape: [batch, reduction_factor]
current_rnn_states: [rnn_layers, 2, batch, dunits]
attn_weight: [batch, x_steps]
get_loss(outputs, samples, training=None)

get loss

synthesize(samples)

Synthesize acoustic features from the input texts

Parameters

samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights
_synthesize_post_net(before_outs, logits_stack)
Parameters
  • before_outs – the outputs before postnet

  • logits_stack – the logits of all steps

Returns:

after_outs: the corresponding synthesized acoustic features