athena.models.tts.tacotron2
¶
tacotron2 implementation
Module Contents¶
Classes¶
An implementation of Tacotron2 |
- class athena.models.tts.tacotron2.Tacotron2(data_descriptions, config=None)¶
Bases:
athena.models.base.BaseModel
An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
- default_config¶
- _pad_and_reshape(outputs, ori_lens, reverse=False)¶
- Parameters
outputs – true labels, shape: [batch, y_steps, feat_dim]
ori_lens – scalar
Returns:
reshaped_outputs: it has to be reshaped to match reduction_factor shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]
- call(samples, training: bool = None)¶
call model
- initialize_input_y(y)¶
- Parameters
y – the true label, shape: [batch, y_steps, feat_dim]
Returns:
y0: zeros will be padded as one step to the start step, [batch, y_steps+1, feat_dim]
- initialize_states(encoder_output, input_length)¶
- Parameters
encoder_output – encoder outputs, shape: [batch, x_step, eunits]
input_length – shape: [batch]
Returns:
prev_rnn_states: initial states of rnns in decoder [rnn layers, 2, batch, dunits] prev_attn_weight: initial attention weights, [batch, x_steps] prev_context: initial context, [batch, eunits]
- concat_speaker_embedding(encoder_output, speaker_embedding)¶
- Parameters
encoder_output – encoder output (batch, x_steps, eunits)
speaker_embedding – speaker embedding (batch, embedding_dim)
- Returns
the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)
- time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)¶
- Parameters
encoder_output – encoder output (batch, x_steps, eunits).
input_length – (batch,)
prev_y – one step of true labels or predicted labels (batch, feat_dim).
prev_rnn_states – previous rnn states [layers, 2, states] for lstm
prev_attn_weight – previous attention weights, shape: [batch, x_steps]
prev_context – previous context vector: [batch, attn_dim]
training – if it is training mode
Returns:
out: shape: [batch, feat_dim] logit: shape: [batch, reduction_factor] current_rnn_states: [rnn_layers, 2, batch, dunits] attn_weight: [batch, x_steps]
- get_loss(outputs, samples, training=None)¶
get loss
- synthesize(samples)¶
Synthesize acoustic features from the input texts
- Parameters
samples – the data source to be synthesized
Returns:
after_outs: the corresponding synthesized acoustic features attn_weights_stack: the corresponding attention weights
- _synthesize_post_net(before_outs, logits_stack)¶
- Parameters
before_outs – the outputs before postnet
logits_stack – the logits of all steps
Returns:
after_outs: the corresponding synthesized acoustic features