`athena.models.tts.tacotron2`¶

tacotron2 implementation

Module Contents¶

Classes¶

Tacotron2

An implementation of Tacotron2

class athena.models.tts.tacotron2.Tacotron2(data_descriptions, config=None)¶

Bases: athena.models.base.BaseModel

An implementation of Tacotron2 Reference: NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS

https://arxiv.org/pdf/1712.05884.pdf

default_config¶

_pad_and_reshape(outputs, ori_lens, reverse=False)¶

Parameters

outputs – true labels, shape: [batch, y_steps, feat_dim]
ori_lens – scalar

Returns:

reshaped_outputs: it has to be reshaped to match reduction_factor
    shape: [batch, y_steps / reduction_factor, feat_dim * reduction_factor]

call(samples, training: bool = None)¶: call model

initialize_input_y(y)¶

Parameters: y – the true label, shape: [batch, y_steps, feat_dim]

Returns:

y0: zeros will be padded as one step to the start step,
[batch, y_steps+1, feat_dim]

initialize_states(encoder_output, input_length)¶

Parameters

encoder_output – encoder outputs, shape: [batch, x_step, eunits]
input_length – shape: [batch]

Returns:

prev_rnn_states: initial states of rnns in decoder
    [rnn layers, 2, batch, dunits]
prev_attn_weight: initial attention weights, [batch, x_steps]
prev_context: initial context, [batch, eunits]

concat_speaker_embedding(encoder_output, speaker_embedding)¶

Parameters

encoder_output – encoder output (batch, x_steps, eunits)
speaker_embedding – speaker embedding (batch, embedding_dim)

Returns

the concat result of encoder_output and speaker_embedding (batch, x_steps, eunits+embedding_dim)

time_propagate(encoder_output, input_length, prev_y, prev_rnn_states, accum_attn_weight, prev_attn_weight, prev_context, training=False)¶

Parameters

encoder_output – encoder output (batch, x_steps, eunits).
input_length – (batch,)
prev_y – one step of true labels or predicted labels (batch, feat_dim).
prev_rnn_states – previous rnn states [layers, 2, states] for lstm
prev_attn_weight – previous attention weights, shape: [batch, x_steps]
prev_context – previous context vector: [batch, attn_dim]
training – if it is training mode

Returns:

out: shape: [batch, feat_dim]
logit: shape: [batch, reduction_factor]
current_rnn_states: [rnn_layers, 2, batch, dunits]
attn_weight: [batch, x_steps]

get_loss(outputs, samples, training=None)¶: get loss

synthesize(samples)¶

Synthesize acoustic features from the input texts

Parameters: samples – the data source to be synthesized

Returns:

after_outs: the corresponding synthesized acoustic features
attn_weights_stack: the corresponding attention weights

_synthesize_post_net(before_outs, logits_stack)¶

Parameters

before_outs – the outputs before postnet
logits_stack – the logits of all steps

Returns:

after_outs: the corresponding synthesized acoustic features

athena.models.tts.tacotron2¶

Module Contents¶

Classes¶

`athena.models.tts.tacotron2`¶