athena.layers.attention

Attention layers.

Module Contents

Classes

ScaledDotProductAttention

Calculate the attention weights.

MultiHeadAttention

Multi-head attention consists of four parts:

BahdanauAttention

the Bahdanau Attention

HanAttention

Refer to [Hierarchical Attention Networks for Document Classification]

MatchAttention

Refer to [Learning Natural Language Inference with LSTM]

LocationAttention

location-aware attention

StepwiseMonotonicAttention

Stepwise monotonic attention

class athena.layers.attention.ScaledDotProductAttention(unidirectional=False, look_ahead=0)

Bases: tensorflow.keras.layers.Layer

Calculate the attention weights. q, k, v must have matching leading dimensions. k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v. The mask has different shapes depending on its type(padding or look ahead) but it must be broadcastable for addition.

Parameters
  • q – query shape == (…, seq_len_q, depth)

  • k – key shape == (…, seq_len_k, depth)

  • v – value shape == (…, seq_len_v, depth_v)

  • mask – Float tensor with shape broadcastable to (…, seq_len_q, seq_len_k). Defaults to None.

Returns

output, attention_weights

call(q, k, v, mask)

This is where the layer’s logic lives.

class athena.layers.attention.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)

Bases: tensorflow.keras.layers.Layer

Multi-head attention consists of four parts:

  • Linear layers and split into heads.

  • Scaled dot-product attention.

  • Concatenation of heads.

  • Final linear layer.

Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

split_heads(x, batch_size)

Split the last dimension into (num_heads, depth).

Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)

call(v, k, q, mask)

call function

class athena.layers.attention.BahdanauAttention(units, input_dim=1024)

Bases: tensorflow.keras.Model

the Bahdanau Attention

call(query, values)

call function

class athena.layers.attention.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)

>>> Input shape: (Batch size, steps, features)
>>> Output shape: (Batch size, features)
build(input_shape)

build in keras layer

call(inputs, training=None, mask=None)

call function in keras

compute_output_shape(input_shape)

compute output shape

_masked_softmax(logits, mask, axis)

Compute softmax with input mask.

class athena.layers.attention.MatchAttention(config, **kwargs)

Bases: tensorflow.keras.layers.Layer

Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170)

>>> Input shape: (Batch size, steps, features)
>>> Output shape: (Batch size, steps, features)
call(tensors)

Attention layer.

class athena.layers.attention.LocationAttention(attn_dim, conv_channel, aconv_filts, scaling=1.0)

Bases: tensorflow.keras.layers.Layer

location-aware attention

Reference: Attention-Based Models for Speech Recognition (https://arxiv.org/pdf/1506.07503.pdf)

compute_score(value, value_length, query, accum_attn_weight)
Parameters
  • value_length – the length of value, shape: [batch]

  • max_len – the maximun length

Returns

initializes to uniform distributions, shape: [batch, max_len]

Return type

initialized_weights

initialize_weights(value_length, max_len)
Parameters
  • value_length – the length of value, shape: [batch]

  • max_len – the maximun length

Returns

initializes to uniform distributions, shape: [batch, max_len]

Return type

initialized_weights

call(attn_inputs, prev_states, training=True)
Parameters
  • attn_inputs (tuple) – it contains 2 params: value, shape: [batch, x_steps, eunits] value_length, shape: [batch]

  • prev_states (tuple) – it contains 3 params: query: previous rnn state, shape: [batch, dunits] accum_attn_weight: previous accumulated attention weights, shape: [batch, x_steps] prev_attn_weight: previous attention weights, shape: [batch, x_steps]

  • training – if it is in the training step

Returns

attended vector, shape: [batch, eunits] attn_weight: attention scores, shape: [batch, x_steps]

Return type

attn_c

class athena.layers.attention.StepwiseMonotonicAttention(attn_dim, conv_channel, aconv_filts, sigmoid_noise=2.0, score_bias_init=0.0, mode='soft')

Bases: LocationAttention

Stepwise monotonic attention

Reference: Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS (https://arxiv.org/pdf/1906.00672.pdf)

build(_)

A Modified Energy Function is used and the params are defined here. Reference: Online and Linear-Time Attention by Enforcing Monotonic Alignments (https://arxiv.org/pdf/1704.00784.pdf).

initialize_weights(value_length, max_len)
Parameters
  • value_length – the length of value, shape: [batch]

  • max_len – the maximun length

Returns

initializes to dirac distributions, shape: [batch, max_len]

Return type

initialized_weights

Examples

An initialized_weights the shape of which is [2, 4]:

>>> [[1, 0, 0, 0],
>>> [1, 0, 0, 0]]
step_monotonic_function(sigmoid_probs, prev_weights)

hard mode can only be used in the synthesis step

Parameters
  • sigmoid_probs – sigmoid probabilities, shape: [batch, x_steps]

  • prev_weights – previous attention weights, shape: [batch, x_steps]

Returns

new attention weights, shape: [batch, x_steps]

Return type

weights

call(attn_inputs, prev_states, training=True)
Parameters
  • attn_inputs (tuple) – it contains 2 params: value, shape: [batch, x_steps, eunits] value_length, shape: [batch]

  • prev_states (tuple) – it contains 3 params: query: previous rnn state, shape: [batch, dunits] accum_attn_weight: previous accumulated attention weights, shape: [batch, x_steps] prev_attn_weight: previous attention weights, shape: [batch, x_steps]

  • training – if it is in the training step

Returns

attended vector, shape: [batch, eunits] attn_weight: attention scores, shape: [batch, x_steps]

Return type

attn_c