athena.layers.attention
¶
Attention layers.
Module Contents¶
Classes¶
Calculate the attention weights. |
|
Multi-head attention consists of four parts: |
|
the Bahdanau Attention |
|
Refer to [Hierarchical Attention Networks for Document Classification] |
|
Refer to [Learning Natural Language Inference with LSTM] |
|
location-aware attention |
|
Stepwise monotonic attention |
- class athena.layers.attention.ScaledDotProductAttention(unidirectional=False, look_ahead=0)¶
Bases:
tensorflow.keras.layers.Layer
Calculate the attention weights. q, k, v must have matching leading dimensions. k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v. The mask has different shapes depending on its type(padding or look ahead) but it must be broadcastable for addition.
- Parameters
q – query shape == (…, seq_len_q, depth)
k – key shape == (…, seq_len_k, depth)
v – value shape == (…, seq_len_v, depth_v)
mask – Float tensor with shape broadcastable to (…, seq_len_q, seq_len_k). Defaults to None.
- Returns
output, attention_weights
- call(q, k, v, mask)¶
This is where the layer’s logic lives.
- class athena.layers.attention.MultiHeadAttention(d_model, num_heads, unidirectional=False, look_ahead=0)¶
Bases:
tensorflow.keras.layers.Layer
Multi-head attention consists of four parts:
Linear layers and split into heads.
Scaled dot-product attention.
Concatenation of heads.
Final linear layer.
Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads. The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated (using tf.transpose, and tf.reshape) and put through a final Dense layer.
Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.
- split_heads(x, batch_size)¶
Split the last dimension into (num_heads, depth).
Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
- call(v, k, q, mask)¶
call function
- class athena.layers.attention.BahdanauAttention(units, input_dim=1024)¶
Bases:
tensorflow.keras.Model
the Bahdanau Attention
- call(query, values)¶
call function
- class athena.layers.attention.HanAttention(W_regularizer=None, u_regularizer=None, b_regularizer=None, W_constraint=None, u_constraint=None, b_constraint=None, use_bias=True, **kwargs)¶
Bases:
tensorflow.keras.layers.Layer
Refer to [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)
>>> Input shape: (Batch size, steps, features) >>> Output shape: (Batch size, features)
- build(input_shape)¶
build in keras layer
- call(inputs, training=None, mask=None)¶
call function in keras
- compute_output_shape(input_shape)¶
compute output shape
- _masked_softmax(logits, mask, axis)¶
Compute softmax with input mask.
- class athena.layers.attention.MatchAttention(config, **kwargs)¶
Bases:
tensorflow.keras.layers.Layer
Refer to [Learning Natural Language Inference with LSTM] (https://www.aclweb.org/anthology/N16-1170)
>>> Input shape: (Batch size, steps, features) >>> Output shape: (Batch size, steps, features)
- call(tensors)¶
Attention layer.
- class athena.layers.attention.LocationAttention(attn_dim, conv_channel, aconv_filts, scaling=1.0)¶
Bases:
tensorflow.keras.layers.Layer
location-aware attention
Reference: Attention-Based Models for Speech Recognition (https://arxiv.org/pdf/1506.07503.pdf)
- compute_score(value, value_length, query, accum_attn_weight)¶
- Parameters
value_length – the length of value, shape: [batch]
max_len – the maximun length
- Returns
initializes to uniform distributions, shape: [batch, max_len]
- Return type
initialized_weights
- initialize_weights(value_length, max_len)¶
- Parameters
value_length – the length of value, shape: [batch]
max_len – the maximun length
- Returns
initializes to uniform distributions, shape: [batch, max_len]
- Return type
initialized_weights
- call(attn_inputs, prev_states, training=True)¶
- Parameters
attn_inputs (tuple) – it contains 2 params: value, shape: [batch, x_steps, eunits] value_length, shape: [batch]
prev_states (tuple) – it contains 3 params: query: previous rnn state, shape: [batch, dunits] accum_attn_weight: previous accumulated attention weights, shape: [batch, x_steps] prev_attn_weight: previous attention weights, shape: [batch, x_steps]
training – if it is in the training step
- Returns
attended vector, shape: [batch, eunits] attn_weight: attention scores, shape: [batch, x_steps]
- Return type
attn_c
- class athena.layers.attention.StepwiseMonotonicAttention(attn_dim, conv_channel, aconv_filts, sigmoid_noise=2.0, score_bias_init=0.0, mode='soft')¶
Bases:
LocationAttention
Stepwise monotonic attention
Reference: Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS (https://arxiv.org/pdf/1906.00672.pdf)
- build(_)¶
A Modified Energy Function is used and the params are defined here. Reference: Online and Linear-Time Attention by Enforcing Monotonic Alignments (https://arxiv.org/pdf/1704.00784.pdf).
- initialize_weights(value_length, max_len)¶
- Parameters
value_length – the length of value, shape: [batch]
max_len – the maximun length
- Returns
initializes to dirac distributions, shape: [batch, max_len]
- Return type
initialized_weights
Examples
An initialized_weights the shape of which is [2, 4]:
>>> [[1, 0, 0, 0], >>> [1, 0, 0, 0]]
- step_monotonic_function(sigmoid_probs, prev_weights)¶
hard mode can only be used in the synthesis step
- Parameters
sigmoid_probs – sigmoid probabilities, shape: [batch, x_steps]
prev_weights – previous attention weights, shape: [batch, x_steps]
- Returns
new attention weights, shape: [batch, x_steps]
- Return type
weights
- call(attn_inputs, prev_states, training=True)¶
- Parameters
attn_inputs (tuple) – it contains 2 params: value, shape: [batch, x_steps, eunits] value_length, shape: [batch]
prev_states (tuple) – it contains 3 params: query: previous rnn state, shape: [batch, dunits] accum_attn_weight: previous accumulated attention weights, shape: [batch, x_steps] prev_attn_weight: previous attention weights, shape: [batch, x_steps]
training – if it is in the training step
- Returns
attended vector, shape: [batch, eunits] attn_weight: attention scores, shape: [batch, x_steps]
- Return type
attn_c