Bahdanau 注意力

2025年03月17日 | 阅读 9 分钟

加性注意力，或称 Bahdanau 注意力，是神经网络拓扑结构中发现的一种过程，特别是在神经机器翻译和序列到序列模型方面。2015 年，Dzmitry Bahdanau 和同事在他们的论文“Neural Machine Translation by Jointly Learning to Align and Translate”中介绍了它。

Bahdanau 注意力主要是为了克服传统序列到序列模型的缺点而开发的，这些模型在处理长序列时经常会出错，并且难以封装长距离依赖关系。例如，在输入和输出序列长度可能不同的机器翻译任务中，模型很难学习有意义的表示和词对齐。

为了解决这个问题，Bahdanau 注意力使模型能够通过选择性地关注输入序列的不同部分来生成输出序列中的每个词。借助这种注意力机制，模型可以在解码过程的每个阶段动态确定某些输入标记的重要性，从而更多地关注相关数据，并降低对固定长度上下文向量的需求。

Bahdanau 注意力的组成部分

对齐分数：模型在每个解码步骤计算编码器的隐藏状态与当前解码器的隐藏状态之间的对齐分数。这些分数代表了每个编码器隐藏状态相对于当前解码阶段的相关性或重要性。
注意力权重：在获得对齐分数后，应用 softmax 函数来确定注意力权重，这些权重指示了每个编码器隐藏状态相对于当前解码阶段的重要性。这些注意力权重之和为一，确保模型能够关注输入序列的每个部分。
上下文向量：最后，使用注意力权重计算编码器隐藏状态的加权和作为上下文向量。通过将输入序列中的相关数据与此刻的解码器隐藏状态连接起来，得到输出，这一点是通过这个上下文向量实现的。

Bahdanau 注意力的优点

Bahdanau 注意力的一些主要优点是

在解码过程中，模型可以动态地关注输入序列的不同部分。与全局注意力等固定技术不同，Bahdanau 注意力为输入序列的每个元素分别计算注意力权重，这使得模型在捕获相关信息方面具有更大的灵活性。
Bahdanau 注意力通过动态地关注输入序列的不同部分来提高生成输出的质量。得益于这种动态注意力方法，模型可以更准确地捕捉长距离关系并有效地对齐输入和输出序列，从而提高了翻译质量和整体速度。
通过指示在解码过程的每个阶段关注输入序列的哪些部分，Bahdanau 注意力有助于提高可解释性。这种可解释性可以用于调试、错误分析和理解模型行为。它使专业人员能够更深入地了解输入序列如何影响模型的输出生成。
Bahdanau 注意力可以有效地处理不同长度的输入序列。由于它是根据输入序列的内容计算注意力权重，而不是依赖预设的对齐，因此它能够处理不同长度的输入序列，而无需进行填充或截断等预处理步骤。

Bahdanau 注意力的挑战

尽管 Bahdanau 注意力在序列到序列模型方面具有许多优点，但它也面临一些挑战

在 Bahdanau 注意力中，在解码过程中，会分别为输入序列的每个元素单独计算注意力权重。因此，随着输入序列长度的增加，计算复杂度呈线性增长。这可能导致更长的训练和推理时间，以及对长输入序列的可伸缩性问题。
在设计有效的注意力机制时，需要仔细考虑多种参数，包括正则化策略、参数初始化和注意力函数公式。选择正确的超参数和注意力机制会对模型的收敛性和性能产生重大影响。
Bahdanau 注意力模型容易过拟合，尤其是在仅在有限数据集上训练或设计过于复杂的情况下。可能需要使用权重衰减和 dropout 等正则化策略来减少过拟合并提高泛化能力。
在处理不在训练词汇表中的 token 时，Bahdanau 注意力模型可能会发现难以处理。为了增加对未见 token 的覆盖和管理，处理 OOV token 需要专门的方法，例如字符级建模、子词分词或使用外部知识源。

为了更好地理解，我们将把 Bahdanau 注意力应用于一个对话聊天机器人。

代码

导入数据集

import tensorflow as tf

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

import warnings
warnings.filterwarnings('ignore')

准备数据

file = open('../input/simple-dialogs-for-chatbot/dialogs.txt','r').read()

qna_list = [f.split('\t') for f in file.split('\n')]

questions = [x[0] for x in qna_list]
answers = [x[1] for x in qna_list]


print("Question: ", questions[0])
print("Answer: ", answers[0])

输出

预处理句子

在这里，我们将对文本数据进行预处理，以便进行进一步处理，例如分词、向量化和训练机器学习模型。

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())

    w = re.sub(r"([?.!,?])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    w = re.sub(r"[^a-zA-Z?.!,?]+", " ", w)
    w = w.strip()

    w = '<start> ' + w + ' <end>'
    return w


print(preprocess_sentence(questions[0]))
print(preprocess_sentence(answers[0]))

pre_questions = [preprocess_sentence(w) for w in questions]
pre_answers = [preprocess_sentence(w) for w in answers]

输出

分词

# This function tokenizes the sentences in a given language 
def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
    lang_tokenizer.fit_on_texts(lang)

    tensor = lang_tokenizer.texts_to_sequences(lang)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')

    return tensor, lang_tokenizer


# The dataset is loaded and preprocessed for training using this function. The input and target language phrases are included in a tuple called data, and the number of examples used for training can be limited by the optional argument num_examples. 
def load_dataset(data, num_examples=None):
    # Creating cleaned input, and output pairs
    if(num_examples != None):
        targ_lang, inp_lang, = data[:num_examples]
    else:
        targ_lang, inp_lang, = data

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer


num_examples = 30000
data = pre_answers, pre_questions
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(data, num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]


# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

输出

词到索引

# It converts integer sequences back into human-readable sentences using the vocabulary mappings stored in the tokenizer.
def convert(lang, tensor):
    for t in tensor:
        if t!=0:
            print ("%d ----> %s" % (t, lang.index_word[t]))


print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

输出

创建 TensorFlow 数据集

现在，我们将数据集设置为适合在 TensorFlow 中训练神经网络模型的格式，特别是用于机器翻译或文本摘要等任务的序列到序列模型。

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

编码器

现在，我们将定义序列到序列模型的编码器组件，该组件将用于在训练和推理中处理输入序列。

class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))


encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

输出

Bahdanau 注意力

现在，我们将定义一个带有 Bahdanau 注意力机制的可重用层函数，可以轻松地将其集成到各种任务的序列到序列模型中。

class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, query, values):
        # query hidden state shape == (batch_size, hidden size)
        # query_with_time_axis shape == (batch_size, 1, hidden size)
        # values shape == (batch_size, max_len, hidden size)
        # We are doing this to broadcast addition along the time axis to calculate the score
        query_with_time_axis = tf.expand_dims(query, 1)

        # score shape == (batch_size, max_length, 1)
        # We get 1 at the last axis because we are applying the score to self.V
        # the shape of the tensor before applying self.V is (batch_size, max_length, units)
        score = self.V(tf.nn.tanh(
            self.W1(query_with_time_axis) + self.W2(values)))

        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights


attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

输出

解码器

# This decoder, along with the Bahdanau Attention mechanism, forms the core of a sequence-to-sequence model for tasks like machine translation, where the model takes an input sequence (source language) and generates an output sequence (target language) based on the learned representations.
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)

        # used for attention
        self.attention = BahdanauAttention(self.dec_units)

    def call(self, x, hidden, enc_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        context_vector, attention_weights = self.attention(hidden, enc_output)

        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # passing the concatenated vector to the GRU
        output, state = self.gru(x)

        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))

        # output shape == (batch_size, vocab)
        x = self.fc(output)

        return x, state, attention_weights


decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

输出

训练

我们现在将按照标准的序列到序列模型训练流程（包括训练中的教师强制）来创建一个训练管道。它旨在最小化预测序列和真实目标序列之间的交叉熵损失，通过反向传播更新模型参数。可以根据任务的具体要求和数据集的特性对优化器、损失函数或训练参数进行调整。

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)


@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)

        dec_hidden = enc_hidden

        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

        # Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

            loss += loss_function(targ[:, t], predictions)

            # using teacher forcing
            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss


EPOCHS = 40

for epoch in range(1, EPOCHS + 1):
    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss

    if(epoch % 4 == 0):
        print('Epoch:{:3d} Loss:{:.4f}'.format(epoch,
                                          total_loss / steps_per_epoch))

输出

评估

def remove_tags(sentence):
    return sentence.split("<start>")[-1].split("<end>")[0]

def evaluate(sentence):
    sentence = preprocess_sentence(sentence)

    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                         maxlen=max_length_inp,
                                                         padding='post')
    inputs = tf.convert_to_tensor(inputs)

    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input,
                                                             dec_hidden,
                                                             enc_out)

        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.index_word[predicted_id] + ' '

        if targ_lang.index_word[predicted_id] == '<end>':
            return remove_tags(result), remove_tags(sentence)

        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return remove_tags(result), remove_tags(sentence)

回答问题

def ask(sentence):
    result, sentence = evaluate(sentence)

    print('Question: %s' % (sentence))
    print('Predicted answer: {}'.format(result))
ask(questions[1])

输出

下一个主题W-GAN

← 上一个下一个 →

Bahdanau 注意力

Bahdanau 注意力的组成部分

Bahdanau 注意力的优点

Bahdanau 注意力的挑战

导入数据集

准备数据

预处理句子

分词

词到索引

创建 TensorFlow 数据集

编码器

Bahdanau 注意力

解码器

训练

评估

回答问题

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

Bahdanau 注意力

Bahdanau 注意力的组成部分

Bahdanau 注意力的优点

Bahdanau 注意力的挑战

导入数据集

准备数据

预处理句子

分词

词到索引

创建 TensorFlow 数据集

编码器

Bahdanau 注意力

解码器

训练

评估

回答问题

相关帖子

使用 Scikit-Learn 的 ML 虚拟分类器

什么是 1 维卷积神经网络

函数导数简介

机器学习中的持续学习

机器学习中的数据管理

机器学习中的情感分析

机器学习中的信息论

过拟合与欠拟合

机器学习中的随机搜索

反向传播 - 算法

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器