机器学习中的 Seq2Seq 模型

2025年6月24日 | 阅读 11 分钟

序列到序列（seq2seq）模型在机器学习中已成为深度学习在具有顺序数据固有结构的任务中的基本方法，例如语言翻译、文本摘要和语音识别。最常见的 seq2seq 模型类型采用编码器-解码器处理，它是一种应用于输入序列的 RNN，其输出是在相同或不同预定义目标序列中的目标输出。编码器-解码器模型通常假设一个上下文向量，该向量作为输入的一种抽象表示形式，对编码器而言，并负责使输出符合输入要求，对解码器而言。

编码器所做的唯一工作是将输入编码为上下文向量，这意味着仅用一个向量来总结其语义。此最终隐藏状态将传递给解码器，解码器一次生成一个目标序列。直接将编码器连接到解码器，上下文向量被馈送到解码器，这确保了解码器从一个逻辑起点开始，并拥有有关输入的所有必要信息。由于解码器无法直接访问输入信息，因此它会根据这些上下文向量和隐藏状态来预测单词。Teacher forcing（教师强制）旨在帮助训练：它有时会将模型的一个预测结果作为目标词传递给网络，以便在模型训练时不会累积错误。在推理时，解码器会生成单词，直到生成序列结束标记。然后，将此模型的性能与实际输出进行评估，从而随着时间的推移改进模型的状态。

我们将使用 PyTorch 来实现模型，并利用 TorchText 协助我们进行必要的预处理。此外，我们将使用 spaCy 来帮助进行数据分词。

代码

import os
for   name_dir, _,   name_file in os.walk('/kaggle/input'):
    for   name_f in   name_file:
        print(os.path.join(  name_dir,   name_f))

输出

代码

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import TranslationDataset
from torchtext.data import Field, BucketIterator, TabularDataset, Example, Dataset
import math
import time
!pip install pyvi
from pyvi import ViTokenizer, ViUtils
import dill
import pickle
import random


SEED = 1212

# Setting the random seeds for reproducible results
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Next, we implement the tokenizer functions that can be passed to TorchText.

def tknizer(text):
  text = ViTokenizer.tokenize(text)
  return text.split()


the_SRC = Field(tokenize=tknizer, 
            token_init='<sos>', 
            token_eos='<eos>', 
            lower=True)

the_TRG = Field(tokenize = tknizer, 
            token_init='<sos>', 
            token_eos='<eos>', 
            lower=True)

#Next, we will download and load the data for the train, validation and test purposes. 
fields_data = [('vi_no_accents', the_SRC), ('vi', the_TRG)]
data_train,data_val = TabularDataset.splits(path='/kaggle/input/tone-prediction', train='train_40k.csv', 
                        validation='test_40k.csv', format='csv', fields=fields_data, skip_header=True)

# We can confirm that the correct quantity of examples has been loaded.
print(f"Number of training examples: {len(data_train.examples)}")
print(f"Number of validation examples: {len(data_val.examples)}")

输出

此外，我们可以打印一个示例，确保源句子已反转。

代码

输出

 
{'vi_no_accents': ['mieu', 'ta', 'khoa', 'hoc', 'dau', 'tien', 'nam', '1928', '.'], 'vi': ['miêu_tả', 'khoa_học', 'đầu_tiên', 'năm', '1928', '.']}

周期始于德语（源）句子，这似乎是可以接受的。现在，我们为源语言和目标语言构建词汇表。词汇表中的每个标记都被分配一个唯一的索引（整数）；然后，对其应用独热编码。在此向量中，除标记位置（标记为 1）外，所有位置均设置为零。请注意，源语言和目标语言的词汇表是分开的。我们通过考虑至少出现两次的最小频率阈值来过滤标记以构建词汇表。出现一次的任何标记都将被替换为一个表示未知词的标记。在构建词汇表时，一个重要的考虑因素是它必须仅从训练数据集中派生，而不包括验证集和测试集：这将避免信息泄露，其中模型在验证或测试中的分数由于学习了不应该在训练期间看到的数据而被人为地提高。

代码

the_SRC.build_vocab(data_train, min_freq = 1)
the_TRG.build_vocab(data_train, min_freq = 1)
print(f"Unique tokens in source (de) vocabulary: {len(the_SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(the_TRG.vocab)}")

输出

数据准备的最后附加部分是创建迭代器，这将使我们能够以有效的方式输入数据批次。返回批次中的两个重要键是：`the_SRC`，它们是包含数值化源句子的 PyTorch 张量，以及 `the_TRG`，包含数值化目标句子。基本上，数值化意味着原始的可读标记序列已被转换为对应于词汇表索引的序列。另一个重要的步骤是定义 `torch.device`，它使我们能够指定计算是在 GPU 还是 CPU 上进行。我们通过调用 `torch` 中的函数 `torch.cuda.is_available()` 来检查是否具有 GPU，并据此设置迭代器上的设备。为了处理批次，源句子和目标句子在批次中长度相同非常重要；因此，需要进行填充，以便所有句子都填充到批次中最长句子的长度。幸运的是，TorchText 迭代器会自动执行此操作。我们使用 BucketIterator 而不是普通 Iterator，它通过将长度相似的句子分组来更好地处理批次。这最小化了所需的填充量，从而实现了更高效的处理和更快的性能，尤其是在处理大型数据集时。

构建 Seq2Seq 模型

序列到序列模型包含三个主要组成部分：编码器、解码器和统一它们的 Seq2Seq 包装器。这种设计允许输入和输出与编码器和解码器无缝通信，同时高效地处理顺序输入和输出数据。

编码器

要构建的下一个组件是编码器。编码器应用于输入序列，定位其各个部分，并将其转换为一个压缩机制。在我们的例子中，使用了 2 层 LSTM，而在原始论文中，它基本上应该是 4 层 LSTM。基本上，我们可以通过计算能力和任务的复杂性轻松改变层数。多层 RNN 以顺序方式接受输入。第一层（底层）接收输入句子并生成隐藏状态序列。然后将隐藏状态馈送到其上方的层，从而在层之间进行更深层次的特征提取。更深的层不断地完善从输入中学到的表示。除了每个时间步的隐藏状态外，编码器还跟踪单元状态，因为 LSTM 会添加另一个单元状态以提高记忆保持能力。LSTM 架构在层之间跟踪隐藏状态和单元状态，而不是仅跟踪隐藏状态，以保留长期依赖关系。初始隐藏状态和初始单元状态在序列开始时均设置为零。当整个序列都被馈送到编码器后，最终状态被统称为上下文向量，该向量捕获了输入序列中应最终提供给解码器进行进一步处理的所有信息。

代码

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 16


# iterator_train, iterator_valid, test_iterator = BucketIterator.splits(
#     (data_train, data_valid, data_test), 
#      size_batch=BATCH_SIZE,
#      device=device)
iterator_train = BucketIterator(data_train, size_batch=BATCH_SIZE, \
key_sort=lambda x: len(x.vi), shuffle=True, device = device)
iterator_valid = BucketIterator(data_val, size_batch=BATCH_SIZE, \
key_sort=lambda x: len(x.vi), shuffle=True, device = device)

class Encoder(nn.Module):  
    def __init__(self, dim_input, dim_emb, hid_dim, layer_n, dropout):  
        super().__init__()  

        self.hid_dim = hid_dim  
        self.layer_n = layer_n  

        self.embedding = nn.Embedding(dim_input, dim_emb)  

        self.rnn = nn.LSTM(dim_emb, hid_dim, layer_n, dropout=dropout)  

        self.dropout = nn.Dropout(dropout)  

    def forward(self, the_SRC):  
        # the_SRC has dimensions: [sequence length, batch size]  

        embedded = self.dropout(self.embedding(the_SRC))  

        # Embedded representation: [sequence length, batch size, embedding size]  

        outputs, (hidden, cell) = self.rnn(embedded)  

        # outputs: [sequence length, batch size, hidden size * number of directions]  
        # hidden: [number of layers * number of directions, batch size, hidden size]  
        # cell: [number of layers * number of directions, batch size, hidden size]  

        # The output tensor consists of the top layer’s hidden states  

        return hidden, cell   

解码器

此外，Seq2Seq 模型的第二个关键组成部分是解码器，它在给定编码的上下文向量的情况下生成目标序列。解码器是一个 2 层长短期记忆（LSTM）网络，与编码器类似；然而，原始论文推荐使用 4 层 LSTM。解码器一次生成一个输出标记，根据前一个词和隐藏状态生成下一个词。虽然编码器一次处理整个输入序列，但解码器会逐步工作，一次预测一个标记。初始状态隐藏，其中一个单元状态取自编码器的最终状态。解码器接收先前预测的词作为（Ground Truth Target Word During Training）输入，并将其输入到 LSTM 层以更新状态。

代码

class Decoder(nn.Module):  
    def __init__(self, output_dim, dim_emb, hid_dim, layer_n, dropout):  
        super().__init__()  

        self.output_dim = output_dim  
        self.hid_dim = hid_dim  
        self.layer_n = layer_n  

        self.embedding = nn.Embedding(output_dim, dim_emb)  

        self.rnn = nn.LSTM(dim_emb, hid_dim, layer_n, dropout=dropout)  

        self.out = nn.Linear(hid_dim, output_dim)  

        self.dropout = nn.Dropout(dropout)  

    def forward(self, input, hidden, cell):  
        # input shape: [batch size]  
        # hidden shape: [number of layers * directions, batch size, hidden size]  
        # cell shape: [number of layers * directions, batch size, hidden size]  

        # Since the decoder always has one direction, we can simplify:  
        # hidden shape: [number of layers, batch size, hidden size]  
        # cell shape: [number of layers, batch size, hidden size]  

        input = input.unsqueeze(0)  # Adding sequence length dimension  

        # input shape after unsqueeze: [1, batch size]  

        embedded = self.dropout(self.embedding(input))  

        # embedded shape: [1, batch size, embedding size]  

        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))  

        # output shape: [sequence length, batch size, hidden size * directions]  
        # hidden shape: [number of layers * directions, batch size, hidden size]  
        # cell shape: [number of layers * directions, batch size, hidden size]  

        # Since sequence length and directions are always 1 in the decoder:  
        # output shape: [1, batch size, hidden size]  
        # hidden shape: [number of layers, batch size, hidden size]  
        # cell shape: [number of layers, batch size, hidden size]  

        prediction = self.out(output.squeeze(0))  # Removing sequence length dimension  

        # prediction shape: [batch size, output dimension]  

        return prediction, hidden, cell  

Seq2Seq 模型是编码器和解码器的有效组合，通过上下文向量和教师强制来生成输出序列。该模型学习在训练过程中通过前向方法执行良好的翻译序列，这取决于提供的监督级别。

代码

class Seq2Seq(nn.Module):  
    def __init__(self, encoder, decoder, device):  
        super().__init__()  

        self.encoder = encoder  
        self.decoder = decoder  
        self.device = device  

        # Ensure encoder and decoder have matching dimensions  
        assert encoder.hid_dim == decoder.hid_dim, "Encoder and decoder hidden sizes must match!"  
        assert encoder.layer_n == decoder.layer_n, "Encoder and decoder must have the same number of layers!"  

    def forward(self, the_SRC, the_TRG, teacher_forcing_ratio=0.5):  
        # the_SRC shape: [sequence length, batch size]  
        # the_TRG shape: [sequence length, batch size]  
        # teacher_forcing_ratio: Probability of using the actual target word instead of the model’s prediction  

        batch_size = the_TRG.shape[1]  
        max_len = the_TRG.shape[0]  
        vocab_size = self.decoder.output_dim  

        # Initialize a tensor to store decoder outputs  
        outputs = torch.zeros(max_len, batch_size, vocab_size).to(self.device)  

        # Use the encoder’s final hidden and cell states as the decoder’s initial states  
        hidden, cell = self.encoder(the_SRC)  

        # The first input to the decoder is always the <sos> token  
        input = the_TRG[0, :]  

        for t in range(1, max_len):  
            # Pass the current input, hidden, and cell state to the decoder  
            output, hidden, cell = self.decoder(input, hidden, cell)  

            # Store the output prediction  
            outputs[t] = output  

            # Determine if teacher forcing should be applied  
            teacher_force = random.random() < teacher_forcing_ratio  

            # Get the most likely predicted token  
            top1 = output.argmax(1)  

            # Choose the next input:  
            # If using teacher forcing, use the actual target word from the dataset  
            # Otherwise, use the model's predicted token  
            input = the_TRG[t] if teacher_force else top1  

        return outputs  

模型实现后，我们就可以进行训练了。我们必须首先初始化模型。几年前，人们发现输入/输出的嵌入大小与词汇表的大小相对应。相比之下，编码器和解码器的嵌入维度和 dropout 率可以不同，而层数以及隐藏和单元状态的大小应保持不变。

代码

# Define model parameters  
INPUT_DIM = len(the_SRC.vocab)  
OUTPUT_DIM = len(the_TRG.vocab)  
ENC_EMB_DIM = 100  # Embedding dimension for encoder  
DEC_EMB_DIM = 100  # Embedding dimension for decoder  
HID_DIM = 256  # Size of hidden states  
N_LAYERS = 2  # Number of layers in both encoder and decoder  
ENC_DROPOUT = 0.5  # Dropout rate for encoder  
DEC_DROPOUT = 0.5  # Dropout rate for decoder  

# Initialize encoder and decoder  
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)  
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)  

# Create the sequence-to-sequence model and move it to the specified device  
model = Seq2Seq(enc, dec, device).to(device)  

# Then, the model's weights need to be initialized. The paper suggests that all weights are initialized to an arena formed with values drawn from a uniform distribution within the limit of -0.08 and 0.08.  This can be achieved in PyTorch by defining a function for the weight initialization and then using that to initialize the model. Such a function is executed using the `apply` function that triggers it on every module and sub-module in the model. Inside the function, the respective parameters are looped over and assigned values sampled from a uniform distribution using `nn.init.uniform_`.

def weights_init(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(weights_init) 

输出

此外，我们定义了一个函数来确定模型包含多少可训练参数。

代码

def parameters_count(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {parameters_count(model):,} trainable parameters')   

输出

 
The model has 20,330,651 trainable parameters

在训练循环中，我们使用定义的优化器来更改参数。要了解更多关于各种优化器的信息，请阅读这篇帖子。这里将使用 Adam。

代码

optimizer = optim.Adam(model.parameters())
PAD_IDX = the_TRG.vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    loss_epoch = 0
    
    for i, batch in enumerate(iterator):
        
        the_SRC = batch.vi_no_accents
        the_TRG = batch.vi
        
        optimizer.zero_grad()
        
        output = model(the_SRC, the_TRG)
        
        #the_TRG = [the_TRG sent len, batch size]
        #output = [the_TRG sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        the_TRG = the_TRG[1:].view(-1)
        
        #the_TRG = [(the_TRG sent len - 1) * batch size]
        #output = [(the_TRG sent len - 1) * batch size, output dim]
        
        loss = criterion(output, the_TRG)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        loss_epoch += loss.item()
        
    return loss_epoch / len(iterator)
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    loss_epoch = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            the_SRC = batch.vi_no_accents
            the_TRG = batch.vi

            output = model(the_SRC, the_TRG, 0) #turn off teacher forcing

            #the_TRG = [the_TRG sent len, batch size]
            #output = [the_TRG sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            the_TRG = the_TRG[1:].view(-1)

            #the_TRG = [(the_TRG sent len - 1) * batch size]
            #output = [(the_TRG sent len - 1) * batch size, output dim]

            loss = criterion(output, the_TRG)
            
            loss_epoch += loss.item()
        
    return loss_epoch / len(iterator)


def epoch_time(time_start, time_end):
    time_elapsed = time_end - time_start
    mins_elapsed = int(time_elapsed / 60)
    secs_elapsed = int(time_elapsed - (mins_elapsed * 60))
    return mins_elapsed, secs_elapsed


N_EPOCHS = 10
CLIP = 1
loss_traines = []
loss_valides = []
loss_valides.append(0)
loss_valides = []
best_loss_valid = float('inf')

for epoch in range(N_EPOCHS):
    
    time_start = time.time()
    
    loss_train = train(model, iterator_train, optimizer, criterion, CLIP)
    loss_valid = evaluate(model, iterator_valid, criterion)
    loss_traines.append(loss_train)
    loss_valides.append(loss_valid)
    time_end = time.time()
    
    epoch_mins, epoch_secs = epoch_time(time_start, time_end)
    
    if loss_valid < best_loss_valid:
        best_loss_valid = loss_valid
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {loss_train:.3f} | Train PPL: {math.exp(loss_train):7.3f}')
    print(f'\t Val. Loss: {loss_valid:.3f} |  Val. PPL: {math.exp(loss_valid):7.3f}')

输出

下一主题机器学习中的概率模型

机器学习中的 Seq2Seq 模型

构建 Seq2Seq 模型

编码器

解码器

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的 Seq2Seq 模型

构建 Seq2Seq 模型

编码器

解码器

相关帖子

在 Python 中创建 AdaBoost 回归模型

机器学习中的图像字幕生成

使用 Pandas 进行数据归一化

深度参数化连续卷积神经网络

什么是 Epoch

联合、边缘和条件概率

机器学习的数据结构

Python 中用于序列分类的双向 LSTM

机器学习中的 IPL 预测

机器学习中的持续学习

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器