文本摘要简介

2025年6月21日 | 阅读11分钟

文本摘要是自然语言处理中的一项重要操作，它将大量文本浓缩成更短、更有意义的版本。由于数字内容激增，摘要为个人和企业提供了从冗长的文档、新闻文章、研究论文和报告中快速提取见解的方法。广义而言，文本摘要主要有两种类型：抽取式和抽象式。抽取式摘要涉及从源文本中逐字选择相关句子或短语，选择标准可以采用统计或语言机制（例如，词频、句子重要性、语义相关性）。这种方法保证了语法正确、简洁的介绍，但在大多数情况下无法提供逻辑上合理的流畅性。另一方面，抽象式摘要根据对文本的理解生成新句子，然后像人类一样重新表述。这种方法需要深度学习模型，例如序列到序列架构，并且转换器应包括 BERT 和 T5 以及循环神经网络，以构建流畅连贯的摘要。

现在我们将使用 Seq2Seq LSTM 模型执行文本摘要。

代码

 
import numpy as np 
import pandas as pd
brief_intro = pd.read_csv('/news_brief_intro.csv', encoding='iso-8859-1')
un_edited = pd.read_csv('/news_brief_intro_more.csv', encoding='iso-8859-1')

prior_val_1 =  un_edited.iloc[:,0:2].copy()
# prior_val_1['head + text'] = prior_val_1['headlines'].str.cat(prior_val_1['text'], sep =" ") 

prior_val_2 = brief_intro.iloc[:,0:6].copy()
prior_val_2['text'] = prior_val_2['author'].str.cat(prior_val_2['date'].str.cat(prior_val_2['read_more'].str.cat(prior_val_2['text'].str.cat(prior_val_2['ctext'], sep = " "), sep =" "),sep= " "), sep = " ")
prior = pd.DataFrame()
prior['text'] = pd.concat([prior_val_1['text'], prior_val_2['text']], ignore_index=True)
prior['brief_intro'] = pd.concat([prior_val_1['headlines'],prior_val_2['headlines']],ignore_index = True)
prior.head(2)   

输出

这是带有注意力的 LSTM

代码

 
#pip install keras-self-attention

prior['text'][:10]

输出

现在让我们进行一些数据清理。

代码

 
import re

# Taking out non-alphabetic characters from the text:
def clean_text(data_column):
    for ent in data_column:
        
        # The sequence of regex operations is very important
        
        ent = re.sub("(\t)", ' ', str(ent)).lower()  # Take out tab spaces
        ent = re.sub("(\r)", ' ', str(ent)).lower()  # Take out carriage returns
        ent = re.sub("(\n)", ' ', str(ent)).lower()  # Take out newlines
        
        ent = re.sub("(__+)", ' ', str(ent)).lower()  # Substitute multiple consecutive underscores with a space
        ent = re.sub("(--+)", ' ', str(ent)).lower()  # Substitute multiple consecutive hyphens with a space
        ent = re.sub("(~~+)", ' ', str(ent)).lower()  # Substitute multiple consecutive tildes with a space
        ent = re.sub("(\+\++)", ' ', str(ent)).lower()  # Substitute multiple consecutive plus signs with a space
        ent = re.sub("(\.\.+)", ' ', str(ent)).lower()  # Substitute multiple consecutive periods with a space
        
        ent = re.sub(r"[<>()|&©ø\[\]\'\",;?~*!]", ' ', str(ent)).lower()  # Take out special characters
        
        ent = re.sub("(mailto:)", ' ', str(ent)).lower()  # Take out 'mailto:' text
        ent = re.sub(r"(\\x9\d)", ' ', str(ent)).lower()  # Take out hexadecimal escape sequences
        ent = re.sub("([iI][nN][cC]\d+)", 'INC_NUM', str(ent)).lower()  # Standardize company number format
        ent = re.sub("([cC][mM]\d+)|([cC][hH][gG]\d+)", 'CM_NUM', str(ent)).lower()  # Substitute CM and CHG numbers with a standard token
        
        ent = re.sub("(\.\s+)", ' ', str(ent)).lower()  # Take out period at word endings
        ent = re.sub("(-\s+)", ' ', str(ent)).lower()  # Take out hyphen at word endings
        ent = re.sub("(:\s+)", ' ', str(ent)).lower()  # Take out colons at word endings
        
        ent = re.sub("(\s+\.\s+)", ' ', str(ent)).lower()  # Take out single characters that appear between spaces
        
        # Extract domain names from URLs and Substitute full links with the domain
        try:
            url_match = re.search(r'((https*:\/*)([^\/\s]+))(.[^\s]+)', str(ent))
            domain = url_match.group(3)
            ent = re.sub(r'((https*:\/*)([^\/\s]+))(.[^\s]+)', domain, str(ent))
        except:
            pass  # Skip cases where no URL is present
        
        ent = re.sub("(\s+)", ' ', str(ent)).lower()  # Take out extra spaces
        
        # This should always be the final cleanup step
        ent = re.sub("(\s+\.\s+)", ' ', str(ent)).lower()  # Take out any isolated characters between spaces
        
        yield ent

brief_cleaning1 = clean_text(prior['text'])
brief_cleaning2 = clean_text(prior['brief_intro'])

from time import time
import spacy

nlp = spacy.load('en', disable=['ner', 'parser'])  # Disable Named Entity Recognition for faster processing

# Utilizing spaCy’s .pipe() method to optimize text preprocessing speed
# If significant data reduction occurs (e.g., text length reduces drastically), consider lowering batch_size

t_start = time()

# Process text in batches of 5000 for efficiency
cleaned_text = [str(doc) for doc in nlp.pipe(brief_cleaning1, batch_size=5000, n_threads=-1)]

# Approximate execution time
print('Total processing time: {} mins'.format(round((time() - t_start) / 60, 2)))   

输出

 
Time to clean up everything: 7.68 mins

利用 spaCy 的 .pipe() 函数提高文本处理任务的效率

代码

 
t = time()  

# Processing text in batches of 5000 and leveraging all available CPU cores for optimized performance  
brief_intro = ['_START_ ' + str(doc) + ' _END_' for doc in nlp.pipe(brief_cleaning2, batch_size=5000, n_threads=-1)]  

# The entire cleanup process takes approximately 7-8 minutes  
print('Total time taken for text processing: {} mins'.format(round((time() - t) / 60, 2)))     

输出

 
Time to clean up everything: 1.91 mins

我们来看看。

代码

输出

现在我们也将看看摘要。

代码

输出

 
'_START_ upgrad learner switches to a career in ml al with 90% salary hike _END_'

绘制文本和摘要。

代码

 
prior['cleaned_text'] = pd.Series(text)
prior['cleaned_brief_intro'] = pd.Series(brief_intro)
text_count = []
brief_intro_count = []
for sent in prior['cleaned_text']:
    text_count.append(len(sent.split()))
for sent in prior['cleaned_brief_intro']:
    brief_intro_count.append(len(sent.split()))
graph_dFrame= pd.DataFrame()
graph_dFrame['text']=text_count
graph_dFrame['brief_intro']=brief_intro_count
import matplotlib.pyplot as plt

graph_dFrame.hist(bins = 5)
plt.show()   

输出

确定包含 15 个或更少单词的“cleaned_brief_intro”条目的百分比

代码

 
count = 0  
for ent in prior['cleaned_brief_intro']:  
    if len(ent.split()) <= 15:  
        count += 1  
        
# Calculating and displaying the proportion of short entries  
print(count / len(prior['cleaned_brief_intro']))     

输出

 
0.9978234465335472

#检查有多少百分比的文本包含 0-70 个单词

代码

 
cnt=0
for i in prior['cleaned_text']:
    if(len(i.split())<=100):
        cnt=cnt+1
print(cnt/len(prior['cleaned_text']))   

输出

 
0.9578389933440218

定义文本摘要的最大字数限制。

代码

 
# 
mtlgth = 100  
max_brief_intro_length = 15  

# Extracting relevant text and summaries within the specified word limits  

processed_text = np.array(prior['cleaned_text'])  
processed_brief_intro = np.array(prior['cleaned_brief_intro'])  

filtered_text = []  
filtered_brief_intro = []  

# Filtering entries where both text and brief_intro meet the defined length constraints  
for i in range(len(processed_text)):  
    if (len(processed_brief_intro[i].split()) <= max_brief_intro_length and  
            len(processed_text[i].split()) <= mtlgth):  
        filtered_text.append(processed_text[i])  
        filtered_brief_intro.append(processed_brief_intro[i])  

# Creating a DataFrame with the refined dataset  
post_prior = pd.DataFrame({'text': filtered_text, 'brief_intro': filtered_brief_intro})  
post_prior.head(2)     

输出

#在

代码

 
post_prior['brief_intro'] = post_prior['brief_intro'].apply(lambda x : 'sostok '+ x + ' Uostoke')
post_prior.head(2)   

输出

Seq2Seq 模型构建

代码

 
from sklearn.model_selection import train_test_split  

# Splitting the dataset into training and validation sets  
tr_x, val_x, tr_y, val_y = train_test_split(  
    np.array(post_prior['text']),  
    np.array(post_prior['brief_intro']),  
    test_size=0.1,  # Allocating 10% of data for validation  
    random_state=0,  # Ensuring reproducibility  
    shuffle=True  # Shuffling data before splitting  
)  

# Importing necessary modules for tokenization and padding  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  

# Initializing a tokenizer for processing text data  
tokenizer_x = Tokenizer()  
tokenizer_x.fit_on_texts(list(tr_x))  # Learning the vocabulary from training text  

# Defining the threshold frequency for rare words  
threshold = 4  

# Initializing counters for vocabulary statistics  
rare_word_count = 0  
total_word_count = 0  
rare_word_frequency = 0  
total_word_frequency = 0  

# Analyzing word distribution in the vocabulary  
for word, occurrence in tokenizer_x.word_counts.items():  
    total_word_count += 1  
    total_word_frequency += occurrence  
    if occurrence < threshold:  
        rare_word_count += 1  
        rare_word_frequency += occurrence  

# Displaying statistics on rare words  
print("% of rare words in vocabulary:", (rare_word_count / total_word_count) * 100)  
print("Total Coverage of rare words:", (rare_word_frequency / total_word_frequency) * 100)     

输出

在这里，我们将分析文本中的稀有词。

代码

 
# Initializing a tokenizer for processing text data with limited vocabulary  
tokenizer_x = Tokenizer(num_words=tot_cnt - cnt)  
tokenizer_x.fit_on_texts(list(tr_x))  # Learning the vocabulary from training text  

# Converting text data into numerical sequences (tokenized representations)  
tr_x_seq = tokenizer_x.texts_to_sequences(tr_x)  
val_x_seq = tokenizer_x.texts_to_sequences(val_x)  

# Applying padding to ensure uniform sequence length  
tr_x = pad_sequences(tr_x_seq, maxlen=mtl, padding='post')  
val_x = pad_sequences(val_x_seq, maxlen=mtl, padding='post')  

# Determining vocabulary size (including an extra token for padding)  
voc_x = tokenizer_x.num_words + 1  

# Displaying the vocabulary size  
print("Size of vocabulary in X = {}".format(voc_x))     

输出

 
Size of vocabulary in X = 33412

在这里，我们将对摘要中的稀有词进行分析。

代码

 
# Initializing a tokenizer for processing text summaries  
tokenizer_y = Tokenizer()   
tokenizer_y.fit_on_texts(list(tr_y))  # Learning the vocabulary from training summaries  

# Defining a threshold for rare word occurrences  
thresh = 6  

# Initializing counters for vocabulary statistics  
cnt = 0  
tot_cnt = 0  
freq = 0  
tot_freq = 0  

# Iterating through the word counts to determine rare words  
for key, value in tokenizer_y.word_counts.items():  
    tot_cnt += 1  # Total number of unique words  
    tot_freq += value  # Total frequency of all words  
    if value < thresh:  # Checking if word occurrence is below the threshold  
        cnt += 1  # Counting rare words  
        freq += value  # Summing frequencies of rare words  

# Displaying the percentage of rare words in the vocabulary  
print("% of rare words in vocabulary:", (cnt / tot_cnt) * 100)  

# Displaying the total frequency coverage of rare words  
print("Total Coverage of rare words:", (freq / tot_freq) * 100)     

输出

我们需要知道大小。

代码

 
# Initializing a tokenizer for processing summary text with a limited vocabulary  
tokenizer_y = Tokenizer(num_words=tot_cnt - cnt)  
tokenizer_y.fit_on_texts(list(tr_y))  # Learning the vocabulary from training summaries  

# Converting text summaries into integer sequences (encoding words into numeric format)  
tr_y_seq = tokenizer_y.texts_to_sequences(tr_y)  
val_y_seq = tokenizer_y.texts_to_sequences(val_y)  

# Applying zero-padding to ensure uniform sequence length  
tr_y = pad_sequences(tr_y_seq, maxlen=max_brief_intro_len, padding='post')  
val_y = pad_sequences(val_y_seq, maxlen=max_brief_intro_len, padding='post')  

# Determining vocabulary size (+1 for padding token)  
voc_y = tokenizer_y.num_words + 1  
print("Size of vocabulary in Y = {}".format(voc_y))     

输出

 
Size of vocabulary in Y = 11581

现在，“摘要”（Y）（训练集和验证集）都只包含 START 和 END，我们将删除它们。

代码

 
# Identifying and removing sequences in Y that contain only two non-zero tokens  
indices_to_remove = []  
for i in range(len(tr_y)):  
    token_count = sum(1 for j in tr_y[i] if j != 0)  
    if token_count == 2:  
        indices_to_remove.append(i)  

# Removing the identified sequences from training data  
tr_y = np.delete(tr_y, indices_to_remove, axis=0)  
tr_x = np.delete(tr_x, indices_to_remove, axis=0)  

# Identifying and removing sequences in validation data following the same logic  
indices_to_remove = []  
for i in range(len(val_y)):  
    token_count = sum(1 for j in val_y[i] if j != 0)  
    if token_count == 2:  
        indices_to_remove.append(i)  

# Removing the identified sequences from validation data  
val_y = np.delete(val_y, indices_to_remove, axis=0)  
val_x = np.delete(val_x, indices_to_remove, axis=0)  

# Importing required libraries  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from nltk.corpus import stopwords  
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed  
from tensorflow.keras.models import Model  
from tensorflow.keras.callbacks import EarlyStopping  
from keras import backend as K  
import gensim  
import numpy as np  
import pandas as pd  
import re  
from bs4 import BeautifulSoup  

import warnings  

# Display settings and warnings suppression  
pd.set_option("display.max_colwidth", 200)  
warnings.filterwarnings("ignore")  

# Displaying the vocabulary size derived from the word2vec model  
print("Size of vocabulary from the w2v model = {}".format(voc_x))  

# Clearing any previous session state  
K.clear_session()  

# Setting dimensions for embedding and latent space  
latent_dim = 300  
embedding_dim = 200  

# Encoder input  
inputs_encoder = Input(shape=(mtl,))  

# Embedding layer for encoder  
enc_emb = Embedding(voc_x, embedding_dim, trainable=True)(inputs_encoder)  

# First LSTM layer in encoder  
lstm_encoder1 = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.4)  
encoder_output1, state_h1, c1_state = lstm_encoder1(enc_emb)  

# Second LSTM layer in encoder  
lstm_encoder2 = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.4)  
encoder_output2, h2_state, c2_state = lstm_encoder2(encoder_output1)  

# Third LSTM layer in the encoder  
lstm_encoder3 = LSTM(latent_dim, return_state=True, return_sequences=True, dropout=0.4, recurrent_dropout=0.4)  
output_encoder, state_h, c_state = lstm_encoder3(encoder_output2)  

# Decoder input  
input_decoder = Input(shape=(None,))  

# Embedding layer for decoder  
emb_dec_layer = Embedding(voc_y, embedding_dim, trainable=True)  
emb_dec = emb_dec_layer(input_decoder)  

# LSTM layer in decoder  
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.4, recurrent_dropout=0.2)  
decoder_outputs, decoder_fwd_state, destate_coder_back = decoder_lstm(emb_dec, initial_state=[state_h, c_state])  

# Fully connected layer with softmax activation for output  
dense_decoder = TimeDistributed(Dense(voc_y, activation='softmax'))  
decoder_outputs = dense_decoder(decoder_outputs)  

# Defining the encoder-decoder model  
model = Model([inputs_encoder, input_decoder], decoder_outputs)  

# Displaying model architecture  
model.summary()     

输出

使用 RMSprop 优化器和稀疏分类交叉熵损失编译模型

代码

 
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')  

# Implementing early stopping to monitor validation loss and stop training if it does not improve  
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)  

# Training the model using the training dataset  
history = model.fit(  
    [tr_x, tr_y[:, :-1]],  # Encoder input and decoder input (excluding last token)  
    tr_y.reshape(tr_y.shape[0], tr_y.shape[1], 1)[:, 1:],  # Decoder output (excluding first token)  
    epochs=50,  
    batch_size=128,  
    validation_data=(  
        [val_x, val_y[:, :-1]],  # Validation data input  
        val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:]  # Validation data output  
    ),  
    callbacks=[es]  # Applying early stopping callback  
)     

输出

代码

 
from matplotlib import pyplot
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()   

输出

让我们构建一个字典，将目标和源词汇表的索引转换为单词

代码

 
#Creating reverse word index mappings for both source and target sequences 
word_index_reverse target = tokenizer_y.index_word  # Maps token indices to words for the target sequence  
reverse_source_word_index = tokenizer_x.index_word  # Maps token indices to words for the source sequence  
word_index_targ = tokenizer_y.word_index  # Retrieves word-to-index mapping for target vocabulary  

# Defining the encoder model to obtain feature representations  
encoder_model = Model(inputs=inputs_encoder, outputs=[output_encoder, state_h, c_state])  

# Setting up the decoder  
# Input placeholders to store previous time step states  
inputh_decoder_state = Input(shape=(latent_dim,))  # Hidden state input  
inputc_decoder_state = Input(shape=(latent_dim,))  # Cell state input  
decoder_hidden_state_input = Input(shape=(mtl, latent_dim))  # Hidden state sequence input  

# Obtaining word embeddings for the decoder input sequence  
emb_dec2 = emb_dec_layer(input_decoder)  

# Predicting the next word in the sequence using the previous state information  
output2_decoder, h2_state, c2_state = decoder_lstm(  
    emb_dec2, initial_state=[inputh_decoder_state, inputc_decoder_state]  
)  

# Applying a dense softmax layer to compute a probability distribution over the target vocabulary  
output2_decoder = dense_decoder(output2_decoder)  

# Constructing the final decoder model  
model_decoder = Model(  
    [input_decoder, decoder_hidden_state_input, inputh_decoder_state, inputc_decoder_state],  
    [output2_decoder, h2_state, c2_state]  
)  

#Below, we define a function that represents the inference process's implementation.
def decode_sequence(input_seq):
    """
    Generates a decoded output sequence from an input sequence using the trained encoder-decoder model.
    """

    # Encode the input sequence to obtain feature vectors and initial states.
    out_e, e_h, e_c = encoder_model.predict(input_seq)

    # Initialize target sequence with a single timestep.
    seq_targ = np.zeros((1, 1))

    # Set the first word of the target sequence as the start token.
    seq_targ[0, 0] = word_index_targ['sostok']

    cond_stop = False
    sentence_decoded = ''

    while not cond_stop:
        # Predict the next token using the decoder model.
        token_output, h, c = model_decoder.predict([seq_targ] + [out_e, h_e, c_e])

        # Get the token with the highest probability.
        token_index_sampled = np.argmax(token_output[0, -1, :])
        token_sampled = word_index_reverse target[token_index_sampled]

        # Append a token to the generated sequence if it's not the end token.
        if token_sampled != 'Uostoke':
            sentence_decoded += ' ' + token_sampled

        # Stop decoding if the end token is reached or the max sequence length is exceeded.
        if token_sampled == 'Uostoke' or len(sentence_decoded.split()) >= (max_brief_intro_len - 1):
            cond_stop = True

        # Update the target sequence for the next iteration.
        seq_targ = np.zeros((1, 1))
        seq_targ[0, 0] = token_index_sampled

        # Update the decoder’s internal states.
        h_e, c_e = h, c

    return sentence_decoded   

对于评论和摘要，让我们定义将整数序列转换为单词序列的函数。

代码

 
def seq2brief_intro(input_seq):
    """
    Converts a sequence of token indices into a readable brief introduction string,
    ignoring padding, start, and end tokens.
    """
    new_string = ''
    for i in input_seq:
        if i != 0 and i != word_index_targ['sostok'] and i != word_index_targ['Uostoke']:
            new_string += word_index_reverse target[i] + ' '
    return new_string.strip()


def seq2text(input_seq):
    """
    Converts a sequence of token indices into a readable text string, ignoring padding tokens.
    """
    new_string = ''
    for i in input_seq:
        if i != 0:
            new_string += reverse_source_word_index[i] + ' '
    return new_string.strip()


#To view the outcomes, run the model over the data.
for i in range(0,100):
    print("Review:",seq2text(tr_x[i]))
    print("Original brief_intro:",seq2brief_intro(tr_y[i]))
    print("predicted brief_intro:",decode_sequence(tr_x[i].reshape(1,mtl)))
    print("\n")   

输出

下一主题生成对抗网络

文本摘要简介

Seq2Seq 模型构建

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

文本摘要简介

Seq2Seq 模型构建

相关帖子

Adadelta 优化器

机器学习中的转置卷积

模型解释中的反事实解释

最大似然估计简介

Facebook Prophet

香农熵

Inception 模型

稀疏逆协方差

深度学习和机器学习对数据结构和算法的需求

机器学习中的贝叶斯网络

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器