机器学习中的解析解

2025年2月3日 | 阅读10分钟

机器学习是一个通过从数据中发现模式和进行预测而蓬勃发展的领域；从这个意义上说，它通常以更实证的方式进行。虽然大多数机器学习技术严重依赖数值方法和近似值来解决复杂问题，但仍有一部分问题存在解析解，可以通过直接从相关的数学方程推导得出，而无需迭代算法或大量的计算。理解机器学习中的解析解可以深入了解该领域的数学基础，并有助于在更传统的统计方法与当今更现代的计算方法之间建立桥梁。

至于解析解，它表示一个封闭形式的表达式，它内在地上解决了数学问题。与数值解不同，数值解是通过迭代方法执行的引导和近似过程的结果，解析解在有限的步骤内给出结果。在机器学习的背景下，解析解通常是从一个明确定义的数学模型推导出来的，从而揭示了变量之间真实的内在关系。

机器学习中解析解的应用

我们可以通过多种方式找到解析解，例如：

线性回归：机器学习中最著名的经典解析解之一是线性回归。线性回归模型的作用是识别最佳拟合线，以保持预测值与真实值之间的最小差异。对于给定的输入特征数据集 X 和目标变量 y，正规方程推导给出了线性回归方程系数 θ 的解析解。其中，β 是系数向量，X 是输入特征矩阵，y 是目标变量。该方程精确地给出了参数的解，无需任何形式的迭代，例如在梯度下降等方法中使用的。
主成分分析 (PCA)：主成分分析 (PCA) 是一种降维过程，通过该过程可以将任何数据集转换为较低维度的空间，同时保留最大方差。分析的解涉及计算 PCA 解的解析解的特征向量和特征值。然后求解与最大特征值对应的特征向量。然后，通过线性变换形成主成分，这不需要迭代优化，因此是解析解的一个例子。
朴素贝叶斯分类器：朴素贝叶斯分类器基于贝叶斯定理，并且通常被迫做出独立性假设。尽管这是一个非常简单的假设，但对于每个类的后验概率，它都有一个解析解，即 $P(y|x) = \frac{P(x|y)P(y)}{P(x)}$。这反过来又可用于基于每个类的观察到的特征似然性对新数据点进行直接分类。
精确性和可解释性：解析解提供精确的答案，这些答案可以高度可解释。例如在线性回归的情况下，可以通过正规方程获得的系数可以直接解释，以了解特征与目标变量之间的关系。
计算效率：解析解不依赖于迭代过程，因此在收敛所需的迭代次数不是过高的但在效率方面非常高，特别是对于中小数据集。这在需要短时间内获得精确答案并且优化算法本身的开销可能很显著的问题中很重要。

导入库

 
import numpy as np # We are going to use it for linear algebra.
import pandas as pd  # We will use it for the  data processing, input output of CS files. like: pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")

import os
print(os.listdir("../input"))
import os
import time
import numpy as np # We are going to use it for linear algebra.
import pandas as pd #We will use it for the  data processing, input output of CS files. like: pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import Bidirectional, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D   

我们需要一些配置值。

 
## now we need some configuration values
size_of_embed = 300 #  It is the value of size of each vector.
features_maximum = 120000 # It is the value of the number of unique words that we are going to use that can be said as the number of rows in an embedding vector.
maximum_length = 70 # Now this is the number of words that are to be in a question.   

我们将在标点符号之前和之后用空格替换它们，以便进行分词或使文本数据准备好进行处理。

 
def clean_text(x):
    x = str(x)
    for punct in punctuations_list:
        x = x.replace(punct, f' {punct} ')
    return x

def split_text(x):
    x = wordninja.split(x)
    return '-'.join(x)   

现在，我们将通过将原始文本转换为可用作模型输入（特别是对于基于神经网络的方法）的格式，来为机器学习管道准备文本数据。

 
def loading_and_preparing():
    train_dataframe = pd.read_csv("train.csv")
    test_dataframe = pd.read_csv("test.csv")
    
    train_dataframe["question_text"] = train_dataframe["question_text"].str.lower()
    test_dataframe["question_text"] = test_dataframe["question_text"].str.lower()
    
    train_dataframe["question_text"] = train_dataframe["question_text"].apply(lambda x: clean_text(x))
    test_dataframe["question_text"] = test_dataframe["question_text"].apply(lambda x: clean_text(x))
    
    print("Train shape : ",train_dataframe.shape)
    print("Test shape : ",test_dataframe.shape)
    
    ## split to train and val
    train_dataframe, val_df = train_test_split(train_dataframe, test_size=0.001, random_state=2018) # hahaha


    ##Now we have to take of the missing values , we will fill them
    X_train = train_dataframe["question_text"].fillna("_##_").values
    X_validation = val_df["question_text"].fillna("_##_").values
    X_test = test_dataframe["question_text"].fillna("_##_").values

    ## Tokenize the sentences
    tokenizer = Tokenizer(number_of_words=features_maximum)
    tokenizer.fit_on_texts(list(X_train))
    X_train = tokenizer.texts_to_sequences(X_train)
    X_validation = tokenizer.texts_to_sequences(X_validation)
    X_test = tokenizer.texts_to_sequences(X_test)

    ## Pad the sentences 
    X_train = pad_sequences(X_train, maximum_length=maximum_length)
    X_validation = pad_sequences(X_validation, maximum_length=maximum_length)
    X_test = pad_sequences(X_test, maximum_length=maximum_length)

    ## Get the target values
    y_train = train_dataframe['target'].values
    y_val = val_df['target'].values  
    
    #shuffling the data
    np.random.seed(2018)
    trn_idx = np.random.permutation(len(X_train))
    val_idx = np.random.permutation(len(X_validation))

    X_train = X_train[trn_idx]
    X_validation = X_validation[val_idx]
    y_train = y_train[trn_idx]
    y_val = y_val[val_idx]    
    
    return X_train, X_validation, X_test, y_train, y_val, tokenizer.word_index   

现在，让我们将这些预训练的词嵌入链接到机器学习模型中；特别是，这对于 NLP 任务很有用。这使得模型能够利用关于单词含义及其相互关系的先验知识，从而在训练期间获得更好的性能并实现更快的收敛。

 
def load_glove(word_index):
    FILE_EMBEDDING= '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    index_embeddings = dict(get_coefs(*o.split(" ")) for o in open(FILE_EMBEDDING))

    all_the_embedding = np.stack(index_embeddings.values())
    emb_mean,emb_std = -0.005838499,0.48782197
    size_of_embed = all_the_embedding.shape[1]

    # word_index = tokenizer.word_index
    words_nb = min(features_maximum, len(word_index))
    matrix_embedding = np.random.normal(emb_mean, emb_std, (words_nb, size_of_embed))
    for word, i in word_index.items():
        if i >= features_maximum: continue
        vector_embedding= index_embeddings.get(word)
        if vector_embeddingis not None: matrix_embedding[i] = embedding_vector
            
    return matrix_embedding 
    
def load_fasttext(word_index):    
    FILE_EMBEDDING= '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    index_embeddings = dict(get_coefs(*j.split(" ")) for j in open(FILE_EMBEDDING) if len(j)>100)

    all_the_embedding = np.stack(index_embeddings.values())
    emb_mean,emb_std = all_the_embedding.mean(), all_the_embedding.std()
    size_of_embed = all_the_embedding.shape[1]

    # word_index = tokenizer.word_index
    words_nb = min(features_maximum, len(word_index))
    matrix_embedding = np.random.normal(emb_mean, emb_std, (words_nb, size_of_embed))
    for word, i in word_index.items():
        if i >= features_maximum: continue
        vector_embedding= index_embeddings.get(word)
        if vector_embeddingis not None: matrix_embedding[i] = embedding_vector

    return matrix_embedding

def load_para(word_index):
    FILE_EMBEDDING= '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
    index_embeddings = dict(get_coefs(*j.split(" ")) for j in open(FILE_EMBEDDING, encoding="utf8", errors='ignore') if len(j)>100)

    all_the_embedding = np.stack(index_embeddings.values())
    emb_mean,emb_std = -0.0053247833,0.49346462
    size_of_embed = all_the_embedding.shape[1]
    print(emb_mean,emb_std,"para")

    # word_index = tokenizer.word_index
    words_nb = min(features_maximum, len(word_index))
    matrix_embedding = np.random.normal(emb_mean, emb_std, (words_nb, size_of_embed))
    for word, i in word_index.items():
        if i >= features_maximum: continue
        vector_embedding= index_embeddings.get(word)
        if vector_embeddingis not None: matrix_embedding[i] = embedding_vector
    
    return matrix_embedding   

CNN 模型

此 CNN 专门设计用于二元文本分类任务。在此模型中，卷积层用于从文本中的不同 n-gram 中提取特征，然后对这些特征进行最大池化和全连接层处理，以预测二元输出（例如，问题是真诚的还是不是）。它使用预训练的嵌入来帮助模型初始化，使其对单词语义有良好的理解，从而提高性能和收敛性。

 
def cnn_model(matrix_embedding):
    filter_sizes = [1,2,3,5]
    num_filters = 36

    inp = Input(shape=(maximum_length,))
    x = Embedding(features_maximum, size_of_embed, weights=[matrix_embedding])(inp)
    x = Reshape((maximum_length, size_of_embed, 1))(x)

    maxpool_pool = []
    for i in range(len(filter_sizes)):
        conv = Conv2D(num_filters, kernel_size=(filter_sizes[i], size_of_embed),
                                     kernel_initializer='he_normal', activation='elu')(x)
        maxpool_pool.append(MaxPool2D(pool_size=(maximum_length - filter_sizes[i] + 1, 1))(conv))

    z = Concatenate(axis=1)(maxpool_pool)   
    z = Flatten()(z)
    z = Dropout(0.1)(z)

    outp = Dense(1, activation="sigmoid")(z)

    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model   

注意力层

基本上，注意力层根据学习到的注意力分数计算输入项的加权和，从而使模型能够专注于与任务相关的序列部分。特别是在自然语言处理等任务中，在执行某些任务时，并非所有单词都同等重要。

 
class Attention(Layer):
    def __init__(self, step_dim,
                 Weighted_regularizer=None, bias_regularizer=None,
                 W_constraint=None, bias_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.Weighted_regularizer = regularizers.get(Weighted_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.Weighted_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.bias_regularizer,
                                     constraint=self.bias_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        input_weighted = x * a
        return K.sum(input_weighted, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim   

LSTM

它专为序列数据上的二元分类问题而设计，例如，句子的情感或问题是否真诚。另一方面，双向 LSTM 层从序列的两个方向提取上下文信息，并且注意力机制帮助模型专注于序列中最有信息量的部分，这实际上可以提高最终预测。

 
def lstm_atten_model(matrix_embedding):
    inp = Input(shape=(maximum_length,))
    x = Embedding(features_maximum, size_of_embed, weights=[matrix_embedding], trainable=False)(inp)
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True))(x)
    x = Bidirectional(CuDNNLSTM(64, return_sequences=True))(x)
    x = Attention(maximum_length)(x)
    x = Dense(64, activation="relu")(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model   

GRU 层可以捕获序列中的时间依赖关系，而添加的注意力机制使模型能够仅关注序列中与提高分类数据准确性相关的部分。

 
def gru_ap_atten_model(matrix_embedding):
    inp = Input(shape=(maximum_length,))
    x = Embedding(features_maximum, size_of_embed, weights=[matrix_embedding])(inp)
    x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
    x = Attention(maximum_length)(x) # New
    x = Dense(16, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model       

使用双向 GRU 结合全局池化，可以组合有关长期依赖关系和序列最重要特征的信息并进行预测。

 
def lstm_du_model(matrix_embedding):
    inp = Input(shape=(maximum_length,))
    x = Embedding(features_maximum, size_of_embed, weights=[matrix_embedding])(inp)
    x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    conc = Dense(64, activation="relu")(conc)
    conc = Dropout(0.1)(conc)
    outp = Dense(1, activation="sigmoid")(conc)
    
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model   

由于这是一个堆叠了多个双向 GRU 层并应用了注意力机制的堆叠，因此可以学习复杂的模式，并且输入序列中的关键信息将获得更大的权重，从而提高预测性能。

 
def gru_atten_3_model(matrix_embedding):
    inp = Input(shape=(maximum_length,))
    x = Embedding(features_maximum, size_of_embed, weights=[matrix_embedding], trainable=False)(inp)
    x = Bidirectional(CuDNNGRU(128, return_sequences=True))(x)
    x = Bidirectional(CuDNNGRU(100, return_sequences=True))(x)
    x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
    x = Attention(maximum_length)(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model   

训练和预测

 
def train_pred(model, epochs=2):
    for e in range(epochs):
        model.fit(X_train, y_train, batch_size=512, epochs=1, validation_data=(X_validation, y_val))
        prediction_validation_y = model.predict([X_validation], batch_size=1024, verbose=0)
    prediction_test_y = model.predict([X_test], batch_size=1024, verbose=0)
    # Plot training & validation accuracy values
#     plt.plot(model.history['acc'])
#     plt.plot(model.history['val_acc'])
#     plt.title('Model accuracy')
#     plt.ylabel('Accuracy')
#     plt.xlabel('Epoch')
#     plt.legend(['Train', 'Test'], loc='upper left')
#     plt.show()

#     # Plot training & validation loss values
#     plt.plot(model.history.history['loss'])
#     plt.plot(model.history.history['val_loss'])
#     plt.title('Model loss')
#     plt.ylabel('Loss')
#     plt.xlabel('Epoch')
#     plt.legend(['Train', 'Test'], loc='upper left')
#     plt.show()
    return prediction_validation_y, prediction_test_y   

 
X_train, X_validation, X_test, y_train, y_val, word_index = loading_and_preparing()
vocab = []
for w,k in word_index.items():
    vocab.append(w)
    if k >= features_maximum:
        break
matrix_embedding_1 = load_glove(word_index)
# matrix_embedding_2 = load_fasttext(word_index)
matrix_embedding_3 = load_para(word_index)   

输出

 
#  We have provided an argument for averaging as a valid meta-embedding technique and found experimental performance to be very close to, or in some cases better than, that of concatenation with additional reduced dimensionality.

# The weakness in concatenating the embeddings and sending them into an RNN encoder is the tendency to make a network relatively inefficient when there are more embeddings combined.
  
# matrix_embedding = np.mean([matrix_embedding_1, matrix_embedding_2, matrix_embedding_3], axis = 0)
matrix_embedding = np.mean([matrix_embedding_1, matrix_embedding_3], axis = 0)
np.shape(matrix_embedding)   

输出

 
prediction_validation_y, prediction_test_y = train_pred(gru_atten_3_model(matrix_embedding), epochs = 3)
outputs.append([prediction_validation_y, prediction_test_y, '3 GRU w/ atten'])   

输出

 
prediction_validation_y, prediction_test_y = train_pred(gru_ap_atten_model(matrix_embedding), epochs = 3)
outputs.append([prediction_validation_y, prediction_test_y, 'gru atten ap'])   

输出

 
prediction_validation_y, prediction_test_y = train_pred(cnn_model(matrix_embedding_1), epochs = 3) # GloVe only
outputs.append([prediction_validation_y, prediction_test_y, '2d CNN GloVe'])   

输出

 
prediction_validation_y, prediction_test_y = train_pred(lstm_du_model(matrix_embedding), epochs = 3)
outputs.append([prediction_validation_y, prediction_test_y, 'LSTM DU'])   

输出

 
prediction_validation_y, prediction_test_y = train_pred(lstm_atten_model(matrix_embedding), epochs = 3)
outputs.append([prediction_validation_y, prediction_test_y, '2 LSTM w/ attention'])   

输出

 
prediction_validation_y, prediction_test_y = train_pred(lstm_atten_model(matrix_embedding_1), epochs = 3) # Only GloVe
outputs.append([prediction_validation_y, prediction_test_y, '2 LSTM w/ attention GloVe'])   

输出

 
prediction_validation_y, prediction_test_y = train_pred(lstm_atten_model(matrix_embedding_3), epochs = 3) # Only Para
outputs.append([prediction_validation_y, prediction_test_y, '2 LSTM w/ attention Para'])   

输出

下一个主题机器学习中的解析解

机器学习中的解析解

机器学习中解析解的应用

导入库

CNN 模型

注意力层

LSTM

训练和预测

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的解析解

机器学习中解析解的应用

导入库

CNN 模型

注意力层

LSTM

训练和预测

相关帖子

使用 PyTorch 进行时间序列预测的 LSTM

深度学习 vs. 机器学习 vs. 人工智能

MLOps 成熟度级别

图像分割的平均交并比 (mIoU)

核主成分分析 (KPCA)

Big GAN

Keras 中的回调

处理大型数据集的 Pandas 替代方案

Bahdanau 注意力

机器学习的风险

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器