机器学习中的蛋白质折叠

2025年3月17日 | 阅读 21 分钟

蛋白质就像我们体内的超级英雄，在支持我们组织、器官和全身过程的功能方面发挥着至关重要的作用。这些令人难以置信的分子由 20 种不同的构件组成，每一种都被称为氨基酸。令人难以置信的是，在我们体内存在着种类繁多的蛋白质，每种蛋白质都具有由数十甚至数百个氨基酸组成的独特序列。

最有趣的部分是，蛋白质中氨基酸的特定序列就像一个秘密代码，决定了它的超能力，例如它的功能。这个序列实际上决定了蛋白质的 3D 结构以及它在不同情况下的行为。你猜怎么着？这种独特的 3D 结构然后定义了蛋白质在各种生物过程中的特殊作用。所以，这不仅仅是一个普通的代码；它就像一个塑造蛋白质形式并释放其非凡功能的超级蓝图，使其成为我们身体运作方式的重要组成部分。

但这还不是全部！让我们深入探讨蛋白质折叠的迷人世界。想象一下：蛋白质就像艺术品，由长长的氨基酸链组成，它们的 3D 结构是释放其力量的关键。蛋白质折叠的过程就像一场复杂的舞蹈，蛋白质链优雅而精确地折叠成自己非凡且有功能的形状。这就像蛋白质发现了自己真正的身份，揭示了它独特的强大能力，以完成它在我们体内的使命。

然而，理解蛋白质折叠并非易事。鉴于该过程的巨大复杂性和蛋白质可以采用的无数种可能方式，这是一个难以解决的谜题。但科学家们正在追寻解开这个谜团，因为它对预测蛋白质结构至关重要，这对发现新药、研究疾病和进步生物工程等领域具有深远的影响。

理解它至关重要，因为它直接关系到蛋白质结构的预测。这种预测在药物发现、疾病研究和生物工程方面具有广泛的影响。然而，由于其复杂性和蛋白质可以采取的无数种构象，蛋白质折叠带来了挑战。

机器学习算法可以基于现有的蛋白质折叠数据进行训练，以学习蛋白质序列与其对应结构之间的模式和关系。然后，可以使用这些算法根据氨基酸序列预测新蛋白质的结构。通过分析已知蛋白质结构的庞大数据集，机器学习模型可以揭示控制蛋白质折叠的隐藏模式和原理。

机器学习在蛋白质折叠中的优势

以下是机器学习在理解蛋白质折叠方面的一些优势

近年来，机器学习方法在蛋白质折叠研究中已被证明非常有益。这些技术使研究人员能够深入研究蛋白质结构与其特定功能之间的复杂联系，特别是在疾病相关蛋白质的背景下。揭示如此有价值的信息有助于阐明各种疾病的分子复杂性，为开发有针对性的有效治疗策略开辟了可能性。
在蛋白质研究领域，机器学习已成为强大的盟友。通过利用广泛的蛋白质折叠数据，科学家们现在可以使用机器学习模型，仅根据氨基酸序列预测蛋白质复杂的 3D 结构。考虑到通过耗时且成本高昂的实验来确定蛋白质结构的传统方法，这一非凡的进展具有革命性。
了解蛋白质的 3D 形状在药物发现中起着至关重要的作用。在创造新药物时，识别药物可以与之相互作用并改变其功能的蛋白质至关重要。通过利用机器学习，研究人员可以对蛋白质结构做出精确预测。这些有价值的信息使他们能够发现潜在的药物靶点，并开发能够有效与这些蛋白质相互作用的新型药物，为各种疾病提供有效的治疗。
蛋白质工程与机器学习之间的联系具有深远的潜力。在生物技术和合成生物学领域，机遇巨大且令人兴奋。工程蛋白质可以在各种领域找到应用，例如酶生产，它们可以作为必需反应的催化剂。此外，生物燃料生产的格局可以得到改善，提供更绿色、更可持续的替代方案。即使在生物修复（利用生物体净化污染物）中，我们也可以看到通过整合机器学习和蛋白质工程取得的卓越进展所带来的益处。可能性是无限且充满希望的，为现实世界的挑战提供了新颖的解决方案。

使用机器学习预测蛋白质折叠的缺点

虽然使用机器学习预测蛋白质折叠具有许多优势，但这种方法也存在一些挑战和缺点，即

蛋白质折叠是一个高度复杂的过程，涉及多种相互作用和形状。尝试根据氨基酸序列预测蛋白质的 3D 结构是一项具有挑战性且资源消耗巨大的任务。折叠的复杂性要求大量的计算能力，并且可能导致处理时间延长，特别是对于较大的蛋白质序列。
尽管取得了显著进展，但目前用于预测蛋白质折叠的机器学习模型在准确性方面仍面临挑战。蛋白质结构的复杂性和构象的巨大范围给精确预测折叠模式带来了困难。虽然机器学习模型提供了有价值的见解，但像 X 射线晶体学和 NMR 波谱这样的实验方法对于获得极其准确的蛋白质结构仍然是必不可少的。
即使氨基酸序列有细微变化，蛋白质的折叠模式也显示出显著的多样性。机器学习模型可能难以捕捉这种内在的生物变异性，从而导致预测结构与实际实验观察结果之间存在差异。
蛋白质是高度动态的，能够根据其环境以及与其他分子的相互作用而呈现各种形状。将这种动态信息整合到机器学习模型中以预测蛋白质折叠是一项艰巨的任务。
机器学习模型有时会过拟合，尤其是在使用小型数据集进行训练时。在蛋白质折叠预测的情况下，过拟合可能导致模型在训练数据上表现良好，但在新且未见的蛋白质序列上预测准确率下降。确保机器学习模型稳健且能够泛化到不同的蛋白质结构仍然是该领域的一项重大挑战。

使用 Python 进行机器学习中的蛋白质折叠预测

关于数据集

该数据集包含从结构生物学协作研究中心 (RCSB) 蛋白质数据库 (PDB) 检索到的蛋白质信息。PDB 档案是一个庞大的数据集合，包括蛋白质和其他重要生物大分子的原子坐标和其他详细信息。为了确定分子中每个原子的位置，结构生物学家使用 X 射线晶体学、NMR 波谱和低温电子显微镜等各种方法。一旦获得这些信息，他们就会将其存入档案，由 wwPDB 进行注释并公开提供。

随着全球实验室研究的进展，PDB 档案不断发展。这使其成为研究人员和教育工作者的激动人心的资源。它提供了许多参与关键生命过程的蛋白质和核酸的结构，包括核糖体、癌基因、药物靶点，甚至整个病毒。然而，由于数据库的庞大，导航和查找特定信息可能会很困难。通常，一种分子有多种可用结构，或者结构是不完整的、经过修饰的，或者与其天然形式不同。

尽管存在挑战，PDB 档案仍然是科学界宝贵的数据来源，提供了有关各种生物分子结构的大量信息。研究人员和教育工作者可以探索这个庞大的存储库，以深入了解蛋白质和其他大分子的复杂性，支持结构生物学领域的进步。

内容

有两个数据文件。两者都根据蛋白质的“structureId”进行排列

pdb_data_no_dups.csv 包含蛋白质元数据，其中包括蛋白质分类、提取方法等详细信息。
data_seq.csv 包含 >400,000 个蛋白质结构序列。

现在，我们将尝试构建一个模型来预测蛋白质结构。

代码

导入库

import random
import os
import torch
from torch import nn, einsum

import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import gc

from einops import rearrange, repeat, reduce
from einops.layers.torch import Rearrange
from inspect import isfunction

import sidechainnet as scn
from sidechainnet.examples import losses, models
from sidechainnet.structure.structure import inverse_trig_transform
from sidechainnet.structure.build_info import NUM_ANGLES
import py3Dmol

seed = 0

random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

输出

我们利用 sidechainnet 来训练我们的机器学习模型，旨在根据给定的氨基酸序列预测蛋白质结构（角度或坐标）。这些示例几乎达到了全面模型训练所需的最低要求。

此处的代码默认设置为在调试数据集上进行训练。但是，您可以自由修改“scn.load”调用并选择其他 SidechainNet 数据集，例如 CASP12，进行进一步的实验和训练。

在这里，我们将使用两个简化的循环神经网络 (RNN) 来预测蛋白质的角度表示，使用它们对应的氨基酸序列。

序列 + PSSM Net_Protein 模型使用氨基酸序列（独热向量）、位置特定评分矩阵 (PSSM) 和信息量作为输入。
仅序列 Net_Protein 模型接收表示为整数张量的氨基酸序列（作为输入）。

内部 RNN 处理氨基酸序列，为每个氨基酸生成角度向量。虽然其他模型只使用了 3 个角度，但在我们的情况下，我们可以预测 SidechainNet 提供的所有 12 个角度。

使用 Pytorch 进行数据访问

请求 DataLoaders 时，您将收到一个字典，该字典将分割名称映射到相应的数据加载器。

#Prepare the data in a suitable format for training.
load_data = scn.load(
             with_pytorch="dataloaders",
             batch_size=4, 
             dynamic_batching=False)
print("Available Dataloaders =", list(load_data.keys()))

输出

当批次被生成时，每个 DataLoader 会返回一个 Batch 名为 namedtuple 的对象，该对象具有以下属性：

pids: 一个元组，包含此批次中蛋白质的 Net_Protein/SidechainNet ID。
seqs:一个编码序列的张量，根据 scn.load(...seq_as_onehot) 参数的设置，表示为整数或独热向量。
msks:一个缺失残基掩码的张量，它可能与数据中的填充重叠。
evos:一个 PSSM（位置特定评分矩阵）+ 信息量的张量。
secs:一个表示二级结构的张量，根据 scn.load(...seq_as_onehot) 参数的设置，表示为整数或独热向量。
angs:一个表示角度的张量。
crds:一个表示坐标的张量。
ress:一个包含 X 射线晶体分辨率的元组。

batch = next(iter(load_data['train']))
print("Protein IDs\n   ", batch.pids)
print("Sequences\n   ", batch.seqs.shape)
print("Evolutionary Data\n   ", batch.evos.shape)
print("Secondary Structure\n   ", batch.secs.shape)
print("Angle Data\n   ", batch.angs.shape)
print("Coordinate Data\n   ", batch.crds.shape)
print("X-ray Resolution\n   ", batch.resolutions)
print("Concatenated Data (seq/evo/2ndary)\n   ", batch.seq_evo_sec.shape)
print("Integer sequence")
print("\tShape:", batch.int_seqs.shape)
print("\tEx:", batch.int_seqs[0,:3])

print("1-hot sequence")
print("\tShape:", batch.seqs.shape)
print("\tEx:\n", batch.seqs[0,:3])

输出

# In the default integer sequence representation, padding is done using the integer value 20. For instance, if we observe the last 15 amino-acid "characters" of sequence #1, it can be seen that padding with the 20s has been used to match the batch size.

example = 0 # 308, note many indices point to structures that have gaps, and thus cannot be visualzed/constructed from angles
seq, ang, crd, mask, sec = ( batch.str_seqs[example],batch.angs[example], 
                             batch.crds[example], batch.msks[example], 
                             batch.secs[example]
                            )
name = batch.pids[example]

print(f"\nExample using {name}.\n")
print(f"Sequence, Mask, and Secondary Structure:\n{seq}\n{mask}\n{sec}\n")
print(f"Angles:\n{ang[:3]} ...\n")
print(f"Coordinates:\n{crd[:3]} ...\n")

输出

#sb = scn.StructureBuilder(seq,crd)
#sb.to_3Dmol(width=600, height=300)
# If you want to train with a GPU, navigate to Runtime > Change runtime type
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print(f"Using {device} for training.")

输出

辅助函数

辅助函数是代码中用于执行特定任务的小型、可重用代码片段。这些函数旨在简化复杂操作，提高代码的可读性，并避免代码重复。通过将复杂任务分解为更小、可管理的单元，辅助函数使主代码更加有序且易于维护。

# helpers

def exists(val):
    return val is not None

def default(val, d):
    if exists(val):
        return val
    return d() if isfunction(d) else d

def cast_tuple(val, depth = 1):
    return val if isinstance(val, tuple) else (val,) * depth

def init_zero_(layer):
    nn.init.constant_(layer.weight, 0.)
    if exists(layer.bias):
        nn.init.constant_(layer.bias, 0.)

注意力层

注意力层在深度学习模型中很重要，因为它们有助于模型专注于数据中最相关部分。它们就像人类的注意力一样，在学习过程中，某些事物比其他事物更重要。

class Attention(nn.Module):
    def __init__(
        self,
        dim,
        len_seq = None,
        heads = 8,
        dim_head = 64,
        dropout = 0.0,
        gating = True
    ):
        super().__init__()
        inner_dim = dim_head * heads
        self.len_seq = len_seq
        self.heads= heads
        self.scale = dim_head ** -0.5

        self.to_q = nn.Linear(dim, inner_dim, bias = False)
        self.to_kv = nn.Linear(dim, inner_dim * 2, bias = False)
        self.to_out = nn.Linear(inner_dim, dim)

        self.gating = nn.Linear(dim, inner_dim)
        nn.init.constant_(self.gating.weight, 0.)
        nn.init.constant_(self.gating.bias, 1.)

        self.dropout = nn.Dropout(dropout)
        init_zero_(self.to_out)

    def forward(self, x, mask = None, attn_bias = None, context = None, mask_context = None, tie_dim = None):
        device, orig_shape, h, has_context = x.device, x.shape, self.heads, exists(context)
        context = default(context, x)
        q_0, k_0, v = (self.to_q(x), *self.to_kv(context).chunk(2, dim = -1))
        i, j = q_0.shape[-2], k.shape[-2]
        q_0, k_0, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), (q, k_0, v))

        # scale
        q_0= q_0* self.scale

        # query / key similarities
        if exists(tie_dim):
           # In accordance with the paper, for the additional Multiple Sequence Alignments (MSAs),
            # They take the average of the queries along the rows of the MSAs.
            # They referred to this specific module as MSAColumnGlobalAttention.

            q_0, k_0 = map(lambda t: rearrange(t, '(b r) ... -> b r ...', r = tie_dim), (q, k))
            q_0= q_0.mean(dim = 1)

            0_dots = einsum('b h i d, b r h j d -> b r h i j', q_0, k)
            0_dots = rearrange(0_dots, 'b r ... -> (b r) ...')
        else:
            0_dots = einsum('b h i d, b h j d -> b h i j', q_0, k)

        # If provided, include attention bias to enable communication from pairwise to msa attention.
        if exists(attn_bias):
            0_dots = 0_dots + attn_bias

        # masking
        if exists(mask):
            mask = default(mask, lambda: torch.ones(1, i, device = device).bool())
            mask_context = mask if not has_context else default(mask_context, lambda: torch.ones(1, k.shape[-2], device = device).bool())
            mask_value = -torch.finfo(0_dots.dtype).max
            mask = mask[:, None, :, None] * mask_context[:, None, None, :]
            try:
                mask = mask.to(torch.bool)
                0_dots = 0_dots.masked_fill(~mask, mask_value)
            except:
                0_dots = 0_dots.masked_fill(mask, mask_value)

        # attention
        0_dots = 0_dots - 0_dots.max(dim = -1, keepdims = True).values
        attn = 0_dots.softmax(dim = -1)
        attn = self.dropout(attn)
        # aggregate
        out = einsum('b h i j, b h j d -> b h i d', attn, v)
        # merge heads
        out = rearrange(out, 'b h n d -> b n (h d)')
        # gating
        gates = self.gating(x)
        out = out * gates.sigmoid()
        # combine to out
        out = self.to_out(out)
        return out

class Net_Protein(nn.Module):
    """A model for predicting protein angles from integer-encoded sequences."""
    def __init__(self,
                 d_hidden,
                 dim,
                 d_in=21,
                 d_embedding=32,
                 heads = 8,
                 integer_sequence=True,
                 n_angles=scn.structure.build_info.NUM_ANGLES):
        
        super(Net_Protein, self).__init__()
        # Dimensionality of RNN hidden state
        self.d_hidden = d_hidden
      
        self.attn = Attention(dim = dim, heads = heads)
        # Output vector dimensionality (per amino acid)
        self.d_out = n_angles * 2
        # Output projection layer. (from RNN -> target tensor)
        self.hidden2out = nn.Sequential(
                            nn.Linear(d_embedding, d_hidden),
                            nn.GELU(),
                            nn.Linear(d_hidden, self.d_out)
                                    )
        self.out2attn = nn.Linear(self.d_out, dim)
        self.final = nn.Sequential(
                            nn.GELU(),
                            nn.Linear(dim, self.d_out))
        self.norm_0 = nn.LayerNorm([dim])
        self.norm_1 = nn.LayerNorm([dim])
        self.activation_0 = nn.GELU()
        self.activation_1 = nn.GELU()

        # The activation function used for the output values is designed to bind them within the range of [-1, 1].                            
        self.output_activation = torch.nn.Tanh()

        # The way we embed the input of our model varies depending on the type of input it receives.
        self.integer_sequence = integer_sequence
        if self.integer_sequence:
            self.input_embedding = torch.nn.Embedding(d_in, d_embedding, padding_idx=20)
        else:
            self.input_embedding = torch.nn.Linear(d_in, d_embedding)
    def get_lengths(self, sequence):
        """Calculate the lengths of each sequence in the batch."""
        if self.integer_sequence:
            lengths = sequence.shape[-1] - (sequence == 20).sum(axis=1)
        else:
            lengths = sequence.shape[1] - (sequence == 0).all(axis=-1).sum(axis=1)
        return lengths.cpu()

    def forward(self, sequence, mask=None):
        """Perform a single forward pass of the model."""
        # First, we compute sequence lengths
        lengths = self.get_lengths(sequence)

        #After computing the lengths of each sequence in the batch, we proceed to embed our input tensors to prepare them for input to the Recurrent Neural Network (RNN).
        sequence = self.input_embedding(sequence)

        # After embedding the input tensors, we pass our data into the RNN using PyTorch's `pack_padded_sequences` function. This function helps handle sequences of variable lengths efficiently by packing them together and padding where necessary before passing them to the RNN for processing.
        sequence = torch.nn.utils.rnn.pack_padded_sequence(sequence,
                                                         lengths,
                                                         batch_first=True,
                                                         enforce_sorted=False)
        output, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(sequence,
                                                                      batch_first=True)
       # At this stage, the output tensor has the same dimensionality as the RNN's hidden state, i.e., (batch, length, d_hidden).

      # To obtain the desired output dimensionality of (batch, length, 24), we perform a linear transformation on the output tensor. This transformation ensures that the output is appropriately reshaped to match the required dimensions for further processing.

        output = self.hidden2out(output)
        output = self.out2attn(output)
        output = self.activation_0(output)
        output = self.norm_0(output)
        output = self.attn(output, mask=mask)
        output = self.activation_1(output)
        output = self.norm_1(output)
        output = self.final(output)
      
        # After obtaining the output tensor with dimensions (batch, length, 24), the next step is to bound the output values within the range [-1, 1]. This step is essential to ensure that the predicted angles fall within the valid range and make the predictions meaningful for protein structure analysis. Bounding the output values restricts them to a specific interval, allowing for more accurate and reliable predictions.
        output = self.output_activation(output)

        # Lastly, we reshape the output tensor to have dimensions (batch, length, angle, (sin/cos value)). This reshaping process organizes the predicted angles in a more structured way, where each angle is represented by its corresponding sine and cosine values. This representation is useful for further analysis and interpretation of the predicted protein structures. By organizing the output in this manner, we can easily extract the sine and cosine values of each predicted angle for downstream applications and evaluations.
        output = output.view(output.shape[0], output.shape[1], 12, 2)

        return output

训练

在这里，我们将训练模型，例如将二级蛋白质结构矩阵作为输入。

模型输入

通过整合 PSSM、二级结构和信息量来增强模型输入，这些信息可以从 batch.seq_evo_sec 属性中访问。
使用的数据集是 CASP 12 数据集的最小版本，经过 30% 的稀疏化以降低复杂性。
通过增加隐藏状态维度到 1024 来扩大模型尺寸，以提高性能。

PSSM

PSSM，也称为位置特定评分矩阵或 DNA 上下文中的位置权重矩阵，表示一个矩阵，为序列中的每个位置提供特定的分数或概率。

这就像一个特殊的代码，告诉我们每个字母（氨基酸）出现在秘密消息（蛋白质序列）的不同位置的可能性。科学家们通过比较来自不同生物的许多相似秘密消息来创建这个代码。PSSM 帮助他们了解哪些字母很重要，哪些字母可以更改而不影响消息的含义。这就像拥有一个秘密解码器，可以帮助科学家们更多地了解蛋白质中的秘密消息以及它们的工作原理。

由于 PSSM 和序列都有 20 种不同的信息，二级结构有 8 种可能性，信息量是每个部分的单个数字；当我们将所有这些放在一起时，我们需要总共 49 个值来正确表示它们。

model_pssms = Net_Protein(d_hidden=512,
                           dim=256,
                           d_in=49,
                           d_embedding=32,
                           integer_sequence=False)
model_pssms = model_pssms.to(device)
model_pssms

输出

def init_loss_optimizer(model):
    optimizer = torch.optim.Adam(model.parameters())
    batch_losses = []
    epoch_training_losses = []
    epoch_validation10_losses = []
    epoch_validation90_losses = []
    mse_loss = torch.nn.MSELoss()
    
    return optimizer, batch_losses, epoch_training_losses, epoch_validation10_losses, epoch_validation90_losses, mse_loss
optimizer, batch_losses, epoch_training_losses, epoch_validation10_losses, epoch_validation90_losses, mse_loss = init_loss_optimizer(model_pssms)

def validation(model, datasplit, mode):
    """Assess the model's performance by evaluating its ability to predict angles represented as sin/cos values between -1 and 1 using the Mean Squared Error (MSE) metric. This evaluation allows us to understand how well the model performs in capturing the relationship between the input sequences and the corresponding angles and provides valuable insights into its accuracy in predicting protein structures."""
    total = 0.0
    n = 0
    with torch.no_grad():
        for batch in datasplit:
            # Set up variables and generate a mask to identify missing angles in the data, which are then padded with zeros. This preparation ensures that the model can handle and process the data effectively, accounting for any missing information in the input sequences.
            # The mask is duplicated along the last dimension to match the sin/cos representation of the data. This ensures that the mask aligns correctly with the format of the angles for proper processing and evaluation. represenation.
            if mode == 'seqs':
                seqs = batch.int_seqs.to(device).long()
            elif mode == 'pssms':
                seqs = batch.seq_evo_sec.to(device)
            mask_ = batch.msks.to(device)
            true_angles_sincosine = scn.structure.trig_transform(batch.angs).to(device)
            mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2)

            # Generate predictions using the model and perform optimization to improve the model's performance.
            angles_predicted= model(seqs, mask = mask_)
            loss = mse_loss(predicted_angles[mask], true_angles_sincosine[mask])
            
            total += loss
            n += 1

    return torch.sqrt(total/n)

def train(model, mode, n_epoch):
    for epoch in range(n_epoch):
        print(f'Epoch {epoch}')
        bar_progress = tqdm(total=len(load_data['train']), smoothing=0)
        for batch in load_data['train']:
            # Prepare the necessary variables and create a mask to identify missing angles, which will be filled with zeros.
            # Please take note that the mask is repeated in the last dimension to align with the sin/cos representation.
            if mode == 'seqs':
                seqs = batch.int_seqs.to(device).long()
            elif mode == 'pssms':
                seqs = batch.seq_evo_sec.to(device)
            mask_ = batch.msks.to(device)
            true_angles_sincos = scn.structure.trig_transform(batch.angs).to(device)
            mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2)

            #Generate predictions using the model and perform optimization to improve the model's performance.
            angles_predicted= model(seqs, mask = mask_)
            loss = mse_loss(predicted_angles[mask], true_angles_sincos[mask])
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 2)
            optimizer.step()

            # Housekeeping
            batch_losses.append(float(loss))
            bar_progress.update(1)
            bar_progress.set_description(f"\rRMSE Loss = {np.sqrt(float(loss)):.4f}")
        # Assess the model's performance on the train-eval set, which has been downsampled for efficiency reasons.
        epoch_training_losses.append(validation(model, load_data['train-eval'], mode))
        #Assess the model's performance on different validation sets.
        epoch_validation10_losses.append(validation(model, load_data['valid-10'], mode))
        epoch_validation90_losses.append(validation(model, load_data['valid-90'], mode))
        print(f"     Train-eval loss = {epoch_training_losses[-1]:.4f}")
        print(f"     Valid-10   loss = {epoch_validation10_losses[-1]:.4f}")
        print(f"     Valid-90   loss = {epoch_validation90_losses[-1]:.4f}")
    # Finally, evaluate the model on the test set
    print(f"Test loss = {validation(model, load_data['test'], mode):.4f}")

train(model_pssms, 'pssms', 10)

输出

# Create a plot showing the loss of each batch over time.
plt.plot(np.sqrt(np.asarray(batch_losses)), label='batch loss')
plt.ylabel("RMSE")
plt.xlabel("Step")
plt.title("Training Loss over Time")
plt.show()

输出

# The previous plot illustrates the loss for each batch during the training process. However, the next plot presents the model's performance on various data splits at the end of each epoch.
plt.plot([x.cpu().detach().numpy() for x in epoch_training_losses], label='train-eval')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation10_losses], label='valid-10')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation90_losses], label='valid-90')
plt.ylabel("RMSE")
plt.xlabel("Epoch")
plt.title("Training and Validation Losses over Time")
plt.legend()
plt.show()

输出

可视化预测

在许多情况下，我们使用scn.BatchedStructureBuilder，它需要两项内容：

一个数字张量，代表一组中的蛋白质序列。这些数字来自我们在训练或测试过程中遍历的数据。
一个数字张量，显示蛋白质每个部分的预测角度。这些数字的范围应在 -π 到 +π 之间。

我们有一个知道如何猜测某些角度的正弦和余弦值的模型。但我们需要实际的角度，而不是正弦和余弦值。因此，我们使用一个名为scn.structure.inverse_trig_transform的特殊工具将正弦和余弦值转换回实际角度。一旦我们有了实际角度，就可以将它们提供给 BatchedStructureBuilder。

def build_visualizable_structures(model, data, mode=None):
    """Create visual representations of the predicted structures for one batch of data using the model's output. These visualizations allow us to better understand and analyze the model's performance.."""
    with torch.no_grad():
        for batch in data:
            if mode == "seqs":
                model_input = batch.int_seqs.to(device)
            elif mode == "pssms":
                model_input = batch.seq_evo_sec.to(device)
            mask_ = batch.msks.to(device)
            #Generate predictions for the angles of the protein structures and then use these angle predictions to construct 3D atomic coordinates for the proteins. This step is essential in converting the predicted angles into a spatial representation of the protein structures.
            predicted_angles_sincos = model(model_input, mask = mask_)
            #As the model predicts sin/cos values for the angles, we need to use a function that converts these values back into the original angles. This function helps us recover the true angles from the predicted sin/cos values, allowing us to interpret the results accurately.
            angles_predicted= inverse_trig_transform(predicted_angles_sincos)

            # EXAMPLE
            # We utilize the BatchedStructureBuilder to construct an entire batch of protein structures. This allows us to efficiently create structures for multiple data points in the batch simultaneously, improving the speed and performance of the process. The BatchedStructureBuilder takes the input data, such as sequences and predicted angles, and generates the corresponding 3D atomic coordinates for each protein in the batch.
            sb_pred = scn.BatchedStructureBuilder(batch.int_seqs, predicted_angles.cpu())
            sb_true = scn.BatchedStructureBuilder(batch.int_seqs, batch.crds.cpu())
            break
    return sb_pred, sb_true

def protein_plot(exp1, exp2):
    p = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js', viewergrid=(2,1))
    p.addModel(open(exp1,'r').read(),'pdb', viewer=(0,0))
    p.addModel(open(exp2,'r').read(),'pdb', viewer=(1,0))
    p.setStyle({'cartoon': {'color':'spectrum'}})
    p.zoomTo()
    p.show()

推理

在这里，我们将模型的预测蛋白质结构与实际蛋白质结构进行比较。为了便于理解，我们使用 3D 图进行可视化比较。每个示例都有两个图：上面的图显示了模型对蛋白质结构的预测，下面的图显示了实际的蛋白质结构。这使我们能够看到模型的预测与实际蛋白质结构匹配的程度。

示例 (01)

pred_s, true_s = build_visualizable_structures(model_pssms, load_data["train"], mode="pssms")
z_idx = 0
idx = 0
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (02)

idx = 1
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (03)

idx = 2
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (04)

idx = 3
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (05)

pred_s, true_s = build_visualizable_structures(model_pssms, load_data["train"], mode="pssms")
z_idx = 1
idx = 0
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (06)

idx = 1
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

idx = 2
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

idx = 3
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

训练 (序列 → 角度)

现在我们将训练模型，同时将蛋白质序列作为输入。

信息流：这里的信息流在一个简单的 Transformer（Attention）模型中，该模型处理序列数据。输入表示为 [Layers*21]，经过 Embedding 层，得到 [Layers. Dense Embedding]。然后，它通过 LSTM 层，转换为 [Layers. Dense Hidden]。最后，输出从 LSTM 出来并通过 [Layers Dense Output] 层。在此过程中，模型处理输入数据，提取相关信息并生成最终输出，而不会对其进行修改。

处理角度的圆形性质：为了帮助我们的模型理解角度 π 和 -π 是相同的，我们使用了一个特殊的技巧。我们不直接预测角度，而是为每个角度预测两个值：sin 和 cos。然后，我们使用 atan2 函数将这两个值组合起来恢复角度。这样，模型的输出形状将为 L×12×2，其中 L 是蛋白质序列的长度，值为 -1 到 1 之间。这种方法使我们能够正确处理角度并提高预测的准确性。

model_seqonly = Net_Protein(d_hidden=512,
                           dim=256,
                           d_in=49,
                           d_embedding=32,
                           integer_sequence=True)
model_seqonly = model_seqonly.to(device)
optimizer, batch_losses, epoch_training_losses, epoch_validation10_losses, epoch_validation90_losses, mse_loss = init_loss_optimizer(model_seqonly)
train(model_seqonly, 'seqs', 9)

输出

# We can visualize the loss of each batch over time during the training process using a line plot. The x-axis represents the training iterations or epochs, while the y-axis shows the loss values for each batch. By observing the plot, we can gain insights into how the model's performance is improving over time and whether the loss is converging or fluctuating during training. This visualization helps us monitor the model's learning progress and make informed decisions on training adjustments if needed.
plt.plot(np.sqrt(np.asarray(batch_losses)), label='batch loss')
plt.ylabel("RMSE")
plt.xlabel("Step")
plt.title("Training Loss over Time")
plt.show()

输出

plt.plot([x.cpu().detach().numpy() for x in epoch_training_losses], label='train-eval')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation10_losses], label='valid-10')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation90_losses], label='valid-90')
plt.ylabel("RMSE")
plt.xlabel("Epoch")
plt.title("Training and Validation Losses over Time")
plt.legend()
plt.show()

输出

推理 (序列 → 角度)

示例 (09)

pred_s, true_s = build_visualizable_structures(model_seqonly, load_data["train"], mode="seqs")
z_idx = 2
idx = 0
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (10)

idx = 1
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (11)

idx = 2
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

示例 (12)

idx = 3
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

输出

我们已经成功构建了一个基于注意力的模型，该模型能够以高精度预测蛋白质结构。我们使用了两种不同的方法训练了该模型：一种以二级蛋白质结构矩阵作为输入，另一种以蛋白质序列作为输入。两种方法都取得了有希望的结果。
我们模型的潜在改进是使用多序列比对 (MSA) 作为训练数据。MSA 提供了有关氨基酸进化保守性的额外信息，这可能有助于提高模型的性能。
目前，我们的模型将角度作为蛋白质结构预测的目标。然而，我们可以探索使用坐标距离和坐标作为目标。这种方法可能导致蛋白质结构预测更加精确和准确。
总的来说，我们的模型在蛋白质结构预测领域显示出巨大的潜力，并且对不同训练数据和目标变量的进一步探索可以进一步提高其性能。

使用机器学习进行蛋白质折叠的未来展望

机器学习在蛋白质折叠领域的未来潜力令人难以置信，它有能力改变我们对蛋白质结构和功能的理解。通过利用机器学习的能力并采取跨学科战略，我们正站在发现新机遇和拓展科学探索边界的门槛上。当我们继续解开围绕蛋白质折叠的谜团时，我们踏上了一条革命性研究和开创性应用的道路，这将对人类福祉及其他方面产生深远影响。

结论

蛋白质折叠是一个关键而复杂的过程，它深刻地影响着蛋白质的行为和功能。机器学习与生物信息学的融合提供了一条令人兴奋的途径来深入研究这个复杂的世界，使我们能够以前所未有的精度预测蛋白质结构。机器学习和生物信息学之旅有望带来变革性的发现，将彻底改变医学和生物技术。当我们向前迈进时，蛋白质折叠的谜团将逐渐被解开，揭示生命本身的深刻复杂性。有了机器学习作为我们的盟友，我们离揭开蛋白质折叠的秘密及其在生命宏大织锦中的广泛含义越来越近了。

下一个主题使用机器学习进行情感分析

机器学习中的蛋白质折叠

机器学习在蛋白质折叠中的优势

使用机器学习预测蛋白质折叠的缺点

使用 Python 进行机器学习中的蛋白质折叠预测

关于数据集

内容

代码

模型输入

PSSM

训练 (序列 → 角度)

推理 (序列 → 角度)

使用机器学习进行蛋白质折叠的未来展望

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的蛋白质折叠

机器学习在蛋白质折叠中的优势

使用机器学习预测蛋白质折叠的缺点

使用 Python 进行机器学习中的蛋白质折叠预测

关于数据集

内容

代码

模型输入

PSSM

训练 (序列 → 角度)

推理 (序列 → 角度)

使用机器学习进行蛋白质折叠的未来展望

结论

相关帖子

在 PySpark DataFrame 中将单列拆分为多列

什么是大数据和机器学习

机器学习中的腺病毒疾病预测

EigenFaces

Sarimax

泰勒级数

机器学习中的潜在客户生成

GBM 在机器学习中的应用

聚类分析指南：应用、最佳实践

如何为机器学习去除异常值

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器