使用 Python 进行 Aho-Corasick 算法模式搜索

2024 年 8 月 29 日 | 4 分钟阅读

Aho-Corasick 算法是一种字典匹配算法。该算法用于搜索关键词集中存在的单词。该算法在查找单词及其位置方面快速而高效。Aho-Corasick 算法构建现有系统并采用TRIE 的概念。

使用树形数据结构来执行该技术。当我们创建树时，它会将其转换或尝试将其转换为自动机，从而使我们能够在线性时间内完成或执行搜索。

Aho-Corasick 算法的时间复杂度

Aho-Corasik 算法以O(X+ Y+ Z) 的时间搜索单词，其中X是文本长度，Y是关键词的总长度，而Z是关键词在文本中出现的次数。

Aho-Corasick 算法的问题陈述

假设我们有输入文本和一个包含 m 个单词的数组 a[ ]。我们需要搜索输入文本中存在的单词的数量。

设 x 为文本长度，y 为所有单词中字符的总数。这意味着 y = len(a [0]) + len(a [1]) + len(a [2]) + …. + len(a [z - 1])。这里，z 是输入单词的数量。

我们将通过一个示例来理解该算法，在该示例中，我们将采用一个输入字符串和一组要在文本字符串中搜索的单词。

示例

输入

Text string: txt= "hellotheirshere"
The sample set of words: a[ ] = {"he", "hello", "she", "here", "their", "the"}

输出

The Word "he" is found at index 0 to 1.
The Word "he" is found at index 6 to 7
The Word "he" is found at index 11 to 12
The Word "hello" is found at index 0 to 4.
The Word "the" is found at index 5 to 7
The Word "their" is found at index 7 to 11.
The Word "she" is found at index 10 to 12.
The Word "here" is found at index 11 to 14

算法预处理

Aho-Corasick 算法分为三个不同的阶段：

转移函数 (Go To)
输出
失败

转移函数 (Go To)：此阶段使用输入到算法中的关键字（按模式排列）来构建树。它从主函数开始，然后是状态集，最后是主根。它跟踪数组 a[ ] 中所有单词的边界。二维数组 gt[ ][ ] 表示转移函数，我们可以在其中存储字符和状态以供后续状态使用。

输出函数 (Output)：此阶段搜索在特定状态结束的单词。它是条件和可用性匹配时的结果。此函数存储以当前状态结尾的单词的索引。一维数组 op[ ] 表示输出函数，我们可以在其中为当前状态存储单词的位图。

失配函数 (Failure)：它向后搜索转换以从集合中查找合适的关键字。如果关键字不匹配，则不会计数。当当前字符在 Trie 中缺少边时，此函数记录所有经过的边。一维数组 fl[ ] 表示失配函数，我们在其中记录当前状态的下一个状态。

Python 中 Aho-Corasick 算法的实现以进行模式搜索

这是 Python 中 Aho-Corasick 算法的实现

代码

from collections import defaultdict
class Aho_Corasick:
    def __init__(self, words):
        self.maxStates = sum([len(word) for word in words])
        self.maxChar = 22
        self.out = [0]*(self.maxStates + 1)
        self.fail = [-1]*(self.maxStates + 1)
        self.goto = [[-1]*self.maxChar for _ in range(self.maxStates + 1)]

        for i in range(len(words)):
            words[i] = words[i].lower()
        self.words = words
        self.states_count = self.__build_matching_machine()
        
    def __build_matching_machine(self):
        m = len(self.words)
        state = 1

        for i in range(m):
            word = self.words[i]
            currentState = 0

            for char in word:
                c = ord(char)

                if self.goto[currentState][c] == -1:
                    self.goto[currentState][c] = state
                    state += 1
                currentState = self.goto[currentState][c]
        self.out[currentState] |= (1 << i)

        for c in range(self.maxChar):
            if self.goto[0][c] == -1:
                self.goto[0][c] = 0
            queue = []

        for ch in range(self.maxChar):
            if self.goto[0][c] != 0:
                self.fail[self.goto[0][c]] = 0
                queue.append(self.goto[0][c])

        while queue:
            states = queue.pop(0)

            for c in range(self.maxChar):
                if self.goto[states][c] != -1:
                    failure = self.fail[states]

            while self.goto[failure][c] == -1:
                failure = self.fail[failure]
            failure = self.goto[failure][c]
            self.fail[self.goto[states][c]] = failure
            self.out[self.goto[states][c]] |= self.out[failure]
            queue.append(self.goto[states][ch])
        return state

        def find_Next_State(self, currentState, next_inp):
            ans = currentSate
            c = ord(next_inp) - 97 
            while self.goto[ans][c] == -1:
                answer = self.fail[ans]
 
            return self.goto[ans][c]

        def search(self, txt):
            txt = txt.lower()
            currentSate = 0
            res = defaultdict(list)

            for i in range(len(txt)):
                currentState = self.findNextState(currentState, txt[i])

                if self.out[currentState] == 0: continue

                for j in range(len(self.words)):
                    if (self.out[currentState] & (1 << j)) > 0:
                        word = self.words[j]
                        res[word].append(i-len(word)+1)
            return result

if __name__ == "__main__":
    words = ["he", "hello", "she", "here", "their", "the"]
    txt = "hellotheirshere"
    aho_chorasick = Aho_Corasick(words)
    res = aho_chorasick.search(txt)

    for word in result:
        for i in result[word]:
            print("The Word", word, "is found at index", i, "to", i + len(word) - 1) 

输出

The Word "he" is found at index 0 to 1.
The Word "he" is found at index 6 to 7
The Word "he" is found at index 11 to 12
The Word "hello" is found at index 0 to 4.
The Word "the" is found at index 5 to 7
The Word "their" is found at index 7 to 11.
The Word "she" is found at index 10 to 12.
The Word "their" is found at index 11 to 14

说明

在此，我们使用了默认字典来存储输出。我们设置了字符和状态的最大限制。使用算法的三个阶段，我们对数据进行了预处理。然后，使用循环和队列，我们构建了机器/自动机，然后从文本字符串中搜索了单词集。

下一主题Python 实现的亚马逊商品价格追踪器

使用 Python 进行 Aho-Corasick 算法模式搜索

Aho-Corasick 算法的时间复杂度

Aho-Corasick 算法的问题陈述

算法预处理

Python 中 Aho-Corasick 算法的实现以进行模式搜索

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

使用 Python 进行 Aho-Corasick 算法模式搜索

Aho-Corasick 算法的时间复杂度

Aho-Corasick 算法的问题陈述

算法预处理

Python 中 Aho-Corasick 算法的实现以进行模式搜索

相关帖子

Python 列表大小

如何使用 Python 清空回收站

get_window_rect Driver method - Selenium Python

Python 复合语句的常见结构

将 Python 列表转换为 DataFrame

Python 中的 os.path.basename() 方法

Python Signal 模块

Python 中的石头剪刀布游戏

Python 图像处理中的形态学运算

如何刷新 Python print 函数的输出

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器