Python 中的哈希表 - 冲突、负载因子和再哈希

2024 年 8 月 29 日 | 阅读 12 分钟

引言

在本教程中，我们将学习 Python 中的哈希表，包括冲突、负载因子和再哈希。哈希表是一种索引数据结构。它以键值对的形式存储数据。在这种数据结构中，数据存储在特殊的索引中。哈希函数创建索引。这个哈希函数可以是内置的或生成的，它使用特定的键来创建索引以存储数据。键必须是不可变的且唯一的。通过哈希，插入、删除和搜索操作的平均时间复杂度接近常数 O(1)，最坏情况下也可能是 O(n) 或 O(1)。Python 中的哈希表包含一些用于设置值、获取值和从索引中删除值的函数。这些函数如下：

set_value(key, value): 根据从哈希键获得的唯一参数（键值）的值插入数据。如果键已经存在，则此函数将更新值。
get_value(key): 通过使用此函数，我们可以从哈希表或索引中获取给定键的值。但是当它没有找到键或哈希表中没有可用时，此函数会返回一个适当的消息。
del_value(key): 通过使用此函数，我们可以从哈希表或索引中删除给定键的值。但是当未找到需要删除的键或哈希表中没有可用时，此函数会返回一个适当的消息。

哈希 - 哈希函数

此函数将键转换为一个小值，用作哈希表中的索引。键值可以是整数或字符串。哈希函数有两种类型，如下所示：

a. 内置哈希函数

在内置哈希函数中，hash() 函数返回对象的哈希值。

b. 用户定义的哈希函数

开发人员创建用户定义的哈希函数。

冲突

冲突是 Python 中哈希表的一部分。当您想将数据插入哈希表，但其他键已经占用了该索引时，就会发生冲突。一个好的哈希表的特点是冲突数量较少。通过使用两种方法，我们可以轻松避免冲突，如下所示：

1. 分离链接

分离链接也称为开放哈希。如果两个键的哈希值相同，则将第二个键的数据以链式模式添加到第一个键的末尾并继续。链接可以借助其他数据结构完成，例如链表、列表或自平衡 BST（AVL 树、红黑树）。连接中的大多数键都分配在哈希表外部。以下数据库的访问时间值对于链表和列表范围从 O(1) 到 O(n)，对于树范围从 O(logn)。

程序代码

现在我们给出 Python 中开放哈希或分离链接的程序代码，如下所示：

class Node:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None
  
  
class HashTable:
    def __init__(self, capacity):
        self.capacity = capacity
        self.size = 0
        self.table = [None] * capacity
  
    def _hash(self, key):
        return hash(key) % self.capacity
  
    def insert(self, key, value):
        index = self._hash(key)
  
        if self.table[index] is None:
            self.table[index] = Node(key, value)
            self.size += 1
        else:
            current = self.table[index]
            while current:
                if current.key == key:
                    current.value = value
                    return
                current = current.next
            new_node = Node(key, value)
            new_node.next = self.table[index]
            self.table[index] = new_node
            self.size += 1
  
    def search(self, key):
        index = self._hash(key)
  
        current = self.table[index]
        while current:
            if current.key == key:
                return current.value
            current = current.next
  
        raise KeyError(key)
  
    def remove(self, key):
        index = self._hash(key)
  
        previous = None
        current = self.table[index]
  
        while current:
            if current.key == key:
                if previous:
                    previous.next = current.next
                else:
                    self.table[index] = current.next
                self.size -= 1
                return
            previous = current
            current = current.next
  
        raise KeyError(key)
  
    def __len__(self):
        return self.size
  
    def __contains__(self, key):
        try:
            self.search(key)
            return True
        except KeyError:
            return False
  

if __name__ == '__main__':
  
    h = HashTable(5)
  
    
    h.insert("Rose", 4)
    h.insert("Lotus", 5)
    h.insert("Dog", 3)
  
    
    print("Rose" in h)  
    print("Red" in h)  
  
    
    print(h.search("Lotus"))  
  
   
    h.insert("Rose", 2)
    print(h.search("Rose"))  
  
    h.remove("Dog")
    
    print(len(h))  

输出

现在，我们编译并运行上述程序。结果如下：

True
False
5
2
2

2. 开放寻址

开放寻址也称为闭合哈希。如果另一个键分配了索引，它将检查哈希表中的下一个可用空间以给出新的键值。探测可以通过多种方式完成。探测有三种类型：线性探测、二次探测和双重哈希。

a. 线性探测

循环地，搜索从起始索引到前一个索引以查找下一个空槽。线性探测中的缓存性能非常好，这是一个优势。但也有一个缺点。线性探测的缺点是会发生主要聚类和次要聚类。

b. 二次探测

在二次探测中，以二次方式搜索索引中的下一个空槽。因此，需要第 n 次迭代才能找到第 (n²) 个槽。没有主要聚类，这是二次探测的一个优势。但也有一个缺点。二次探测的缺点是会发生次要聚类和非常差的缓存性能。

c. 双重哈希

在双重哈希中，我们使用第二个哈希函数来查找索引的下一个可用空间。此索引是从第一个哈希函数获得的。没有任何主要和次要聚类，这是双重哈希的一个优势。但也有一个缺点。双重哈希的缺点是缓存性能非常差。

什么是聚类？

在聚类中，当不同的数据（键）放置在相同的哈希索引时，连续的键或索引会迅速包含在内。

负载因子

负载因子定义为 (m/n)。其中，m 表示在增加哈希表大小之前可以在哈希表中找到的元素数量，n 表示哈希表的大小。负载因子是一个参数，它决定何时增加哈希表大小，以便以尽可能低的时间复杂度 O(1) 进行查找和添加。Java 中哈希表的默认负载因子是哈希表大小的 0.75f。当当前负载因子超过预定义负载时，哈希表的大小必须加倍。负载因子的一些步骤如下：

第一步，执行时间取决于 K 和哈希函数。例如，如果我们取键字符串 "abcd"，其哈希函数将取决于字符串的长度。然而，对于非常大的 n 值，哈希计算可以被认为是常数时间，即 O(1)，因为映射中的条目数量和键的长度与 n 相比几乎可以忽略不计。
第二步，必须遍历在刻度中找到的 K-V 对列表。因此，最坏情况是所有 n 个输入都在相同的刻度上。那么，时间复杂度是 O(n)。然而，已经进行了足够的研究，以确保哈希函数将键均匀地分布在数组中，从而很少发生这种情况。
负载因子可以定义为 (m/n)。其中，m 表示在增加哈希表大小之前可以在哈希表中找到的元素数量，n 表示哈希表的大小。
这个负载因子应该保持较低，以便索引条目数量较少，并且复杂度几乎是常数，例如 O(1) 时间复杂度。

Rehash

负载因子是再哈希的一部分。它是一种特殊的避免冲突并使时间复杂度达到 O(1) 的方法。当哈希表大小达到或超过某个称为负载因子的点时，这是增加或加倍哈希表大小的过程。进行再哈希是为了在哈希表满时减少从哈希表访问元素的难度。

访问哈希表元素的时间复杂度在最佳情况下为 O(1)。在这里，所有键值对都在一个索引中，在最坏情况下为 O(n)，当所有键值对都存储在一个索引中。索引中项目按顺序输入。因此，重复操作是为了减少访问时间并计算哈希表中的所有键点。如果哈希表中的点数超过起始指针或哈希表的大小，请使用重用方法。再哈希是根据哈希表的负载因子完成的。它将决定何时增加哈希表大小。

为什么哈希表需要再哈希？

为了防止冲突，哈希表需要再哈希。它还保持数据结构的效率。如果将元素添加到哈希表中，负载因子会增加。负载因子是内容数量与桶数量之比。当它大于某个阈值（通常设置为 0.75）时，由于冲突增加，哈希表变得不可用。为了避免这种情况，可以修改哈希表。此外，内容在新的桶中重新创建，从而降低了负载和冲突。这被称为再哈希。它可能需要时间和空间，但它应该使哈希表高效。再哈希时，您需要从哈希函数中检查索引，以将旧哈希表中的内容添加到新哈希表中。如果哈希函数使用哈希表的大小来查找索引，则在再哈希时始终会考虑新的大小。

再哈希的步骤是什么？

再哈希可以通过以下步骤完成：

检查插入哈希表的每个项目的当前负载因子。
如果超过预定义的负载因子，则进行再哈希。
要进行再哈希，创建一个比原始哈希表大两倍的新哈希表。
然后，遍历旧哈希表的每个元素并将其添加到新哈希表。
再哈希完成后，删除旧哈希表以节省空间。

例如，假设一个 5 维哈希表，哈希函数是 key % size (key % 5)。我们需要向哈希表添加 5 个元素，例如 11、12、13、14 和 15，这使得时间复杂度为 O(1)。0.75 是初始负载因子。

所以，_hash(key) 等于

_hash(11) = 1；这里，元素 11 存储在索引 1 中。所以，它的当前负载因子是 1/5 = 0.2。
_hash(12) = 2，这里，元素 12 存储在索引 2 中。所以，它的当前负载因子是 2/5 = 0.4。
_hash(13) = 3，这里，元素 13 存储在索引 3 中。所以，它的当前负载因子是 3/5 = 0.6。
_hash(14) = 4，这里，元素 14 存储在索引 4 中。所以，它的当前负载因子是 4/5 = 0.8。
_hash(15) = 5，这里，元素 15 存储在索引 5 中。所以，它的当前负载因子是 15/5 = 1。

程序代码

现在我们给出 Python 中再哈希哈希表的程序代码，如下所示：

# Create the class Map here
class Map:

# Here, we create another class MapNode with in the class Map
    class MapNode:

# Here, we define the function _init_ with three parameter self, key and value
        def __init__(self, key, value):
            self.key = key
            self.value = value
            self.next = None
    buckets = list()
 
    # size is also initialized here
    size = 0
 
    # The number of buckets initialized here
    num_Buckets = 0
 
    # Here, we initialized the default load Factor
    DEFAULT_LOAD_FACTOR = 0.75
 
# Here, we define the function _init_
    def __init__(self):
        Map.num_Buckets = 7
 
        Map.buckets = [None]*Map.num_Buckets
 
        print("The Hash Map is created")
        print("The total number of pairs in the Map: " + str(Map.size))
        print("The Map size is: " + str(Map.num_Buckets))
        print("Default Load Factor is: " + str(Map.DEFAULT_LOAD_FACTOR) + "\n")
 
    def getBucketInd(self,key):
        hash_Code = hash(key)
        return (hash_Code % Map.num_Buckets)
 
    def insert(self,key,value):
 
        #  Here, we are getting the index at which we need to be inserted
        bucketInd = self.getBucketInd(key)
        head = Map.buckets[bucketInd]
 
        # Here, we check that all nodes are present already in the index or not
        while (head != None):
 
            # If the node is already present, the value is updated
            if (head.key==key):
                head.value = value
                return
            head = head.next
 
        newElementNode = Map.MapNode(key, value)
 
        # Initialize the head node at the index
        head = Map.buckets[bucketInd]
 
        newElementNode.next = head
        Map.buckets[bucketInd]= newElementNode
 
        print("Pair(\" {} \", \" {} \") insertion is successfully done.".format(key,value))
 
        # Incrementing size of the map
        Map.size+=1
 
        # Here, we calculated the Load factor 
        loadFactor = (1* Map.size) / Map.num_Buckets
 
       # Print the current load factor
        print("Current Load factor = " + str(loadFactor))
 
        # The rehashing is done if the load factor is > 0.75, 
        if (loadFactor > Map.DEFAULT_LOAD_FACTOR):
            print(str(loadFactor) + " is greater than " + str(Map.DEFAULT_LOAD_FACTOR))
            print("Therefore Rehashing will be done.")
 
            self.rehash()
 
            print("The new Map size is: " + str(Map.num_Buckets))
 
        print("The number of pairs in the Map: " + str(Map.size))
        print("The Map size is: " + str(Map.num_Buckets))
 
    def rehash(self):
 
        print("\n***Rehashing Started***\n")
 
        # The temp is made by the present bucket list
        temp = Map.buckets
 
        # bucket is calculated here
        buckets = (2 * Map.num_Buckets)
 
        for i in range(2 * Map.num_Buckets):
            # Initialised to null
            Map.buckets.append(None)
 
        # Now the map size is made zero
        Map.size = 0
        Map.num_Buckets *= 2
 
        for i in range(len(temp)):
 
            # The head of the chain 
            head = temp[i]
 
            while (head != None):
                key = head.key
                value = head.value
 
                self.insert(key, value)
                head = head.next
 
        print("\n***Rehashing Ended***")
 
    def printMap(self):
 
        # The temp is made by the present bucket 
        temp = Map.buckets
 
        print("Current HashMap:")
        for i in range(len(temp)):
            head = temp[i]
 
            while (head != None):
                print("key = \" {} \", val = {}" .format(head.key,head.value))
 
                head = head.next
        print()
 
# Define the main function
 
if __name__ == '__main__':
    # Creating the Map
    map = Map()
 
    # The elements are inserted here
    map.insert(1, "Java")
    map.printMap()
 
    map.insert(2, "Tpoint")
    map.printMap()
 
    map.insert(3, "Is")
    map.printMap()
 
    map.insert(4, "A")
    map.printMap()
 
    map.insert(5, "Online")
    map.printMap()
    
    map.insert(6, "Learning")
    map.printMap()
    
    map.insert(7, "Website")
    map.printMap()

输出

现在，我们编译并运行上述程序。结果如下：

The Hash Map is created
The total number of pairs in the Map: 0
The Map size is: 7
Default Load Factor is: 0.75

Pair(" 1 ", " Java ") insertion is successfully done.
Current Load factor = 0.14285714285714285
The number of pairs in the Map: 1
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java

Pair(" 2 ", " Tpoint ") insertion is successfully done.
Current Load factor = 0.2857142857142857
The number of pairs in the Map: 2
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint

Pair(" 3 ", " Is ") insertion is successfully done.
Current Load factor = 0.42857142857142855
The number of pairs in the Map: 3
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is

Pair(" 4 ", " A ") insertion is successfully done.
Current Load factor = 0.5714285714285714
The number of pairs in the Map: 4
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A

Pair(" 5 ", " Online ") insertion is successfully done.
Current Load factor = 0.7142857142857143
The number of pairs in the Map: 5
The Map size is: 7
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A
key = " 5 ", val = Online

Pair(" 6 ", " Learning ") insertion is successfully done.
Current Load factor = 0.8571428571428571
0.8571428571428571 is greater than 0.75
Therefore, Rehashing will be done.

***Rehashing Started***
***Rehashing Ended***
The new Map size is: 14
The number of pairs in the Map: 0
The Map size is: 14
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A
key = " 5 ", val = Online
key = " 6 ", val = Learning

Pair(" 7 ", " Website ") insertion is successfully done.
Current Load factor = 0.07142857142857142
The number of pairs in the Map: 1
The Map size is: 14
Current HashMap:
key = " 1 ", val = Java
key = " 2 ", val = Tpoint
key = " 3 ", val = Is
key = " 4 ", val = A
key = " 5 ", val = Online
key = " 6 ", val = Learning
key = " 7 ", val = Website

下一个主题如何将图像转换为 NumPy 数组？

← 上一个下一个 →

Python 中的哈希表 - 冲突、负载因子和再哈希

引言

哈希 - 哈希函数

冲突

1. 分离链接

2. 开放寻址

什么是聚类？

负载因子

Rehash

为什么哈希表需要再哈希？

再哈希的步骤是什么？

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Python 中的哈希表 - 冲突、负载因子和再哈希

引言

哈希 - 哈希函数

冲突

1. 分离链接

2. 开放寻址

什么是聚类？

负载因子

Rehash

为什么哈希表需要再哈希？

再哈希的步骤是什么？

相关帖子

如何使用 Python 清空回收站

如何在 Python 中四舍五入数字

Python 中的语法和拼写检查器

Python 中的 8 拼图问题

Python lxml 模块

Python Tkinter 中的 Tree view 小部件和 Tree view 滚动条

如何将 Python 列表或可迭代对象分割成块

Python 统计学

Python 中的函数注解

在 Python 中对具有多维系数的 Legendre Series 进行微分

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器