深入理解Python之Dictionaries和Sets

Posted on 2017-04-15 Views:

Dictionaries

Dictionary是Python中最常用的数据结构之一

推导式

DIAL_CODES = [
    (86, 'China'),
    (91, 'India'),
    (1, 'United States'),
    (62, 'Indonesia'),
    (55, 'Brazil'),
    (92, 'Pakistan'),
    (880, 'Bangladesh'),
    (234, 'Nigeria'),
    (7, 'Russia'),
    (81, 'Japan'),
]

country_code = {country: code for code, country in DIAL_CODES}
# print(country_code)
print({code: country.upper() for country, code in country_code.items() if code < 66})

使用`setdefault`处理无`key`

import sys
import re

WORD_RE = re.compile('\w+')

index = {}
with open(sys.argv[1], encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
        for match in WORD_RE.finditer(line):
            word = match.group()
            column_no = match.start() + 1
            location = (line_no, column_no)
            occurrences = index.get(word, [])
            occurrences.append(location)
            index[word] = occurrences
for word in sorted(index, key=str.upper):
    print(word, index[word])

可以使用get(key, defaultValue)和setdefault(key, defaultValue)来处理key不存在的情况,也可以使用collections.defaultdict(value)来设置创建默认new-key的字典。采用这方式，如果dd是一个defaultdict,那么当dd[k]不存在时，会调用default_factory来创建一个默认值，但返回结果依然是None

`missing`

这个方法默认是没有定义的，当dict的子类提供了这个方法的实现，那么当key不存在时，不会抛出异常KeyError，__getitem__会调用__missing__

class StrKeyDict0(dict):

def __missing__(self, key):
    if isinstance(key, str):
        raise KeyError(key)
    return self[str(key)]

def get(self, k, d=None):
    try:
        return self[k]
    except KeyError:
        return default

def __contains__(self, item):
    return key in self.keys() or str(key) in self.keys()

这里需要注意isintance检查key是否为str,否则会陷入死循环。使用k in dict.keys()效率很高

各种各样的`dict`

除了defaultdict之外，还会有很多别的dict

collections.OrderedDict

key有顺序的dict，类似Java中的TreeMap

collections.ChainMap

将多个dict当成一个dict，按顺序搜索key，只要搜索到key，结果返回成功

collections.Counter

持有key的计数

ct = collections.Counter('dsakfjdjsfakdjfs')
print(ct)

result: Counter({'d': 3, 's': 3, 'f': 3, 'j': 3, 'a': 2, 'k': 2})

userDict

纯Python实现的标准dict

一般来说userDict被用来集成，其他几个直接使用

UserDict派生

UserDict内部提供了很多默认的实现方法，因此直接从UserDict继承会很方便。UserDict不是继承dict，但其内部有个dict实例。这是真正数据存储的地方

class StrKeyDict0(collections.UserDict):
def __missing__(self, key):
    if isinstance(key, str):
        raise KeyError(key)
    return self[str(key)]

def __contains__(self, item):
    return str(item) in self.data

def __setitem__(self, key, value):
    self.data[str(key)] = value

不可改Mapping

标准库提供的映射类型都是可以修改的。你可能想让你的映射不能够被修改。Python3.3之后，标准库提供了只读的MappingProxy

from types import MappingProxyType
d = {1: 'A'}
d_proxy = MappingProxyType(d)
print(d_proxy)
d_proxy[2] = 'x'

上面的代码会抛出异常，因为d_proxy是不允许被修改的

Set

l = ['spam', 'spam', 'eggs', 'spam']
print(set(l))

result: {'spam', 'eggs'}

需要注意Set的元素必须是能哈希的
支持&(intersetion)

Set语法

Python3中，set标准表示方法s = {1}。使用这个方式会比使用set([1])效率高

from dis import dis
print(dis('{1}'))

result:

1           0 LOAD_CONST               0 (1)
            2 BUILD_SET                1
            4 RETURN_VALUE
            

print(dis('set([1])')

result:
1           0 LOAD_NAME                0 (set)
            2 LOAD_CONST               0 (1)
            4 BUILD_LIST               1
            6 CALL_FUNCTION            1
            8 RETURN_VALUE

使用dis来查看两种方式的字节码操作。第一种方式效率明显高于第二种

Set推导式

1 2	from unicodedata import name print({chr(i) for i in range(32, 256) if 'SIGN' in name(chr(i), '')})

其他

性能

数据量	因子	dict时间	因子
1000	1x	0.000202s	1.00x
10000	10x	0.000140s	0.69x
100000	100x	0.000228s	1.13x
1000000	1000x	0.000290s	1.44x
10000000	10000x	0.000337s	1.67x

dict的高性能多亏了hashtable

dict_hash_algorithm

好处和坏处

key必须要能hash,需要满足
- 支持hash()，同一个对象返回的值总是相同
- 支持eq()
- 如果a==b，那么hash(a)==hash(b)
- 用户定义的类型的hash值是id()
内存开销大
key搜索非常快
- dict是典型的以空间换取时间
key的顺序依赖插入顺序
添加数据会影响已存在key的顺序