压缩稀疏图例程 (`scipy.sparse.csgraph`)#

示例：单词阶梯#

“单词阶梯”是刘易斯·卡罗尔发明的一种文字游戏，玩家通过每次改变一个字母来在单词之间找到路径。例如，可以通过以下方式连接“ape”和“man”

\[{\rm ape \to apt \to ait \to bit \to big \to bag \to mag \to man}\]

请注意，每个步骤都只涉及更改单词的一个字母。这只是从“ape”到“man”的一种可能路径，但这是最短路径吗？如果我们要找到两个给定单词之间最短的单词阶梯路径，稀疏图子模块可以提供帮助。

首先，我们需要一个有效单词列表。许多操作系统都内置了此类列表。例如，在 Linux 上，通常可以在以下位置之一找到单词列表

/usr/share/dict
/var/lib/dict

单词的另一个简单来源是互联网上各个网站提供的 Scrabble 单词列表（使用您喜欢的搜索引擎搜索）。我们首先创建此列表。系统单词列表由一个文件组成，每行一个单词。以下内容应根据您可用的特定单词列表进行修改

>>> with open('/usr/share/dict/words') as f:
...    word_list = f.readlines()
>>> word_list = map(str.strip, word_list)

我们想查看长度为 3 的单词，因此只选择正确长度的单词。我们还将消除以大写字母开头（专有名词）或包含非字母数字字符（如撇号和连字符）的单词。最后，我们将确保所有内容均为小写以便后续比较

>>> word_list = [word for word in word_list if len(word) == 3]
>>> word_list = [word for word in word_list if word[0].islower()]
>>> word_list = [word for word in word_list if word.isalpha()]
>>> word_list = list(map(str.lower, word_list))
>>> len(word_list)
586    # may vary

现在我们有一个包含 586 个有效三字母单词的列表（确切数量可能因所用列表而异）。这些单词中的每一个都将成为我们图中的一个节点，我们将创建边来连接每对仅相差一个字母的单词所关联的节点。

有高效的方法，也有低效的方法。为了尽可能高效地完成这项工作，我们将使用一些复杂的 NumPy 数组操作

>>> import numpy as np
>>> word_list = np.asarray(word_list)
>>> word_list.dtype   # these are unicode characters in Python 3
dtype('<U3')
>>> word_list.sort()  # sort for quick searching later

我们有一个数组，其中每个条目是三个 Unicode 字符长。我们想找到所有恰好有一个字符不同的对。我们将首先将每个单词转换为一个 3D 向量

>>> word_bytes = np.ndarray((word_list.size, word_list.itemsize),
...                         dtype='uint8',
...                         buffer=word_list.data)
>>> # each unicode character is four bytes long. We only need first byte
>>> # we know that there are three characters in each word
>>> word_bytes = word_bytes[:, ::word_list.itemsize//3]
>>> word_bytes.shape
(586, 3)    # may vary

现在，我们将使用每个点之间的汉明距离来确定哪些单词对是连接的。汉明距离衡量两个向量之间不同条目的比例：任何两个汉明距离等于 \(1/N\)（其中 \(N\) 是字母数）的单词在单词阶梯中是连接的

>>> from scipy.spatial.distance import pdist, squareform
>>> from scipy.sparse import csr_matrix
>>> hamming_dist = pdist(word_bytes, metric='hamming')
>>> # there are three characters in each word
>>> graph = csr_matrix(squareform(hamming_dist < 1.5 / 3))

在比较距离时，我们不使用相等性，因为这对于浮点值可能不稳定。只要单词列表中的任意两个条目不相同，不等式就会产生所需的结果。现在，我们的图已经设置好了，我们将使用最短路径搜索来查找图中任意两个单词之间的路径

>>> i1 = word_list.searchsorted('ape')
>>> i2 = word_list.searchsorted('man')
>>> word_list[i1]
'ape'
>>> word_list[i2]
'man'

我们需要检查它们是否匹配，因为如果单词不在列表中，则不会出现这种情况。现在，我们只需要找到图中这两个索引之间的最短路径。我们将使用迪克斯特拉算法，因为它允许我们只为一个节点找到路径

>>> from scipy.sparse.csgraph import dijkstra
>>> distances, predecessors = dijkstra(graph, indices=i1,
...                                    return_predecessors=True)
>>> print(distances[i2])
5.0    # may vary

因此我们看到，“ape”和“man”之间的最短路径只包含五个步骤。我们可以使用算法返回的前驱来重建此路径

>>> path = []
>>> i = i2
>>> while i != i1:
...     path.append(word_list[i])
...     i = predecessors[i]
>>> path.append(word_list[i1])
>>> print(path[::-1])
['ape', 'apt', 'opt', 'oat', 'mat', 'man']    # may vary

这比我们最初的例子少了三个链接：“ape”到“man”的路径只有五个步骤。

使用模块中的其他工具，我们可以回答其他问题。例如，是否存在不通过单词阶梯连接的三字母单词？这是图中连通分量的问题

>>> from scipy.sparse.csgraph import connected_components
>>> N_components, component_list = connected_components(graph)
>>> print(N_components)
15    # may vary

在这个特定的三字母单词样本中，有 15 个连通分量：也就是说，有 15 组不同的单词，组之间没有路径。每组中有多少个单词？我们可以从分量列表中了解到这一点

>>> [np.sum(component_list == i) for i in range(N_components)]
[571, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]    # may vary

有一个大的连通集和 14 个较小的连通集。让我们看看较小连通集中的单词

>>> [list(word_list[np.nonzero(component_list == i)]) for i in range(1, N_components)]
[['aha'],    # may vary
 ['chi'],
 ['ebb'],
 ['ems', 'emu'],
 ['gnu'],
 ['ism'],
 ['khz'],
 ['nth'],
 ['ova'],
 ['qua'],
 ['ugh'],
 ['ups'],
 ['urn'],
 ['use']]

这些都是不通过单词阶梯与其他单词连接的三字母单词。

我们可能还会好奇哪些单词是最大程度分离的。哪两个单词需要最多的链接才能连接？我们可以通过计算所有最短路径的矩阵来确定这一点。请注意，按照惯例，两个不连接点之间的距离被报告为无穷大，因此我们需要在找到最大值之前移除这些点

>>> distances, predecessors = dijkstra(graph, return_predecessors=True)
>>> max_distance = np.max(distances[~np.isinf(distances)])
>>> print(max_distance)
13.0    # may vary

所以，至少有一对单词需要 13 步才能从一个到另一个！让我们确定是哪些单词

>>> i1, i2 = np.nonzero(distances == max_distance)
>>> list(zip(word_list[i1], word_list[i2]))
[('imp', 'ohm'),    # may vary
 ('imp', 'ohs'),
 ('ohm', 'imp'),
 ('ohm', 'ump'),
 ('ohs', 'imp'),
 ('ohs', 'ump'),
 ('ump', 'ohm'),
 ('ump', 'ohs')]

我们看到有两对单词彼此最大程度分离：一方面是“imp”和“ump”，另一方面是“ohm”和“ohs”。我们可以像上面一样找到连接列表

>>> path = []
>>> i = i2[0]
>>> while i != i1[0]:
...     path.append(word_list[i])
...     i = predecessors[i1[0], i]
>>> path.append(word_list[i1[0]])
>>> print(path[::-1])
['imp', 'amp', 'asp', 'ass', 'ads', 'add', 'aid', 'mid', 'mod', 'moo', 'too', 'tho', 'oho', 'ohm']    # may vary

这为我们提供了我们想要看到的路径。

单词阶梯只是 SciPy 针对稀疏矩阵的快速图算法的一种潜在应用。图论在数学、数据分析和机器学习的许多领域都有出现。稀疏图工具足够灵活，可以处理许多这些情况。

压缩稀疏图例程 (scipy.sparse.csgraph)#

示例：单词阶梯#

压缩稀疏图例程 (`scipy.sparse.csgraph`)#