2.4. 一个字谜检测例子¶

2.4. An Anagram Detection Example

中文英文

展示不同数量级算法的一个很好例子是经典的字符串字谜检测问题。如果一个字符串是另一个字符串的字谜，那么第二个字符串只是第一个字符串的重新排列。例如，heart 和 earth 是字谜。字符串 python 和 typhon 也是字谜。为了简单起见，我们假设这两个字符串的长度相等，并且它们由 26 个小写字母组成。我们的目标是编写一个布尔函数，该函数接受两个字符串并返回它们是否是字谜。

A good example problem for showing algorithms with different orders of magnitude is the classic anagram detection problem for strings. One string is an anagram of another if the second is simply a rearrangement of the first. For example, heart and earth are anagrams. The strings python and typhon are anagrams as well. For the sake of simplicity, we will assume that the two strings in question are of equal length and that they are made up of symbols from the set of 26 lowercase alphabetic characters. Our goal is to write a boolean function that will take two strings and return whether they are anagrams.

2.4.1. 解决方案1：字谜检测逐一检查¶

2.4.1. Solution 1: Anagram Detection Checking Off

中文英文

我们对字谜问题的第一个解决方案将检查字符串的长度，然后检查第一个字符串中的每个字符是否确实出现在第二个字符串中。如果可以勾选每个字符，那么这两个字符串必须是字谜。勾选字符将通过将其替换为特殊的 Python 值 None 来完成。然而，由于 Python 中的字符串是不可变的，因此处理的第一步是将第二个字符串转换为列表。然后可以将第一个字符串中的每个字符与列表中的字符进行比较，如果找到，则通过替换来勾选。ActiveCode 1 显示了这个函数。

Activity: 2.4.1.1 Checking Off
def anagram_solution_1(s1, s2):
    still_ok = True
    if len(s1) != len(s2):
        still_ok = False

    a_list = list(s2)
    pos_1 = 0

    while pos_1 < len(s1) and still_ok:
        pos_2 = 0
        found = False
        while pos_2 < len(a_list) and not found:
            if s1[pos_1] == a_list[pos_2]:
                found = True
            else:
                pos_2 = pos_2 + 1
        if found:
            a_list[pos_2] = None
        else:
            still_ok = False
        pos_1 = pos_1 + 1

    return still_ok


print(anagram_solution_1("apple", "pleap"))  # 预期结果: True
print(anagram_solution_1("abcd", "dcba"))  # 预期结果: True
print(anagram_solution_1("abcd", "dcda"))  # 预期结果: False

要分析这个算法，我们需要注意 s1 中的每个 n 个字符都会导致对 s2 列表中的最多 n 个字符的迭代。列表中的每个 n 个位置将被访问一次以匹配来自 s1 的字符。访问的次数将成为从 1 到 n 的整数的总和。我们之前提到，这可以写成：

\[\begin{align} \sum_{i=1}^{n} i &= \frac {n(n+1)}{2} \\ &= \frac {1}{2}n^{2} + \frac {1}{2}n \end{align}\]

随着 $n$ 变大，$n^{2}$ 项将占主导地位，而 $n$ 项和 $\frac {1}{2}$ 可以忽略。因此，这个解决方案的时间复杂度是 $O(n^{2})$。

Our first solution to the anagram problem will check the lengths of the strings and then check to see that each character in the first string actually occurs in the second. If it is possible to check off each character, then the two strings must be anagrams. Checking off a character will be accomplished by replacing it with the special Python value None. However, since strings in Python are immutable, the first step in the process will be to convert the second string to a list. Each character from the first string can be checked against the characters in the list and if found, checked off by replacement. ActiveCode 1 shows this function.

Activity: 2.4.1.1 Checking Off
def anagram_solution_1(s1, s2):
    still_ok = True
    if len(s1) != len(s2):
        still_ok = False

    a_list = list(s2)
    pos_1 = 0

    while pos_1 < len(s1) and still_ok:
        pos_2 = 0
        found = False
        while pos_2 < len(a_list) and not found:
            if s1[pos_1] == a_list[pos_2]:
                found = True
            else:
                pos_2 = pos_2 + 1
        if found:
            a_list[pos_2] = None
        else:
            still_ok = False
        pos_1 = pos_1 + 1

    return still_ok


print(anagram_solution_1("apple", "pleap"))  # expected: True
print(anagram_solution_1("abcd", "dcba"))  # expected: True
print(anagram_solution_1("abcd", "dcda"))  # expected: False

To analyze this algorithm, we need to note that each of the n characters in s1 will cause an iteration through up to n characters in the list from s2. Each of the n positions in the list will be visited once to match a character from s1. The number of visits then becomes the sum of the integers from 1 to n. We stated earlier that this can be written as

\[\begin{align} \sum_{i=1}^{n} i &= \frac {n(n+1)}{2} \\ &= \frac {1}{2}n^{2} + \frac {1}{2}n \end{align}\]

As $n$ gets large, the $n^{2}$ term will dominate the $n term and the $\frac {1}{2}$ can be ignored. Therefore, this solution is $O(n^{2})$.

2.4.2. 解决方案2：排序和比较¶

2.4.2. Anagram Detection Solution 2: Sort and Compare

中文英文

另一种解决字谜问题的方案利用了这样一个事实，即虽然 s1 和 s2 不同，但它们只有在由完全相同的字符组成时才是字谜。因此，如果我们先按字母顺序将每个字符串从 a 排到 z，那么如果原始的两个字符串是字谜，最终它们会得到相同的字符串。ActiveCode 2 显示了这个解决方案。同样，在 Python 中，我们可以通过将每个字符串转换为列表来使用内置的 sort 方法进行排序。

Activity: 2.4.2.1 Sort and Compare
def anagram_solution_2(s1, s2):
    a_list_1 = list(s1)
    a_list_2 = list(s2)

    a_list_1.sort()
    a_list_2.sort()

    pos = 0
    matches = True

    while pos < len(s1) and matches:
        if a_list_1[pos] == a_list_2[pos]:
            pos = pos + 1
        else:
            matches = False

    return matches


print(anagram_solution_2("apple", "pleap"))  # 预期结果: True
print(anagram_solution_2("abcd", "dcba"))  # 预期结果: True
print(anagram_solution_2("abcd", "dcda"))  # 预期结果: False

乍一看，你可能会认为该算法是 $O(n)$，因为在排序过程之后，只需进行一次简单的迭代来比较 n 个字符。然而，对 Python sort 方法的两次调用也并非没有代价。正如我们将在第 5 章看到的那样，排序通常是 $O(n^{2})$ 或 $O(n\log n)$，因此排序操作占主导地位。最终，这个算法的数量级与排序过程相同。

Another solution to the anagram problem will make use of the fact that even though s1 and s2 are different, they are anagrams only if they consist of exactly the same characters. So if we begin by sorting each string alphabetically from a to z, we will end up with the same string if the original two strings are anagrams. ActiveCode 2 shows this solution. Again, in Python we can use the built-in sort method on lists by simply converting each string to a list at the start.

Activity: 2.4.2.1 Sort and Compare
def anagram_solution_2(s1, s2):
    a_list_1 = list(s1)
    a_list_2 = list(s2)

    a_list_1.sort()
    a_list_2.sort()

    pos = 0
    matches = True

    while pos < len(s1) and matches:
        if a_list_1[pos] == a_list_2[pos]:
            pos = pos + 1
        else:
            matches = False

    return matches


print(anagram_solution_2("apple", "pleap"))  # expected: True
print(anagram_solution_2("abcd", "dcba"))  # expected: True
print(anagram_solution_2("abcd", "dcda"))  # expected: False

At first glance you may be tempted to think that this algorithm is $O(n)$, since there is one simple iteration to compare the n characters after the sorting process. However, the two calls to the Python sort method are not without their own cost. As we will see in Chapter 5, sorting is typically either $O(n^{2})$ or $O(n\log n)$, so the sorting operations dominate the iteration. In the end, this algorithm will have the same order of magnitude as that of the sorting process.

2.4.3. 解决方案3：暴力破解¶

2.4.3. Anagram Detection Solution 3: Brute Force

中文英文

暴力求解方法通常尝试穷尽所有可能性。对于字谜检测问题，我们可以简单地生成使用 s1 中字符的所有可能字符串列表，然后检查 s2 是否出现。然而，这种方法存在问题。生成 s1 的所有可能字符串时，第一位有 n 种可能，第二位有 $n - 1$ 种可能，第三位有 $n - 2$ 种可能，依此类推。候选字符串的总数是 $n \cdot (n - 1) \cdot (n - 2) \cdot ... \cdot 3 \cdot 2 \cdot 1$，即 $n!$。虽然某些字符串可能是重复的，但程序无法提前知道这一点，因此它仍然会生成 $n!$ 个不同的字符串。

事实上，随着 n 的增大，$n!$ 的增长速度甚至比 $2^{n}$ 还快。实际上，如果 s1 有 20 个字符，那么可能的候选字符串数为 $20! = 2,432,902,008,176,640,000$。如果我们每秒处理一个可能的字符串，遍历整个列表仍然需要 77,146,816,596 年。这可能不是一个好的解决方案。

A brute force technique for solving a problem typically tries to exhaust all possibilities. For the anagram detection problem, we can simply generate a list of all possible strings using the characters from s1 and then see if s2 occurs. However, there is a problem with this approach. When generating all possible strings from s1, there are n possible first characters, $n - 1$ possible characters for the second position, $n - 2$ for the third, and so on. The total number of candidate strings is $n \cdot (n - 1) \cdot (n - 2) \cdot ... \cdot 3 \cdot 2 \cdot 1$, which is $n!$. Although some of the strings may be duplicates, the program cannot know this ahead of time and so it will still generate $n!$ different strings.

It turns out that $n!$ grows even faster than $2^{n}$ as n gets large. In fact, if s1 were 20 characters long, there would be $20! = 2,432,902,008,176,640,000$ possible candidate strings. If we processed one possibility every second, it would still take us 77,146,816,596 years to go through the entire list. This is probably not going to be a good solution.

2.4.4. 解决方案4：计数和比较¶

2.4.4. Anagram Detection Solution 4: Count and Compare

中文英文

我们对字谜问题的最终解决方案利用了这样一个事实：任何两个字谜都会有相同数量的字母 a，相同数量的字母 b，相同数量的字母 c，以此类推。为了判断两个字符串是否是字谜，我们首先要计算每个字符出现的次数。由于有 26 个可能的字符，我们可以使用一个包含 26 个计数器的列表，每个计数器对应一个字符。每次看到一个特定字符时，我们将增加该位置的计数。最终，如果这两个计数字符列表相同，那么这两个字符串必须是字谜。ActiveCode 3 显示了该解决方案。

Activity: 2.4.4.1 Count and Compare
def anagram_solution_4(s1, s2):
    c1 = [0] * 26
    c2 = [0] * 26

    for i in range(len(s1)):
        pos = ord(s1[i]) - ord("a")
        c1[pos] = c1[pos] + 1

    for i in range(len(s2)):
        pos = ord(s2[i]) - ord("a")
        c2[pos] = c2[pos] + 1

    j = 0
    still_ok = True
    while j < 26 and still_ok:
        if c1[j] == c2[j]:
            j = j + 1
        else:
            still_ok = False

    return still_ok


print(anagram_solution_4("apple", "pleap"))  # 预期结果: True
print(anagram_solution_4("abcd", "dcba"))  # 预期结果: True
print(anagram_solution_4("abcd", "dcda"))  # 预期结果: False

同样，这个解决方案也有多次迭代。然而，不同于第一个解决方案，迭代并未嵌套。前两次用于计算字符的迭代基于 n。第三次迭代用于比较两个计数列表，由于字符串中有 26 个可能的字符，这次迭代总共需要 26 步。总共计算得到的步数为 $T(n)=2n+26$。即 $O(n)$。我们已经找到了一个线性数量级的算法来解决这个问题。

在结束这个例子之前，我们需要讨论一下空间需求。虽然最后的解决方案能够在线性时间内运行，但它只能通过使用额外的存储空间来保存两个字符计数列表来实现。换句话说，这个算法牺牲了空间来换取时间。

这种情况非常常见。在许多情况下，你需要在时间和空间的权衡之间做出决策。在这个例子中，额外的空间需求并不大。然而，如果底层的字母表包含数百万个字符，那么这个问题将会更加令人担忧。作为一名计算机科学家，当你有选择算法的机会时，你需要根据具体问题决定如何最有效地利用计算资源。

自检

活动：2.4.4.2 多项选择活动：2.4.4.3 多项选择活动：2.4.4.4 多项选择

Q-4：给定以下代码片段，它的 Big O 运行时间是什么？

test = 0
for i in range(n):
    for j in range(n):
        test = test + i * j

A. $O(n)$
B. $O(n^2)$
C. $O(log n)$
D. $O(n^3)$

Q-5：给定以下代码片段，它的 Big O 运行时间是什么？

test = 0
for i in range(n):
    test = test + 1

for j in range(n):
    test = test - 1

A. $O(n)$
B. $O(n^2)$
C. $O(log n)$
D. $O(n^3)$

Q-6：给定以下代码片段，它的 Big O 运行时间是什么？

i = n
while i > 0:
    k = 2 + 2
    i = i // 2

A. $O(n)$
B. $O(n^2)$
C. $O(log n)$
D. $O(n^3)$

Our final solution to the anagram problem takes advantage of the fact that any two anagrams will have the same number of a’s, the same number of b’s, the same number of c’s, and so on. In order to decide whether two strings are anagrams, we will first count the number of times each character occurs. Since there are 26 possible characters, we can use a list of 26 counters, one for each possible character. Each time we see a particular character, we will increment the counter at that position. In the end, if the two lists of counters are identical, the strings must be anagrams. ActiveCode 3 shows this solution.

Activity: 2.4.4.1 Count and Compare
def anagram_solution_4(s1, s2):
    c1 = [0] * 26
    c2 = [0] * 26

    for i in range(len(s1)):
        pos = ord(s1[i]) - ord("a")
        c1[pos] = c1[pos] + 1

    for i in range(len(s2)):
        pos = ord(s2[i]) - ord("a")
        c2[pos] = c2[pos] + 1

    j = 0
    still_ok = True
    while j < 26 and still_ok:
        if c1[j] == c2[j]:
            j = j + 1
        else:
            still_ok = False

    return still_ok


print(anagram_solution_4("apple", "pleap"))  # expected: True
print(anagram_solution_4("abcd", "dcba"))  # expected: True
print(anagram_solution_4("abcd", "dcda"))  # expected: False

Again, the solution has a number of iterations. However, unlike the first solution, none of them are nested. The first two iterations used to count the characters are both based on n. The third iteration, comparing the two lists of counts, always takes 26 steps since there are 26 possible characters in the strings. Adding it all up gives us $T(n)=2n+26$ steps. That is $O(n)$. We have found a linear order of magnitude algorithm for solving this problem.

Before leaving this example, we need to say something about space requirements. Although the last solution was able to run in linear time, it could only do so by using additional storage to keep the two lists of character counts. In other words, this algorithm sacrificed space in order to gain time.

This is a common occurrence. On many occasions you will need to make decisions between time and space trade-offs. In this case, the amount of extra space is not significant. However, if the underlying alphabet had millions of characters, there would be more concern. As a computer scientist, when given a choice of algorithms, it will be up to you to determine the best use of computing resources given a particular problem.

Self Check [Activity: 2.4.4.2 Multiple Choice]

Q-4: Given the following code fragment, what is its Big O running time?

test = 0
for i in range(n):
   for j in range(n):
      test = test + i * j

A. O(n)
B. O(n^2)
C. O(log n)
D. O(n^3)

Self Check [Activity: 2.4.4.3 Multiple Choice]

Q-5: Given the following code fragment what is its Big O running time?

test = 0
for i in range(n):
   test = test + 1

for j in range(n):
   test = test - 1

A. O(n)
B. O(n^2)
C. O(log n)
D. O(n^3)

Self Check [Activity: 2.4.4.4 Multiple Choice]

Q-6: Given the following code fragment what is its Big O running time?

i = n
while i > 0:
   k = 2 + 2
   i = i // 2

A. O(n)
B. O(n^2)
C. O(log n)
D. O(n^3)

最后更新: 2024年9月12日
创建日期: 2024年9月9日