当前位置：首页 > article >正文

算法基础 -- Trie压缩树原理

article 2025/3/4 8:49:53

Trie 压缩树原理

一、Trie 与压缩 Trie 的基本概念

1. 普通 Trie（字典树）

Trie（前缀树或字典树）是一种用于高效存储和检索字符串的树形数据结构。它的关键特性包括：

节点（Node）不直接存储完整字符串，而是根据字符串的前缀共用节点。
每条边代表可能是一个字符（在最常见的实现里，每一层对应一个字符），也可能是一个字符集的索引。

例如，将字符串集合 ["he", "hello", "hi"] 插入普通 Trie 时，结构如下：

        (root)
        /    \
      'h'    ...
       |
       v
      (node1)
     /      \
   'e'      'i'
    |        |
   (node2)  (node3)
   /   \
 'l'   (单词结束标记：he)
  |
 (node4)
  |
 'l'
  |
 (node5)
  |
 'o'
  |
 (node6) (单词结束标记：hello)

普通 Trie 的优点是结构清晰，查询和插入的时间复杂度理想情况下为 O(k)，其中 k 是字符串长度。但其缺点是当字符串有较多相同前缀时，会出现大量“只有一个孩子的链状节点”，浪费了存储空间。

2. 压缩 Trie（Compressed Trie）

压缩 Trie 对普通 Trie 进行了优化，当一个节点只有唯一分支时，将这些连续的节点合并到一条边上，存储为一个更长的“公共子串”。这样可以显著减少冗余节点。

例如，字符串集合 ["he", "hello", "hi"] 的压缩 Trie：

        (root)
         |
        "h"
         |
       (节点A)
       /    \
    "e"     "i"
     |
   (节点B)
     |
   "llo"
     |
   (节点C) (单词结束标记：hello)

压缩 Trie 的操作要点

插入：
1. 找到要插入字符串与现有边的公共前缀；
2. 如果完全匹配，进入下一层节点继续处理；
3. 如果部分匹配，需要拆分边，将公共前缀与剩余部分分别存储；
4. 如果没有匹配，新建一条边。
搜索：沿着边逐字符匹配。如果匹配完整且到达结束标记，则表示字符串存在；否则表示搜索失败。

这种方式能在空间和查询效率之间取得较好平衡。

二、C 语言实现

1. 数据结构定义

在实现中，我们定义以下结构体：

Edge 结构体

表示一条边，包含：

char *label：存储边上的字符串片段（子串）。
struct TrieNode *child：指向后续节点。

TrieNode 结构体

表示 Trie 中的一个节点，包含：

bool isEndOfWord：标记是否为单词结束节点。
int numChildren：子边的数量。
Edge *children：动态数组形式存储子边。

结构关系如下：

(root节点)  ->  Edge[label="abc"] -> (nodeA)
                       \
                        ...

2. 基础函数

创建节点

TrieNode *createNode() {
    TrieNode *node = (TrieNode *)malloc(sizeof(TrieNode));
    node->isEndOfWord = false;
    node->numChildren = 0;
    node->children = NULL;
    return node;
}

字符串工具函数

截取子串和计算公共前缀长度：

char *substr(const char *src, int start, int end) {
    int length = end - start;
    char *dest = (char *)malloc(sizeof(char) * (length + 1));
    strncpy(dest, src + start, length);
    dest[length] = '\0';
    return dest;
}

int commonPrefixLength(const char *s1, const char *s2) {
    int i = 0;
    while (s1[i] && s2[i] && s1[i] == s2[i]) {
        i++;
    }
    return i;
}

3. 插入字符串

插入函数的逻辑：

遍历当前节点的子边，查找公共前缀。
如果需要，拆分边并插入新字符串。
如果没有匹配，新建一条边。

完整代码：

void insertString(TrieNode *root, const char *key) {
    if (*key == '\0') {
        root->isEndOfWord = true;
        return;
    }

    for (int i = 0; i < root->numChildren; i++) {
        Edge *edge = &root->children[i];
        int prefixLen = commonPrefixLength(key, edge->label);

        if (prefixLen == 0) {
            continue;
        }

        if (prefixLen == (int)strlen(edge->label)) {
            insertString(edge->child, key + prefixLen);
            return;
        } else {
            char *commonPart = substr(edge->label, 0, prefixLen);
            char *oldLabelRem = substr(edge->label, prefixLen, strlen(edge->label));

            TrieNode *newNode = createNode();

            newNode->children = (Edge *)malloc(sizeof(Edge));
            newNode->children[0].label = oldLabelRem;
            newNode->children[0].child = edge->child;
            newNode->numChildren = 1;

            free(edge->label);
            edge->label = commonPart;
            edge->child = newNode;

            const char *keyRem = key + prefixLen;
            if (*keyRem == '\0') {
                newNode->isEndOfWord = true;
            } else {
                TrieNode *anotherChild = createNode();
                newNode->children = (Edge *)realloc(newNode->children,
                    sizeof(Edge) * (newNode->numChildren + 1));
                Edge *e = &newNode->children[newNode->numChildren];
                e->label = strdup(keyRem);
                e->child = anotherChild;
                newNode->numChildren++;
                anotherChild->isEndOfWord = true;
            }
            return;
        }
    }

    root->children = (Edge *)realloc(root->children,
            sizeof(Edge) * (root->numChildren + 1));
    Edge *newEdge = &root->children[root->numChildren];
    newEdge->label = strdup(key);
    newEdge->child = createNode();
    root->numChildren++;
    newEdge->child->isEndOfWord = true;
}

4. 搜索字符串

搜索函数逻辑与插入类似，逐步匹配字符串，直至找到或失败：

bool searchString(TrieNode *root, const char *key) {
    if (*key == '\0') {
        return root->isEndOfWord;
    }

    for (int i = 0; i < root->numChildren; i++) {
        Edge *edge = &root->children[i];
        int prefixLen = commonPrefixLength(key, edge->label);

        if (prefixLen == (int)strlen(edge->label)) {
            return searchString(edge->child, key + prefixLen);
        }
    }

    return false;
}

5. 测试示例

完整测试代码：

int main() {
    TrieNode *root = createNode();

    const char *words[] = {"he", "hello", "hi", "cat", "cater", "cart", "dog"};
    int n = sizeof(words) / sizeof(words[0]);

    for (int i = 0; i < n; i++) {
        insertString(root, words[i]);
        printf("Inserted: %s\n", words[i]);
    }

    const char *tests[] = {"he", "hello", "hi", "cat", "cater", "cart", "do", "dog", "c", "hea", "car"};
    int t = sizeof(tests) / sizeof(tests[0]);

    printf("\nSearch tests:\n");
    for (int i = 0; i < t; i++) {
        bool found = searchString(root, tests[i]);
        printf("'%s' -> %s\n", tests[i], found ? "FOUND" : "NOT FOUND");
    }

    return 0;
}