当前位置：首页 > article >正文

Python实现MySQL数据库对象的血缘分析

article 2025/3/29 2:31:43

Python控制台的程序，实现遍历MySQL中所有的SQL对象（表、视图、用户定义函数、存储过程和触发器等），并取得它们之间之前的依赖性关系，并列出三张表，第一张表的第一列是所有的SQL对象名称，第二列是它的数据的生成路径，路径中的相邻SQL对象之间用“->”隔开，如果有多条路径，就存储多条记录，第二张表是根据依赖性生成所有表的列表，依懒性从上到下依次递增，第三张表是根据依赖性生成所有第二张表中表数据的存储过程列表，依懒性从上到下依次递增，存储在Excel文件中。

程序总体设计

1. 系统架构

+-----------------+     +-----------------+     +-----------------+     +-----------------+
| MySQL元数据采集 | --> | 依赖关系分析引擎 | --> | 数据路径生成器 | --> | Excel输出模块 |
+-----------------+     +-----------------+     +-----------------+     +-----------------+

2. 模块划分

元数据采集模块：通过information_schema获取所有数据库对象
依赖分析模块：解析SQL定义获取对象依赖关系
路径生成模块：构建依赖图谱并生成所有可能路径
拓扑排序模块：为表和存储过程生成依赖顺序
Excel输出模块：使用pandas生成多Sheet的Excel文件

3. 技术选型

Python 3.8+
数据库驱动：mysql-connector-python
SQL解析：sqlparse + 正则表达式
数据处理：pandas
Excel输出：openpyxl

4. 数据结构设计

class DBObject:
    def __init__(self, name, obj_type, definition):
        self.name = name          # 对象名称
        self.obj_type = obj_type # 对象类型（TABLE/VIEW/PROCEDURE等）
        self.definition = definition # 原始定义SQL
        self.dependencies = []    # 直接依赖对象列表

详细设计方案

1. 元数据采集（示例SQL）

# 获取所有表/视图
SELECT table_name, table_type 
FROM information_schema.tables 
WHERE table_schema = DATABASE()

# 获取存储过程/函数
SELECT routine_name, routine_type 
FROM information_schema.routines 
WHERE routine_schema = DATABASE()

# 获取触发器
SELECT trigger_name, event_object_table 
FROM information_schema.triggers 
WHERE trigger_schema = DATABASE()

2. 依赖关系分析算法

def parse_dependencies(obj: DBObject):
    if obj.obj_type == 'VIEW':
        # 使用正则匹配FROM和JOIN子句
        pattern = r'\b(?:FROM|JOIN)\s+([\w`]+)'
    elif obj.obj_type in ('PROCEDURE', 'FUNCTION'):
        # 匹配表引用和CALL语句
        pattern = r'\b(?:FROM|INTO|UPDATE|JOIN|CALL)\s+([\w`]+)'
    elif obj.obj_type == 'TRIGGER':
        # 触发器关联表已在元数据中
        pattern = None
    
    if pattern:
        matches = re.findall(pattern, obj.definition, re.IGNORECASE)
        obj.dependencies = [m.strip('`') for m in matches]

3. 依赖图谱构建

class DependencyGraph:
    def __init__(self):
        self.graph = defaultdict(list)  # 邻接表：{节点: [依赖项]}
        self.reverse_graph = defaultdict(list)  # 逆邻接表
    
    def add_edge(self, from_node, to_node):
        self.graph[to_node].append(from_node)
        self.reverse_graph[from_node].append(to_node)

4. 路径生成算法（DFS实现）

def find_all_paths(graph, start):
    paths = []
    
    def dfs(node, path):
        if node not in graph or not graph[node]:
            paths.append('->'.join(path[::-1]))
            return
        for neighbor in graph[node]:
            dfs(neighbor, path + [neighbor])
    
    dfs(start, [start])
    return paths

5. 拓扑排序算法（Kahn算法）

def topological_sort(graph):
    in_degree = {u:0 for u in graph}
    for u in graph:
        for v in graph[u]:
            in_degree[v] += 1
    
    queue = deque([u for u in in_degree if in_degree[u] == 0])
    sorted_list = []
    
    while queue:
        u = queue.popleft()
        sorted_list.append(u)
        for v in graph[u]:
            in_degree[v] -= 1
            if in_degree[v] == 0:
                queue.append(v)
    
    return sorted_list

6. Excel输出结构

with pd.ExcelWriter('output.xlsx') as writer:
    # Sheet1: 对象路径表
    path_df.to_excel(writer, sheet_name='对象路径', index=False)
    
    # Sheet2: 表依赖顺序
    table_order_df.to_excel(writer, sheet_name='表依赖顺序', index=False)
    
    # Sheet3: 存储过程依赖顺序
    sp_order_df.to_excel(writer, sheet_name='存储过程顺序', index=False)

执行流程

初始化数据库连接
采集所有数据库对象元数据
构建依赖关系图谱
验证并处理循环依赖
生成路径数据（Sheet1）
分别对表和存储过程进行拓扑排序（Sheet2、Sheet3）
格式化输出到Excel文件

异常处理设计

数据库连接失败：捕获OperationalError，提示检查连接参数
循环依赖检测：使用Tarjan算法检测强连通分量
SQL解析失败：记录解析错误日志，跳过当前对象
文件写入异常：捕获PermissionError，提示关闭已打开的Excel文件

优化策略

缓存机制：对已解析对象建立缓存字典
并行处理：使用ThreadPoolExecutor并行解析不同对象
增量更新：记录最后处理时间戳，支持增量分析
批处理优化：使用executemany批量查询对象定义

示例输出格式

表1：对象路径表

对象名称	生成路径
view1	table_a->view1
view1	table_b->view1
proc1	table_a->view1->proc1

表2：表依赖顺序

表名
table_a
table_b
table_c

表3：存储过程顺序

存储过程名
proc_base
proc_main

这个设计方案可以实现对MySQL数据库对象的全面依赖分析，最终输出结构清晰的Excel报告。实际开发时需要特别注意SQL解析的准确性和异常处理的完备性。

此代码实现了对MySQL数据库对象的元数据采集、依赖关系分析、路径生成和拓扑排序，并将结果输出到Excel文件中。同时包含了基本的异常处理和优化策略。

import mysql.connector
import sqlparse
import re
from collections import defaultdict, deque
import pandas as pd
from tarjan import tarjan as tarjan_algorithm
import traceback
from concurrent.futures import ThreadPoolExecutor
import time


class DBObject:
    def __init__(self, name, obj_type, definition):
        self.name = name
        self.obj_type = obj_type
        self.definition = definition
        self.dependencies = []


def get_metadata(cnx):
    cursor = cnx.cursor()

    tables = []
    cursor.execute("""
        SELECT table_name, table_type 
        FROM information_schema.tables 
        WHERE table_schema = DATABASE()
    """)
    for row in cursor:
        name, obj_type = row
        cursor2 = cnx.cursor(dictionary=True)
        cursor2.execute(f"SHOW CREATE {obj_type.lower()} {name}")
        result = cursor2.fetchone()
        tables.append(DBObject(name, obj_type, result[f'Create {obj_type}']))
        cursor2.close()

    routines = []
    cursor.execute("""
        SELECT routine_name, routine_type 
        FROM information_schema.routines 
        WHERE routine_schema = DATABASE()
    """)
    for row in cursor:
        name, obj_type = row
        cursor2 = cnx.cursor(dictionary=True)
        cursor2.execute(f"SHOW CREATE {obj_type.lower()} {name}")
        result = cursor2.fetchone()
        routines.append(DBObject(name, obj_type, result[f'Create {obj_type}']))
        cursor2.close()

    triggers = []
    cursor.execute("""
        SELECT trigger_name, event_object_table 
        FROM information_schema.triggers 
        WHERE trigger_schema = DATABASE()
    """)
    for row in cursor:
        name, table = row
        triggers.append(DBObject(name, 'TRIGGER', table))

    all_objects = tables + routines + triggers
    cursor.close()
    return all_objects


def parse_dependencies(obj: DBObject):
    if obj.obj_type == 'VIEW':
        pattern = r'\b(?:FROM|JOIN)\s+([\w`]+)'
    elif obj.obj_type in ('PROCEDURE', 'FUNCTION'):
        pattern = r'\b(?:FROM|INTO|UPDATE|JOIN|CALL)\s+([\w`]+)'
    elif obj.obj_type == 'TRIGGER':
        pattern = None

    if pattern:
        matches = re.findall(pattern, obj.definition, re.IGNORECASE)
        obj.dependencies = [m.strip('`') for m in matches]


class DependencyGraph:
    def __init__(self):
        self.graph = defaultdict(list)
        self.reverse_graph = defaultdict(list)

    def add_edge(self, from_node, to_node):
        self.graph[to_node].append(from_node)
        self.reverse_graph[from_node].append(to_node)


def find_all_paths(graph, start):
    paths = []

    def dfs(node, path):
        if node not in graph or not graph[node]:
            paths.append('->'.join(path[::-1]))
            return
        for neighbor in graph[node]:
            dfs(neighbor, path + [neighbor])

    dfs(start, [start])
    return paths


def topological_sort(graph):
    in_degree = {u: 0 for u in graph}
    for u in graph:
        for v in graph[u]:
            in_degree[v] += 1

    queue = deque([u for u in in_degree if in_degree[u] == 0])
    sorted_list = []

    while queue:
        u = queue.popleft()
        sorted_list.append(u)
        for v in graph[u]:
            in_degree[v] -= 1
            if in_degree[v] == 0:
                queue.append(v)

    return sorted_list


def detect_cycles(graph):
    components = tarjan_algorithm(graph)
    for component in components:
        if len(component) > 1:
            return True
    return False


def main():
    try:
        cnx = mysql.connector.connect(
            user='your_username',
            password='your_password',
            host='your_host',
            database='your_database'
        )

        all_objects = get_metadata(cnx)

        cache = {}
        graph = DependencyGraph()

        def parse_object(obj):
            if obj.name in cache:
                return cache[obj.name]
            parse_dependencies(obj)
            for dep in obj.dependencies:
                graph.add_edge(obj.name, dep)
            cache[obj.name] = obj
            return obj

        with ThreadPoolExecutor() as executor:
            all_objects = list(executor.map(parse_object, all_objects))

        if detect_cycles(graph.graph):
            print("检测到循环依赖，请检查数据库对象的定义。")
            return

        path_data = []
        for obj in all_objects:
            paths = find_all_paths(graph.graph, obj.name)
            for path in paths:
                path_data.append({'对象名称': obj.name, '生成路径': path})

        path_df = pd.DataFrame(path_data)

        tables = [obj.name for obj in all_objects if obj.obj_type == 'TABLE']
        table_graph = {k: v for k, v in graph.graph.items() if k in tables}
        table_order = topological_sort(table_graph)
        table_order_df = pd.DataFrame({'表名': table_order})

        procs = [obj.name for obj in all_objects if obj.obj_type in ('PROCEDURE', 'FUNCTION')]
        proc_graph = {k: v for k, v in graph.graph.items() if k in procs}
        proc_order = topological_sort(proc_graph)
        sp_order_df = pd.DataFrame({'存储过程名': proc_order})

        with pd.ExcelWriter('output.xlsx') as writer:
            path_df.to_excel(writer, sheet_name='对象路径', index=False)
            table_order_df.to_excel(writer, sheet_name='表依赖顺序', index=False)
            sp_order_df.to_excel(writer, sheet_name='存储过程顺序', index=False)

        cnx.close()
    except mysql.connector.OperationalError as e:
        print(f"数据库连接失败: {e}")
    except Exception as e:
        print(f"发生错误: {e}")
        traceback.print_exc()


if __name__ == "__main__":
    main()