当前位置：首页 > article >正文

使用 Python 遍历文件夹

article 2024/10/7 4:00:32

要解决这个问题，使用 Python 的标准库可以很好地完成。我们要做的是遍历目录树，找到所有的 text 文件，读取内容，处理空行和空格，并将处理后的内容合并到一个新的文件中。

整体思路：

遍历子目录：我们可以使用 os 模块来遍历目录中的所有文件。os.walk 是一个常用的方法，它可以递归遍历指定目录中的所有文件和子目录。
读取文件并处理内容：对于每个 .txt 文件，我们读取文件内容，删除空行和空格。可以使用字符串的 strip() 方法去除行首和行尾的空格，并且过滤掉空行。
合并文件内容：处理完每个文件的内容后，我们将所有内容合并成一个字符串，准备写入到新的文件中。
写入新的文件：最后，将合并后的内容写入到一个新的文本文件中。

Python 实现步骤

我们可以从文件遍历开始。先确保能够遍历子目录，然后一步步地实现每个细节。

步骤 1：遍历子目录

在 Python 中，os.walk 是一个非常强大的函数，可以递归遍历指定目录下的所有子目录和文件。它返回的是一个生成器，生成的是三元组 (dirpath, dirnames, filenames)，即当前路径、当前路径下的目录列表和当前路径下的文件列表。

import os

def list_text_files(root_dir):
    text_files = []
    for dirpath, dirnames, filenames in os.walk(root_dir):
        for file in filenames:
            if file.endswith(".txt"):
                text_files.append(os.path.join(dirpath, file))
    return text_files

在这个函数中，我们遍历了 root_dir 目录下的所有子目录及其文件，并将所有 .txt 文件的路径添加到 text_files 列表中。

步骤 2：读取文件并删除空行和空格

为了从文件中删除空行和空格，我们可以使用 strip() 函数来处理每一行，并且过滤掉空行。示例代码如下：

def clean_text_file(file_path):
    cleaned_lines = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            cleaned_line = line.strip()  # 删除行首尾的空格
            if cleaned_line:  # 过滤空行
                cleaned_lines.append(cleaned_line)
    return cleaned_lines

在这个函数中，我们打开每个 .txt 文件，逐行读取它的内容。通过 strip() 函数，我们删除了每一行的首尾空格。之后，我们过滤掉空行，只保留有内容的行。

步骤 3：合并所有文件的内容

接下来，我们要把所有清理过的文件内容合并在一起。我们可以通过调用 clean_text_file() 函数获取每个文件的内容，并将这些内容追加到一个大列表中。

def merge_cleaned_files(file_paths):
    all_cleaned_lines = []
    for file_path in file_paths:
        cleaned_lines = clean_text_file(file_path)
        all_cleaned_lines.extend(cleaned_lines)
    return all_cleaned_lines

在这个函数中，我们遍历所有的文件路径，使用 clean_text_file() 函数清理每个文件的内容，然后将所有清理后的内容合并到 all_cleaned_lines 列表中。

步骤 4：写入新文件

合并后的所有内容需要写入到一个新的 .txt 文件中。我们可以使用 Python 的 open() 函数来完成这个操作。

def write_to_new_file(new_file_path, cleaned_content):
    with open(new_file_path, 'w', encoding='utf-8') as new_file:
        for line in cleaned_content:
            new_file.write(line + '\n')

在这个函数中，我们打开一个新的文件，并将所有清理后的内容逐行写入文件。为了确保每行内容之间有换行符，我们在每一行后面添加了 \n。

完整的实现代码

将上述步骤整合在一起，形成完整的 Python 脚本：

import os

# Step 1: List all text files in the directory and its subdirectories
def list_text_files(root_dir):
    text_files = []
    for dirpath, dirnames, filenames in os.walk(root_dir):
        for file in filenames:
            if file.endswith(".txt"):
                text_files.append(os.path.join(dirpath, file))
    return text_files

# Step 2: Clean text files by removing blank lines and extra spaces
def clean_text_file(file_path):
    cleaned_lines = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            cleaned_line = line.strip()  # Remove leading and trailing spaces
            if cleaned_line:  # Ignore blank lines
                cleaned_lines.append(cleaned_line)
    return cleaned_lines

# Step 3: Merge the cleaned content of all files
def merge_cleaned_files(file_paths):
    all_cleaned_lines = []
    for file_path in file_paths:
        cleaned_lines = clean_text_file(file_path)
        all_cleaned_lines.extend(cleaned_lines)
    return all_cleaned_lines

# Step 4: Write merged content to a new file
def write_to_new_file(new_file_path, cleaned_content):
    with open(new_file_path, 'w', encoding='utf-8') as new_file:
        for line in cleaned_content:
            new_file.write(line + '\n')

# Main function to orchestrate the process
def process_text_files(root_dir, new_file_path):
    # Step 1: Get all text files
    text_files = list_text_files(root_dir)
    # Step 2 and 3: Clean and merge the content
    cleaned_content = merge_cleaned_files(text_files)
    # Step 4: Write to the new file
    write_to_new_file(new_file_path, cleaned_content)

# Example usage:
root_directory = '/path/to/your/directory'
output_file = '/path/to/your/output_file.txt'
process_text_files(root_directory, output_file)

代码的解释

list_text_files 函数：它遍历了目录及其子目录，找到了所有以 .txt 结尾的文件。文件的完整路径被保存在 text_files 列表中，便于后续处理。
clean_text_file 函数：它读取给定文件的每一行，使用 strip() 函数清除行首尾的空格。之后，通过判断 cleaned_line 是否为空来过滤掉空行。如果这行有内容，就将它添加到 cleaned_lines 列表中。
merge_cleaned_files 函数：它合并所有文件的内容。我们遍历每个文件路径，调用 clean_text_file 来获取每个文件的清理内容，然后将这些内容合并到一个大列表中。
write_to_new_file 函数：它将合并后的内容写入到一个新的文件中。逐行写入时，通过 line + '\n' 来确保每一行都带有换行符。

示例说明

假设有如下目录结构：

/example_directory
    /subdir1
        file1.txt
        file2.txt
    /subdir2
        file3.txt
        file4.txt

每个 .txt 文件可能包含以下内容：

file1.txt
```
Hello World

This is a test.
  
```
file2.txt
```
Python is fun!
      
```
file3.txt
```
The quick brown fox.
```

处理后，每个文件的内容会删除空行和空格，结果将合并为：

Hello World
This is a test.
Python is fun!
The quick brown fox.

最后，所有处理后的内容会被写入到一个新的文件中。新的文件将包含所有 .txt 文件中非空行的内容，且所有行首尾的空格已经被去掉。

关于性能优化

如果处理的文件非常多或非常大，可能会涉及一些性能优化的需求。比如，逐步处理文件而不是一次性读取所有文件的内容，可以避免过大的内存占用。以下是一些可能的优化方向：

逐步写入输出文件：可以在处理每个文件时，直接将清理后的内容写入新的文件，而不是等所有文件都处理完再写入。这样可以避免在内存中存储过多的数据。
多线程处理：在 Python 中使用多线程或多进程模块（如 threading 或 multiprocessing）来同时处理多个文件，可以提升处理速度。
生成器：使用生成器处理文件可以更高效地利用内存，特别是在文件内容非常大的情况下。