chardet检测文件编码,使用生成器逐行读取文件
detect_encoding
函数使用 chardet
来检测文件的编码。然后,在 process_large_file
函数中,根据检测到的编码方式打开文件。这样,你就能够更准确地处理不同编码的文件。
import chardet
def detect_encoding(file_path):
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding']
def process_line_with_line_number(line, line_number):
# 占位符函数,你需要在这里定义自己的逻辑
# 例如,打印带有行号的行
print(f"{line_number}: {line.strip()}")
def process_large_file(input_file_path, output_file_path):
encoding = detect_encoding(input_file_path)
print(f"检测到的编码: {encoding}")
with open(input_file_path, "r", encoding=encoding) as input_file, open(output_file_path, "wb") as output_file:
for line_number, line in enumerate(input_file, start=1):
# 使用占位符函数处理每一行
process_line_with_line_number(line, line_number)
# 将处理后的行写入输出文件
output_file.write(f"{line_number}: {line}\n".encode(encoding))
if __name__ == "__main__":
input_file_path = "input_large_file.txt"
output_file_path = "output_large_file.txt"
process_large_file(input_file_path, output_file_path)
当处理大型文本文件时,为了降低内存的使用,可以使用生成器(generator)来逐行读取文件。生成器允许你逐步获取文件的每一行,而不是一次性将整个文件加载到内存中。以下是一个使用生成器逐行读取大型文本文件的例子:
import chardet
def detect_encoding(file_path):
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding']
def read_large_text_file(file_path):
encoding = detect_encoding(file_path)
print(f"检测到的编码: {encoding}")
with open(file_path, 'r', encoding=encoding) as file:
for line_number, line in enumerate(file, start=1):
yield line_number, line
if __name__ == "__main__":
input_file_path = "large_text_file.txt"
# 使用生成器逐行读取大型文本文件
line_generator = read_large_text_file(input_file_path)
# 处理每一行,例如打印行号和内容
for line_number, line in line_generator:
print(f"Line {line_number}: {line.strip()}")