当前位置：首页 > article >正文

文件 IO：高效读取文件的一些注意事项

article 2024/11/25 14:33:47

字符编码一致

在读取文件时，如果不注意编码问题，就会遇到莫名其妙的乱码问题。

比如，使用 GBK 编码把 “你好hello” 写入一个hello.txt文件中，然后再以字节数组形式读取文件内容，转换为十六进制字符串输出到日志中。

static void init() throws Exception{
    Files.deleteIfExists(Paths.get("hello.txt"));
    Files.write(Paths.get("hello.txt"),"你好hello".getBytes(Charset.forName("GBK")));
    log.info("bytes:{}", Hex.encodeHexString(Files.readAllBytes(Paths.get("hello.txt"))).toUpperCase());
 }

输出如下：

16:07:25.253 [main] INFO com.demo.jvm.io.IOTest - bytes:C4E3BAC368656C6C6F

打开这个文件，都提示我们了，正在用错误的字符编码 UTF-8 来加载这个文件。

在这里插入图片描述
按照它的提示，换成 GBK ，这次不乱码了。

在这里插入图片描述

计算机内部在存储文本文件时，是以二进制形式存储的，每个字符都按照一定的规则被转换成了一组特定的二进制代码。这个规则就是字符集，字符集枚举了所有支持的字符映射成二进制的映射表。

当我们在处理文本文件时，如果是在字节层面进行操作，那么不会涉及字符编码问题；

如果需要在字符层面进行读写的话，就需要明确字符的编码方式（字符集）了。

比如，下面这段代码：

static void wrong() throws Exception{
    char[] bytes = new char[10];

    String content = "";

    try(FileReader fileReader = new FileReader("hello.txt")) {

        int len;

        while ((len = fileReader.read(bytes)) != -1){

            content += new String(bytes,0,len);
        }

    }

    log.info("content:{}",content);
}

输出结果：

16:22:12.557 [main] INFO com.demo.jvm.io.IOTest - content:���hello

可以看到，在用FileReader以字符方式读取文件内容时，输出了乱码。

这里并没有指定以什么字符集来读取文件中的字符，那它是用的什么字符集呢，通过查看官方文档，可以看到FileReader 是以当前机器的默认字符集来读取文件的，如果希望指定字符集的话，需要直接使用 InputStreamReader 和 FileInputStream。

Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
FileReader is meant for reading streams of characters. For reading streams of raw bytes, consider using a FileInputStream.

所以，我们先确认下当前机器的字符集。

static void defaultCharset() throws Exception{

    log.info("default charset:{}", Charset.defaultCharset());

    Files.write(Paths.get("hello2.txt"),"你好hello".getBytes(Charsets.UTF_8));

    log.info("bytes:{}", Hex.encodeHexString(Files.readAllBytes(Paths.get("hello2.txt"))).toUpperCase());

 }

输出：

16:29:13.037 [main] INFO com.demo.jvm.io.IOTest - default charset:UTF-8
16:29:13.099 [main] INFO com.demo.jvm.io.IOTest - bytes:E4BDA0E5A5BD68656C6C6F

可以看到，当前机器默认字符集是 UTF-8，所以无法读取 GBK 编码的汉字。

UTF-8编码一个汉字需要3个字节的空间，而GBK编码一个汉字只需要2个字节，由于UTF-8和GBK编码的字节长度不同，如果用GBK编码保存汉字，再用UTF-8解码读取，必然无法正确显示汉字。

所以，现在我们按照官网说的，用FileInputStream 拿文件流，然后用 InputStreamReader 读取字符流，并指定字符集为 GBK。

static void right1() throws Exception{

    char[] bytes = new char[10];

     String content = "";

     try(FileInputStream fileInputStream = new FileInputStream("hello.txt")) {

         InputStreamReader reader = new InputStreamReader(fileInputStream, Charset.forName("GBK"));

         int len;

         while ((len = reader.read(bytes)) != -1) {
             content += new String(bytes,0,len);
         }
     }

     log.info("content:{}",content);
 }

可以看到，正常输出了：

16:35:28.623 [main] INFO com.demo.jvm.io.IOTest - content:你好hello

如果感觉上面比较复杂的话，还可以用Files.readAllLine，一行代码搞定。

static void right2() throws Exception{
    List<String> allLines = Files.readAllLines(Paths.get("hello.txt"), Charset.forName("GBK"));
    allLines.stream().forEach(line -> System.out.println(line));
}

可以看到，Files.readAllLines读取文件所有内容后，放到一个 List 中返回，如果内存无法容纳这个 List，就会 OOM。

 public static List<String> readAllLines(Path path, Charset cs) throws IOException {
    try (BufferedReader reader = newBufferedReader(path, cs)) {
        List<String> result = new ArrayList<>();
        for (;;) {
            String line = reader.readLine();
            if (line == null)
                break;
            result.add(line);
        }
        return result;
    }
}

比如我们构造一个 300 万行，文件 3G 多的大文件。

 /**
  * 初始化大文件
  * @throws Exception
  */
 private static void initLarge() throws Exception{
     String payload = IntStream.rangeClosed(1, 1000)
             .mapToObj(__ -> "a")
             .collect(Collectors.joining(""));


     Files.deleteIfExists(Paths.get("large.txt"));
     IntStream.rangeClosed(1, 10).forEach(__ -> {
         try {
             Files.write(Paths.get("large.txt"),
                     IntStream.rangeClosed(1, 300000).mapToObj(i -> payload).collect(Collectors.toList())
                     , UTF_8, CREATE, APPEND);
         } catch (IOException e) {
             e.printStackTrace();
         }
     });
 }

设置一下 VM 参数-Xmx512m -Xms512m，直接用Files.readAllLines读取该大文件

Files.readAllLines(Paths.get("large.txt"), Charsets.UTF_8).stream().findFirst().orElse("000");

直接 OOM 异常了。

在这里插入图片描述

从上面结果能看出来，如果直接把文件所有内容读取到内存，很容易爆发 OOM 问题，那么有没有办法实现按需的流式读取呢？

比如，需要消费某行数据时再读取，而不是把整个文件一次性读取到内存。

这就是Files.lines方法。

注意释放文件句柄

与 Files.readAllLines 方法返回 List 不同，Files.lines 方法返回的是 Stream。

这使得我们在需要时可以不断读取、使用文件中的内容，而不是一次性地把所有内容都读取到内存中，因此避免了 OOM。

还是上面那个大文件，我们用Files.lines 来读取 10 万行数据和 100 万行数据的耗时差异，最后逐行读取文件，统计文件的总行数。

private static void readLarge() throws Exception{
    log.info("large file size:{}", Files.size(Paths.get("large.txt")));

    StopWatch stopWatch = new StopWatch();

    stopWatch.start("read 10w lines");

    log.info("lines:{}",Files.lines(Paths.get("large.txt")).limit(100000).collect(Collectors.toList()).size());

    stopWatch.stop();


    stopWatch.start("read 100w lines");

    log.info("lines:{}",Files.lines(Paths.get("large.txt")).limit(200000).collect(Collectors.toList()).size());

    stopWatch.stop();

    log.info(stopWatch.prettyPrint());

    AtomicLong atomicLong = new AtomicLong();

    Files.lines(Paths.get("large.txt")).forEach(line -> atomicLong.incrementAndGet());

    log.info("total lines:{}", atomicLong.get());
}

输出结果如下：

17:08:25.110 [main] INFO com.demo.jvm.io.IOTest - large file size:3003000000
17:08:25.695 [main] INFO com.demo.jvm.io.IOTest - lines:100000
17:08:26.449 [main] INFO com.demo.jvm.io.IOTest - lines:200000
17:08:26.452 [main] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 1333180616 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
579611462  043%  read 10w lines
753569154  057%  read 100w lines

17:08:33.106 [main] INFO com.demo.jvm.io.IOTest - total lines:3000000

可以看到，实现了全文件的读取、统计了整个文件的行数，并没有出现 OOM；

读取 100 万行数据耗时 753ms，读取 10 万行数据耗时 579ms。

这些都可以说明，File.lines 方法并不是一次性读取整个文件的，而是按需读取。

通常情况下，我们可能会认为静态方法调用不需要我们手动管理资源释放，因为当方法执行完毕后，资源应该会自动由API释放。

但是，这个假设在Files类的一些返回Stream的方法上并不成立，这是一个容易被忽视的问题。

比如这段代码，用 Files.lines 方法读取这个文件 100 万次，每读取一行计数器 +1。

static void handle() throws Exception{

     LongAdder longAdder = new LongAdder();
     IntStream.rangeClosed(1, 1000000).forEach(i -> {
         try {
             Files.lines(Paths.get("demo.txt")).forEach(line -> longAdder.increment());
         } catch (IOException e) {
             e.printStackTrace();
         }
     });
     log.info("total : {}", longAdder.longValue());
}

输出错误如下：

在这里插入图片描述
使用 lsof 命令查看进程打开的文件，可以看到打开了 1 万多个 demo.txt。

zhangwenwen@zhangwenwendeMacBook-Pro jvm % lsof -p 63622  | grep demo.txt | wc -l
   10173

查看官方文档可以看到，用 try-with-resources 方式来确保流的 close 方法可以调用释放资源。

在这里插入图片描述
按照提示，修改代码

static void handle() throws Exception{

    LongAdder longAdder = new LongAdder();
    IntStream.rangeClosed(1, 1000000).forEach(i -> {

        try(Stream<String> lines = Files.lines(Paths.get("demo.txt"))) {

            lines.forEach(line -> longAdder.increment());

        } catch (IOException e) {
            e.printStackTrace();
        }
        
    });
    log.info("total : {}", longAdder.longValue());

}

设置缓冲区

当使用 BufferedReader 进行字符流读取时，就用到了缓冲，即使用一块内存区域作为直接操作的中转。

先初始化一个文件。

/**
  * 初始化一个 35MB 的文件
  * @throws Exception
  */
 static void initBufferFile() throws Exception{
     Files.deleteIfExists(Paths.get("src.txt"));

     Files.write(Paths.get("src.txt"),
             IntStream.rangeClosed(1, 1000000).mapToObj(i -> UUID.randomUUID().toString()).collect(Collectors.toList())
             , UTF_8, CREATE, TRUNCATE_EXISTING);

     log.info("src file size:{}", Files.size(Paths.get("src.txt")));

 }

再使用 FileInputStream 获得一个文件输入流，然后调用其 read 方法每次读取一个字节，最后通过一个 FileOutputStream 文件输出流把处理后的结果写入另一个文件。

private static void fileCopyNoBuffer() throws IOException {

    StopWatch stopWatch = new StopWatch();

    stopWatch.start("read 35MB file");


    try (FileInputStream fileInputStream = new FileInputStream("src.txt");
         FileOutputStream fileOutputStream = new FileOutputStream("dest.txt")) {
        int i;
        while ((i = fileInputStream.read()) != -1) {
            fileOutputStream.write(i);
        }
    }

    stopWatch.stop();
    log.info(stopWatch.prettyPrint());

}

再用 100字节的缓冲区作为过渡，一次性从原文件读取一定数量的数据到缓冲区，一次性从缓冲区写入一定数量的数据到目标文件。

private static void fileCopWith100Buffer() throws Exception{
     StopWatch stopWatch = new StopWatch();

     stopWatch.start("read 35MB file by 100 Buffer");




     try (FileInputStream fileInputStream = new FileInputStream("src.txt");
          FileOutputStream fileOutputStream = new FileOutputStream("dest.txt")) {

         byte[] buffer = new byte[100];

         int len;
         while ((len = fileInputStream.read(buffer)) != -1) {
             fileOutputStream.write(buffer,0,len);
         }
     }

     stopWatch.stop();
     log.info(stopWatch.prettyPrint());
 }

输出结果：

17:44:45.726 [Thread-1] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 3193869135 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
3193869135  100%  read 35MB file by 100 Buffer

17:47:59.001 [Thread-0] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 196472736818 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
196472736818  100%  read 35MB file

比对结果可以看到：

	执行时间（s）
不使用缓冲区	196
使用100B缓冲区	3.1

很显然，每读取一个字节和写入一个字节都进行一次 IO 操作，代价太大。

而仅使用 100 字节的缓冲区做过滤，效率就提高了60 倍。

如果使用 1000 字节的缓冲区呢，结果如下：

18:40:57.798 [Thread-0] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 492754355 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
492754355  100%  read 35MB file by 1000 Buffer

18:40:58.970 [Thread-1] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 1666319049 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
1666319049  100%  read 35MB file by 100 Buffer

可以看到，只需要 492 毫秒就完成了。

由此可以得出结论：在进行文件 IO 处理的时候，使用合适的缓冲区可以明显提高性能。

如果直接使用BufferedInputStream 和 BufferedOutputStream呢？毕竟它们在内部实现了一个默认 8KB 大小的缓冲区。

下面我们可以测试一下，看看性能到底如何。

 /**
  * 方式一：直接使用 BufferedInputStream 和 BufferedOutputStream。
  * @throws Exception
  */
 private static void bufferJust() throws Exception{
     StopWatch stopWatch = new StopWatch();

     stopWatch.start("read 35MB file by BufferedStream");


     try (BufferedInputStream bufferInputStream = new BufferedInputStream(new FileInputStream("src.txt"));
          BufferedOutputStream bufferOutputStream = new BufferedOutputStream(new FileOutputStream("dest.txt"))) {

         int len;
         while ((len = bufferInputStream.read()) != -1) {
             bufferOutputStream.write(len);
         }
     }

     stopWatch.stop();
     log.info(stopWatch.prettyPrint());
 }

 /**
  * 方式二：使用 BufferedInputStream 和 BufferedOutputStream，再额外使用一个 8KB 缓冲。
  * @throws Exception
  */
 private static void bufferWith8KB() throws Exception{
     StopWatch stopWatch = new StopWatch();

     stopWatch.start("read 35MB file by BufferedStream and 8KB");


     try (BufferedInputStream bufferInputStream = new BufferedInputStream(new FileInputStream("src.txt"));
          BufferedOutputStream bufferOutputStream = new BufferedOutputStream(new FileOutputStream("dest.txt"))) {


         byte[] buffer = new byte[8192];

         int len;
         while ((len = bufferInputStream.read(buffer)) != -1) {
             bufferOutputStream.write(buffer,0,len);
         }
     }

     stopWatch.stop();
     log.info(stopWatch.prettyPrint());
 }

 /**
  * 方式三：直接使用 FileInputStream 和 FileOutputStream，再使用一个 8KB 的缓冲。
  * @throws Exception
  */
 private static void fileStreamWith8KB() throws Exception{
     StopWatch stopWatch = new StopWatch();

     stopWatch.start("read 35MB file by FileStream and 8KB");

     try (FileInputStream fileInputStream = new FileInputStream("src.txt");
          FileOutputStream fileOutputStream = new FileOutputStream("dest.txt")){

         byte[] buffer = new byte[8192];

         int len;
         while ((len = fileInputStream.read(buffer)) != -1) {
             fileOutputStream.write(buffer,0,len);
         }
     }

     stopWatch.stop();
     log.info(stopWatch.prettyPrint());
 }

结果如下：

18:52:40.416 [Thread-1] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 311713168 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
311713168  100%  read 35MB file by BufferedStream and 8KB

18:52:40.416 [Thread-2] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 311210649 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
311210649  100%  read 35MB file by FileStream and 8KB

18:52:41.638 [Thread-0] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 1539085701 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
1539085701  100%  read 35MB file by BufferedStream

可以看到，方式一的耗时最长，另两种耗时都在 311 毫秒左右。

如果希望有更高的性能，还可以使用FileChannel.transfreTo 方法进行流的复制。

在一些操作系统（比如高版本的 Linux 和 UNIX）上可以实现 DMA（直接内存访问），数据从磁盘经过总线直接发送到目标文件，无需经过内存和 CPU 进行数据中转。

private static void fileChannelOperation() throws IOException {
    StopWatch stopWatch = new StopWatch();
    stopWatch.start("read 35MB file by FileChannel");
    
    FileChannel in = FileChannel.open(Paths.get("src.txt"), StandardOpenOption.READ);
    FileChannel out = FileChannel.open(Paths.get("dest.txt"), CREATE, WRITE);
    in.transferTo(0, in.size(), out);
    
    
    stopWatch.stop();
    log.info(stopWatch.prettyPrint());
}

输出结果：

19:00:26.467 [Thread-3] INFO com.demo.jvm.io.IOTest - StopWatch '': running time = 175960468 ns
---------------------------------------------
ns         %     Task name
---------------------------------------------
175960468  100%  read 35MB file by FileChannel