Mac Java 使用 tesseract 进行 ORC 识别
在 Java 开发中使用图片转文字时,难免会遇到问题,比如我使用 Mac (M1 芯片) 系统进行开发,就出现报错。
博主博客
- https://blog.uso6.com
- https://blog.csdn.net/dxk539687357
一、直接使用
1. 使用 brew 进行安装
brew install tesseract
如果是其他系统的, 建议看 官方文档 进行安装。
2. 查看版本
nukix@nukixPC ~ % tesseract --version
tesseract 5.5.0
leptonica-1.85.0
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.2.12 : libwebp 1.4.0 : libopenjp2 2.5.2
Found NEON
Found libarchive 3.7.7 zlib/1.2.12 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6
Found libcurl/8.7.1 SecureTransport (LibreSSL/3.3.6) zlib/1.2.12 nghttp2/1.62.0
3. 查看安装路径
brew list tesseract
比如我安装路径是 /opt/homebrew/Cellar/tesseract/5.5.0
, 下面会用到。
4. 在图片中识别文字(英文)
tesseract abc.jpg out.txt -l eng
命令 tesseract 图片路径 输出文件 -l 语言码
, 我上面的命令识别出来的结果会在 out.txt
文件中。
语言码可以在 这里 找到。
5. 语言包下载
语言包传送门: tessdata
根据自己需要下载对应的语言包, 比如中文的语言包是 chi_tra.traineddata。
下载后放到 tessdata
目录下面, 比如我的目录在 /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata
。
二、Java API 调用
2.1 导入 Maven 库
因为 tesseract
是一个 C
语言库, 所以不能直接使用,官方推荐了其他语言一些第三方封装库可以在 这里 查看, 而 Java
是 tess4j。
Gradle(short)
implementation 'net.sourceforge.tess4j:tess4j:5.13.0'
Maven
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.13.0</version>
</dependency>
2.2 Java 调用
根据 文档 的案例,可以这样调用。
package tess4j.example;
import java.io.File;
import net.sourceforge.tess4j.*;
public class TesseractExample {
public static void main(String[] args) {
// System.setProperty("jna.library.path", "32".equals(System.getProperty("sun.arch.data.model")) ? "lib/win32-x86" : "lib/win32-x86-64");
File imageFile = new File("eurotext.tif");
ITesseract instance = new Tesseract(); // JNA Interface Mapping
// ITesseract instance = new Tesseract1(); // JNA Direct Mapping
// File tessDataFolder = LoadLibs.extractTessResources("tessdata"); // Maven build bundles English data
// instance.setDatapath(tessDataFolder.getPath());
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
2.3 java.lang.UnsatisfiedLinkError: Unable to load library ‘tesseract’
问题
不出意外的话会有这一个异常,运行库中没有 tesseract
库。
java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache)
dlopen(libtesseract.dylib, 0x0009): tried: 'libtesseract.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OSlibtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/./libtesseract.dylib' (no such file), '/Users/nukix/Library/Java/JavaVirtualMachines/openjdk-18.0.2.1/Contents/Home/bin/../lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache), 'libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file, not in dyld cache)
dlopen(/Users/nukix/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Users/nukix/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
dlopen(/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
dlopen(/System/Library/Frameworks/tesseract.framework/tesseract, 0x0009): tried: '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/System/Library/Frameworks/tesseract.framework/tesseract' (no such file), '/System/Library/Frameworks/tesseract.framework/tesseract' (no such file, not in dyld cache)
Native library (darwin-aarch64/libtesseract.dylib) not found in resource path (/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/classes/java/main:/Users/nukix/AndroidStudioProjects/nukix/TestWeb/build/resources/main)
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:325) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:481) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.Library$Handler.<init>(Library.java:197) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.Native.load(Native.java:618) ~[jna-5.14.0.jar:5.14.0 (b0)]
at com.sun.jna.Native.load(Native.java:592) ~[jna-5.14.0.jar:5.14.0 (b0)]
at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:83) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:353) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:202) ~[tess4j-5.13.0.jar:5.13.0]
at net.sourceforge.tess4j.ITesseract.doOCR(ITesseract.java:59) ~[tess4j-5.13.0.jar:5.13.0]
解决
根据报错的路径, 把文件复制过去, 比如我报错的路径是 /Users/nukix/Library/Frameworks/tesseract.framework/tesseract
。
先创建文件夹
mkdir -p /Users/nukix/Library/Frameworks/tesseract.framework
复制文件
cp /opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata /Users/nukix/Library/Frameworks/tesseract.framework
2.4 Error opening data file ./eng.traineddata
问题
不出意外的话,会提示找不到语言包。
Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
解决
这时候就需要配置环境变量,如果是使用 IDEA 进行开发,可以在菜单栏找到 Run->Edit Configurations...->Environment variables
进行设置。
比如我需要设置 TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract/5.5.0/share/tessdata
,路径上面有提,这里就不在赘述。
再次运行正常情况就能打印出识别的结果。
参考文献
- Tesseract documentation#Introduction
- Tesseract documentation#tessdoc
- Development with Tess4J
- mac os 使用tesseract 进行ORC识别