springBoot整合 Tess4J实现OCR识别文字(图片+PDF)
1. 环境准备
- JDK 8 或更高版本
- Maven 3.6 或更高版本
- Spring Boot 2.4 或更高版本
- Tesseract OCR 引擎
- Tess4J 库
2. 安装 Tesseract OCR 引擎
下载地址: Home · UB-Mannheim/tesseract Wiki · GitHub
linux直接安装:sudo apt-get install tesseract-ocr
3. 引入pom文件
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.4</version>
</dependency>
4. 实现 OcrService 类,目前支持png、jpeg、jpg、pdf
注意:这里需要注意配置文件:设置Tesseract的数据路径,案例将tess安装到了D盘,如果是linux服务器,需求配置对应的地址,到tessdata路径;还需要设置识别语言,案例里为中文识别,如需对应的语言,需要下载对应的文件到安装目录下的tessdata文件夹中。识别语言包下载地址:GitHub - tesseract-ocr/tessdata: Trained models with fast variant of the "best" LSTM models + legacy models
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.springframework.stereotype.Service;
import java.awt.image.BufferedImage;
import java.io.File;
import org.apache.commons.io.FilenameUtils;
import javax.imageio.ImageIO;
import java.io.IOException;
@Service
public class OcrService {
public String recognizeText(File imageFile) {
ITesseract instance = new Tesseract();
instance.setDatapath("D:\\anzhuang\\ocr\\tessdata"); // 设置Tesseract的数据路径
instance.setLanguage("chi_sim"); // 设置识别语言
// 获取文件扩展名
String extension = FilenameUtils.getExtension(imageFile.getName());
// 根据文件扩展名设置Tesseract的图像类型
if ("png".equalsIgnoreCase(extension)) {
instance.setTessVariable("filename", "png");
} else if ("jpg".equalsIgnoreCase(extension) || "jpeg".equalsIgnoreCase(extension)) {
instance.setTessVariable("filename", "jpeg");
} else if("pdf".equalsIgnoreCase(extension)){
try {
return processPDF(imageFile,instance);
} catch (Exception e) {
e.printStackTrace();
return "Error: " + e.getMessage();
}
} else {
return "Unsupported file format: " + extension;
}
try {
return instance.doOCR(imageFile);
} catch (TesseractException e) {
e.printStackTrace();
return "Error: " + e.getMessage();
}
}
private String processPDF(File pdfFile, ITesseract instance) throws IOException, TesseractException {
PDDocument document = PDDocument.load(pdfFile);
PDFRenderer renderer = new PDFRenderer(document);
StringBuilder result = new StringBuilder();
for (int i = 0; i < document.getNumberOfPages(); i++) {
BufferedImage image = renderer.renderImageWithDPI(i, 800); // 使用300 DPI渲染图像
File tempImageFile = File.createTempFile("page" + i, ".png");
ImageIO.write(image, "png", tempImageFile);
result.append(instance.doOCR(tempImageFile));
tempImageFile.delete(); // 删除临时文件
}
document.close();
return result.toString();
}
}
5. OcrController实现案例:
package com.fan.ocr.controller;
import com.fan.ocr.serivce.OcrService;
import net.sourceforge.tess4j.Tesseract;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.File;
import java.io.IOException;
@RestController
@RequestMapping("/ocr")
public class OcrController {
@Autowired
private OcrService ocrService;
@PostMapping("/upload")
public ResponseEntity<String> uploadImage(@RequestParam("file") MultipartFile file) {
if (file.isEmpty()) {
return new ResponseEntity<>("File is empty", HttpStatus.BAD_REQUEST);
}
try {
// 将文件保存到本地
File convFile = new File(System.getProperty("java.io.tmpdir") + "/" + file.getOriginalFilename());
file.transferTo(convFile);
// 调用OCR服务识别文字
String result = ocrService.recognizeText(convFile);
return new ResponseEntity<>(result, HttpStatus.OK);
} catch (IOException e) {
return new ResponseEntity<>("File upload error: " + e.getMessage(), HttpStatus.INTERNAL_SERVER_ERROR);
}
}
}
注意:初步观察,解析一页pdf耗时在30-40s之间,建议不要超过一页,可能会导致无响应或者等待时间太久。