langchain4j+PDFBox小试牛刀
序
本文主要研究langchain4j结合Apache PDFBox进行pdf解析
步骤
pom.xml
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-document-parser-apache-pdfbox</artifactId>
<version>1.0.0-beta1</version>
</dependency>
example
public class PDFBoxTest {
public static void main(String[] args) {
String path = System.getProperty("user.home") + "/downloads/deepseek.pdf";
DocumentParser parser = new ApachePdfBoxDocumentParser();
Document document = FileSystemDocumentLoader.loadDocument(path, parser);
log.info("textSegment:{}", document.toTextSegment());
log.info("meta data:{}", document.metadata().toMap());
log.info("text:{}", document.text());
}
}
指定好了文件路径,通过ApachePdfBoxDocumentParser来解析,最后统一返回Document对象,它可以返回textSegment,这个可以跟向量数据库结合在一起
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
TextSegment segment1 = document.toTextSegment();
Embedding embedding1 = embeddingModel.embed(segment1).content();
embeddingStore.add(embedding1, segment1);
源码
document-parsers/langchain4j-document-parser-apache-pdfbox/src/main/java/dev/langchain4j/data/document/parser/apache/pdfbox/ApachePdfBoxDocumentParser.java
public class ApachePdfBoxDocumentParser implements DocumentParser {
private final boolean includeMetadata;
public ApachePdfBoxDocumentParser() {
this(false);
}
public ApachePdfBoxDocumentParser(boolean includeMetadata) {
this.includeMetadata = includeMetadata;
}
@Override
public Document parse(InputStream inputStream) {
try (PDDocument pdfDocument = PDDocument.load(inputStream)) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(pdfDocument);
if (isNullOrBlank(text)) {
throw new BlankDocumentException();
}
return includeMetadata
? Document.from(text, toMetadata(pdfDocument))
: Document.from(text);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
private Metadata toMetadata(PDDocument pdDocument) {
PDDocumentInformation documentInformation = pdDocument.getDocumentInformation();
Metadata metadata = new Metadata();
for (String metadataKey : documentInformation.getMetadataKeys()) {
String value = documentInformation.getCustomMetadataValue(metadataKey);
if (value != null) metadata.put(metadataKey, value);
}
return metadata;
}
}
ApachePdfBoxDocumentParser实现了DocumentParser,默认includeMetadata为false,其parse方法先通过PDDocument.load(inputStream)加载,然后通过PDFTextStripper去提取文本,最后若includeMetadata为true,则通过pdDocument.getDocumentInformation()来获取元数据信息。
小结
langchain4j提供了langchain4j-document-parser-apache-pdfbox用于读取PDF文档,然后解析成Document类型,它可以返回textSegment,这个可以跟向量数据库结合在一起。
doc
- document-parsers/apache-pdfbox