I am playing with Google document ai but I am unsure what the possibilities are. Has any one created a model that can read a pdf and split in into appropriate dita topics? Or split into separate json files for each identified dita topic? Any tips or help is appreciated
Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level
2
Answers
To split documents you can use Document Splitter in Document AI.
Splitter output contains split information for the input document, including a confidence score. The Document AI API outputs a Document JSON object, and the output format uses the entities field for representing document splits.
The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (e.g. a 40-page bank statement) may be split into two or more docs and classified separately.
Splitters identify page boundaries, but do not actually split the input document for you. Here is a code sample that physically splits a PDF file by using the page boundaries:
Document AI PDF Splitter Sample.
For more information about Document Splitter you can refer to this document.
To create a custom classification processor this documentation can be followed.
Slight clarification for https://stackoverflow.com/a/76021683/6216983
The general Document Splitter processor isn’t recommended to be used for production use cases.
It is recommended to use Custom Document Splitter (currently requires allowlisting) or the Procurement Splitter & Classifier or Lending Splitter & Classifier depending on the types of documents.
Splitters identify page boundaries, but do not actually split the input document for you.
You can use the Document AI Toolbox SDK to split the original PDF based on page boundaries identified.
Document AI doesn’t currently have built-in support for DITA topics. If you can provide more context for the use case, I can report this as a feature request to the product development team.