skip to Main Content

I am playing with Google document ai but I am unsure what the possibilities are. Has any one created a model that can read a pdf and split in into appropriate dita topics? Or split into separate json files for each identified dita topic? Any tips or help is appreciated

2

Answers


  1. To split documents you can use Document Splitter in Document AI.

    Splitter output contains split information for the input document, including a confidence score. The Document AI API outputs a Document JSON object, and the output format uses the entities field for representing document splits.

    The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (e.g. a 40-page bank statement) may be split into two or more docs and classified separately.

    Splitters identify page boundaries, but do not actually split the input document for you. Here is a code sample that physically splits a PDF file by using the page boundaries:

    Document AI PDF Splitter Sample.

    For more information about Document Splitter you can refer to this document.

    To create a custom classification processor this documentation can be followed.

    Login or Signup to reply.
  2. Slight clarification for https://stackoverflow.com/a/76021683/6216983


    The general Document Splitter processor isn’t recommended to be used for production use cases.

    It is recommended to use Custom Document Splitter (currently requires allowlisting) or the Procurement Splitter & Classifier or Lending Splitter & Classifier depending on the types of documents.

    Splitters identify page boundaries, but do not actually split the input document for you.

    You can use the Document AI Toolbox SDK to split the original PDF based on page boundaries identified.

    Document AI doesn’t currently have built-in support for DITA topics. If you can provide more context for the use case, I can report this as a feature request to the product development team.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search