skip to Main Content

I’ve been trying to get Document AI batch submission working and having some difficulty. I have single file submission working using RawDocument and suppose I could just iterate over my data set (27k images) but chose batch since it seems like the more appropriate technique.

When I run my code I am seeing an error: "Failed to process all documents". The first few lines of the debug information are:

O:17:"GoogleRpcStatus":5:{
s:7:"*code";i:3;s:10:"*message";s:32:"Failed to process all documents.";
s:26:"GoogleRpcStatusdetails";
O:38:"GoogleProtobufInternalRepeatedField":4:{
s:49:"GoogleProtobufInternalRepeatedFieldcontainer";a:0:{}s:44:"GoogleProtobufInternalRepeatedFieldtype";i:11;s:45:"GoogleProtobufInternalRepeatedFieldklass";s:19:"GoogleProtobufAny";s:52:"GoogleProtobufInternalRepeatedFieldlegacy_klass";s:19:"GoogleProtobufAny";}s:38:"GoogleProtobufInternalMessagedesc";O:35:"GoogleProtobufInternalDescriptor":13:{s:46:"GoogleProtobufInternalDescriptorfull_name";s:17:"google.rpc.Status";s:42:"GoogleProtobufInternalDescriptorfield";a:3:{i:1;O:40:"GoogleProtobufInternalFieldDescriptor":14:{s:46:"GoogleProtobufInternalFieldDescriptorname";s:4:"code";“`

The support for this error states that the reason for the error is:

The gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to begin with gs:// and end with a trailing backslash character (/). Check the configuration for the Bucket URIs.

I am not using gcsUriPrefix (should I? My buckets > max batch limit) but my gcsOutputConfig.gcsUri is within these limits. The file list I’ve provided gives file names (pointed at the right bucket) so should not have a trailing backslash.

Advice welcome

    function filesFromBucket( $directoryPrefix ) {
        // NOT recursive, does not search the structure
        $gcsDocumentList = [];
    
        // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix
        $bucketName = 'my-input-bucket';
        $storage = new StorageClient();
        $bucket = $storage->bucket($bucketName);
        $options = ['prefix' => $directoryPrefix];
        foreach ($bucket->objects($options) as $object) {
            $doc = new GcsDocument();
            $doc->setGcsUri('gs://'.$object->name());
            $doc->setMimeType($object->info()['contentType']);
            array_push( $gcsDocumentList, $doc );
        }
    
        $gcsDocuments = new GcsDocuments();
        $gcsDocuments->setDocuments($gcsDocumentList);
        return $gcsDocuments;
    }
    
    function batchJob ( ) {
        $inputConfig = new BatchDocumentsInputConfig( ['gcs_documents'=>filesFromBucket('the-bucket-path/')] );
    
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig
        // nb: all uri paths must end with / or an error will be generated.
        $outputConfig = new DocumentOutputConfig( 
            [ 'gcs_output_config' =>
                   new GcsOutputConfig( ['gcs_uri'=>'gs://my-output-bucket/'] ) ]
        );
     
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient
        $documentProcessorServiceClient = new DocumentProcessorServiceClient();
        try {
            // derived from the prediction endpoint
            $name = 'projects/######/locations/us/processors/#######';
            $operationResponse = $documentProcessorServiceClient->batchProcessDocuments($name, ['inputDocuments'=>$inputConfig, 'documentOutputConfig'=>$outputConfig]);
            $operationResponse->pollUntilComplete();
            if ($operationResponse->operationSucceeded()) {
                $result = $operationResponse->getResult();
                printf('<br>result: %s<br>',serialize($result));
            // doSomethingWith($result)
            } else {
                $error = $operationResponse->getError();
                printf('<br>error: %s<br>', serialize($error));
                // handleError($error)
            }
        } finally {
            $documentProcessorServiceClient->close();
        }    
    }

2

Answers


  1. Chosen as BEST ANSWER

    This turns out to be an ID-10-T error, with definite PEBKAC overtones.

    $object->name() does not return the bucket name as part of the path.

    Changing $doc->setGcsUri('gs://'.$object->name()); to $doc->setGcsUri('gs://'.$bucketName.'/'.$object->name()); resolves the issue.


  2. Usually, the reason for the error "Failed to process all documents" is an incorrect syntax for the input files or output bucket. Since an incorrectly formatted path might still be a "valid" path for Cloud Storage, but not the files you’re expecting. (Thank you for checking the error messages page first!)

    You don’t have to use gcsUriPrefix if you’re providing a list of specific documents to process. Although, based on your code, it looks like you’re adding all of the files from a GCS directory to the BatchDocumentsInputConfig.gcs_documents field anyway, so it would make sense to try sending the prefix in BatchDocumentsInputConfig.gcs_uri_prefix instead of a list of individual files.

    Note: There is a maximum number of files (1000) that can be sent in an individual batch processing request, and specific processors have their own limits for pages.

    https://cloud.google.com/document-ai/quotas#content_limits

    You can try separating out the files into multiple batch requests to avoid hitting this limit. The Document AI Toolbox Python SDK has built-in functions for this, but you can try re-implementing this in PHP for your own use case. https://github.com/googleapis/python-documentai-toolbox/blob/ba354d8af85cbea0ad0cd2501e041f21e9e5d765/google/cloud/documentai_toolbox/utilities/gcs_utilities.py#L213

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search