skip to Main Content

Aim:

I am using pdfannots algorithm to extract the text from a highlighted PDF.The aim of my work is to use given algorithm in learning German language. The problem is that the output .txt (json) file does not show any Polish or German special signs (look at Ex. 1).

I would like to ask if there is an option [or rather how to set it] to use encoding that contains Polish and German special signs.

Example 1 – how the text with Polish and German signs is shown

Correct version
text: Lösung
contents: rozwiązanie

The file encoding is set to utf8, the file was also re-saved in notepad++:

Example 2 – encoding

I tried to modify the json file but I stuck.

Anaconda Console line:

pdfannots "path" -f json > directoriesjson_to_csv.txt

Thanks

Some additional information:

The PDF file has been created in Goethe FF Clan font. When I copy the word from the file and paste p.e. to Notepad++/WordPad/browser, it copies the special signs, too.

Currently I can create the .csv file from the .json output, but there are still no German or Polish signs
The same situation takes place when I am trying to create a markdown (.md) file.

I have not tried the same workflow on a different PDF file. I think that the case lies somewhere else. The strange is to me that these special signs are shown in utf8 notation (00f6 is the unicode codepoint for ö )

I would like to avoid modifying the pdfannots code if the result could be achieved in any other way.

The code which I used to create csv is shown below. As you can see, the encoding is set to utf8. Currently, I am trying to modify the encoding to the ‘unicode-escape’.

import json
import csv

data_file = open('C:PYTHONpdfannots_GERMANmainjson_to_csv_2.csv', 'w', newline='')

csv_writer = csv.writer(data_file)

count = 0
for data in jsondata:
    if count == 0:
        header = data.keys()
        csv_writer.writerow(header)
        count += 1
    csv_writer.writerow(data.values())

data_file.close()

2

Answers


  1. I suppose output.txt contains u construction not as unicode, but as bunch of some characters, they are not interpreted as unicode.

    Before writing and saving output.txt, you can try convert output data to unicode string:

    # result is somethihg like '... "contents": "rozwiu0105zanie",...'
    # 
    output = result.encode().decode('unicode-escape')
    
    Login or Signup to reply.
  2. One problem is that unlike Adobe FDFtoolkit there is no single cohesive way to describe "/Annot"s in JSON, so every library will do it different rather than the Adobe way !

    internally it can be described as part import <p><span>rozwi&#261;zanie</span></p> or it is also more normally like this 16bit HEX encoded
    /Contents<FEFF0072006F007A007700690105007A0061006E00690065>

    Here are 2 totally different JSON outputs from same input

    enter image description here

    Left is designed for HTML strings to be manipulated for out only overview.

    <div>[
      {
        "annotationFlags": 4,
        "borderStyle": {
          "width": 0,
          "style": 1,
          "dashArray": [
            3
          ],
          "horizontalCornerRadius": 0,
          "verticalCornerRadius": 0
        },
        "backgroundColor": null,
        "borderColor": null,
        "contentsObj": {
    **    "str": "Lösungrrozwiązanie",
          "dir": "ltr"
        },
        "color": {
          "0": 255,
          "1": 255,
          "2": 255
        },
        "hasAppearance": true,
        "id": "15R",
        "modificationDate": "D:20230809185910+00'00'",
        "rect": [
          206.80898,
          497.66255,
          347.1585,
          532.10977
        ],
        "subtype": "FreeText",
        "titleObj": {
          "str": "Kuba",
          "dir": "ltr"
        },
        "cr....
    

    Right in image above, is a coherent cpdf output, that is as comprehensive as possible for export import, but that means much more additional output to describe BE-16 bit text. So string is a mix that emulates the binary content

    "þÿ L ö s u n g r r o z w i u0001u0005 z a n i e"

    [
      [
      1, { "/AP": { "/N": 27 }, "/C": [ { "I": 1 }, { "I": 1 }, { "I": 1 } ],
      "/Contents": "þÿu0000Lu0000öu0000su0000uu0000nu0000gu0000ru0000ru0000ou0000zu0000wu0000iu0001u0005u0000zu0000au0000nu0000iu0000e",
      "/CreationDate": "D:20230809185910+00'00'", "/DA": "1 0 0 rg /F1 12 Tf",
      "/DS": "font-family:Arial;font-size:12pt;color:#000000;", "/F": { "I": 4 },
      "/IT": { "N": "/FreeText" }, "/M": "D:20230809185910+00'00'",
      "/NM": "3dfeefcf-0c19-4701-adda7b013cbe96fc", "/P": 9,
      "/RD": [ { "I": 1 }, { "I": 1 }, { "I": 1 }, { "I": 1 } ],
      "/Rect": [
        { "F": 206.80898 }, { "F": 497.66255 }, { "F": 347.1585 }, {
        "F": 532.10977 }
      ], "/Subj": "Text Box", "/Subtype": { "N": "/FreeText" }, "/T": "Kuba",
      "/Type": { "N": "/Annot" } } ], 
    

    So the best format for export and import annotations is the same one used for bulk IO of AcroForms and the one Adobe use for collaborative commenting.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search