skip to Main Content

I’m new. I’m working on a program that reads a bunch of docx documents. I get the document content from his XML with XPATH and xmldom. It gives me an array with every line of the document. The thing is, I have something like this:

[
  '-1911312-14668500FECHA:  15-12-25',
  'NOMBRE Y APELLIDO:  Jhon dee',
  'C.I.: 20020202                                  EDAD: 45                       ',
  'DIRECCION:  LA CASA',
  'TLF:  55555555',
  'CORREO: thiisatest@gmail',
  '                                            HISTORIA CLINICA GINECO-OBSTETRICA',
  'HO',
  'NULIG',
  'FUR',
  '3-8-23',
  'EG',
  '',
  'FPP',
  '',
  'GS',
  '',
  'GSP',
  '',
  '',
  'MC:  CONTROL GINECOLOGICO',
  'HEA',
  '',
  'APP:  NIEGA PAT, NIEGA ALER, QX NIEGA.',
  'APF: MADRE HTA, ABUELA DM.',
  '',
  'AGO: MENARQUIA:  10                FUR:                         CICLO:      4/28              ',
  '    TIPO: EUM',
  ' MET ANTICONCEP:  GENODERM DESDE HACE 3 AÑOS.',
  'PRS:                                      NPS:                                                   ITS: VPH LIE BAJO GRADO 2017 , BIOPSIA.',
  'FUC:  NOV 2022, NEGATIVA. COLPO NEGATIVA.',
  '',
  '',
  'EMBARAZO',
  '#/AÑO',
  'TIPO DE PARTO',
  'INDICACION',
  'RN',
  'SEXO',
  'RN',
  'PESO',
  'OBSERVACIONES',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'EXAMEN FISICO:',
  'PESO:  80,1                  TALLA:                    TA: MMHG                    FC:                    FR: ',  
  '',
  'PIEL Y MUCOSA:  DLN',
  'CARDIOPULMONAR: DLN',
  '',
  'MAMAS: ',
  '',
  'ABDOMEN: ',
  'GENITALES:  CUELLO SIN SECRECION , COLPO SE EVDIENCIA DOS LEISONES HPRA 1 Y HORA 5',
  '',
  'EXTREMIDADES: DLN',
  'NEUROLOGICO: DLN',
  '',
  ' IDX:  LESION EN CUELLO UTERINO',
  '',
  'PLAN: DEFEROL OMEGA, CAUTERIZACION Y TIPIFICACION VIRAL',
  '22-8-23',
  'SE TOMA MUESTRA DE TIPIFICACION VIRAL.',
  '',
  '',
  '',
  'LABORATORIOS:',
  'FECHA',
  'HB/HTO',
  'LEU/PLAQ',
  'GLICEMIA',
  'UREA',
  'CREAT',
  'HIV/VDRL',
  'UROANALISIS',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ... 44 more items
]

So, I want to put this content on a js object like:

const customObj = {
 fecha: "fecha on the doc",
....
}

But well I think this will works:

const fillObject = (inputArray, keywords) => {
    const customObj = {};
    keywords.forEach((keyword, index) => {
        customObj[keyword] = inputArray.map(line => {
            const keywordIndex = line.indexOf(keyword);
            if (keywordIndex !== -1) {
                const nextKeywordIndex = keywords.slice(index + 1).reduce((acc, nextKeyword) => {
                    const nextKeywordIndex = line.indexOf(nextKeyword);
                    return nextKeywordIndex !== -1 && nextKeywordIndex < acc ? nextKeywordIndex : acc;
                }, line.length);
                return line.slice(keywordIndex, nextKeywordIndex).trim();
            }
            return null;
        }).filter(Boolean);
    });
    console.log(customObj);
    return customObj;
};

From the function I get this: the keyword with the content before the next keyword, and i want to get only the important data.
The format of the documents is always the same, but sometimes i get spaces between a keyword and its content and sometimes I don’t. The words are always capitalized.

I try the function mentioned before, but i want to be more precise on my searching and in how the data looks in the object. The final result has to be a little more accurate because the output actually looks like this:

'FECHA:': [ 'FECHA: 19-10-23' ],
    'NOMBRE Y APELLIDO:': [ 'NOMBRE Y APELLIDO: John Dee' ],
    'C.I.:': [ 'C.I.: 3232323' ],
    'EDAD:': [ 'EDAD: 56' ],
    'DIRECCION:': [ 'DIRECCION:   Marylan ],
    'TLF:': [ 'TLF:  55555555' ],
    'CORREO:': [ 'CORREO:  [email protected]' ],
    'CONTACTO:': [
      'CONTACTO:  IG                                            HISTORIA CLINICA GINECO-OBSTETRICA'
    ],

As you can see some properties are weird like "contacto" does not fit well.

2

Answers


  1. Chosen as BEST ANSWER

    To solve this problem I use some JS methods and loops.

        /**
     * 
     * @param {*} content 
     * @returns This function takes an array of content and joins it into a string, separated by commas.
     */
    const formatContent = (content) => {
        return content.join(', ');
    };
    
    /**
     * 
     * @param {*} inputArray 
     * @param {*} keywords 
     * @param {*} extendedKeywords 
     * @returns This function fills an object with information extracted from an input array based on keywords and extended keywords.
     */
    // Definition of the fillObject function with three parameters: inputArray, keywords, and extendedKeywords.
    const fillObject = (inputArray, keywords, extendedKeywords) => {
        // Initialization of customObj as an empty object to store the results.
        const customObj = {};
        // Array of restricted keywords that should not be included in the results.
        const restrictedKeywords = ['FECHA:', 'C.I.', 'TLF'];
    
        // Iteration over each keyword in the keywords array.
        keywords.forEach((keyword, index) => {
            // Mapping each line of the inputArray to process the information.
            const values = inputArray.map((line, lineIndex) => {
                // Searching for the index of the current keyword in the line.
                const keywordIndex = line.indexOf(keyword);
                // If the keyword is found, the following code is executed.
                if (keywordIndex !== -1) {
                    // Searching for the index of the next keyword to delimit the content to be extracted.
                    const nextKeywordIndex = keywords.slice(index + 1).reduce((acc, nextKeyword) => {
                        const nextKeywordIndex = line.indexOf(nextKeyword);
                        // The smallest index that is not -1 is selected as the final delimiter.
                        return nextKeywordIndex !== -1 && nextKeywordIndex < acc ? nextKeywordIndex : acc;
                    }, line.length);
                    // Extraction of the content between the current keyword and the next.
                    let result = line.slice(keywordIndex + keyword.length, nextKeywordIndex).trim();
                    // If the result contains any restricted keywords, only the first word is extracted.
                    if (restrictedKeywords.some(restrictedKeyword => result.includes(restrictedKeyword))) {
                        const match = result.match(/b(w+)b/);
                        result = match ? match[0] : null;
                    }
                    // If the current keyword is in extendedKeywords, additional content is collected.
                    if (extendedKeywords.includes(keyword)) {
                        const content = [];
                        let nextLineIndex = lineIndex + 1;
                        // Additional lines are collected until a new keyword is found.
                        while (nextLineIndex < inputArray.length && !keywords.some(k => inputArray[nextLineIndex].includes(k))) {
                            // The line is added to the content array if it is not empty.
                            if (inputArray[nextLineIndex].replace(/,/g, '').trim() !== '') {
                                content.push(inputArray[nextLineIndex]);
                            }
                            nextLineIndex++;
                        }
                        // The formatContent function is used to join the collected content into a string.
                        result = formatContent(content);
                    }
                    // The result for this iteration of the map is returned.
                    return result;
                }
                // If the keyword is not found, null is returned.
                return null;
            // Null values are filtered out of the resulting array from the map.
            }).filter(Boolean);
            // The array of values or the single value is assigned to the customObj under the corresponding keyword.
            customObj[keyword] = values.length > 1 ? values : values[0];
        });
        // The customObj with the extracted information is returned.
        return customObj;
    };
    

    This gives me something close to my original desire, but I have a little issue with the tables. Remember, this is originally an XML document extracted from a DOCX file. It has some tables, but the way I store them in my array makes the program confused.

    When I have a table with two cells, the right and left values are parsed correctly. However, when I have more than two cells, the values go crazy, and everything appears after the value of the last cell.

    Sorry for my bad English; I’m not a native speaker.


  2. Instead of providing a set of keys, I would just parse the input data to recognise the "key: value" pattern.

    Note that you could get ambiguity. For instance, if an input line were:

    TEST: A B C: OK
    

    Then, this could be interpreted as:

    {
        "TEST": "A",
        "B C": "OK"
    }
    

    or as:

    {
        "TEST": "A B",
        "C": "OK"
    }
    

    To break such ties, we could make the capture of the value greedy, so that in the above example the second output would be generated. If however we find that there is a separation of at least three spaces, then we could interpret what follows as a new key/value pair, so that this input:

    TEST: A   B C: OK
    

    …would be interpreted as:

    {
        "TEST": "A",
        "B C": "OK"
    }
    

    Secondly, if a value has commas, you could turn that value into array (except if the comma is part of a numeric value).

    We can use the power of regular expressions to do this kind of parsing.

    Here is a function makeObject and how that could work for your sample input:

    const multiple = arr => arr.length > 1 ? arr : arr[0];
    const regex = /((?:[A-Z.]+ )*[A-Z.]+):((?: {0,2}(?!S*:)S+)*)/g;
    const makeObject = data => Object.fromEntries(
        Array.from(data.join("n").matchAll(regex), ([, key, value]) => [
            key, 
            multiple(value.split(/,(?!d)/).map(val => val.trim()))
        ])
    );
    
    // Your sample data:
    const data = ['-1911312-14668500FECHA:  15-12-25','NOMBRE Y APELLIDO:  Jhon dee','C.I.: 20020202                                  EDAD: 45                       ','DIRECCION:  LA CASA','TLF:  55555555','CORREO: thiisatest@gmail','                                            HISTORIA CLINICA GINECO-OBSTETRICA','HO','NULIG','FUR','3-8-23','EG','','FPP','','GS','','GSP','','','MC:  CONTROL GINECOLOGICO','HEA','','APP:  NIEGA PAT, NIEGA ALER, QX NIEGA.','APF: MADRE HTA, ABUELA DM.','','AGO: MENARQUIA:  10                FUR:                         CICLO:      4/28              ','    TIPO: EUM',' MET ANTICONCEP:  GENODERM DESDE HACE 3 AÑOS.','PRS:                                      NPS:                                                   ITS: VPH LIE BAJO GRADO 2017 , BIOPSIA.','FUC:  NOV 2022, NEGATIVA. COLPO NEGATIVA.','','','EMBARAZO','#/AÑO','TIPO DE PARTO','INDICACION','RN','SEXO','RN','PESO','OBSERVACIONES','','','','','','','','','','','','','','','','','','','','EXAMEN FISICO:','PESO:  80,1                  TALLA:                    TA: MMHG                    FC:                    FR: ','','PIEL Y MUCOSA:  DLN','CARDIOPULMONAR: DLN','','MAMAS: ','','ABDOMEN: ','GENITALES:  CUELLO SIN SECRECION , COLPO SE EVDIENCIA DOS LEISONES HPRA 1 Y HORA 5','','EXTREMIDADES: DLN','NEUROLOGICO: DLN','',' IDX:  LESION EN CUELLO UTERINO','','PLAN: DEFEROL OMEGA, CAUTERIZACION Y TIPIFICACION VIRAL','22-8-23','SE TOMA MUESTRA DE TIPIFICACION VIRAL.','','','','LABORATORIOS:','FECHA','HB/HTO','LEU/PLAQ','GLICEMIA','UREA','CREAT','HIV/VDRL','UROANALISIS','','','','','','','','',];
    console.log(makeObject(data));

    You’ll see in the output all keys it could find, even those that have an empty value (like AGO). Just extract from this object what you need.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search