I’m new. I’m working on a program that reads a bunch of docx documents. I get the document content from his XML with XPATH and xmldom. It gives me an array with every line of the document. The thing is, I have something like this:
[
'-1911312-14668500FECHA: 15-12-25',
'NOMBRE Y APELLIDO: Jhon dee',
'C.I.: 20020202 EDAD: 45 ',
'DIRECCION: LA CASA',
'TLF: 55555555',
'CORREO: thiisatest@gmail',
' HISTORIA CLINICA GINECO-OBSTETRICA',
'HO',
'NULIG',
'FUR',
'3-8-23',
'EG',
'',
'FPP',
'',
'GS',
'',
'GSP',
'',
'',
'MC: CONTROL GINECOLOGICO',
'HEA',
'',
'APP: NIEGA PAT, NIEGA ALER, QX NIEGA.',
'APF: MADRE HTA, ABUELA DM.',
'',
'AGO: MENARQUIA: 10 FUR: CICLO: 4/28 ',
' TIPO: EUM',
' MET ANTICONCEP: GENODERM DESDE HACE 3 AÑOS.',
'PRS: NPS: ITS: VPH LIE BAJO GRADO 2017 , BIOPSIA.',
'FUC: NOV 2022, NEGATIVA. COLPO NEGATIVA.',
'',
'',
'EMBARAZO',
'#/AÑO',
'TIPO DE PARTO',
'INDICACION',
'RN',
'SEXO',
'RN',
'PESO',
'OBSERVACIONES',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'',
'EXAMEN FISICO:',
'PESO: 80,1 TALLA: TA: MMHG FC: FR: ',
'',
'PIEL Y MUCOSA: DLN',
'CARDIOPULMONAR: DLN',
'',
'MAMAS: ',
'',
'ABDOMEN: ',
'GENITALES: CUELLO SIN SECRECION , COLPO SE EVDIENCIA DOS LEISONES HPRA 1 Y HORA 5',
'',
'EXTREMIDADES: DLN',
'NEUROLOGICO: DLN',
'',
' IDX: LESION EN CUELLO UTERINO',
'',
'PLAN: DEFEROL OMEGA, CAUTERIZACION Y TIPIFICACION VIRAL',
'22-8-23',
'SE TOMA MUESTRA DE TIPIFICACION VIRAL.',
'',
'',
'',
'LABORATORIOS:',
'FECHA',
'HB/HTO',
'LEU/PLAQ',
'GLICEMIA',
'UREA',
'CREAT',
'HIV/VDRL',
'UROANALISIS',
'',
'',
'',
'',
'',
'',
'',
'',
... 44 more items
]
So, I want to put this content on a js object like:
const customObj = {
fecha: "fecha on the doc",
....
}
But well I think this will works:
const fillObject = (inputArray, keywords) => {
const customObj = {};
keywords.forEach((keyword, index) => {
customObj[keyword] = inputArray.map(line => {
const keywordIndex = line.indexOf(keyword);
if (keywordIndex !== -1) {
const nextKeywordIndex = keywords.slice(index + 1).reduce((acc, nextKeyword) => {
const nextKeywordIndex = line.indexOf(nextKeyword);
return nextKeywordIndex !== -1 && nextKeywordIndex < acc ? nextKeywordIndex : acc;
}, line.length);
return line.slice(keywordIndex, nextKeywordIndex).trim();
}
return null;
}).filter(Boolean);
});
console.log(customObj);
return customObj;
};
From the function I get this: the keyword with the content before the next keyword, and i want to get only the important data.
The format of the documents is always the same, but sometimes i get spaces between a keyword and its content and sometimes I don’t. The words are always capitalized.
I try the function mentioned before, but i want to be more precise on my searching and in how the data looks in the object. The final result has to be a little more accurate because the output actually looks like this:
'FECHA:': [ 'FECHA: 19-10-23' ],
'NOMBRE Y APELLIDO:': [ 'NOMBRE Y APELLIDO: John Dee' ],
'C.I.:': [ 'C.I.: 3232323' ],
'EDAD:': [ 'EDAD: 56' ],
'DIRECCION:': [ 'DIRECCION: Marylan ],
'TLF:': [ 'TLF: 55555555' ],
'CORREO:': [ 'CORREO: [email protected]' ],
'CONTACTO:': [
'CONTACTO: IG HISTORIA CLINICA GINECO-OBSTETRICA'
],
As you can see some properties are weird like "contacto" does not fit well.
2
Answers
To solve this problem I use some JS methods and loops.
This gives me something close to my original desire, but I have a little issue with the tables. Remember, this is originally an XML document extracted from a DOCX file. It has some tables, but the way I store them in my array makes the program confused.
When I have a table with two cells, the right and left values are parsed correctly. However, when I have more than two cells, the values go crazy, and everything appears after the value of the last cell.
Sorry for my bad English; I’m not a native speaker.
Instead of providing a set of keys, I would just parse the input data to recognise the "key: value" pattern.
Note that you could get ambiguity. For instance, if an input line were:
Then, this could be interpreted as:
or as:
To break such ties, we could make the capture of the value greedy, so that in the above example the second output would be generated. If however we find that there is a separation of at least three spaces, then we could interpret what follows as a new key/value pair, so that this input:
…would be interpreted as:
Secondly, if a value has commas, you could turn that value into array (except if the comma is part of a numeric value).
We can use the power of regular expressions to do this kind of parsing.
Here is a function
makeObject
and how that could work for your sample input:You’ll see in the output all keys it could find, even those that have an empty value (like
AGO
). Just extract from this object what you need.