skip to Main Content

From an HTML code, I’ve to make an array of objects separating what is text and what is markup, like this way:

[
    {"text": "A "},
    {"markup": "<b>"},
    {"text": "test"},
    {"markup": "</b>"}
]

The HTML code that I’m using is this one:

<h2 id="mcetoc_1h1m1ll27l">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.<a href="https://www.sadasdas.es" aria-invalid="true">tr</a><a title="titulo" href="https://www.sadasdas.es" aria-invalid="true">adsf afjdasi k</a><a title="titlee" href="https://www.sadasdas.es" aria-invalid="true">asdsssssssssssss</a><a href="https://www.sadasdas.es" aria-invalid="true">s</a></p>
<p><a href="https://www.sadasdas.es" aria-invalid="true">Lorem Ipsum</a></p>

To avoid the using of RegEx, first I create an array with all the nodes and then I loop over the nodes, looking what is an Element and what is a text node.

Currently I’m stuck with closing tags when an element node has child text nodes followed element nodes (and I’m not sure if I’m overcomplicating things):

<p>Lorem ipsum dolor sit...<a href="..." aria-invalid="true">tr</a><a title="..." href="..." aria-invalid="true">...</a><a title="..." href="..." aria-invalid="true">...</a><a href="..." aria-invalid="true">...</a></p>

So from this paragraph, my object looks so:

{markup: '<p>'}
{text: 'Lorem ipsum dolor sit...'}
{markup: '</p>'}
{markup: '<a>'}
//...

As you can see the close tag appears after the text node. I’ve managed it for element nodes followed by other element nodes, but this case still escapes me.

This is what I’ve done so far (codepen):

const obj = {
    annotation: []
};

const nodelist = (() => {
    const res = [];
    const tw = document.createTreeWalker(document.body);

    while (tw.nextNode()) {
        res.push(tw.currentNode)
    }

    return res;
})();

console.log(nodelist);

const nodeHasParents = (node) => node.parentNode.nodeName !== 'BODY';
const isTextNode = (node) => node.nodeType === Node.TEXT_NODE;
const isElementNode = (node) => node.nodeType === Node.ELEMENT_NODE;

const GetNextNodeElements = (i) => {
    let n = i + 1;
    let res = [];

    while (nodelist[n] && isElementNode(nodelist[n])) {
        res.push(nodelist[n]);
        n++;
    }

    return res;
}

const GetNextTextNode = (i) => {
    let n = i + 1;

    for (let n = i; n < nodelist.length; n++) {
        if (isTextNode(nodelist[n])) return nodelist[n];
    }
}


for (let i = 0; i < nodelist.length; i++) {
    let node = nodelist[i];
    let opentags = '';
    let closetags = '';

    if (isTextNode(node) && !nodeHasParents(node)) {
        obj.annotation.push({"text": node.textContent});
    }
    else if (isElementNode(node)) {
        opentags += `<${node.nodeName.toLowerCase()}>`;

        const currentNode = node;
        const nextNodeElements = GetNextNodeElements(i);

        if (nextNodeElements) {
            nextNodeElements.forEach(node => opentags += node.outerHTML.replace(node.textContent, '').replace(`</${node.nodeName.toLowerCase()}>`, ''));
            nextNodeElements.reverse();
            nextNodeElements.forEach(node => closetags += `</${node.nodeName.toLowerCase()}>`);

            i = i + nextNodeElements.length;
            node = nodelist[i];
        }

        if (!!closetags.length) {
            closetags = `</${currentNode.nodeName.toLowerCase()}>` + closetags;
        }
        else closetags += `</${currentNode.nodeName.toLowerCase()}>`

        obj.annotation.push({"markup": opentags});
        obj.annotation.push({"text": GetNextTextNode(i)?.textContent});
        obj.annotation.push({"markup": closetags});
    }
}

console.log(obj.annotation);

2

Answers


  1. You could find an npm package to convert xml to js object or to json. Try to find this on search engine:

    • XML to JS
    • XML to JS Object
    • XML to JSON
    • Parse XML to JS Object

    Fortunately, I found an interesting library: xml-js. Then, if you are on browser, you could fetch the library using Cloudflare CDNjs, jsDelivr, or Unpkg. Which one you think is the best.

    And also there is same question in stackoverflow. You could read this further:

    But, if you insist to do it by yourself, you would end up into Compilation Technique, and learning about finite automata, regular expression, lexical analysis, etc.

    The last you could try is parse xml to dom. But I thought this would heavy. You may find this interesting: https://www.w3schools.com/xml/dom_intro.asp

    Also, please, do not reinvent the wheel. There should exists some similar library or project you may try to achive.

    Login or Signup to reply.
  2. Recursion makes it easier:

    const tree = []; 
    
    walk(document.body);
    
    console.log(tree);
    
    function walk(parent) {
        for (const elem of parent.childNodes) {
            if(elem.nodeType === Node.TEXT_NODE){
                tree.push({text: elem.textContent});
            } else if(elem.nodeType === Node.ELEMENT_NODE){
                tree.push({markup: `<${elem.tagName.toLowerCase()}>`});
                elem.hasChildNodes() && walk(elem);
                tree.push({markup: `</${elem.tagName.toLowerCase()}>`});
            }
        }
    }
    <h2 id="mcetoc_1h1m1ll27l">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</h2>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.<a href="https://www.sadasdas.es" aria-invalid="true">tr</a><a title="titulo" href="https://www.sadasdas.es" aria-invalid="true">adsf afjdasi k</a><a title="titlee" href="https://www.sadasdas.es" aria-invalid="true">asdsssssssssssss</a><a href="https://www.sadasdas.es" aria-invalid="true">s</a></p>
    <p><a href="https://www.sadasdas.es" aria-invalid="true">Lorem Ipsum</a></p>
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search