Html - Create an array of objects separating what is text and what is markup

kosmosan
June 13, 2023
268 views
0 votes
2 Answers

From an HTML code, I’ve to make an array of objects separating what is text and what is markup, like this way:

[
    {"text": "A "},
    {"markup": "<b>"},
    {"text": "test"},
    {"markup": "</b>"}
]

The HTML code that I’m using is this one:

<h2 id="mcetoc_1h1m1ll27l">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.<a href="https://www.sadasdas.es" aria-invalid="true">tr</a><a title="titulo" href="https://www.sadasdas.es" aria-invalid="true">adsf afjdasi k</a><a title="titlee" href="https://www.sadasdas.es" aria-invalid="true">asdsssssssssssss</a><a href="https://www.sadasdas.es" aria-invalid="true">s</a></p>
<p><a href="https://www.sadasdas.es" aria-invalid="true">Lorem Ipsum</a></p>

To avoid the using of RegEx, first I create an array with all the nodes and then I loop over the nodes, looking what is an Element and what is a text node.

Currently I’m stuck with closing tags when an element node has child text nodes followed element nodes (and I’m not sure if I’m overcomplicating things):

<p>Lorem ipsum dolor sit...<a href="..." aria-invalid="true">tr</a><a title="..." href="..." aria-invalid="true">...</a><a title="..." href="..." aria-invalid="true">...</a><a href="..." aria-invalid="true">...</a></p>

So from this paragraph, my object looks so:

{markup: '<p>'}
{text: 'Lorem ipsum dolor sit...'}
{markup: '</p>'}
{markup: '<a>'}
//...

As you can see the close tag appears after the text node. I’ve managed it for element nodes followed by other element nodes, but this case still escapes me.

This is what I’ve done so far (codepen):

const obj = {
    annotation: []
};

const nodelist = (() => {
    const res = [];
    const tw = document.createTreeWalker(document.body);

    while (tw.nextNode()) {
        res.push(tw.currentNode)
    }

    return res;
})();

console.log(nodelist);

const nodeHasParents = (node) => node.parentNode.nodeName !== 'BODY';
const isTextNode = (node) => node.nodeType === Node.TEXT_NODE;
const isElementNode = (node) => node.nodeType === Node.ELEMENT_NODE;

const GetNextNodeElements = (i) => {
    let n = i + 1;
    let res = [];

    while (nodelist[n] && isElementNode(nodelist[n])) {
        res.push(nodelist[n]);
        n++;
    }

    return res;
}

const GetNextTextNode = (i) => {
    let n = i + 1;

    for (let n = i; n < nodelist.length; n++) {
        if (isTextNode(nodelist[n])) return nodelist[n];
    }
}


for (let i = 0; i < nodelist.length; i++) {
    let node = nodelist[i];
    let opentags = '';
    let closetags = '';

    if (isTextNode(node) && !nodeHasParents(node)) {
        obj.annotation.push({"text": node.textContent});
    }
    else if (isElementNode(node)) {
        opentags += `<${node.nodeName.toLowerCase()}>`;

        const currentNode = node;
        const nextNodeElements = GetNextNodeElements(i);

        if (nextNodeElements) {
            nextNodeElements.forEach(node => opentags += node.outerHTML.replace(node.textContent, '').replace(`</${node.nodeName.toLowerCase()}>`, ''));
            nextNodeElements.reverse();
            nextNodeElements.forEach(node => closetags += `</${node.nodeName.toLowerCase()}>`);

            i = i + nextNodeElements.length;
            node = nodelist[i];
        }

        if (!!closetags.length) {
            closetags = `</${currentNode.nodeName.toLowerCase()}>` + closetags;
        }
        else closetags += `</${currentNode.nodeName.toLowerCase()}>`

        obj.annotation.push({"markup": opentags});
        obj.annotation.push({"text": GetNextTextNode(i)?.textContent});
        obj.annotation.push({"markup": closetags});
    }
}

console.log(obj.annotation);

Tags: html javascript

Answers

- FarhanMSabran
- June 8, 2023 at 9:50 am
- 0 votes
0
You could find an npm package to convert xml to js object or to json. Try to find this on search engine:
- XML to JS
- XML to JS Object
- XML to JSON
- Parse XML to JS Object
Fortunately, I found an interesting library: xml-js. Then, if you are on browser, you could fetch the library using Cloudflare CDNjs, jsDelivr, or Unpkg. Which one you think is the best.

And also there is same question in stackoverflow. You could read this further:
- XML to JavaScript Object
But, if you insist to do it by yourself, you would end up into Compilation Technique, and learning about finite automata, regular expression, lexical analysis, etc.

The last you could try is parse xml to dom. But I thought this would heavy. You may find this interesting: https://www.w3schools.com/xml/dom_intro.asp

Also, please, do not reinvent the wheel. There should exists some similar library or project you may try to achive.
Login or Signup to reply.

Recursion makes it easier:

const tree = []; 

walk(document.body);

console.log(tree);

function walk(parent) {
    for (const elem of parent.childNodes) {
        if(elem.nodeType === Node.TEXT_NODE){
            tree.push({text: elem.textContent});
        } else if(elem.nodeType === Node.ELEMENT_NODE){
            tree.push({markup: `<${elem.tagName.toLowerCase()}>`});
            elem.hasChildNodes() && walk(elem);
            tree.push({markup: `</${elem.tagName.toLowerCase()}>`});
        }
    }
}

<h2 id="mcetoc_1h1m1ll27l">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</h2>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris at tincidunt lectus.<a href="https://www.sadasdas.es" aria-invalid="true">tr</a><a title="titulo" href="https://www.sadasdas.es" aria-invalid="true">adsf afjdasi k</a><a title="titlee" href="https://www.sadasdas.es" aria-invalid="true">asdsssssssssssss</a><a href="https://www.sadasdas.es" aria-invalid="true">s</a></p>
<p><a href="https://www.sadasdas.es" aria-invalid="true">Lorem Ipsum</a></p>

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Create an array of objects separating what is text and what is markup

Answers