I want to find and replace some of the *innerHTML *with regex. I want to do this across a document acting on everything except anchor elements.
I thought I could do this with queryselectorAll, by setting it to select all elements except anchor elements. The problem with that is that the anchor elements are nested within elements, as in the code below. So, even if I exclude anchor elements, I still traverse them in the regex because they are a nested within other elements (below, nested in a para element) that are caught by queryselector.
My next step was to try and exclude all elements that are the parent of an anchor element. But that results in HTML being missed from my regex search. For example, in the para element below the text "hello I am some text" is the *child *of the ‘p’ element. So, by excluding the ‘p’ element, that text falls outside the scope of my regex. I need that text to be included in my regex.
<p class="1 2">
<span class="3">
some writing here
<strong class="4">some more here</strong>
</span>
<strong class="5">
<span class="6">
<span class="7"></span>
<a class="8" href="#abc" title="TITLE" id="9">some text</a>
<span class="10">some text</span>
<span class="11"></span>
</span>
<span class="12"></span>
</strong>
hello I am some text
</p>
There are two further complexities. First, the document i need to traverse is very long, in the region of 250,000 words of HTML, all in a complex nested format perhaps 10 – 15 levels deep. Second, it is not a single regex I am running. I have an array of 300 regex. I need to traverse the document for every one of these 300 regex. The point being that it is quite resource intensive and time consuming. At the moment it takes about an hour to run my code. But that code is wrong because it acts on the anchor elements.
I thought of simply removing the anchor elements along the lines:
anchors.forEach((anchor) => anchor.parentNode.removeChild(anchor));
but then I am left with a document that lacks the anchor elements, and I need them in the document, i just don’t want to traverse them with the regex. I thought of then recording the location of the deleted anchor elements and then reinserting them after the regex but it all gets very complex as I will be inserting new spans, thereby making it complex to track where the relevant anchor should be reinserted. This method just becomes too complex.
I would be grateful for suggestions as to how to proceed. Is there some way of avoiding traversing **nested **anchor elements?
2
Answers
Try the following:
Starting with the parent
<p>
element the.querySelectorAll("*:not(a)")
collection will contain all none-a
elements. The text nodes within the child nodes of each one of these elements are then processed further. In each of their.textContent
s the string "some" will be replaced by "lots of".Selecto all non anchor elements, iterate their child nodes and change text nodes.
The question is hard to comprehend though, it’s not clear what exactly should be change inside anchors so I guess the child elements should be change too, added a span inside an anchor to show this. But better the OP should provide more a extended input HTML and add the desired output.