I have lots of html image elements as a string but they often contain rubbish I don’t need. How can I remove titles, height, class etc?
Eg. <img class="img-fruit" src="apple.png" title="apple" height="25" width="25">
Would become
<img src="apple.png">
The order of attributes varies.
Struggling to think of an easy solution, any ideas?
I have tried searching for specific attributes and trying to calculate the lengths to remove them but it’s messy
2
Answers
You can use a regular expression to remove unwanted attributes from the HTML image elements string. Here’s a simple example in JavaScript 👇
To clean up HTML tags in a string using C# and remove unwanted attributes while retaining only the src attribute, you can use the HtmlAgilityPack library. This library makes it easier to parse and manipulate HTML.
Here’s how you can achieve this in C#:
Step-by-Step Solution
Install HtmlAgilityPack:
You can install the HtmlAgilityPack library via NuGet Package Manager.
Define a Function:
Create a function that takes the HTML string, finds all tags, and removes unwanted attributes, keeping only the src attribute.
}
Load HtmlAgilityPack:
Include using HtmlAgilityPack at the top of your file.
Parse the HTML string using HtmlDocument.
Select and Clean Tags:
Use XPath
//img
to select all elements.
For each image
Element stores the src attribute value.
Remove all attributes using img.Attributes.RemoveAll().
Reassign the src attribute back to the element.
Return the Cleaned HTML:
Convert the modified HTML document back to a string using doc.DocumentNode.OuterHtml.