Remove HTML Attributes and Tags from HTML Source

by Andrew Barber 5. May 2011 17:59

This article describes a reusable, encapsulated method for removing either HTML tags or HTML attributes from a string of HTML source, while leaving the content of the tag(s) untouched. The primary use-cases for this are converting old HTML which uses deprecated syntax or tags, or providing Mobile web site content when it's possible there are certain attributes or tags in the output which you would not want on a mobile site.

The solution herein is written in C# with Regular Expressions against the .NET Framework version 3.5. It uses extension methods and a lambda expression for convenience, but those can be removed to use this solution in .NET 2.0. I leave that to the intrepid reader.

Removing HTML Tags

Removing tags is by far the simpler of the two. Keep in mind what we want to do is remove a tag, but leave the content of the tag intact. Let's say, for example, we have a bunch of old HTML that is peppered with <font> tags, like so:

<p>
Hello. <font size="2">I am size 2</font>
<font color="red">and I am red</font>
</p>

To quickly fix this content for a modern web site, but keeping the text, you want to just remove the font tags and end up with this:

<p>
Hello. I am size 2
and I am red
</p>

To do that, we use the following Regular Expression:

"</?font[^<]*?>"

...and simply call the Regex.Replace() method with it, against the source. This will remove all instances of the tag - opening, closing or self-closing versions - and leave their content intact. It will remove the entire tag, including any attributes which were defined, as the sample above shows. The regex assumes well-formed HTML - so there should be no stray, literal < or > in the source that are not tag open/close symbols. The pattern seeks a tag's opening < with the close-tag symbol / being optional, then the name of the tag. After that, any number of characters other than < are matched, but as few as possible (hence the *?) until the closing > is encountered. This will work on tags which are spread out on multiple lines, with or without attributes set.

A simple method (which I define as an extension method of string here) can be used to enable you to pass in any tag name you wish:

private static Regex _validAttributeOrTagNameRegEx = 
new
 Regex(@"^\w+$"RegexOptions.Compiled | RegexOptions.IgnoreCase);
private const string STR_RemoveHtmlTagRegex = "</?{0}[^<]*?>";
public static string RemoveHtmlTag(this string input, string tagName) {
if (_validAttributeOrTagNameRegEx.IsMatch(tagName)) {
Regex reg = new Regex(string.Format(STR_RemoveHtmlTagRegex, tagName),
RegexOptions.IgnoreCase);
return reg.Replace(input, "");
else {
throw new ArgumentException("Not a valid HTML tag name""attributeName");
}
}

As I noted, pretty simple. The tag name is sanity-checked, then simply replaced into the pattern using string.format().

Removing HTML Attributes

Suppose we had some HTML which has inline style attributes that we want to remove...

<div style="backgroundyellow;" class="mainDiv">
Our Div
</div>

This time, we don't want to remove the whole tag; We just want to remove all instances of style attributes in all tags in the string we pass in, to get this:

<div class="mainDiv">
Our Div
</div>

The Regular Expression pattern for this is quite a bit more complex than for removing the whole tag, though the C# code isn't too much worse. I'll start with the C# code first, this time:

private const string STR_RemoveHtmlAttributeRegex = 
@"(?<=<)([^/>]+)(\s{0}=['""][^'""]+?['""])([^/>]*)(?=/?>|\s)";
public static string RemoveHtmlAttribute(this string input, string attributeName) {
if (_validAttributeOrTagNameRegEx.IsMatch(attributeName)) {
Regex reg = new Regex(string.Format(STR_RemoveHtmlAttributeRegex, attributeName),
 RegexOptions.IgnoreCase);
return reg.Replace(input, item => item.Groups[1].Value + item.Groups[3].Value);
else {
throw new ArgumentException("Not a valid HTML attribute name""attributeName");
}
}

The main difference in the code here is the use of a MatchEvaluator delegate in the form of a lambda expression instead of a simple string replacement. The code defined by the lambda is simply returning the match group values which contain the information within the tag which come both before or after the attribute we seek, and explicitly leaving out the one we sought. The simple Replace method which takes a string can not be used here because it would replace everything in the interior of the tag - not just the attribute value we are looking to strip.

Each match will have three groups (four, counting the 'whole match' group at index 0); Everything up to the offending attribute, the offending attribute itself, and everything after. Hence, our code just outputs the groups at index 1 and 3.

Now, let's have a look at the regular expression itself, with the string format replacement token replaced with the attribute name 'style' to fit our example:

@"(?<=<)([^/>]+)(\sstyle=['""][^'""]+?['""])([^/>]*)(?=/?>|\s)"

This comes in five different Regex groups; one look-behind, three matching groups, and one look-ahead. The first group is:

(?<=<)

This finds the opening <, and is specified as a look-behind so that it does not produce a match group. Next is:

([^/>]+)

This group matches everything in the tag up to the attribute we seek. If the tag closes before the attribute is found, then there is no match on that particular tag. Next is the main capture:

(\sstyle=['""][^'""]+?['""])

This matches the space before the attribute (which can be any kind of white space) the attribute name itself, then the equals sign, quotes and content. Note that this requires that the attribute values be properly quoted, with either single- or double-quotes. This match group is the part which is 'left out' in the code above, thereby removing the attribute. The next group is for what comes after the matching attribute (which could be empty):

([^/>]*)

Then the look-ahead at the end just assures there is some whitespace or the close of the tag afterward; otherwise, no match will be made:

(?=/?>|\s)

Both the third capture group and the look-ahead are required in order to assure you are matching a tag, and properly organizing any other possible attributes into the third group so that you can hold onto that info... yet it also still permits your attribute to be the very last thing in the tag.

Conclusion

You can put the above C# code blocks into a static class to use as string extensions, or remove the this parameter modifiers so you can just use them as regular static methods.

Comments are closed

Links/Profile

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent those of my partners, clients or contractors in any way.

© Copyright 2012 AndrewBarber.com