'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · edit-2 7 months ago

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

TehPers@beehaw.org · edit-2 7 months ago

This advice mostly applies to people who are less experienced and less familiar with just how complex HTML can be. As for other languages - if you’re doing regex on markdown, you’ll probably be fine (but you should verify if you’re writing something for the general case that must not fail). But in HTML’s case:

You have nested languages (CSS and JS)
You have tag-specific rules (img and link end in />, but div must end in a separate closing tag)
Browsers use error correction to try to make sense of invalid HTML, like inserting missing tags. Many websites rely on this behavior.

If you’re trying to use Regex to parse a specific website’s HTML, you’ll be able to get what you want eventually, but as a general HTML parser, there will always be some website that breaks your assumptions.

DaleGribble88@programming.dev · 7 months ago

HTML parsers scare me. I already knew it was a big job, but this blog post sealed the deal that HTML, err… the web’s interpretation of HTML(?), is one heck of a mess.
https://jakearchibald.com/2023/against-self-closing-tags-in-html/