'''AntiPattern Name''': Parsing HTML with regular expressions '''Type''': Development '''Problem''': Extracting information from HTML text, such as elements or groups of elements with a specific property. '''Context''': Programmer understands how to use regular expressions and sees that HTML looks fairly regular and easy to parse using this tool. '''Causes''': * Using a formated page as a data source instead of seeking a proper data source. * Not having taken a formal class in parser construction. * Not reading the formal HTML specification. '''Supposed Solution''': Write a half-baked regular expression that works for specific test cases. '''Resulting Context''': Regular expression may work for most cases, but fails on oddly structured but valid HTML or broken HTML, both of which are all-too-common on the Wild Wild Web. '''Common Resulting problems''': * The parser breaks if the HTML code is changed at all. * HTML code remaining in the parsed result. * Inability to cope with tricky nestling structures. * Failure to correctly decode entities. '''RefactoredSolution''': Find a data source that was designed to be used as a data source. Alternatively, use an existing parser of the correct complexity. ---- See also: ParsingHtml, http://stackoverflow.com/a/1732454/699224 for another take on this anti-pattern ---- CategoryDevelopmentAntiPattern