'''Context:''' * You're writing a piece of code that will have to parse a file which is intended to perform some sort of configuration role. Such files range from "name, value" pairs (e.g. .ini files or dictionaries) on through scripting languages, rule sets, or sentences in first order logic. '''Problem:''' * How do you parse the file and get it into memory in a usable form? '''Forces:''' * ''Implementation complexity''. ** Parsing code varies enormously in complexity and in time to implement. Simple brute force approaches (read the file into a string, scan the string using multiple instances of the strategy pattern) are easy to write and can be easy to extend. BNFs take longer to write and fewer people understand them. * ''Execution Speed'' * ''Extendability'' ** File formats always change. Not just in allowed values, but in syntax as well. Simple "name-value" formats suddenly need to handle variable numbers of values as well. Procedural scripting languages suddenly need full recursion. SQL constantly shoe-horns in new functionality. And so on. * YouArentGonnaNeedIt '''Solution:''' * Always use a parser generator. ** In a lot of languages, there are many objects that can simplify parts of the task. For example, in java.io, the F''''''ileReader class can greatly simplify part of the task (getting the file into memory as a string). At which point, it is easy to write code that simply searches through the string for simple patterns. ** This is a bad idea. The code to do so is usually either brittle or inefficient. Maintaining it is nasty. Spend the effort up front to write a grammar. The added flexibility is always (in my experience) worth it. '''Related Patterns:''' * ParsingPatterns * EncourageExperimentation ------- Discussion: Yes, you should always build a grammar before you write a parser. But you don't always have to use a parser generator. If your grammar is simple then it is easy to write a recursive descent parser. To me, it depends on how complicated the grammar is. And how easy to use your parser generator is. And how well you know how to write recursive-descent parsers. -- RalphJohnson Personally, in the tradition of YouArentGonnaNeedIt I prefer to implement that initial version of the Parser on regular expressions. I am unaware of a modern language that doesn't support them, and your grammar can become surprisingly complex before maintaining them becomes a problem. So for your simple name:value/list/etc file the code is very simple and trivially extensible. Of course when this ceases to be the case, ReFactor and use ParseUsingGrammars. Of course if the alternative is to hand write your parser by hand? Well I would disagree with RalphJohnson. File formats do change, and I have never seen a hand written parser that isn't ridiculously brittle in the face of change compared to a grammar. -- AndraeMuys I ''always'' start off by writing a parser by hand. And the result is ''always'' ugly and hard to maintain, so I ''always'' end up refactoring by reimplementing it with lex/yacc or a regular expression library. When will I ever learn? -- KrisJohnson ---- '''Use XML. It comes with a free grammar... and lots of parsers.''' ---- Another option is to use a language such as LuaLanguage, or any language that can read a representation of data structures at run time and reconstruct them in memory (PerlLanguage and PythonLanguage also qualify, as does LispLanguage). The added advantage of using a language in this role is you can not only incorporate a description of the data, but processing on that data as well. Let's say you wanted to create a graphics editor and store your graphics as a list of data structures - a circle here, a line there - with attributes on these. No problem, but you can also incorporate code. Instead of filling an object with the color red, you might provide a function that does a procedural fill. Or you might have a shape best described not as drawing primitives, but as a function. -- JohnPassaniti ---- I have to disagree with this. Using language specific formats makes it really hard to manipulate the data in another language, worse they can become a living nightmare to support as the language matures and various bugs/features change or disappear. This is especially the case if you are embedding executable code in your data formats. Worse still, this can quickly lead to serious security problems as any Outlook user will attest to. Really this comes down to the PowerOfPlainText. -- AndraeMuys Eh... you use PowerOfPlainText and you'll get all the disadvantages of a language specific format. Indeed, I can't think of something much more language-specific than a specific-language - one designed for the particular application. If you require the ability to describe processing, you'll need to obtain it whether you're using a DSL or a generic extension language like LUA. ---- Save yourself the heartache and megabytes of code involved with a parser generator, and use ReversePolishNotation. You don't need to embed all of ForthLanguage -- just order the data and some simple constructor words. For example, given the following bit of XML: Come Home Margeret McAuther 0-123-45678-9 15.99 you can express this quite effortlessly in RPN: "Come Home" title "Margeret McAuther" author 0 123 45678 9 isbn "15.99" price book The lexer for this need only identify strings, numbers, and pre-existing constructors. The constructors build stuff in memory. The result of executing such a "program" is a data structure in memory, suitable for more convenient processing. I used this technique while working for Hifn, Inc. in a project designed to automate the testing of their HIPP series of chips. It worked REALLY well, despite everyone's abject horror of having to learn a little RPN. I was able to deliver the product in days, not weeks or months, it was fully unit tested (how do you UnitTest the output of yacc?), and I've received zero bug reports on it for the remainder of my time at Hifn. -- SamuelFalvo ---- See LexAndYacc, ProcessingMarkupLanguages, AlternativesToRegularExpressions