Although I have only read the first twelve chapters or so, it's already clear (and perhaps not at all surprising) that there are starkly different definitions of beauty here: the book's greatest strength -- and, frankly, its greatest weakness -- is that the chapters are so incredibly varied. For one chapter, beauty is a small and titilating act of recursion; for the next, it's that a massive and complicated integrated system could be delivered quickly and cheaply. (I might add that the definition of beauty in my own chapter draws something from both of these poles: that in software, the smallest and most devilish details can affect the system at the largest and most basic levels.)
If one can deal with the fact that the chapters are widely divergent, and that there is not even a token attempt to weave them together into a larger tapestry, this book (at least so far, anyway) is (if nothing else) exceptionally thought provoking; if Oprah were a code cranking propeller head, this would be the ideal choice for her book club.
Now in terms of some of my specific thoughts that have been provoked: as I mentioned, quite a few of my coauthors are enamored with the elegance of recursion. While I confess that I like writing a neatly recursive routine, I also find that I frequently end up having to unroll the recursion when I discover that I must deal with data structures that are bigger than I anticipated -- and that my beautiful code is resulting (or can result) in a stack overflow. (Indeed, I spent several unpleasant days last week doing exactly this when I discovered that pathologically bad input could cause blown stacks in some software that I'm working on.)
To take a concrete example, Brian Kernighan has a great chapter in Beautiful Code about some tight, crisp code written by Rob Pike to perform basic globbing. And the code is indeed beautiful. But it's also (at least in a way) busted: it overflows the stack on some categories of bad input. Admittedly, one is talking about very bad input here -- strings that consist of hundreds of thousands of stars in this case -- but this highlights exactly the problem I have with recursion: it leaves you with edge conditions that on the one hand really are edge conditions (deeply pathological input), but with a failure mode (a stack overflow) that's just too nasty to ignore.
Now, there are ways to deal with this. If one can stomach it, the simplest way to deal with this is to setup a sigaltstack and then siglongjmp out of a SIGSEGV/SIGBUS signal handler. You have to be very careful about doing this: the signal handler should look at the si_addr field in the siginfo and comparing it to the stack bounds to confirm that it's a stack overflow, lest it end up siglongjmp'ing out of a non-recursion induced SIGSEGV (which, needless to say, would make a bad problem much worse). While an alternative signal stack solution may sound hideous to some, at least the recursion doesn't have to go under the knife in this approach. If having a SIGSEGV handler to catch this condition feels uncomfortably brittle (as well it might), or if one's state cannot be neatly unwound after an arbitrary siglongjmp (as well it might not), the code will have to change: either a depth counter will have to be passed down and failure propagated when depth exceeds a reasonable maximum, or the recursion will have to be unrolled into iteration. For most aesthetic senses, none of these options is going to make the code more beautiful -- but they will make it indisputably more correct.
I was actually curious about where exactly the Pike/Kernighan code would blow up, so I threw together a little program that uses sigaltstack along with sigsetjmp/siglongjmp to binary search to find the shortest input that induces the failure. My program, which (naturally) includes the Pike/Kernighan code, is here.
Here are the results of running my program on a variety of Solaris platforms, with each number denoting the maximum string length that can be processed by the Pike/Kernighan code without the possibility of stack overflow.
| x86 | SPARC | |||
| 32-bit | 64-bit | 32-bit | 64-bit | |
| Sun cc, unoptimized | 403265 | 187225 | 77649 | 38821 |
| gcc, unoptimized | 327651 | 218429 | 69883 | 40315 |
| Sun cc, optimized | 327651 | 327645 | 174723 | 95303 |
| gcc, optimized | 582489 | 524227 | 149769 | 87367 |
As can be seen, there is a tremendous range here, even across just two different ISAs, two different data models and two different compilers: from 38,821 on 64-bit SPARC using Sun cc without optimization to 582,489 on 32-bit x86 using gcc with optimization -- an order of magnitude difference. So while recursion is a beautiful technique, it is one that ends up with the ugliest of implicit dependencies: on the CPU architecture, on the data model and on the compiler. And while recursion is still beautiful to me personally, it will always be a beauty that is more superficial than profound...
Coming from assembler background, it didn't take very long for me to recognize the fallacy of recursion -- problems can be solved "elegantly", but, apart from designing a special harness to "babysit" the recursion loop, the whole thing can overflow the stack, which makes it inherently unreliable.
Elegance of code is more than just beauty: it need not necessarily be short/small code, but it should be as simple as possible, because, as I often teach, simple is robust; and only when one has robust building blocks as foundations, can one hope to build more complex robust solutions. Recursion, alas, does not belong into that category because of its conventional implementation.
Finally, I'll leave with quoting a famous programmer:
"Programming perfection is not when there's nothing more to add to a program, but when there is nothing left to take away."
Posted by UX-admin on July 11, 2007 at 02:27 AM PDT #
Posted by Chris on July 11, 2007 at 08:30 AM PDT #
Posted by Rafael de F. Ferreira on July 12, 2007 at 08:22 PM PDT #
Posted by Calum Mackay on July 13, 2007 at 04:04 AM PDT #
Posted by Bryan Cantrill on July 13, 2007 at 12:01 PM PDT #
There are some peculiarities about Pike’s matching engine. One is that the given algorithm is non-greedy for the star quantifier – that is, a star will try to match as few characters as possible. This is the result of <code>matchstar</code> using a <code>do</code> loop to attempt to match the rest of the regex first before letting the quantifier consume an input character. This is significant insofar as that this non-greedy algorithm requires backtracking.
But if greediness is acceptable, then expressions with a grammar as simple as the one that this matcher accepts can be tested without any backtracking at all! In in this grammar, atoms can only ever consist of single characters, which means the length of an atom is constant. Therefore the regex itself as given is sufficient as a full state machine. (If quantifiers applied to groups of atoms, you would have to compile a state machine from the regex; this is what DFA matchers do. You still don’t need backtracking then, though; it is only actually necessary once you introduce backreferences, which make the expressions non-regular.) So you can write an engine that is driven by the input string, as opposed to having it driven by the regex à la Pike. Such an engine would simply use the position in the regex as its state, and would consumes the input string character by character, checking whether the current character statisfies the current state’s assertion.
This sounds very iterative. Where is the beauty in that?
Well, personally, I don’t find recursion all that beautiful, nor have I found it very natural (in general). This may seem surprising if you consider my propensity for functional programming; but to me, elegance is rather found in treating input as an immutable sequence to be <em>mapped</em> to another immutable sequence. Some other primitives are trivially derivable from this: filtering is simply the act of mapping a sequence to another sequence that omits some of the input elements but is otherwise identical; reducing maps a sequence of arbitrary length to a sequence of length one.
Matching an input against a regex, in these terms, means reducing an input given as a sequence of characters to a boolean.
Note if you write this purely functionally, recursion still enters the picture, because changing state when values are not mutable involves a call. However, this can be written purely tail-recursively.
(Disclaimer: this is just dashed off, so I might have made mixed up some classes of languages and how much computational complexity they entail.)
Posted by Aristoteles Pagaltzis on July 15, 2007 at 02:27 PM PDT #
Posted by Michael Feathers on July 16, 2007 at 11:00 PM PDT #
Posted by Geoff on July 17, 2007 at 02:49 AM PDT #
Posted by josh on July 17, 2007 at 07:14 AM PDT #
Posted by Klein on July 17, 2007 at 08:27 AM PDT #
Michael, as Geoff mentioned, if input is not checked (and I hasten to add that the length-of-death is much lower than 300K -- as low as 37K) this is a potential security problem. Now, as I believe Josh is trying to say, this does not constitute a buffer overflow attack -- an attacker could not use this vector to execute arbitrary code on the stack. But contrary to Josh's assertion that there is "no security problem", this does allow for a potential DoS attack.
That said, to me the issue is less about security and more about failure mode. Michael, while your assertion about tolerances is interesting, the physical world is ultimately a poor analogy for software: this is not physical failure, it is logical failure -- and there's quite a difference. To me, the better analogy is with mathematics: if you prove a theorem for all n less than 38,821, is that proof correct? It depends, of course, on how that proof is labelled: if one asserts that one has proved the theorem for all n when, in fact, it's only proved for n less than 38,821, than the proof is incorrect. If, however, one accurately claims that one has only proved the theorem for n less than 38,821, then the proof is correct -- but it is also clearly a proof of limited consequence, and certainly not one for Erdös's Book. My argument is that it is better to correctly solve a smaller problem (i.e. limited input) than to incorrectly solve a more ambitious problem (i.e. arbitrary input).
Finally, as for the assertion that we "get ourselves in trouble when we assume that functions have to work across all inputs", I could not disagree more -- indeed, I believe that we get ourselves into trouble when we make implicit assumptions that our input will fit some (ill-defined) notion of "reasonable." In my experience, software that makes such implicit assumptions becomes especially problematic as its used as foundation software, introducing new, cascading failure modes into the system. Certainly if one fancies writing software that will itself be used by higher layers of software, one must be paranoid about input and explicit about failure modes.
Posted by Bryan Cantrill on July 17, 2007 at 08:56 AM PDT #
Posted by Bryan Cantrill on July 17, 2007 at 09:06 AM PDT #
Bryan, we can go back and forth about whether it's a logical or a physical problem.. but those are just labels. To me, the crux of this is the fact that the algorithm has certain computational characteristics that place constraints on the environment. If the environment doesn't meet them, we probably shouldn't run it.
You're proof analogy is fine as it stands, but I think we have to question the assumption. Nobody is making the assertion that the algorithm is correct for all N input sequences. We might be able to construct a proof, but that proof would say something about the algorithm on an ideal machine (infinite stack). It wouldn't say anything about what you can expect in the face of physical (whoops, I said it) resource constraints.
I think that the biggest beef I have with your position is that I feel that there is no single standard for correctness or appropriateness of coding style in the industry. What you do in software that you release to thousands of users can be different from what you do in software you use in a tool for a small workgroup. Safety critical is another area, so is teaching software. As well, not all software is in the middle of an application stack. Not all software is distributed either. Some has trusted users.
FIT is a good example. It's heavily recursive, and I imagine it could overflow the stack if the size of the tables it processes are too large. I don't think it's a serious problem, though. FIT is used in work groups where developers are always present. If it happened, the size of tables would be the obvious culprit. Would I want flight control software to be that trusting? No, but that's a completely different domain. There is no "one size fits all" standard for code.
Posted by Michael Feathers on July 17, 2007 at 01:12 PM PDT #
Posted by Bryan Cantrill on July 17, 2007 at 03:03 PM PDT #
Posted by PJW on July 17, 2007 at 07:40 PM PDT #
Posted by Bryan Cantrill on July 17, 2007 at 09:19 PM PDT #
I think a lot of people are missing the point. This is a very concise and elegant solution to a simple pattern matcher. It will work on all practical examples. If you want to treat some arbitrary input from an untrusted user as a match pattern, it's your own fault if you don't verify and place limits on that input first.
A robust regexp library would compile simple patterns like this to a DFA. Or, more likely, include more features such as alternations and remembering submatches, and compile some or all patterns to NFA's. This will be a lot more complicated. For something as commonly used as regexps it may be worth having such a full library, but on small and embedded systems, or for less common tasks for which there isn't likely to be a pre-existing library, it's a great advantage to be able to quickly put together simple, clean code like this.
Regarding tail-call optimization, the reason the example has non-tail calls is because it's using backtracking, and it needs somewhere to store that information. The stack is the easiest place to do this, but you can always move the data into the heap (a technique Scheme programmers are used to).
Below is a straightforward translation of the same code to use the heap instead of the stack, and thus never blows out the stack when compiled with gcc -O2. It's basically the same thing, except you pass around an extra "stack" parameter. matchstar is made simpler by just pushing a backtrack point onto the stack and recursing (and could just as easily do greedy matching).
For technical reasons, matchhere needs an unused extra "int c" parameter to force tail-call optimizations. See http://www.ddj.com/dept/cpp/184401756 for the explanation. This and explicit malloc/free make the code a lot more clumsy than the equivalent in Scheme, where you have GC and tail-calls are guaranteed to be optimized.
-- Alex
typedef struct _matchstack { int c; char *regexp; char *text; struct _matchstack *next; } *matchstack; int matchhere(int, char *, char *, matchstack); int matchstar(int, char *, char *, matchstack); int matchfail(matchstack); matchstack matchpushstack(int, char *, char *, matchstack); int match(char *regexp, char *text) { if (regexp[0] == '^') return (matchhere('\0', regexp + 1, text, NULL)); else return (matchstar('.', regexp, text, NULL)); } int matchhere(int c, char *regexp, char *text, matchstack stack) { if (regexp[0] == '\0') return (1); if (regexp[1] == '*') return (matchstar(regexp[0], regexp + 2, text, stack)); if (regexp[0] == '$' && regexp[1] == '\0') return (*text == '\0'); if (*text != '\0' && (regexp[0] == '.' || regexp[0] == *text)) return (matchhere(c, regexp + 1, text + 1, stack)); return (0); } int matchstar(int c, char *regexp, char *text, matchstack stack) { if (*text != '\0' && (*text == c || c == '.')) stack = matchpushstack(c, regexp, text+1, stack); return matchhere(c, regexp, text, stack); } int matchfail(matchstack stack) { int c; char *regexp, *text; matchstack next; if (stack) { c = stack->c; regexp = stack->regexp; text = stack->text; next = stack->next; free(stack); return matchstar(c, regexp, text, next); } else { return (0); } } matchstack matchpushstack(int c, char *regexp, char *text, matchstack stack) { matchstack newstack = (matchstack) malloc(sizeof(struct _matchstack)); newstack->c = c; newstack->regexp = regexp; newstack->text = text; newstack->next = stack; return newstack; }Posted by Alex Shinn on July 18, 2007 at 09:00 PM PDT #
Posted by Bryan Cantrill on July 19, 2007 at 12:33 PM PDT #
This all brings to mind a conversation that Bill Moore once relayed to me: he was talking to a friend of his who is a professional drummer. The friend said that the primary difference between the amateur drummer and the professional drummer is that the professional can recover when things go wrong. In short, the difference between the amateur and the professional is error handling -- and I think this holds true for software as well.
Posted by Bryan Cantrill on July 19, 2007 at 12:33 PM PDT #
Bryan,
There's only a single malloc, which can easily be wrapped in an
if (! ... ) fatal("out of memory);. That one extra line doesn't exactly sully up the code :) More likely, you'd define a safemalloc utility which did this automatically, and the code is then no longer nor less clear.Sure, in general robust, carefully tested code will be longer and more complex than simple code, but this is completely orthogonal to recursion. Recursion is a useful and powerful tool that can lead to cleaner code - especially if your language gives you a tail-call optimization guarantee.
Posted by Alex Shinn on July 19, 2007 at 12:33 PM PDT #
Posted by Alex Shinn on July 19, 2007 at 04:00 PM PDT #
Posted by Bryan Cantrill on July 19, 2007 at 10:04 PM PDT #
Posted by sri on July 20, 2007 at 05:31 PM PDT #