Stupid Questions are Good for You

Jun 23, 2013 by olsner

The point of this post is that you shouldn't be afraid about asking stupid questions, and that you should never punish anyone for asking them. Because so-called stupid questions teach something to the ones that would answer it.

This is related to another unwritten blog post about how noobs are good for you. Also somewhat related to the long list of reasons to reject questions as stupid which I find annoying. Haven't bothered figuring out exactly why it bothers me (perhaps a topic for a future post), but others seem to have similar opinions.

Mind you, these observations only apply in some contexts, let's say on a project mailing list between people who are already engaged/employed in the project, rather than somewhere that might receive end-user help requests. Among other things, this means that it's in your interest to make sure that whoever is asking makes progress and that you remain friendly with them, and there's probably some kind of competency requirement they've already passed so you can assume they're not an idiot.

Good Stupid

You might think that either a question is a stupid question or it's not, and that all stupid questions are intrinsically bad. This is what ESR's essay would tell you. But there are Good Stupid questions!

A lot of the Good Stupid questions come from stuff that the incumbent experts would consider obvious, but that is evidently not obvious to the one asking a question. Other Good Stupid questions are about stuff that's technically explained by the relevant FAQ or introduction document, but where the documentation doesn't actually explain it well enough.

Instead of rebuking the asker about how they failed to RTFM and do their own research, think of the question itself as a symptom of an underlying problem.

For a question like "How do I build foo?" the underlying problem(s) might be:

  • It's not obvious where to find this information
  • The build instructions are not in the right place
  • The wiki's search function is poor (if there's a wiki and it has this information at all.)
  • The naming of wiki pages is not intuitive, e.g. everything uses code names that aren't easy to guess
  • The source repo does not have the relevant documentation
  • The information was easy to find, but simply wrong. Perhaps it was meticulously well-maintained for Foo 1.0 but nothing was updated for the big Foo 2.0 rewrite.
  • ...

I think (in the case of a new member asking questions), a lot of these causes boil down to institutional memory not getting transferred. You could view the question and answering as a transfer of those memories, but I rather like to think of it as a symptom of failing to transfer it in the first place.

For other questions, let's say something like "How do I use X to Y?" you'll have a different set of underlying causes. Perhaps Y is something that should be provided by X but isn't. Perhaps X is the worst possible way of doing Y, but it's not obvious that X is not appropriate, or the obvious answer (using the finely tuned Y frobnicator contained in misc/flibnel.c, for example) is in reality quite obscure.

The very best stupid questions are the ones that expose fundamental misunderstandings of important core concepts - considering most people will do anything to avoid having to risk asking a "stupid" question, how many other developers misunderstood the same thing without ever being set right? Or how much time did they waste before realizing it themselves?

In other words, questions that you think are stupid point out things that you haven't made obvious. Stupid questions are information. Learn from that information and use it to make your code/organization/workflows/etc obvious enough that you stop getting stupid questions.

PS. Here's an inspirational video on the importance of asking the stupid question.

mod_rewrite revisited

May 04, 2011 by olsner

I suddenly decided to continue my earlier mod_rewrite experiments, BF+Thue in mod_rewrite and my first excursion in mod_rewrite. That code all worked in theory (and for very small examples), but not in practice: Apache runs out of memory.

The Problem

The basic problem is that for each request, Apache will never return any memory to the system - Apache expects trivial things such as interpreting turing-complete languages to complete quickly and without using more than a smidgeon of memory.

To get around this, I decided to change the "run-time system" to make a redirect for each step, i.e. sending an error back to the user-agent and tell it to make a completely new request to a different URL instead of making mod_rewrite loop on the Apache side until it's done. Apache allocates the memory for each of these requests separately, and its memory use stays bounded (to the size of the program's state anyway).

Implementing this mostly this meant simply replacing the [N] flag (which means jumping to the start of the list of rewrite rules but continue rewriting) with the [R] flag that send a redirect to the client. But of course a few other minor shenanigans were lurking - primarily around the "bootstrapping" step.

The original RTS relied on Apache always having a slash at the start of the URI to bootstrap the program, and relied on Apache not complaining if you removed the slash internally while doing the rewriting. With the new redirecting approach, we will need to continue processing from a URI that has a slash (since it's a completely new request from the client) without bootstrapping that URL again. As described in the comments below, we now use a 'q' to tell a bootstrapped program from an unstarted program instead of checking whether the slash was removed or not.

While I was doing these changes I ended up I made the compiler output the RTS with the program instead of requiring separate program and RTS config files.

Thue mod_rewrite Compiler v2.0

Without further ado, here is the updated Thue to mod_rewrite compiler in its entirety, now including all the RTS it needs!

# Bootstrapping part: output the RTS prologue
1 {
i\
# This file should be included in an Apache2 config file for a VirtualHost. \
\
# Enable rewriting\
RewriteEngine on\

# Hack to make actual files available if you know their names\
i\
RewriteCond /var/www/rewrite%{REQUEST_FILENAME} -f\
RewriteRule ^.*$ - [L]

# This should only run on the first inputed string to add the interpreter.
# After each redirect, the client makes a new request and we return to the
# first rule, so this must be safe and idempotent.
# An initial q indicates whether we have boostrapped the program yet, the
# second q is a separator between the output and the current program state.
# We also add a ^ just for fun.
i\
RewriteRule ^/([^q].*)$ qq^$1
# Remove any leading slash to simplify the rest of our processing
i\
RewriteRule ^/(.*)$ $1

# Add an 'r' to distinguish unchanged strings. This would be the termination
# condition for our rewrite system - if the 'r' is still left after running the
# chain of rewrites, we're done and stop looping.
# This rule is probably completely unneccessary since we'll always redirect
# when rewriting, which should already prevent us from reaching the
# end-of-program rule before we're done.
i\
RewriteRule ^(q.*)q(.*)$ $1qr$2
}

# Strip comments
s/^#.*$//
# Rewrite question marks since they interfere with mod_rewrite by making the
# productions look like requests with query strings, which apache then splits.
s/\?/Q/g
# .*+(){}^$<>[\\ |]\|\]\)/\\\.*+(){}^$<
s/\([.*+(){}^$<>[\\ |]\|\]\)/\\\1/g
# Skip empty lines
/^$/ d
# This is where we're actually doing something worthwhile:
s/^\(.\+\)::=~\(.*\)$/RewriteRule ^q(.*)qr?(.*)\1(.*)$ q$1\2q$2$3 [R,L]/
s/^\(.\+\)::=\(.*\)$/RewriteRule ^q(.*)qr?(.*)\1(.*)$ q$1q$2\2$3 [R,L]/
# End-of-program marker, we want to ignore everything after this line
/^::=$/ { s/::=//; b end }

# Restart processing and input next program line
b

# Output the RTS epilogue after the last line of program text
:end
# If the string has changed, we are not yet done. Loop to the beginning and
# remove the changed-marker. This doesn't change the string at all ('-' is a
# special substitution string that passes along the original string), just has
# the [N] flag to trigger apache to restart rewriting.
# This rule is probably completely unneccessary since we'll always redirect
# when rewriting.
a\
RewriteRule ^(q.*q[^r].*)$ $1 [N]

# If the string is still unchanged, apply output formatting and send to the
# interface CGI script.

# First, remove the unchanged-marker 'r'
a\
RewriteRule ^q(.*)qr(.*)$ q$1q$2
# Finally, pass the resulting string to a simple CGI or PHP script to print it
# to the web browser. [R] means to redirect to it, ending rewriting.
a\
RewriteRule ^q(.*)q(.*)$ /print.php?$1 [R]
# Done. Quit. Get out of here.
q

Testing it out

For obvious reasons I wouldn't want you to test this on my server, but if you want to set this up on your own server, setup should be very easy: set up a new virtual host and let the virtual host config stanza include the compiled output of your Thue program.

One "small" caveat is that pretty much every HTTP client available will only follow a very small number of recursive redirects. To actually see a largeish program (such as hello world) run to completion you will have to reconfigure your HTTP client to follow a very large (a few hundred thousand) number of redirects. Using the command-line curl program, curl -L -g --max-redirs 300000 seems to do the trick.

Some current projects...

Feb 23, 2011 by olsner

A bit of a mixed bag.

I've started working on a compiler for my programming language, see m3.git. Since code generation is boring, it targets LLVM. And anything related to "M++" is gone, and so are all object oriented features. This is simply an imperative language with modules, and maybe some type-system features.

That hasn't been moving for some time though, due to my replacement project: an operating system. This is going to be a dirt simple but (at least theoretically) fully functional operating system for x86-64/amd64/em64t, written in Assembly. I've decided that it's not going to be running C code until I finish support for loadable processes and/or kernel modules. As a general rule, I'm aiming for a microkernel structure. In part because microkernels are cool, in part because Assembly is hard to write so I really want to keep the "kernel" part as small as possible!

Now, the kernel thingy needs to do some memory management internally to keep track of some data structures where most of them are not as big as a page. I have a simple page-frame allocator, but now I need virtual memory management (which will reside inside the microkernel part of the OS) and processes, and to do that conveniently I need a kernel malloc to allocate my metadata.

There goes the third interrupting project: the malloc. In order to have something working in the kernel, written in assembly, I thought it best to start by even implementing malloc at all. So this is a malloc that you build into a .so (you can also link to it statically, if you compile it for that yourself) that loads into any program using LD_PRELOAD to replace the C library's malloc. That is, on Linux. Haven't tried it on anything else :)

I've stolen most of the ideas from a paper about phkmalloc, so I guess it's not very interesting in terms of malloc design. And it's single-threaded (protected by a single lock, if thread safety is enabled) so there will be no scalability contest between this and e.g. jemalloc! Then again, it shouldn't be excessively hard to take this allocator and essentially have one heap per thread/core, with a shared pagepool in the bottom somewhere.

My malloc currently "works" with any programs I've thrown at it, except that it doesn't actually free anything back to the OS. Properly implementing that seems to require a few changes to the underlying data structures. Oops! Which is precisely the reason I'm doing it at all - to find these problems out so I can redo it correctly in a write-once language like assembly :)

Anyway, after the malloc works, the next step is to translate it into the kernel code, implement virtual memory management in the kernel and to use it instead of the hardcoded page tables that it currently sets up for user processes. Then it'll be time to implement dynamically allocating processes based on an empty or existing address space, starting them up, scheduling them, ending them, and waiting for them! That'll be exciting :)

M++ Design, Part 2 of N

Jul 02, 2008 by olsner

A short post about Syntax and Compiler usage. SPECS (new syntax for C++, see below) I have been dabbling with previously, and always found quite interesting. Regarding compiler usage, I guess my inspiration lies in the way Haskell compilation with ghc works (at least in the ghc --make variant). It should never be any harder than that.

Compiler usage

I'm thinking it would be left up to the compiler to keep track of dependencies and determine which modules need recompilation. Rather than taking a list of source files to compile into objects, run the compiler for each source to make a .o file and then link your hundreds of .o files into a shared library or executable, you'd tell the compiler to either build module Main (which can have another name) into an executable, or to build a library exposing one or more given namespaces. The names declared in these namespaces would be exported as if in the global namespace (following the target C++ ABI, most likely - I would like to keep binary compatability as far as possible).

Example 1: Compile the program contained in module Main and all dependent modules into a.out.

m++ --main Main -o a.out

Example 2: Compile modules A, B, C into a shared library.

m++ --export A --export B --export C -o libstuff.so

As mentioned in the previous posts, modules referenced in the source (e.g. Net::HTTP) would be automatically looked up as ./Net/HTTP.mpp in the current include path. It is up to the compiler to apply necessary magic to that file in order to extract the information it needs to compile the current module. I'm thinking there'd probably be a local file containing a parsed representation of the module source which is automatically updated if the source file is outdated.

Due to the way this works, I think mutually recursive modules are basically impossible to write in this new language. C++, using headers with declarations separate from implementation files, allows to get around this problem in some small way by e.g. using forward declarations in the header. It may be possible for the compiler to apply a similar workaround automatically by parsing declarations and implementations separately, but I actually think it is a good thing not to be able to build mutually recursive modules.

Code generation would be driven entirely by the need to output the exported names (or just main() in the case of --main), so only names used recursively by those functions would be code-generated at all. In ordinary C++, it is very hard to control the set of functions you're exposing to the world. In M++, it should be very easy and very explicit.

Syntax

I originally thought using the syntax of C++ would be a good thing. After all, what I secretly want this language to do is replace C++ entirely, which I thought more likely to happen using a syntax familiar to the old fashioned C/C++ tradition. Worked wonders for e.g. Java, C# and JavaScript, didn't it? Too bad all of them punted on the opportunity to actually replace C++, rather than take some small niche where you never really needed C/C++.

However, since this is a C++ dialect anyway, why not go the extra mile and just throw out all the syntax, and all that ugly legacy that would come along with the syntax? For example, there's this Modest Proposal for a new syntax for C++. Quite interesting read, and using a syntax like this would probably eliminate a lot of the quirky issues you run into when trying to parse C++. The proposal also suggests fixing a few quirks of the C++ language, such as making this a reference rather than a pointer. One of the most wonderful parts of SPECS (Significantly Prettier and Easier C++ Syntax) is the complete re-working of type syntax - even a complicated type signature like that of a pointer to an array of pointers to functions returning a pointer to a function is easily readable in SPECS:

type ComplicatedType : ^ [7] ^(int -> ^(int -> int));

To read the SPECS typedef, just start at the left and read:

pointer to array of 7 pointers to a function taking int and returning a pointer to a function from int to int

Contrast the equivalent C++:

typedef int (*(*(*ComplicatedType)[7])(int))(int);

There is a somewhat reliable technique for reading this kind of nested type definition (start in the middle and go outwards? something like that...), but in my opinion: don't bother. Give the original author a proper spanking and make them rewrite it as a set of typedefs instead :D

SPECS would be so much nicer than C++ syntax, but then I'm much more talking about an entirely new language than a C++ dialect (even though some semantics would be similar).

So: keep C++ and the world of pain it represents or write something using a new (and therefore scary and controversial) syntax?

Real Modules for C++

May 27, 2008 by olsner

C++ sucks. C++ needs a proper module system, where you can actually separate modules from each other rather than tangle them together in circular header include trees. This is it! Or, rather, when thinking about it has terminated and culminated into the start of an implementation, this will be the start of it that started it. Think of it as a collection of a few fluffy ideas that will transform C++ hell into the cozy wonderland it should be ;-)

I'm not entirely sure about this, but maybe I'll call it M++ as in "C++ with Modules". In any case, this will become a C++ dialect as there is no hope of not breaking any code. There will also probably never be a very easy transition path for existing large-ish C++ codebases. (That's not even a goal until this thing is sufficiently awesome to motivate someone to convert a large body of C++ code to it...)

What is M++?

Basically, it is C++ with "modules", where modules are somewhat like ordinary C++ translation units, but with special magic for how names are imported between modules.

Importing a module imports a well-defined set of names into the local scope. Contrast with C/C++ where module importing is implemented by inclusion of header files. (You know this already, but I'll repeat it for the sake of rhetoric) Header files may #define just about anything and wreak whatever havoc on the environment of following files. This means that any kind of higher-level reasoning on the effects of changes in header files on the actual code is pretty much moot. Any compiler must recompile every included header file from every including source file every time any part of this pool of mud changes. A wonky define in module A may produce error messages in system headers included from module B, while compiling module C. Modules A and B may not even be your code, or code you can't change. World of woe!

With a proper module system, declarations from different modules will never clash, except maybe in the module that is importing conflicting names from more than one module. But that would be caused by code in your module, and you would have the tools to solve the problem!

Guiding principles

  • Keep as much as possible of the syntax and semantics of C++
  • Remove the need for preprocessor inclusion
  • Keep the preprocessor around for e.g. importing external interfaces
  • Replace the current text-level importing with a symbol-table-level import
  • Provide good means of separating unrelated components by:
    • Limiting the set of exported symbols from components
    • Providing easy means to cherry-pick subcomponents (using namespace::name;)
    • Providing robust non-conflicting importing of whole components (using namespace;)

Basic proof-of-something example

For context, this module would reside in a file Main.mpp (maybe even just call these files cpp?), and exports the class Main::Foo, the method Main::Foo::Bar and the function Main::main(...).

// This is where the *really* cool part is. These modules (namespaces) are
// automatically imported by searching the system module path. Either they
// are defined by local source tree cpp (mpp?) files named e.g. Win32.cpp
// and Net/HTTP.cpp (Net_HTTP.cpp would also work), or they could be
// binary self-describing modules. Nothing said here on how that binary
// self-description would look. That's the magic left as an exercise for the reader.
using namespace Win32;
using namespace Net::HTTP;

// These are some provisional ideas on how to wrap the C/C++ standard library in
// M++ form. More on importing legacy C/C++ functions and classes later on.
using stdlib::atoi;
using stdio::printf;

// This is a private function. It would not be linkable from other translation
// units (default linkage in the top-level is static), but is usable from all
// definitions in the Main namespace below.
void Foo_Bar()
{}

namespace Main
{
    // Normal classes here
    class Foo
    {
    public:
        void Bar();
    }

    // Maybe the main function could move into the Main namespace from the
    // global namespace like this:
    int main(int argc, char *argv[])
    {
        Foo foo;
        foo.Bar();
        Foo_Bar();
        printf("Baz\n");
    }
}

Unresolved/other issues

  • How do you expose things in the global namespace from M++ modules (these exposed names should follow the relevant C++ ABI and mix-and-match with C++ code.)

This one is slightly harder than the latter question. Many solutions here are bad. So I think I'll just let this one noodle for a while ;-)

  • How do you import external C or C++ classes/functions into M++?

I'm thinking you'd #include the external headers in one module, then use using ::asdf; to import the top-level declarations into the namespace exported from that module. This means that a M++ compiler must be able to understand the full wonderful ambiguity of C++. But hopefully, the modularization through use of namespaces would mean that only one of a large number of modules need to go through the hassle of actually parsing all that crud in order to build a small symbol table.

One remaining issue is how to distribute the kind of macros that are required/useful (i.e. file/line-tracking allocation functions, asserts, debug/release-dependent code). Would you be including some small number of headers into each module to do that, would the module system somehow get involved in preprocessing and let macros be imported as part of a namespace?

Getting rid of macros in the first place is a pretty good idea anyway, but exactly how far can you practically take it? Some things like stdint.h define a large quantity of useful macros. In an ideal world, these would be constant varibables const int INT_MAX; etc defined in a suitable namespace (perhaps something like stdint, since they come from stdint.h).

Ant sucks.

Mar 29, 2008 by olsner

I've recently had the opportunity to use Ant quite heavily in a project at work, and have thus come to realize it sucks. Quite hard.

Contents

  • Contents
  • The structure of an Ant "project"
  • What's a build script anyway?
  • How Ant does it
  • Modelling a build process
  • Refactoring: Worsening the Design of Existing Code
  • Bonus chapter
  • Conclusion

The structure of an Ant "project"

The top-level element of an Ant project is the Project, represented by an XML element called project. This entity does not actually do anything useful.

A project consists of a list of definitions of properties, macros and targets, mixed with a number of miscellaneous directives (like import). That's the actual stuff - the need for the top-level node seems to be purely a side-effect of XML requiring, for no good reason, a single specific top-level node.

Contrast the Makefile - realizing that there is no real need for a "project" thingy, the Makefile puts properties (variables) and targets directly on the top-level.

I shouldn't bring up syntax this early in the flame, but I really must.

Compare this, the simplest Ant file I can come up with:

<project name="useless" default="all">
    <property name="foo" value="bar" />

    <target name="all">
        ... <!-- see below -->
    </target>
</project>

And the corresponding Makefile:

foo=bar

all:

See the difference? How much of the Ant file conveys actual contents? How much is just crud to explicitly write out the entire Ant AST in the rawest, most explicit form possible? At this point, I'm thinking that the writers of Ant have entirely missed ... everything that the field of CS has come up with when it comes to programming (and other) languages? But they certainly noticed the invention of XML.

What's a build script anyway?

The thing in common between most build systems is what I'll call the dependencies->target structure. That is, when the build tool is run, you give it a set of targets (or use the default as defined by the ant/make file), and the tool recursively builds all targets' dependencies before building the target, building each target according to some description of what it's supposed to do to build it and when it's supposed to be built.

Ant is no different there; and make is the canonical example of such a system.

The difference lies in how the systems determine what to build. Make has an intrinsic notion of targets and dependencies being files (rather, it has a notion about file targets and "phony" targets, defined by the Makefile) - if any of the dependencies are newer than the target, the target is rebuilt. If any of the dependencies were out-of-date and rebuilt, their dependents are rebuilt automatically by make.

This is probably old news to you, but I want to emphasize the effect of this property of make: provided with a correct set of dependencies, make does the right thing. And with make's ability to automatically check files, it is very easy to provide that correct complete set of dependencies.

(Granted, it does get harder in make with things like automatically discovering C/C++ header file dependencies, so many makefile writers google for recipes for automatically maintaining and updating dependency files and copy-paste this into their Makefile. But once this code is in the Makefile, make can take care of keeping the dependency files updated just as any other targets are kept updated.)

How Ant does it

As opposed to make, ant doesn't encourage this kind of specification of recursive dependencies. The easiest way to write an Ant file is to take the shell script you wrote (or memorized!) for building everything and translate this line-by-line into the corresponding Ant syntax for running things in a series. Let's say that the series of shell commands includes the generation of a generated whizbang file, like this:

javac Generate.java
java Generate Whizbang.whizbang Whizbang.java
javac Whizbang.java Foo.java Bar.java

It is relatively straight-forward to translate this into Ant syntax. But do take note of how irksome it is to do even simple stuff in Ant!

<project name="useless" default="all">
    <property name="foo" value="bar" />

    <target name="all">
        <javac>
            <!-- god forbit you write a path without putting it in an attribute of an XML node -->
            <src path="Generate.java" />
        </javac>
        <java>
            <arg text="Generate" />
            <!-- yes. please. put. every. string. in. a. separate. XML. node. thank. you. -->
        </java>
        <javac>
            <!-- repeat spanking -->
        </javac>
    </target>
</project>

See, Ant is bash-in-XML, with all the pain and none of the benefit. Every word becomes an attribute of an XML node inside the XML node specifying the "verb" of the action. Being able to put the word javac or java directly in the node name is just an effect of ant being so geared towards Java. For any other program, you'll end up with something like <exec><command name="java" /></exec>. Ugly!

Actually, previous versions of Ant supported putting commands as text inside XML nodes, like <exec>gcc -o target</exec>. This has since been deprecated. The producers of angle bracket keys probably bribed the Ant maintainer at the time.

Modelling a build process

This brings me to possibly the most important part of this whole comparison slash rant. The model of a build process enforced by Ant.

Ant: A project consists of a list of "targets" to run, each containing a list of commands and a list of other targets to run before those commands

Make: A Makefile consists of a list of targets (files or phony), each listing its dependencies and a list of commands to run to update the target from those dependencies

Notice the difference between updating a target from its updated constituent parts and building a target by executing dependency targets and commands. In Ant, all you're really doing is calling functions or post-order traversing a tree of targets.

Refactoring: Worsening the Design of Existing Code

The way an ant file is refactored into independent parts is to extract a bunch of commands doing one thing (for example mkdir cpp-objects followed by one build command for each object), into its own target (for example <target name=cpp-objects>) and then either inline calling that target with <ant ...> (this is just like a function call), or by adding that target as a dependency. The end result is a large set of phony targets that aren't linked to their products or sources. What have been gained? The resulting Ant file does exactly the same thing, just as bad as the original file did it.

Contrast this with how you'd do it with make. You take the set of commands in the rule you want to split and identify the products and sources produced by that set. Let's say it's every .o file generated from cpp-sources/*.c.

The canonical way to do that would be something like this:

CPP_OBJECTS=interesting-.o-files
cpp-objects/%.o: cpp-sources/%.c
    $(CC) -c -o $@ $<

other_target: $(CPP_OBJECTS)

Notice how features of make conspire to reduce duplication in typing and modelling, and how easy it is to produce something that properly models the building of the required set of .o files from the appropriate .c files and how the actual commands are derived from a generic pattern rule. This is what a DSL (Domain Specific Language) for building looks like. Not like this:

<exec><command name="build-many-cfiles.sh ${cpp_objects}"/></exec>

I suppose there could well be some kind of "for-each" XML thing in Ant, but it's still just so far from what we're modelling. Building is not running a series of commands, it is updating an end product from a set of sources. There is just no way for ant to actually do this with a model like that!

(Bonus: Conditional execution in Ant)

Then, we have the Ant way to build things conditionally (hang on, this'll get hairy! to spare the sensitive reader, this is only pseudo-ant code).

Let's say I want to do the Thing only when a certain file is newer than the target or the target file doesn't exist.

<doThing>
    <condition>
        <or>
            <and>
                <exists path="target" />
                <exists path="source" />
                <newer left="source" right="target" />
            </and>
            <not><exists path="target" /></not>
        </or>
    </condition>
</doThing>

Wow! That's lean.

Well, the end result is that some Ant hackers get so fed up with making ant do the right thing that they produce atrocities like

<target name="build" dependencies="clean"> ... </target>

or

<target name="build"><delete><dir path="output"></delete> ... </target>

Just to get ant to rebuild stuff that's out of date, accidentally rebuilding everything else in the process. But workstations are fast, right?

Yes. As I might write about some other day, waiting for the compile is not an unimportant factor of the subjective irksomeness of any build system. Going from "Yes, there's the typo!" to waiting for a complete rebuild of your project is enough to make any sane coder go bat-shit insane. If the delay is due to stupid build systems, even more so!

Conclusion

Ant is a poor way to model builds. Ant has syntax so brain-dead, it goes way beyond any bastard child of Java and Cobol in needless verbosity. In some places every, fucking, word, needs, its, own, XML element. And see above for Ant's take on the ubiquitous if statement. In short: Ant sucks.

If Ant was developed internally by any code shop, it would surely already have appeared on The Daily WTF. And we would all have laughed at the round-about ways to write shell scripts, and how many tokens you need in order even to do nothing (not to mention what you need to actually do anything at all).

The Real WTF, someone would say, is that these people produced a make replacement that does half as much as make and does it worse!

Then someone would counter with a joke post about "hey, if this language had conditionals, it'd probably look like this" (see above), at which point in time the original reporter realizes with a shiver that that was the exact way they did implement it.

Oh, and just for a final rubbing-in, did I mention that Ant sucks?

Literate Haskell Blogging

Feb 05, 2008 by olsner

An unfortunate thing with this markdown framework I've been using is that it requires me to indent code blocks by four spaces (or a tab). Editing such text in a web browser really sucks, so I've been thinking about how to hack in literate haskell support into markdown. Well, turns out, it wasn't that hard after all.

After some real simple modification to the markdown+geshi patch referenced in my previous blog post I now have literate haskell support! The modification literally took less time than writing this post ;-) Find my new patch attached at the bottom of the page.

So - the testing! (This should render as normal text.)

and "this" {-should-} -- be rendered as haskell

Now some normal text in between the haskell blocks, before we give haskell highlighting a run for its money.

and . more . haskell $ [1,2,3] main = print.takeid20$fix$(.)<$>(:)<>((.((:[]).))(=<<).().(*2))$1 -- print an interesting sequence

End of test.

Anyway, here's the patch for markdown.php (you'll also need to cut-n-paste the highlight_helper.php file from Dougal Stanton's blog).

Thue in Haskell

Jan 27, 2008 by olsner

Yesterday, I wrote a Thue implementation in Haskell. At this point, the implementation is painfully slow (about 35 minutes to run the brainfuck "Hello World!" through the BF Thue interpreter on my MacBook, as compared to about 4 minutes for the python Thue implementation run in python 2.5.1). I'm thinking I should a) should use ByteString's instead of String, and b) use some smarter algorithm for matching the rules (instead of untilDone (foldl applyRule input rules), it'd be something more like compiling an automaton out of all rules and untilDone (run automaton)). By reducing the number of rules (eliminating the 50 or so #::=# rules, which are comments in idiomatic Thue but don't strictly need special processing), I could squeeze about twice the performance out of it - this optimization brought the execution time down to 35 minutes. I need at least 10x the performance before this program is anything to brag about!

Another idea for the future is to implement parallel Thue by splitting up the state and then replacing in the parts until no replacements can be done in parallel, then running a few iterations serially on the whole state. You could also just synchronize once and move the split-point - which would eliminate the problem of deciding when to start going parallell again since you'd never really go serial. With the language designed to allow random replacement order, any valid Thue program already contains the synchronization needed to make it evaluate the right thing and e.g. output things in the correct order, so this is perfectly safe provided you have a correct implementation and correct programs.

Although it's fun to make things go fast, I think parallel Thue would be much more interesting as it might even be something that's never been done before!

Anyway, if you want to try it out, here's my Thue darcs repository, containing the haskell source and the BF interpreter in Thue with the BF Hello world program I've been using for testing. It should take exactly 287197 steps (and heaps of time) to complete the execution and print "Hello world!\n" as a series of binary numbers separated by underscores.

BF+Thue in mod_rewrite

Jan 27, 2008 by olsner

Based on the mod_rewrite experiment from the other week, I hacked up a small Thue to mod_rewrite compiler in sed. For testing, I have used the brainfuck interpreter in Thue (as far as I can see, this is like the only example of a "real-world" Thue program). I tested this on the small 3+5 examples included in the BF Thue distribution and tried to test it on BF Hello World, but apache runs out of memory before completing the rewriting ;-)

Anyway, in this BF interpreter, the interpretation is bootstrapped by a circumflex, '^', so the interpreter placeholder I introduced last week had a natural use allowing us to simply enter the brainfuck program as a URL, followed by a colon and input as binary underscore-separated numbers (terminated by 0_), and have patience.

The Thue interpreter needs to keep track of a few bits of state - all of this has to be encoded in some reliable way in the URL we're continually rewriting:

  • Have we added the interpreter? For BF+Thue, this is '^' which is really something that has leaked from this one specific Thue program into the interpreter - which is really ugly, but hey, at least it was easy. Anyway, since we're in the root of the virtual host, we can use the leading slash to encode this information - simply remove it after adding the interpreter, and we can match on ^/(.*)$ to determine whether we should add the bootstrap. (I have since realized that there is not much that prevents a malicious Thue program to output a slash as its first character, thereby breaking my entire system. Well, well, in version 2.0 I'll replace this with a more robust encoding system.)
  • The current output string. Stored in the first part of the URL, terminated by a 'q' (I'm having real trouble here with selecting characters unused in the program and all its possible states - otherwise matching would be a bitch)
  • Did we rewrite anything in the last pass over the state? If nothing was rewritten, we must terminate the program. After the 'q' that terminates the output string, we store an 'r' before starting going through the rules and then every time we apply a rule we remove the 'r'. If the 'r' is still there after having tried every rule in the program, we're done and pass the output string to the printer PHP script.
  • Finally, the current program state is stored after the q and r markers and goes on to the end of the URL.

In the end, this became something quite beautiful in its ugliness, although it could certainly need some robustness work. For example, it blindly relies on the Thue program to never ever use q or r anywhere (and probably a few other things I haven't thought of). This should be changed to a totally robust encoding of anything that the rewrite rules should operate on as data, that can't interfere with the things that are used as control structures and data formatting by the rewrite rules.

Here is the translator sed script in all its regexp-wtf glory, highlighted as Perl since GeSHi unfortunately doesn't have sed highlighting.

(EDIT: Not actually highlighted as Perl anymore since the replacement of wordpress! I have created my own sed highlighter.)

# Remove comments
s/^#.*$//
# Rewrite question marks since they interfere with mod_rewrite by making the
# productions look like requests with query strings, which apache then splits.
s/\?/Q/g
# Escape characters with special meanings in regexps.
s/\([.*+(){}^$<>[\\ |]\|\]\)/\\\1/g
# Remove the ::= terminating line since it doesn't define a rule.
s/^::=$//
# Output rule: ~asf means to print the string "asf" and remove the matched string.
s/^\(.*\)::=~\(.*\)$/RewriteRule ^(.*)qr?(.*)\1(.*)$ $1\2q$2$3 [N]/
# Normal rule: match substring and replace it
s/^\(.*\)::=\(.*\)$/RewriteRule ^(.*)qr?(.*)\1(.*)$ $1q$2\2$3 [N]/

Note that the resulting rewrite rules are meant to be inserted into the mod_rewrite embedded language bootstrap I built in the previous post.

An excursion in mod_rewrite

Jan 21, 2008 by olsner

While setting up my blosxom blog's apache vhost to rewrite URL:s to pass them by the blosxom CGI script, I happened to set up a redirection loop with apache's mod_rewrite module - by accident - and realized that mod_rewrite might be Turing complete. (Always my instinct when something has the ability to loop.) After some more thinking, I realized that this apache module really is a bona fide String rewriting system which can be used to implement both Tag systems and Cellular automata, both of which can be Turing complete. Actually, since mod_rewrite can implement any tag system or cellular automaton, mod_rewrite is Turing complete.

So, if one felt like it, mod_rewrite can be configured with a Turing complete set of rewrite rules such that each accessed URL is a program, and the final URL is the out-data (or a link to a CGI-script that gives back the final output to the user, with suitable decoding). It's possible (trivial, even) to add a first rewrite rule that bootstraps the rewriting with an interpreter string. It's theoretically possible to have this be something like a Haskell interpreter - imagine the possibilities! - a dog-slow Haskell interpreter implemented in an apache config file!

Here's a proof-of-concept implementation of a rewrite system framework in mod_rewrite. All that's missing is a set of rewrite rules to implement something interesting, like 99-bottles-of-beer, bon-digi or a Universal Turing Machine. Note that apache by default limits the number of "internal recursions" to 10, and with any serious use of this you're likely to hit that limit. Increase as needed.

# This file should be included in an Apache2 config file for a VirtualHost.

# Enable rewriting
RewriteEngine on

# First off, set a sensible log level so that you can see what goes wrong
RewriteLogLevel 3
RewriteLog "/var/log/apache2/rewrite/rewrite.log"

# This should only run on the first inputed string to add the interpreter.
# After each run of substitutions, we return to the first rule, so this must
# be safe and idempotent. So, we add the interpeter prefixed by '-' if the
# string does not start with '-'. (This could also be used for input programs
# to replace the interpreter if they like. We could also simply require input
# programs to come with their own interpreters and do nothing here.)
RewriteRule ^([^-].*)$ -INTERPRETER$1

# Add a comma to distinguish unchanged strings. This would be the termination
# condition for our rewrite system. Of course there are other ways to do this,
# for example a regexp which evaluates into the terminate-rule at the bottom.
# Doesn't need to be a comma, of course. If this character is part of the input
# alphabet, it could confuse some things though.
RewriteRule ^(.*)$ ,$1

# This is where the implementation of the entire string rewriting system would
# be. The general format is ,xyz -> xy'z - remove the first comma to mark the
# string as matched and changed (if there was a comma to start with), then
# change the string y to y' (with x and z left unchanged).
RewriteRule ^,?(.*)foo(.*)$ $1bar$2
# ... etc until your rewrite rules implement some turing-complete language, or
# a program that does something interesting.

# If the string has changed, we are not yet done. Loop to the beginning and
# remove the changed-marker. This doesn't change the string at all ('-' is a
# special substitution string that simple uses the original string), just has
# the [N] flag to trigger apache to restart rewriting.
RewriteRule ^[^,].*$ - [N]

# If the string is still unchanged, terminate and apply output-rewriting.

# First, remove the comma
RewriteRule ^,(.*)$ $1 
# Since the example string INTERPRETER is not actually interpreted in this
# example, remove it after applying the rewrite rules.
RewriteRule ^-INTERPRETER(.*)$ $1
# Finally, pass the resulting string to a simple CGI or PHP script to print it
# to the web browser. [L] means to abort rewriting here.
RewriteRule ^(.*)$ /print.php?$1 [L]