Spectemur Agendo: October 2009

Saturday, October 31, 2009

The language I would like to see

So if you read this blog very much you know that I talk about C a good bit. And you might ask "Well why C why not Python, Java or a more modern language?" The answer is that C sits in a niche neglected by most other programming languages. My job is in that niche of embedded systems.

C places very minimal requirements on the environment to the point that its actually fairly easy to write a C program that works directly with the computer's hardware with no OS to help. In fact by using GNU GRUB all you must do is conform to the multi-boot specification, avoid some sections of the standard library and compile with -ffreestanding -fno-builtin and maybe another flag or two. Which is surprisingly easy because few other languages can do that. Which is exactly why C/C++ is the dominate language of embedded systems (C++ is a little harder to get going but it can be done).

However i'm a little unsatisfied with that. I really wish that there was a language with somewhat higher level constructs that could be used. C++ is close to what I want but it often feels ...kludgy that things just could have been done better. Another candidate is D which is a language that's goal is to be C++ "done right". It is a reboot of adding these higher level features to C but with less weird syntax and avoiding known problems in the design of C++.

But anyway some of the features I want would be:

Manual Memory Management with compiler supported hooks for library based garbage collectors
Arbitrary memory manipulation through pointers or something similar
Inline assembly
Object Oriented
The ability to create objects with sub-byte packing of data elements
Very minimal support environment

Most of those features are targeted at working directly with the hardware of a computer. The language would be mainly for operating system/driver developers, embedded system developers and anyone else that ends up working very closely with the hardware.

Wednesday, October 21, 2009

I'm a rel bad spelr

There is a tool that I would like to see. I don't know about you but I am a really bad speller. Its happens quite often I am so far off that spell checking software does not even know what I'm trying to type. So I know code that I write is littered with spelling mistakes. What I would like to see is a Programmer's spell checker.

Now I don't mean one that spell checks just comments, because VIM already does that for me. But one that understands common programmer naming idioms like Hungarian notation, UNDER_BARS and CammelCase. It would split symbol names up and spell check the names against a dictionary like any other spell checker. The spell checker would probably need to be able to do some limited syntax analysis of the program it is spell checking, to separate out keywords and such. To really be useful it would also need very strong support for per-project user dictionary files and the ability to automatically add words from the interface of modules you are using.

Tuesday, October 20, 2009

And The 121 Dollar Word Is!

Idempotent, which is an adjective that describes something that has the same effect no matter how many times you do it. Which is something very impotent to achieve in clean up functions. Because when using manual resource management it can often be nearly impossible to tell whether something has been released or not at any one particular point in the code. So programmers tend to "over release" things to be safe. That is they release things again because they are not sure if it has been done yet or not.

This is much less of a problem in object oriented languages because you can use RAII. But in classic C where I spend most of my day we don't have destructors so that technique is not helpful. For some reason people sometimes write clean up functions that cannot be run twice without problems. Which inevitably causes problems. Because of what I described before. When writing any such function you should very carefully make sure that if it is run on an already released resource can detect that case and do nothing.

Design From Your User's Standpoint

Something that often seems to be lacking in software architecture is designing for usability of the API. Personally when I design something for other programmers (or even myself) to use I am always envisioning how the API will be used. It has brought me into disagreement with some of the other programmers occasionally in design pow-wows. They would be pushing for some interface that they think is easy to use, and I would be trying to steer them away because of issues envisioned in it's use.

One particularity annoying culprit is one of the hardware vendors used by my company. Recently I was working on using their SSL API but was having problems with the connections failing regularly. As it turns out the problem was that in some of my error handling routines I was releasing some of the SSL's resources before it was apparently done with them. Which caused their library to error and fail to connect from then on.

Now you might say "Well just use the library correctly and there wont be a problem". while thats one way of looking at it I believe it to be a poor opinion to have. What the implementers of the SSL library should have done is to more carefully designed their API and library to be easier to use and harder to break.

As one of my previous posts says "Fail safe but fail loudly!" which is one of my overriding design principles. Shift your point of view to being a client of the interface you are implementing and anticipate as best you can the errors another programmer would make. Those errors should be handled in such a way as it does not put your module/object whatever in an inconstant state. Then make sure the programmer knows that they are doing something wrong. Even better redesign the interface so that the number of errors that can be caused by misuse are minimized.

The weak link in software development is not the program but the programmer. We are all fallible. So code to your weakest link and try to make your APIs Idiot proof. Because God knows, when it comes down to it, we're all idiots in our own little ways.

Tuesday, October 13, 2009

Please Be Nice and Include Everyone

Well I saw something today that made my hair stand on end. One of my coworkers had done something he should have never, never, never, never....never, never (did I mention never) done. If you are not familiar with C you should know this. In C a functions' signature must be declared before it is used. That is you cannot call a function without first at least telling the compiler what it's interface looks like. This was a decision made in the early days of the language to make C easer to parse.

To this end C has two concepts Deceleration or saying something exists but not what it is and Definition which is saying what it is. You can't use anything, be it variable, function or whatever without at least saying it exists first. C also has two different file types the *.h file is reserved for the interface of a module and the *.c for the implementation. Which fits into the Deceleration/Definition scheme perfectly. Basically the *.h file is to be used for Decelerations only. It occasionally can contain some Definitions. To see a Definition in a *.h file usually signals a programming error however.

So ideally a C program should be structured so that there is one "module" per *.c file. Each *.c file has a corresponding *.h file with the module's public interface declared in it. Notice that I said "public interface" something that should be done is to take utility functions that have no bearing on the module's purpose out of the header file. You can define/declare them entirely in the *.c file as static. Or if they are more generally useful they can be moved to a sub-module that other modules can use and be declared/defined normally.

This is where the break down comes in. The developer had taken my advice but not really understood it. He had half way done both things. A function had been forward declared in the *.c file that used it but was defined in a separate *.c file. Which is a big no-no. The C compiler uses the deceleration in the *.h file to check and make sure that you are calling the function with the right number and types of parameters.

By forward declaring the function like he did the compiler had no chance what-so-ever of catching this programming error. While in the file where the function is used the compiler would take the forward deceleration as gospel and compile without even a warning. Then while compiling the file the function was in it would take the definition as gospel and compiled fine. However the generated code now is mismatched. If the two function signatures are ever different the compiler will happily call the functions with the wrong parameters. Another of C's many shotguns to blow your foot off.

This is the second time I've seen this in the company. The first time, was in a long existing code base, I was a little flabbergasted and spent a good bit of time cleaning it up. This time all I could do was shake my head and tell them to fix it. So the moral of the story is ...DON'T DO IT!!! Its a huge headache to find when it breaks and it will break. If you are putting a function in a *.c file and calling it from another *.c then make a header for the module. If the function is never used outside the current module then have it entirely inside the module. There is no other right way to do this!

Monday, October 5, 2009

Betting Against the Tortoise and Hare

So apparently one question that comes up pretty often in programmer interviews is "how do you find a loop in a singly linked list?" I've never personally been asked this question but thats the hearsay. The common answer is Floyd's cycle detection algorithm (also called the tortoise and the hare algorithm). It works by having two iterators the tortoise and the hare, where the hare moves through the list twice as fast as the tortoise. The algorithm can find not only a cycle in the list but where it starts (check Wikipedia's article for details). The algorithm takes a few passes through the list to find the information though.

That got me thinking about a slightly different problem. Can we find the first element of the loop with only a single pass in a list that wont fit in memory. The answer that I came up with is: Sort of. At least we can find it with some small probability that we are wrong. Is this optimal ...I doubt it but it was fun to devise. The algorithm relies on a data structure called a bloom filter.

Bloom filters have the ability to keep a tab on things in a set with very little space overhead. However bloom filters have a non-zero probability of saying something is in a set when it is not. So here comes the algorithm (I'm glossing over the exact details of the filter here).

BloomFilter set;
Iterator loop_beginning = null;
float prob = 1.0;

//go through the entire list
for(Iterator iter = list.beginning();
    iter.hasNext(); ++iter) {

    //is the node in the set
    if(set.contains(*iter)) {
        //it is! we may be in a loop.

        //is this the first node in this possible loop?
        if(loop_beginning == null) {
            prob = set.falsePositiveRate();
            loop_beginning = iter;
        } else {
            prob *= set.falsePositiveRate();
        } //end if

        //is the probability low enough?
        if(prob < errorThreshold) {
            //good enough for us return the
            //first node in the loop.
            return loop_beginning;
        } //end if
    } else {
        //nope not in the set. We have not seen
        //this node before so add it to the set.
        set.add(*iter);

        //no loop
        loop_beginning = null;
    } //end if else
} //end for

return null;

The algorithm is pretty easy it goes around the list until the node being examined is in the bloom filter. We could stop there but it could be that we just found a false positive. Bloom filters never say something is not in the set when it is so we check to see that the next few nodes are in the set also. Every contiguous positive result reduces the probability that we are examining a false positive. So we just go until the probability that this is a false positive drops to an acceptable level.

So when would you use this? You probably wouldn't; not unless you have a known upper bound on the number of nodes and you can't have two iterators walking the sequence for some contrived reason. The size of the bloom filter will grow linearly with the upper bound of links but it will be considerably smaller then storing the keys themselves. Even with that space advantage the bloom filter might not fit in memory. But like I said it was fun coming up with this.

edit: I must have been really fried when I wrote that algorithm the first time. So its better now.

edit-edit: Argh! one day ill get that pseudo-code right.

Thursday, October 1, 2009

A Bit on Bits

Long time no post, try to do better...so on and so forth. With that out of the way.

A problem came up recently that I found interesting. The problem is an old one that of counting the number of bits set in a given integer. So I went looking for the correct algorithm to use. The one that I found was elegant and short but took a minute of inspection to understand how it worked.

The algorithm used a strategy called Dynamic Programming. First off Dynamic Programming is about the worst name for this algorithmic strategy because it conveys nothing about how it works. For that algorithmic strategy to be applicable the problem at hand must meet a few conditions. First you must be able to divide the problem into sub problems. Second you must be able to find answer to your problem by solving and putting together the sub problems.

Now you might say "But that sounds like Divide and Conquer" and you would be right. The difference is that in dynamic programming the identical sub-problem comes up multiple times. If you blindly divide and conquer such a problem you end up recomputing the sub-problems multiple times which often leads to exponential time algorithms. So in dynamic programming we keep a table of the answers computed and the problem becomes putting together the sub-answers to find the solution.

So here is the algorithm in pseudo-code ('&' is the bitwise operation a'la C).

onesCount[0] = 0

for i = 1 to 255
    onesCount[i] = onesCount[i / 2] + (i & 0x1)

See, short. This follows the basic structure of any Dynamic Programming algorithm. Basically this builds a look-up table for a single byte (i.e. onesCount[x] is the number of bits set in the integer x). The first line sets up the trivial answer to the problem's base case. Being that the integer zero contains no set bits. Then the for-loop precedes to fill in the rest of the table with the correct values. How does that work?

Now since we start at 0 and work our way upward we know for sure that the value for i / 2 has already been computed. So we know the number of bits set in that number. Well i / 2 basically chops the right most bit off the number (its a one position right bit shift). So that tells us how many bits have been set in all but the one's position and all that remains is to check out if that last bit is set or not. If it is the number of bits set in i is one more then in i / 2. So i & 0x1 is one in exactly that situation.

So thats the dynamic programming algorithm for constructing that look-up table. Do you see how it fits into the dynamic programming strategy. Be on the lookout for other algorithms from the dynamic programming family. They are always interesting.

Spectemur Agendo