Friday, October 30, 2009

4 Dan debugging

So, a 4 Dan programmer must have some pretty awesome debugging skills, right? I've done most of my programming for the last 8 years or so on Linux using GCC and GDB, so I should have intimate knowledge of the full power of GDB, right?

Wrong. In fact, my knowledge of GDB is probably just a small subset of the things you might find in a GDB cheat sheet. Hmm, maybe I should check that assumption. I'm feeling lucky, what is Google's first link for gdb cheat sheet? Ahh, it is a tutorial page on yolinux.com. Ok, I just skimmed that page and here is my list of commands I commonly use:
  1. gdb -e name-of-executable -c name-of-core-file
  2. bt
  3. up number
  4. down number
  5. list
  6. p variable-name
  7. quit
There are two more gdb commands that I use regularly that are not on the cheat sheet:
  1. thread apply all bt
  2. thread number
That's pretty much it. Yes, I vaguely remember a few other commands and can find them using the help command, but I hardly ever use those commands. Note that I nearly always invoke gdb on a core file, and since I almost never debug a running program, I don't use breakpoints, watchpoints, single stepping, etc.


Let me state up front that I sometimes have real feelings of inferiority over my knowledge of debugging tools. I've known other programmers who were strong designers/programmers but who seemed godlike (to me) in their mastery of debugging tools and I have to admit that I have been envious. Yet I've been on teams with programmers like that and my productivity was comparable or better.


So, do I just write perfect code that never needs to be debugged? Absolutely not. I typically invoke gdb several times a day, sometimes dozens of times. Are you ready for me to reveal my secrets?


Ok, here they are: 
  1. assert() is your friend.
  2. Make your assumptions explicit in your code (see #1)
  3. Don't fall into the trap of believing something to be true without evidence.
There, that's basically it. So, how does it manifest in the way I work? Well, as I said I invoke gdb several times a day or more. Most of the times that I invoke gdb, I do so because a program has just terminated due to an assertion that I put in the code. I run gdb with the core dump, look at the stack trace (or possibly several if the code is multithreaded), and probably examine some variables. I do this to figure out where my previous assumptions where incorrect. This generally leads directly to the fix and probably prompts me to add some more assertions.


Sometimes my program terminated for some other reason than an assertion. Then my job is to look at the stack trace and figure out what assertions I should add that will cause the program to instead terminate with an assertion. Most of the time this is relatively easy to do.


As a quick aside, the various Agile programming methodologies put a big emphasis on unit tests. Note that assertions are a kind of unit test.


There are more to assertions than just assert(). It's very useful to understand the concepts of precondition, postcondition, invariant, etc. Understanding these concepts will help you not only in choosing good assertions to use in your code, but how to better structure your code. However, I've think I've said enough about assertions for this post.


Returning to the three points above, let's look at points 2 and 3. Point 2 is almost redundant. It connects points 1 and 3. Point 3 says to make sure you have evidence to support your beliefs about the functioning of your code. Point 1 gives you the mechanism for  obtaining that evidence from your running code. Point 2 simply says just do it.


So the real key concept that must be mastered is the final one. A master programmer must be a kind of zen skeptic. You must never let your beliefs about reality outweigh the evidence that reality gives you, and you must be actively asking reality to give you evidence. You must always seek evidence with an open mind.


A few days ago I read a delightful post by Paul Buchheit titled Applied Philosophy, a.k.a. "Hacking". Paul opens by saying:
Every system has two sets of rules: The rules as they are intended or commonly perceived, and the actual rules ("reality"). In most complex systems, the gap between these two sets of rules is huge.
Paul goes on to discuss how hackers of various kinds exploit the gap in various systems and gives several great examples in different domains. But lets see how we can apply his statement to the problem of debugging.

Computer programs have a great deal of complexity. There is complexity that the programmer puts into the program through her use of programming language constructs. There is complexity that the compiler puts into the program by turning the program into machine code. There is additional complexity that arises from the data the program consumes, the internal state of the program as it evolves. And finally, if the program is multithreaded, there is additional complexity that arises from the complex timing relationships between the threads.

The human mind is amazing in its ability to understand complex systems, but we have to recognize that our understanding is always an imprecise model of reality. We create beliefs about how complex systems work and these beliefs are necessarily dramatic simplifications. So, there is a gap between reality and our belief about reality. To be able to debug complex computer programs we must be aware of that gap and apply various methods to minimize the gap. Or to at least minimize the negative consequences of the gap. We can do that using concrete tools such as assertions, but the biggest gain comes from having a good understanding of the limitations of our own minds.

We can (and all of us do) have bugs in our thought processes that can hold us back in our ability to design, implement and debug. We begin the path to mastery when we begin to debug our own thought processes.


Thursday, October 29, 2009

Spin Locks

In my post Multithreading And Atomic Integers I briefly introduced atomic integer operations and GCC's built-in functions for them. In this post and the next I want to discuss spin locks that can be implemented using the GCC built-ins.

First, consider a very simple implementation of a spin lock:

class SpinLock
{
public:
    SpinLock() : mWord(0) {}
    void Lock() { while (!__sync_bool_compare_and_swap(&mWord, 0, 1)); }
    void Unlock() { mWord = 0; }
private:
    int mWord;
};

To use this SpinLock, allocate one that is visible to all threads, and then in each thread do something like this:

extern SpinLock gLock;
void Foo()
{
    gLock.Lock();
    ... critical section here
    gLock.Unlock();
}

This SpinLock class allows only one thread into the critical section at a time. It does it using just one bit of an atomic integer -- the code would work fine if mWord was declared as a char. If more than one thread wants into the critical section, one will be allowed access, and the others will spin in a very tight loop. Each spinning thread will burn CPU cycles (and memory bandwidth) until it acquires the lock.

It's worth spending a moment here to consider when spin locks are appropriate. One rule of thumb is that you should only use spin locks if you are willing to limit the number of active threads to the number of cores on your machine. Futhermore you should know that your application is going to be the only process consuming significant CPU cycles on the machine. If you use spin locks, you want it to be highly improbable that the kernel will suspend a thread while it holds a spin lock.

If you are willing to limit the number of active threads to the number of cores, you probably have threads that are primarily CPU bound. If the threads are primarily I/O bound, you are probably better off using mutexes and creating more threads than cores. But note that I am not saying that your application can't do I/O. I have an application that reads a large volume from disk yet is clearly CPU bound. It uses one thread per core and all threads add data into shared data structures. The application performs much better (as measured by the rate that it consumes the data) when using spin locks instead of mutexes.

As a final consideration, if you are going to be able to use spin locks, you probably are using lots of them. If you have only one global mutex, you don't want to replace it with a spin lock. Spin locks work well when they are fine grained and dispersed throughout a large data structure. Perhaps you have a large graph data structure and need a critical section whenever you update any link of this data structure. Each link would then need its own mutex or spin lock. If you have millions of these links then spin locks can be a great choice. The probability of contention on any one link is probably very small and it is likely that the critical sections are short in term of CPU cycles. If you tried to use a Posix mutex for every link, you might find that the memory footprint is prohibitive -- one Posix mutex requires 40 bytes on the Linux platform I use. A spin lock can use as little as one byte.

Let's return to the implementation above. There is one subtle problem with this implementation: if multiple threads contend for the same lock, there is nothing to ensure fairness. Suppose thread A obtains the lock and starts doing the work of its critical section. While it is doing the work, first thread B, then thread C attempt to obtain the lock. Since B was 'first in line', it would be unfair for C to be the thread granted the lock next, but that quite possibly will happen.

If this kind of fairness is a serious requirement for a given application then that application probably shouldn't be using spin locks. But we can implement a spin lock class that ensures fairness. Here is one possibility:

class FairSpinLock
{
public:
    typedef unsigned char Byte;
    FairSpinLock() : mReserved(0), mReleased(0) {}
    void Lock() { Byte resv=Add(&mReserved); while (Add(&mReleased,0)!=resv); }
    void Unlock() { Add(&mReleased); }
private:
    static Byte Add(Byte* val, Byte x=1) { return __sync_fetch_and_add(val, x); }
private:
    Byte mReserved;
    Byte mReleased;
};

Here we use two different atomic integers. Both are single bytes and we only use the Add operation, which is defined using __sync_fetch_and_add(). The first byte mReserved is a running count of how many reservations have been made. The second byte mReleased is a running count of how many acquired locks have been released. A thread holds the lock when it finds that all prior reservations have been released. It holds the lock until it releases its reservation.

There is beauty in the simplicity of this implementation. Note that both counters wrap around. But wrap around is not a problem since we only check for equality. Using bytes means we are limited to no more than 255 pending reservations. Just make sure you don't use this code as is on your machine with 256 or more cores.

So far we have only considered simple mutual exclusion. But there are cases where you want to allow multiple readers as long as write operations are excluded. These are called read/write locks. A write lock is a mutual exclusion lock as we have discussed so far. While a thread holds a write lock, no other thread can obtain either a read or a write lock. A read lock excludes other threads from obtaining a write lock, but allows other threads to obtain concurrent read locks.

So, can we implement read/write spin locks? Yes, we can, but I'm going to save that for a future post. It is a fun exercise to try to implement a read/write spin lock, so you might want to try before reading my follow-up post. First start with the FairSpinLock implementation and extend it, while keeping the fairness feature. Then see if you can implement a read/write lock using a single atomic integer instead of two. I don't know how to do this while still ensuring fairness, but I have a working implementation with a single atomic integer that does not ensure fairness. I'll discuss both implementation in the follow-up post.

Friday, October 23, 2009

The Siren Call of Abstraction

To be a great programmer you must master the use of abstraction. I don't care how big your brain is, if you can't abstract away details, your code will get bogged down in them. To a large degree, modern languages are all designed to help programmers with tools for managing complexity through powerful abstractions.

I know some good programmers who are quite productive despite having mostly low-abstraction tools in their tool set. For example, I know programmers who implement almost any collection of objects using linked lists, with the link implemented as a next pointer embedded inside the object, something like this:

class SomeObject
{
public:
    SomeObject(SomeObject* next) : mNext(next) {}
    SomeObject* Next() const { return mNext; }
    ...
private:
    SomeObject* mNext;
};

This is a fine choice for some problems, but it is a suboptimal choice for most problems. One pragmatic reason is that objects implemented this way can only belong to one collection at a time. The bigger problem comes simply from coupling together the two concepts of Object and Containment. If the programmer sees the next pointer as just another attribute of the object, then the methods of the object will probably blur the different operations, with one line of code manipulating the object's state, and the next line of code manipulating the container state, and no clear demarcation between the two fundamentally different abstract operations.

But what does this have to do with The Siren Call of Abstraction? Well, good programmers can be productive despite not having mastered the use of abstraction. But being obsessed with abstractions can be a major hindrance to productivity. A couple different posts crossed my RSS feed today that made me think about this topic.

First, Joel Spolsky wrote recently about the virtues of Duct Tape Programmers. He sings the glories of programmers who get the job done with simple tools analogous to duct tape and WD-40, and ridicules programmers who love multiple inheritance. As usual Joel is a fun read.

But Joel only ridicules some caricature of a programmer. DPP's tweet today was about a specific programmer and is far more damning, so much so that I feel sorry for the subject of his scorn.

Multithreading and atomic integers

One of the challenges for programmers today is taking advantage of machines with many CPU cores. Two-core machines are so common now that odds are good that your laptop computer has two cores. Most server-class machines already have at least four cores. Machines with 16 or 32 cores are being widely deployed and it is likely that in the near future we'll see machines with considerably more. Consider that Intel demonstrated 80-core processors two years ago!

Writing applications that take full advantage of many cores is difficult. If you can write your applications such that an N-core server runs N CPU-bound processes with no interprocess communication required then you can probably keep all N cores busy. There are applications like this -- probably anything that runs on BOINC or comparable loosely coupled grid services is in this category -- but I'm more interested in applications that aren't as easily decomposed into many independent processes. Instead, let's assume we are developing applications that run as a single process with at least one thread per core and the threads must work with shared resources, in particular shared data structures in memory. Here we need various mechanisms to assure the integrity of the shared data structures.

The most common mechanism for assuring data integrity is the mutex. In POSIX there is the pthreads API, which includes functions for working with mutexes. With this API you can guard a critical section of code like this:
pthread_mutex_lock(&mutex);
... // critical section here
pthread_mutex_unlock(&mutex);

If two or more threads attempt to execute this code simultaneously, the kernel will insure that only one thread can be executing the critical section at a time. The first thread to acquire the lock is allowed to execute the critical section, and while it holds the lock, any other thread that attempts to acquire the lock will be blocked, i.e. the scheduler will stop execution of the thread and allow some other thread that is ready to execute to continue.

Suppose your critical section takes only on the order of a hundred CPU cycles to execute and the section is executed on the order of thousands of times per second. Let's assume the cost of acquiring an unlocked mutex is negligible, but that there will be significant contention for the mutex. Perhaps there is a 10% chance that when a thread tries to acquire the mutex it will be blocked and the kernel will do a context switch. When this is the case, the overhead introduced by the mutex can be huge. It's even possible that an application running on multiple cores might not run much faster than if it was rewritten to run single-threaded without mutexes.

Consider the case where the only operation performed in the critical section is to increment a global counter, but that counter may be incremented hundreds of thousands of times per second. In this case, the overhead of using a mutex is quite possibly unacceptable.

But perhaps you are wondering if just incrementing an integer requires any kind of synchronization? Let me just give the quick answer that if you are programming in C/C++ and only have POSIX APIs available to you, then yes, you absolutely do need a mutex. Suppose we have this function:
void foo()
{
extern int gCounter;
++gCounter;
}


The code your compiler will generate is most likely three separate operations:
  1. Load from RAM into CPU register
  2. Increment CPU register
  3. Store from CPU register into RAM.

Suppose two threads running on 2 cores of a multicore processor both perform these operations simultaneously. What happens? We want the integer to be incremented twice, but it is entirely possible that when both threads have completed these three operations, the value in memory is only one count larger than it was before.

The problem is that the three operations are not performed atomically. It is certainly possible for a CPU designer to provide a single instruction that performs the three steps atomically (and as we'll see, they have), but such instructions will be more expensive to execute than the three primitive operations. So, modern CPUs typically provide both kinds of instructions. However, the C and C++ language definitions don't provide a standard way to access the atomic operations, and POSIX does not define an API for atomic integer operations.

Of course, most compilers provide various extensions, such as the ability to write inline assembler, but these extensions are nonportable. One ramification of the non-portability is that they aren't documented in common sources of documentation. Consider that your humble 4 Dan only just recently stumbled upon the atomic builtins introduced in GCC 4.1.1. For example, there is this operation:
type __sync_add_and_fetch(type* ptr, type value);
For type, GCC will allow any integral scalar or pointer type that is 1, 2, 4 or 8 bytes in length, though the documentation is not entirely clear on the restrictions that might apply on different CPU architectures. I can confirm that the built-in performs correctly on 64-bit Intel Xeon and AMD Opteron processors using the long integer (64-bit) type. I wrote a relatively simple stress test to confirm this, which might be the subject of an upcoming post.

So, to be clear, the builtin behaves as if there are functions declared as
inline char __sync_add_and_fetch(char* ptr, char value);
inline short __sync_add_and_fetch(short* ptr, short value);
inline int __sync_add_and_fetch(int* ptr, int value);
inline long __sync_add_and_fetch(long* ptr, long value);
Or if you are comfortable with C++ templates, like this:
template <typename T>
inline T __sync_add_and_fetch(T* ptr, T value);

So what does __sync_add_and_fetch() do? Well, as you probably guessed, we can use it to do an atomic increment:
int result = __sync_add_and_fetch(&gCounter, 1);
This is an operation we can safely do from several threads on a multicore machine, and know that gCounter will accurately reflect the actual number of operations performed. Note that __sync_add_and_fetch() returns the result of the increment. If the code you are executing in your thread needs to know the current count after the increment, it should do as I've shown here, and not simply access gCounter again. To illustrate why, suppose we need to take some action every thousand increments. Assume we write code that looks like this:
__sync_add_and_fetch(&gCounter, 1);
if ((gCounter % 1000) == 0)
DoActionOnRollover();

This code won't behave as we intend. The function DoActionOnRoller() might not get called when it is supposed to, and it might get called more than once. But let's rewrite it this way:
if ((__sync_add_and_fetch(&gCounter, 1) % 1000) == 0)
DoActionOnRollover();


Now we can be sure that DoActionOnRollover() will get called once and only once per 1000 increments of gCounter.

There is quite a bit more interesting material here, but this post is already long enough. Future posts will cover some performance measurements and other things you can do with the GCC Atomic Builtins.

Shu Ha Ri

Hi. Welcome to my blog. However you arrived here, you might be wondering about the title. Here's a quick explanation.

About six months ago (as I write this) Christian Neukirchen posted on his excellent tumblelog Trivium a link to a page titled Shu Ha Ri. Shu Ha Ri is a Japanese concept that claims that the path from novice to master goes through three stages, with Shu being the novice stage and Ri being the expert stage. But note that entering the Ri stage is just the beginning of mastery. Those of us in the west who are only superficially familiar with eastern martial arts might be surprised to know that obtaining a black belt in a martial art is merely recognition of entering the Ri stage. A person who has just obtained their black belt is 1 Dan, but as they continue on the path of mastery they will advance to 2 Dan, and so on, and perhaps someday reach the rarified level of 9 dan, or perhaps be given the extreme honor of being called 10 Dan.

The Japanese use this rating system for various martial arts and also for the board game Go. I've never studied any martial art but I spent a couple years being nearly obsessed with Go. Most people can learn the rules of Go in about 10 minutes and will play at a level of about 30 Kyu. As beginners progress, they count down to 1 Kyu and then transition to 1 Dan. Go is a fascinating game and the ranking system is very useful, because it can be used to handicap games. Players that are ranked 9 levels or less apart can play a game where the weaker player is given extra stones (placed in established locations) at the start of the game. The system works remarkably well, such that both players can have an interesting and challenging game.

The Shu Ha Ri article was the catalyst that has prompted me to start this blog and indirectly led me to the title 4 Dan Programmer. Please don't take the title literally. There is no system for ranking software engineers. If there were I expect it would be difficult to apply it meaningfully and consistently to the diverse landscape of software engineering. But even if we could devise such a rating system, I really have no idea if I would be 4 Dan. Rather, I just want to convey that I consider myself well into the Ri stage of software engineering, yet still have far to go to achieve the level of mastery that I know other engineers have achieved. I know this from real experience, having been fortunate enough to have worked at Google for two and a half years, where in my view the majority of engineers were "Dan-level" programmers. Google's unique performance review process relies extensively on peer review and engineers are ranked on a well documented scale. There were hundreds of engineers ranked at my level and over a hundred engineers deservedly ranked at higher levels. I might add that I worked with several younger engineers ranked lower than me who proved on numerous occasions that old age and treachery is not always superior to youth and skill. I enjoyed learning from them.

Which brings me to a key point I want to make. I do hope that I may be of some benefit to a few other engineers looking to improve their mastery of software engineering, but I won't consider this blog a success if I learn nothing from the process. Whatever your skill level, if you are moved to question or critique what I have to say, I hope you will so that we both benefit from the exchange.