Today a short story about how a simple mathematical operation can ruin your whole day. This is something that I had the ‘pleasure’ of dealing with about 6 months ago, costed me 2 days of debuging and (as usual) turned out to be a surprising, yet basic Java feature. Because of that I believe it is worth to warn you all. The problem I was dealing with was slightly more complicated, but for the sake of simplicity it can be defined in the following way:

Let’s say that you have few machines and a million independent computations (tasks) that you want to run on them. The problem is how to distribute the tasks among the machines in a way that meets following conditions:

  • every machine gets similar number of tasks
  • all tasks are executed
  • you use all your machines
  • the distribution is ‘random’

First three conditions are self-explanatory. The randomness is important, as the execution time for each task may vary by few orders of magnitude and you do not know how long it will take upfront. Without the randomness you risk that any used distribution pattern ‘by accident’ may send all the heavy tasks to one machine and overload it. On the other hand it can’t be really random – we may want to repeat the process to debug it and/or control the data processing. So what can we do?

No worries, there is a plan! We could use hashing! To be exact, for every task we could calculate a hashcode and use distribution algorithm similar to the one used in HashMap to distribute objects in a array. To be specific: let’s say we have N servers with IDs 0, 1, …, N-1. We will execute given task T on a server with ID equal to T.hashCode() % N. Since hashCode values are meant to be ‘randomly distributed’ they are perfect for this task. What is more they are not really random, so any given task will be executed in every run on the same machine. Perfect!

One more thing: because of performance reasons (that do not affect this puzzle, I promise!) instead of distributing the tasks by a centralised system we will just send all tasks to all servers and let servers decide whether or not they will execute them. One, two, three and the (simplified) code is here! First, there is a Server class – each server knows its ID and the total numbers of working servers. All task distribution logic is in the performTasks() method, which accepts all tasks and decides whether to execute them or not. Decision is based on the algorithm discussed above – use the hashCode() to get a ‘random’ value, get its remainder from dividing by the number of working servers and if you’ll end up with a value equal to server’s ID then you know server should execute it. Simple!

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
public class Server {
    private final int serverID;
    private final int serverCount;

    public Server(int serverID, int serverCount) {
        this.serverID = serverID;
        this.serverCount = serverCount;
    }

    public void performTasks(Object[] tasks) {
        for (Object task : tasks) {
            // We check whether this task
            // should be done by this server.
            if (task.hashCode() % serverCount == serverID) {
                // Instead of 'performing the task' we just print
                // the information about it.
                System.out.println("Server (ID="
                        + serverID + ") performing " + task);
            }
        }
    }
}

Now lets make it all work! Here is a simple class with main() method that creates few easy tasks, few servers and does the shabang. It passes all tasks to all servers and since servers execute tasks just by printing them to console we can just wait and see how the system works!

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
public class TasksEasy {
    public static void main(String[] args) {
        String[] tasksEasy = {"easyTask1", "easyTask2", "easyTask3"};

        Server[] servers = {
            new Server(0, 3),
            new Server(1, 3),
            new Server(2, 3)};

        for (Server server: servers) {
            server.performTasks(tasksEasy);
        }
    }
}

So what will get printed out? Anyone?? Well… no surprises, the distribution algorithm is awesome, so the output is like that:

Server (ID=0) performing easyTask2
Server (ID=1) performing easyTask3
Server (ID=2) performing easyTask1

Well, the luck is on our side. Not only everything worked out, but we even got all our tasks evenly distributed! Hmmm, wait… if everything would be so great I would not waist your time about that… so what is wrong? Before I’ll tell you where the bug is hidden (the post title helps, hehehe!), lets run our code again with different tasks, just for fun!

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
public class TasksHard {
    public static void main(String[] args) {
        String[] tasksHard = {"hardTask1", "hardTask2", "hardTask3"};

        Server[] servers = {
            new Server(0, 3),
            new Server(1, 3),
            new Server(2, 3)};

        for (Server server: servers) {
            server.performTasks(tasksHard);
        }
    }
}

So, what’s the output for the ‘HardTasks’? Hmmm… wait… strange…

Server (ID=0) performing hardTask2

What have happened with ‘hardTask1′ and ‘hardTask3′?? Why no one took care of them? Let’s try to analyse that: each server computes task.hashCode() and takes the remainder from dividing it by three, we have three servers with IDs 0, 1 and 2. It seems there is no way the task can’t be executed, right? I mean if you divide by three and take the rest you’ll get either 0, 1 or 2… right?

Wrong. Check the specification of the Java remainder operator (%), you’ll see that its defined as satisfying the following equation:

(value / divider) * divider + value % divider == value

Still do not see the problem? In plain English: its obvious that 5 % 3 == 2, right? What is not obvious is that -5 % 3 == -2! Yes, when you take reminder of a negative value in Java the result is also negative (non-positive to be exact). That’s exactly what triggered the bug and made us lose ‘hardTask1′ and ‘hardTask3′. The hash codes for the easy tasks were (by chance, haha!) all positive and therefore our algorithm worked. The hard task’s hash codes were on the contrary all negative (again, by chance!) and that is when the bug appeared.

In the real-life situation we have found out that there is a problem in the code because out of 10 billion tasks scheduled for execution we got ~5 billion results. Me and some other coders were going trough the code line-by-line for hours, but since taking the remainder looked so safe and innocent no one have even suggested that this is the line that fails!

So what’s the lesson here? Write Unit Tests for every code that does logic/algorithmic job! Some even say, that if you have an ‘if’ statement in the code, do a Unit Test for it. Second thing: beware of how % operator works, its not that obvious as some might thing and that can bite you in the behind. If you read Joshua Bloch and Neal Gafter’s Java Puzzlers (btw: totally recommended!), you probably remember that the first puzzler discussed in the book is exactly about this bug. Its a nasty one!