Today a short story about how a simple mathematical operation can ruin your whole day. This is something that I had the ‘pleasure’ of dealing with about 6 months ago, costed me 2 days of debuging and (as usual) turned out to be a surprising, yet basic Java feature. Because of that I believe it is worth to warn you all. The problem I was dealing with was slightly more complicated, but for the sake of simplicity it can be defined in the following way:
Let’s say that you have few machines and a million independent computations (tasks) that you want to run on them. The problem is how to distribute the tasks among the machines in a way that meets following conditions:
- every machine gets similar number of tasks
- all tasks are executed
- you use all your machines
- the distribution is ‘random’
First three conditions are self-explanatory. The randomness is important, as the execution time for each task may vary by few orders of magnitude and you do not know how long it will take upfront. Without the randomness you risk that any used distribution pattern ‘by accident’ may send all the heavy tasks to one machine and overload it. On the other hand it can’t be really random – we may want to repeat the process to debug it and/or control the data processing. So what can we do?
No worries, there is a plan! We could use hashing! To be exact, for every task we could calculate a hashcode and use distribution algorithm similar to the one used in HashMap to distribute objects in a array. To be specific: let’s say we have N servers with IDs 0, 1, …, N-1. We will execute given task T on a server with ID equal to T.hashCode() % N. Since hashCode values are meant to be ‘randomly distributed’ they are perfect for this task. What is more they are not really random, so any given task will be executed in every run on the same machine. Perfect!
One more thing: because of performance reasons (that do not affect this puzzle, I promise!) instead of distributing the tasks by a centralised system we will just send all tasks to all servers and let servers decide whether or not they will execute them. One, two, three and the (simplified) code is here! First, there is a Server class – each server knows its ID and the total numbers of working servers. All task distribution logic is in the performTasks() method, which accepts all tasks and decides whether to execute them or not. Decision is based on the algorithm discussed above – use the hashCode() to get a ‘random’ value, get its remainder from dividing by the number of working servers and if you’ll end up with a value equal to server’s ID then you know server should execute it. Simple!
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: |
public class Server {
private final int serverID;
private final int serverCount;
public Server(int serverID, int serverCount) {
this.serverID = serverID;
this.serverCount = serverCount;
}
public void performTasks(Object[] tasks) {
for (Object task : tasks) {
// We check whether this task
// should be done by this server.
if (task.hashCode() % serverCount == serverID) {
// Instead of 'performing the task' we just print
// the information about it.
System.out.println("Server (ID="
+ serverID + ") performing " + task);
}
}
}
}
|
Now lets make it all work! Here is a simple class with main() method that creates few easy tasks, few servers and does the shabang. It passes all tasks to all servers and since servers execute tasks just by printing them to console we can just wait and see how the system works!
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: |
public class TasksEasy {
public static void main(String[] args) {
String[] tasksEasy = {"easyTask1", "easyTask2", "easyTask3"};
Server[] servers = {
new Server(0, 3),
new Server(1, 3),
new Server(2, 3)};
for (Server server: servers) {
server.performTasks(tasksEasy);
}
}
}
|
So what will get printed out? Anyone?? Well… no surprises, the distribution algorithm is awesome, so the output is like that:
Server (ID=0) performing easyTask2 Server (ID=1) performing easyTask3 Server (ID=2) performing easyTask1 |
Well, the luck is on our side. Not only everything worked out, but we even got all our tasks evenly distributed! Hmmm, wait… if everything would be so great I would not waist your time about that… so what is wrong? Before I’ll tell you where the bug is hidden (the post title helps, hehehe!), lets run our code again with different tasks, just for fun!
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: |
public class TasksHard {
public static void main(String[] args) {
String[] tasksHard = {"hardTask1", "hardTask2", "hardTask3"};
Server[] servers = {
new Server(0, 3),
new Server(1, 3),
new Server(2, 3)};
for (Server server: servers) {
server.performTasks(tasksHard);
}
}
}
|
So, what’s the output for the ‘HardTasks’? Hmmm… wait… strange…
Server (ID=0) performing hardTask2 |
What have happened with ‘hardTask1′ and ‘hardTask3′?? Why no one took care of them? Let’s try to analyse that: each server computes task.hashCode() and takes the remainder from dividing it by three, we have three servers with IDs 0, 1 and 2. It seems there is no way the task can’t be executed, right? I mean if you divide by three and take the rest you’ll get either 0, 1 or 2… right?
Wrong. Check the specification of the Java remainder operator (%), you’ll see that its defined as satisfying the following equation:
(value / divider) * divider + value % divider == value |
Still do not see the problem? In plain English: its obvious that 5 % 3 == 2, right? What is not obvious is that -5 % 3 == -2! Yes, when you take reminder of a negative value in Java the result is also negative (non-positive to be exact). That’s exactly what triggered the bug and made us lose ‘hardTask1′ and ‘hardTask3′. The hash codes for the easy tasks were (by chance, haha!) all positive and therefore our algorithm worked. The hard task’s hash codes were on the contrary all negative (again, by chance!) and that is when the bug appeared.
In the real-life situation we have found out that there is a problem in the code because out of 10 billion tasks scheduled for execution we got ~5 billion results. Me and some other coders were going trough the code line-by-line for hours, but since taking the remainder looked so safe and innocent no one have even suggested that this is the line that fails!
So what’s the lesson here? Write Unit Tests for every code that does logic/algorithmic job! Some even say, that if you have an ‘if’ statement in the code, do a Unit Test for it. Second thing: beware of how % operator works, its not that obvious as some might thing and that can bite you in the behind. If you read Joshua Bloch and Neal Gafter’s Java Puzzlers (btw: totally recommended!), you probably remember that the first puzzler discussed in the book is exactly about this bug. Its a nasty one!
7 Comments until now
Thanks for sharing this! I can imagine the furstation of debugging and the relief of “Aha!” moment when you figured it. I had similar moments when I made a mistake in commenting out a innocent looking println (part of code cleanup to avoid clogging of log files) and created a infinite loop in production environment bringing all app servers down! It took around 6 hours to figure out that a simple comment of the println has caused this. The println was “System.out.println(”the element is :” + e.nextElement())” which was inside a while (e.hasMoreElements()) loop. Since only the println was commented, the while loop became a infite one. We never bothered to test it as we just commented a simple println. Hard way to learn the importance of testing!
Rajs
http://thoughtsinprogress.co.in
Relying on hashcode is rather tricky — if any component in the computed hashcode relies on the built-in hashcode of an instance variable, you might be basing your hashcode computation on memory address somewhere.
Have you thought about instantiating a PRNG with a defined seed? Just use random.next() * N and you should get a repeatable, uniform distribution.
It would be quite funny if you have written unit tests with all the task names having positive hashcodes by accident…
Yeah, random with a defined seed would be the best solution – it would give you a guarantee of an even and fair distribution no matter the number of servers or pattern in task queue.
Why bother to randomly distribute the jobs if the execution times of those jobs are inherently random? Say you have a queue of jobs (J), and a circular buffer of servers (S) which those jobs may execute upon. For each job j in J, put j on S and make S the next server in the circular buffer. If the jobs are random, then this configuration will be just ‘as random’ as the configuration you choose by route of hashcode.
As a side note- you could have avoided this bug by going with a uint rather than int and your ring would have been Z+ rather than just Z.
About the circular buffer: Try to imagine that your system is trying to compare two algorithms (A and B) against each other by running them with different inputs (i1, i2,…iN) and checking whether they will give the same results (For example A is a brut-force slow solution, B is a fast sophisticated algorithm that you want to check). It is very possible that your jobs queue will look like this:
A(i1), B(i1), A(i2), B(i2), A(i3), B(i3), …, A(iN), B(iN)
so there is always alternating A and B tasks. The execution time of every task is ‘random’, but then it is expected that A(i) >> B(i).
Now try to imagine that you have 100 machines in S. If the system would use your algorithm, then all servers with even IDs would get the A tasks and servers with even IDs would get the tasks B. Therefore odd-ID servers would have much more load and still continue to process the tasks while the even-ID servers would be idle….
I just wanted to show you that there can be a task distribution pattern in J and its impact on processing can be really bad. Basically since the task generator is not really random you do have to expect to have some pattern in J and that is why you should distribute them among the machines in a ‘random’ way:)
…and yes, unsigned int would solve the issue. But then again you would have to know about the issue first and if you did you would probably find a more elegant solution:)
What is this uint you speak of? Some mystical item from a faraway land?
Add your Comment!