Kevin Boone

Why are the variable names all wrong in my decompiled Java class?

Java logo A Java compiler produces machine code, but not machine code for any particular CPU: it produces machine code for the Java virtual machine. Still, it looks like machine code: it has simple operations that do arithmetic, move data between the stack and variables, branch, and call subroutines. Sometimes it is helpful to be able to convert the compiled code (usually called 'byte code' in the Java world) back into Java source. Various tools are available that can attempt this, with varying degrees of success. All these tools have the disadvantage that they lose information; in particular, they lose many of the variable names.

This article explains why this happens, by examining the compiled Java bytecode in detail.


For example, consider this simple Java class:

public class Test
  {
  public void foo()
    {
    int total = 0;
    for (int x = 0; x < 10; x++)
      {
      total += x;
      }
    }
  }

I'll compile this Java source to a class file Test.class, using javac, then pass Test.class to various Java decompilers.

CFR produces this:

public class Test {
    public void foo() {
        int n = 0;
        for (int i = 0; i < 10; ++i) {
            n += i;
        }
    }
}

Fernflower, on the other hand, renders it like this:

public class Test {
   public void foo() {
      int var1 = 0;
      for(int var2 = 0; var2 < 10; ++var2) {
         var1 += var2;
      }
   }
}

Both of these decompilers, and others, correctly detect that I've used a for loop in my code rather than, say, a while loop. All correctly render the names of the class and its single method. None, however, correctly detects the names of any of the variables. I see subtle differences between the decompilers: for example, CFR used i as the loop counter in the for loop, which is a common choice. But it's still wrong -- the original name was x. Fernflower didn't even use common names: we just get var1 and var2.

But why? If the decompiler can get the correct class and method names, why can't it determine variable names?

To answer this question, we need to look at the compiled bytecode itself. We can do this using javap -c, which reveals this:

public class Test {
  public Test();
  ...
  public void foo();
    Code:
       0: iconst_0
       1: istore_1
       2: iconst_0
       3: istore_2
       4: iload_2
       5: bipush        10
       7: if_icmpge     20
      10: iload_1
      11: iload_2
      12: iadd
      13: istore_1
      14: iinc          2, 1
      17: goto          4
      20: return
}

Unlike any 'real' machine language I've come across (that runs directly on a CPU), Java byte code was developed alongside the Java programming language, which is inherently object-oriented. This means that concepts like 'class' and 'method' are built into the byte code specification. So we see the actual class and method names encoded into the bytecode.

But what about variables? The variables used in my simple test class are entirely local to the method foo(). Outside that method, they have no existence at all. The op-code iconst_0 puts the number zero on the stack; istore_1 stores the value on the stack in 'variable 1'. Java byte code has specific op-codes for handling a small number of variables, and some small integers (0, 1, ...) Using more variables, or larger numbers, requires the compiler to use more sophisticated methods to encode the operations as byte code but, in my simple case, I only need two variables, so the compiler can use the built-in op-codes for numbered variables. There's simply no need to give these variables names.

It's interesting to me that Fernflower uses variable names var1, var2, etc., that match the numbering of variables used in the Java byte code specification. CFR, on the other hand, attempts to use more more 'human' names. But neither will recover the names I originally used, because the compiler never stored them.

Decompilers won't always recover the program structure perfectly, because there are no for or while constructs -- the only flow control op-codes provided by the JVM are subroutine calls and plain goto branches. You can see in the example above that my for loop is actually implemented by a comparison (if_icmpge, 'integer compare greater or equal') and a goto.

Decompilers use heuristic methods to recover program flow, but it isn't easy to do the same for naming. I've seen attempts to recover names using AI techniques -- after all, there's a large body of code to train a machine learning algorithm on. Developers tend to use the same kinds of names for particular functions, so decompilers with a measure of AI could probably do better. Nevertheless, I don't know of any such implementation in use outside the laboratory.

So that's why your decompiled Java code has incorrect variable names: the compiler did not store any names for the decompiler to recover.