\myheading{Dealing with compiler optimizations}

\leveldown{}

The following piece of code \dots

\begin{lstlisting}
        mov     rax, rdi
        add     rax, rax
\end{lstlisting}

\dots will be transformed into \emph{(arg1 + arg1)} expression.
It can be reduced to \emph{(arg1 * 2)}.
Our toy decompiler can identify patterns like such and rewrite them.

\begin{lstlisting}
# X+X -> X*2
def reduce_ADD1 (expr):
    if is_expr_op(expr) and get_op (expr)=="+" and get_op1 (expr)==get_op2 (expr):
        return dbg_print_reduced_expr ("reduce_ADD1", expr, create_binary_expr ("*", get_op1 (expr), create_val_expr (2)))

    return expr # no match
\end{lstlisting}

This function will just test, if the current node has \emph{EXPR\_OP} type,
operation is ``+'' and both children are equal to each other.
By the way, since our data structure is just tuple of tuples, Python can compare them using plain ``=='' operation.
If the testing is finished successfully, current node is then replaced with a new expression:
we take one of children, we construct a node of \emph{EXPR\_VALUE} type with ``2'' number in it,
and then we construct a node of \emph{EXPR\_OP} type with ``*''.

\TT{dbg\_print\_reduced\_expr()} serving solely debugging purposes---it just prints the old and the new (reduced) expressions.

Decompiler is then traverse expression tree recursively in \emph{deep-first search} fashion.

\begin{lstlisting}
def reduce_step (e):
    if is_expr_op (e)==False:
        return e # expr isn't EXPR_OP, nothing to reduce (we don't reduce EXPR_SYMBOL and EXPR_VAL)

    if is_unary_op(get_op(e)):
        # recreate expr with reduced operand:
        return reducers(create_unary_expr (get_op(e), reduce_step (get_op1 (e))))
    else:
        # recreate expr with both reduced operands:
        return reducers(create_binary_expr (get_op(e), reduce_step (get_op1 (e)), reduce_step (get_op2 (e))))

...


# same as "return ...(reduce_MUL1 (reduce_ADD1 (reduce_ADD2 (... expr))))"
reducers=compose([
	...
    reduce_ADD1, ...
    ...])

def reduce (e):
    print "going to reduce " + expr_to_string (e)
    new_expr=reduce_step(e)
    if new_expr==e:
        return new_expr # we are done here, expression can't be reduced further
    else:
        return reduce(new_expr) # reduced expr has been changed, so try to reduce it again
\end{lstlisting}

Reduction functions called again and again, as long, as expression changes.

Now we run it:

\begin{lstlisting}
python td.py tests/add1.s

...

going to reduce (arg1 + arg1)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
going to reduce (arg1 * 2)
result=(arg1 * 2)
\end{lstlisting}

So far so good, now what if we would try this piece of code?

\begin{lstlisting}
        mov     rax, rdi
        add     rax, rax
        add     rax, rax
        add     rax, rax
\end{lstlisting}

\begin{lstlisting}
python td.py tests/add2.s

...

working out tests/add2.s
going to reduce (((arg1 + arg1) + (arg1 + arg1)) + ((arg1 + arg1) + (arg1 + arg1)))
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2)
reduction in reduce_ADD1() (((arg1 * 2) * 2) + ((arg1 * 2) * 2)) -> (((arg1 * 2) * 2) * 2)
going to reduce (((arg1 * 2) * 2) * 2)
result=(((arg1 * 2) * 2) * 2)
\end{lstlisting}

This is correct, but too verbose.

We would like to rewrite \emph{(X*n)*m} expression to \emph{X*(n*m)}, where $n$ and $m$ are numbers.
We can do this by adding another function like \TT{reduce\_ADD1()}, but there is much better option:
we can make matcher for tree.
You can think about it as regular expression matcher, but over trees.

\begin{lstlisting}
def bind_expr (key):
    return ("EXPR_WILDCARD", key)

def bind_value (key):
    return ("EXPR_WILDCARD_VALUE", key)

def match_EXPR_WILDCARD (expr, pattern):
    return {pattern[1] : expr} # return {key : expr}

def match_EXPR_WILDCARD_VALUE (expr, pattern):
    if get_expr_type (expr)!="EXPR_VALUE":
        return None
    return {pattern[1] : get_val(expr)} # return {key : expr}

def is_commutative (op):
    return op in ["+", "*", "&", "|", "^"]

def match_two_ops (op1_expr, op1_pattern, op2_expr, op2_pattern):
    m1=match (op1_expr, op1_pattern)
    m2=match (op2_expr, op2_pattern)
    if m1==None or m2==None:
        return None # one of match for operands returned False, so we do the same
    # join two dicts from both operands:
    rt={}
    rt.update(m1)
    rt.update(m2)
    return rt

def match_EXPR_OP (expr, pattern):
    if get_expr_type(expr)!=get_expr_type(pattern): # be sure, both EXPR_OP.
        return None
    if get_op (expr)!=get_op (pattern): # be sure, ops type are the same.
        return None

    if (is_unary_op(get_op(expr))):
        # match unary expression.
        return match (get_op1 (expr), get_op1 (pattern))
    else:     
        # match binary expression.     

        # first try match operands as is.
        m=match_two_ops (get_op1 (expr), get_op1 (pattern), get_op2 (expr), get_op2 (pattern))
        if m!=None:
            return m
        # if matching unsuccessful, AND operation is commutative, try also swapped operands.
        if is_commutative (get_op (expr))==False:
            return None
        return match_two_ops (get_op1 (expr), get_op2 (pattern), get_op2 (expr), get_op1 (pattern))

# returns dict in case of success, or None
def match (expr, pattern):
    t=get_expr_type(pattern)
    if t=="EXPR_WILDCARD":
        return match_EXPR_WILDCARD (expr, pattern)
    elif t=="EXPR_WILDCARD_VALUE":
        return match_EXPR_WILDCARD_VALUE (expr, pattern)
    elif t=="EXPR_SYMBOL":
        if expr==pattern:
            return {}
        else:
            return None
    elif t=="EXPR_VALUE":
        if expr==pattern:
            return {}
        else:
            return None
    elif t=="EXPR_OP":
        return match_EXPR_OP (expr, pattern)
    else:
        raise AssertionError
\end{lstlisting}

Now how we will use it:

\begin{lstlisting}
# (X*A)*B -> X*(A*B)
def reduce_MUL1 (expr):
    m=match (expr, create_binary_expr ("*", (create_binary_expr ("*", bind_expr ("X"), bind_value ("A"))), bind_value ("B")))
    if m==None:
        return expr # no match

    return dbg_print_reduced_expr ("reduce_MUL1", expr, create_binary_expr ("*", 
        m["X"], # new op1
        create_val_expr (m["A"] * m["B"]))) # new op2
\end{lstlisting}

We take input expression, and we also construct pattern to be matched.
Matcher works recursively over both expressions synchronously.
Pattern is also expression, but can use two additional node types: \emph{EXPR\_WILDCARD} and
\emph{EXPR\_WILDCARD\_VALUE}. These nodes are supplied with keys (stored as strings).
When matcher encounters \emph{EXPR\_WILDCARD} in pattern, it just stashes current expression and will return it.
If matcher encounters \emph{EXPR\_WILDCARD\_VALUE}, it does the same, but only in case the current node
has \emph{EXPR\_VALUE} type.

\TT{bind\_expr()} and \TT{bind\_value()} are functions which create nodes with the types we have seen.

All this means, \TT{reduce\_MUL1()} function will search for the expression in form \emph{(X*A)*B}, where $A$ and $B$
are numbers. In other cases, matcher will return input expression untouched, so these reducing function can be chained.

Now when \TT{reduce\_MUL1()} encounters (sub)expression we are interesting in,
it will return dictionary with keys and expressions.
Let's add \TT{print m} call somewhere before return and rerun:

\begin{lstlisting}
python td.py tests/add2.s

...

going to reduce (((arg1 + arg1) + (arg1 + arg1)) + ((arg1 + arg1) + (arg1 + arg1)))
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2)
{'A': 2, 'X': ('EXPR_SYMBOL', 'arg1'), 'B': 2}
reduction in reduce_MUL1() ((arg1 * 2) * 2) -> (arg1 * 4)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() (arg1 + arg1) -> (arg1 * 2)
reduction in reduce_ADD1() ((arg1 * 2) + (arg1 * 2)) -> ((arg1 * 2) * 2)
{'A': 2, 'X': ('EXPR_SYMBOL', 'arg1'), 'B': 2}
reduction in reduce_MUL1() ((arg1 * 2) * 2) -> (arg1 * 4)
reduction in reduce_ADD1() ((arg1 * 4) + (arg1 * 4)) -> ((arg1 * 4) * 2)
{'A': 4, 'X': ('EXPR_SYMBOL', 'arg1'), 'B': 2}
reduction in reduce_MUL1() ((arg1 * 4) * 2) -> (arg1 * 8)
going to reduce (arg1 * 8)
...
result=(arg1 * 8)
\end{lstlisting}

The dictionary has keys we supplied plus expressions matcher found.
We then can use them to create new expression and return it.
Numbers are just summed while forming second operand to ``*'' operation.

Now a real-world optimization technique---optimizing GCC replaced multiplication by 31 by shifting and subtraction operations:

\begin{lstlisting}
        mov     rax, rdi
        sal     rax, 5
        sub     rax, rdi
\end{lstlisting}

Without reduction functions, our decompiler will translate this into \emph{((arg1 << 5) - arg1)}.
We can replace shifting left by multiplication:

\begin{lstlisting}
# X<<n -> X*(2^n)
def reduce_SHL1 (expr):
    m=match (expr, create_binary_expr ("<<", bind_expr ("X"), bind_value ("Y")))
    if m==None:
        return expr # no match
    
    return dbg_print_reduced_expr ("reduce_SHL1", expr, create_binary_expr ("*", m["X"], create_val_expr (1<<m["Y"])))
\end{lstlisting}

Now we getting \emph{((arg1 * 32) - arg1)}.
We can add another reduction function:

\begin{lstlisting}
# (X*n)-X -> X*(n-1)
def reduce_SUB3 (expr):
    m=match (expr, create_binary_expr ("-",
        create_binary_expr ("*", bind_expr("X1"), bind_value ("N")),
        bind_expr("X2")))
    
    if m!=None and match (m["X1"], m["X2"])!=None:
        return dbg_print_reduced_expr ("reduce_SUB3", expr, create_binary_expr ("*", m["X1"], create_val_expr (m["N"]-1)))
    else:
        return expr # no match
\end{lstlisting}

Matcher will return two X's, and we must be assured that they are equal.
In fact, in previous versions of this toy decompiler, I did comparison with plain ``=='', and it worked.
But we can reuse \TT{match()} function for the same purpose, because it will process commutative operations better.
For example, if X1 is ``Q+1'' and X2 is ``1+Q'', expressions are equal, but plain ``=='' will not work.
On the other side, \TT{match()} function, when encounter ``+'' operation (or another commutative operation),
and it fails with comparison, it will also try swapped operand and will try to compare again.

However, to understand it easier, for a moment, you can imagine there is ``=='' instead of the second \TT{match()}.

Anyway, here is what we've got:

\begin{lstlisting}
working out tests/mul31_GCC.s
going to reduce ((arg1 << 5) - arg1)
reduction in reduce_SHL1() (arg1 << 5) -> (arg1 * 32)
reduction in reduce_SUB3() ((arg1 * 32) - arg1) -> (arg1 * 31)
going to reduce (arg1 * 31)
...
result=(arg1 * 31)
\end{lstlisting}

Another optimization technique is often seen in ARM thumb code: AND-ing a value with a value like 0xFFFFFFF0,
is implemented using shifts:

\begin{lstlisting}
        mov rax, rdi
        shr rax, 4
        shl rax, 4
\end{lstlisting}

This code is quite common in ARM thumb code, because it's a headache to encode 32-bit constants using couple of 16-bit
thumb instructions, while single 16-bit instruction can shift by 4 bits left or right.

Also, the expression \emph{(x>>4)<<4} can be jokingly called as ``twitching operator'':
I've heard the ``-{}-i++'' expression was called like this in Russian-speaking social networks, it was some kind of meme
(``operator podergivaniya'').

Anyway, these reduction functions will be used:

\begin{lstlisting}
# X>>n -> X / (2^n)
...
def reduce_SHR2 (expr):
    m=match(expr, create_binary_expr(">>", bind_expr("X"), bind_value("Y")))
    if m==None or m["Y"]>=64:
        return expr # no match

    return dbg_print_reduced_expr ("reduce_SHR2", expr, create_binary_expr ("/", m["X"],
        create_val_expr (1<<m["Y"])))

...

# X<<n -> X*(2^n)
def reduce_SHL1 (expr):
    m=match (expr, create_binary_expr ("<<", bind_expr ("X"), bind_value ("Y")))
    if m==None:
        return expr # no match
    
    return dbg_print_reduced_expr ("reduce_SHL1", expr, create_binary_expr ("*", m["X"], create_val_expr (1<<m["Y"])))

...

# FIXME: slow
# returns True if n=2^x or popcnt(n)=1
def is_2n(n):
    return bin(n).count("1")==1


# AND operation using DIV/MUL or SHL/SHR
# (X / (2^n)) * (2^n) -> X&(~((2^n)-1))
def reduce_MUL2 (expr):
    m=match(expr, create_binary_expr ("*", create_binary_expr ("/", bind_expr("X"), bind_value("N1")), bind_value("N2")))
    if m==None or m["N1"]!=m["N2"] or is_2n(m["N1"])==False: # short-circuit expression
        return expr # no match

    return dbg_print_reduced_expr("reduce_MUL2", expr, create_binary_expr ("&", m["X"],
        create_val_expr(~(m["N1"]-1)&0xffffffffffffffff)))
\end{lstlisting}

Now the result:

\begin{lstlisting}
working out tests/AND_by_shifts2.s
going to reduce ((arg1 >> 4) << 4)
reduction in reduce_SHR2() (arg1 >> 4) -> (arg1 / 16)
reduction in reduce_SHL1() ((arg1 / 16) << 4) -> ((arg1 / 16) * 16)
reduction in reduce_MUL2() ((arg1 / 16) * 16) -> (arg1 & 0xfffffffffffffff0)
going to reduce (arg1 & 0xfffffffffffffff0)
...
result=(arg1 & 0xfffffffffffffff0)
\end{lstlisting}

\myheading{Division using multiplication}

Division is often replaced by multiplication for performance reasons.

From school-level arithmetic, we can remember that division by 3 can be replaced by multiplication by $\frac{1}{3}$.
In fact, sometimes compilers do so for floating-point arithmetic, for example, FDIV instruction in x86 code
can be replaced by FMUL.
At least MSVC 6.0 will replace division by 3 by multiplication by $\frac{1}{3}$ and sometimes it's hard to be sure,
what operation was in original source code.

But when we operate over integer values and CPU registers, we can't use fractions.
However, we can rework fraction:

\begin{center}
{\large $result = \frac{x}{3} = x \cdot \frac{1}{3} = x \cdot \frac{1 \cdot MagicNumber}{3 \cdot MagicNumber}$}
\end{center}

Given the fact that division by $2^n$ is very fast, we now should find that $MagicNumber$, for which the following
equation will be true: $2^n = 3 \cdot MagicNumber$.

This code performing division by 10:

\begin{lstlisting}
        mov     rax, rdi
        movabs  rdx, 0cccccccccccccccdh
        mul     rdx
        shr     rdx, 3
        mov     rax, rdx
\end{lstlisting}

Division by $2^{64}$ is somewhat hidden: lower 64-bit of product in RAX is not used (dropped), only higher 64-bit of
product (in RDX) is used and then shifted by additional 3 bits.

RDX register is set during processing of MUL/IMUL like this:

\begin{lstlisting}
def handle_unary_MUL_IMUL (registers, op1):
    op1_expr=register_or_number_in_string_to_expr (registers, op1)
    result=create_binary_expr ("*", registers["rax"], op1_expr)
    registers["rax"]=result
    registers["rdx"]=create_binary_expr (">>", result, create_val_expr(64))
\end{lstlisting}

In other words, the assembly code we have just seen multiplication operations by {\Large $\frac{0cccccccccccccccdh}{2^{64+3}}$},
or divides by {\Large $\frac{2^{64+3}}{0cccccccccccccccdh}$}.
To find divisor we just have to divide numerator by denominator.

\begin{lstlisting}
# n = magic number
# m = shifting coefficient
# return = 1 / (n / 2^m) = 2^m / n
def get_divisor (n, m):
    return (2**float(m))/float(n)

# (X*n)>>m, where m>=64 -> X/...
def reduce_div_by_MUL (expr):
    m=match (expr, create_binary_expr(">>", create_binary_expr ("*", bind_expr("X"), bind_value("N")), bind_value("M")))
    if m==None:
        return expr # no match
    
    divisor=get_divisor(m["N"], m["M"])
    return dbg_print_reduced_expr ("reduce_div_by_MUL", expr, create_binary_expr ("/", m["X"], create_val_expr (int(divisor))))
\end{lstlisting}

This works, but we have a problem: this rule takes \emph{(arg1 * 0xcccccccccccccccd) >> 64} expression first and
finds divisor to be equal to $1.25$.
This is correct: result is shifted by 3 bits after (or divided by 8), and $1.25 \cdot 8 = 10$.
But our toy decompiler doesn't support real numbers.

We can solve this problem in the following way: if divisor has fractional part, we postpone reducing, with a hope,
that two subsequent right shift operations will be reduced into single one:

\begin{lstlisting}
# (X*n)>>m, where m>=64 -> X/...
def reduce_div_by_MUL (expr):
    m=match (expr, create_binary_expr(">>", create_binary_expr ("*", bind_expr("X"), bind_value("N")), bind_value("M")))
    if m==None:
        return expr # no match
    
    divisor=get_divisor(m["N"], m["M"])
    if math.floor(divisor)==divisor:
        return dbg_print_reduced_expr ("reduce_div_by_MUL", expr, create_binary_expr ("/", m["X"], create_val_expr (int(divisor))))
    else:
        print "reduce_div_by_MUL(): postponing reduction, because divisor=", divisor
        return expr
\end{lstlisting}

That works:

\begin{lstlisting}
working out tests/div_by_mult10_unsigned.s
going to reduce (((arg1 * 0xcccccccccccccccd) >> 64) >> 3)
reduce_div_by_MUL(): postponing reduction, because divisor= 1.25
reduction in reduce_SHR1() (((arg1 * 0xcccccccccccccccd) >> 64) >> 3) -> ((arg1 * 0xcccccccccccccccd) >> 67)
going to reduce ((arg1 * 0xcccccccccccccccd) >> 67)
reduction in reduce_div_by_MUL() ((arg1 * 0xcccccccccccccccd) >> 67) -> (arg1 / 10)
going to reduce (arg1 / 10)
result=(arg1 / 10)
\end{lstlisting}

I don't know if this is best solution. In early version of this decompiler, it processed input expression in two passes:
first pass for everything except division by multiplication, and the second pass for the latter.
I don't know which way is better.
Or maybe we could support real numbers in expressions?

Couple of words about better understanding division by multiplication.
Many people miss ``hidden'' division by $2^{32}$ or $2^{64}$,
when lower 32-bit part (or 64-bit part) of product is not used (or just dropped).
Also, there is misconception that modulo inverse is used here. This is close, but not the same thing.
Extended Euclidean algorithm is usually used to find \emph{magic coefficient}, but in fact,
this algorithm is rather used to solve the equation. You can solve it using any other method.
Also, needless to mention, the equation is unsolvable for some divisors, because this is diophantine equation
(i.e., equation allowing result to be only integer), since we work on integer CPU registers, after all.

\levelup{}