Forward Mode Automatic Differentiation Using Abstract Syntax Trees

In this post, we will look at Forward mode auto differentiation(auto diff) using Python’s Abstract Syntax Tree(AST).

Sample auto diff representation

The above image is a typical image of AutoDiff and it is very similar to a typical image of an AST, which essentially breaks down a set of instructions into individual components.

So simply put, the task was to look at a set of math operations and create an equivalent set of differential operations for the math operations. This is possible using Python's ast module. And later on, use the math and differential operations to perform linear regression.

I am assuming you are familiar with AutoDiff and Python AST. We will directly look at using the Python AST to create the AutoDiff forward mode functionality.

So let’s get started.

The Variable Class

We will create a Variable class that will help us keep track of the name of the variable and the value. Tracking the name of the variable is key to generating differential equations. The Variable class also has overloads for the common math operations.

class Variable:
    def __init__(self, name: str = '', value: float = 0.0):
        self.name = name
        self.value = value

    # Sample Add overload
    def __add__(self, other):
        if isinstance(other, self.__class__):
            value = self.value + other.value
            return Variable(self.name + '_A_' + other.name, value)
        else:
            value = self.value + other
            return Variable(self.name + '_A_' + str(other), value)

We will use objects of this class as our variables.

Predict and Loss Function

Assume the following code, which is common in linear regression.

def train_loop(X, y, w, b):
    '''
    X - Single Input Feature len(X) == 1
    y - Single Output Value
    w - Trainable weight parameter
    b - Trainable bias parameter
    '''
    y_pred = w * X + b           # Linear equation for prediction
    loss = (y_pred - y) ** 2     # MSE with no mean. Because len(X) == 1
    return loss

'''
    Where

    X = Variable(name='X', value=3)
    y = Variable(name='y', value=7)
    w = Variable(name='w', value=np.random.rand(1))
    b = Variable(name='b', value=np.random.rand(1))

'''

We need to find how much w and b should change so that the loss is minimum i.e the value of y_pred is closer to y. This is done through a process called Gradient Descent, in which we get the partial derivatives(Gradients) of the loss function with respect to the trainable parameters w and b.

Python AST

Consider only the first operation y_pred = w * X + b in the above train_loop.

The following is the AST generated by Python for

import ast

print(ast.dump(ast.parse('y_pred = w * X + b'), indent=2))

'''
Module(
  body=[
    Assign(
      targets=[
        Name(id='y_pred', ctx=Store())],
      value=BinOp(
        left=BinOp(
          left=Name(id='w', ctx=Load()),
          op=Mult(),
          right=Name(id='X', ctx=Load())),
        op=Add(),
        right=Name(id='b', ctx=Load())))],
  type_ignores=[])
'''

The value of the Assign node is a Binop operation which again has a Binop as its left input. We separate the "cascading" Binop operations such that each Binop has separate Assign node. This will help in generating the differential equation for that particular Binop. These equations give us the Gradient for each parameter.

Flatten Operation

We pass the above code through a "flattening operation" to extract the individual operations, thereby "flattening" the AST.

# Input to Flattening function
y_pred = w * X + b

# AST is generated from the input.
# AST is parsed and every Binop operation is given its own Assign Node.
# This Assign node is added to the AST to generate new code.

# Output from Flattening function
random_name_1 = w * X
random_name_2 = random_name_1 + b
y_pred = random_name_2

Following in the AST of the above Output

import ast

print(ast.dump(ast.parse(
'''
random_name_1 = w * X
random_name_2 = random_name_1 + b
y_pred = random_name_2
'''), indent=2))

'''
Module(
  body=[
    Assign(
      targets=[
        Name(id='random_name_1', ctx=Store())],
      value=BinOp(
        left=Name(id='w', ctx=Load()),
        op=Mult(),
        right=Name(id='X', ctx=Load()))),
    Assign(
      targets=[
        Name(id='random_name_2', ctx=Store())],
      value=BinOp(
        left=Name(id='random_name_1', ctx=Load()),
        op=Add(),
        right=Name(id='b', ctx=Load()))),
    Assign(
      targets=[
        Name(id='y_pred', ctx=Store())],
      value=Name(id='random_name_2', ctx=Load()))],
  type_ignores=[])
'''

As you can see we have one Binop operation per Assign node.

Create Differential Equations

We can now generate a differential equation for each operation, with respect to the trainable parameters.

These equations are added to the flattened function which can be then called to get the output of the operation as well as the partial derivatives(gradients) of the output with respect to the parameters.

Putting it together

So putting it all together, when we take the train_loop(X, y, w, b) function from above through the process we get the following output.

# Input Function to Flatten function
def train_loop(X, y, w, b):
    y_pred = w * X + b
    loss = (y_pred - y) ** 2

    return loss

# Output of the Flatten function
def train_loop_flat(X, y, w, b):
    w_mult_X = w * X
    w_mult_X.name = 'w_mult_X'
    w_mult_X_add_b = w_mult_X + b
    w_mult_X_add_b.name = 'w_mult_X_add_b'
    y_pred = w_mult_X_add_b
    y_pred.name = 'y_pred'
    y_pred_sub_y = y_pred - y
    y_pred_sub_y.name = 'y_pred_sub_y'
    y_pred_sub_y_pow_uta = y_pred_sub_y ** 2            # uta is a random id 
    y_pred_sub_y_pow_uta.name = 'y_pred_sub_y_pow_uta'
    loss = y_pred_sub_y_pow_uta
    loss.name = 'loss'

    return loss

# Output of the Create Differential Equations Function
def train_loop_flat_and_diff(X, y, w, b, dw=0, db=0):
    w_mult_X = w * X
    w_mult_X.name = 'w_mult_X'
    w_mult_X_add_b = w_mult_X + b
    w_mult_X_add_b.name = 'w_mult_X_add_b'
    y_pred = w_mult_X_add_b
    y_pred.name = 'y_pred'
    y_pred_sub_y = y_pred - y
    y_pred_sub_y.name = 'y_pred_sub_y'
    y_pred_sub_y_pow_uta = y_pred_sub_y ** 2
    y_pred_sub_y_pow_uta.name = 'y_pred_sub_y_pow_uta'
    loss = y_pred_sub_y_pow_uta
    loss.name = 'loss'

    dw_mult_X = dw * X
    dw_mult_X_add_b = dw_mult_X + db
    dy_pred = dw_mult_X_add_b
    dy_pred_sub_y = dy_pred
    dy_pred_sub_y_pow_uta = 2 * y_pred_sub_y ** 1 * dy_pred_sub_y
    dloss = dy_pred_sub_y_pow_uta

    return (loss, dloss)

The train_loop_flat_and_diff above has inputs dw and db for the trainable parameters w and b respectively. The differential equations are created using the Chain Rule of derivatives.

Linear Regression test

We will now put our forward mode auto diff to the test. We will have an input variable X and an output variable y, along with w and b initialized as follows

def train_loop(X, y, w, b):
    y_pred = w * X + b 
    loss = (y_pred - y) ** 2

    return loss

def main():
    X = Variable(name='X', value=np.random.rand(1) * -100) # -100.0 : 0.0
    y = Variable(name='y', value=np.random.rand(1) * 100) # 0.0 : 100.0

    w = fm_ast.Variable(name='w', value=np.random.rand(1))
    b = fm_ast.Variable(name='b', value=np.random.rand(1))

    epochs = 100
    learning_rate = 0.0001
    train_losses = []

    # ForwardModeAutoDiffAST contains all the functionality described above
    # Takes the "training function" and paramter names as inputs
    fm_ad = ForwardModeAutoDiffAST(train_loop, [w.name, b.name])

    for epoch in range(epochs):
        gradients = fm_ad.get_gradients(X=X, y=y, w=w, b=b)

        w = w - gradients[w.name][1].value * learning_rate
        w.name = 'w'
        b = b - gradients[b.name][1].value * learning_rate
        b.name = 'b'

        train_losses.append(gradients[w.name][0].value)

'''
X=[-95.6548842] y_pred=[-38.00573576], y=[3.70300017]
Train Loss at 0 = [1739.61865285]
Train Loss at 10 = [42.05417673]
Train Loss at 20 = [1.01663303]
Train Loss at 30 = [0.02457646]
Train Loss at 40 = [0.00059412]
Train Loss at 50 = [1.43624806e-05]
Train Loss at 60 = [3.47203852e-07]
Train Loss at 70 = [8.39343273e-09]
Train Loss at 80 = [2.02905909e-10]
Train Loss at 90 = [4.90512154e-12]
X=[-95.6548842] y_pred=[3.70299982], y=[3.70300017]
'''

As seen above the loss has greatly reduced throughout the training and the y_pred and y values are nearly equal at the end.

Following is the uncommented, unoptimized first working version.

Thank you for reading, and hope to post Part 2 soon.

https://gist.github.com/nihalkenkre/e51619241de182d8de07acc277787182

AutoDiff and Python AST - Part 1

Table of contents