Notes on Linear Algebra

A quick review of related concepts.

Table of Contents:

Matrix basics

Problem setting

1x + 2y + 3 z = 12
2x + 3y + 4z = 11
3x + 4y + 5z = 10

Formulate with numpy

A*X = b
A = np.array(np.mat(1 2 3; 2 3 4; 3 4 5))
A = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
X = np.transpose(np.array([x, y, z]))

Vector, Matrix, and Tensor Derivatives

More details here

Row vectors

\[\vec{x}: 1*D \vec{y} : 1*C W: D*C \vec{y} = \vec{x}W \vec(y_3} = \sum_{j=1}^{D}\vec(x_j)W_{j,3} \frac{\partial\vec{y_3}}{\partial\vec{x_7}} = W_{7,3} \frac{\partial\vec{y}}{\partial\vec{x}} = W\]

3 Dimensional array

\[\vec{y_3} = \vec{x_1}W_{1,3} + \ved{x_2}W_{2,3} + \cdots + \vec{x_D}W_{D,3} \frac{\partial\vec{y_3}}{\partial\vec{x_2}} = W_{2,3} \frac{\partial\vec{y_j}}{\partial\vec{x_i}} = W_{i,j}\]

Multiple data points

\[X: N*D W: D*C Y: N*C Y = XW \frac{\partialY_{i,j}}{\partialX_{i,k}} = W_{k,j} \frac{\partialY_{i,:}}{\partialX_{i,:}} = W i.e. \frac{\partial\vec{y}}{\partial\vec{x}} = W\]

Chain rule

\[\vec{y} = VW\vec{x} \frac{d\vec{y_i}}{d\vec{x_j}} = \sum_{k=1}^{M}\frac{d\vec{y_i}}{d\vec{m_k}}\frac{d\vec{m_k}}{d\vec{x_j}} = \sum_{k=1}^{M}V_{i,k}W_{k,j}\]

tips row x matrix = row_picture matrix x column = col_picture row x column = scalar elimination views matrix as stacked rows or cols permutation views manipulate row/col vectors

Activation Function

https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

Backpropagation

computing the gradient analytically using the chain rule, i.e. backpropagation.

Intuition with gates

  • mini-batch SGD loop:
    • sample
    • forward loss
    • backward gradient
    • update weights
  • layer - gate
  • weight - parameters
  • forward - backward pass

dfdx, or dx, is the gradient on x, the partial derivative on x

the gradient is the vector [dfdx, dfdy, dfdz] or [dx, dy, dz]

# set inputs
x = -2; y = 5; z = -4

# perform the forward pass
q = x + y # q becomes 3
f = q * z   # f becomes -12

# perform the backward pass (backpropagation, reverse-mode differentiation)
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 1.0*dfdq # df/dx = dq/dx * df/dq = 1*df/dq
dfdy = 1.0*dfdq # df/dy = dq/dy * df/dq = 1*df/dq
# assume some random weights and data
w = [2,-3,-3]
x = [-1,-2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2] # dot product x.dot(w.T)
f = 1.0/(1+math.exp(-dot)) # sigmoid function

# backward pass through the neuron
ddot = (1-f)*f # sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # bp into x, df/dx = df/ddot * ddot/dx = df/ddot * w
dw = [x[0] * ddot, x[1] * ddot, 1.0*ddot ] # bp into w df/dw = df/ddot * ddot/dw = df/ddot * x

In practice: staged computation

\[f(x, y) = \frac{x + \sigma(y)}{\sigma(x) + (x+y)^2}\]

To differentiate w.r.t. x or y is very complex. But we do not need to have an explicit function to evaluate the gradient. We only need to know how to compute it.

# some inputs
x = 3
y = -4

# forward pass
sigy = 1.0/(1 + math.exp(-y)) # sigmoid in numerator   #(1)
num = x + sigy # numerator                             #(2)
sigx = 1.0/(1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y                                            #(4)
xpysqr = xpy**2                                        #(5)
den = sigx + xpysqr # denominator                      #(6)
invden = 1.0/den                                       #(7)
f = num * invden                                       #(8)

# backward pass
# bp f = num * invden
dnum = invden                         #(8)
dinvden = num                         #(8)
# bp invden = 1.0/den
dden = (-1.0/(den**2)) * dinvden      #(7)
# bp den = sigx + xpysqr
dsigx = (1.0)*dden                    #(6)
dxpysqr = (1.0)*dden                  #(6)
# bp xpysqr = xpy**2                  
dxpy = (2.0*xpy)*dxpysqr              #(5)
# bp xpy = x + y                       
dx = 1.0*dxpy                         #(4)
dy = 1.0*dxpy                         #(4)
# bp sigx = 1.0/(1 + math.exp(-x))
dx += sigx * (1 - sigx)*dsigx         #(3)
# bp num = x + sigy
dx += (1.0)*dnum                      #(2)
dsigy = (1.0)*dnum                    #(2)
# bp sigy = 1.0/(1 + math.exp(-y))
dy += sigy * (1 - sigy)*dsigy         #(1)

Tips: * Cache forward pass variables. * Gradients add up at forks. “+=” multi-variable chain rule, if a variable branches out at different parts of the circuit, then the gradients that flow back to it will add up.

Gradients for vectorized operations

Matrix-Matrix multiply graident_**

# forward pass
W = np.random.randn(5, 10) # 5x10
X = np.random.randn(10, 3) # 10x3
D = W.dot(X) #  5x3

# backward pass
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T)
dX = W.T.dot(dD)

tips *calculus *algebra *geometry

Written on November 19, 2016