# Notes on Linear Algebra

A quick review of related concepts.

Table of Contents:

## Matrix basics

### Problem setting

```
1x + 2y + 3 z = 12
2x + 3y + 4z = 11
3x + 4y + 5z = 10
```

### Formulate with numpy

```
A*X = b
A = np.array(np.mat(‘1 2 3; 2 3 4; 3 4 5’))
A = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
X = np.transpose(np.array([x, y, z]))
```

## Vector, Matrix, and Tensor Derivatives

More details here

*Row vectors*

*3 Dimensional array*

*Multiple data points*

*Chain rule*

** tips**
row x matrix = row_picture
matrix x column = col_picture
row x column = scalar
elimination views matrix as stacked rows or cols
permutation views manipulate row/col vectors

## Backpropagation

computing the gradient analytically using the chain rule, i.e. backpropagation.

Intuition with gates

- mini-batch SGD loop:
- sample
- forward loss
- backward gradient
- update weights

- layer - gate
- weight - parameters
- forward - backward pass

dfdx, or dx, is the gradient on x, the partial derivative on x

the gradient is the vector [dfdx, dfdy, dfdz] or [dx, dy, dz]

```
# set inputs
x = -2; y = 5; z = -4
# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12
# perform the backward pass (backpropagation, reverse-mode differentiation)
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
# now backprop through q = x + y
dfdx = 1.0*dfdq # df/dx = dq/dx * df/dq = 1*df/dq
dfdy = 1.0*dfdq # df/dy = dq/dy * df/dq = 1*df/dq
```

```
# assume some random weights and data
w = [2,-3,-3]
x = [-1,-2]
# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2] # dot product x.dot(w.T)
f = 1.0/(1+math.exp(-dot)) # sigmoid function
# backward pass through the neuron
ddot = (1-f)*f # sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot] # bp into x, df/dx = df/ddot * ddot/dx = df/ddot * w
dw = [x[0] * ddot, x[1] * ddot, 1.0*ddot ] # bp into w df/dw = df/ddot * ddot/dw = df/ddot * x
```

In practice: staged computation

To differentiate w.r.t. x or y is very complex. But we do not need to have an explicit function to evaluate the gradient. We only need to know how to compute it.

```
# some inputs
x = 3
y = -4
# forward pass
sigy = 1.0/(1 + math.exp(-y)) # sigmoid in numerator #(1)
num = x + sigy # numerator #(2)
sigx = 1.0/(1 + math.exp(-x)) # sigmoid in denominator #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # denominator #(6)
invden = 1.0/den #(7)
f = num * invden #(8)
# backward pass
# bp f = num * invden
dnum = invden #(8)
dinvden = num #(8)
# bp invden = 1.0/den
dden = (-1.0/(den**2)) * dinvden #(7)
# bp den = sigx + xpysqr
dsigx = (1.0)*dden #(6)
dxpysqr = (1.0)*dden #(6)
# bp xpysqr = xpy**2
dxpy = (2.0*xpy)*dxpysqr #(5)
# bp xpy = x + y
dx = 1.0*dxpy #(4)
dy = 1.0*dxpy #(4)
# bp sigx = 1.0/(1 + math.exp(-x))
dx += sigx * (1 - sigx)*dsigx #(3)
# bp num = x + sigy
dx += (1.0)*dnum #(2)
dsigy = (1.0)*dnum #(2)
# bp sigy = 1.0/(1 + math.exp(-y))
dy += sigy * (1 - sigy)*dsigy #(1)
```

Tips: * Cache forward pass variables. * Gradients add up at forks. “+=” multi-variable chain rule, if a variable branches out at different parts of the circuit, then the gradients that flow back to it will add up.

Gradients for vectorized operations

** Matrix-Matrix** multiply graident_**

```
# forward pass
W = np.random.randn(5, 10) # 5x10
X = np.random.randn(10, 3) # 10x3
D = W.dot(X) # 5x3
# backward pass
dD = np.random.randn(*D.shape) # same shape as D
dW = dD.dot(X.T)
dX = W.T.dot(dD)
```

** tips**
*calculus
*algebra
*geometry