# Notes on Machine Learning with NN

A quick review of related concepts.

## Basics

Questions: what is machine learning?

The fundamental goal of machine learning is to generalize beyond the examples in the training set.

(Machine learning is) the field of study that gives computers the ability to learn without being explicitly programmed – Arthur Samuel

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. – Tom Mitchell

Key Elements of ML Algorithm

• Representation: how to represent knowledge. Examples include decision trees, sets of rules, instances, graphical models, neural networks, support vector machines, model ensembles and others.
• Evaluation: the way to evaluate candidate programs (hypotheses). Examples include accuracy, prediction and recall, squared error, likelihood, posterior probability, cost, margin, entropy k-L divergence and others.
• Optimization: the way candidate programs are generated known as the search process. For example combinatorial optimization, convex optimization, constrained optimization.

### Model Representation

A typical machine learning model:

 x^{(i)} input variables, input features y^{(i)} output or target variable, to be predicted (x^{(i)}, y^{(i)}) a training example (x^{(i)}, y^{(i)}); (i = 1, …, m) a training set X the input space Y the output space h: X \rightarrow Y a hypothesis function

### Cost Function: how well does the hypothesis function fit the input data?

** Cost function ** measures the accuracy of the hypothesis function, e.g. mean square error function. Our objective is to get the lowest prediction error.

### Parameter Learning via Gradient Descent

** Gradient descent ** estimates the parameters in the hypothesis function, by taking the (partial) derivatives of the the cost function. It repeatedly and simultaneously updates the parameters until convergence.

# compute gradients w.r.t. existing parameters

# compute new parameters

# update parameters
w1 = w1_
w2 = w2_


Difficulties:

• step size, i.e. learning rate
• gradient descent method, e.g. batch, online, stochastic

## Classification

Goals

• build a simple image classification pipeline, based on the k-Nearest Neighbor or the SVM/Softmax classifier.
• understand the basic Image Classification pipeline and the data-driven approach (train/predict stages)
• understand the train/val/test splits and the use of validation data for hyperparameter tuning.
• develop proficiency in writing efficient vectorized code with numpy
• implement and apply a k-Nearest Neighbor (kNN) classifier
• implement and apply a Multiclass Support Vector Machine (SVM) classifier
• implement and apply a Softmax classifier
• implement and apply a Two layer neural network classifier
• understand the differences and tradeoffs between these classifiers
• get a basic understanding of performance improvements from using higher-level representations than raw pixels (e.g. color histograms, Histogram of Gradient (HOG) features)

### Nearest Neighbor Classifier

Clustering is a type of unsupervised learning: input data samples have no output labels. For example, K-Nearest Neighbor (KNN)

K-NN’s success is greatly dependent on the representation it classifies data from, so one needs a good representation before k-NN can work well. They are very expensive to train, but once the training is finished it is very cheap to classify a new test example. This mode of operation is much more desirable in practice.

1 Nearest Neighbor classifier

import numpy as np
# input
Xtr, Ytr, Xte, Yte = loadData('path') # magic function
# training
nn = NearestNeighbor() # create a Nearest Neighbor classifier class
nn.train(Xtr, Ytr) # train the classifier on the training data and labels
# testing
Yte_predict = nn.predict(Xte) # predict labels on the test data
# print the classification accuracy: average number of examples that are correctly predicted, label matches
print 'accuracy: %f' % (np.mean(Yte_predict == Yte))
# the classifier
class NearestNeighbor(object):
def __init__(self):
pass

def train(self, X, y):
""" X is N x D where each row is an example. Y is N x 1 """
# the nearest neighbor classifier simply remembers ALL the training data
slef.Xtr = X
self.ytr = y

def predict(self, X):
""" X is N x D where each row is an example, which we wish to predict label for """
num_test = X.shape[0]
# make sure the output type matches the input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

# loop over all test rows
for i in xrange(num_test):
# find the nearest training image to the i'th test image
# using the L1 distance (sum of absolute value differences)
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
## or using the L2 distance
# distance = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1))
min_index = np.argmin(distances) # get the index with the smallest distance
Ypred[i] = self.ytr[min_index] # predict the label of the nearest example
return Ypred


K Nearest Neighbor classifier

Xval = Xtr[:1000, :] # take the first 1000 for validation
Yval = Ytr[:1000]
Xtr = Xtr[1000:, :] # keep last 49,000 for train
Ytr = Ytr[1000:]

# find Hyperparameters that work best on the validation set
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
# use a particular value of k and evaluate on validation data
nn = NearestNeighbor()
nn.train(Xtr, Ytr)
# assume a modified NearestNeighbor class that can take k as input
Yval_predict = nn.predict(Xval, k = k)
acc = np.mean(Yval_predict == Yval)
print 'accuracy: %f' % (acc,)

# keep track of what works on the validation set
validation_accuracies.append((k, acc))


### Linear Classifier

Three core components of the (image) classification task: Score Function maps the raw image pixels to class scores (e.g. a linear function) Loss function measures the quality of a certain set of parameters (e.g. softmax vs SVM) Optimization finds the set of parameters W that minimize the loss function

3 key components in (image) classification task:

1. A (parameterized) score function maps the raw image pixels to class scores (e.g., a linear function $f(x_i,W) = Wx_i$)
1. A loss function L measures the quality of a particular set of parameters using certain scores (e.g., Softmax/SVM)
1. An optimization function finds the set of parameters W that minimize the loss function

#### Score Function

binary Softmax classifier (also known as logistic regression)

#### Loss Function

loss function measures our unhappiness with outcomes, sometimes also referred to as the cost function or the objective function.

Multiclass SVM

Code for the loss function, without regularization, in python, both unvectorized and half-vectorized form:

def L_i(x, y, W:
"""
unvectorized version.
- x
- y
- W
"""
delta =
score =

def L_i_vectorized(x, y, W):
"""
half-vectorized implementation, where for a single example the implementation contains no for loops, but there is still one loop over the examples outside this function
"""
delta =
scores =
margins =
margins[y] =
loss_i =
return loss_i

def L(X, y, W):
"""
fully-vectroized implementation:
- X
- y
- W
"""
delta =
score =


Softmax

Compare

$\sum_{\forall i}{x_i^{2}}$

$$P(y = 1 \mid x; w,b) = \sigma(w^Tx + b)$$, where $$\sigma(z) = 1/(1+e^{-z})$$

training dataset $$x_i$$, y_{i})\), where $i = 1 \cdots N$

N examples each with dimensionality D K distinct categories

weights, parameters

Data preprocessing: mean subtraction, scale, [-1, 1]

Validation sets for Hyperparameter tuning

#### Optimization

Strategy 1: Random search: try out many different random weights and keep track of what works best.

# X_train is the data where each column is an example (e.g. 3073 x 50,000)
# Y_train are the labels (e.g. 1 x 50,000)
# L evaluates the loss function

bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
loss = L(X_train, Y_train, W) # get the loss over the entir training set
if loss < bestloss: #track the best solution
bestloss = loss
bestW = W
print 'In attemp %d the loss was %f, best %f' % (num, loss, bestloss)

# Assume X_test is [3073 x 10000], Y_test is [10000 x 1]
scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples
# find the index with max score in each column, i.e. the predicted class
Yte_predict = np.argmax(scores, axis = 0)
# calculate accuracy (fraction of predictions that are correct)
np.mean(Yte_predict == Yte)
# returns 0.1555


Strategy 2: Random local search: tries to extend one foot in a random direction and take a step only if it leads downhill.

• Start out with a random W
• generate random perturbation $\deltaW$, if the loss at the perturbed $W + \deltaW$ is lower, then updata
W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in xrange(1000):
step_size = 0.0001
Wtry = W + np.random.randn(10, 3073) * step_size
loss = L(Xtr_cols, Ytr, Wtry)
if loss < bestloss:
W = Wtry
bestloss = loss
print 'iter %d loss is %f' % (i, bestloss)
# returns 0.214



Strategy 3: Following the gradient: computes the best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descent. This direction will be related to the gradient of the loss function.

• 1-D functions, derivatives, the gradient is the vector of slopes for each dimension
• x-D functions, partial derivatives, the gradient is the vector of partial derivatives in each dimension

• numerical gradient: compute numerically with finite differences
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f is a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient at
"""
fx = f(x) # evaluate function value at original point
h = 0.00001

# iterate over all indexes in x
while not it.finished:

# evaluate function at x+h
ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh = f(x) # evaluate f(x+h)
x[ix] = old_value # restore to previous value (!!!important)

# compute the partial derivative
grad[ix] = (fxh -fx) /h # the slope
it.iternext() # step to next dimension



Compute gradient for the CIFAR-10 loss function at some random point in the weight space:

# a function that takes a single argument, the weights,
def CIFAR10_loss-fun(W):
return L(X_train, Y_train, W)

W = np.random.rand(10, 3073) * 0.001 # random weight vector

# the gradient tells the slope of the loss function along every dimension, we use it to make an update
loss_original = CIFAR10_Loss_fun(W) # original loss
print 'original loss: %f' % (loss_original, )

# effect of multiple step sizes
step_size = 10**step_size_log
W_new = W - step_size*df
loss_new = CIFAR10_loss_fun(W_new)
print 'for step size %f new loss: %f' % (step_size, loss_new)


• analytic gradient: compute analytically with Calculus

#### Linear Regression with One/Multiple Variables

Linear regression predicts a real-valued output based on an input value.

cost function

i.e., $X_0 = 1$

## Non-linear Classification with Neural Networks

Goals

• write backpropagation code, and train Neural Networks and Convolutional Neural Networks.
• understand Neural Networks and how they are arranged in layered architectures
• understand and be able to implement (vectorized) backpropagation
• implement various update rules used to optimize Neural Networks
• implement batch normalization for training deep networks
• implement dropout to regularize networks
• effectively cross-validate and find the best hyperparameters for Neural Network architecture

Questions

• what a neural network is really doing, behavior of deep neural networks
• how it is doing so, when succeeds; and what went wrong when fails.
• explore low-dimensional deep neural networks when classifying certain datasets
• visualise the behavior, complexity, and topology of such networks
• On an example dataset, with each layer, the network transforms the data, creating a new representation

### Architecture

#### Biological motivation


# sigmoid activation function
def sigmoid(x):
return 1.0/(1.0 + math.exp(-x))

class Neuron(object):
# example of forward propagating a single neuron
def forward(inputs):
""" assume inputs and weights are 1-D numpy arrays and bia is a number """
cell_body_sum = np.sum(inputs * self.weights) + self.bias
firing_rate = sigmoid(cell_synapse)
return firing_rate


### Case Study

#### Problem setting 1: linear dataset

1x + 2y + 3 z = 12

2x + 3y + 4z = 11

3x + 4y + 5z = 10


Formulate with numpy

import numpy as np

A*X = b
A = np.array(np.mat(‘1 2 3; 2 3 4; 3 4 5’))
A = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
X = np.transpose(np.array([x, y, z]))
# visualize the data
TBD


#### Problem setting 2: non-linear (spiral) dataset

import numpy as np

N = 100 # number of points per class/spiral wing
D = 2  # dimensionality
K = 5  # number of spiral wings
X = np.zeros((N*K)) # data matrix (each row is a single training example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in xrange(K):
ix = range(N*j, N*(j+1))
r = np.linspace(0.0, 1, N) # radius
t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
# visualize the data
plt.scatter(X[:,0],X[:,1], c=y, s=40, cmap=plt.cm.Spectral)


Loss function

def L_ivectorized(x, y, w):
scores = w.dot(x)
margins = np.maximum(0, scores - scores[y] +1)
margins[y] = 0
loss_i = np.sum(margins)
return loss_i


“Sigmoid Function,” also called the “Logistic Function”:

### Recurrent Neural Networks

Goals

• Implement recurrent networks, and apply them to image captioning on Microsoft COCO.
• Understand the architecture of recurrent neural networks (RNNs) and how they operate on sequences by sharing weights over time
• Understand the difference between vanilla RNNs and Long-Short Term Memory (LSTM) RNNs
• Understand how to sample from an RNN at test-time
• Understand how to combine convolutional neural nets and recurrent nets to implement an image captioning system
• Understand how a trained convolutional network can be used to compute gradients with respect to the input image
• Implement and different applications of image gradients, including saliency maps, fooling images, class visualizations, feature inversion, and DeepDream.

#### Continuous Visualization of Layers

• A linear transformation by the “weight” matrix
• A translation by the vector
• Point-wise application of e.g., tanh.

## ML Applications

### Audio: Human Speech and Language Processing

polyphonic sound event detection in real life recordings

Motivation It is a way of x, which is critical for x, and x.

Problem statement The core problem studies is:

Given x, where x, and we want to compute/debug x.

Expressions and interpretation of the problem

## Tools for ML

“if a human expert couldn’t use the data to solve the problem manually, a computer probably won’t be able to either. Instead, focus on problems where a human could solve the problem, but where it would be great if a computer could solve it much more quickly.”

## Summary

Historical Context

• cat & edges, how brain neurons work, 1981, 1963
• 1966 A.I. Computer Vision

Some Basic Concepts

• Supervised learning
• Nearest neighbor
• Linear classifiers
• Linear regression with one/multiple variable
• Logistic Regression
• Support vector machines
• Naive Bayes classifer
• Support vector machines
• Decision trees
• C4.5
• Random forests
• CART
• Hidden Markov models
• Artificial neural network
• Hopfield networks
• Boltzmann machines
• Restricted Boltzmann machines
• Random Forest
• Conditional Random Field
• ANOVA
• BayEsian network
• Unsupervised learning
• Artificial neural network
• self-organizing map
• Clustering
• K-means clustering
• Fuzzy clustering
• Deep learning
• deep belief networks
• deep Boltzmann machines
• deep Convolutional neural networks
• deep Recurrent neural networks
• Semi-supervised learning (…)
• Reinforcement learning (…)

Organization may vary across subjects, this list is mainly from Coursera (ML basics) and wiki (ML concepts)

ML by Outputs

• Classification
• Regression
• Probability Estimation
• Clustering
• Dimensionality Reduction