Notes on Machine Learning with NN
A quick review of related concepts.
Table of Contents:
 Basics
 Classification
 ML Applications
 [ML Tools][#MLTools]
 [Tensorflow in a Nutshell: a Reading][#Tensorflow]
 Summary
Basics
Questions: what is machine learning?
The fundamental goal of machine learning is to generalize beyond the examples in the training set.
(Machine learning is) the field of study that gives computers the ability to learn without being explicitly programmed – Arthur Samuel
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. – Tom Mitchell
Key Elements of ML Algorithm
 Representation: how to represent knowledge. Examples include decision trees, sets of rules, instances, graphical models, neural networks, support vector machines, model ensembles and others.
 Evaluation: the way to evaluate candidate programs (hypotheses). Examples include accuracy, prediction and recall, squared error, likelihood, posterior probability, cost, margin, entropy kL divergence and others.
 Optimization: the way candidate programs are generated known as the search process. For example combinatorial optimization, convex optimization, constrained optimization.
Model Representation
A typical machine learning model:
x^{(i)}  input variables, input features 
y^{(i)}  output or target variable, to be predicted 
(x^{(i)}, y^{(i)})  a training example 
(x^{(i)}, y^{(i)}); (i = 1, …, m)  a training set 
X  the input space 
Y  the output space 
h: X \rightarrow Y  a hypothesis function 
Cost Function: how well does the hypothesis function fit the input data?
** Cost function ** measures the accuracy of the hypothesis function, e.g. mean square error function. Our objective is to get the lowest prediction error.
Parameter Learning via Gradient Descent
** Gradient descent ** estimates the parameters in the hypothesis function, by taking the (partial) derivatives of the the cost function. It repeatedly and simultaneously updates the parameters until convergence.
# compute gradients w.r.t. existing parameters
grad = function_SGD(x, w1, w2)
# compute new parameters
w1_ = w1  step_size*grad
w2_ = w2  step_size*grad
# update parameters
w1 = w1_
w2 = w2_
Difficulties:
 step size, i.e. learning rate
 gradient descent method, e.g. batch, online, stochastic
Classification
Goals
 build a simple image classification pipeline, based on the kNearest Neighbor or the SVM/Softmax classifier.
 understand the basic Image Classification pipeline and the datadriven approach (train/predict stages)
 understand the train/val/test splits and the use of validation data for hyperparameter tuning.
 develop proficiency in writing efficient vectorized code with numpy
 implement and apply a kNearest Neighbor (kNN) classifier
 implement and apply a Multiclass Support Vector Machine (SVM) classifier
 implement and apply a Softmax classifier
 implement and apply a Two layer neural network classifier
 understand the differences and tradeoffs between these classifiers
 get a basic understanding of performance improvements from using higherlevel representations than raw pixels (e.g. color histograms, Histogram of Gradient (HOG) features)
Nearest Neighbor Classifier
Clustering is a type of unsupervised learning: input data samples have no output labels. For example, KNearest Neighbor (KNN)
KNN’s success is greatly dependent on the representation it classifies data from, so one needs a good representation before kNN can work well. They are very expensive to train, but once the training is finished it is very cheap to classify a new test example. This mode of operation is much more desirable in practice.
1 Nearest Neighbor classifier
import numpy as np
# input
Xtr, Ytr, Xte, Yte = loadData('path') # magic function
# training
nn = NearestNeighbor() # create a Nearest Neighbor classifier class
nn.train(Xtr, Ytr) # train the classifier on the training data and labels
# testing
Yte_predict = nn.predict(Xte) # predict labels on the test data
# print the classification accuracy: average number of examples that are correctly predicted, label matches
print 'accuracy: %f' % (np.mean(Yte_predict == Yte))
# the classifier
class NearestNeighbor(object):
def __init__(self):
pass
def train(self, X, y):
""" X is N x D where each row is an example. Y is N x 1 """
# the nearest neighbor classifier simply remembers ALL the training data
slef.Xtr = X
self.ytr = y
def predict(self, X):
""" X is N x D where each row is an example, which we wish to predict label for """
num_test = X.shape[0]
# make sure the output type matches the input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
# loop over all test rows
for i in xrange(num_test):
# find the nearest training image to the i'th test image
# using the L1 distance (sum of absolute value differences)
distances = np.sum(np.abs(self.Xtr  X[i,:]), axis = 1)
## or using the L2 distance
# distance = np.sqrt(np.sum(np.square(self.Xtr  X[i,:]), axis = 1))
min_index = np.argmin(distances) # get the index with the smallest distance
Ypred[i] = self.ytr[min_index] # predict the label of the nearest example
return Ypred
K Nearest Neighbor classifier
Xval = Xtr[:1000, :] # take the first 1000 for validation
Yval = Ytr[:1000]
Xtr = Xtr[1000:, :] # keep last 49,000 for train
Ytr = Ytr[1000:]
# find Hyperparameters that work best on the validation set
validation_accuracies = []
for k in [1, 3, 5, 10, 20, 50, 100]:
# use a particular value of k and evaluate on validation data
nn = NearestNeighbor()
nn.train(Xtr, Ytr)
# assume a modified NearestNeighbor class that can take k as input
Yval_predict = nn.predict(Xval, k = k)
acc = np.mean(Yval_predict == Yval)
print 'accuracy: %f' % (acc,)
# keep track of what works on the validation set
validation_accuracies.append((k, acc))
Disadvantages of kNearest Neighbor
Linear Classifier
Three core components of the (image) classification task: Score Function maps the raw image pixels to class scores (e.g. a linear function) Loss function measures the quality of a certain set of parameters (e.g. softmax vs SVM) Optimization finds the set of parameters W that minimize the loss function
3 key components in (image) classification task:

 A (parameterized) score function maps the raw image pixels to class scores (e.g., a linear function )

 A loss function L measures the quality of a particular set of parameters using certain scores (e.g., Softmax/SVM)

 An optimization function finds the set of parameters W that minimize the loss function
Score Function
binary Softmax classifier (also known as logistic regression)
Loss Function
loss function measures our unhappiness with outcomes, sometimes also referred to as the cost function or the objective function.
Multiclass SVM
Code for the loss function, without regularization, in python, both unvectorized and halfvectorized form:
def L_i(x, y, W:
"""
unvectorized version.
 x
 y
 W
"""
delta =
score =
def L_i_vectorized(x, y, W):
"""
halfvectorized implementation, where for a single example the implementation contains no for loops, but there is still one loop over the examples outside this function
"""
delta =
scores =
margins =
margins[y] =
loss_i =
return loss_i
def L(X, y, W):
"""
fullyvectroized implementation:
 X
 y
 W
"""
delta =
score =
Softmax
Compare
$ \sum_{\forall i}{x_i^{2}} $
\(P(y = 1 \mid x; w,b) = \sigma(w^Tx + b)\), where \(\sigma(z) = 1/(1+e^{z})\)
training dataset \(x_i\), y_{i})\), where $ i = 1 \cdots N $
N examples each with dimensionality D K distinct categories
weights, parameters
Data preprocessing: mean subtraction, scale, [1, 1]
Validation sets for Hyperparameter tuning
Optimization
Strategy 1: Random search: try out many different random weights and keep track of what works best.
# X_train is the data where each column is an example (e.g. 3073 x 50,000)
# Y_train are the labels (e.g. 1 x 50,000)
# L evaluates the loss function
bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
loss = L(X_train, Y_train, W) # get the loss over the entir training set
if loss < bestloss: #track the best solution
bestloss = loss
bestW = W
print 'In attemp %d the loss was %f, best %f' % (num, loss, bestloss)
# Assume X_test is [3073 x 10000], Y_test is [10000 x 1]
scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples
# find the index with max score in each column, i.e. the predicted class
Yte_predict = np.argmax(scores, axis = 0)
# calculate accuracy (fraction of predictions that are correct)
np.mean(Yte_predict == Yte)
# returns 0.1555
Strategy 2: Random local search: tries to extend one foot in a random direction and take a step only if it leads downhill.
 Start out with a random W
 generate random perturbation , if the loss at the perturbed is lower, then updata
W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in xrange(1000):
step_size = 0.0001
Wtry = W + np.random.randn(10, 3073) * step_size
loss = L(Xtr_cols, Ytr, Wtry)
if loss < bestloss:
W = Wtry
bestloss = loss
print 'iter %d loss is %f' % (i, bestloss)
# returns 0.214
Strategy 3: Following the gradient: computes the best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descent. This direction will be related to the gradient of the loss function.
What is the gradient?
 1D functions, derivatives, the gradient is the vector of slopes for each dimension
 xD functions, partial derivatives, the gradient is the vector of partial derivatives in each dimension
How to compute the gradient?
 numerical gradient: compute numerically with finite differences
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
 f is a function that takes a single argument
 x is the point (numpy array) to evaluate the gradient at
"""
fx = f(x) # evaluate function value at original point
grad = np.zeros(x.shape)
h = 0.00001
# iterate over all indexes in x
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
# evaluate function at x+h
ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh = f(x) # evaluate f(x+h)
x[ix] = old_value # restore to previous value (!!!important)
# compute the partial derivative
grad[ix] = (fxh fx) /h # the slope
it.iternext() # step to next dimension
return grad
Compute gradient for the CIFAR10 loss function at some random point in the weight space:
# a function that takes a single argument, the weights,
def CIFAR10_lossfun(W):
return L(X_train, Y_train, W)
W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient
# the gradient tells the slope of the loss function along every dimension, we use it to make an update
loss_original = CIFAR10_Loss_fun(W) # original loss
print 'original loss: %f' % (loss_original, )
# effect of multiple step sizes
for step_size_log in [10:1:1]
step_size = 10**step_size_log
W_new = W  step_size*df
loss_new = CIFAR10_loss_fun(W_new)
print 'for step size %f new loss: %f' % (step_size, loss_new)
 analytic gradient: compute analytically with Calculus
gradient descent minibatch GD, stochastic GD (i.e. online gradient descent)
Linear Regression with One/Multiple Variables
Linear regression predicts a realvalued output based on an input value.
cost function
gradient descent for learning.
i.e.,
Logistic Regression
Nonlinear Classification with Neural Networks
Goals
 write backpropagation code, and train Neural Networks and Convolutional Neural Networks.
 understand Neural Networks and how they are arranged in layered architectures
 understand and be able to implement (vectorized) backpropagation
 implement various update rules used to optimize Neural Networks
 implement batch normalization for training deep networks
 implement dropout to regularize networks
 effectively crossvalidate and find the best hyperparameters for Neural Network architecture
Questions
 what a neural network is really doing, behavior of deep neural networks
 how it is doing so, when succeeds; and what went wrong when fails.
 explore lowdimensional deep neural networks when classifying certain datasets
 visualise the behavior, complexity, and topology of such networks
 On an example dataset, with each layer, the network transforms the data, creating a new representation
Architecture
Biological motivation
# sigmoid activation function
def sigmoid(x):
return 1.0/(1.0 + math.exp(x))
class Neuron(object):
# example of forward propagating a single neuron
def forward(inputs):
""" assume inputs and weights are 1D numpy arrays and bia is a number """
cell_body_sum = np.sum(inputs * self.weights) + self.bias
firing_rate = sigmoid(cell_synapse)
return firing_rate
Data and Loss Function
Learning and Evaluation
Case Study
Problem setting 1: linear dataset
1x + 2y + 3 z = 12
2x + 3y + 4z = 11
3x + 4y + 5z = 10
Formulate with numpy
import numpy as np
A*X = b
A = np.array(np.mat(‘1 2 3; 2 3 4; 3 4 5’))
A = np.array([[1, 2, 3], [2, 3, 4], [3, 4, 5]])
X = np.transpose(np.array([x, y, z]))
# visualize the data
TBD
Problem setting 2: nonlinear (spiral) dataset
import numpy as np
N = 100 # number of points per class/spiral wing
D = 2 # dimensionality
K = 5 # number of spiral wings
X = np.zeros((N*K)) # data matrix (each row is a single training example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in xrange(K):
ix = range(N*j, N*(j+1))
r = np.linspace(0.0, 1, N) # radius
t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
# visualize the data
plt.scatter(X[:,0],X[:,1], c=y, s=40, cmap=plt.cm.Spectral)
Loss function
def L_ivectorized(x, y, w):
scores = w.dot(x)
margins = np.maximum(0, scores  scores[y] +1)
margins[y] = 0
loss_i = np.sum(margins)
return loss_i
“Sigmoid Function,” also called the “Logistic Function”:
Convolutional Neural Networks
Recurrent Neural Networks
Goals
 Implement recurrent networks, and apply them to image captioning on Microsoft COCO.
 Understand the architecture of recurrent neural networks (RNNs) and how they operate on sequences by sharing weights over time
 Understand the difference between vanilla RNNs and LongShort Term Memory (LSTM) RNNs
 Understand how to sample from an RNN at testtime
 Understand how to combine convolutional neural nets and recurrent nets to implement an image captioning system
 Understand how a trained convolutional network can be used to compute gradients with respect to the input image
 Implement and different applications of image gradients, including saliency maps, fooling images, class visualizations, feature inversion, and DeepDream.
Continuous Visualization of Layers
 A linear transformation by the “weight” matrix
 A translation by the vector
 Pointwise application of e.g., tanh.
ML Applications
Toy Example
Audio: Human Speech and Language Processing
polyphonic sound event detection in real life recordings
Motivation It is a way of x, which is critical for x, and x.
Problem statement The core problem studies is:
Given x, where x, and we want to compute/debug x.
Expressions and interpretation of the problem
Vision: Image and Video Processing
Tools for ML
“if a human expert couldn’t use the data to solve the problem manually, a computer probably won’t be able to either. Instead, focus on problems where a human could solve the problem, but where it would be great if a computer could solve it much more quickly.”
Tensorflow in a Nutshell: a Reading
Summary
Historical Context
 cat & edges, how brain neurons work, 1981, 1963
 1966 A.I. Computer Vision
Some Basic Concepts
ML by Tasks
 Supervised learning
 Nearest neighbor
 Linear classifiers
 Linear regression with one/multiple variable
 Logistic Regression
 Support vector machines
 Naive Bayes classifer
 Support vector machines
 Decision trees
 C4.5
 Random forests
 CART
 Hidden Markov models
 Artificial neural network
 Hopfield networks
 Boltzmann machines
 Restricted Boltzmann machines
 Random Forest
 Conditional Random Field
 ANOVA
 BayEsian network
 …
 Unsupervised learning
 Artificial neural network
 selforganizing map
 Clustering
 Kmeans clustering
 Fuzzy clustering
 …
 Artificial neural network
 Deep learning
 deep belief networks
 deep Boltzmann machines
 deep Convolutional neural networks
 deep Recurrent neural networks
 Semisupervised learning (…)
 Reinforcement learning (…)
Organization may vary across subjects, this list is mainly from Coursera (ML basics) and wiki (ML concepts)
ML by Outputs
 Classification
 Regression
 Probability Estimation
 Clustering
 Dimensionality Reduction
Further Reading