Machine Learning Basics
Learning Algorithms
- Linear Regression
Capacity, Overfitting and Underfitting
Hyperparameters and Validation Sets
Estimators, Bias, Variance
Supervised Learning Algorithms
- Probabilistic Supervised Learning
- Support Vector Machines
  - Kernels
  - Advantages/Disadvantages of Kernel
Unsupervised Learning Algorithms
- Principal Component Analysis (PCA)

Machine Learning Basics

Machine learning isessentially a form of applied statistics with increased emphasis on the use ofcomputers to statistically estimate complicated functions and a decreased emphasison proving conﬁdence intervals around these functions

Learning Algorithms

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Term	Description
Task, T	ML tasks are described in terms of how ML system should process an example. Example is collection of features , quantitatively measured from event that ML model will process. 1. Classification-which of the k categories, input belongs to? 2. Classification with missing inputs - ML model defines single function mapping from vector to categorical input. 3. Regression - Predict output, given some input.
Perfomance Pressure, P	1. Measure accuracy, error-rate of model. 2. Evaluate P, using test-set (completely separate from train-set).
Experience, E	1. Based on E, ML algorithms are classified as unsupervised and supervised learning algos.

Training and Testing Data

Consider the below scenario where on the left we have a simple solution to classify 2 categories of points. The second solution is a bit complex as compared to first one. Now, which one is a better solution?

We classify our data into 2 types - training and testing data.
We calculate and predict the line of classification using training data (without looking at testing data.) Now we see how well our prediction has performed on testing data. The dots with white inside are testing data.

Looking at the scenario above, it is clear that the first solution is a better one since it classified more data points correctly.

Linear Regression

Goal : Build a system that takes vector x є R, and predict value of scalar y є R as output. Output can be defined as follows, where b is the intercept term -
Term b, is often called bias parameter, in the sense that o/p of transformation is biased towards being b, in absence of any input.
w є R is set of vectpr parameters/weights that determines how each feature affects prediction.
Model's performance can be measured on the basis of Mean Squared Error (MSE) on test-set.
Improve ML model, by improving w such that it reduces MSE(test), by observing training set,
To get optimum weight, w value, minimize MSE(train), i.e. solve for where its gradient = 0 (system of Normal Equations.).
.

Capacity, Overfitting and Underfitting

Consider the following scenario to understand overfitting and underfitting -

Scenario 1: Trying to kill Godzilla using fly-trap => Here we are over-simplifying the problem. This is called underfitting
Scenario 2: Trying to kill fly using Bazooka. => Here we are over-complicating the solution which will lead to bad results. This is known as overfitting

In terms of neural networks, overfitting and underfitting can be classified as follows -

Generalization - Ability to perform on previously unobserved inputs.
Generalization/Test-Error - Expected value of error on new input.
To improve model's performance, reduce test-error. Test error can be defined as follows -
Set of i.i.d assumptions describes data-generating process with probability distribution over single sample.
i.i.d assumptions refers to examples in each data-set are indipendent and training and test data are identically distributed.
Expected training error of random model equals expected testing error of that model.
Factors that determine how well ML model will perform are -
- smaller training error
- smaller gap between training and testing error
Underfitting - model's inability to achieve smaller training error.
Overfitting - gap between training and testing error is too large.
Capacity - model's capacity is its ability to ﬁt a wide variety of functions.
Hypothesis space - set of functions that the learning algorithm is allowed to select as being the solution.
ML algos perform best when their capacity is appropriate for the true complexity of the task.
Representational Capacity - ( choice of model + hypothetical space)
Effective Capacity <= Representational Capacity
Baeys Error - difference between true probability distribution and true probability distribution.
No Free Lunch Theorem - ML aims to find rules that are probably correct about most members of the set rather than finding rules that are surely correct about certain set of members.
Theorem states that averaged over all possible data sets, no ML algorithm is universally any better than other.
Weight Decay - ML algos are given preference in hypothesis space. We can transform linear regression to include weight decay. The criterion J(w), has preference for weights with smaller L2 norm
- λ is a value chosen ahead of time that controls the strength of our preference for smaller weights.
- λ= 0, we impose no preference
- larger λ forces weights to become smaller.
- In general, we can regularize a model that learns a functionf(x;θ) by adding a penalty called regularizer to cost function.
Regularization is any modiﬁcation we make to learning algorithm that is intended to reduce its generalization error but not training error.

Hyperparameters and Validation Sets

Hyper-parameters -
- Settings related to ML algos that can be used to control algorithm’s behavior.
- Sometimes settings is chosen as hyper-paramter, because it is not appropriate to learn that hyperparameter (controlling model's capacity) on the training set, since they would always chose maximum possible model capacity,resulting in overfitting.
Validation-set - * To solve the above problem of overfitting due to chosing of maximum model's capacity, we need set of examples that model has not observed. This set is called validation-set (constructed from training data).

Estimators, Bias, Variance

Estimators

Point Estimator

Point estimation is the attempt to provide the single best prediction of some quantity of interest.
Point estimation for θ is represented by ˆθ.
A good estimator is a function, whose output is close to true value θ that generated training data.

Function Estimator

In function estimation,approximate f with a model or estimateˆf.
Function estimator ˆf is simply a point estimator in function space.

Bias

Bias of an estimator is defined as -
Estimator ˆθm is said to be unbiased if bias(ˆθm) =0
An estimatorˆθmis said to be asymptotically unbiased if
Gaussian and Bernoulli distribution are unbiased.
Variance of Gaussian distribution is biased with bias of

Variance and Standard Error

Variance/Standard Error of an estimator provides a measure of how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data-generating process.
Lower the variance, the better.
Standard Limit Theorem - Mean will be approximately distributed with a normal distribution.
Variance of an estimator is given by -

Trading off Bias and Variance to MSE

To chose between 2 estimators (more variance or more bias), we negotiate this trade-oﬀ by using cross-validation.
MSE in terms of variance and bias is given by -
Increasing capacity -> Increased variance, decreased bias.
Relationship between bias, variance, capacity and generalization error is given as follows -

Consistency

As the number of data points (m) increases in data-set, we expect our point estimates converge to the true value of the corresponding parameters.
Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows.

Maximum Likelihood Estimation

Rather than guessing that some function might make a good estimator and then analyzing its bias and variance,we would like to have some principle from which we can derive speciﬁc functions that are good estimators for diﬀerent models.

Maximum likelihood estimator for θ is then deﬁned as -
Taking log both sides and dividing by m to convert it into expectation, we get -
Maximum likelihood can be seen as minimizing dissimilarity between the empirical distribution ˆpdata, deﬁned by the training set and model distribution.
- KL Divergence measures degree of dissimilarity between the two.
- Minimizing KL divergence implies minimizing cross-entropy between distributions.

Conditional Log-Likelihood and Mean Squared Error

If X represents all our inputs and Y all our observed targets, then the conditional maximum likelihood estimator is gven by, assuming examples are i.i.d -

Properties of Maximum Likelihood Estimatior

Under appropriate conditions, the maximum likelihood estimator has the property of consistency, given the following conditions -

The true distribution pdata must lie within the model family pmodel(·;θ)
The true distribution pdata must correspond to exactly one value of θ.
Statistical Efficiency Consistent estimator may obtain lower generalization error for a ﬁxed number of samples m, or equivalently, may require fewer examples to obtain a ﬁxed level of generalization error.

Bayesian Statistics

Till now, we have considered single estimated value of θ and made all predictions. Another approach is to consider all values of θ and make one single prediction. The later domain is known as Bayesian Statistics.

The Bayesian uses probability to reﬂect degrees of certainty in states of knowledge.
Considering we have set of examples , we can recover the eﬀect of data on our belief about θ by combining the data likelihood, , with the prior via Bayes’ rule -
Maximum likelihood approach addresses the uncertainty in a given point estimate of θ by evaluating its variance.
Bayesian statistics integrates over point estimate to address uncertainity.
The prior has an inﬂuence by shifting probability mass density towards regions of the parameter space that are preferred a priori.
Bayesian statistics suffer high computational costs when training examples are large.

Maximum a Posteriori (MAP) Estimation

Rather than simply returning to the maximum likelihood estimate, we can still gain some beneﬁt of Bayesian approach by allowing the prior to inﬂuence the choice of the point estimate.
One rational way to do this is to choose maximum a posteriori(MAP) point estimate, which choses the point of maximal probability density in the more common case of continuous θ-
MAP Bayesian inference provides a straightforward way to design complicated yet interpretable regularization terms.

Supervised Learning Algorithms

Supervised learning algorithms are learning algorithms that learn to associate some input with some output, given a training set of examples of inputs x and outputs y.

Probabilistic Supervised Learning

Supervised learning is based on estimating probability distribution , by using maximum likelihood estimation to ﬁnd the best parameter vector θ for a parametric family of distributions p(y | x; θ)
Logistic Regression -
- Normal distribution over real-valued numbers, used for linear regression is parametrized in terms of mean.
- Distribution for binary variables becomes complicated since mean must always be between 0 and 1.
- Above problem can be solved by using logistic sigmoid function, to squash output of linear function into interval (0, 1) and interpret that value as a probability:

Support Vector Machines

Most influential approach to supervised learning problems.
Similar to logistic regression and driven by a linear function
Donot provide probabilities, predicts presence of positive or negative class.

Kernels

Many machine learning algorithms can be written exclusively in terms of dot products between examples.
Linear function used by SVM can be rewritten as
- x(i) is training example and α is vector of coeﬃcients.
- This replaces x with output of given a given feature function φ(x)
The dot product is called kernel.
Dot operator is analogus to
Predictions are made using the function -

Advantages/Disadvantages of Kernel

Enables us to learn models that are nonlinear as a function of x using convex optimization techniques.
Kernel function k is computationally efficient than naively constructing 2 φ(x) vectors and explicitly taking their dot product.
Cost of evaluationg a Kernel function is very high since i-th example contributes a term to decision function.
SVMs are able to mitigate the above problem by learning an α vector that contains mostly zeros.
Classifying a new example then requires evaluating the kernel function only for training examples that have non-zero αi. Such training examples are called Support Vectors

Unsupervised Learning Algorithms

ALgorithms that experience only features but not a supervision signal.

Principal Component Analysis (PCA)

PCA is unsupervised learning algorithm that learns representation of data.
PCA is then used for dimensionality reduction preserving as much information about data as possible.
Assuming a m by n design matrix X. and has a 0 mean.
Unbiased sample covariance matrix is given by -
PCA finds representation through linear transformation via
Principals components of design matrix are given by eigen vectors of
Principals components can be obtained by SVD also; Let W be right singular vectors in the decomposition . Original eigen vector can be recovered by -
.
Variance of X can be now expressed as -
.
Co-variance of z is diagnol -

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadMe.md

ReadMe.md

Table of Contents

Machine Learning Basics

Learning Algorithms

Training and Testing Data

Linear Regression

Capacity, Overfitting and Underfitting

Hyperparameters and Validation Sets

Estimators, Bias, Variance

Estimators

Point Estimator

Function Estimator

Bias

Variance and Standard Error

Trading off Bias and Variance to MSE

Consistency

Maximum Likelihood Estimation

Conditional Log-Likelihood and Mean Squared Error

Properties of Maximum Likelihood Estimatior

Bayesian Statistics

Maximum a Posteriori (MAP) Estimation

Supervised Learning Algorithms

Probabilistic Supervised Learning

Support Vector Machines

Kernels

Advantages/Disadvantages of Kernel

Unsupervised Learning Algorithms

Principal Component Analysis (PCA)

Files

ReadMe.md

Latest commit

History

ReadMe.md

File metadata and controls

Table of Contents

Machine Learning Basics

Learning Algorithms

Training and Testing Data

Linear Regression

Capacity, Overfitting and Underfitting

Hyperparameters and Validation Sets

Estimators, Bias, Variance

Estimators

Point Estimator

Function Estimator

Bias

Variance and Standard Error

Trading off Bias and Variance to MSE

Consistency

Maximum Likelihood Estimation

Conditional Log-Likelihood and Mean Squared Error

Properties of Maximum Likelihood Estimatior

Bayesian Statistics

Maximum a Posteriori (MAP) Estimation

Supervised Learning Algorithms

Probabilistic Supervised Learning

Support Vector Machines

Kernels

Advantages/Disadvantages of Kernel

Unsupervised Learning Algorithms

Principal Component Analysis (PCA)