Machine Learning & Data Science
Great Introductory ML Course
Includes a decent amount of math foundation, practice jupyter notebooks and a kaggle competition.
Performance Measures
- Allow you to evaluate your machine learning model and the results or predictions made by it
Confusion Matrix
- The rows of the matrix correspond to the predictions made by the model and the columns correspond to the actual known data
- The matrix shows the True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN)
- Confusion matrices generated by different machine learning models on the same data can then be used to choose which model makes the best predictions
Receiver Operating Characteristics (ROC)
- An ROC graph plots sensitivity (true positive rate) vs. 1 - Specificity (false positive rate) or precision
- Based on the data set various classification thresholds can be set, the TP and FP rates are then based on all the confusion matrices produced at each threshold
- The final ROC graph dictates which threshold is best for the given data set
Area Under the Curve (AUC)
- Has a maximum value of 1
- Allows for comparison of multiple ROC curves
- A greater AUC value indicates a better model
Probability Theory

Random variables
- A variable whose value depends on outcomes of random phenomena
- Serve as a way to map the outcomes of random processes to real numbers
Bayes' Theorem
- Conditional probability - Allows you to calculate the probability of A occurring given that event B occured
- Bayes’ Theorem generalizes a tree diagram: The numerator is the probability of A and B while the denominator is the marginal probability of B
- $P(A|B) = \frac{P(A)P(B|A)}{P(B)}$
Bayesian Statistics
Baye's Error
Bayes' Classifier
Bias vs Variance
Optimal Model Complexity
The Scikit-learn Library
Classification
Naive Bayes
- A supervised learning algorithm
- Applies Bayes’ Theorem using a naive assumption - assumes conditional independence between each pair of features (given value of the class variable)
- Used to classify data based on features
- For example, classifying emails as spam or not spam
Nearest Neighbor
- Unsupervised nearest neighbors - manifold learning and spectral clustering
- Supervised neighbors - classification of data with discrete labels, and regression of data with continuous labels
- Works by finding the closest predefined number of training samples to a new point, and then predicting a label based on those
Brute force
- Only computes the distances between all pairs of points
- Classifies new data exclusively through these computed distances
KD tree
- Generates a binary tree structure through the recursive partitioning of data along the data axes provided
- Computes distances based on the partitions made by the tree as opposed to computing it for each data point, thus reducing the number of distance calculations
Ball tree
- KD Trees only partition data within cartesian axes, so ball trees separate data into a series of nesting hyper-spheres to all for higher dimensions
- The data is divided into nodes defined by a centroid and radius, and the neighbor search utilizes the triangle inequality for classification
Support Vector Machines
- A supervised learning method that can be used for classification, regression, and outlier detection
- Effective in high dimensions
- SVMs transform data (as necessary) into higher dimensions in order to best classify it based on the training data provided
- This is done through kernel functions that determine the relationships between data points and then set parameters to classify data
Multi Layer Perceptron
- A supervised learning algorithm - learns a function based on features that are provided as input (each input feature is represented as a neuron)
- Neurons in the hidden layer (between the input and output) transform the values from each previous layer through a weighted linear summation and then a non-linear activation function
- Finally, the output layer transforms values received from the last hidden layer into output values
Random Forest
- A meta estimator - fits various subsets of the given data using decision tree classifiers
- Trees in the ensemble are generated with replacement (bootstrap sample)
- The results of these many trees are averaged
- The randomness used to generate the trees and the averaging of the trees decreases variance, increases predictive accuracy and controls over-fitting
AdaBoost
- Fits a sequence of small decision trees (stumps) that are weak learners on different versions of the data
- Results from all the stumps are combined through a weighted sum to generate the final prediction
- Relies on successive iterations of the algorithm based on sample weights of the data in order to train the model (at each step, incorrectly predicted training examples have their weight increased, while correctly predicted examples weights’ are decreased)
Stochastic Gradient Descent
- SGD is an optimization technique, a way to train a model, and it is sensitive to feature scaling
- Similar to gradient descent in methodology, but only looks at one/a mini batch of samples at each step that are randomly selected
- SGD also takes larger steps at first and uses fewer samples per step, but then increases the number of steps/decreases step size as the optimal value is approached
Regression
Ridge
- Performs a special type of linear regression (least squares)
- It minimizes the sum of squared residuals and the least regression penalty
- Least regression penalty = ɑ*(slope + other parameters)^2
- ɑ is determined using cross validation
- This term introduces bias and as ɑ increases, the slope approaches 0 and sensitivity along the x-axis decreases
- Ultimately, it improves predictions by decreasing variance, making the predictions less sensitive to the training data and can be very beneficial when sample size is relatively small or there is not enough data to perform least squares regression
LASSO
- Similar to Ridge regression, LASSO introduces bias to least squares regression to decrease variance
- It minimizes sum of squared residuals and ɑ*|slope + other parameters|
- As ɑ increases, the slope can actually be 0 (Ridge can only approach 0 asymptotically) and sensitivity along the x-axis decreases
- When performed on data sets with many parameters, LASSO can be used to eliminate parameters by setting their slope to 0, thus LASSO is better than Ridge at reducing variables if a model has many useless parameters/variables
LOESS
- Fits a curve to the data by using a sliding window to break the data down into smaller sections where each section is fit using weighted least squares
- Each window has a focal point and weighted least squares is performed based on the distance from each point to the focal point within the window
- The window (generally based on a percentage of the data set) moves as the focal point moves creating a set of preliminary points, and these points are then used to re-weight the original points based on their distance from the calculated points. The re-weighting process can be repeated and then a curve can be constructed through the calculated points
