ID Machine Learning & Data Science

Great Introductory ML Course

Includes a decent amount of math foundation, practice jupyter notebooks and a kaggle competition.

Performance Measures

  • Allow you to evaluate your machine learning model and the results or predictions made by it

Confusion Matrix

  • The rows of the matrix correspond to the predictions made by the model and the columns correspond to the actual known data
  • The matrix shows the True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN)
  • Confusion matrices generated by different machine learning models on the same data can then be used to choose which model makes the best predictions

Receiver Operating Characteristics (ROC)

  • An ROC graph plots sensitivity (true positive rate) vs. 1 - Specificity (false positive rate) or precision
  • Based on the data set various classification thresholds can be set, the TP and FP rates are then based on all the confusion matrices produced at each threshold
  • The final ROC graph dictates which threshold is best for the given data set

Area Under the Curve (AUC)

  • Has a maximum value of 1
  • Allows for comparison of multiple ROC curves
  • A greater AUC value indicates a better model

ID Probability Theory

Random variables

  • A variable whose value depends on outcomes of random phenomena
  • Serve as a way to map the outcomes of random processes to real numbers

Bayes' Theorem

  • Conditional probability - Allows you to calculate the probability of A occurring given that event B occured
  • Bayes’ Theorem generalizes a tree diagram: The numerator is the probability of A and B while the denominator is the marginal probability of B
  • $P(A|B) = \frac{P(A)P(B|A)}{P(B)}$

ID Bayesian Statistics

Baye's Error

Bayes' Classifier


ID Bias vs Variance

Optimal Model Complexity


ID The Scikit-learn Library


ID Classification

Naive Bayes

  • A supervised learning algorithm
  • Applies Bayes’ Theorem using a naive assumption - assumes conditional independence between each pair of features (given value of the class variable)
  • Used to classify data based on features
  • For example, classifying emails as spam or not spam

Nearest Neighbor

  • Unsupervised nearest neighbors - manifold learning and spectral clustering
  • Supervised neighbors - classification of data with discrete labels, and regression of data with continuous labels
  • Works by finding the closest predefined number of training samples to a new point, and then predicting a label based on those

Brute force

  • Only computes the distances between all pairs of points
  • Classifies new data exclusively through these computed distances

KD tree

  • Generates a binary tree structure through the recursive partitioning of data along the data axes provided
  • Computes distances based on the partitions made by the tree as opposed to computing it for each data point, thus reducing the number of distance calculations

Ball tree

  • KD Trees only partition data within cartesian axes, so ball trees separate data into a series of nesting hyper-spheres to all for higher dimensions
  • The data is divided into nodes defined by a centroid and radius, and the neighbor search utilizes the triangle inequality for classification

Support Vector Machines

  • A supervised learning method that can be used for classification, regression, and outlier detection
  • Effective in high dimensions
  • SVMs transform data (as necessary) into higher dimensions in order to best classify it based on the training data provided
  • This is done through kernel functions that determine the relationships between data points and then set parameters to classify data

Multi Layer Perceptron

  • A supervised learning algorithm - learns a function based on features that are provided as input (each input feature is represented as a neuron)
  • Neurons in the hidden layer (between the input and output) transform the values from each previous layer through a weighted linear summation and then a non-linear activation function
  • Finally, the output layer transforms values received from the last hidden layer into output values

Random Forest

  • A meta estimator - fits various subsets of the given data using decision tree classifiers
  • Trees in the ensemble are generated with replacement (bootstrap sample)
  • The results of these many trees are averaged
  • The randomness used to generate the trees and the averaging of the trees decreases variance, increases predictive accuracy and controls over-fitting

AdaBoost

  • Fits a sequence of small decision trees (stumps) that are weak learners on different versions of the data
  • Results from all the stumps are combined through a weighted sum to generate the final prediction
  • Relies on successive iterations of the algorithm based on sample weights of the data in order to train the model (at each step, incorrectly predicted training examples have their weight increased, while correctly predicted examples weights’ are decreased)

Stochastic Gradient Descent

  • SGD is an optimization technique, a way to train a model, and it is sensitive to feature scaling
  • Similar to gradient descent in methodology, but only looks at one/a mini batch of samples at each step that are randomly selected
  • SGD also takes larger steps at first and uses fewer samples per step, but then increases the number of steps/decreases step size as the optimal value is approached

ID Regression

Ridge

  • Performs a special type of linear regression (least squares)
  • It minimizes the sum of squared residuals and the least regression penalty
    • Least regression penalty = ɑ*(slope + other parameters)^2
    • ɑ is determined using cross validation
    • This term introduces bias and as ɑ increases, the slope approaches 0 and sensitivity along the x-axis decreases
  • Ultimately, it improves predictions by decreasing variance, making the predictions less sensitive to the training data and can be very beneficial when sample size is relatively small or there is not enough data to perform least squares regression

LASSO

  • Similar to Ridge regression, LASSO introduces bias to least squares regression to decrease variance
  • It minimizes sum of squared residuals and ɑ*|slope + other parameters|
    • As ɑ increases, the slope can actually be 0 (Ridge can only approach 0 asymptotically) and sensitivity along the x-axis decreases
  • When performed on data sets with many parameters, LASSO can be used to eliminate parameters by setting their slope to 0, thus LASSO is better than Ridge at reducing variables if a model has many useless parameters/variables

LOESS

  • Fits a curve to the data by using a sliding window to break the data down into smaller sections where each section is fit using weighted least squares
  • Each window has a focal point and weighted least squares is performed based on the distance from each point to the focal point within the window
  • The window (generally based on a percentage of the data set) moves as the focal point moves creating a set of preliminary points, and these points are then used to re-weight the original points based on their distance from the calculated points. The re-weighting process can be repeated and then a curve can be constructed through the calculated points

ID Embedding


ID Neural Networks

Feed forward Networks

Tensorflow Scripting

Convolutional Neural Networks


ID Time Series Analysis

ARIMA Models

GARCH Models

PFSA Models


ID Time Series Modeling with Neural Networks

Recurrent Neural Networks

Long-short Memory Models


ID Autoencoders

Variational Autoencoders

Autoencoders and Kolmogorov Complexity


ID Case Studies

Comparison of Genomic Sequences

Genome Wide Association Studies

Epidemiology