Research Topics: Overview
- EHR Projects
- Qnets: Modeling Ultra High Dimensional Dependency Networks
- Cynet: Deep Learning without Neural Networks
- Cognitive Dissonance Modeling
- Sequence to Function Mapping in Biology
- Actionable Forecasting of Urban Crime
- Preempting the next pandemic
Algorithms
XgenESeSS
Purpose
The application allows for time series analysis through a comprehensive machine learning package that utilizes a Hidden Markov Model (HMM) called the Probabilistic Finite-State Automaton (PFSA).
- PFSA is used to model discrete time series, using these algorithms it is possible to extend modeling to continuous times series.
- Effectiveness of the package was shown using data sets with at least 50 observations (UCR Time Series Classification Archive)
- Can allow for the study of the meta learning problem to distinguish what features of a dataset facilitate the application of certain classification algorithms.
Results
- These algorithms performed at or better then baseline on 60% of data sets.
- The Csmash clustering algorithm improves performance across a majority of data sets, thus indicating that assuming that more than one process generates the time series is useful.
- The PFSA relies on processes having ergodicity, stationarity, and have a finite number of states, and while the package can be used on any dataset, small datasets have unstable results and are sensitive to small changes in hyperparameters.
PFSA v DTW (Dynamic Time Warping)
- When computing distance between time series of 1s and 0s, PFSA will result in a distance of 0, while DTW will measure the maximum distance due to lack of alignment of symbols.
- DTW and PFSA algorithms define similarity differently.
- If the data was created by a physical process with properties of a HMM, the PFSA algorithm is expected to perform better than DTW
Method
The package provides a series of algorithms:
genESeSS
A supervised inference algorithm that allows for PFSA model inference from a time series.
XG2 (supervised)
The algorithm assumes there is a HMM and then generates a discrete probabilistic state machine based on a logarithmic function. It does so by mapping the initial data to a two or three symbol alphabet, and then for every quantization it computes the conditional probabilities of the next symbol given a prior word.
How?
1. Maps the given time series to a discrete time series in order to compute the features.
2. Calculates the conditional probability of the next symbol from the previous symbols, returning the probability as a feature.
Lsmash distance (unsupervised)
The algorithm computes a pairwise distance matrix from a collection of time series (2 or more) and compares the log likelihoods to determine if the series were generated by a set of preselected PFSAs.
How?
1. It computes distances between train data by fitting the quantizer (if not fitted).
2. Then pairwise distances are computed and the algorithm returns a distance matrix.
Quant (supervised)
The algorithm generates a discrete time model form the original continuous time series.
How?
1. It computes the optimal discretization scheme.
2. Test data is transformed, returning a list of transformed data.
3. The Quantizer is fit and returns transformed train data.
Smash (supervised)
The algorithm generates a state machine of a 2 or 3 symbol alphabet, or deems it not possible (meaning it is a bad model or is not a HMM)
How?
1. The algorithm maps the initial data and for every mapping and label a PFSA is found using the interference algorithm.
2. For each state in the HMM, the probability of the stream originating from that state is computed.
3. The probabilities are weighted by their stationary distribution to compute log likelihoods which are then used as features in a classifier.
Csmash (supervised)
The algorithm combines Smash and Lsmash. Lsmash clusters data from the original data classes into additional subclasses (for example from 2 subclasses to 3), and Smash solves the classification problem of now 3 subclasses
How?
1. The algorithm clusters data that belongs to one label and applies Smash on the given data, but with more labels.
2. Streams in the same class are clustered first.
- Clustering - the distance between two streams is given as the difference between log likelihoods that the given streams were produced by the preselected HMM.
3. Csmash is applied on data sets with more/extended labels.
Cynet
Purpose
The Cynet Package performs analysis on spatio-temporal data. It allows us to study weather events in the US spanning from 2016-2019 using a dataset containing 4 years of recorded weather events (cold, snow, fog, rain, storm) from over 2000 weather stations.
Results
Current average AUC of 0.76 from the 2079 existing models.
Method
- Algorithm input is the log of events.
- An item of event log is the what, where, when of a weather event.
- The event log is preprocessed and converted to a time series which contains location and event type.
For example...
- A time series for location "UofC" and event type “rain” may look like 0 1 0 0 0 1
- It means no rain at the first time step, rain at the second time step, no rain for the third to fifth time
steps, and rain for the last time step.
- Cynet generates a directed network for all the time series.
- The influence of one time series on another is captured in the edges of the network.
For example...
- Assume we have time series UofC - “Rain” and O'Hare - “Storm”
- The edge from O'Hare - “Storm” to UofC - “Rain” is a model showing how storms around O'Hare can be used as
predictors for rain around UofC.
- Let us say that typically in Chicago, wind blows from O'Hare to UofC, then the model will have strong
predicting power.
- We note that the influence may very well be asymmetric.
- We also infer models for different time lag because some influences may be short-term while others may take
some more time to be apparent.
- Once the network is generated for all source and target series, it is pruned to remove weak edges.
- Each edge serves as a predictor for a weather pattern.
- Location and time lag are very influential.
- The Cynet package integrates all these predictions by producing scalar coefficients from each data source in order to make a final prediction about the weather at the desired location.
- Once the network is pruned and scalar coefficients are established for each link in the network, Cynet can make predictions.
Qnet
Purpose
To effectively model the evolutionary properties of viruses with a novel machine learning algorithm. The Quasinet (Qnet) framework can be used to simulate the evolution of virus strains, predict the probability of a pandemic risk, and decide better vaccine components.
Results
- Using Qnets, we successfully predicted the global Influenza pandemic in 2009.
- This method provided targets for H1N1 influenza that are much more accurate than the recommended targets from the World Health Organization.
Method
Background of Viruses for Computational Biology -
Viruses are infectious agents that latch onto a host, such as a human or animal, in order to replicate. A virus usually has multiple strains, or variations of that virus that performs similar functions. Each strain consists of proteins which originate from a specific sequence of amino acids. These amino acids can be thought of as specific letters, and the entire sequence can be thought of as a sentence. However, a strain can mutate (lose, gain, or exchange amino acids), and these mutations can be significant; changing enough amino acids may change the properties of the strain.
Qnet Framework
- Learns structural dependencies of symbols within sequences
- The Qnet is composed of many conditional inference trees
- Each tree corresponds to a specific location in a sequence
- Once the trees are trained, each tree uses all locations within the sequence to predict the probabilities of certain occurrences at other locations based on the corresponding index
Qnets and Virological Understanding
- In the case of a virus, the Qnet learns the structural dependencies between amino acids. The conditional inference decision trees use all the amino acids within a given sequence to predict the probability of an amino acid occurrence at its corresponding index.
- Within a Qnet a distance measure can be defined, called the Qdistance. The Qdistance describes how close one sequence is to another. Given a virus, we can measure the Qdistance between every pair of strains, and then construct phylogenetic trees and understand the evolutionary trajectory of the virus using these Qdistances.
-
The trained Qnet model can also simulate the evolution of a strain until convergence by using the Qnet induced probabilities to decide which amino acids we want to mutate. We use this simulation process to predict risk of a pandemic, and we say such a risk exists if two following conditions are met.
- After simulating the evolution of a strain with a non-human host, we find that the strain moves closer to the closest strain with a human host.
- After performing another simulation, we find that the same strain fails to mutate towards the most common human strain.
The Qnet framework can be used to choose vaccine targets for each year. Our choice for the target sequence comes from the common strain (as computed by the Qdistance) of a population from previous years.
Qnets and Coronavirus
When applied to the coronavirus, we find that the phylogenetic trees constructed from the Qdistance provides a better representation of the evolutionary process than existing methods.
