Expanding Prediction To Multiple Viruses
We will attempt to replicate dominant strain prediction to two more viruses:
- HIV1-A
- This will attempt to predict dominant strain within individual patients
- CCHFV
- This is an attempt to predict dominant strain over time, say for 5 years
The key ponts here are: 1) is there enough data available to carry out this modeling? And 2) should we consider geographic stratification?
Lets start with CCHF. We do the following steps:
- Download all CCHFV sequences between collection years 2010 and 2020
- breakdown by 2 year periods
- check which segments are most common (there is S, M, and L segements in the virus)
- Try making Qnets with this 2 year data (after aliognment, and segment choice)
- Carry out dominant strain pred and validation (check how well we do from the actual dominant strain prediction)
Currently looking at HIV patient specific data availability
Task List
- generate the alphabeta and the metadata csvs as above
- Shere the link with the diffusion simulation u did for the different viruses
- Dominat strain tracking task (above)
- We need to do a EPITOPE MATCH task (detailed later)
- We need to do a IRAT calculation task as well (this will be a seprate paper)
Epitope Match Task
Let k be the diffusion simulation step.
compute the MHCI top epitopes using the application u tried, call this set E_k
-
compute the deviation of E_k from E_0 as follows:
Note these are sets of strings Compute $$average_{x \in E_k} min_{y \in E_0} L(x,y)$$
- Plot the averge E_k line, averaged over mutiple simulation runs
IRAT task
- Show qIRAT for 100 H1N1, 100 H3N2, and sufficient number of recent H5 and H7 sequences (report minimum score over 2010-current, and the current year score)
- List top 3 for each variety (minimum current year score)
- as usual report seqName, seqAccession, etc
