Notes for user:
Note 1:
When running the flexroc version of AdaBoost with python3,
we need to modified the notebook a little bit:
auc_command = '../bin/flexroc -i {} -w 1 -x 0 -C 1 -L 1 -E 0 -f .2 -t .9'.format(outfile)
output = subprocess.check_output(auc_command,shell = True)
# ------------add this line if working with python3 -----------------
output = output.decode('utf8')
# -------------------------------------------------------------------
python2 probably doesn't need it since byte-like object is a python3 feature.
Note 2
If we are going to do bi-class classification using AdaBoost,
we probably need to binarized the the target
(the last column of the cvs file generated by cynet_chunker call).
Note for developer
[Improvment] Run run_pipeline() in parallel
In perturbation, if we have $p$ percentages and $n$ variables, then we have $p^{n}$ many different many perturbed splits.
When dataset is large, or $p$ or $n$ is big, we may want to run all the run_pipeline() calls in parallel,
whithout copy models folder for each row (or entry).
The implementation now cannot do not support this,
since although we can save the .res file in separate folder,
and the *model_sel_[uuid].json files cannot be overwritten because of the uuid,
the .log files currently cannot be saved to a separated folder,
and can be overwritten.
There is a parameter named LOG_PATH in for run_pipeline(), but it seems it doesn't have much effect,
because it get overwritten (on around line 1844-1847 of cynet.py) anyway.
[Bug] petrubation_parallel vs perturbation
We have a function called peturbation_parallel in the cynet class, which is a typo.
[Improvement] Automated RUNLEN and FLEX_TAIL_LEN from dates
Yi wrote two little pieces of code so that when RUNLEN and FLEX_TAIL_LEN in the config.yaml are given as -1,
they could be calculated from START_DATE, END_DATA, OOS_END_DATE, FREQ in the config.yaml file.
Now the code is include in the user-side notebook/script
(e.g. usWeather_cynet_2_pred.py or usWeather_cynet_3_chunker.py, etc).
But one day, Yi forgot to pass the calculated RUNLEN and FLEX_TAIL_LEN to cynet_chunker,
but instead passed -1 for both values.
The result is that the cynet binary kept running forever.
So we may want to consider add a safety check somewhere so that the wrapper or cynet warn the user -1 is passed.
Or the inference of RUNLEN and FLEX_TAIL_LEN should instead be done on the Cynet side.
[Bug] use of python string strip() function:
Quote from documentation of .strip()
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
In cynet we sometimes want to change the extension, for example .log. But the if we use strip(.log) and the file path also STARTS with log, the leading log will also be striped. I suggest using replace() or rstrip().
[Bug] double counting caused by inclusive pandas.DataFrame.loc slicing:
The DataFrame.loc[A:B] is both inclusive for boundaries A and B. This will cause major problem when the temporal resolution of the raw log file lines up with the temproal discretization of cynet. For example, if we use 1 day resolution and the raw log files (feed to the cynet) only have dates and we use temporal frequence 1D for cynet, then loc['2016-01-01':'2016-01-02'] will in fact contain events on both 2016-01-01 and 2016-01-02, while the intended behavior is to count only events on 2016-01-01.
[Improvment] Inconsistency of cynet.spatioTemporal and cynet.cynet_chunk in using partition file
- Suppose we have binary partition, then
cynet_chunkerneed a file with two columns with the first column being the coords (as in the.coordsfile) butspneeds just one column of the partition points; -
cynet_chunkertakes partition filename as a string, but forsp, one needs to put the filename in a bracket like["partition_filename.csv"]. These inconsistency can be a little confusing for users, and since they probably don't have access to the binary source code, they probably couldn't figure them out either.
[Improvment] About the threshold parameter in spatioTemporal
In case you want to keep all the time series no matter how sparse a time series is, you may want to set the threshold parameter in spatioTemporal to be zero. But the current cynet class code use >= and hence will keep time series that is entirely zero. Later developer may consider use > instead of >=.
