svm-agp - Documentation

Running the prediction of atypical genes (svm-agp)

System requirements:

The prediction of atypical genes is a python program tested with python 2.7 on a Linux machine running Ubuntu 12.4 LTS. Python packages needed are:

python-numpy
python-scipy
python-sklearn
python-biopython

Other programs that need to be installed are:

cd-hit (only if you want to remove redundancies in your data)
ncbi-blast+

Prepare your data:

Get gene records from ENA with a query according to your interests. We used 'viruses' as query limiting the search to coding sequences only. Store the query results as EMBL flat file (Text) on your machine. Run prepareData.py in order to prepare the input for the prediction algorithm. The parameters have to be as follows:

Name of the EMBL formatted file obtained from ENA
Name of folder where to store the generated statistics
Redundancy level if you want to apply CD-HIT order to reduce redundancies in the data beforehand. We used a level of 0.95. The redundancy reduction will be omitted if you provide a value of 0.
Percentage of artificial outliers you want to add to the data for each family. Note that the outliers are taken also from the data you provide. This means there must be genes from multiple families if you want to set this value to anything larger than 0. We used a value of 5 to create the artificial data sets. The artificial outliers will be marked by “**foreign**” in the data generated.
Max oligonucleotide length to use. We didn't specifically test for values larger than 4.
Number of processors to execute the preparation on. Provide 1 if you don't want to use multiple processors.

Note that this script may run a while if your input is large. Warnings may arise if there are only one or zero genes for a family. All families with less than three data points are not considered in the prediction algorithm. If you run this script multiple times, for instance to have another number of oligonucleotide frequencies considered, the database constructed by BLAST in order to quickly access gene entries will not be recreated. This implies that if you want to create statistics from a new ENA query, you either need to make sure the names of the old and the new file are distinct or you need to provide an empty folder where to store the new statistics.

Predict atypical genes:

Assuming you have prepared statistics for at least one virus family and have not modified the files created by prepareData.py, you can run predictor.py to see which genes are most atypical in the family. The following parameters are needed for the script:

The name of the folder where the generated statistics are stored
The name of the virus family for which you want to run the prediction. The family name must be equal to the name of a sub-folder of the before specified folder.
Max oligonucleotide length to use. We didn't specifically test for values larger than 4. Must be consistent to the data you generated.
Limit to the output of the prediction result. In particular for large families, it might be handy to obtain only the upper part of the ranked result, up to this defined threshold.

This script outputs a lists of genes for each feature set discussed.