SeqWord Phylogenomics Version "2017/07/01" Readme File

This python script is written in python 3.

Modules dependencies required for the program to run are:
> Scipy 0.16.1 onwards
> numpy 1.10.3 onwards
> sympy 0.7.2 onwards


Program can run with command line using python3.x run.py following several arguments:

-f [Folder name] > containing genbank files of organisms under study (Compulsory):
> The folder name must be within the working directory path of "input folder" under option "-i".
> All genbank/fasta files under study must be within this one file and the program will analyze all files with these extensions.
> Example folder Bacillus is situated within the "zip" input file and can be used to test the program

-g [gyra.fas] > GyrA protein gene sequence in fasta format (Optional):
> Only GyrA gene is accepted as reference protein gene.
> Single fasta file containing all gyra sequences of all organisms under study.
> This file is used as additional information in constructing final phylogenetic tree built upon oligonucleotide usage pattern.
> Additional dendrogram and clustering plot will be created with this option.
> If command lines interface is to be used, please enter gyra as input parameter without the file extension
> Example file gyra.fas for taxonomic group Bacillus is situated in the "zip/Bacillus" folder and can be used to test this function

-i [input folder] > by default "zip" folder is used:
> Working directory path for all input folders.

-o [output folder] > by default "output" folder is used:
> Working directory path for all output folders.

-r [cluster number - Integer 2,3] > by default auto calculated cluster number based on algorithm:
> Only applicable when reference protein sequence is used (GyrA).
> Clustering of species is done based on the relationship of pairs of distances calculated from GyrA protein and Oligonucleolide Usage Pattern.
> Clustering will result in restricting species with similar clusters to related more closely to each other based on the integrated information of GyrA gene and Oligonucleotide usage pattern of whole genome sequence.
> This function restricts the number of clusters the program will created instead of default where the program determines the best clustering solution.

-c [N Protein Contribution Number - Integer 1,2,3] > by default N = 1:
> Only applicable when reference protein sequence is used (GyrA).
> Protein Contribution Number controls the weighting GyrA gene distance contributes to the final metric which determines the phylogenetic tree.
> For closely related organisms, larger "-c" value allows better resolution of relationships within the phylogenetic tree as oligonucleotide usage pattern is very similar in this case.
> It is advised to use a value of 2 when closely related species are compared, and the value 3 to distinguish between subspecies of the same species.

-n [Yes/No] > by default "No", if "Yes", Sequences will be normalized by k-mer pattern:
> This function normalize sequence pattern based on k-mer word.
> i.e. n1 - normalization based on mononucleotide frequencies, n2 - normalized based on dinucleotide frequencies
> It is advised to not change this setting as this function has not been tested within phylogenetic analysis.

-e [Yes/No] > by default "No", if "Yes", default parameter for g and k will be used:
> This function skips the calculation of clustering parameter and uses pre-calculated parameters g and k estimated from all organisms within current study (Bacillus, Corynebacteria, Enterobacteria, Lactobacilli, Mycobacteria, Pseudomonas, Prochlorococcus, Thermatoga)
> The usage of the default g and K parameters ensures sample independence of the results. 

-d [Yes/No] > by default "Yes", if "No", distance table will not be saved
-m [Yes/No] > by default "Yes", if "Yes", clustering table will not be saved
-t [Yes/No] > by default "Yes", if "Yes", phylogenetic tree will not be saved
-l [Yes/No] > by default "Yes", if "Yes", cladogram will not be saved
-p [Yes/No] > by default "Yes", if "Yes", verhulst plot will not be saved

Example: python3.4 run.py -f Bacillus -g gyra.fas -r 2 -c 3

This will return the phylogenetic tree, cladogram, cluster matrix, distance table and verhulst plot for sequences within dataset Mycobacteria using reference gene gyra with cluster number 2 and protein contribution 3 without default parameter used

Program can also be used with interface where on command line, no arguments are given.

Example: python3.4 run.py

Arguments can be inserted manually into the program through this interface using the instructions shown.


