Scythe uses simple, human readable formats to store gene-transcript and ortholog information. See also Format.
Scripts to convert the following formats to loc format are included:
run
scythe_loc_gff.py -f GFF
#------------ loc output format ---------------------------#
LOCUS0 TRANSCRIPT0_0 TRANSCRIPT0_1 ... TRANSCRIPT0_n
LOCUS1 TRANSCRIPT1_0 ... TRANSCRIPT1_m
.
.
.
LOCUSk TRANSCRIPTk_0 ...TRANSCRIPTk_l
#----------------------------------------------------------#
#-------- gff version 3 input format ----------------------#
example (tab is shown as \t):
##gff-version 3
L1i\texample\tgene\t1000\t9000\t.\t+\t.\tID=L1;Name=L1;Note=example
L1.a\texample\tmRNA\t1000\t9000\t.\t+\t.\tID=L1.a;Parent=L1;Name=L1.a
Note that this script only relies on the "ID" and "Parent" tags,
"Name" will be ignored. If "longest" is specified (eg phytozome)
and =="1", the transcript will be placed on the first position
for its gene.
#----------------------------------------------------------#
#####################################
# scythe_loc_gff.py -f FILE.gff3 #
#####################################
-f, --file=gff3_FILE
-o, --output=FILE output file [default: gff3_FILE.loc]
-h, --help prints this
-H, --HELP show help on format
or
scythe_loc_tsv.py -f FILE.tsv
#------------ loc output format ---------------------------#
LOCUS0 TRANSCRIPT0_0 TRANSCRIPT0_1 ... TRANSCRIPT0_n
LOCUS1 TRANSCRIPT1_0 ... TRANSCRIPT1_m
.
.
.
LOCUSk TRANSCRIPTk_0 ...TRANSCRIPTk_l
#----------------------------------------------------------#
example ensembl query:
http://www.ensembl.org/biomart/
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName = "default" formatter = "TSV"
header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
<Filter name = "biotype" value = "protein_coding"/>
<Attribute name = "ensembl_gene_id" />
<Attribute name = "ensembl_transcript_id" />
<Attribute name = "ensembl_peptide_id" />
<Attribute name = "cds_length" />
</Dataset>
</Query>
###########################
# scythe_loc_tsv.py #
###########################
-f, --file=ENSEMBLBioMart.tsv
format: 1st column: gene id, 2nd column:transcript id,
3rd column: peptide id, 4th column: cds length;
gene ids can occur multiple times
-c, --custom=COLx,COLy,COLz,... COLi in ["gene","transcript", "protein", "length"]
Use this if your file is different from the described biomart output.
"cds_length" is optional but recommended, at least one of
["transcript", "protein"] need to be included
-o, --output=FILE output file [default: ENSEMBLEBioMart.tsv.loc]
-h, --help prints this
-H, --HELP show help on format
#----------------------------------#
See also: loc format.
Please note that the grp converters need a concatenated loc file in addition to the orthology information. Scripts to convert the following formats to grp are included:
run
scythe_grp_orthomcl.py
scythe_grp_proteinortho.py
scythe_grp_tsv.py
See also grp format.
To download sequences (pep and cds fasta files) from ENSEMBL without the graphical user interface, use scythe_ensembl_fasta.py for fasta files and scythe_ensembl_ortho_mysql.py to download pairwise orthology information.
usage: scythe_ensembl_fasta.py -s species1,species2 -r INT
options:
-s, --species=STR comma-separated list of species (eg 'homo_sapiens,gorilla_gorilla')
-r, --release=NUM ENSEMBL version (eg '75')
-d, --dir DIR output directory [default ./]
-h, --help prints this
usage: scythe_ensembl_ortho_mysql.py -s species_1,species_2 -r INT
options:
-s, --species=STR comma-separated list of species (eg 'homo_sapiens,gorilla_gorilla')
-r, --release=NUM ensembl version (eg '75')
-h, --help prints this
If you have pairwise (two-species) files ready and want to merge them into a multi-species .grp file you can do so via the scythe_ensembl2grp and scythe_mergeSubsets scripts.
usage: scythe_ensembl2grp.py -f FILE1,FILE2 -o OUT.grp
-f, --files=STR list of ensembl tsv files (eg sA.tsv,sB.tsv,sC.tsv)
-o, --output=FILE output file
-h, --help prints this
usage: scythe_mergeSubsets.py -g groups.grp -o new.grp
options:
-g, --grp=FILE.grp
-o, --output=OUTFILE.grp output file [default: FILE.allspec.grp]
[-r, --rename discard old orthogroup ids and start numbering from 0]
-h, --help prints this
[-n, --numspec=N min number of species ]
------------
.grp format: GroupID geneIDiSp1 geneIDjSp2 ...geneIDkSpn