Converters

Scythe uses simple, human readable formats to store gene-transcript and ortholog information. See also Format.

loc

Scripts to convert the following formats to loc format are included:

  • gff3
  • tab-separated (eg ENSEMBL BioMart)

run

scythe_loc_gff.py -f GFF
    #------------ loc output format ---------------------------#

    LOCUS0	TRANSCRIPT0_0	TRANSCRIPT0_1	...	TRANSCRIPT0_n
    LOCUS1	TRANSCRIPT1_0	...	TRANSCRIPT1_m
    .
    .
    .
    LOCUSk	TRANSCRIPTk_0	...TRANSCRIPTk_l

    #----------------------------------------------------------#


    #-------- gff version 3 input format ----------------------#
    example (tab is shown as \t):

    ##gff-version 3
    L1i\texample\tgene\t1000\t9000\t.\t+\t.\tID=L1;Name=L1;Note=example
    L1.a\texample\tmRNA\t1000\t9000\t.\t+\t.\tID=L1.a;Parent=L1;Name=L1.a


    Note that this script only relies on the "ID" and "Parent" tags,
    "Name" will be ignored. If "longest" is specified (eg phytozome)
    and =="1", the transcript will be placed on the first position
    for its gene.
    #----------------------------------------------------------#

    

    #####################################
    #  scythe_loc_gff.py  -f FILE.gff3  #
    #####################################

    -f, --file=gff3_FILE
    -o, --output=FILE        output file [default: gff3_FILE.loc]
    -h, --help               prints this
    -H, --HELP               show help on format

or

scythe_loc_tsv.py -f FILE.tsv
    #------------ loc output format ---------------------------#

    LOCUS0	TRANSCRIPT0_0	TRANSCRIPT0_1	...	TRANSCRIPT0_n
    LOCUS1	TRANSCRIPT1_0	...	TRANSCRIPT1_m
    .
    .
    .
    LOCUSk	TRANSCRIPTk_0	...TRANSCRIPTk_l

    #----------------------------------------------------------#

    

    example ensembl query:

    http://www.ensembl.org/biomart/
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE Query>
    <Query  virtualSchemaName = "default" formatter = "TSV"
    header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
    <Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "biotype" value = "protein_coding"/>
    <Attribute name = "ensembl_gene_id" />
    <Attribute name = "ensembl_transcript_id" />
    <Attribute name = "ensembl_peptide_id" />
    <Attribute name = "cds_length" />
    </Dataset>
    </Query>
    

    ###########################
    #  scythe_loc_tsv.py      #
    ###########################
    -f, --file=ENSEMBLBioMart.tsv
                              format: 1st column: gene id, 2nd column:transcript id,
                                      3rd column: peptide id, 4th column: cds length;
                                      gene ids can occur multiple times
    -c, --custom=COLx,COLy,COLz,...   COLi in ["gene","transcript", "protein", "length"]
                                      Use this if your file is different from the described biomart output.
                                      "cds_length" is optional but recommended, at least one of
                                      ["transcript", "protein"] need to be included
    -o, --output=FILE         output file [default: ENSEMBLEBioMart.tsv.loc]
    -h, --help                prints this
    -H, --HELP                show help on format
    #----------------------------------#

See also: loc format.

grp

Please note that the grp converters need a concatenated loc file in addition to the orthology information. Scripts to convert the following formats to grp are included:

  • orthomcl
  • proteinortho
  • tab-separated (eg ENSEMBL BioMart)

run

scythe_grp_orthomcl.py
scythe_grp_proteinortho.py
scythe_grp_tsv.py

See also grp format.

Downloading from ENSEMBL without the GUI

To download sequences (pep and cds fasta files) from ENSEMBL without the graphical user interface, use scythe_ensembl_fasta.py for fasta files and scythe_ensembl_ortho_mysql.py to download pairwise orthology information.

scythe_ensembl_fasta

    usage: scythe_ensembl_fasta.py -s species1,species2 -r INT

    options:
    -s, --species=STR   comma-separated list of species (eg 'homo_sapiens,gorilla_gorilla')
    -r, --release=NUM   ENSEMBL version (eg '75')
    -d, --dir DIR       output directory [default ./]
    -h, --help          prints this

scythe_ensembl_ortho_mysql

    usage: scythe_ensembl_ortho_mysql.py -s species_1,species_2 -r INT

    options:
    -s, --species=STR   comma-separated list of species (eg 'homo_sapiens,gorilla_gorilla')
    -r, --release=NUM   ensembl version (eg '75')
    -h, --help          prints this

Manual merge of tab-separated files to one .grp file

If you have pairwise (two-species) files ready and want to merge them into a multi-species .grp file you can do so via the scythe_ensembl2grp and scythe_mergeSubsets scripts.

scythe_ensembl2grp

    usage: scythe_ensembl2grp.py -f FILE1,FILE2 -o OUT.grp

    -f, --files=STR     list of ensembl tsv files (eg sA.tsv,sB.tsv,sC.tsv)
    -o, --output=FILE   output file
    -h, --help          prints this

scythe_mergeSubsets

    usage:  scythe_mergeSubsets.py -g groups.grp -o new.grp

    options:
    -g, --grp=FILE.grp
    -o, --output=OUTFILE.grp    output file [default: FILE.allspec.grp]
    [-r, --rename               discard old orthogroup ids and start numbering from 0]
    -h, --help                  prints this
    [-n, --numspec=N    min number of species ]
    ------------
    .grp format: GroupID	geneIDiSp1	geneIDjSp2	...geneIDkSpn