MBL Logo The Marine Biological Laboratory, Woods Hole The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution
The Josephine Bay Paul Center
Home
JBPC Forms
JBPC Wiki
 
Faculty
Mitchell Sogin
Mark Alliegro
Linda Amaral Zettler
Irina Arkhipova
Joshua Hamilton
Julie Huber
David Mark Welch
Anton Post
Sheri Simmons
Joel Smith
Research Faculty
Sue Huse
Jessica Mark Welch
Hilary Morrison
William Reznikoff
Margrethe (Gretta) Serres
Adjunct Faculty
Marlene Belfort
Seth Bordenstein
Robert Campbell
Alex Keynan
Matthew Meselson
Robert Prendergast
MBL/Brown Faculty
David Rand
Gary Wessel
Other Personnel
Visiting Scientists
Senior Scholars
Administration
Graduate Students
Postdoctoral Fellows
Informaticists
Computer Facilities
Computer Resources
Sequencing Informatics
Software
Databases
Beowulf Clusters
Personnel
Local Databases
VAMPS Project
GenProtEC
ICOMM
Micro*Scope
Spraguea lophii
Education
Workshop on Molecular Evolution - 2011
Strategies and Techniques for Analyzing Microbial Population Structures (STAMPS) - 2011
Lectures in Ecological Statistics (Archive)
Brown-MBL Graduate Program
Micro-Eco Journal Club
Living in the Microbial World (Archive)
HHMI-MBL Precollege Science Education Lab Series
Protistology Workshop (Archive)
JBPC Sequencing Informatics

Tutorials

 

Pipelines

The JBPC genomics research routinely uses a series of bioinformatics programs to analyze and assemble genomics data.  The more common of these series of programming steps have been combined into "pipelines", programs that automate the series of steps into one or a few steps.  The use of these pipelines facilitates the inclusion of sequencing projects in the GMOD interface.  Each of these pipeline scripts are available to all users of the JBPC computing facility.

Please follow the pipeline links for detailed information on the use of these pipelines. 

  • straw:  takes the sequencing reads, trims vector and low quality data and assembles them into contigs.  The output files should be reviewed with consed and then used directly in make_scaffold. [Programs included:  phred, phd2fasta, cross_match, phrap]

  • consed:  this is not a pipeline program, but an editor for editing sequence assemblies.  It should be used for QAQC of sequences and assembly prior to running make_scaffold.

  • make_scaffold: combines the straw output files containing contig information and scaffolds them into supercontigs.  The output data from make_scaffold can be provided to the GMOD administrator for import into the GMOD and GBrowse system. [Programs included: stripx.pl, makemates.pl, goBambus, toArachne.pl]

  • arachne2gbrowse: the final pipeline that imports sequencing data into GMOD.  This script is used by the GMOD administrator, using the output files from make_scaffold provided by the project researcher.

  • assemble_cdna: an alternative initial assembly script, similar to straw, but optimized for cDNA / EST projects that include very large numbers of reads [Programs included:  phred, phd2fasta, lucy, zapping.awk, cross_match, stripx.pl, tgicl]

 

Vector Library

The Sogin Lab has been collecting vector and splice files that are commonly used in sequencing.  These are available to anyone to use in trimming their sequences for assembly and analysis.  The map images are listed below, but the fasta files are on the xraid.  You can copy these to your workspace.  Example, to determine the exact filename you want, like the pcr4 topo vector file, and to copy it to your current directory (NB the final dot): 
$ls /xraid/bioware/linux/seqinfo/vectors
$cp  /xraid/bioware/linux/seqinfo/vectors/pcr4topo_vector.fa .

 

Useful Programs 

There are several smaller programs that are useful in analyzing sequences.  We have listed several tools below that can be useful:

  • SEALS -  a very useful set of utilities for sequencing and manipulating sequence files.  Provided through NCBI, follow the Documentation link to see the list.

  • EMBOSS - a second set of applications, like SEALS, for data manipulation.  You will find a surprising number of useful tools.  See Overview for the list.

  • Seqinfo/bin - a directory of useful bioinformatic and sequence manipulation tools created by Sue Huse. All of these scripts can be used from your home directory (i.e. $>countbp myseq.fa

  • ren - renames a series of files in your current directory based on pattern-matching.  Use * and ? in specifying the old names, and #1, #2, etc. to refer to them in the new name.

    A simple example is:  >ren "*.fa" "#1.fasta"
    to rename .fa files to .fasta
    Or to change from fas.pep and fas.cds to pep.fas and cds.fas:
    > ren "*.fas.*"  "#1.#2.fas"

  • stripx.pl - takes an input fasta or contig file and changes all x or X's in the sequence to n or N.

  • defline_organism.pl - moves the Genus species in a fasta definition line to the beginning of the text and encloses it in [ ]s.

  • defline_jgi.pl - creates a full NCBI style definition line for sequences downloaded from jgi. 

  • measure_polyAT - returns a very approximate measure of the length of polyA and polyT tails in a fasta file.

 

 
     
Supported by NIH, NSF, NASA, The Josephine Bay Paul and C. Michael Paul Foundation, W.M. Keck Foundation, G. Unger Vetlesen Foundation, and Ellison Medical Foundation.
Unless otherwise stated, all material © 2004 Bay Paul Center, MBL.
Please send notifications or content errors, content updates, and other requests regarding this site to JBPC Webmaster.