Assignment 1, part 1

Now is the time to test one's souls. Before beginning, lets have one more look at a mummer.
Keep this image in your mind as you work on the assignment.

Step 1: Create output comparing the following two genomes:

  • Mycobacterium tuberculosis CDC1551 (TBCDC1551.1con)
  • Mycobacterium tuberculosis H37rv (TBH37rv.1con)
    Perform the comparison on the nucleotide level.

    Step 2: Parse the output to find SNPs.

    These genomes are very similar to each other, and have a relatively small number of nucleotide differences among them. Once you make the mummer output carefully examine the output and and identify where you see single nucleotide differences between the genomes.

    SPECIAL HINT: The output of the mummer DOES NOT come in numerical order. You will probably have to sort the data in order to get an answer.

    One way to sort a perl list numerically is:

    @list = sort numerically (@list);
    sub numerically {
        $a <=> $b;
    }
    

    Step 3: "Publish" your results to a web page.

    Report to the user the total number of SNPs found, and then create a table that lists all of them. An alternative is that you dont have to just find the SNPs, which are _single_ nucleotide differences, you can report all the polymorphisms to the user. You'll need to know the SNPs for the next part of the assignment though.


    Assignment 1, part 2

    Using the output of the above program, look at the positions of the genes for the genome which is found in another file. Then report to the user if the SNPs were present in coding or non-coding regions. Summarize these results in a table.

    The file containing the gene coordinates for Tuberculosis is:

    TBCDC1551.ppt
    
    Its format is the following:
    
    Mycobacterium tuberculosis CDC1551, complete genome - 0..4403836
    4187 proteins
           Location		Strand	Length	PID	Gene	Synonym	Code	COG	Product
    
            1..1524     	 +	508	13879042	MT0001	-	-	-	chromosomal replication initiator protein DnaA	
         2052..3260     	 +	403	13879043	MT0002	-	-	-	DNA polymerase III, beta subunit	
         3280..4437     	 +	386	13879044	MT0003	-	-	-	recF protein	
    
    
    Most of this information you can ignore. The positions for the coordinates are found in the first two coloums of data listed in the file. The locatio of the first gene is starts at nucleotide 1 and goes to nucleotide 1524 for example.


    The answers to these assignments are in lesson4.
    left right