CSE891 Homework 4: Using BLAST to Identify Known ncRNAs

Because of strong secondary structural similarity rather than sequence similarity, using BLAST-like sequence alignment tools may miss homologous ncRNA sequences. This purpose of this homework is to validate this claim or hypothesis. And, this homework requires you to play with different bioinformatics tools and databases. Both BLAST and Rfam database provides good users' guide. Please browse related sections before you ask questions. 

This homework contains two parts. Part A is due on 11/3. Part B is due on 11/10. Be sure to bring your written report to the class!  Important: don't put the whole Rfam database or BLAST software suite on handin. Only submit your own source codes and the final output.

Part A (100 points). Evaluate whether ncRNA sequence families contain highly conserved regions, which can be matched by the first stage of BLAST. 

Step 1 (10 points) : go to http://rfam.janelia.org/browse.html. Extract the accession IDs for all ncRNA families with "Av. % id (seed)" < =70%. For example, RF00925 has Av. % id(seed) 35, which is <= 70. "Av. % id (seed)" represents the average sequence similarity between all the sequences in a seed family. Note that you can click on the column header to sort that column. Then you can copy all the rows with "Av. % id(seed)" <= 70 into a txt file under Linux. Using a command such as "cut" can help you extract the "Accession" column. Save this "Accession" numbers into a list. Submit this list via handin. Don't print it out. 

Step 2: Download the seed sequence family database from ftp://selab.janelia.org/pub/Rfam/Rfam.seed.gz  Extract families using the Accession number list generated from Step 1. For the families with Av. % id <=70, we evaluate whether there are strong local similarities between pairs of random seed sequences. 

Example, this is part of the MSA from an ncRNA family:

AE006696.1/291-218   GCCG.CCGU.A.GCUCAGCC.CGGG...AGAGCG.C.CCGGC.UGAAGAC

X06054.1/711-637     GGGC.CCGU.C.GUCUAGCC.UGGUU..AGGACG.C.UGCCC.UGACGCG

AE006699.1/3409-3482 GGGC.CCGU.A.GCUUAGCUC.GGU...AGAGCG.C.UCGGC.UCAUAAC

Randomly pick pairs of sequences from this sequence family. For example, we pick the first two sequences. And we extract their alignment below.

AE006696.1/291-218   GCCG.CCGU.A.GCUCAGCC.CGGG...AGAGCG.C.CCGGC.UGAAGAC

X06054.1/711-637     GGGC.CCGU.C.GUCUAGCC.UGGUU..AGGACG.C.UGCCC.UGACGCG

Remove columns containing only dots.

AE006696.1/291-218   GCCGCCGUAGCUCAGCCCGGG.AGAGCGCCCGGCUGAAGAC

X06054.1/711-637     GGGCCCGUCGUCUAGCCUGGUUAGGACGCUGCCCUGACGCG

Break the above gapped alignment into unggapped parts. We have two ungapped regions.

AE006696.1/291-218   GCCGCCGUAGCUCAGCCCGGG        AGAGCGCCCGGCUGAAGAC

X06054.1/711-637     GGGCCCGUCGUCUAGCCUGGU        AGGACGCUGCCCUGACGCG

Translate each ungapped alignment into a *similarity* string, which is a bit string with 1 denoting a match and 0 denoting a mismatch. Thus, we get two similarity strings:

100011110100011110110  and 1100111000011110100

Evaluate *local string similarity* by computing percentage of "1" in a substring with length from 11 to 20. Record the highest similarity for each length. For example,  the highest local string similarities for 100011110100011110110 with length 11 is: 7/11. The corresponding string is 11110100011 or 10001111011. Both of them have 7 "1" out of 11 positions. Record the highest local similarity for each length between 11 to 20. 

For each sequence family, you should extract pairwise alignments from the MSA without too many insertions or deletions. Otherwise you cannot build useful ungapped alignments. Please record the total number of pairwise alignments you extract from each sequence family. Suppose you have N sequences in the seed family, you can extract N(N-1)/2 pairwise alignments. In this part, don't try to get all of the possible pairwise alignments. Instead, randomly pick a large number (e.g. if N=100, try to pick at least 1000 random pairwise alignments). 

Submit a histogram of the highest local similarity for length 11 to 20. For example, for length 11, how many alignments fall into the highest local similarity range [0, 10%), [10%, 20%), ...

I will grade the second step using your histogram and a report. For example, how do you randomly pick pairwise alignments from an MSA? What is the average length of the ungapped alignments? How many pairwise alignments you choose from each MSA (tell me the statistics or the method, not the exact numbers)? This step is worth 90 points. 

Part B (100 points). Evaluate ncRNA homology search performance using BLAST. 

Step 1 (30 points): build a test set. Download the full alignments of ncRNA families from ftp://selab.janelia.org/pub/Rfam/Rfam.full.gz. For each family, remove sequences that are includes in the corresponding seed alignment. The remaining sequences consist of the test set for this specific ncRNA family. Only submit your source codes, not the test set via handin. I will test your programs using a randomly chosen sequence family.

Step 2 (70 points): for each sequence in the test family, BLAST is against all the seed sequences which had been downloaded in Part A. NCBI BLAST executable files can downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/. Please use pairwise alignment tool. An example is shown below (please read the help about the input parameters):

./bl2seq -i seq1.fa -j seq2.fa -p blastn -W 3 -e 0.1

Each test sequence will be blasted against all seed sequences from all seed families. Thus, you will get multiple alignment output. Use a voting algorithm to decide the family membership of this test sequence. For example, suppose the test sequence is originally extracted from sequence family A. After comparing it against all seed sequences from family A, B, C,.... you get a bunch of alignment outputs with e values < 0.1 (you can chance E value threshold using -e). Suppose most of these outputs are the alignment between the test sequence and seed sequences of family B, you will assign the test sequence to B. And, this results in a false membership assignment.

Another possibility is that this test sequence does not yield any statistically significant alignment with any seed sequence. That means we cannot decide its membership at all.

For each test set of each sequence family, record how many test sequences are assigned to the correct families; how many are assigned to the wrong families; and how many yield 0 significant alignment. Summarize your results in a text file. And, analyze the relationship between the average sequence similarity and BLAST-based ncRNA classification method. 

10 bogus points will be given to students who analyze how different input parameters (such as -W and -e) affect the final results.