Many ncRNAs function through both their sequences and
secondary structures. Thus, secondary structure derivation
is an important component in today¡¯s RNA research. The state-of-the-art
structure annotation tools derive consensus structure of
homologous ncRNAs and have better accuracy than ab
initio folding tools. Despite promising results from existing
ncRNA aligning and consensus structure derivation tools,
there is a need for more efficient and accurate ncRNA secondary
structure modeling and alignment methods.
In this work, we introduce a consensus structure derivation
tool based on grammar string, a novel ncRNA secondary structure
representation that encodes an ncRNA¡¯s sequence and secondary
structure in the parameter space of a context-free grammar
(CFG) and a full RNA grammar including pseudoknots. Being
a string defined on a special alphabet constructed from
a grammar, it converts ncRNA alignment into sequence alignment
with O(n2) complexity. We align hundreds
of ncRNA families from BraliBase 2.1 and 25 families containing
pseudoknots using grammar strings and compare their consensus
structure with Murlet and RNASampler. Our experiments have
shown that grammar string based structure derivation competes
favorably in consensus structure quality with Murlet and
RNASampler. Source codes and experimental data are available
at http://www.cse.msu.edu/~yannisun/grammar-string.
Grammar rules and pseudo code are available here
This material is based upon work supported by the National Science Foundation under Grant No. 0953738

