Sequence-based Detection of RNA Architectural Modules

Eric Westhof

RNA molecules are characterized by the formation of hydrogen bonded pairs between the nucleotide bases along the polymer. All base-base interactions present in nucleic acids can be classified in twelve families where each family is a 4x4 matrix of the bases A, G, C, U. The usual Watson-Crick pairs belong to one of these families and the other eleven families gather the non-Watson-Crick pairs. The family of the common Watson-Crick pairs forms the secondary structure and all the other families are crtitical for the tertiary structure. This classification clarifies RNA architecture, which can be viewed as the hierarchical assembly of preformed double-stranded helices defined by Watson-Crick base pairs and RNA modules maintained by non-Watson-Crick base pairs. RNA modules are recurrent ensemble of ordered non-Watson-Crick base pairs. The geometrical constraints attached to each base pairing family explain the surprising molecular neutrality observed in sequences and structures during biological evolution. Through systematic comparisons between homologous sequences and x-ray structures, followed by automatic clustering, the whole range of sequence diversity in recurrent RNA modules has been characterized. These data permitted the construction of a computational pipeline for identifying known 3D structural modules in single and multiple RNA sequences in the absence of any other information. Any module can in rpinciple be searched. Up to now, four modules can be searched automatically: the G-bulged loop, the Kink-turn, the C-loop and the tandem GA loop. In controlled test sequences we were able to find all of the known motifs with a false discovery rate of 0.23. The present pipeline can be used for RNA 2D structure refinement, 3D model assembly, and for searching and annotating structured RNAs in genomic data.