The untranslated regions (UTRs) of many mRNAs contain sequence and structural motifs that are used to
regulate the stability, localization and translatability of the mRNA. Unfortunately, the consensus
sequences for these motifs frequently have the potential for significant variability at any given position
and are only loosely characterized. Therefore, simple alignment tools are frequently inadequate for the
discovery of previously unidentified RNA regulatory motifs.
Many new generation software tools utilize
adaptive techniques requiring training. One way to evaluate and train software is to utilize validated
sets of sequence data known to contain a defined motif.
The UTResource
provides a database of UTRs that contain consensus sequences of experimentally discovered regulatory motifs.
From this database we generated a collection of Training UTR datasets.
The sequence files are meant to be used as blind test sets to simulate experimental results one might expect
from a process such as immunoprecipitation. Each sequence in a given file contains a previously characterized
RNA motif conforming to a defined consensus.
To date twelve basic training sets have been generated with associated indexes and answer sets provided
that identify where the previously characterized RNA motif (e.g. the IRE, ARE, SECIS, etc) resides in each
sequence. The current incarnation of the collection represents what could be thought of as ideal experiment data. That is,
each sequence is positive for at least one occurence of the set's motif. We plan to eventually introduce
noise into the sets in the form of negative sequences and also potentially increase the size of the sequences
containing the motifs.