Bioinformatics analysis on structural features of microRNA precursors in insects

To date, thousands of microRNAs (miRNAs) and their precursors (pre-miRNAs) have been identified in insects and their nucleotide sequences deposited in the miRBase database. In the present work, we have systematically analyzed, utilizing bioinformatics tools, the featural differences between human and insect pre-miRNAs, as well as differences across 24 insect species. Results showed that the nucleotide composition, sequence length, nucleotides preference and secondary structure features between human and insects were different. Subsequently, with the aid of three available SVM-based prediction programs, pre-miRNA sequences were evaluated and given corresponding scores. Thus it was found that of 2633 sequences from the 24 chosen insect species, 2229 (84.7%) were successfully recognized by the Mirident classifier, higher than Triplet-SVM (72.5%) and PMirP (72.6%). In contrast, four species, including the domesticated silkworm, Bombyx mori L., the fruit fly, Drosophila melanogaster Meigen, the honeybee, Apis mellifera L. and the red flour beetle, Tribolium castaneum (Herbst), were found to be largely responsible for the poor performance of some sequence matching. Compared with other species, B. mori especially showed the worst performance with the lowest average MFE index (0.73). Collectively these results pave the way for understanding specificity and diversity of miRNA precursors in insects, and lay the foundation for the further development of more suitable algorisms for insects. 13 * Corresponding author; e-mail: bxzhong@zju.edu.cn utilized to illustrate the features differences of premiRNAs. In our study, utilizing information from the miRBase18.0 database, the featural differences of human and insect pre-miRNA were systematically analyzed and compared, including composition and preference of nucleotides and secondary structure characteristics. At the same time, three programs were separately applied to predict each pre-miRNA dataset. The differences across 24 species were also compared with the intention of illustrating the diversity among them and providing more information to bioinformaticians for developing more efficient tools for predicting insect pre-miRNAs. Moreover, by comparing the performance of three programs, the study aims to help those who are engaged in miRNA research find more suitable research tools to study insects. MATERIAL AND METHODS


INTRODUCTION
MicroRNAs (miRNAs) are a large class of endogenous and small non-coding RNAs approximately 22 nucleotides (nt) in length that regulate gene expression at a posttranscriptional level and play various fundamental roles in multiple biological processes, including cell differentiation, proliferation and apoptosis as well as disease processes (Ambros, 2004;Bartel, 2004;Bushati & Cohen, 2007;Wang & Li, 2007).According to recent studies, mature miRNAs are originally transcribed from a long primary miRNA (pri-miRNA) and processed into a 60-70 nt miRNA precursor (pre-miRNA) with the aid of two different enzymes, RNA polymerase II and RNase III Drosha, respectively (Lee et al., 2003(Lee et al., , 2004)).Since their initial discovery in the nematode, Caenorhabditis elegans, the study of miRNAs has become a rapidly growing field in the life sciences (Lee et al., 1993).Compared with miRBase 16.0 including 142 species, the latest version miRBase18.0has grown to 18226 miRNA gene loci in 168 species and 21643 distinct mature miRNA sequences (Kozomara & Griffiths-Jones, 2011), directly attributable to the development of deep sequencing technology.
Although the founding members of miRNAs, such as lin-4 and let-7, were identified by a genetic screening approach, computational approaches still play a critical role in the identification of novel miRNAs (Griffiths-Jones, 2004;Jones-Rhoades & Bartel, 2004;Li et al., 2010;Dong et al., 2012).Previous studies reported that miRNA genes were conserved in the primary sequences and secondary structures (Gesellchen & Boutros, 2004;Nam et al., 2005;Wang et al., 2005Wang et al., , 2007)).Thus, for most computational approaches attempting to identify miRNAs, one of the critical discoveries has been the finding that all pre-miRNAs have a stem-loop hairpin in their secondary structure, as predicted by RNAfold or Mfold (Mathews et al., 1999;Hofacker, 2003;Zuker, 2003;Unver et al., 2009).To date, more and more computational approaches have been developed and widely applied to predict miRNAs from nematodes (Lim et al., 2003b), flies (Lai et al., 2003), humans (Lim et al., 2003a) and plants (Jones-Rhoades & Bartel, 2004).However, some such studies are limited when there are no known close homologies between compared sequences or enough information about species genomes (Bentwich et al., 2005).Three programs, Triplet-SVM, PMirP and the latest Mirident classifier, which can be used to identify pre-miRNAs without utilizing comparative genomics information, are all based on support vector machine (SVM) and reported to be of superior performance in predicting pre-miRNAs from humans (Xue et al., 2005;Zhao et al., 2010;Liu et al., 2012).Triplet-SVM is based on a set of novel structure features of stem-loops, while PMirP utilizes various features to distinguish real pre-miRNAs (Xue et al., 2005;Zhao et al., 2010).Unlike them, Mirident classifier applies the software Teiresias to recognize sequence-structure motifs (ss-motifs) of different length in data sets (Liu et al., 2012).Undoubtedly, all the programs can be used to distinguish real pre-miRNAs from pseudo ones, on the other hand, this function can also be utilized to illustrate the features differences of pre-miRNAs.
In our study, utilizing information from the miRBase18.0database, the featural differences of human and insect pre-miRNA were systematically analyzed and compared, including composition and preference of nucleotides and secondary structure characteristics.At the same time, three programs were separately applied to predict each pre-miRNA dataset.The differences across 24 species were also compared with the intention of illustrating the diversity among them and providing more information to bioinformaticians for developing more efficient tools for predicting insect pre-miRNAs.Moreover, by comparing the performance of three programs, the study aims to help those who are engaged in miRNA research find more suitable research tools to study insects.

Data sets
Firstly, all the miRNA precursors were downloaded from the miRNA registry database release 18.0 (November 2011) (http://www.mirbase.org/ftp.shtml).Human miRNA hairpins contain 1527 sequences, while there are 2633 sequences available for 24 insect species.In order to reduce bias caused by redundant sequences, all the sequences were gathered together and redundancies filtered (sequence identity >90%) using CD-HIT software which was originally written by Weizhong Li (http://www.bioinformatics.org/cd-hit/).

Three types of prediction software
Two of the important processes, shared by three types of software, are constructing the training models and adopting Support Vector Machine (SVM) to classify pre-miRNAs versus non-pre-miRNA hairpins.The ab initio program Triplet-SVM was kindly provided by Chenhai Xue of Tsinghua University (China).The PMirP classifier package was obtained on the web (http://ccst.jlu.edu.cn/ci/bioinformatics/MiRNA), which may be run directly on Windows with a C++ compiler.The Mirident classifier package, written by Xiuqin Liu of the Chinese Academy of Sciences, was executed with Python program support (http://www.regulatoryrna.org/pub/Mirident/).Different versions of a third-party software libsvm were downloaded and installed on the Linux system according to the manufacturer's instructions (Chang & Lin, 2011).
In order to compare the hairpin features between human and insects, numbers of nucleotides (A, U, C or G), MFE, length of sequence, G + C content, and A + U content were calculated and exported into an Excel file for each sequence.The Student's t-test was applied in order to compare nucleotide difference between human and insects.At the same time, the nucleotide frequency at each position on the pre-miRNA sequences was counted separately for analyzing nucleotides preference.According to a previous report (Zhang et al., 2006), (adjust MFE) AMFE and MFE index (MFEI) were calculated by the following equations: AMFE = (MFE/sequence length) × 100, MFEI = AMFE/(100 × RatioG-C), RatioG-C stands for the content of (G + C).
In Table 2, the averages of MFEI for the 24 insect species were calculated separately, and then, using 0.85 as an index, numbers of pre-miRNAs with MFEI greater than 0.85 in each species were counted.Subsequently, their percentages in each species were also calculated, termed as PPM0.85 (percentage of pre-miRNAs with MFEI higher than 0.85).

Features comparison of pre-miRNA precursors between human and insects
Based on the sequences available in miRBase, stemloop features between human and insects were compared according to nucleotide formation, contents of A-U or C-G, full length, nucleotide preference and secondary structure.After trimming high similar stem-loop sequences in human and insects separately, 1428 and 1638 non-redundant sequences were retrieved, respectively.Fig. 1A shows that the averages of two nucleotides (A or U) are different between human and insects, while the left two nucleotides G and C seem largely the same.Moreover, two datasets contained more (A+U) nucleotides than C-G pairs.The content of U is higher than others, especially in insects, which contains around 31% U compared to human (26.6%).A previous study revealed that more than 28% of the nucleotides in miRNA precursors were U, a fact which could be used to distinguish miRNAs from other RNAs (Zhang et al., 2006).A higher A-U pair content makes the pre-miRNA less stable and easier to be processed into mature miRNA.Although the sum and content of A-U in insects is obviously higher than its counterpart, two pairs (A-U and C-G) in human seem to show no obvious difference (Fig. 1B, C).Specifically, the average content of A-U in insects (56.5%) is clearly higher than for C-G (43.5%), while these are slightly different or even opposite in human (50.1% A-U vs 49.9% G-C).The subsequent t-test also illustrated that the content of the A-U pair was significantly higher in insects than for the C-G pair, compared to no difference in humans (P < 0.001).
Full length and minimum free energy (MFE) are two important features used to distinguish miRNA precursors.Some miRNA predictors often take them into account when distinguishing real and pseudo sequences (Xue et al., 2005;Jiang et al., 2007;Zhao et al., 2010;Liu et al., 2012).There are also differences between two datasets in both full length and MFEs (Fig. 1A, C).The average of full length in insects is longer than human, but the MFE in insects (-33.07 kcal/mol) is less negative than that in humans (-39.83kcal/mol) (Fig. 1C).Generally, as more bonding interactions are possible in longer molecules, increasing sequence length is responsible for the more negative MFE (Seffens & Digby, 1999).However, this is not surprising as the C-G pair content in insects is lower than that in humans (Fig. 1B).Thermodynamic data illus-trate that an A-U pair contributes -0.9 kcal/mol, while a C-G pair contributes -2.9 kcal/mol (Freier et al., 1986).
Thereafter, contents of four nucleotides at each position on the individual sequences from two datasets were calculated and scattered on the X-axis according to their positions on the individual sequence (Fig. 2A).From the head-to-head figure, it is seen that both share the similar colour waves of nucleotides.This result indicates that, although the width of each wave shows slightly differently between insects and humans, they have similar nucleotides preference.To better understand the differences between them, the ratio of nucleotide content (RNC) for the identical nucleotides between humans and insects was used to observe the diversity of the two datasets.For example, human/insects (A) represented the RNC of adenine between humans and insects.If the value of RNC for a nucleotide in a given position is greater than 1.0, it means that the content of this nucleotide in humans is higher than that in insects.Due to their different full lengths, 1-139 nucleotides shared by two datasets were selected for further analysis.As shown in Fig. 2B, most RNCs of C and G from the forward 56 nucleotides are >> 1.0, contrary to A or U.But after around 76 nucleotides, all nucleotide contents in humans seem smaller than those in insects.This result means that the RNC of pre-miRNAs between humans and insects are profoundly different, and this difference might well be caused by insect species diversity.

Global comparison and evaluation
Although three published programs had made great strides in human hairpin prediction, there was evidence that different features used for pre-miRNA detection could largely influence the performance of an algorithm (Jiang et al., 2007;Liu et al., 2012).We assumed that difference of prediction performance could be served as an index to directly reflect the difference of stem-loop structure.On the other hand, it was also necessary to evaluate the feasibility and performance of an algorithm before employing it for different cases.
As shown in Table 1, Mirident classifier shows the best performance, achieving an overall accuracy of 84.7% for 2633 sequences tests.There are 1909 out of 2567 pre-miRNAs correctly recognized by Triplet-SVM and 1913 out of 2529 sequences successfully detected by PMirP classifier, which give accuracies of 74.4% and 75.6%, respectively.In order to decrease the bias caused by the redundancies, non-redundant sequences were employed to evaluate the performance of three programs.Result showed that 1315 out of 1638 sequences were correctly recognized by Mirident classifier, which still kept the first place with an accuracy rate of 80.3% higher than that in Triplet-SVM (67.0%) and PMirP classifier (66.8%).And of those successfully recognized precursors, only 835 pre-Notes: Twenty-four insect species were used to evaluate the performance of three nucleotide analysis programs.Accuracy is the degree of closeness of a measured quantity to its actual value (number in bracket).Non-redundant sequences stand for the output of the above total of 2633 sequences in insects after filtering the redundancies (sequence identity >90%) using CD-HIT software.miRNAs conserved sequences represent stem-loop sequences whose mature miRNAs were found to be conserved in at least two insect species.The items with "#" stand for the corresponding sequences after removing four insect species, such as Bombyx mori, Drosophila melanogaster, Apis mellifera and Tribolium castaneum.miRNAs were jointly detected by the three programs, accounting for about 51% prediction.
From global evaluation of the three programs, all of them showed an obvious decline compared to the 2633 sequences.One possible reason for this was the fact that 995 highly similar sequences of the 24 insect species discarded by the CD-hit procedure possessed a large share of correctly detectable sequences.This being so, we also predicted these sequences using the three programs.Our results showed that 914 out of the 995 sequences were successfully recognized by Mirident classifier with the accuracy rate of 91.9%, still higher than that using Triplet-SVM (86.2%) and PMirP classifier (89.8%).These data confirmed our assumption and further demonstrated that these highly similar sequences shared similar features with human stem loop structures (Table 1).In addition, some pre-miRNA sequences detected using Triplet-SVM and PMirP processing were deleted before prediction, because of undesirable multiple loops according to individual designing principle.Although concerned with sequence structures, this data pretreatment cause problems with subsequent results processing compared to the Mirident classifier system.Nevertheless, compared to previous tests on human precursor sequences, Triplet SVM and PMirP classifier produced mis-diagnoses in insects.Given that the test sequences were composed of 24 insect species, a plausible interpretation is that two of them primarily relied on the features of human pre-miRNAs such as base-pair, full length and MFE, etc. Unlike them, Mirident classifier seems to be more suitable for predicting the pre-miRNA of insects.According to the analysis of nucleotides preference, espe-cially in Fig. 2A, a similar tendency in human and insects pre-miRNAs is consistent with their initial designing conception on ss-motif of variable length, without considering sequence-structure features of fixed size (Liu et al., 2012).
In contrast, highly similar precursor sequences did not mean that their microRNAs were also conserved; this is because lots of microRNAs were almost identical while their precursors differed.For more objective and accurate evaluating of the three programs, 1597 stem-loop sequences whose mature miRNAs were conserved in at least two insect species were selected and predicted using the three softwares.The results as shown in Table 1 revealed of 1597 sequences, 1521 were successfully recognized by Mirident classifier, with the accuracy rate 95.2%, still higher than that in triplet SVM (87.9%) and PMirP classifer (89.7%).As compared with the above performances in 995 high similar precursor sequences, two datasets showed similar trends.
Collectively, the global performance comparison illustrated that different precursor features between human and insect miRNAs must influence prediction performance.Perhaps, species differences in insects were the main reason leading to the bad prediction performance obtained.To illustrate this further studies were still required in order to compare and analyze the diversity across the 24 insect species.

Comparison and analysis across 24 species
As shown in Table 1, the accuracy rates for the 24 insect species were obtained and summarized according to three prediction algorithms.Most recognized rates reached >> 80%, except in the case of several species including B. mori, D. melanogaster, A. mellifera and T. castaneum which are largely responsible for the bad performance noted.This is especially so for 961 nonredundant pre-miRNA precursors from these four species, representing around 59% of 1638 non-redundant sequences.After removing these from the global set, the accuracy rates of the three test programs were individually raised to higher levels, which were closer to previous human tests.Our results demonstrated that many nonredundant stem-loop sequences in these four species might well vary from those of other species in the sample of 24 species.Consequently, nucleotides formations, full length and MFE for 24 species were calculated in order to analyze such potential differences.Fig. 3 illustrates that four nucleotides in the 24 species have almost the same change tendency.Nearly all the U nucleotides in these species are in excess of other nucleotides, with the exception of the butterfly, Heliconius melpomene (L.) and the locust, Locusta migratoria (L.).The contents of the four nucleotides reached a minimum value in the aphid, Acyrthosiphon pisum (Harris) and L. migratoria.Furthermore, the negative MFEs also fell to the lowest values noted (Table 1).This demonstrates that stem-loop structure feature of the two species is probably very different from that of the other species in the collection.Moreover and surprisingly, compared with full sequence length, the accuracy rates of the three programs most likely expressed opposite change tendencies (Fig. 4).This phenomenon indicates that full sequence length might affect the performance of the three chosen predictors.
Zhang and co-workers (Zhang et al., 2006) developed a new term named minimum free energy index (MFEI) to detect different types of RNA in plants.Their results showed that when the MFEI was >> 0.85, the sequence was most likely to be real miRNA.Using this concept, we employed MFEI to disclose the differences across the miRNAs of the 24 insect species tested.As shown in Table 2, most species had average MFEIs >> 0.85, except for five species, i.e.B. mori, D. melanogaster, D. simulans (Sturtevant) A. mellifera, and H. melpomene.Interestingly, three species -B.mori, D. melanogaster and A. mellifera not only had lower MFEIs, but also accounted for three quarters of the bad performance of species noted.Accordingly, more than half of pre-miRNAs in 17 Fig. 4. Performance comparisons of three predictors through testing 24 insect species.The full lines represent the accuracy rates of three predictors, while the dotted line stands for percentage of full length, MFEI and percentage of pre-miRNAs with MFEI higher than 0.85 (PPM0.85 ).The X-axis contains 24 insect species, which come from miRBase 18.0.species had MFEIs >> 0.85, while only 21.6% miRNA precursors did so in B. mori (Fig. 4).Considering the curves shown in Fig. 4, PPM0.85 almost has the same change tendency with performances of Triplet-SVM and PMirP classifier.The result illustrates that the values of MFEIs indeed influence the performance of the two programs.More importantly, the over-reliance on C-G content, full sequence length and MFE might be the main reason leading to the poor accuracy rates of two algorisms in some insect species.Lastly, compared with their report (Zhang et al., 2006), more than 90% of pre-miRNAs had an MFEI >> 0.85, but in our study, 71.4% for L. migratoria was the highest ratio observed (Fig. 4).This lack of uniformity might arise due to differences between plants and insects.Thus, MFEI might be a good, but not absolute parameter, for detecting pre-miRNAs; however, the average for total MFEIs could better serve as an important index to describe pre-miRNA features of a given species.
Despite stem-loop hairpin structure not being a unique feature of miRNA, it is nevertheless still the most important factor enabling definition of pre-miRNAs (Ambros et al., 2003).Full sequence length and MFE reflect the most immediate features of a pre-miRNA, while nucleotides preference leads to diversity and specificity among species.Based on the close relation between MFEI and the performances of two programs, Triplet-SVM and PMirP classifier, the lower MFEI appeared to be a key factor leading to bad prediction performance in several insect species tested.Perhaps the training models of the two programs placed over-reliance on the features of human pre-miRNAs.A more suitable and effective training model based for insect is presently under consideration and improvement.In the case of B. mori especially, this becomes more interesting because of the poor performance (Fig. 4), lowest average MFEI (0.73) and smallest PPM0.85 (21.6%) according to the above analyses.On the other hand, it cannot be ignored that potential error might be caused by false positive pre-miRNAs in miR-Base.As is well known, many such errors arise as a result of computational methodology without extra experimental confirmation and validation (Griffiths-Jones, 2004).

CONCLUSION
Taken as a whole, developments of deep sequencing and bioinformatics have greatly boosted research into miRNAs.Butas found by us, one important step is clearly the selection of s more suitable tools to distinguish real miRNAs from pseudo ones.Given the performance of Mirident classifier in predicting pre-miRNAs, general sequence features such as ss-motifs should be taken as important factors to construct new tools or algorithms for pre-miRNA prediction.Furthermore, based on the features understanding of known pre-miRNA sequences of insects, a better and more specific training model is still 19 Note: Table 2 shows the average number of each term such as four nucleotides, C-G pair, A-U pair, full sequence length, minimum free energy (MFE) and minimum free energy index (MFEI).The following "± number" = individual standard deviation (STDV).required for future studies.For those insect species whose genomes are presently unknown, such bioinformatics algorithms may play important roles in discovering many more useful miRNAs.

Fig. 2 .
Fig. 2. Nucleotide distribution and ratios of pre-miRNAs between humans and insects.A -Contents of four nucleotides from two datasets were calculated and scattered on the X-axis according to their positions on the individual sequence.The length of waves represents sequence length, while the width represents individual contents of nucleotides on each position.From the head-head figure, the left region represents 1428 non-redundant pre-miRNAs of humans, while the right one represents 1638 non-redundant precursor sequences of insects.The X-axis represents the order of nucleotide on their pre-miRNA sequence.B -The lines drew in four colours represents the ratio of nucleotide content (RNC) for the same nucleotide between human and insects.

Fig. 3 .
Fig. 3. Change tendency of four nucleotides in 24 insect species.The X-axis represents 24 species, the Y-axis, nucleotide contents.

TABLE 2 .
Nucleotide characteristics of 24 insect species.