How can i find promoter sequences




















In order to cover the whole spectrum of prediction methodologies, we selected a few representative procedures mainly based on the conservation of gene structure fprom [ 1 ], firstef [ 13 ], dpf [ 12 ] and nscan [ 29 ] , the identification of CpG islands eponine [ 22 ], cpgprod [ 16 ] and dgsf [ 21 ] , compositional sequence biases mcpromoter [ 26 , 33 ] and other criteria nnpp [ 24 ] and promoter2.

The results of these comparisons show that despite its simplicity, ProStar performed better than most of the other methods and was similar to two algorithms that use gene structure for prediction fpom and firstef , and only nscan , which is based also on multi-species homology, provided more accurate results for the reference set of genes Figure 2 , Table 2 and Figure S5 in Additional data file 1.

Global analysis of performance using Bajic's metrics [ 42 ] see Materials and methods showed that the predictive power of our method is only improved by nscan Table 2 and Table S2 in Additional data file 2. Finally, it is worth to comment the good performance of ProStar, that only uses simple dinucleotide parameters, compared to complex methods based on n-mer compositional rules see Materials and methods.

Clearly, the richness of the six-dimensional descriptors obtained for each dinucleotide by the MD simulation explains the success of our simple approach. Results of performance comparison for the Encode region between ProStar and other programs Table S1 in Additional data file 2 using a window size D equal to 1, see Materials and methods. Results obtained compare the predictive power with a a subset of Havana protein coding genes, b a set of 1, non-coding genes, and c a set of 1, annotated TSSs from a Cage data set that falls inside non-CpG island coding genes see Materials and methods.

Squares indicate methods based on gene prediction exons, intronic signals, and so on , and other methods are represented with circles. Interestingly, when the analysis is performed for a subset of TSSs of non-coding genes Figure 2 , Table 2 and Figure S6 in Additional data file 1 the performance of all the methods decreases, but ProStar seems more robust than the others. In fact, the analysis of these data shows that, for this subset of genes, ProStar performs better than any method that uses sequence compositional bias, location of known TFBSs, or the presence of TATA-box signals or CpG islands and similar or better than those relying on the presence of orthologs as shown in Bajic's metrics Table 2.

Our method works better when predicting promoters associated with CpG islands, but the decrease in performance for promoters associated with non-CpG islands is similar to that of other methods, including those that are based on the maintenance of the gene structure Figure S7a in Additional data file 1. If a conservative definition of a non-CpG associated promoter is used no CpG island detectable at less than 5 Kb from the promoter , the performance of ProStar decreases, but is still better than that of most methods Figure S7b in Additional data file 1 , although even in this case the method is not competitive with algorithms based on gene structure conservation.

In any case the performance of ProStar for genes not associated with CpG islands is quite reasonable, confirming that the need for specific elastic properties at promoter regions is a general requirement and not restricted to the presence of CpG islands or diffuse TSSs. It is also worth noting that ProStar performs better than methods specifically tuned to capture promoters associated with CpG islands when the analysis is restricted to Havana annotated genes with CpG islands data not shown.

Finally, the performance of ProStar does not decay for genes containing a TATA box Figure S8 in Additional data file 1 , which are the easiest to detect from simple sequence signals. Once we tested the performance of ProStar to reproduce promoters annotated by the Havana group, we explored the ability of the method to locate promoters reported in massive Cage experiments [ 4 ], where promoters were often found in unexpected locations.

To increase the challenge, we analyzed only Cage-detected promoters falling inside transcribed regions including exons and 3' UTR regions of annotated Havana genes that are not regulated by a CpG island. Our results demonstrate that despite the method not being trained with this type of promoter, it performed quite well Figure 2 , Table 2 , Figures S6 and S9 in Additional data file 1 , in fact improving the results obtained by other available methods Table 2.

The results are summarized in Figure S10 in Additional data file 1 and confirm the quality of our predictions at the genome level. Please note that some caution is needed in the interpretation of these results since the apparent better performance of our method at the genome level compared with that obtained using Encode regions can be simply due to the noise in the first dataset. The final extreme challenge for ProStar was to find promoters that are not detectable by methods based on sequence conservation along orthologs or on the maintenance of gene structure.

For this purpose, we selected a subset of 1, annotated promoters of non-coding genes that are found as false negative by nscan , fprom and firstef.

We should clarify that this comparison will give no information on ProStar with respect to 'state of the art' methods based on conservation of gene structure and orthology, but does give some indication of the ability of other methods including ProStar to capture promoters located in anomalous positions. The results shown in Figure 3 demonstrate that ProStar can recover a significant fraction of these promoters with a signal to noise ratio superior to all methods based on the differential genomic content of promoters and on the use of powerful discriminant algorithms.

This suggests that ProStar is a powerful tool for promoter determination and that it could be a good alternative for the location of promoters of fast evolving genes or those appearing in anomalous positions that violate the traditional concept of gene structure.

CC measurement see Materials and methods for the subset of Havana TSSs 1, of non-coding protein genes in the Encode region, unrecalled by nscan , fprom and firstef. Atomic MD simulations, based on physical potentials derived from quantum chemical calculations, yield helical stiffness parameters that reveal the complexity of the deformation pattern of DNA.

The use of these intuitive parameters at the genomic level allowed us to define promoters as regions of unique deformation properties, particularly near TSSs. Taking advantage of this differential pattern, we trained a very simple method, based on Mahalanobis metrics, that is able to locate human promoters with remarkable accuracy.

Our results are better than the ones of methods based on the use of large batteries of descriptors, such as sequence signals, empirical physical descriptors, and complex statistical predictors neural networks, hidden Markov models, and so on. The overall performance of ProStar is similar and in some cases even better than that of methods based on the conservation of gene structure, methods that might not be so accurate in the location of promoters of fast evolving genes, or those located in unusual positions.

Taken together, our work reveals that even in complex organisms like human, there is a hidden physical code that contributes to the modulation of gene expression. Neutral hydrated systems were then optimized, thermalized and pre-equilibrated using our standard protocol [ 43 , 44 ]. The structures obtained at the end of this procedure were then re-equilibrated for an additional 2 ns.

The snapshots obtained at the end of this equilibration were used as starting points for 50 ns trajectories performed at constant temperature K and pressure 1 atm using periodic boundary conditions and Ewald summations [ 45 ]. Simulations were carried out using SHAKE [ 46 ] on all bonds connecting hydrogens and 2 fts time steps for integration of Newton equations of motions.

Note that each of these elements k i is the force-constant associated with the distortion along a given helical coordinate:.

ProStar was trained using 5' ends of protein coding genes annotated by the Havana group [ 39 ] in the human Encode [ 40 ] region as a TSS set.

According to Egasp workshop rules [ 5 ], the training procedure was restricted to 13 of the 44 Encode regions see performance test section. TSS and strand recognition are trained and processed independently. This size is extended to 1, nucleotides for strand prediction see Strand prediction section. Encode regions and annotated data and predictions were downloaded from the Egasp ftp directory [ 54 ].

We used version Coding genes are those with annotated start and stop codon signals; the others are taken as non-coding. In addition to Egasp test sets, we analyzed the performance of our methodology using the selected sets of TSSs more difficult to predict as TSSs on unexpected positions or TSSs belonging to genes with special particularities.

Since CpG islands are supposed to be the strongest promoter signals, this set represents an important challenge for our method. Cage predictions [ 60 ] were downloaded from Egasp [ 54 ] database. Those overlapping any Havana coding and non-coding genes without a CpG island in the upstream region were selected.

Standard Egasp rules were used also for these challenging sets. We trained our method for promoter recognition with a collection of nucleotide sequences that comprised intervals of nucleotides upstream and downstream of the training TSS set. As negative set, we collected nucleotide sequences from transcribed regions of Havana coding genes.

We made sure that positive and negative sequences did not overlap. For the recognition of the strand, we trained our method with a collection of DNA sequences that comprised for every TSS in the positive training set the 1, nucleotide DNA sequence ranging from bp upstream to bp downstream of the same TSS.

The reverse complementary sequences of the positive set were taken as a negative set. For a given deformation we sum the values associated with every dinuecleotide step in the sequence and divide the total by n - 1.

For example, the twist deformation score for the sequence ACGC would be 0. We used Mahalanobis distance [ 61 ] to classify nucleotide DNA sequences as belonging to the promoter class k x or non-promoter class k y. Every class is defined by a specific dataset of sequences see Training set section. Computing the physical properties of every sequence of the dataset, we conclude with a set of vectors for every class X for class k x and Y for k y.

Even so, we can modulate the confidence of our decision according to a normalized score defined in equation 4. The reverse complement of the positive set sequences was used as the negative set. As observed using experimental approaches [ 4 ], TSSs have a dominant position, but many closely related alternative sites may be found around them. In consequence, every TSS may produce multiple close predictions. To clarify the annotation, our algorithm allows the user to define a window size set as 1, nucleotides by default where all predictions will be unified in a single annotation.

Predicted dominant position p' of the window W is computed as:. The training and performance of ProStar followed the protocol described [ 5 ] for the Egasp workshop [ 54 , 56 ]. Thus, protein coding genes annotated by the Havana group from 13 of the Encode regions were used for training, while the entire set was used in tests tests performed using only regions that were not considered in the training give very close results; Table S2 in Additional data file 2.

We also include in our analysis the averaged score measure ASM , which combines many 'independent' descriptors to provide an overall relative measure of the quality of a predictive method with respect to others Table S2 in Additional data file 2; Additional data file 3. In addition to the methods checked in the Egasp experiment, we performed predictions using programs that were not considered in the Egasp experiment, but which are publicly available.

In these cases we used the corresponding web-based tool or downloadable script with default parameters Table S1 in Additional data file 2. All convolution layers are followed by ReLU activation function Glorot et al. Then, the outputs of these layers are concatenated together and fed into a bidirectional long short-term memory BiLSTM Schuster and Paliwal, layer with 32 nodes in order to capture the dependencies between the learnt motifs from the convolution layers.

Then we add two fully connected layers for classification. The first one has nodes and followed by ReLU and dropout with a probability of 0. This is achieved through the LSTM structure which is composed of a memory cell and three gates called input, output, and forget gates. These gates are responsible for regulating the information in the memory cell. In addition, utilizing the LSTM module increases the network depth while the number of the required parameters remains low.

Having a deeper network enables extracting more complex features and this is the main objective of our models as the negative set contains hard samples. The Keras framework is used for constructing and training the proposed models Chollet F. Adam optimizer Kingma and Ba, is used for updating the parameters with a learning rate of 0.

The batch size is set to 32 and the number of epochs is set to Early stopping is applied based on validation loss. In this work, we use the widely adopted evaluation metrics for evaluating the performance of the proposed models. These metrics are precision, recall, and Matthew correlation coefficient MCC , and they are defined as follows:.

Where TP is true positive and represents correctly identified promoter sequences, TN is true negative and represents correctly rejected promoter sequences, FP is false positive and represents incorrectly identified promoter sequences, and FN is false negative and represents incorrectly rejected promoter sequences. When analyzing the previously published works for promoter sequences identification we noticed that the performance of those works greatly depends on the way of preparing the negative dataset.

They performed very well on the datasets that they have prepared, however, they have a high false positive ratio when evaluated on a more challenging dataset that includes non-prompter sequences having common motifs with promoter sequences. For instance, in case of the TATA promoter dataset, the randomly generated sequences will not have TATA motif at the position and —25 bp which in turn makes the task of classification easier. In other words, their classifier depended on the presence of TATA motif to identify the promoter sequence and as a result, it was easy to achieve high performance on the datasets they have prepared.

However, their models failed dramatically when dealing with negative sequences that contained TATA motif hard examples. The precision dropped as the false positive rate increased. Simply, they classified these sequences as positive promoter sequences. A similar analysis is valid for the other promoter motifs. Therefore, the main purpose of our work is not only achieving high performance on a specific dataset but also enhancing the model ability on generalizing well by training on a challenging dataset.

To more illustrate this point, we train and test our model on the human and mouse TATA promoter datasets with different methods of negative sets preparation. The first experiment is performed using randomly sampled negative sequences from non-coding regions of the genome i. These high results are expected, but the question is whether this model can maintain the same performance when evaluated on a dataset that has hard examples.

The answer, based on analyzing the prior models, is no. The second experiment is performed using our proposed method for preparing the dataset as explained in section 2. This ensures that our model learns more complex features rather than learning only the presence or absence of TATA-box. Figure 5. Over the past years, plenty of promoter region prediction tools have been proposed Hutchinson, ; Scherf et al. However, some of these tools are not publically available for testing and some of them require more information besides the raw genomic sequences.

In this study, we compare the performance of our proposed models with the current state-of-the-art work, CNNProm, which was proposed by Umarov and Solovyev as shown in Table 2. On the other hand, our models are able to deal with these cases more successfully and false positive rate is lower compared with CNNProm. For further analyses, we study the effect of alternating nucleotides at each position on the output score. We focus on the region —40 and 10 bp as it hosts the most important part of the promoter sequence.

Blue color represents a drop in the output score due to mutation while the red color represents the increment of the score due to mutation. We notice that altering the nucleotides to C or G in the region —30 and —25 bp reduces the output score significantly. This region is TATA-box which is a very important functional motif in the promoter sequence. Thus, our model is successfully able to find the importance of this region.

In the rest of the positions, C and G nucleotides are more preferable than A and T, especially in case of the mouse. This can be explained by the fact that the promoter region has more C and G nucleotides than A and T Shi and Zhou, Figure 6.

Figure 7. Accurate prediction of promoter sequences is essential for understanding the underlying mechanism of the gene regulation process. In this work, we were particularly interested in constructing a hard negative set that drives the models toward exploring the sequence for deep and relevant features instead of only distinguishing the promoter and non-promoter sequences based on the existence of some functional motifs.

The main benefits of using DeePromoter is that it significantly reduces the number of false positive predictions while achieving high accuracy on challenging datasets. DeePromoter outperformed the previous method not only in the performance but also in overcoming the issue of high false positive predictions. It is projected that this framework might be helpful in drug-related applications and academia. MO and ZL prepared the dataset, conceived the algorithm, and carried out the experiment and analysis.

All authors discussed the results and contributed to the final manuscript. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Alipanahi, B. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Angermueller, C. Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biol. Baker, T. Benjamin-Cummings Publishing Company. Google Scholar. Behjati, S. What is next generation sequencing? Childhood Educ. Bharanikumar, R. PeerJ 6:e Chollet, F.

Astrophysics Source Code Library. Dahl, J. A rapid micro chromatin immunoprecipitation assay chip. Davuluri, R. Computational identification of promoters and first exons in the human genome. Down, T. Computational detection and location of transcription start sites in mammalian genomic dna.

Genome Res. Dreos, R. Epd and epdnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. Glorot, X. Hutchinson, G. The prediction of vertebrate promoter regions using differential hexamer frequency analysis.

Bioinformatics 12, — PubMed Abstract Google Scholar. Ioshikhes, I. Large-scale human promoter mapping using cpg islands.

Juven-Gershon, T. The rna polymerase ii core promoter—the gateway to transcription. Cell Biol. Kanhere, A. A novel method for prokaryotic promoter prediction based on dna stability. BMC Bioinform. Kim, J. Confirm that BLAST results have retrieved a hit on the minus strand by noting the numbering of the hit " Coordinates in "seq" link display screen of Map Viewer: Two types of coordinates are shown in this display.

The first set, at the top of the display, shows the location of the gene on the overall chromosome. The second set of coordinates, in the gray shaded bar, shows the location of the gene on the individual contig, or chromosome fragment. To obtain 5' sequence of a gene, first the directionality of the gene must be determined. In other words, is the gene on the plus strand , and therefore read in a direction that moves from top to bottom of the chromosome?

Or is the gene on the minus strand , and therefore read in a direction that moves from bottom to top of the chromosome? Knowing the directionality of the gene is essential to determining the location of the promoter region. There are several ways to determine the directionality of the PER2 gene on the assembled chromosome sequence: a small black arrow beside the gene name points up or down.

For PER2, the arrow points up, meaning the gene is on the minus strand , and that the sequence of the chromosome in that region should be read in the direction of the arrow 5' end of the gene, to the 3' end of the gene.



0コメント

  • 1000 / 1000