Question: How to interpret DNAShapeR results
0
2.9 years ago by
Germany
Bioinformatist Newbie0 wrote:

Hello,

I have to compute DNA shape features (MGW, Roll, ProT, HelT) for genomic ranges of my interest. My GRanges object is something like:

> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
seqnames           ranges strand
<Rle>        <IRanges>  <Rle>
3575    chr10 [179300, 179500]      *
3576    chr10 [179350, 179550]      *
3577    chr10 [179400, 179600]      *
3578    chr10 [179450, 179650]      *
3579    chr10 [179500, 179700]      *
3580    chr10 [179550, 179750]      *
-------
seqinfo: 20 sequences from an unspecified genome; no seqlengths

I used following code:

gr <- makeGRangesFromDataFrame(sample, keep.extra.columns = F, ignore.strand = T, seqinfo = NULL, seqnames.field = "chr", start.field = "start", end.field = "stop")
getFasta(gr, BSgenome = Hsapiens, width = 200, filename = "tmp.fa")
fn <- "tmp.fa"
pred2 <- getShape(fn)

The output I get is:

> pred2[1:4,1:5]
MGW.1 MGW.2 MGW.3 MGW.4 MGW.5
1    NA    NA  5.08  4.66  3.90
2    NA    NA  2.85  3.75  4.43
3    NA    NA  5.57  5.96  4.68
4    NA    NA  5.89  5.21  4.52

The no.of rows in pred2 are equal to no.of rows of GRanges object (gr= 90) and no.of columns are 798 and I am not getting this idea why I am getting 798 columns. I think it is because I defined width=200. According to help information: width = A number indicating a fixed width of sequences. I kept width = 200 because if you notice that my genomic ranges are divided into bins of 200 neucleotides each.

My Questions:

1. Why I am getting 798 columns?
2. Why there are NA in MGW.1, MGW.2, MGW.199 and MGW.200 columns and similarly for HelT and so on for every feature?
3. Ideally I want the feature values (4 features= MGW, Roll, ProT, HelT) for each nucleotide. For that what I expect is a matrix of 90 rows and 800 columns (1 column for each feature value in 200 nucleotides, 4*200 = 800) but instead what I am getting is 90*798 with NA values in 12 columns.

If somebody has used the DNAshapeR package then kindly guide me how to interpret the results. Thank you.

dnashaper • 584 views
modified 2.9 years ago by tsupeich30 • written 2.9 years ago by Bioinformatist Newbie0
Answer: How to interpret DNAShapeR results
2
2.9 years ago by
tsupeich30
tsupeich30 wrote:

Hi ,

1. After mining the results of Monte Carlo simulations on seed structures, we generated a pentamer query table for high-throughput DNA shape predictions. The query table contains two types of DNA shape features, one is "nucleotide parameter" (e.g. MGW and ProT) which describes the shape parameter of a specific nucleotide; the other is "base pair step parameter" (e.g. Roll and HelT) which describes the shape parameter between two base pairs. If the length of a given sequence is 200, you would get 200 values for each "nucleotide parameter" and 199 for each "base pair step parameter". For four DNA shape paramters, totally you will get 798 values.
2. and 3. The tool uses a pentameric sliding window approach; therefore, there is no value for both end of the sequence. NA can be chopped off or assigned an average value of the shape features. Here are the average values of the shape parameters in the pentamer query table that you can refer: avgMGW <- 5.072; avgProT <- -6.792; avgRoll <- -0.698; avgHelT <- 34.326

Hope this information can help!

Best regards,

Tsu-Pei

@tsupeich Does it make sense to take average of all the individual  (1 neucletide resolution) features e.g. MGW.1 ... MGW.200 ? Isn't it that we loose the information content of true TF binding sites by taking average of feature value across the entire width of the given sequence (e.g. 200 in this case) ?

Hi,

Tsu-Pei was referring to a replacement for NA values that are naturally occurring at the boundaries of the genomic regions you specified, he was not suggesting to average nucleotide-resolution and base-pair-step-resolution parameters.

@tsupeich We could consider adding an option to automatically fill in values at boundaries.

Federico

@federico.comoglio Sure, we can do that. The NA values might not only happen at the boundaries but also at any position in the sequence (according to the input). Maybe the function should be able to handle all NA values.

yes, agreed.

Hi,

I feel like averaging 200 bps would loose some of the information content after all the MGW can go up and down along the region, which cannot be explained only by one value. The features generated byDNAshapeR are more likely to describe the local structure of DNA (instead of the global one). The way we usually analyze the MGW for TF binding sites is based on one base pair resolution, for example, the analyses in the Fig. 2 and 4 of [1] and the Fig. 5 of [2]. Other than that, we did average the MGW but only for the central five base pairs for Fis protein binding site in the Fig.1 of [3]. For your case, I feel like if you have to average the MGW for 200 base pairs, I would suggest you to do more statistics in addition to averaging the values, for example, also checking the distribution.

[1] N. Abe, I. Dror, L. Yang, M. Slattery, T. Zhou, H.J. Bussemaker, R. Rohs*, and R.S. Mann*: Deconvolving the recognition of DNA sequence from shape. Cell 161, 307-318 (2015)

[2] T.P. Chiu,&, L. Yang,&, T. Zhou, B.J. Main, S.C. Parker, S.V. Nuzhdin, T.D. Tullius, and R. Rohs*: GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res. 43, D103-109 (2015)

[3] T. Zhou, L. Yang, Y. Lu, I. Dror, A.C. Dantas Machado, T. Ghane, R. Di Felice, and R. Rohs*: DNAshape: a method for the high-throughput prediction of DNA structural features on a genome-wide scale. Nucleic Acids Res. 41, W56-62 (2013)

Best wishes,

Tsu-Pei

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.