Question: Normalization for small non-coding RNA
0
gravatar for Konstantinos Yeles
4 months ago by
University of Salerno, Salerno, Italy
Konstantinos Yeles20 wrote:

Dear Bioconductor,

Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization.
Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment that "The PO treatment
has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies"

We have treated libraries with ~10 million reads and untreated with ~45 million reads.
In order to find piRNAs in our samples, we used SPORTS1.0

with output: matched reads to the genome and matched reads to databases, unmatched reads to the genome and matched reads to databases.
For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database. (So, we have many resulting files, in a tabular format:)

t00000406 617 + piR-hsa-3546 3 CTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIII 1
t00000517 445 + piR-hsa-3454 2 CACGTGTTAGGACCCGAAAGA IIIIIIIIIIIIIIIIIIIII 0
t00000519 439 + piR-hsa-3546 0 CGGCTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIIIIII

 


The majority of reads multimap to different piRNAs, so  I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

So the library is separated for every smallRNA database.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Thank you
Konstantinos
 

rnaseq smallrna deseq2 tmm • 217 views
ADD COMMENTlink modified 4 months ago by António Miguel de Jesus Domingues430 • written 4 months ago by Konstantinos Yeles20
Answer: Normalization for small non-coding RNA
2
gravatar for Michael Love
4 months ago by
Michael Love23k
United States
Michael Love23k wrote:

I'm not sure I understand the setup exactly, but I will give my best answer: it sounds like you want a robust normalization although there may be many true differences? Do you have any kind of artificial or endogenous controls - some features where you expect no changes? Otherwise, in silico normalization is very difficult, when there may be many true, large differences across the samples. The assumption that most methods make to normalize is that the middle of the distribution of log ratios captures the technical artifact of sequencing depth.

ADD COMMENTlink written 4 months ago by Michael Love23k

Dear Michael Love,
Unfortunately, we don't have any artificial (RNA spike-ins?) or endogenous controls. Should I provide more information about the setup? 
Sorry for the false information, we have 2 different kinds of spike-ins, one added before treatment and one added after.
 

ADD REPLYlink modified 4 months ago • written 4 months ago by Konstantinos Yeles20
1

I don't know if i'll be able to provide much more specialized feedback for working with small RNA, just because I'm pretty busy and have a lot of software support requests this time of year. You might get more useful small RNA normalization feedback from a general forum such as Biostars.

ADD REPLYlink written 4 months ago by Michael Love23k

I will try Biostars!
Thank you for your time!

ADD REPLYlink written 4 months ago by Konstantinos Yeles20
Answer: Normalization for small non-coding RNA
2
gravatar for António Miguel de Jesus Domingues
4 months ago by
Germany

Hi konstatinos,

The majority of reads multimap to different piRNAs, I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multimapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Since you are working with different cells lines I assume that you have biological conditions you want to compare and replicates. What we have done in the past in our group is to analyse separately the treated and the non-treated samples. The treatment biases the library composition quite a lot so it is tricky to compare them. Furthermore, we never really had a need to compare treated vs non-treated directly - We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

For DESeq2 type analysis, we just do the usual analysis exemplified in the the vignette: take a matrix of (piRNA or other small RNA) counts, and compare mutant strains vs WT (or whatever biological conditions we are studying) in each treatment.

So the library is separated for every smallRNA database

I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

These are mere pointers. For a more informed answer you would need to define better the biological question you are trying to answer.

ADD COMMENTlink modified 4 months ago • written 4 months ago by António Miguel de Jesus Domingues430

Dear António Miguel de Jesus Domingues,

  • Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multi-mapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

No, you didn't misunderstand. I've also posted an informative example in Biostars . I don't want to choose randomly a piRNA because it may be misleading. Using weights is a possibility but what about a read that matches 5 piRNAs such as these:
piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA
piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA
piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG
piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA
piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG

It's one will get 0.2 counts. Is it the correct way to "counter" multi-mapping?
 

  • We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification, we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

 Using spike-ins before and after treatment could help to normalize between treated/untreated samples? What about the library differences in number of reads? (treated with ~10 million reads, untreated with ~45 million reads)
 

  • I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

So I should perform DE analysis in the level of raw read counts and later assign them to  each small-RNA?

Thank you for your time and instructive answer!

ADD REPLYlink written 4 months ago by Konstantinos Yeles20
1

See my answer here: https://www.biostars.org/p/351612/#352575

ADD REPLYlink written 4 months ago by António Miguel de Jesus Domingues430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 207 users visited in the last hour