Search
Question: Normalization for small non-coding RNA
0
gravatar for yeles.konstantinos
16 days ago by
University of Salerno, Salerno, Italy
yeles.konstantinos10 wrote:

Dear Bioconductor,

Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization.
Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment that "The PO treatment
has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies"

We have treated libraries with ~10 million reads and untreated with ~45 million reads.
In order to find piRNAs in our samples, we used SPORTS1.0

with output: matched reads to the genome and matched reads to databases, unmatched reads to the genome and matched reads to databases.
For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database. (So, we have many resulting files, in a tabular format:)

t00000406 617 + piR-hsa-3546 3 CTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIII 1
t00000517 445 + piR-hsa-3454 2 CACGTGTTAGGACCCGAAAGA IIIIIIIIIIIIIIIIIIIII 0
t00000519 439 + piR-hsa-3546 0 CGGCTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIIIIII

 


The majority of reads multimap to different piRNAs, so  I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

So the library is separated for every smallRNA database.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Thank you
Konstantinos
 

ADD COMMENTlink modified 15 days ago by António Miguel de Jesus Domingues430 • written 16 days ago by yeles.konstantinos10
2
gravatar for Michael Love
16 days ago by
Michael Love20k
United States
Michael Love20k wrote:

I'm not sure I understand the setup exactly, but I will give my best answer: it sounds like you want a robust normalization although there may be many true differences? Do you have any kind of artificial or endogenous controls - some features where you expect no changes? Otherwise, in silico normalization is very difficult, when there may be many true, large differences across the samples. The assumption that most methods make to normalize is that the middle of the distribution of log ratios captures the technical artifact of sequencing depth.

ADD COMMENTlink written 16 days ago by Michael Love20k

Dear Michael Love,
Unfortunately, we don't have any artificial (RNA spike-ins?) or endogenous controls. Should I provide more information about the setup? 
Sorry for the false information, we have 2 different kinds of spike-ins, one added before treatment and one added after.
 

ADD REPLYlink modified 15 days ago • written 16 days ago by yeles.konstantinos10
1

I don't know if i'll be able to provide much more specialized feedback for working with small RNA, just because I'm pretty busy and have a lot of software support requests this time of year. You might get more useful small RNA normalization feedback from a general forum such as Biostars.

ADD REPLYlink written 16 days ago by Michael Love20k

I will try Biostars!
Thank you for your time!

ADD REPLYlink written 16 days ago by yeles.konstantinos10
2
gravatar for António Miguel de Jesus Domingues
15 days ago by
Germany

Hi konstatinos,

The majority of reads multimap to different piRNAs, I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multimapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Since you are working with different cells lines I assume that you have biological conditions you want to compare and replicates. What we have done in the past in our group is to analyse separately the treated and the non-treated samples. The treatment biases the library composition quite a lot so it is tricky to compare them. Furthermore, we never really had a need to compare treated vs non-treated directly - We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

For DESeq2 type analysis, we just do the usual analysis exemplified in the the vignette: take a matrix of (piRNA or other small RNA) counts, and compare mutant strains vs WT (or whatever biological conditions we are studying) in each treatment.

So the library is separated for every smallRNA database

I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

These are mere pointers. For a more informed answer you would need to define better the biological question you are trying to answer.

ADD COMMENTlink modified 15 days ago • written 15 days ago by António Miguel de Jesus Domingues430

Dear António Miguel de Jesus Domingues,

  • Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multi-mapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

No, you didn't misunderstand. I've also posted an informative example in Biostars . I don't want to choose randomly a piRNA because it may be misleading. Using weights is a possibility but what about a read that matches 5 piRNAs such as these:
piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA
piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA
piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG
piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA
piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG

It's one will get 0.2 counts. Is it the correct way to "counter" multi-mapping?
 

  • We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification, we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

 Using spike-ins before and after treatment could help to normalize between treated/untreated samples? What about the library differences in number of reads? (treated with ~10 million reads, untreated with ~45 million reads)
 

  • I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

So I should perform DE analysis in the level of raw read counts and later assign them to  each small-RNA?

Thank you for your time and instructive answer!

ADD REPLYlink written 15 days ago by yeles.konstantinos10
1

See my answer here: https://www.biostars.org/p/351612/#352575

ADD REPLYlink written 9 days ago by António Miguel de Jesus Domingues430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 253 users visited in the last hour