Normalization for small non-coding RNA
2
0
Entering edit mode
@konstantinos-yeles-8961
Last seen 3 months ago
Italy

Dear Bioconductor,

Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization.
Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment that "The PO treatment
has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies"

We have treated libraries with ~10 million reads and untreated with ~45 million reads.
In order to find piRNAs in our samples, we used SPORTS1.0

with output: matched reads to the genome and matched reads to databases, unmatched reads to the genome and matched reads to databases.
For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database. (So, we have many resulting files, in a tabular format:)

t00000406 617 + piR-hsa-3546 3 CTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIII 1
t00000517 445 + piR-hsa-3454 2 CACGTGTTAGGACCCGAAAGA IIIIIIIIIIIIIIIIIIIII 0
t00000519 439 + piR-hsa-3546 0 CGGCTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIIIIII

 


The majority of reads multimap to different piRNAs, so  I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

So the library is separated for every smallRNA database.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Thank you
Konstantinos
 

TMM deseq2 smallrna rnaseq • 1.8k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

I'm not sure I understand the setup exactly, but I will give my best answer: it sounds like you want a robust normalization although there may be many true differences? Do you have any kind of artificial or endogenous controls - some features where you expect no changes? Otherwise, in silico normalization is very difficult, when there may be many true, large differences across the samples. The assumption that most methods make to normalize is that the middle of the distribution of log ratios captures the technical artifact of sequencing depth.

ADD COMMENT
0
Entering edit mode

Dear Michael Love,
Unfortunately, we don't have any artificial (RNA spike-ins?) or endogenous controls. Should I provide more information about the setup? 
Sorry for the false information, we have 2 different kinds of spike-ins, one added before treatment and one added after.
 

ADD REPLY
1
Entering edit mode

I don't know if i'll be able to provide much more specialized feedback for working with small RNA, just because I'm pretty busy and have a lot of software support requests this time of year. You might get more useful small RNA normalization feedback from a general forum such as Biostars.

ADD REPLY
0
Entering edit mode

I will try Biostars!
Thank you for your time!

ADD REPLY
2
Entering edit mode
@antonio-miguel-de-jesus-domingues-5182
Last seen 8 weeks ago
Germany

Hi konstatinos,

The majority of reads multimap to different piRNAs, I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multimapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Since you are working with different cells lines I assume that you have biological conditions you want to compare and replicates. What we have done in the past in our group is to analyse separately the treated and the non-treated samples. The treatment biases the library composition quite a lot so it is tricky to compare them. Furthermore, we never really had a need to compare treated vs non-treated directly - We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

For DESeq2 type analysis, we just do the usual analysis exemplified in the the vignette: take a matrix of (piRNA or other small RNA) counts, and compare mutant strains vs WT (or whatever biological conditions we are studying) in each treatment.

So the library is separated for every smallRNA database

I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

These are mere pointers. For a more informed answer you would need to define better the biological question you are trying to answer.

0
Entering edit mode

Dear António Miguel de Jesus Domingues,

  • Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multi-mapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

No, you didn't misunderstand. I've also posted an informative example in Biostars . I don't want to choose randomly a piRNA because it may be misleading. Using weights is a possibility but what about a read that matches 5 piRNAs such as these:
piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA
piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA
piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG
piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA
piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG

It's one will get 0.2 counts. Is it the correct way to "counter" multi-mapping?
 

  • We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification, we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

 Using spike-ins before and after treatment could help to normalize between treated/untreated samples? What about the library differences in number of reads? (treated with ~10 million reads, untreated with ~45 million reads)
 

  • I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

So I should perform DE analysis in the level of raw read counts and later assign them to  each small-RNA?

Thank you for your time and instructive answer!

ADD REPLY
1
Entering edit mode

Login before adding your answer.

Traffic: 826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6