Question

Normalization for small non-coding RNA

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 4 months ago

Italy

Dear Bioconductor,

Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization.
Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment that "The PO treatment
has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies"

We have treated libraries with ~10 million reads and untreated with ~45 million reads.
In order to find piRNAs in our samples, we used SPORTS1.0

with output: matched reads to the genome and matched reads to databases, unmatched reads to the genome and matched reads to databases.
For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database. (So, we have many resulting files, in a tabular format:)

t00000406	617	+	piR-hsa-3546	3	CTGTTAACCGAAAGGTTGGTGGT	IIIIIIIIIIIIIIIIIIIIIII	1
t00000517	445	+	piR-hsa-3454	2	CACGTGTTAGGACCCGAAAGA	IIIIIIIIIIIIIIIIIIIII	0
t00000519	439	+	piR-hsa-3546	0	CGGCTGTTAACCGAAAGGTTGGTGGT	IIIIIIIIIIIIIIIIIIIIIIIIII

The majority of reads multimap to different piRNAs, so I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

So the library is separated for every smallRNA database.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Thank you
Konstantinos

TMM deseq2 smallrna rnaseq • 1.9k views

ADD COMMENT • link updated 5.4 years ago by António Miguel de Jesus Domingues ▴ 490 • written 5.4 years ago by Konstantinos Yeles ▴ 80

score 2 · Answer 1 · 2018-11-26

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 30 minutes ago

United States

I'm not sure I understand the setup exactly, but I will give my best answer: it sounds like you want a robust normalization although there may be many true differences? Do you have any kind of artificial or endogenous controls - some features where you expect no changes? Otherwise, in silico normalization is very difficult, when there may be many true, large differences across the samples. The assumption that most methods make to normalize is that the middle of the distribution of log ratios captures the technical artifact of sequencing depth.

ADD COMMENT • link 5.4 years ago Michael Love 41k

0

Entering edit mode

Dear Michael Love,
~~Unfortunately, we don't have any artificial (RNA spike-ins?) or endogenous controls.~~ Should I provide more information about the setup?
Sorry for the false information, we have 2 different kinds of spike-ins, one added before treatment and one added after.

ADD REPLY • link 5.4 years ago Konstantinos Yeles ▴ 80

1

Entering edit mode

I don't know if i'll be able to provide much more specialized feedback for working with small RNA, just because I'm pretty busy and have a lot of software support requests this time of year. You might get more useful small RNA normalization feedback from a general forum such as Biostars.

ADD REPLY • link 5.4 years ago Michael Love 41k

0

Entering edit mode

I will try Biostars!
Thank you for your time!

ADD REPLY • link 5.4 years ago Konstantinos Yeles ▴ 80

score 2 · Answer 2 · 2018-11-27

Hi konstatinos,

The majority of reads multimap to different piRNAs, I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome).

Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multimapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

How will I perform normalization between libraries with so many quantitative differences so as to check for relative expression?

Since you are working with different cells lines I assume that you have biological conditions you want to compare and replicates. What we have done in the past in our group is to analyse separately the treated and the non-treated samples. The treatment biases the library composition quite a lot so it is tricky to compare them. Furthermore, we never really had a need to compare treated vs non-treated directly - We check to see if the treatment worked by quantifying the smallRNAs for each class in each library but that is it. In this type of quantification we tend to normalize to non-structural reads (reads not mapping to rRNA, snoRNA, tRNA), or to mapped reads, but this is totally dependent on the project specifics.

For DESeq2 type analysis, we just do the usual analysis exemplified in the the vignette: take a matrix of (piRNA or other small RNA) counts, and compare mutant strains vs WT (or whatever biological conditions we are studying) in each treatment.

So the library is separated for every smallRNA database

I would suggest putting all the read counts in a single count matrix otherwise, and afaik, the DESeq modelling might not work as expected.

These are mere pointers. For a more informed answer you would need to define better the biological question you are trying to answer.