DiffBind full library size not reliable

0

Entering edit mode

Marianna ▴ 20

@7cc5052f

Last seen 7 months ago

Italy

Dear colleagues,

I'm using DiffBind for the first time and I'm encountering some issues, one in particular is quite weird.

I've got a simple experiment, 3 conditions, 4 replicates each. I've obtained peaks using MACS2 (-- broad function).


   SampleID Condition Replicate
1        S1   control         1
2        S2   control         2
3        S3   control         3
4        S4   control         4
5        S5   flu_low         1
6        S6   flu_low         2
7        S7   flu_low         3
8        S8   flu_low         4
9        S9  flu_high         1
10      S10  flu_high         2
11      S11  flu_high         3
12      S12  flu_high         4

After doing the dba.count and looking at the dba object I realized that the number of reads per sample is extraordinarily high (at least two times the actual library size, which was around 60M SE). How is it possible?

Is this linked to the fact that I retained multi-mappers reads?

Any advice would be greatly appreciated.

Best

Marianna

DiffBind Reads MACS2 FullLibSize • 913 views

ADD COMMENT • link updated 7 months ago by Rory Stark ★ 5.2k • written 7 months ago by Marianna ▴ 20

0

Entering edit mode

It's unclear what you're asking. Please always provide some minimal code, data examples or plots to illustrate the problem. Generally though, from what I know and have seen over the last years, multimappers are usually not included in the analysis. If this is the issue here is unclear. Are you referring to raw counts or normalized counts?

ADD REPLY • link 7 months ago ATpoint ★ 4.8k

0

Entering edit mode

Thank you ATpoint for your reply.

I'll try to describe the issue in more detail.

I've mapped my SE reads to the ref genome using bowtie2 --end-to-end -k 6.
I've sorted and indexed the reads using samtools
I've marked duplicates using Picard
I've called peaks using MACS2 (with the --broad function)

Below is the code used for DiffBind

>samples<-read.csv("samples.csv")
>samples
>flu <- dba(sampleSheet=samples)
>flu

12 Samples, 12455 sites in matrix (15815 total):
ID Condition Replicate Intervals
1   S1   control         1      8771
2   S2   control         2      8937
3   S3   control         3      9608
4   S4   control         4      6338
5   S5   flu_low         1      9127
6   S6   flu_low         2      7849
7   S7   flu_low         3      9081
8   S8   flu_low         4      7916
9   S9  flu_high         1     10432
10 S10  flu_high         2      9284
11 S11  flu_high         3      8311
12 S12  flu_high         4     11119

>flu_counts <- dba.count(flu)
>flu_counts

12 Samples, 12374 sites in matrix:
ID Condition Replicate     Reads FRiP
1   S1   control         1 137864693 0.06
2   S2   control         2 111003187 0.09
3   S3   control         3 142181269 0.07
4   S4   control         4 110355671 0.07
5   S5   flu_low         1 166460540 0.05
6   S6   flu_low         2 128119574 0.06
7   S7   flu_low         3 164164658 0.05
8   S8   flu_low         4 133877017 0.06
9   S9  flu_high         1 169378646 0.05
10 S10  flu_high         2 162547426 0.07
11 S11  flu_high         3 173471350 0.05
12 S12  flu_high         4 196066616 0.05

Now the weird thing is that it seems there are, for instance in sample S1, more than 130M reads. This is not possible as for this sample there mere 60M raw reads! Do you think that running bowtie2 with -k argument can be a problem? As setting -k 6 the sam file will retain reads mapping to a maximum of 6 different locations.

Thank you

Best

Marianna

ADD REPLY • link 7 months ago Marianna ▴ 20

0

Entering edit mode

I see no code, and you did not answer whether it's about raw or normalized counts.

ADD REPLY • link 7 months ago ATpoint ★ 4.8k

0

Entering edit mode

Sorry, I accidentally replied before completing the answer. I'm referring to raw counts

ADD REPLY • link 7 months ago Marianna ▴ 20

0

Entering edit mode

Yes, this could potentially be due to multimappers, since (iirc) -k 6 outputs up to 6 locations per multimapper. My recommendation is to remove any multimappers, for example by a MAPQ filter, e.g. samtools view -q 20. Note, this is my recommendation, I am not the DiffBind maintainer. Multimappers are usually never considered due to their uncertainty.

ADD REPLY • link 7 months ago ATpoint ★ 4.8k

0

Entering edit mode

Thank you ATpoint!

I tried by filtering with samtools view -q 42, to keep only uniquely mapping reads but the results were very similar to the one reported above. I'm afraid of the -k in bowtie2, so I'm trying to run it without -k, to completely get rid of the multimappers.

Thank you

Marianna

ADD REPLY • link 7 months ago Marianna ▴ 20

0

Entering edit mode

I am not exactly sure how the k parameter works, I never use it, but I agree, realignment without it and filter for MAPQ > 0 makes sense. Maybe not 42, that is strict, maybe use something like 20.

ADD REPLY • link 7 months ago ATpoint ★ 4.8k

0

Entering edit mode

DiffBind should already be filtering out multi-mapped reads using the mapQCth parameter in dba.count(). This is set to mapQCth=15 by default.

ADD REPLY • link 7 months ago Rory Stark ★ 5.2k

Login before adding your answer.

Similar Posts

Overriding of shrinkage in DESeq2 •

updated 8.9 years ago by Michael Love 43k • written 8.9 years ago by Nik Tuzov ▴ 90

<span style="line-height:1.6">Hello:</span> <span style="line-height:1.6">Could you please answer some questions about Figure 1 in Love et…

High number of significant DE genes with low baseMean, is this abnormal? •

updated 8.9 years ago by Ryan C. Thompson ★ 7.9k • written 8.9 years ago by dhibar ▴ 60

I have a relatively large RNAseq data set: 360 Human blood biological replicates, 100bp PE reads with 60M read depth on average, with Poly-…

DiffBind for MBD-seq •

6 months ago Marianna ▴ 20

Dear all, I'm using DiffBind to find the differentially methylated regions. Libraries have been obtained by using the Methyl-binding do…

MBD-seq non model species •

updated 2.6 years ago by Simon Pearce • 0 • written 2.6 years ago by Marianna ▴ 20

Hi all, I'm trying to perform a technical validation of MBD enrichment. I sequenced a single library (150PE - 1.5 Million reads) as a…

from kallisto to deseq2 analysis •

3.1 years ago Marianna ▴ 20

Hi all, I'm doing a DE analysis using deseq2 with a non-model species, thus I retrieved annotation using biomaRt in R. I've imported ka…

package rta10cdf needed for RTA-1_0 files •

updated 7.7 years ago by Guido Hooiveld ★ 4.1k • written 7.7 years ago by grinberg • 0

Dear all, <pre> after reading CEL files with affy package's readAffy() function, I try to normalize the data using rmaplus() function fro…

Transcriptome assembly: single individual or pooled samples •

updated 2.3 years ago by James W. MacDonald 68k • written 2.3 years ago by Marianna ▴ 20

Dear colleagues, I'm going to set up a RNA-seq experiment to characterize the transcriptome of different tissues of a species with a pub…

Using MBD score in MBD-seq •

2.1 years ago Marianna ▴ 20

Dear colleagues, I'm planning a **MBD-seq experiment** to study if a toxic molecule affect genes methylation and I'm evaluating the poss…

EdgeR glmLRT vs glmQLFTest •

updated 19 months ago by lucap • 0 • written 3.9 years ago by Marianna ▴ 20

Hi everybody, afer reading some posts about different DE analysis provided by edgeR, I found that the QL framework it's a the better cho…

RNA-seq: compare two technical replicates •

updated 2.9 years ago by Gordon Smyth 52k • written 2.9 years ago by Marianna ▴ 20

Dear all, I'd like your opinion on the best way to compare two technical replicates. First, I give you some preliminary information o…

Error in makeGeneRegion from GenomeGraphs •

updated 14.4 years ago by Mike Smith ★ 6.6k • written 14.4 years ago by Mark Dunning ★ 1.1k

<div class="preformatted">Hi all, I am trying to use the makeGeneRegion function in GenomeGraphs to get the locations of exons within a pa…

Can we massively improve RNA-Seq results by slightly reducing the counts in it? •

updated 2.3 years ago by Gordon Smyth 52k • written 2.3 years ago by Istvan Albert ▴ 50

Yesterday I became enamored by the following paper. It claims that one can massively improve an RNA-Seq analysis by simply removing 5-20 co…

Divergent results using exact, glmLRT, and glmQLF tests in edgeR •

13 months ago Marianna ▴ 20

Hi to everybody! I'm analyzing an RNA-seq dataset (n=28; 1 factor, 4 variables: normal, SM, WS, WB) and I'm finding some unexpected result…

edgeR analysis: glm functions and offset matrix •

11 months ago Marianna ▴ 20

Dear Bioconductor community, I'm trying to analyze a RNA-seq dataset with edgeR after importing the Kallisto counts with tximport and us…

Differential abundance analysis of phosphoproteomics data with multiple time points and conditions •

19 months ago svlachavas ▴ 840

Dear Bioc Community, good morning and I hope my message finds everyone well !! I wanted to ask one specific question regarding possible an…

[R] make.cdf.package: Error: cannot allocate vector of size 1 Kb •

updated 15.2 years ago by Peng Yu ▴ 950 • written 15.2 years ago by Martin Morgan 25k

<div class="preformatted">Peng Yu wrote: > My machine has 8GB memory. I had quit all other programs that might > take a lot of memory…

[R] make.cdf.package: Error: cannot allocate vector of size 1 Kb •

15.2 years ago James W. MacDonald 68k

<div class="preformatted">This problem may well be due to the repeated R_alloc calls, and might be fixable by refactoring the code, but I a…

GEOquery, GSEMatrix parameter and lifecycle of GEO series data •

updated 12.7 years ago by Sean Davis 21k • written 12.7 years ago by Gustavo Fernández Bayón ▴ 440

<div class="preformatted">Dear Sean and James, first of all, I would like to apologize for my late reply. There were a lot of storms yeste…

Loading Similar Posts

Traffic: 591 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6