Question

SystemPipeR: alignments with ambiguous pairing were dumped.

0

Entering edit mode

abass.conteh • 0

@abassconteh-11803

Last seen 8.4 years ago

Hello,

I have some mouse RNA seq data I am trying to process and so far its been one issue after another.

I am using SystemPipeR to complete the workflow. Most processes are done on a school computer cluster

Alignment with Tophat2 & Bowtie2 will not give me alignment percentage greater than 96%:

FileName	Nreads2x	Nalign	Perc_Aligned	Nalign_Primary	Perc_Aligned_Primary
V_3a	64400650	90154756	139.9904	62147942	96.50204152
CE_1	72373724	97572256	134.8172	69027462	95.37641313
V-2a	41102852	56080009	136.4382	39176654	95.31371205
CE_2	78475458	103266995	131.5915	74732731	95.23070385
CE_3	47441492	64027520	134.961	44988260	94.8289316
V_1	34108822	49409113	144.8573	32204806	94.41781953
M1_1	35241898	46316778	131.4253	33220611	94.26453422
M1_4	31914826	53134390	166.4881	29613700	92.78978992
M1_5	65201850	135826244	208.3165	52013333	79.77278712
C1_a	62484420	188092157	301.0225	49205193	78.74793909
M1_3	55370018	99579439	179.8436	43400389	78.38247226
M2_2	54255432	142487044	262.6226	42202003	77.78392217
C2_a	42444164	112527985	265.12	32364201	76.25123916
M2_4	72875614	170897751	234.5061	55534758	76.20485777
M1_2	63688546	74411912	116.8372	47756057	74.98374511
M2_1	71261864	187889713	263.661	53356281	74.87354106
C3	36400264	85669156	235.3531	27208587	74.74832325
C_4a	32817136	79616238	242.6057	24427341	74.43471301
C_5a	29761222	75925049	255.114	22125726	74.34414487
M2_3	61233484	150667977	246.0549	44963721	73.42995705

Some of the concordant pair alignment rate of these these files would be ~68%, while some are ~90%

Example Tophat execution: /tophat2/2.1.0/bin/tophat -p 16 -o /results/022722_C3_1.fastq.gz.tophat /data/genome /data/022722_C3_1.fastq.gz /data/022722_C3_2.fastq.gz

Here is the fastq report:https://drive.google.com/open?id=0B6NkdsrSEj_wUExMSnlXOC14Unc

Any ideas as to why I get such low alignment or how I can improve it?

Okay, besides the low rates, everything else goes fine with alignment. Next step is to get counts. I use SummarizeOverlaps to get counts from bam file using txdb made from mm10

Here is where I am stuck so far:

Trying to do it with parallelization on the Cluster gives an error. I was able to get it to work just using the local computer processor on 2 files. But when I try it on 20 files no luck: I run the following

f <- function(x) {
    library(systemPipeR); library(BiocParallel); library(GenomicFeatures); library(AnnotationDbi)
    txdb <- loadDb("./data/genesmrna.sqlite")
    eByg <- exonsBy(txdb, by=c("gene"))
    args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE.txt")
    bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
    summarizeOverlaps(eByg, bfl[x], mode="Union", ignore.strand=TRUE, inter.feature=TRUE, singleEnd=FALSE)
countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])$counts)
}
funs <- makeClusterFunctionsTorque("torque.tmpl")
param <- BatchJobsParam(length(args), resources=list(walltime="20:00:00", nodes="1:ppn=8", memory="16gb"), cluster.functions=funs)
register(param)
counteByg <- bplapply(seq(along=args), f)

I get the following error:

Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
    255004 alignments with ambiguous pairing were dumped.
    Use 'getDumpedAlignments()' to retrieve them from the dump environment.
Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
    516780 alignments with ambiguous pairing were dumped.
    Use 'getDumpedAlignments()' to retrieve them from the dump environment.
Result:
List of 2
 $ message: chr "object 'counteByg' not found"
 $ call   : language seq(along = counteByg)
 - attr(*, "class")= chr [1:3] "remote_error" "bperror" "condition"
 - attr(*, "traceback")= chr [1:29] "17: BatchJobs:::doJob(reg = loadRegistry(\"//BiocParallel_tmp_6c923953a7b5\"), " "        ids = c(1L), multiple.result.files = FALSE, disable.mail = FALSE, " "        first = 2L, last = 1L, array.id = NA)" "16: try(applyJobFunction(reg, job, cache), silent = TRUE)" ...

No idea what: "alignments with ambiguous pairing were dumped." means.

Any help would be greatly appreciated!

R rnaseq tophat summarizeoverlaps systempiper • 2.5k views

ADD COMMENT • link updated 8.4 years ago by Thomas Girke ★ 1.7k • written 8.4 years ago by abass.conteh • 0

score 1 · Accepted Answer · 2016-11-07

There may be subtleties here that I am missing, in which case Thomas Girke will probably be along with a better answer. In the mean time, a 75% alignment isn't particularly bad, and expecting better than 96% alignment is probably a bit optimistic on your part, so I wouldn't be particularly worried about those statistics.

Also note that you aren't getting an Error, you are getting a Warning. These are not the same thing. An error means that something went horribly wrong, and R refused to continue, sputtering in indignation. A warning is just a polite notification that something happened that you might not have expected. In this case, you are just being warned that you had a few hundred thousand alignments dumped, because the pairings were ambiguous.

Unfortunately, the warning is a bit, um, ambiguous itself, and I believe it comes from readGAlignmentPairs in the GenomicAlignments package. If you look at ?readGAlignmentPairs you can see what exactly is being tested to say if a pairing is ambiguous, which has to do with lots of 'inside baseball' about the SAM specification and whatnot. The basic idea behind 'proper pairing' is that we expect (for Illumina data, at least) for paired ends to be on opposite strands, facing inward, with an expected insert size of like 450 bp or so (these are all flags that you send to the aligner - I am giving some idea of the usual defaults). Anything that doesn't fulfill those basic criteria will be marked as not properly paired by the aligner.

Anyway, you have like 25M - 75M reads per sample, so having half a million or so dumped off is inconsequential.

score 1 · Accepted Answer · 2016-11-07

Dear Abass,

What makes you believe that an alignment rate around 90% is low? Especially, for PE reads this is reasonable. Even 70% is not too bad. You could certainly try to diagnose the source of this drop for some of your samples (e.g. lower quality, adapter contaminations, PCR artifacts, etc). For comparison you also may want to take a look what alignment frequencies are reported by Tophat directly. These are stored in the Tophat output directories of this naming structure: results/*.fastq.gz.tophat/align_summary.txt. Please also note, you might get better alignment frequencies with newer aligners such as the new version of Tophat called HISAT2 (param file for systemPipeR is here) or rsubread. Those aligners are more sensitive and also much more time efficient.

As for the error in your read counting step on your computer cluster, one issue is the last line in your f() function. This line doesn't belong there. It needs to be run after the job submission with bplapply (please see vignette for this). Also, in case your computer nodes run several R versions, please make sure that your cluster template file (here torque.tmpl) loads the same R version you are using to submit the jobs (preferentially a recent version). The first messages you are seeing are just warnings from summarizeOverlaps. To understand their meaning it can be helpful to read the help file of the summarizeOverlaps function and/or the vignette of the GenomicAlignments package where this function is defined.

Thomas