Question: SystemPipeR: alignments with ambiguous pairing were dumped.
0
2.9 years ago by
abass.conteh0 wrote:

Hello,

I have some mouse RNA seq data I am trying to process and so far its been one issue after another.

I am using SystemPipeR to complete the workflow. Most processes are done  on a school computer cluster

Alignment with Tophat2 & Bowtie2  will not give me alignment percentage greater than  96%:

 FileName Nreads2x Nalign Perc_Aligned Nalign_Primary Perc_Aligned_Primary V_3a 64400650 90154756 139.9904 62147942 96.50204152 CE_1 72373724 97572256 134.8172 69027462 95.37641313 V-2a 41102852 56080009 136.4382 39176654 95.31371205 CE_2 78475458 103266995 131.5915 74732731 95.23070385 CE_3 47441492 64027520 134.961 44988260 94.8289316 V_1 34108822 49409113 144.8573 32204806 94.41781953 M1_1 35241898 46316778 131.4253 33220611 94.26453422 M1_4 31914826 53134390 166.4881 29613700 92.78978992 M1_5 65201850 135826244 208.3165 52013333 79.77278712 C1_a 62484420 188092157 301.0225 49205193 78.74793909 M1_3 55370018 99579439 179.8436 43400389 78.38247226 M2_2 54255432 142487044 262.6226 42202003 77.78392217 C2_a 42444164 112527985 265.12 32364201 76.25123916 M2_4 72875614 170897751 234.5061 55534758 76.20485777 M1_2 63688546 74411912 116.8372 47756057 74.98374511 M2_1 71261864 187889713 263.661 53356281 74.87354106 C3 36400264 85669156 235.3531 27208587 74.74832325 C_4a 32817136 79616238 242.6057 24427341 74.43471301 C_5a 29761222 75925049 255.114 22125726 74.34414487 M2_3 61233484 150667977 246.0549 44963721 73.42995705

Some of the concordant pair alignment rate of these these files would be ~68%, while some are ~90%

Example Tophat execution:  /tophat2/2.1.0/bin/tophat -p 16 -o /results/022722_C3_1.fastq.gz.tophat /data/genome /data/022722_C3_1.fastq.gz /data/022722_C3_2.fastq.gz

Any ideas as to why I get such low alignment or how I can improve it?

Okay, besides the low rates, everything else goes fine with alignment. Next step is to get counts. I use SummarizeOverlaps to get counts from bam file using txdb made from mm10

Here is where I am stuck so far:

Trying to do it with parallelization on the Cluster gives an error. I was able to get it to work just using the local computer processor on 2 files. But when I try it on 20 files no luck: I run the following

f <- function(x) {
library(systemPipeR); library(BiocParallel); library(GenomicFeatures); library(AnnotationDbi)
eByg <- exonsBy(txdb, by=c("gene"))
args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE.txt")
bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character())
summarizeOverlaps(eByg, bfl[x], mode="Union", ignore.strand=TRUE, inter.feature=TRUE, singleEnd=FALSE)
countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])counts) } funs <- makeClusterFunctionsTorque("torque.tmpl") param <- BatchJobsParam(length(args), resources=list(walltime="20:00:00", nodes="1:ppn=8", memory="16gb"), cluster.functions=funs) register(param) counteByg <- bplapply(seq(along=args), f)  I get the following error: Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 255004 alignments with ambiguous pairing were dumped. Use 'getDumpedAlignments()' to retrieve them from the dump environment. Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 516780 alignments with ambiguous pairing were dumped. Use 'getDumpedAlignments()' to retrieve them from the dump environment. Result: List of 2 message: chr "object 'counteByg' not found"
\$ call   : language seq(along = counteByg)
- attr(*, "class")= chr [1:3] "remote_error" "bperror" "condition"
- attr(*, "traceback")= chr [1:29] "17: BatchJobs:::doJob(reg = loadRegistry(\"//BiocParallel_tmp_6c923953a7b5\"), " "        ids = c(1L), multiple.result.files = FALSE, disable.mail = FALSE, " "        first = 2L, last = 1L, array.id = NA)" "16: try(applyJobFunction(reg, job, cache), silent = TRUE)" ...

No idea what: "alignments with ambiguous pairing were dumped." means.

Any help would be greatly appreciated!

modified 2.9 years ago by Thomas Girke1.7k • written 2.9 years ago by abass.conteh0
Answer: SystemPipeR: alignments with ambiguous pairing were dumped.
1
2.9 years ago by
United States
James W. MacDonald51k wrote:

There may be subtleties here that I am missing, in which case Thomas Girke will probably be along with a better answer. In the mean time, a 75% alignment isn't particularly bad, and expecting better than 96% alignment is probably a bit optimistic on your part, so I wouldn't be particularly worried about those statistics.

Also note that you aren't getting an Error, you are getting a Warning. These are not the same thing. An error means that something went horribly wrong, and R refused to continue, sputtering in indignation. A warning is just a polite notification that something happened that you might not have expected. In this case, you are just being warned that you had a few hundred thousand alignments dumped, because the pairings were ambiguous.

Unfortunately, the warning is a bit, um, ambiguous itself, and I believe it comes from readGAlignmentPairs in the GenomicAlignments package. If you look at ?readGAlignmentPairs you can see what exactly is being tested to say if a pairing is ambiguous, which has to do with lots of 'inside baseball' about the SAM specification and whatnot. The basic idea behind 'proper pairing' is that we expect (for Illumina data, at least) for paired ends to be on opposite strands, facing inward, with an expected insert size of like 450 bp or so (these are all flags that you send to the aligner - I am giving some idea of the usual defaults). Anything that doesn't fulfill those basic criteria will be marked as not properly paired by the aligner.

Anyway, you have like 25M - 75M reads per sample, so having half a million or so dumped off is inconsequential.

Answer: SystemPipeR: alignments with ambiguous pairing were dumped.
1
2.9 years ago by
Thomas Girke1.7k
United States
Thomas Girke1.7k wrote:

Dear Abass,

What makes you believe that an alignment rate around 90% is low? Especially, for PE reads this is reasonable. Even 70% is not too bad. You could certainly try to diagnose the source of this drop for some of your samples (e.g. lower quality, adapter contaminations, PCR artifacts, etc). For comparison you also may want to take a look what alignment frequencies are reported by Tophat directly. These are stored in the Tophat output directories of this naming structure: results/*.fastq.gz.tophat/align_summary.txt. Please also note, you might get better alignment frequencies with newer aligners such as the new version of Tophat called HISAT2  (param file for systemPipeR is here) or rsubread. Those aligners are more sensitive and also much more time efficient.

As for the error in your read counting step on your computer cluster, one issue is the last line in your f() function. This line doesn't belong there. It needs to be run after the job submission with bplapply (please see vignette for this). Also, in case your computer nodes run several R versions, please make sure that your cluster template file (here torque.tmpl) loads the same R version you are using to submit the jobs (preferentially a recent version). The first messages you are seeing are just warnings from summarizeOverlaps. To understand their meaning it can be helpful to read the help file of the summarizeOverlaps function and/or the vignette of the GenomicAlignments package where this function is defined.

Thomas

Thank you so much for the quick reply! I saw the error in the f() function after I posted it on here. That one line was the reason it kept failing, and after I had removed it, everything worked perfectly.  As for the alignments rate, I am not familiar with what is considered acceptable. So basically try to go for at least ~70% for PE data.  I will try HISAT2 or rsubread next time, especially if they are better and faster than Tophat.

Thank you again for your help, and also for creating systemPipeR.