Hello,
I have some mouse RNA seq data I am trying to process and so far its been one issue after another.
I am using SystemPipeR to complete the workflow. Most processes are done on a school computer cluster
Alignment with Tophat2 & Bowtie2 will not give me alignment percentage greater than 96%:
FileName | Nreads2x | Nalign | Perc_Aligned | Nalign_Primary | Perc_Aligned_Primary |
V_3a | 64400650 | 90154756 | 139.9904 | 62147942 | 96.50204152 |
CE_1 | 72373724 | 97572256 | 134.8172 | 69027462 | 95.37641313 |
V-2a | 41102852 | 56080009 | 136.4382 | 39176654 | 95.31371205 |
CE_2 | 78475458 | 103266995 | 131.5915 | 74732731 | 95.23070385 |
CE_3 | 47441492 | 64027520 | 134.961 | 44988260 | 94.8289316 |
V_1 | 34108822 | 49409113 | 144.8573 | 32204806 | 94.41781953 |
M1_1 | 35241898 | 46316778 | 131.4253 | 33220611 | 94.26453422 |
M1_4 | 31914826 | 53134390 | 166.4881 | 29613700 | 92.78978992 |
M1_5 | 65201850 | 135826244 | 208.3165 | 52013333 | 79.77278712 |
C1_a | 62484420 | 188092157 | 301.0225 | 49205193 | 78.74793909 |
M1_3 | 55370018 | 99579439 | 179.8436 | 43400389 | 78.38247226 |
M2_2 | 54255432 | 142487044 | 262.6226 | 42202003 | 77.78392217 |
C2_a | 42444164 | 112527985 | 265.12 | 32364201 | 76.25123916 |
M2_4 | 72875614 | 170897751 | 234.5061 | 55534758 | 76.20485777 |
M1_2 | 63688546 | 74411912 | 116.8372 | 47756057 | 74.98374511 |
M2_1 | 71261864 | 187889713 | 263.661 | 53356281 | 74.87354106 |
C3 | 36400264 | 85669156 | 235.3531 | 27208587 | 74.74832325 |
C_4a | 32817136 | 79616238 | 242.6057 | 24427341 | 74.43471301 |
C_5a | 29761222 | 75925049 | 255.114 | 22125726 | 74.34414487 |
M2_3 | 61233484 | 150667977 | 246.0549 | 44963721 | 73.42995705 |
Some of the concordant pair alignment rate of these these files would be ~68%, while some are ~90%
Example Tophat execution: /tophat2/2.1.0/bin/tophat -p 16 -o /results/022722_C3_1.fastq.gz.tophat /data/genome /data/022722_C3_1.fastq.gz /data/022722_C3_2.fastq.gz
Here is the fastq report:https://drive.google.com/open?id=0B6NkdsrSEj_wUExMSnlXOC14Unc
Any ideas as to why I get such low alignment or how I can improve it?
Okay, besides the low rates, everything else goes fine with alignment. Next step is to get counts. I use SummarizeOverlaps to get counts from bam file using txdb made from mm10
Here is where I am stuck so far:
Trying to do it with parallelization on the Cluster gives an error. I was able to get it to work just using the local computer processor on 2 files. But when I try it on 20 files no luck: I run the following
f <- function(x) { library(systemPipeR); library(BiocParallel); library(GenomicFeatures); library(AnnotationDbi) txdb <- loadDb("./data/genesmrna.sqlite") eByg <- exonsBy(txdb, by=c("gene")) args <- systemArgs(sysma="param/tophat.param", mytargets="targetsPE.txt") bfl <- BamFileList(outpaths(args), yieldSize=50000, index=character()) summarizeOverlaps(eByg, bfl[x], mode="Union", ignore.strand=TRUE, inter.feature=TRUE, singleEnd=FALSE) countDFeByg <- sapply(seq(along=counteByg), function(x) assays(counteByg[[x]])$counts) } funs <- makeClusterFunctionsTorque("torque.tmpl") param <- BatchJobsParam(length(args), resources=list(walltime="20:00:00", nodes="1:ppn=8", memory="16gb"), cluster.functions=funs) register(param) counteByg <- bplapply(seq(along=args), f)
I get the following error:
Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, :
255004 alignments with ambiguous pairing were dumped.
Use 'getDumpedAlignments()' to retrieve them from the dump environment.
Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, :
516780 alignments with ambiguous pairing were dumped.
Use 'getDumpedAlignments()' to retrieve them from the dump environment.
Result:
List of 2
$ message: chr "object 'counteByg' not found"
$ call : language seq(along = counteByg)
- attr(*, "class")= chr [1:3] "remote_error" "bperror" "condition"
- attr(*, "traceback")= chr [1:29] "17: BatchJobs:::doJob(reg = loadRegistry(\"//BiocParallel_tmp_6c923953a7b5\"), " " ids = c(1L), multiple.result.files = FALSE, disable.mail = FALSE, " " first = 2L, last = 1L, array.id = NA)" "16: try(applyJobFunction(reg, job, cache), silent = TRUE)" ...
No idea what: "alignments with ambiguous pairing were dumped." means.
Any help would be greatly appreciated!
Thank you so much for the quick reply! I saw the error in the f() function after I posted it on here. That one line was the reason it kept failing, and after I had removed it, everything worked perfectly. As for the alignments rate, I am not familiar with what is considered acceptable. So basically try to go for at least ~70% for PE data. I will try HISAT2 or rsubread next time, especially if they are better and faster than Tophat.
Thank you again for your help, and also for creating systemPipeR.