Question

rsubread versus BWA ( Burrows-Wheeler Alignmener) for whole genome alignment

0

Entering edit mode

tarek.mohamed ▴ 10

@tarekmohamed-9489

Last seen 6.2 years ago

Hi All,

I have some whole genome sequencing data that were generated from illumina hiseq platform.

as a first step, I will align the fastq file to a reference genome. Can I use Rsubread for that? I have used it for RNAseq mapping before and it was very good, I got 90% of the read were mapped to the ref genome. Does BWA have any advantage as compared to Rsubread.

Thanks

Tarek

rsubread BWA NGS • 3.2k views

ADD COMMENT • link updated 8.5 years ago by Wei Shi ★ 3.6k • written 8.6 years ago by tarek.mohamed ▴ 10

score 1 · Answer 1 · 2016-08-30

1

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

Yes you can use the Rsubread align() function to align DNA-seq reads. You will need to set 'type="dna"' for this mapping.

For the comparison between Rsubread and BWA, you may take a look at this paper:

http://www.ncbi.nlm.nih.gov/pubmed/23558742

ADD COMMENT • link 8.6 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

Thanks for your reply.

I am processing whole genome sequencing data, following to alignment, I want to call SNPs within different samples (100 samples). SNP data will be used to perform a downstream GWAS.

DO you think I can use exactSNP() to call SNPs or shall I stick to GATK.

Thanks

Tarek

ADD REPLY • link 8.5 years ago tarek.mohamed ▴ 10

0

Entering edit mode

I cant see why you can not use exactSNP.

ADD REPLY • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

I tried this with the unzipped file and it returns

*** caught segfault ***
address 0x7fff9a688000, cause 'memory not mapped'

Thanks

Tarek

ADD REPLY • link 8.5 years ago tarek.mohamed ▴ 10

0

Entering edit mode

Can you provide the full screen output?

ADD REPLY • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

>library(Rsubread)
> setwd("/projects/b1042/BurridgeLab/rnaseq_rarg")
>exactSNP(SNPAnnotationFile="All_20151104.vcf",readFile="subread_jg4.sam",isBAM=FALSE,refGenomeFile="BSgenome.Hsapiens.NCBI.GRCh38.fasta",outputFile="subread_jg4.sam.vcf",nthreads=40)

//====================== Running (08-Sep-2016 11:52:33) ======================\
|| ||
|| WARNING format issue in file 'subread_jg4.sam': ||
|| The required format is : SAM ||
|| The real format seems to be : BAM ||
|| A wrong format may result in wrong results or crash the program. ||
|| Please refer to the manual for file format options. ||
|| If the file is in the correct format, please ignore this message. ||
|| ||
|| WARNING format issue in file 'BSgenome.Hsapiens.NCBI.GRCh38.fasta': ||
|| The required format is : SAM ||
|| The file format is unknown. ||
|| A wrong format may result in wrong results or crash the program. ||
|| Please refer to the manual for file format options. ||
|| If the file is in the correct format, please ignore this message. ||
|| ||
|| Split SAM file into ./temp-snps-003575-58311A9F-* ... ||

*** caught segfault ***
address 0x7fff9598f000, cause 'memory not mapped'

ADD REPLY • link 8.5 years ago tarek.mohamed ▴ 10

score 0 · Answer 2 · 2016-09-08

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

Apparently exactSNP complains about the format of your input data. You need to specify 'isBAM = TRUE' if your input data are in BAM format.

Also please refer to the posting guide (http://www.bioconductor.org/help/support/posting-guide/) when you post questions. You should provide output of sessionInfo() and other info.

ADD COMMENT • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei,

I have already specified isBAM =FALSE. I have added here the sessionInfo()

>exactSNP(SNPAnnotationFile="All_20151104.vcf",readFile="CH1_0uM_subjunc.sam",isBAM=FALSE,refGenomeFile="BSgenome.Hsapiens.NCBI.GRCh38.fasta",outputFile="subread_jg4.sam.vcf",nthreads=40)

> sessionInfo()

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

ADD REPLY • link 8.5 years ago tarek.mohamed ▴ 10

score 0 · Answer 3 · 2016-09-08

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 3 months ago

Australia/Melbourne

Sorry you should specify 'isBAM=TRUE' if you provide a BAM input.

Your R version is 1 year old. You will need to update both your R and Rsubread to the latest.

ADD COMMENT • link 8.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi wie,

One more questions.

I tried subread to align whole genome sequence from one sample (paired end, 150 bp reads) and I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.

I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.

>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)

Thanks

Tarek

ADD REPLY • link 8.5 years ago tarek.mohamed ▴ 10

0

Entering edit mode

Hi Tarek, are you using the latest version of Rsubread? The session info you provided above showed that your Rsubread and R are at least 1 year old.

Also as I mentioned above, please include all your commands and the full screen output when you post. This will help you to get help from others.

ADD REPLY • link 8.5 years ago Wei Shi ★ 3.6k