Hi,
So I'm running easyRNASSeq on a couple of bam files that I have, before I do it on all 40 samples in order to get a count table for the samples. However, I keep getting an error and am unsure as to why. If I could get any help that would be great.
Below is the script that I am using:
> library(easyRNASeq)
> setwd("/home/RNA-seq data/")
> filenames <- dir(".", "bam$")
> bf <- getBamFileList(filenames)
>anns <- "hg19.gtf"
>annotParam <- AnnotParam(datasource = anns, type = "gtf")
>rnaSeqParam <- RnaSeqParam(annotParam = annotParam, countBy=c("exons"))
> sexp <- simpleRNASeq(bamFiles = bf, param = rnaSeqParam, nnodes = 40, verbose = TRUE)
==========================
simpleRNASeq version 2.2.1
==========================
Creating a SummarizedExperiment.
==========================
Processing the alignments.
==========================
Pre-processing 2 BAM files.
Validating the BAM files.
Extracted 93 reference sequences information.
Found 0 single-end BAM files.
Found 2 paired-end BAM files.
Bam file: 01_v2.bam has reads of length 49bp
Bam file: 01_v4.bam has reads of length 49bp
Bam file: 01_v2.bam has 38562408 reads.
Bam file: 01_v4.bam has 37504885 reads.
==========================
Processing the annotation
==========================
Validating the annotation source
Read 995 records
Validated a datasource of type gtf
Fetching the annotation
Retrieving annotation from a gtf datasource
Read 2748737 records
Warning messages:
1: In FUN(1:2[[2L]], ...) : Bam file: 01_v2.bam is considered unstranded.
2: In FUN(1:2[[2L]], ...) :
Bam file: 01_v2.bam Strandedness could not be determined using 38362 regions spanning 6332885 bp on either strand at a 90% cutoff; 52.12 percent appear to be stranded.
3: In FUN(1:2[[2L]], ...) : Bam file: 01_v4.bam is considered unstranded.
4: In FUN(1:2[[2L]], ...) :
Bam file: 01_v4.bam Strandedness could not be determined using 40047 regions spanning 6556771 bp on either strand at a 90% cutoff; 52.21 percent appear to be stranded.
5: In matrix(unlist(strsplit(some.lines[[9]], " |; *")), ncol = 2, :
data length [29599] is not a sub-multiple or multiple of the number of rows [14800]
6: In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg 1)
> head(assay(sexp))
01_v2.bam 01_v4.bam
CCDS 0 0
ENSE00000327880 0 0
ENSE00000328922 0 0
ENSE00000329326 0 0
ENSE00000330966 0 0
ENSE00000331689 0 0
When I view the whole table, all the counts are 0, not just the top 5.
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] easyRNASeq_2.2.1
loaded via a namespace (and not attached):
[1] AnnotationDbi_1.28.2 BBmisc_1.9 BatchJobs_1.6
[4] Biobase_2.26.0 BiocGenerics_0.12.1 BiocParallel_1.0.3
[7] Biostrings_2.34.1 DBI_0.3.1 DESeq_1.18.0
[10] GenomeInfoDb_1.2.5 GenomicAlignments_1.2.2 GenomicRanges_1.18.4
[13] IRanges_2.0.1 LSD_3.0 RColorBrewer_1.1-2
[16] RCurl_1.95-4.7 RSQLite_1.0.0 Rsamtools_1.18.3
[19] S4Vectors_0.4.0 ShortRead_1.24.0 XML_3.98-1.3
[22] XVector_0.6.0 annotate_1.44.0 base64enc_0.1-2
[25] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6
[28] checkmate_1.6.0 codetools_0.2-9 digest_0.6.8
[31] edgeR_3.8.6 fail_1.2 foreach_1.4.2
[34] genefilter_1.48.1 geneplotter_1.44.0 genomeIntervals_1.22.3
[37] grid_3.1.2 hwriter_1.3.2 intervals_0.15.0
[40] iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26
[43] limma_3.22.7 magrittr_1.5 parallel_3.1.2
[46] sendmailR_1.2-1 splines_3.1.2 stats4_3.1.2
[49] stringi_0.5-5 stringr_1.0.0 survival_2.37-7
[52] tools_3.1.2 xtable_1.7-4 zlibbioc_1.12.0
Thanks for the help.
Hi,
Thanks for the reply. Does version 2.4.4 of easyRNASeq install on R 3.1.2?
I don't think I can update the version of R that I have as I am not in control of the UNIX environment that I am using to run R and analyse my RNAseq data.
Thanks
Hi,
Thanks for that. I have spoken to the administrator and he has managed to install R 3.2.1 onto the system, and this is now my sessionInfo():
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] easyRNASeq_2.4.3 BiocInstaller_1.18.3
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 GenomeInfoDb_1.4.1
[4] XVector_0.8.0 futile.options_1.0.0 bitops_1.0-6
[7] tools_3.2.1 zlibbioc_1.14.0 biomaRt_2.24.0
[10] annotate_1.46.0 RSQLite_1.0.0 lattice_0.20-31
[13] DBI_0.3.1 parallel_3.2.1 DESeq_1.20.0
[16] genefilter_1.50.0 hwriter_1.3.2 Biostrings_2.36.1
[19] S4Vectors_0.6.1 IRanges_2.2.5 locfit_1.5-9.1
[22] stats4_3.2.1 grid_3.2.1 LSD_3.0
[25] Biobase_2.28.0 AnnotationDbi_1.30.1 XML_3.98-1.3
[28] survival_2.38-1 BiocParallel_1.2.7 limma_3.24.12
[31] latticeExtra_0.6-26 geneplotter_1.46.0 lambda.r_1.1.7
[34] edgeR_3.10.2 intervals_0.15.0 genomeIntervals_1.24.1
[37] Rsamtools_1.20.4 splines_3.2.1 BiocGenerics_0.14.0
[40] GenomicAlignments_1.4.1 GenomicRanges_1.20.5 ShortRead_1.26.0
[43] xtable_1.7-4 RCurl_1.95-4.7
I have updated the version of easyRNASeq that I have but only seem to be getting version 2.4.3 rather than 2.4.4.
However, even with the updated version, I am still getting the same error messages as before as well as a count table of just 0 for everything.
Thanks
Hi,
I will keep a look out for the new version of easyRNASeq once it becomes accessible through biocLite().
The gtf file was from ENSEMBL. I retried the whole thing with a different gtf file that I got from gencode, and this is the error that I get:
Warning messages:
1: In FUN(X[[i]], ...) : Bam file: 01_v2.bam is considered unstranded.
2: In FUN(X[[i]], ...) :
Bam file: 01_v2.bam Strandedness could not be determined using 38362 regions spanning 6332885 bp on either strand at a 90% cutoff; 52.12 percent appear to be stranded.
3: In FUN(X[[i]], ...) : Bam file: 01_v4.bam is considered unstranded.
4: In FUN(X[[i]], ...) :
Bam file: 01_v4.bam Strandedness could not be determined using 40047 regions spanning 6556771 bp on either strand at a 90% cutoff; 52.21 percent appear to be stranded.
5: In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg 1)
I have tried to create a counts table using a different method, using GenomicAlignments and using the gtf file from gencode, and I have managed to create a counts table. So not sure if its a problem with my bam files, the gtf files or something else. If you would like anything else, let me know.
Thanks,
Elie
Hi Nico,
Yea it should be no problem to send you an excerpt of the BAM files. Do you want the header and 1M lines for both files?
For the header do you want the output from:
samtools view -H 01_v2.bam
How do I get the first 1M lines for you?
Thanks,
Elie
Hi Nico,
Thanks for that, I have sent you an email.
Thanks,
Elie
Hej Elie!
Thanks for the data. I figured out what the issue was. Briefly, your BAM files were the culprits. Although every read in your BAM file states that they are paired-end reads (in the flag, the second SAM column), no single reads has any mate information in column 7 to 9. simpleRNASeq autodetects the type of sequencing data (it assumed Paired-End in your case) and uses that information to count reads from valid PE reads (i.e. those that have values in the column 7-9). Hence, with your BAM files, there was no such valid reads and all counts remained 0.
For more details and for a working solution (overriding the simpleRNASeq parameter autodetection), I have put a document there:
https://microasp.upsc.se/root/upscb-public/tree/master/tutorial/easyRNASeq/
called annotation-manipulation-example.R and its html transcript: annotation-manipulation-example.
It details:
1) How I create synthetic transcript from your gtf file - essential step; see the guidelines there: http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis for more details.
2) How I changed the sequence name so that the EnsEMBL annotation I got matched the sequence names in your BAM files
3) How to define the set of parameters to match your BAM files
4) How to override the simpleRNASeq parameter autodetection.
Finally, I have made a few changes to the package to make it more user friendly and these will be available in version 2.4.7; probably after the WE.
HTH,
Cheers,
Nico
Hi Nico,
That's great thanks for figuring out what was wrong for me. I shall have a proper read through the links and the working solution when I get into the office tomorrow, but it is great to finally know what was wrongs.
Thanks,
Elie
Hi Nico,
I have had a quick look at your response, but I can't seem to create a synthetic transcript file. I am following the details on the link you provided, however, this is what I am getting:
> source("https://microasp.upsc.se/root/upscb-public/raw/master/src/R/createSyntheticTranscripts.R")
Error in file(filename, "r", encoding = encoding) :
https:// URLs are not supported
and
> gAnnot <- createSyntheticTranscripts(
filename="Homo_sapiens.GRCh38.80.gtf.gz",
input="gtf",
feature="transcript",
output="GRanges")
Error: could not find function "createSyntheticTranscripts"
Thanks,
Elie
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)
locale:
[1] C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocInstaller_1.18.3 curl_0.9.1 genomeIntervals_1.24.1
[4] BiocGenerics_0.14.0 intervals_0.15.0 easyRNASeq_2.4.7
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2 futile.logger_1.4.1 GenomeInfoDb_1.4.1
[4] XVector_0.8.0 tools_3.2.1 futile.options_1.0.0
[7] bitops_1.0-6 zlibbioc_1.14.0 biomaRt_2.24.0
[10] annotate_1.46.0 RSQLite_1.0.0 lattice_0.20-31
[13] DBI_0.3.1 DESeq_1.20.0 genefilter_1.50.0
[16] hwriter_1.3.2 Biostrings_2.36.1 S4Vectors_0.6.1
[19] IRanges_2.2.5 stats4_3.2.1 locfit_1.5-9.1
[22] grid_3.2.1 LSD_3.0 Biobase_2.28.0
[25] AnnotationDbi_1.30.1 XML_3.98-1.3 survival_2.38-3
[28] BiocParallel_1.2.7 limma_3.24.12 latticeExtra_0.6-26
[31] geneplotter_1.46.0 lambda.r_1.1.7 edgeR_3.10.2
[34] Rsamtools_1.20.4 GenomicAlignments_1.4.1 splines_3.2.1
[37] GenomicRanges_1.20.5 ShortRead_1.26.0 xtable_1.7-4
[40] RCurl_1.95-4.7
Hi Nico,
Thanks for that. I have tried downloading curl and using
curl("https://microasp.upsc.se/root/upscb-public/raw/master/src/R/createSyntheticTranscripts.R").
The output from that is:
description
"https://microasp.upsc.se/root/upscb-public/raw/master/src/R/createSyntheticTranscripts.R"
class
"curl"
mode
"r"
text
"text"
opened
"closed"
can read
"yes"
can write
"no"
And after that, I still get Error: could not find function "createSyntheticTranscripts". I have also tried to download the file itself and using that in the source command, but all I get is:
> source("createSyntheticTranscripts.txt")
Error in source("createSyntheticTranscripts.txt") :
createSyntheticTranscripts.txt:1:1: unexpected input
1: ÿ
^
All the alignment was down by an external company, and from what I can see from the files that came with the aligned BAM files, I think they were done using SOAP aligner.
Sorry for not being able to do this.
Thanks for the help,
Elie
Hi Nico,
That is great thanks! I have finally managed to get it to work and to get a count table at the end of it.
Thanks a lot for all your help and if I run into any other problems with easyRNASeq, I will let you know.
Thanks,
Elie