Question: Issue with easyRNASeq
0
4.1 years ago by
ea11g00
United Kingdom
ea11g00 wrote:

Hi,

So I'm running easyRNASSeq on a couple of bam files that I have, before I do it on all 40 samples in order to get a count table for the samples. However, I keep getting an error and am unsure as to why. If I could get any help that would be great.

Below is the script that I am using:

> library(easyRNASeq)

> setwd("/home/RNA-seq data/")

> filenames <- dir(".", "bam") > bf <- getBamFileList(filenames) >anns <- "hg19.gtf" >annotParam <- AnnotParam(datasource = anns, type = "gtf") >rnaSeqParam <- RnaSeqParam(annotParam = annotParam, countBy=c("exons")) > sexp <- simpleRNASeq(bamFiles = bf, param = rnaSeqParam, nnodes = 40, verbose = TRUE) ========================== simpleRNASeq version 2.2.1 ========================== Creating a SummarizedExperiment. ========================== Processing the alignments. ========================== Pre-processing 2 BAM files. Validating the BAM files. Extracted 93 reference sequences information. Found 0 single-end BAM files. Found 2 paired-end BAM files. Bam file: 01_v2.bam has reads of length 49bp Bam file: 01_v4.bam has reads of length 49bp Bam file: 01_v2.bam has 38562408 reads. Bam file: 01_v4.bam has 37504885 reads. ========================== Processing the annotation ========================== Validating the annotation source Read 995 records Validated a datasource of type gtf Fetching the annotation Retrieving annotation from a gtf datasource Read 2748737 records Warning messages: 1: In FUN(1:2[[2L]], ...) : Bam file: 01_v2.bam is considered unstranded. 2: In FUN(1:2[[2L]], ...) : Bam file: 01_v2.bam Strandedness could not be determined using 38362 regions spanning 6332885 bp on either strand at a 90% cutoff; 52.12 percent appear to be stranded. 3: In FUN(1:2[[2L]], ...) : Bam file: 01_v4.bam is considered unstranded. 4: In FUN(1:2[[2L]], ...) : Bam file: 01_v4.bam Strandedness could not be determined using 40047 regions spanning 6556771 bp on either strand at a 90% cutoff; 52.21 percent appear to be stranded. 5: In matrix(unlist(strsplit(some.lines[[9]], " |; *")), ncol = 2, : data length [29599] is not a sub-multiple or multiple of the number of rows [14800] 6: In .Method(..., deparse.level = deparse.level) : number of columns of result is not a multiple of vector length (arg 1) > head(assay(sexp)) 01_v2.bam 01_v4.bam CCDS 0 0 ENSE00000327880 0 0 ENSE00000328922 0 0 ENSE00000329326 0 0 ENSE00000330966 0 0 ENSE00000331689 0 0 When I view the whole table, all the counts are 0, not just the top 5. > sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] easyRNASeq_2.2.1 loaded via a namespace (and not attached): [1] AnnotationDbi_1.28.2 BBmisc_1.9 BatchJobs_1.6 [4] Biobase_2.26.0 BiocGenerics_0.12.1 BiocParallel_1.0.3 [7] Biostrings_2.34.1 DBI_0.3.1 DESeq_1.18.0 [10] GenomeInfoDb_1.2.5 GenomicAlignments_1.2.2 GenomicRanges_1.18.4 [13] IRanges_2.0.1 LSD_3.0 RColorBrewer_1.1-2 [16] RCurl_1.95-4.7 RSQLite_1.0.0 Rsamtools_1.18.3 [19] S4Vectors_0.4.0 ShortRead_1.24.0 XML_3.98-1.3 [22] XVector_0.6.0 annotate_1.44.0 base64enc_0.1-2 [25] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 [28] checkmate_1.6.0 codetools_0.2-9 digest_0.6.8 [31] edgeR_3.8.6 fail_1.2 foreach_1.4.2 [34] genefilter_1.48.1 geneplotter_1.44.0 genomeIntervals_1.22.3 [37] grid_3.1.2 hwriter_1.3.2 intervals_0.15.0 [40] iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 [43] limma_3.22.7 magrittr_1.5 parallel_3.1.2 [46] sendmailR_1.2-1 splines_3.1.2 stats4_3.1.2 [49] stringi_0.5-5 stringr_1.0.0 survival_2.37-7 [52] tools_3.1.2 xtable_1.7-4 zlibbioc_1.12.0 Thanks for the help. easyrnaseq error • 1.1k views ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by ea11g00 Answer: Issue with easyRNASeq 0 4.1 years ago by Sweden Nicolas Delhomme320 wrote: Hej! Please upgrade to the latest version of easyRNAseq (version 2.4.4) for R (version 3.2.1). The version you are using is outdated. Let me know if that does not fix your problem. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 06 Jul 2015, at 21:34, ea11g0 [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User ea11g0 wrote Question: Issue with easyRNASeq: > > > Hi, > > So I'm running easyRNASSeq on a couple of bam files that I have, before I do it on all 40 samples in order to get a count table for the samples. However, I keep getting an error and am unsure as to why. If I could get any help that would be great. > > Below is the script that I am using: > > > library(easyRNASeq) > > > setwd("/home/RNA-seq data/") > > > filenames <- dir(".", "bam") > > > bf <- getBamFileList(filenames) > > >anns <- "hg19.gtf" > > >annotParam <- AnnotParam(datasource = anns, type = "gtf") > > >rnaSeqParam <- RnaSeqParam(annotParam = annotParam, countBy=c("exons")) > > > sexp <- simpleRNASeq(bamFiles = bf, param = param, nnodes = 40, verbose = TRUE) > > ========================== > simpleRNASeq version 2.2.1 > ========================== > Creating a SummarizedExperiment. > ========================== > Processing the alignments. > ========================== > Pre-processing 2 BAM files. > Validating the BAM files. > Extracted 93 reference sequences information. > Found 0 single-end BAM files. > Found 2 paired-end BAM files. > Bam file: 01_v2.bam has reads of length 49bp > Bam file: 01_v4.bam has reads of length 49bp > Bam file: 01_v2.bam has 38562408 reads. > Bam file: 01_v4.bam has 37504885 reads. > ========================== > Processing the annotation > ========================== > Validating the annotation source > Read 995 records > Validated a datasource of type gtf > Fetching the annotation > Retrieving annotation from a gtf datasource > Read 2748737 records > > Warning messages: > 1: In FUN(1:2[[2L]], ...) : Bam file: 01_v2.bam is considered unstranded. > 2: In FUN(1:2[[2L]], ...) : > Bam file: 01_v2.bam Strandedness could not be determined using 38362 regions spanning 6332885 bp on either strand at a 90% cutoff; 52.12 percent appear to be stranded. > 3: In FUN(1:2[[2L]], ...) : Bam file: 01_v4.bam is considered unstranded. > 4: In FUN(1:2[[2L]], ...) : > Bam file: 01_v4.bam Strandedness could not be determined using 40047 regions spanning 6556771 bp on either strand at a 90% cutoff; 52.21 percent appear to be stranded. > 5: In matrix(unlist(strsplit(some.lines[[9]], " |; *")), ncol = 2, : > data length [29599] is not a sub-multiple or multiple of the number of rows [14800] > 6: In .Method(..., deparse.level = deparse.level) : > number of columns of result is not a multiple of vector length (arg 1) > > > head(assay(sexp)) > 01_v2.bam 01_v4.bam > CCDS 0 0 > ENSE00000327880 0 0 > ENSE00000328922 0 0 > ENSE00000329326 0 0 > ENSE00000330966 0 0 > ENSE00000331689 0 0 > > > When I view the whole table, all the counts are 0, not just the top 5. > > > sessionInfo() > R version 3.1.2 (2014-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] easyRNASeq_2.2.1 > > loaded via a namespace (and not attached): > [1] AnnotationDbi_1.28.2 BBmisc_1.9 BatchJobs_1.6 > [4] Biobase_2.26.0 BiocGenerics_0.12.1 BiocParallel_1.0.3 > [7] Biostrings_2.34.1 DBI_0.3.1 DESeq_1.18.0 > [10] GenomeInfoDb_1.2.5 GenomicAlignments_1.2.2 GenomicRanges_1.18.4 > [13] IRanges_2.0.1 LSD_3.0 RColorBrewer_1.1-2 > [16] RCurl_1.95-4.7 RSQLite_1.0.0 Rsamtools_1.18.3 > [19] S4Vectors_0.4.0 ShortRead_1.24.0 XML_3.98-1.3 > [22] XVector_0.6.0 annotate_1.44.0 base64enc_0.1-2 > [25] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 > [28] checkmate_1.6.0 codetools_0.2-9 digest_0.6.8 > [31] edgeR_3.8.6 fail_1.2 foreach_1.4.2 > [34] genefilter_1.48.1 geneplotter_1.44.0 genomeIntervals_1.22.3 > [37] grid_3.1.2 hwriter_1.3.2 intervals_0.15.0 > [40] iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 > [43] limma_3.22.7 magrittr_1.5 parallel_3.1.2 > [46] sendmailR_1.2-1 splines_3.1.2 stats4_3.1.2 > [49] stringi_0.5-5 stringr_1.0.0 survival_2.37-7 > [52] tools_3.1.2 xtable_1.7-4 zlibbioc_1.12.0 > > > Thanks for the help. > > > > > > > Post tags: easyRNASeq, error > > You may reply via email or visit Issue with easyRNASeq >

Hi,

Thanks for the reply. Does version 2.4.4 of easyRNASeq install on R 3.1.2?

I don't think I can update the version of R that I have as I am not in control of the UNIX environment that I am using to run R and analyse my RNAseq data.

Thanks

Hej! No it won't - in theory it should, but it has way too many dependencies and that's honestly not worth the effort to try make it work that way. However, even without admin rights, you can install R into your own home directory; there are plenty howto available on the internet. You should also contact your system administrator and ask them to install the latest R version. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 06 Jul 2015, at 22:55, ea11g0 [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User ea11g0 wrote Comment: Issue with easyRNASeq: > > > Hi, > > Thanks for the reply. Does version 2.4.4 of easyRNASeq install on R 3.1.2? > > I don't think I can update the version of R that I have as I am not in control of the UNIX environment that I am using to run R and analyse my RNAseq data. > > Thanks > > > Post tags: easyRNASeq, error > > You may reply via email or visit C: Issue with easyRNASeq >

Hi,

Thanks for that. I have spoken to the administrator and he has managed to install R 3.2.1 onto the system, and this is now my sessionInfo():

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] easyRNASeq_2.4.3     BiocInstaller_1.18.3

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2      futile.logger_1.4.1     GenomeInfoDb_1.4.1
[4] XVector_0.8.0           futile.options_1.0.0    bitops_1.0-6
[7] tools_3.2.1             zlibbioc_1.14.0         biomaRt_2.24.0
[10] annotate_1.46.0         RSQLite_1.0.0           lattice_0.20-31
[13] DBI_0.3.1               parallel_3.2.1          DESeq_1.20.0
[16] genefilter_1.50.0       hwriter_1.3.2           Biostrings_2.36.1
[19] S4Vectors_0.6.1         IRanges_2.2.5           locfit_1.5-9.1
[22] stats4_3.2.1            grid_3.2.1              LSD_3.0
[25] Biobase_2.28.0          AnnotationDbi_1.30.1    XML_3.98-1.3
[28] survival_2.38-1         BiocParallel_1.2.7      limma_3.24.12
[31] latticeExtra_0.6-26     geneplotter_1.46.0      lambda.r_1.1.7
[34] edgeR_3.10.2            intervals_0.15.0        genomeIntervals_1.24.1
[37] Rsamtools_1.20.4        splines_3.2.1           BiocGenerics_0.14.0
[43] xtable_1.7-4            RCurl_1.95-4.7

I have updated the version of easyRNASeq that I have but only seem to be getting version 2.4.3 rather than 2.4.4.

However, even with the updated version, I am still getting the same error messages as before as well as a count table of just 0 for everything.

Thanks

Hej! easyRNASeq version 2.4.5 (it will be version 2.4.6 soon actually) is currently only available through the Bioc SVN repository (http://bioconductor.org/developers/how-to/source-control/). Anyway, easyRNASeq version > 2.4.3 should be built soon and available through the more standard biocLite() hopefully tomorrow. From your description, I would suggest you take a look at this thread: C: easyRNASeq: "Error in IRangesList", where I describe how to create a set of synthetic transcripts, which is essential for proper counting and avoiding multiple counting (i.e. assigning one read to more than one gene or several times for one gene). You should also look up section 7 of the vignette. The vignette is rather outdated, but not that section. Finally, the warning you get seem to indicate an issue with your gtf file. Can you tell me where you got it from, so that I can try to reproduce your error? Thanks, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden ---------------------------------------------------------------

Hi,

I will keep a look out for the new version of easyRNASeq once it becomes accessible through biocLite().

The gtf file was from ENSEMBL. I retried the whole thing with a different gtf file that I got from gencode, and this is the error that I get:

Warning messages:
1: In FUN(X[[i]], ...) : Bam file: 01_v2.bam is considered unstranded.
2: In FUN(X[[i]], ...) :
Bam file: 01_v2.bam Strandedness could not be determined using 38362 regions spanning 6332885 bp on either strand at a 90% cutoff; 52.12 percent appear to be stranded.
3: In FUN(X[[i]], ...) : Bam file: 01_v4.bam is considered unstranded.
4: In FUN(X[[i]], ...) :
Bam file: 01_v4.bam Strandedness could not be determined using 40047 regions spanning 6556771 bp on either strand at a 90% cutoff; 52.21 percent appear to be stranded.
5: In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg 1)

I have tried to create a counts table using a different method, using GenomicAlignments and using the gtf file from gencode, and I have managed to create a counts table. So not sure if its a problem with my bam files, the gtf files or something else. If you would like anything else, let me know.

Thanks,

Elie

Hej Elie! I would really like to know what is happening, yes :-) Could you share an excerpt of your BAM files (e.g. the header and first 1M lines)? If so, I can provide you an access to upload the data to my box. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 07 Jul 2015, at 16:22, ea11g0 [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User ea11g0 wrote Comment: Issue with easyRNASeq: > > > Hi, > > I will keep a look out for the new version of easyRNASeq once it becomes accessible through biocLite(). > > The gtf file was from ENSEMBL. I retried the whole thing with a different gtf file that I got from gencode, and this is the error that I get: > > Warning messages: > 1: In FUN(X[[i]], ...) : Bam file: 01_v2.bam is considered unstranded. > 2: In FUN(X[[i]], ...) : > Bam file: 01_v2.bam Strandedness could not be determined using 38362 regions spanning 6332885 bp on either strand at a 90% cutoff; 52.12 percent appear to be stranded. > 3: In FUN(X[[i]], ...) : Bam file: 01_v4.bam is considered unstranded. > 4: In FUN(X[[i]], ...) : > Bam file: 01_v4.bam Strandedness could not be determined using 40047 regions spanning 6556771 bp on either strand at a 90% cutoff; 52.21 percent appear to be stranded. > 5: In .Method(..., deparse.level = deparse.level) : > number of columns of result is not a multiple of vector length (arg 1) > > > I have tried to create a counts table using a different method, using GenomicAlignments and using the gtf file from gencode, and I have managed to create a counts table. So not sure if its a problem with my bam files, the gtf files or something else. If you would like anything else, let me know. > > > Thanks, > > Elie > > > Post tags: easyRNASeq, error > > You may reply via email or visit C: Issue with easyRNASeq >

Hi Nico,

Yea it should be no problem to send you an excerpt of the BAM files. Do you want the header and 1M lines for both files?

For the header do you want the output from:

samtools view -H 01_v2.bam

How do I get the first 1M lines for you?

Thanks,

Elie

Great! The command could look like this: samtools view -h 01_v2.bam | head -n 1000000 | samtools view -b -o excerpt01.bam - The first part writes a SAM formatted output including the header which is then piped into head that limits it to the first 1M lines, which is again piped into samtools to create a BAM file out of it. The number of reads won't be exactly a million (due to the header) which might split one PE entry, but that's not a problem. Contact me offline (nicolas.delhomme@umu.se) for the dropbox location - I need your email address :-) Cheers, Nico

Hi Nico,

Thanks for that, I have sent you an email.

Thanks,

Elie

Hej Elie!

Thanks for the data. I figured out what the issue was. Briefly, your BAM files were the culprits. Although every read in your BAM file states that they are paired-end reads (in the flag, the second SAM column), no single reads has any mate information in column 7 to 9. simpleRNASeq autodetects the type of sequencing data (it assumed Paired-End in your case) and uses that information to count reads from valid PE reads (i.e. those that have values in the column 7-9). Hence, with your BAM files, there was no such valid reads and all counts remained 0.

For more details and for a working solution (overriding the simpleRNASeq parameter autodetection), I have put a document there:

https://microasp.upsc.se/root/upscb-public/tree/master/tutorial/easyRNASeq/

called annotation-manipulation-example.R and its html transcript: annotation-manipulation-example.

It details:

1) How I create synthetic transcript from your gtf file - essential step; see the guidelines there: http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis for more details.

2) How I changed the sequence name so that the EnsEMBL annotation I got matched the sequence names in your BAM files

3) How to define the set of parameters to match your BAM files

4) How to override the simpleRNASeq parameter autodetection.

Finally, I have made a few changes to the package to make it more user friendly and these will be available in version 2.4.7; probably after the WE.

HTH,

Cheers,

Nico

Hi Nico,

That's great thanks for figuring out what was wrong for me. I shall have a proper read through the links and the working solution when I get into the office tomorrow, but it is great to finally know what was wrongs.

Thanks,

Elie

Hi Nico,

I have had a quick look at your response, but I can't seem to create a synthetic transcript file. I am following the details on the link you provided, however, this is what I am getting:

> source("https://microasp.upsc.se/root/upscb-public/raw/master/src/R/createSyntheticTranscripts.R")
Error in file(filename, "r", encoding = encoding) :
https:// URLs are not supported

and

> gAnnot <- createSyntheticTranscripts(
filename="Homo_sapiens.GRCh38.80.gtf.gz",
input="gtf",
feature="transcript",
output="GRanges")
Error: could not find function "createSyntheticTranscripts"

Thanks,

Elie

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.3 (Santiago)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] BiocInstaller_1.18.3   curl_0.9.1             genomeIntervals_1.24.1
[4] BiocGenerics_0.14.0    intervals_0.15.0       easyRNASeq_2.4.7

loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-2      futile.logger_1.4.1     GenomeInfoDb_1.4.1
[4] XVector_0.8.0           tools_3.2.1             futile.options_1.0.0
[7] bitops_1.0-6            zlibbioc_1.14.0         biomaRt_2.24.0
[10] annotate_1.46.0         RSQLite_1.0.0           lattice_0.20-31
[13] DBI_0.3.1               DESeq_1.20.0            genefilter_1.50.0
[16] hwriter_1.3.2           Biostrings_2.36.1       S4Vectors_0.6.1
[19] IRanges_2.2.5           stats4_3.2.1            locfit_1.5-9.1
[22] grid_3.2.1              LSD_3.0                 Biobase_2.28.0
[25] AnnotationDbi_1.30.1    XML_3.98-1.3            survival_2.38-3
[28] BiocParallel_1.2.7      limma_3.24.12           latticeExtra_0.6-26
[31] geneplotter_1.46.0      lambda.r_1.1.7          edgeR_3.10.2
[34] Rsamtools_1.20.4        GenomicAlignments_1.4.1 splines_3.2.1
[40] RCurl_1.95-4.7

Hej Elie! Try: install.packages(curl) library(curl) and then source the R script. If that does not work, you can always download the file (just put the URL in your browser) and change the filename in the "source" command to where you have downloaded the file. On a separate note, it would be interesting to know how you generated your BAM files, because it is surprising that the "Paired" flag is set but that no alignments has a mate. Cheers, Nico

Hi Nico,

The output from that is:

description
"https://microasp.upsc.se/root/upscb-public/raw/master/src/R/createSyntheticTranscripts.R"
class
"curl"
mode
"r"
text
"text"
opened
"closed"
"yes"
can write
"no"

And after that, I still get Error: could not find function "createSyntheticTranscripts". I have also tried to download the file itself and using that in the source command, but all I get is:

> source("createSyntheticTranscripts.txt")
Error in source("createSyntheticTranscripts.txt") :
createSyntheticTranscripts.txt:1:1: unexpected input
1: ÿ
^

All the alignment was down by an external company, and from what I can see from the files that came with the aligned BAM files, I think they were done using SOAP aligner.

Sorry for not being able to do this.

Thanks for the help,

Elie

When you have installed curl, you just need to source("https://microasp.upsc.se/root/upscb-public/raw/master/src/R/createSyntheticTranscripts.R") That will probably be irrelevant since the above should work, but it seems that your download was 1) either corrupt or 2) that your web browser saved it with some extra formatting, or as a web page. In most browser, you have an option to "save file as" in which case you should ask for the "page source" format. Otherwise, you can always copy/paste the R code displayed in your browser to a text file and save that file. Cheers, Nico

Hi Nico,

That is great thanks! I have finally managed to get it to work and to get a count table at the end of it.

Thanks a lot for all your help and if I run into any other problems with easyRNASeq, I will let you know.

Thanks,

Elie