Question

RNA-seq analysis with R/BioC: alignment and read counts

0

Entering edit mode

Fiona ▴ 70

@fiona-5790

Last seen 8.1 years ago

United Kingdom

Hi everyone,

I'm analyzing my first paired-end RNA-seq data in R/BioConductor and I'm having some trouble. I've given detail below but please let me know if I need to give more information. I'd be really grateful for any help or pointers.

I downloaded the D. melanogaster reference genome (dm6, the most recent version) from the UCSC server, and used SAMtools and Picard to create a reference genome .fa file. I carried out my alignment to this reference file with R BioC 'QuasR' package, command qAlign, and this gave me .bam files for all my samples. I used:

proj <- qAlign(sampleFile,genomeFile,cacheDir=results)

with sampleFile and genomeFile providing the paths to the sample and reference .fa files, respectively. This step seemed to go fine.

Now I am trying to generate read counts with R BioC 'GenomicFeatures' package. Following the package notes and 'how to' guide, I used:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz",format="gff")

exbygene <- exonsBy(txdb, by="gene")

I get a warning message for makeTxDbFromGFF, stating that:

Warning messages:
1: In matchCircularity(seqlevels(gr), circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could
  not be inferred (either because the exons are not on the same
  chromosome/strand or because they are not separated by introns):
  FBtr0084079, FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083,
  FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have CDSs that
  cannot be mapped to an exon: FBtr0100857, FBtr0100861, FBtr0100863,
  FBtr0100866, FBtr0100868, FBtr0100870, FBtr0100883, FBtr0433498,
  FBtr0433502

The exonsBy command runs fine though, so I tried to continue with qCount in 'QuasR' with this code:

anycounts <- qCount(proj, txdb, reportLevel="gene", orientation="any")

and I get the following error:

Error in qCount(proj, txdb, reportLevel = "gene", orientation = "any") :
  sequence levels in 'query' not found in alignment files: 2L, 2R, 3L, 3R, 4, X, Y, Unmapped_Scaffold_8_D1580_D1567, 211000022278158, 211000022278279, 211000022278282, 211000022278298, 211000022278307, 211000022278309, 211000022278436, 211000022278449, 211000022278498, 211000022278522, 211000022278603, 211000022278604, 211000022278664, 211000022278724, 211000022278750, 211000022278760, 211000022278875, 211000022278877, 211000022278878, 211000022278879, 211000022278880, 211000022278985, 211000022279055, 211000022279108, 211000022279132, 211000022279134, 211000022279165, 211000022279188, 211000022279222, 211000022279264, 211000022279342, 211000022279392, 211000022279446, 211000022279528, 211000022279529, 211000022279531, 211000022279555, 211000022279681, 211000022279708, 211000022280133, 211000022280328, 211000022280341, 211000022280347, 211000022280481, 211000022280494, 211000022280645, 211000022280703, mitochondrion_genome, rDNA

As far as my beginner knowledge can tell, the reference I used to create .bam files is not matching up with the reference I am using to count reads against. Is this the case?

If so, I understand why that would be a problem, but I don't understand how the problem has occurred, nor how to fix it. As far as I can tell, I used the same reference for both of these steps (in the form of the .fa file for alignment and the .gff to make the gene model). Can anyone help me solve this one?

Thanks in advance for any help,

Fiona

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Scientific Linux release 6.6 (Carbon)

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ShortRead_1.26.0        GenomicAlignments_1.4.1 BiocParallel_1.2.3     
 [4] Rsamtools_1.20.4        Biostrings_2.36.1       XVector_0.8.0          
 [7] rtracklayer_1.28.4      edgeR_3.10.2            limma_3.24.10          
[10] DESeq_1.20.0            lattice_0.20-31         locfit_1.5-9.1         
[13] GenomicFeatures_1.20.1  AnnotationDbi_1.30.1    Biobase_2.28.0         
[16] QuasR_1.8.3             Rbowtie_1.8.0           GenomicRanges_1.20.5   
[19] GenomeInfoDb_1.4.0      IRanges_2.2.4           S4Vectors_0.6.0        
[22] BiocGenerics_0.14.0    

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.18.3 RColorBrewer_1.1-2   futile.logger_1.4.1
 [4] bitops_1.0-6         futile.options_1.0.0 tools_3.2.0         
 [7] zlibbioc_1.14.0      biomaRt_2.24.0       annotate_1.46.0     
[10] RSQLite_1.0.0        BSgenome_1.36.0      DBI_0.3.1           
[13] genefilter_1.50.0    hwriter_1.3.2        grid_3.2.0          
[16] XML_3.98-1.2         survival_2.38-2      latticeExtra_0.6-26
[19] geneplotter_1.46.0   lambda.r_1.1.7       splines_3.2.0       
[22] xtable_1.7-4         RCurl_1.95-4.6

drosophila melanogaster quasR genomicfeatures genomicranges alignment • 3.2k views

ADD COMMENT • link updated 8.8 years ago by Hervé Pagès 16k • written 8.8 years ago by Fiona ▴ 70

score 0 · Answer 1 · 2015-07-13

Hi Fiona,

My understanding from the information that you've given is, that the seqlevelsStyle of the GFF files is not UCSC and that is giving rise
to these issues. I suggest using the following steps

a) read in the GFF file as a GRanges
b) read in one of the bam files as a GenomicAlignments.
c) check if both of them have the same chromosome naming convention ?
d) convert everything to have same naming convention
e) then create the txdb object.

I have shown some code below - but have used test files which you can
replace with your own GFF and BAM files

Test GFF file

library(rtracklayer)
test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")

Test Bam file

library(pasillaBamSubset)
fl <- untreated3_chr4()   # filename as a character

Here is an example of how to read in a GFF file as a GRanges -

library(rtracklayer)
gff_gr <- import(test_gff3, format="gff")
seqlevelsStyle(gff_gr)

Read in your BAM files into R with -

library(GenomicAlignments) 
bamData <- readGAlignmentsList(fl)
seqlevelsStyle(bamData)

Checking step:

if(seqlevelsStyle(galist1)==seqlevelsStyle(gff_gr))
{
    # create the txdb
    txdb <- makeTxdbFromGRanges(gff_gr) # should work!
}

This is how you change the chromosome naming style to "UCSC"

library(GenomeInfoDb)
seqlevelsStyle(gff_gr) <- "UCSC"

then run

txdb <- makeTxdbFromGRanges(gff_gr)

Hope that helps!
Sonali.

score 0 · Answer 2 · 2015-07-13

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

Hi Fiona,

I don't want to confuse you with more alternatives, but there is a complete start to finish case study of analysing D. melanogaster RNA-seq data in the last chapter of the limma User's Guide:

http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

This example uses featureCounts() in the Rsubread package to count the reads. It may be worth considering.

ADD COMMENT • link 8.8 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks very much for the suggestion - I didn't realize limma could be used for RNAseq as well as microarray analysis, so this will definitely be worth me trying out.

ADD REPLY • link 8.8 years ago Fiona ▴ 70

score 0 · Answer 3 · 2015-07-13

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 4 days ago

Seattle, WA, United States

Hi Fiona,

I suppose you got dmel-all-r6.05.gff.gz from FlyBase. AFAIK it contains annotations based on reference genome BDGP Release 6 which is indeed the same assembly as dm6 from UCSC. However, as Sonali suggested, UCSC doesn't use the same chromosome naming convention as FlyBase. An easy way to address this is to grab the gene/transcript/exon models from UCSC. That is, just replace:

txdb <- makeTxDbFromGFF("dmel-all-r6.05.gff.gz", format="gff")

with

txdb <- makeTxDbFromUCSC(genome="dm6", tablename="refGene")

See ?makeTxDbFromUCSC for more information.

Hope this helps,

H.

ADD COMMENT • link 8.8 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks for pointing this out - I didn't know about the makeTxDbFromUCSC function, and it fixes the problem.

ADD REPLY • link 8.8 years ago Fiona ▴ 70