.fastq to .txt conversion for EdgeR package and merging two paired end sequence files
4
0
Entering edit mode
@hamidrezarazzaghian-9208
Last seen 6 days ago
Canada

Dear all,

I a post-doc at the University of British Columbia, Canada and I'm pretty new to RNA-seq data analysis. I want to do the TMM normalization on my RNA-seq data using EdgeR package in R. I have two questions:

1) How can I convert .fastq files to .txt files to be able to feed them into the EdgeR package?

2) My RNA-seq data are paired sequence .fastq files. What quality control should I do on them and how should I merge them together prior to analysis?

 

Thanks for the help,

Hamid

 

EdgeR normalization fastq txt TMM • 2.1k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

You don't feed FASTQ files to edgeR. You first have to align against the genome of your species and then get counts per gene, which is what you then feed into edgeR. For that you could use something like the Rsubread package. It has a User's guide, so I would start there.

ADD COMMENT
0
Entering edit mode
@hamidrezarazzaghian-9208
Last seen 6 days ago
Canada

Thanks James for the fast reply. Unfortunately is not available in windows-based R. Do you know any other package for this purpose?

Thanks

ADD COMMENT
2
Entering edit mode

As Martin noted, you can use Rbowtie, but that is for the original bowtie aligner, which doesn't do gapped alignments. If you are doing RNA-Seq you probably want bowtie2, which does do gapped alignments. You can run bowtie2 on Windows, so that is probably the best bet, but you have to run it from the command line, not from within R.

Most aligners assume you are using some sort of Linux variant, so you are sort of hamstrung by the fact that you are on Windows. But Linux is free after all, and it's relatively simple to set up a dual-boot Ubuntu/Windows OS on your comp, so if you are serious that might be something to consider.

One thing about kallisto and sleuth (and salmon or sailfish and sleuth while we are at it). These packages are intended to make comparisons at the transcript level, rather than the gene level. Since part of the alignment process is to infer which transcript a read came from, there is additional uncertainty in your count measurement that you have to account for when fitting a model. This has two downsides. First, that additional uncertainty has a cost, which is reduction in power to detect differences. Second, you shouldn't use something like edgeR or DESeq2 for transcript-level counts because the model they fit doesn't account for that uncertainty, so you have to use something like sleuth (either Lior Pachter's version or the patched version from Rob Patro's group) to fit the model. And sleuth is just a github package now, so you are pretty much on your own if you want to go that route.

As an (apparent) beginner, you are probably better off just getting bowtie2 and going from there.
 

ADD REPLY
0
Entering edit mode

AFAIK "gapped alignments" in bowtie2 means indels, not junctions, so bowtie2 is not suited for RNA-seq. The original bowtie only supported mismatches, no indels.

H.

ADD REPLY
0
Entering edit mode

Hi Hervé,

Thanks for pointing that out. I naively thought that 'gapped alignment' was more or less a consistently applied term, but obviously not so much.

ADD REPLY
1
Entering edit mode

The Rbowtie package wraps (an older?) version of the Bowtie aligner, but probably most people use alignment tools outside R. The airway vignette and differential expression work flow describe overall approaches that go from FASTQ to count matrices via whole-genome alignment. kallisto is a different and fast though not cross-platform approach; see SummarizedExperiment::readKallisto() in addition to the github sleuth package. 

The poster has FASTQ files, but needs alignment (BAM) files before trying to count reads; b.nota's efforts would only be relevant after alignment. Ways to summarize aligned reads to counts across platforms and in R include bamsignals or perhaps GenomicFeatures::summarizeOverlaps().

ADD REPLY
0
Entering edit mode

I counted the reads once in R with a self made script using the libraries: IRanges, GenomicRanges, and Rsamtools. However, if you are pretty new to RNA-seq I would not recommend to try this yourself. It was pretty hard to do this.

I think the easiest way for you to get your counts is to install a virual machine with Ubuntu and try featureCount in Rsubread there.

 

ADD REPLY
0
Entering edit mode

Unfortunately is not available in windows-based R.

Rsubread has been available in R for Windows for a few years now.

ADD REPLY
0
Entering edit mode
@hamidrezarazzaghian-9208
Last seen 6 days ago
Canada

Thanks everyone for all the help.

ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 18 minutes ago
WEHI, Melbourne, Australia

You can follow one of the example workflows, for example:

or else follow the edgeR User's Guide.

Personally I use Rsubread::align followed by Rsubread::featureCounts to generate counts as input to edgeR, and that works just fine on a Windows laptop running R for Windows. Rsubread takes about 20 minutes on my Windows 10 laptop to align a FASTQ files with about 20 million paired-end reads.

See also

Liao, Y, Smyth, GK, Shi, W (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research 47(8), e47.

or see

Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME (2016). RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research 5, 1408.

for a different workflow using limma and edgeR.

ADD COMMENT

Login before adding your answer.

Traffic: 289 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6