Question

RiboSeqR package input file format

0

Entering edit mode

Laura.Fancello • 0

@laurafancello-6772

Last seen 9.3 years ago

Belgium

Hi, I want to use the riboSeqR package on the output of a TopHat mapping but I have problems getting the right format for this file.

My file looks like this:

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chrM LN:16569
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@PG     ID:TopHat       VN:2.0.7        CL:/opt/tophat-2.0.7.Linux_x86_64/tophat        --seed  42      -n      2       -m      1       --no-novel-juncs        --no-novel-indels       --no-coverage-search    --segment-length        25      --transcriptome-index   /srv/data/indexes/gencode.v19.annotation.basic_only     -G      /srv/data/indexes/gencode.v19.annotation.basic_only.gtf -o      /srv/data2/deepseq/codonusage_TGFb/KR20140905/data/tophat_out//2935_6_R98S__A_CCGTCCC_L001_R1_001       -p      2       /srv/data/indexes/hg19.Ensembl_coordinates      /srv/data2/deepseq/codonusage_TGFb/KR20140905/data/clean_fastq/2935_6_R98S__A_CCGTCCC_L001_R1_001.cleaned.fastq.gz
16      chr1    13461   GAAGTAGGCCTCATCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCC
0       chr1    14888   CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG
0       chr1    14888   CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG
0       chr1    14888   CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG
0       chr1    14919   GACAGAATTACGAGGTGCTGGCCCAGGGCGGGCAGCGGCCCTTCCTCCTAC
0       chr1    16438   GCTCTACAGTTTGAAAACCACTATTTTATGAACCAAGTAGAACAAGATATT
0       chr1    17487   CCACTCGTCACCCCCTGGCTCCTGGCCTATGTGCTGTACCTGTGTCTGATG

If I don't eleiminate the first 27 lines with the headers the file can't be read with the readRibodata function (Reading ribosomal files...Error in `[.data.frame`(read.delim(file, as.is = TRUE, header = header),: undefined columns selected).
If I eliminate these first 27 header lines the file is successfully read but then downstream analyses fail.
I guess this is because I don't have transcripts names in the second column but the chromosomes. How can I obtain them from my Tophat mapping file? Or how can I generate it running TopHat again (the Bowtie option –suppress 1,6,7,8 recommended in the manual doesn't seem to exist umnder tophat).
Thanks a lot for your help!
Laura

P.S. Note that the fCs object ( fCs <- frameCounting(riboDat, fastaCDS)) look like this:

> fCs
GRanges object with 1864591 ranges and 1 metadata column:
                                                                                                                seqnames
                                                                                                                   <Rle>
        [1] ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|
        [2]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        [3]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        [4]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        [5]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        ...                                                                                                          ...
  [1864587]                                   ENST00000361567.2|ENSG00000198786.2|-|-|MT-ND5-201|MT-ND5|1812|CDS:1-1812|
  [1864588]                                     ENST00000361681.2|ENSG00000198695.2|-|-|MT-ND6-201|MT-ND6|525|CDS:1-525|
  [1864589]                                   ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141|
  [1864590]                                   ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141|
  [1864591]                                   ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141|
                  ranges strand   |     frame
               <IRanges>  <Rle>   | <numeric>
        [1]   [  1, 918]      *   |         0
        [2]   [166, 252]      *   |         0
        [3]   [298, 594]      *   |         0
        [4]   [604, 609]      *   |         0
        [5]   [754, 837]      *   |         0
        ...          ...    ... ...       ...
  [1864587] [1551, 1592]      *   |         2
  [1864588] [ 243,  260]      *   |         2
  [1864589] [  87,  650]      *   |         2
  [1864590] [ 813,  857]      *   |         2
  [1864591] [1134, 1141]      *   |         2
  -------
  seqinfo: 95309 sequences from an unspecified genome; no seqlengths

Slot "replicates"
[1] WT M
Levels: M WT

Hits array; dimension  1864591 2 3 6
Unique hits array; dimension  1864591 2 3 6 fCs

And the fs object (fS <- readingFrame(rC = fCs)) looks like this:

> fS
         26 27 28 29 30
          0  0  0  0  0
          0  0  0  0  0
          0  0  0  0  0
frame.ML  0  0  0  0  0

riboSeqR • 1.6k views

ADD COMMENT • link updated 9.5 years ago by Thomas J Hardcastle ▴ 180 • written 9.5 years ago by Laura.Fancello • 0

0

Entering edit mode

Why don't you just align using bowtie, rather than tophat? Tophat is intended to find alternate splicing, which IIRC doesn't occur with rRNAs, so I don't see the rationale for using tophat anyway.

ADD REPLY • link 9.5 years ago James W. MacDonald 65k

score 1 · Answer 1 · 2014-10-07

1

Entering edit mode

Thomas J Hardcastle ▴ 180

@thomas-j-hardcastle-3860

Last seen 6.5 years ago

United Kingdom

Hi Laura,

riboSeq is expecting the output from an alignment to the *transcriptome*, not the genome. If you align to the transcriptome using Bowtie with the suppress options suggested it should all go smoothly.

Best wishes,

Tom Hardcastle

--
Dr. Thomas J. Hardcastle
Senior Research Associate

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge, CB2 3EA
United Kingdom

ADD COMMENT • link 9.5 years ago Thomas J Hardcastle ▴ 180

0

Entering edit mode

Dear Thomas thanks for your answer.

I still have some questions about riboSeqR:

In the filterHits function the option hitMean is not really clear to me. I understood that a minimum number of reads (ribosome footprints) for ONE coding region are required to consider that coding region translated. I agree with the reasoning but how to decide the threshold? If the threshold is too stringent we might loose coding sequences which have very few ribosome footprints only because the sequencing depth is not enough. Do you see what I mean? If you have tried to play around with this threshold do you have any recommendation on how to use it?

Thanks!

Laura

ADD REPLY • link 9.5 years ago Laura.Fancello • 0