Question: RiboSeqR package input file format
0
gravatar for Laura.Fancello
4.5 years ago by
Belgium
Laura.Fancello0 wrote:

Hi, I want to use the riboSeqR package on the output of a TopHat mapping but I have problems getting the right format for this file.

My file looks like this:

@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chrM LN:16569
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566
@PG     ID:TopHat       VN:2.0.7        CL:/opt/tophat-2.0.7.Linux_x86_64/tophat        --seed  42      -n      2       -m      1       --no-novel-juncs        --no-novel-indels       --no-coverage-search    --segment-length        25      --transcriptome-index   /srv/data/indexes/gencode.v19.annotation.basic_only     -G      /srv/data/indexes/gencode.v19.annotation.basic_only.gtf -o      /srv/data2/deepseq/codonusage_TGFb/KR20140905/data/tophat_out//2935_6_R98S__A_CCGTCCC_L001_R1_001       -p      2       /srv/data/indexes/hg19.Ensembl_coordinates      /srv/data2/deepseq/codonusage_TGFb/KR20140905/data/clean_fastq/2935_6_R98S__A_CCGTCCC_L001_R1_001.cleaned.fastq.gz
16      chr1    13461   GAAGTAGGCCTCATCCTGACAGGCAGCTGCACCACTGCCTGGCGCTGTGCC
0       chr1    14888   CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG
0       chr1    14888   CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG
0       chr1    14888   CGGAGACTTAAATACAGGAGGAAAAAGGCAGGACAGAATTACGAGGTGCTG
0       chr1    14919   GACAGAATTACGAGGTGCTGGCCCAGGGCGGGCAGCGGCCCTTCCTCCTAC
0       chr1    16438   GCTCTACAGTTTGAAAACCACTATTTTATGAACCAAGTAGAACAAGATATT
0       chr1    17487   CCACTCGTCACCCCCTGGCTCCTGGCCTATGTGCTGTACCTGTGTCTGATG

 

If I don't eleiminate the first 27 lines with the headers the file can't be read with the readRibodata function (Reading ribosomal files...Error in `[.data.frame`(read.delim(file, as.is = TRUE, header = header),: undefined columns selected).
If I eliminate these first 27 header lines the file is successfully read but then downstream analyses fail.
I guess this is because I don't have transcripts names in the second column but the chromosomes. How can I obtain them from my Tophat mapping file? Or how can I generate it running TopHat again (the Bowtie option –suppress 1,6,7,8 recommended in the manual doesn't seem to exist umnder tophat).
Thanks a lot for your help!
Laura

 

P.S. Note that the fCs object ( fCs <- frameCounting(riboDat, fastaCDS)) look like this:

> fCs
GRanges object with 1864591 ranges and 1 metadata column:
                                                                                                                seqnames
                                                                                                                   <Rle>
        [1] ENST00000335137.3|ENSG00000186092.4|OTTHUMG00000001094.1|OTTHUMT00000003223.1|OR4F5-001|OR4F5|918|CDS:1-918|
        [2]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        [3]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        [4]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        [5]   ENST00000423372.3|ENSG00000237683.5|-|-|AL627309.1-201|AL627309.1|2661|UTR5:1-70|CDS:71-850|UTR3:851-2661|
        ...                                                                                                          ...
  [1864587]                                   ENST00000361567.2|ENSG00000198786.2|-|-|MT-ND5-201|MT-ND5|1812|CDS:1-1812|
  [1864588]                                     ENST00000361681.2|ENSG00000198695.2|-|-|MT-ND6-201|MT-ND6|525|CDS:1-525|
  [1864589]                                   ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141|
  [1864590]                                   ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141|
  [1864591]                                   ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141|
                  ranges strand   |     frame
               <IRanges>  <Rle>   | <numeric>
        [1]   [  1, 918]      *   |         0
        [2]   [166, 252]      *   |         0
        [3]   [298, 594]      *   |         0
        [4]   [604, 609]      *   |         0
        [5]   [754, 837]      *   |         0
        ...          ...    ... ...       ...
  [1864587] [1551, 1592]      *   |         2
  [1864588] [ 243,  260]      *   |         2
  [1864589] [  87,  650]      *   |         2
  [1864590] [ 813,  857]      *   |         2
  [1864591] [1134, 1141]      *   |         2
  -------
  seqinfo: 95309 sequences from an unspecified genome; no seqlengths

Slot "replicates"
[1] WT M
Levels: M WT

Hits array; dimension  1864591 2 3 6
Unique hits array; dimension  1864591 2 3 6 fCs

And the fs object (fS <- readingFrame(rC = fCs)) looks like this:
 

> fS
         26 27 28 29 30
          0  0  0  0  0
          0  0  0  0  0
          0  0  0  0  0
frame.ML  0  0  0  0  0

 

riboseqr • 845 views
ADD COMMENTlink modified 4.5 years ago by Thomas J Hardcastle180 • written 4.5 years ago by Laura.Fancello0

Why don't you just align using bowtie, rather than tophat? Tophat is intended to find alternate splicing, which IIRC doesn't occur with rRNAs, so I don't see the rationale for using tophat anyway.

ADD REPLYlink written 4.5 years ago by James W. MacDonald49k
Answer: RiboSeqR package input file format
1
gravatar for Thomas J Hardcastle
4.5 years ago by
United Kingdom
Thomas J Hardcastle180 wrote:

Hi Laura,

riboSeq is expecting the output from an alignment to the *transcriptome*, not the genome. If you align to the transcriptome using Bowtie with the suppress options suggested it should all go smoothly.

 

Best wishes,

 

Tom Hardcastle

 

--
Dr. Thomas J. Hardcastle
Senior Research Associate

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge, CB2 3EA
United Kingdom
ADD COMMENTlink written 4.5 years ago by Thomas J Hardcastle180

Dear Thomas thanks for your answer.

I still have some questions about riboSeqR:

In the filterHits function the option hitMean is not really clear to me. I understood that a minimum number of reads (ribosome footprints) for ONE coding region are required to consider that coding region translated. I agree with the reasoning but how to decide the threshold? If the threshold is too stringent we might loose coding sequences which have very few ribosome footprints  only because the sequencing depth is not enough. Do you see what I mean? If you have tried to play around with this threshold do you have any recommendation on how to use it?

Thanks!

Laura

ADD REPLYlink written 4.5 years ago by Laura.Fancello0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 106 users visited in the last hour