Hi Jo?o:
You should use a lower consensus cutoff when using Rsubread to
map your 36bp reads. The default setting of requiring at least 3
consensus subreads should only be used when reads are at least 45
bases long (so that at least 10 subreads can be extracted from each
read). You can use a consensus threshold of 2 (TH1=2) to map your
data. This should be able to get more reads mapped at a quite low
mapping error rate.
Attached is the C program for retrieving substrings from each
read. Use this command to compile it on a linux computer:
> gcc -g -o get_substrings get_substrings.c
Typing ./get_substrings will tell you help to use it. It only
works with a FASTQ file. The output will be a text file in which each
line is a extracted substring.
After you got the output file, for example a file called
"output.txt", issue the following command to get the frequencies of
distinct substrings:
> sort output.txt | uniq -c > frequency.txt
If your reads include adapter sequence, you should see a 14mer
sequence in your "frequency.txt" file which has a very big number.
Hope these will work for you, but let me know they don't.
Cheers,
Wei
On Apr 22, 2011, at 4:11 AM, Jo?o Moura wrote:
> Dear Wei Shi,
>
> Is it possible to send me the C code you talked about?
>
> Just another note, using bowtie with the full reads I get 66% while
using Rsubread I get only 55%. Maybe the reason is that bowtie is
reporting the alignments saying:
>
> "reads with at least one reported alignment"
>
> While maybe Rsubread just gets one hit per read?
>
> Thanks a lot for your help,
>
> On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi at="" wehi.edu.au=""> wrote:
> Dear Jo?o:
>
> You got to be certain whether the first 14 bases of your reads
are adapters or not before you can trim them. You can do this by
visual inspection on a small set of reads. But a better way will be to
extract all first 14 bases from each read and and then look at the
frequency of these 14mer sequences. I wrote a C program before to
retrieve barcode sequences for each read, which might be useful for
your data as well. This program has not been included in Rsubread
package yet, but I will be happy to send you a copy of it if you want
to use it.
>
> Cheers,
> Wei
>
> On Apr 21, 2011, at 10:44 AM, Jo?o Moura wrote:
>
>> Dear Wei Shi,
>>
>> First of all, thanks a lot for you help. Maybe I was doing it wrong
also with bowtie, but the truth is I'm a master student, thus quite
armature in all of this.
>>
>> Well, the reason why I trimmed my reads in first place was that
sequence content across all bases is biased (which i think is a common
problem in RNAseq). As you can see in the picture I send you, the
first 14/15 position are biased (which I think is due to sequencing
adapters?) So, the reason I followed was to trim the reads in order to
be able to map the 22 bases that otherwise would be unmapped because
of the adapters or other artifacts in the first 14/15 of the reads of
the reads. But, since you say that 16bp is the minimum requirement for
Rsubread, it seems that this problem will be solved using your
software, right?
>>
>> Do you know if bowtie also uses this approach of subseting the
reads? Because if I trim the reads I get an improvement of ~15% on the
alignments.
>>
>> Once again, thank you very much for your help.
>>
>> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi at="" wehi.edu.au="">
wrote:
>> Dear Jo?o:
>>
>> The default setting of Rsubread alignment requires at least 3
subreads (16bp substrings extracted from the read) to be mapped to the
same location to report a hit. Subreads are extracted at a gap of 2bp
from the read sequence, i.e. neighboring subreads are 2bp apart. 3
passes of mapping will be performed for mapping three sets of subreads
respectively, which are extracted from different start positions of
the read (1st base, 2nd base and 3rd base).
>>
>> Therefore, for your trimmed reads which are 20bp long, there
will be only up to 2 subreads extracted from each read and they were
used for mapping. However, because the consensus cutoff (minimal
number of consensus subreads needed to report a hit) is 3 by default
(which I believe is the value used in your analysis), no hits can be
reported because of insufficient number of extracted subreads.
>>
>> DNA sequences of 20 bases are highly repetitive in human or
mouse genome. But I do not know if this is the case for yeast genome.
However with shorter read, you are more likely to get ambiguous
mapping locations.
>>
>> I would suggest you to use the full read sequences for mapping
to take advantage of all the base information in the reads. When
mapping 36bp reads, Rsubread can extract 7 subreads from each read.
This gives a lot more power for finding mapping locations in a
sensitive and accurate way. You can use a consensus threshold of 2
(TH1=2) to call mapping locations, which we found works quite well.
>>
>> Hope this helps.
>>
>> Cheers,
>> Wei
>>
>> On Apr 21, 2011, at 12:31 AM, Jo?o Moura wrote:
>>
>>> Dear Wei Shi,
>>>
>>> Indeed I can run the example of the vignette but not my own
trimmed reads. With bowtie I get more than 80% but with RSubread i get
0%. I think I got why. the thing is I'm trimming the reads in R and
trying to realign them afterwards. But when I'm triming, I trimmed the
quality and DNA strings. So, a line like this:
>>>
>>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36
>>>
>>>
>>> Will contain "length=36" all the time, even if my reads and length
20. Do you think this might be a problem? Because if I run with
untrimmed reads I get some reasonable results.
>>>
>>> I sent you a untrimmed version and a trimmed version. Maybe you
can try...
>>>
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi at="" wehi.edu.au="">
wrote:
>>> Dear Joao:
>>>
>>> Can you run the example enclosed with Rsubread package?
>>>
>>> Cheers,
>>> Wei
>>>
>>> On Apr 20, 2011, at 3:23 AM, Jo?o Moura wrote:
>>>
>>>> Dear Wei Shi,
>>>>
>>>> I'm trying out the Rsubread and I'm having strange results. That
is I firstly built the reference genome, use a fasta file like this:
>>>>
>>>> [joao at goedel genomes]$ grep ">" yeast.fna
>>>> >Scchr01 1 - 230208
>>>> >Scchr02 1 - 813178
>>>> >Scchr03 1 - 316617
>>>> >Scchr04 1 - 1531917
>>>> >Scchr05 1 - 576869
>>>> >Scchr06 1 - 270148
>>>> >Scchr07 1 - 1090946
>>>> >Scchr08 1 - 562643
>>>> >Scchr09 1 - 439885
>>>> >Scchr10 1 - 745667
>>>> >Scchr11 1 - 666454
>>>> >Scchr12 1 - 1078175
>>>> >Scchr13 1 - 924429
>>>> >Scchr14 1 - 784333
>>>> >Scchr15 1 - 1091289
>>>> >Scchr16 1 - 948062
>>>> >Scmito 1 - 85779
>>>>
>>>> In R I ran the following code:
>>>>
>>>> ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")
>>>> path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")
>>>> buildindex(basename = file.path(path, "reference_index"),
reference = ref)
>>>> reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4",
patt="fastq$", full=TRUE)
>>>> align(index = file.path(path, "reference_index"), readfile1 =
reads[[1]],output_file = file.path(path, "alignResults.SAM"))
>>>>
>>>> 11152683 reads were processed in 121.8 seconds.
>>>> Percentage of successfully mapped reads is 0.00%.
>>>>
>>>>
>>>>
>>>> Completed successfully.
>>>>
>>>>
>>>> If I do the same with bowtie I get more than 70%..What am I doing
wrong here? Thanks a lot!
>>>>
>>>> --
>>>> Jo?o Moura
>>>
>>>
>>>
______________________________________________________________________
>>> The information in this email is confidential and intended solely
for the addressee.
>>> You must not disclose, forward, print or use it without the
permission of the sender.
>>>
______________________________________________________________________
>>>
>>>
>>>
>>> --
>>> Jo?o Moura
>>> <thousand.fastq><untri_thousand.fastq>
>>
>>
>>
______________________________________________________________________
>> The information in this email is confidential and intended solely
for the addressee.
>> You must not disclose, forward, print or use it without the
permission of the sender.
>>
______________________________________________________________________
>>
>>
>>
>> --
>> Jo?o Moura
>
>
>
______________________________________________________________________
> The information in this email is confidential and intended solely
for the addressee.
> You must not disclose, forward, print or use it without the
permission of the sender.
>
______________________________________________________________________
>
>
>
> --
> Jo?o Moura
______________________________________________________________________
The information in this email is confidential and intended solely for
the addressee.
You must not disclose, forward, print or use it without the permission
of the sender.
______________________________________________________________________