FourCSeq - how many bases to trim with countFragmentOverlaps
4
0
Entering edit mode
@francesclopez-8907
Last seen 5.8 years ago
United States

Hi,

A bit of background. We used the following protocol:

3C RE is HindIII with the following seq A^AGCTT

4C RE is DpnII with the following seq ^GATC

exptData <- SimpleList(projectPath =projectPath, fragmentDir = "re_fragments", referenceGenomeFile = referenceGenomeFile, reSequence1 = "AAGCTT", reSequence2 = "GATC", primerFile = primerFile, bamFilePath = bamFilePath)

Bait primer was trimmed before aligning the reads to the reference (76bp -> 51bp reads). After mapping only reads with MPQ >20 were selected.

It is unclear to me what command and options I should use when counting overlaps. countFragmentOverlaps? or countFragmentOverlapsSecondCutter? and the bases to trim?

I used different trimming values. WIth default values I get the following scatter plot.

fc <- countFragmentOverlaps(fc) fc <- combineFragEnds(fc) fc <- smoothCounts(fc)  plotScatter(fc[,c("chr14pv_A1", "chr14pv_A2")], xlab="Rep1", ylab="Rep2", asp=1)

When trimming 1bp I am getting the following.

Trimming 2bp ..... More than 2bp barely produces any dots.

Thanks!

fourcseq • 1.1k views
1
Entering edit mode
felix.klein ▴ 150
@felixklein-6900
Last seen 3.1 years ago
Germany

Hello Francesc,

do your sequences contain AAGCTT? In that case you have to set trim to 6 so that the complete Sequence at the beginning is trimmed off. If the AAGCTT has been trimmed and reads start right next to the cutting site than trim=0 is the correct setting.

From your plot it looks like there might be some differences in where the reads have been trimmed. Some might be off by one base. You can check this in a genome browser by looking at the start positions of the reads for some examples. If you set save=TRUE in the addFragments function, FourCSeq produces bed files of the restriction enzymes in the fragmentDir folder that you can visualize along with the reads to find the relative positions.

The countFragmentOverlapsSecondCutter has to be used for counting if your sequencing was performed from the direction of the second cutter. In this case the overlap with the restriction enzyme positions of the second cutter is used for counting and the read has to overlap the restriction site of the second cutter completely. In you case the reads would have to contain the whole GATC sequence. If this sequence has been trimmed in the preprocessing, than you have to use the default setting extend=TRUE, so that the read starts are extended by the length of the restriction enzyme sequence.

In the second plot there are 3 dense regions which seems weird. I have never seen this before. It only reminds me of copy number plots from colleagues along the genome. Is it possible that there are copy number variations in the replicates? Or that different parts have been amplified?

Best regards,

Felix

1
Entering edit mode
felix.klein ▴ 150
@felixklein-6900
Last seen 3.1 years ago
Germany

Hello,

you just need to specify trim=4 in the countFragmentOverlaps function. Then the reads will be trimmed correctly. FourCSeq takes the orientation of the read into account. Therefore the reads on the reverse strand will be trimmed from the end and all reads will be taken into account.

If you have further questions, just let me know.

Best regards,

Felix

0
Entering edit mode
@francesclopez-8907
Last seen 5.8 years ago
United States

Thanks Felix,

I think I figure out what was going on. The 3' end of the primer overlaps with the restriction site, except for 1bp.

5' .....GAGGAAGCT 3' RE          AAGCTT

Thus, I only need to trim 1 bp. I am still not sure why not trimming produces those plots.

I started over the whole process. This time I classified and trimmed the primer using a method a bit more flexible. I used fastx toolkit barcode splitter to allow some variation in the primer sequence.

I still get the 3 dense regions in all the plots. I don't know why.

Thank you vey much

Francesc

0
Entering edit mode
cvicgar • 0
@cvicgar-10149
Last seen 5.3 years ago

Hi!

I've (very) recently joined the Bioinformatics world so I hope the following question is not too dumb... I got to this post after looking for the trimming step in the countFragmentOverlaps function. In my case, reads should start after the first enzyme restriction site, GATC. When I look at the .fq file, that is certainly the case. For instance:

@FCD2CYUACXX:7:2303:5605:72503#0/1

GATCCCAATTGTCAGTGTCTTAGTTA

+

iiiiiihihifghhhceffhifhfhh

Then I mapped the reads with bowtie (with the sam output option so I could then convert it to .bam and use it with FourCSeq). For the particular read in the example above, I got the following line in the .bam file:

FCD2CYUACXX:7:2303:5605:72503#0/1    16    chr10    109736641    255    26M    *    0    0    TAACTAAGACACTGACAATTGGGATC    hhfhfihffechhhgfihihiiiiii    XA:i:0    MD:Z:26    NM:i:0

Thus, it clearly mapped in the reverse strand, and the GATC site is now at the 3' end. When I look at the .bam file, I get a mix of reads that start or end with GATC. So I cannot simply trim the 4 first bases because that won't work for half of the reads... And also, because the reads are expected to go in a certain direction, I guess the program will remove all those reads that end in GATC. Is that right? Or maybe does it take into account the FLAG information to know where to start trimming?

Thank you very much in advance!

C