#2. Exact duplicate (not PCR duplicate!) reads in BAM files

How I Learned to Stop Worrying and Love the Shell

Someone reached out to me and asked me if I had ever encountered "exactly" duplicate reads in BAMs before. I wasn't aware of the terminology of an exact duplicate read but I assumed it had something to do with PCR duplicates. I was wrong, and the resulting rabbit hole was an interesting one. What they were talking about was having the entire line representing a read found twice in a BAM file, which is a pretty interesting situation to find yourself in. What they had been doing was splitting a file into several chunks that could overlap with each other, and then merging it back in. Usually, the way to merge BAM files into one file is to use

$ samtools merge -o out.bam in1.bam ... inN.bam

However, if we have BAM files from the same sample, samtools recommends[1] collapsing read group (@RG) headers using the -c tag. This allows two identical reads from different BAM files to come together in a merged BAM file.

$ samtools merge -c -o sample1_merged.bam sample1_in1.bam ... sample1_inN.bam  

An easy way to replicate this problem is use the samtools view command with the region selection option. If we wanted to subset a BAM to only look at reads from chromosome 1, we could do this by running

$ samtools view -b test.bam chr1 > test_chr1.bam  

Here, the -b option causes the output to remain in the BAM format instead of being output in the SAM format. Now, suppose we were to make two subsets of this BAM file, one that contains reads mapping to chr1:1:150,000,000 and another one that contains reads mapping to chr1:100,000,000-248956422 (which marks the end of chromosome 1), we could do this by running

$ samtools view -b test.bam chr1:1:150,000,000 > test_chr1_pt1.bam  

$ samtools view -b test.bam chr1:100,000,000-248956422 > test_chr1_pt2.bam  

Now, if we were to merge these BAM files using samtools merge -c, the reads that map to chr1:100,000,000-150,000,000 would be present twice - the entire line repeated exactly the same. In order to fix this, I decided to use awk to remove lines that are completely and exactly duplicated. First, we need to get the entire BAM file, including the header (-h) to process, after which we do some awk magic followed by a conversion to BAM format (-b) again.

$ samtools merge -c -o test_chr1_remerged.bam test_chr1_pt1.bam test_chr1_pt2.bam  

$ samtools view -h test_chr1_remerged.bam |
  awk '!seen[$0]++' |
  samtools view -b - > test_chr1_remerged_fixed.bam  

Here, the seen in the awk expression isn't really a secret command - it's simply what is known as an associative array, which is another term for a dictionary-like data structure. The code !seen[$0]++ has two parts - a test condition and an increment. The other thing to note is that $0 in awk represents the entire line. If the value of seen[$0] is 0, the test (!seen[$0]) is TRUE. The value of $0 in the associative array seen is then incremented by one, and the next time $0 is encountered, the test will be FALSE. The result of this is that the test is TRUE for the first occurence of every line, and FALSE for every subsequent occurence. Whenever the result of a test in an awk expression without an explicit action is TRUE, the default action, i.e. print or print $0 is executed, which prints the current line. This is then passed to samtools view -b which them converts it back into a BAM file.

In the end, we have a BAM file that only has non-repeated reads after merging the split BAM files. While I do not like using awk sandwiched between samtools commands, it seemed like a relatively easy way to do this. Let me know if you have a better way to do this.

Cheers,
Devang