How I Learned to Stop Worrying and Love the Shell
One of the main parameters I use to filter BAM files is the bitwise FLAG that is present in the second column in a BAM file. This contains a lot of information - it tells us whether a read is paired or not, or whether it is a PCR duplicate. The SAM file specification lists what the different flags in a FLAG stand for, as shown below.
Description | ||
---|---|---|
template having multiple segments in sequencing | ||
each segment properly aligned according to the aligner | ||
segment unmapped | ||
next segment in the template unmapped | ||
SEQ being reverse complemented | ||
SEQ of the next segment in the template being reverse complemented | ||
the first segment in the template | ||
the last segment in the template | ||
secondary alignment | ||
not passing filters, such as platform/vendor quality controls | ||
PCR or optical duplicate | ||
supplementary alignment |
These flags can be combined together to describe the properties of reads in a BAM file. To learn more about the different combinations of flags, check out this interactive explainer by the Broad Institute. Now that we know what the SAM flags are, we can move on to the core of this piece. In order to filter a BAM file based on certain combinations, we can samtools view with a variety of options. This is where things can get a little confusing, so we will go through the options one by one.
1. First off, we have -f FLAG or --require-flags FLAG which outputs the alignments with all bits set in FLAG present in the FLAG field.
2. Second, we have -F FLAG or --exclude-flags FLAG which removes the alignments with any bits set in FLAG present in the FLAG field.
3. Next, we have --rf FLAG or --include-flags FLAG which outputs the alignments with any bits set in FLAG present in the FLAG field.
4. Finally, we have -G FLAG which removes the alignments with all bits set in FLAG present in the FLAG field.
If we look at the bolded parts in the description above, we can see a pattern here. The options -f and -G are complementary and so are the options -F and --rf. What this means is that if we were to use -f FLAG -G FLAG or -F FLAG --rf FLAG, we would get no reads. (Note the aymmetry due to lack of -g, which is something the maintainers of samtools talk about here.)
Now we come to the main issue that I faced. I wanted to obtain a FASTQ file with the reads from a BAM file that were "bad". In my case, "bad" meant if the reads were either supplementary alignments or PCR duplicates. Based on the options and flags discussed above, we could use the -rf flag with 3072 to select reads that satisfy any of the two criteria. Instead of having to filter the BAM and then converting it to a FASTQ, we can use samtools fastq to do this in one step. However, until samtools v1.17, there was no --rf option for the samtools fasta/fastq commands, only the other three commands, and even those commands did not have the more readable long-form commands (e.g. --require-flags for -f).
I decided to try to fix this issue in the samtools codebase, which wasn't the easiest to understand as a completely new developer, but I was able to get it done! I'd never written C code before, so that was a fun exercise as well. As it stands, samtools v1.18 now has all four options for samtools fasta/fastq and as a fun bonus, I get to see my name in the changelog! Let me know what you think.
Cheers, Devang