#1. Converting subsets of a BAM to FASTA/FASTQ using samtools

How I Learned to Stop Worrying and Love the Shell

One of the main parameters I use to filter BAM files is the bitwise FLAG that is present in the second column in a BAM file. This contains a lot of information - it tells us whether a read is paired or not, or whether it is a PCR duplicate. The SAM file specification lists what the different flags in a FLAG stand for, as shown below.

Bit
Bit
Description
  (Decimal)  
     (Hexadecimal)     
1
0x1
template having multiple segments in sequencing
2
0x2
each segment properly aligned according to the aligner
4
0x4
segment unmapped
8
0x8
next segment in the template unmapped
16
0x10
SEQ being reverse complemented
32
0x20
SEQ of the next segment in the template being reverse complemented   
64
0x40
the first segment in the template
128
0x80
the last segment in the template
256
0x100
secondary alignment
512
0x200
not passing filters, such as platform/vendor quality controls
1024
0x400
PCR or optical duplicate
2048
0x800
supplementary alignment

These flags can be combined together to describe the properties of reads in a BAM file. To learn more about the different combinations of flags, check out this interactive explainer by the Broad Institute. Now that we know what the SAM flags are, we can move on to the core of this piece. In order to filter a BAM file based on certain combinations, we can samtools view with a variety of options. This is where things can get a little confusing, so we will go through the options one by one.

1. First off, we have -f FLAG or --require-flags FLAG which outputs the alignments with all bits set in FLAG present in the FLAG field.

2. Second, we have -F FLAG or --exclude-flags FLAG which removes the alignments with any bits set in FLAG present in the FLAG field.

3. Next, we have --rf FLAG or --include-flags FLAG which outputs the alignments with any bits set in FLAG present in the FLAG field.

4. Finally, we have -G FLAG which removes the alignments with all bits set in FLAG present in the FLAG field.

If we look at the bolded parts in the description above, we can see a pattern here. The options -f and -G are complementary and so are the options -F and --rf. What this means is that if we were to use -f FLAG -G FLAG or -F FLAG --rf FLAG, we would get no reads. (Note the aymmetry due to lack of -g, which is something the maintainers of samtools talk about here.)

Now we come to the main issue that I faced. I wanted to obtain a FASTQ file with the reads from a BAM file that were "bad". In my case, "bad" meant if the reads were either supplementary alignments or PCR duplicates. Based on the options and flags discussed above, we could use the -rf flag with 3072 to select reads that satisfy any of the two criteria. Instead of having to filter the BAM and then converting it to a FASTQ, we can use samtools fastq to do this in one step. However, until samtools v1.17, there was no --rf option for the samtools fasta/fastq commands, only the other three commands, and even those commands did not have the more readable long-form commands (e.g. --require-flags for -f).

I decided to try to fix this issue in the samtools codebase, which wasn't the easiest to understand as a completely new developer, but I was able to get it done! I'd never written C code before, so that was a fun exercise as well. As it stands, samtools v1.18 now has all four options for samtools fasta/fastq and as a fun bonus, I get to see my name in the changelog! Let me know what you think.

Cheers,
Devang