#3. Filtering BAM reads based on variant alleles at position

How I Learned to Stop Worrying and Love the Shell

Another day, another interesting problem! This time, we are dealing with a BAM filtering problem but it also bears close resemblance to a visualization feature in IGV. If you've spent time looking at variants in IGV, you might know that a simple way to improve your IGV screenshot perusal experience is to group reads using the base_at_pos tag. This feature groups the reads according to the nucleotide base at a given position, allowing you to look at reference and alternate alleles separately.

The problem here is similar but instead of grouping reads in IGV, we want to filter a BAM based on the allele at a particular locus. samtools allows us to filter BAM files based on a number of options - the most commonly used filters are based on regions or flags.

# filter by coordinates
$ samtools view -b -h input.bam chr1:100-200 > input_chr1_100_200.bam

# filter by flag (e.g. remove PCR duplicates)
$ samtools view -b -h -F 1024 input.bam > input_no_pcr_dups.bam

However, filtering based on the base at a particular position is uncharted territory. samtools provides filter expressions for a number of tags and fields but this is a combination of a couple of fields - the position and the base which is currently infeasible. This is exactly the kind of problems that I like to solve - no standard solution but can be solved using some command chaining, or even better, a one-liner! I started off with a verbose rudimentary solution that worked as expected and then set out to make the solution more efficient and cleaner. There is an argument to be made about the benefits of shorter commands split over multiple lines but in this house we go for one-liners when we can. Here's the full command split displayed in a more readable format:

$ samtools view -b -N <(samtools mpileup --output-QNAME --no-output-ends \
-r chr1:150-150 input.bam | cut -f5,7 |
awk -v FS='\t' -v OFS='\t' '{split($2, a, ","); \
for(i=1; i<=length(a); i++) \
if (toupper(substr($1, i, 1)) == "C") print a[i]}') \
input.bam chr1:150-150 > input_base_a.bam

We can now go over this step by step. The first part is the samtools view command with the -b flag and an argument to -N that forms the bulk of the command. The -b flag converts the resulting records into the BAM format. The -N option allows us to pass read names that can be used to filter the BAM file. This is a relatively new option added to release 1.12 in 2021 that makes this task much faster and easier.

The argument passed to the -N option is the meat of this solution. The first tool used in here is samtools mpileup, which generates a summary of nucleotide counts at each position in a set of aligned sequencing reads from a BAM file. We use a couple of options with this command:

--output-QNAME: This tells mpileup to print out the read names of reads that span this position in a comma-delimited column.
--no-output-ends: This tells mpileup to suppress the additional markup for bases that are present at the beginning (^) or end ($) of a read. This allows us to have exactly as many characters in the base column as the read names column.
-r chr1:150-150: This tells mpileup to only calculate the results for this position.

This gives us the following output where the columns in order are (1) chromosome, (2) position, (3) reference base [N in this case because we haven't provided a reference FASTA file], (4) the number of bases covering this position, (5) read bases at this position, (6) quality scores of read bases at this position, and (7) names of reads spanning this position, corresponding to the bases in column 5. We see that there are three reads spanning this position (RNAME1, RNAME2, RNAME3) with 2 cytosine (C) bases and 1 adenine (A) base at this position.

$ chr1 150 N 3 CAc BBB RNAME1,RNAME2,RNAME3

The only fields we care about in this output are the read bases at this position (column 5) and the names of the reads they originate from (column 7). We then use the cut command in bash to extract these two columns from the output. Next, we use an awk command with tab delimiters for input -v FS='\t' and output -v OFS='\t'. The first statement inside the awk command splits the contents of the read names column ($2) with a comma as a delimiter, and stores it in an array a. The second statement inside the awk command is a for loop that iterates over the contents of the array a. For each element in the array (which is the same length as the depth specified by column 4 in the mpileup output as well as the length of the column 5 in the mpileup output containing the read bases which in turn is the first column passed to awk). We use the substring command in awk to obtain the base at each position and compare it to our base of interest, in this case - a cytosine (C). If we find a match, we print the corresponding read name from the array.

This is now the output of the subshell to the -N option to samtools view. We are passing a list of read names that will be used to filter the file. Providing a region at the end of the view command allows samtools to skip searching the entire file for these read names and speeding up the process greatly. This was an optimization I added after I had generated an initial solution that was pretty slow. Overall, this was a very fun exercise! Let me know if you have a better way to do this.

Cheers,
Devang