# Efficient algorithms for SNP genotype data analysis using hidden Markov models of haplotype diversity

Contents Acknowledgments ii List of Figures vii List of Tables x 1 Introduction 1 2 Genotype Error Detecti on Using Hidden Markov Models of Hap- lotype Diversity 6 2.1 Preliminaries 8 2.2 Methods 10 2.2.1 Hidden Markov Model 10 2.2.2 Likelihood Ratio Approach to Genotype Error Detection . . 12 2.2.3 Efficiently Computable Likelihood Functions 14 2.2.4 Trie Speed-up 18 2.3 Experimental results 19 2.3.1 Experimental Setup 19 2.3.2 Results on Synthetic Datasets 21 2.3.3 Results on Real Data from [8] . . . 27 2.4 GEDI Software 29 2.4.1 GEDI Results and Discussion 30 iv

2.5 Conclusion 36 3 Imputation-based Local Ancestry Inference in Admi xed Popula tions 38 3.1 Introduction 38 3.2 Methods 40 3.2.1 Genotype Imputation Within Windows with Known Local Ancestry 40 3.2.2 Local Ancestry Inference 44 3.3 Experimental results 46 3.3.1 Inference of Local Ancestry in Admixed Populations 46 3.3.2 SNP Genotype Imputation in Admixed Populations 51 3.4 GEDI-ADMX Software 52 3.4.1 Gene Admix Viewer 53 3.5 Discussion 53 4 Single Individual Genotyping from Low-Coverage Sequencing Data 55 4.1 Introduction 55 4.2 Methods 58 4.2.1 Notations 58 4.2.2 Single SNP Genotype Calling 59 4.2.3 A Statistical Model for Multilocus Genotype Inference ... 61 4.3 Efficient MGP Heuristics 65 4.3.1 Posterior Decoding 65 4.3.2 Greedy Algorithm 68 4.3.3 Markov Approximation Algorithm 70 4.4 Results 73 4.4.1 Datasets 73 v

4.4.2 Read Mapping 74 4.4.3 Genotype Accuracy 74 4.5 Conclusions and Future Work 80 5 Conclusions 85 Bibliography 88 VI

List of Figures 2.1 The structure of the Hidden Markov Model for n=5 SNP loci and K=4 founders 10 2.2 Sample dataset over 5 SNPs (a) and corresponding trie (b) 19 2.3 Detection ROC curves for parents (P) and children (C) using the three likelihood functions in Section 2.2.3 21 2.4 Histograms of log-likelihood ratios for parents (left) and children (right) SNP genotypes, computed based on trios, unos, duos, or the minimum of uno, duo, and trio log-likelihood ratios 23 2.5 Comparison with FAMHAP accuracy for parents (top) and children (bottom) 24 2.6 Effect of the error model (a), sample size (b), and SNP density (c) on detection accuracy of TotalProb-Combined 26 2.7 Imputation error rate and runtime for varying number of flanking typed SNP loci (IMAGE chr. 22 dataset, 520 training haplotypes). 31 2.8 Imputation error rate on the IMAGE chr. 22 dataset for varying numbers of HMM founders and training haplotypes 32 2.9 GEDI imputation error rate and runtime for varying number of founders (IMAGE chr. 22 dataset, 520 training haplotypes) 33 2.10 Runtime comparison between using GEDI imputation with and with out PopTree 34 vn

2.11 Effect of using pedigree information during imputation (IMAGE chr. 22 dataset, 13 HMM founders) 34 3.1 Factorial HMM model for a multilocus SNP genotype (G\,..., Gn) over an n-locus window within which one haplotype is inherited from ancestral population Vk and the other from ancestral population V\. For every locus i, F^ and H\ denote the founder haplotype, respectively the allele observed on the haplotype originating from population Vk] similarly, F\ and H[ denote the founder haplotype and observed allele for the haplotype originating from population Vi. 41 3.2 Single-window imputation-based ancestry inference algorithm. ... 44 3.3 Accuracy of local ancestry estimates obtained by GEDI-ADMX on the three HapMap admixtures using a single window of varying size. 48 3.4 GEDI-ADMX accuracy (solid) and runtime (dashed) for varying values of the number K of HMM founder haplotypes on the CEU- JPT dataset, consisting of n = 38,864 SNPs on Chromosome 1. . . 49 3.5 Gene Admix Viewer main screen 54 4.1 HF-HMM model for multilocus genotype inference 61 4.2 Schematic of reduction of the consensus string problem to MMGPP. 64 4.3 Comparison of genotype sequencing methods: Single SNP vs. Poste rior Decoding vs. Greedy vs. Markov Approximation; Heterozygous (a), and Homozygous (b) 76 4.4 Comparison between binomial and Multilocus genotype calling on percentage of concordance between predicted and gold standard genotype for different average coverages on the Watson dataset (a) and on the NA18507 datasets (b). Bold lines correspond to homozy gous SNPs while dotted lines correspond to heterozygous SNPs ... 78 vin

4.5 Comparison between single posterior and multilocus genotype call ing on percentage of concordance between predicted and gold stan dard genotype for different probability thresholds expressed as un called genotype rates on the Watson dataset. Results of binomial genotype calling are shown as a single datapoint 79 4.6 Effects of the recombination rate (a), SNP coverage (b) and Panel size (c) on concordance between predicted and gold standard geno type on the Watson dataset 80 IX

List of Tables 2.1 Results of TotalProb-Combined on Becker et al. dataset 27 2.2 Imputation error rate on the IMAGE chr. 22 dataset for varying numbers of HMM founders and training haplotypes 31 2.3 Comparison of two GEDI imputation flows on a version of the IM AGE chr. 22 dataset generated by randomly inserting 1% errors and 1% missing data (520 training haplotypes) 35 3.1 Percentage of correctly recovered SNP ancestries on three HapMap admixtures with a = 0.2 51 3.2 Imputation error rate, in percents, on three HapMap simulated ad mixtures with a = 0.5 52 4.1 Summary statistics for the three datasets used in evaluation .... 74 x

Chapter 1 Introduction Recently, large scale Genome-Wide Association Studies (GWASs) have been made possible by the sequencing of the human genome [13,14] and the initial mapping of human haplotypes by the HapMap project [78]. Evidence supporting GWASs as being a powerful approach to identifying disease-gene associations is growing. One recent success, for example, was a joint GWAS using a large British population set identifying 24 statistically significant independent association signals across 6 diseases [16]. The genetic markers of choice in GWASs are Single Nucleotide Polymorphisms (SNPs), which account for most of the genomic variation in hu mans. A SNP is a single base pair mutation, or a variance in DNA nucleotides between organisms of the same species, at certain locations, called markers, along the chromosome. All possible nucleotides which exist for a large percentage of the population for a specific marker are called alleles. Most SNPs are biallelic, i.e., only two nucleotide variations are known to exist at that SNP marker. According to NCBI, the human genome consists of nearly thirteen million common SNPs, which have been cataloged in the most recent build (130) of the dbSNP database ( ht t p://www.ncbi.nl m.ni h.gov/pr oj ects/SNP/). In diploid organisms such as humans, cells contains two copies of each chromo- 1

some, one for each parent. The combination of alleles present at SNP markers on one parental chromosome is called a haplotype. The conflated allele information from the two haplotypes is called a genotype sequence. However, while the geno type sequence identifies the two alleles at each SNP marker, it does not specify which allele is assigned to a specific chromosome. This thesis presents several highly scalable algorithms that utilize Hidden Markov Models (HMMs) of haplotype diversity which capture SNP information such as Linkage Disequilibrium (LD) observed in the population or populations under study. The algorithms are applied to address the problems of genotype error detection and correction, imputation-based local ancestry inference, and genotype calling with low coverage shotgun sequencing data. The validity of associations uncovered in GWASs critically depends on the accuracy of genotype data. Despite recent progress in genotype calling algo rithms [17,18,51,57,63,83], significant error levels remain present in SNP genotype data due to factors ranging from human error and sample quality to sequence vari ation and assay failure. Since even low error levels can lead to inflated false positive rates and substantial losses in the statistical power of linkage and association stud ies [4,2,12,28,53,86], detecting and correcting genotype errors remains a critical task in genetic data analysis. Further, since causal SNPs are unlikely to be typed directly due to the limited coverage of current genotyping platforms, imputation of genotypes at untyped SNP loci has recently emerged as a powerful technique for increasing the power of association studies. Chapter 2 of this thesis proposes novel methods for genotype error detection which extends the likelihood ratio error detection approach of [8]. While we focus on detecting errors in parents-child trio genotype data, our proposed methods apply with minor modifications to genotype data coming from unrelated individuals and small pedigrees other than trios. Un like previous approaches to genotype error detection [8], which use enumeration of 2

common haplotypes within a small window around each locus, we employ a hid den Markov model (HMM) to represent frequencies of all possible haplotypes over the set of typed loci. Empirical results shows significant improvement for both error accuracy and speed when compared to other methods on real and simulated datasets. The error detection approach can also be modified to address the prob lems of imputation of untyped SNP markers and missing genotype data recovery. We conclude the chapter by introducing GEDI, a software package that imple ments efficient algorithms for performing several common tasks in the analysis of population genotype data, including error detection and correction, imputation of both randomly missing and untyped genotypes, and genotype phasing. One type of powerful tool used in disease-gene association studies that has emerged is admixture mapping. Admixture mapping relies on genotyping hun dreds of thousands of single nucleotide polymorphisms (SNPs) across the genome in a population of recently admixed individuals and is based on the assumption that near a disease-associated locus there will be an enhanced ancestry content from the population with higher disease prevalence. Therefore, a critical step in admixture mapping is to obtain accurate estimates of local ancestry around each genomic locus. Chapter 3 contributes a novel method for imputation-based local ancestry inference that effectively exploits LD information using HMMs of haplo- type diversity. While there are several methods that use a detailed HMM of LD (e.g. SABER [75], SWITCH [66],HAPAA [74]), surprisingly, existing methods that do not exploit LD outperform those that do. This second class of methods (e.g. LAMP [67] and WINPOP [60]) employs a window-based framework to achieve increased accuracy, however these methods differ from each other in the type of information ultimately used to make local ancestry inferences. Our novel method for imputation-based local ancestry inference more effectively exploits LD infor mation by combining these two classes of methods. Our method employs multiple 3

HMMs trained on a set of ancestral haplotypes for each population in the admix ture study to impute genotypes at all typed SNP loci (temporarily marking each SNP genotype as missing) under each possible local ancestry. We then assign to each locus the local ancestry that yields the highest imputation accuracy, as as sessed using a weighted-voting scheme based on multiple SNP windows centered on the locus of interest. Preliminary experiments on simulated admixed populations show that imputation-based ancestry inference has accuracy competitive with best existing methods in the case of distant ancestral populations, and is significantly more accurate for closely related ancestral populations. Recent massively parallel sequencing technologies deliver orders of magnitude higher throughput compared to classic Sanger sequencing. Sequencers like Roche/454 FLX Titanium, Illumina Genome Analizer II, ABI SOLiD 3 and Helicos HeliScope are able to provide millions of short reads in a single run which lasts just a few days in some cases and even less than one day in other cases. These advances promise to enable cost-effective shotgun sequencing of individual genomes. After recent publication of five complete individual genomes [9,22,52,62,80,23], ongoing efforts focus on increasing the quality and coverage of short reads and on im proving algorithms for mapping, genotyping and variations discovery to sequence over a thousand more individual genomes [1], While shotgun sequencing can dis cover new SNPs and other forms of sequence variation, its sensitivity of detecting heterozygous SNPs is limited by coverage depth. Chapter 4 demonstrates that highly accurate SNP genotypes can be inferred from very low coverage shotgun sequencing data by using a multilocus inference model that also exploits linkage disequilibrium (LD) information from HMMs of haplotype diversity. While shot gun sequencing can discover new SNPs and other forms of sequence variation, its sensitivity of detecting heterozygous SNPs is limited by coverage depth. It was estimated that a coverage depth of over 21 x is required to achieve 99% sensitiv- 4

ity at detecting heterozygous SNPs based on the rule that each allele must be covered by two or more reads [23]. A coverage depth similar to that in [22,23] (7.5x) detects only 75% of the heterozygous SNPs, and sensitivity drops rapidly at even lower coverage depths. The hierarchical-factorial HMM (HF-HMM) in troduced in this chapter enables the integration of shotgun sequencing data with haplotype frequency information extracted from a reference panel. Efficient de coding algorithms for genotype calling that utilize this HF-HMM are introduced. Experimental results show that our algorithms achieve significant improvements in accuracy compared to previous methods. Based on publicly available reads from three different sequencing technologies, we show that we can achieve more than 95% accuracy on heterozygous SNP calls and more than 99% accuracy on homozy gous SNP calls with just 5x coverage depth. Moreover, our proposed algorithms have a linear run time on the number of SNP loci and individuals to be analyzed. Finally, a current status of this research, as well as a list of several improvements which build upon this research, is provided and will be explored as outlined in the concluding chapter. 5

Chapt er 2 Genot ype Error Detection Using Hi dden Markov Models of Hapl ot ype Diversity1 The sequencing of the human genome coupled with the initial mapping of human haplotypes by the HapMap project and rapid advances in SNP genotyping tech nologies have recently opened up the era of genome-wide association studies, which promise to uncover the genetic basis of common complex diseases such as diabetes and cancer by analyzing the patterns of genetic variation within healthy and dis eased individuals. However, the validity of associations uncovered in these studies critically depends on the accuracy of genotype data. Despite recent progress in genotype calling algorithms [17,18,51,57,63,83], significant error levels remain present in SNP genotype data due to factors ranging from human error and sam ple quality to sequence variation and assay failure, see [61] for a recent survey. A recent study of dbSNP genotype data [84] found that as much as 1.1% of about 20 million SNP genotypes typed multiple times have inconsistent calls, and are thus lrPhe results presented in this chapter are based on joint work with I. Mandoiu and B. Pasaniuc [38,36,37] 6

incorrect in at least one dataset. Recommended quality control procedures such as the use of external control samples from HapMap and duplication of internal samples [55] provide an estimate of error rates, but do not eliminate them. Although systematic errors such as assay failure can be detected by departure from Hardy-Weinberg equilibrium proportions [32,44], and, when genotype data is available for related individuals, some errors become detectable as Mendelian Inconsistencies (Mis), a large fraction of errors remains undetected by these analyses, e.g., as much as 70% of errors in mother- father-child trio genotype data are undetected by Mendelian consistency analysis [20,29]. Since even low error levels can lead to inflated false positive rates and substantial losses in the statistical power of linkage and association studies [4, 2,12,28,53,86], detecting Mendelian consistent errors remains a critical task in genetic data analysis. This task becomes particularly important in the context of association studies based on haplotypes instead of single locus markers, where error rates as low as 0.1% may invalidate some statistical tests for disease association [42]. A powerful approach of dealing with genotyping errors is to explicitly model them in downstream statistical analyses, see, e.g., [11,31,49]. While powerful, this approach often leads to complex statistical models and impractical runtime for large datasets such as those generated by genome-wide association studies. A more practical approach is to perform genotype error detection as a separate anal ysis step following genotype calling. SNP genotypes flagged as putative errors can be either excluded from downstream analyses or retyped when high quality geno type data is required. Indeed, such a separate error detection step is currently implemented in all widely-used software packages for pedigree genotype data anal ysis including Mendel [72], Merlin [3], Sibmed [19], and SimWalk2 [71,72], all of which detect Mendelian consistent errors by independently analyzing each pedigree and identifying loci of excessive recombination. Unfortunately, these methods have 7

very limited power to detect errors in genotype data from small pedigrees such as mother-father-child trios, and do not apply at all to genotype data from unrelated individuals. [8] have recently introduced the use of population level haplotype fre quency information for genotype error detection in trio data via a simple likelihood ratio test. However, detection accuracy of their method is severely limited by the reliance on explicit enumeration of most frequent haplotypes within short blocks of consecutive SNP loci. 2.1 Preliminaries In this chapter we propose novel methods for genotype error detection extending the likelihood ratio error detection approach of [8]. While we focus on detecting errors in trio genotype data, our proposed methods apply with minor modifications to genotype data coming from unrelated individuals and small pedigrees other than trios. Unlike [8], we employ a hidden Markov model (HMM) to represent frequencies of all possible haplotypes over the set of typed loci. Similar HMMs have been successfully used in recent works [41,64,68,69] for genotype phasing and disease association. Two limitations of previous uses of HMMs in this context have been the relatively slow training based on genotype data and the inability to exploit available pedigree information. We overcome these limitations by training our HMM using haplotypes inferred by the pedigree-aware phasing algorithm of [30], based on entropy minimization. The authors of [8] use maximum phasing probability of a trio genotype as the likelihood function whose sensitivity to single SNP genotype deletions signals potential errors. The former is heuristically approximated by a computationally expensive search over quadruples of frequent haplotypes inferred for each window. We show that, when haplotype frequencies are implicitly represented using an 8

HMM, computing the maximum trio phasing probability is, unfortunately, hard to approximate in polynomial time. Despite this hardness result, we are able to significantly improve both detection accuracy and speed compared to [8] by using alternate likelihood functions such as Viterbi probability and the total trio genotype probability, both of which can be computed for commonly vised unrelated and trio genotype data within a worst-case runtime that increases linearly in the number of SNP loci and that of genotyped individuals. Further improvements in detection accuracy for genotype trio data are obtained by combining likelihood ratios computed for different subsets of trio members. Empirical experiments show that this technique is very effective in reducing false positives within correctly typed SNP genotypes for which the same locus is mistyped in related individuals. The rest of the chapter is organized as follows. We introduce basic notations in Section 2.1 describe the structure of the HMM used to represent haplotype frequencies in Section 2.2.1, and present the likelihood ratio approach of [8] in Section 2.2.2. In Section 2.2.3, we show that while the likelihood function in [8] cannot be approximated efficiently when an HMM is used to represent haplotype frequencies, we give three alternative likelihood functions that can be computed efficiently based on an HMM. Finally, we give experimental results assessing the error detection accuracy of our methods on both simulated and real datasets in Section 2.4.1, and conclude with ongoing research directions in Section 2.5. We start by introducing basic terminology and notations used throughout the chapter. We denote the major and minor alleles at a SNP locus by 0 and 1. A SNP genotype represents the pair of alleles present in an individual at a SNP locus. Possible SNP genotype values are 0/1/2/?, where 0 and 1 denote homozygous genotypes for the major and minor alleles, 2 denotes the heterozygous genotype, and ? denotes missing data. SNP genotype g is said to be explained by an ordered pair of alleles (a, a') € {0, l }2 if g —?, or g € {0,1} and a = a' = g, or g = 2 and 9

Figure 2.1: The structure of the Hidden Markov Model for n=5 SNP loci and K=4 founders. o±o>. We denote by n the number of SNP loci typed in the population under study. A multi-locus genotype (or simply genotype) is a 0/1/2/? vector G of length n, while a haplotype is a 0/1 vector H of length n. An ordered pair (H, H') of haplotypes explains multi-locus genotype G iff, for every i = l,...,n, the pair (H(i),H'(i)) explains G(i). A trio genotype is a triple T = (Gm, G/, Gc) consisting of mother, father, and child multi-locus genotypes. Assuming that no recombi nation takes place within the set of SNP loci of interest, we say that an ordered 4-tuple (Hi, H2, H3, H4) of haplotypes explains trio genotype T = (Gm, G/, Gc) iff (Hi,H2) explains Gm, (H3, H4) explains Gf, and (Hi, H^) explains Gc. A geno type duo consisting of mother-child or father-child genotypes is defined similarly. An ordered 3-tuples of haplotypes (Hi, H2,H3) is said to explain such a duo iff (Hi,H2) explains the parent genotype and (Hi,H%) explains the child genotype. 2.2 Met hods 2.2.1 Hi dden Markov Model The HMM used to represent haplotype frequencies has a similar structure to HMMs recently used in [41,51,64,68,69]. This structure (see Figure 2.1) is fully determined by the number of SNP loci n and a user-specified number of founders K (typically a 10

small constant, we used K = 7 in our experiments). Formally, the HMM is specified by a triple M = (Q, 7, e), where Q is the set of states, 7 is the transition probability function, and e is the emission probability function. The set of states Q consists of disjoint sets Q0 = {q0}, Qi, Q2, • • •, Qn, with | Qi| = IQ2I = ••• = \Qn\ = K, where q° denotes the start state and Qj, 1 < j < n, denotes the set of states corresponding to SNP locus j. The transition probability between two states a and b, j(a, b), is non-zero only when a and b are in consecutive sets Qt. The initial state q° is silent, while every other state q emits allele a G {0,1} with probability e(q, a). The probability with which M emits a haplotype H along a path TX starting from q° and ending at a state in Qn is given by: P( i f,7^| M) =7( g^4l ) ) e( v^( l ),^( l ) ) ^"= 2 7( ^- l ) ^«) e( v^( ^),J f/( ^) ) (2.1) Intuitively, M represents founder haplotypes along high-probability paths of states, with recombination between pairs of founder haplotypes being captured via remain ing transition probabilities. As noted above, the structure of our HMM is similar to that of other models proposed in the literature. However, there are also important differences. The model underlying the IMPUTE algorithm described in [51] defines HMM states at each SNP locus directly from reference haplotypes (thus, for iV reference haplotypes there are N HMM states at each locus). Under the IMPUTE model the probability of switching from one reference haplotype to another is derived from the genetic distance between loci and the effective size for the population under study, and does not depend on the states (haplotypes) between which the transition occurs. Similar to our use of founder haplotypes, the model underlying the fastPHASE algorithm [68] reduces the number of HMM states by using at each SNP locus a state for each one of K clusters of reference haplotypes, where K is a user- 11

specified parameter. For each SNP locus, the fastPHASE model estimates K different different transition probabilities, with all transitions into a cluster being given an equal probability. As detailed above, our HMM model allows transition probabilities to depend on both the start and the end states (founder haplotypes), potentially providing more expressive power compared to the models in [68] and [51]. In HMMs nearly identical to our own [41,64], training was accomplished using genotype data via variants of the EM algorithm. Since EM-based training is gen erally slow and cannot be easily modified to take advantage of phase information that can be inferred from available family relationships, we adopted the following two-step approach for training our HMM. First, we use the highly scalable ENT algorithm of [30] to infer haplotypes for all individuals in the sample based on entropy minimization. ENT can handle genotypes related by arbitrary pedigrees, and has been shown to yield high phasing accuracy as measured by the so called switching error, which implies that inferred haplotypes are locally correct with very high probability. In the second step we use the classical Baum-Welch algorithm [6] to train the HMM based on the haplotypes inferred by ENT. 2.2.2 Likelihood Rat i o Approach to Genotype Error De tection Our detection methods are based on the likelihood ratio approach of [8]. We call likelihood function any function L assigning non-negative real-values to trio geno types, with the further constraint that L is non-decreasing under data deletion. Let T = (Gm, Gf, Gc) denote a trio genotype, x € {m, f, c} denote one of the indi viduals in the trio (mother, father, or child), and i denote one of the n SNP loci. The trio genotype T(x*
*