Haplo-ASP

Haplo-ASP is an ASP-based approach to solving Haplotype Inference problems. It is briefly described in the paper "Efficient Haplotype Inference with Answer Set Programming," by Erdem and Ture (AAAI-08).

Haplotype Inference

A genotype is the specific genetic makeup of an individual. It is in the form of a DNA, consisting of a number of sites. Each genotype has two copies, one from the mother and one from the father. These two copies are called haplotypes. Haplotype Inference problem is, for a given set of genotypes, to infer haplotypes that explain them.

The concept of two haplotypes explaining a genotype is defined in various ways. Here is Gusfield's definition.

Haplotype Inference by Pure Parsimony

Haplotype Inference by Pure Parsimony (HIPP) is a slight modification of the Haplotype Inference problem, which asks for a minimal set of haplotypes that explain the given genotypes.

We have studied HIPP by means of its decision version:

HIPP-DEC is NP-complete.

Assume that H is a set that contains 2*n haplotypes, h₁, ..., h_2n, and that every genotype g_i in G is mapped to two haplotypes, h_2i and h_2i-1, in H. Then H is a solution to HIPP-DEC if the following hold:

C1 For every genotype g in G, for every site j of g with value 2, the values of the jth sites of the corresponding two haplotypes are different.
C2 For every genotype g in G, for every site j of g with value 1 or 0, the values of the jth sites of the corresponding two haplotypes are identical to g[j].
C3 There are at most k unique haplotypes in H.

Representing HIPP-DEC in ASP

In ASP, we consider genotypes and haplotypes as sets of atoms. Genotypes are described by means of atoms of the form amb(i,j): The presence of amb(i,j) in an answer set describes that g_i[j] = 2, the presence of -amb(i,j) in an answer set describes that g_i[j] = 1, and the absence of both literals describes that g_i[j] = 0. Haplotypes are described by atoms of the form h(i,j).

We represent HIPP-DEC as a program, where each condition C1--C3 is represented by a set of rules. Here is a straightforward ASP formulation of HIPP-DEC, with respect to Gusfield's definition: hipp-dec-alternative-gus-asp.lp.

We have also formulated HIPP-DEC, with respect to other mathematical models of the problem, and in other declarative programming paradigms (Constraint Programming and Integer Linear Programming). All formulations are available here.

Haplotype Inference from Present-Absent Genotype

Haplotype Inference from Present-Absent Genotype (HIPAG) is another haplotype inference problem, which asks for the minimal set of haplotypes "compatible" with the given genotypes. Both haplotypes and genotypes can be viewed as vectors, as in HIPP. In this problem, each site of a haplotype takes one of the two values 0 or 1 specifying the presence/absence of a particular gene. Sites of genotypes are biallelic; the value of each site is a pair of numbers from {0, 1, ?}. For instance, (1, ?)(1, ?)(1, 1) is a genotype.

For a genotype g of the form (g₁₁, g₁₂)...(g_m1, g_m2), let us denote by g₁ the vector g₁₁g₂₁...g_m1 and by g₂ the vector g₁₂g₂₂...g_m2. We say that two alleles i and j are compatible if they are identical or if one of them is ?. Two haplotypes h₁ and h₂ are compatible with a genotype g, all with m sites, if, for every site j, h₁[j] is compatible with one of the two alleles, g₁[j] or g₂[j], and h₂[j] is compatible with the other. For instance, the haplotypes 011 and 111 are compatible with (1, 0)(1, 1)(1, 1) and (1, ?)(1, ?)(1, 1) but not with (1, ?)(1, 0)(1, 1). Note that, we can discard the sites with missing information while computing a solution for HIPAG.

We consider the following decision version of HIPAG:

HIPAG-DEC Given a set G of n genotypes each with m biallelic sites, and a positive integer k, decide whether there is a set H of at most k unique haplotypes such that each genotype in G is compatible with two haplotypes in H.

H1 For every genotype g in G, for every site j of g, the values of the j'th sites of the corresponding haplotypes h₁ and h₂ are compatible with the values of g₁[j] and g₂[j].
H2 There are at most k unique haplotypes in H.

Representing HIPAG-DEC in ASP

The ASP representation of HIPAG-DEC is similar to the ASP representation of HIPP-DEC: We generate values for the sites of haplotypes and compute the maximum number of unique haplotypes with the same set of rules. Haplotypes are represented as in HIPP representation. Genotypes are represented in terms of atoms of the form present(G,J,I) ("Chromosome I of Genotype G has value 1 at Site J") describing present genes; -present(G,J,I) describes absence of genes. For instance, Genotype 3 of the form (1, ?)(1, 0)(1, 1) is represented as follows:

present(3,1,1). present(3,2,1). present(3,3,1). -present(3,2,2). present(3,3,2).

Here is the ASP program that describes HIPAG-DEC: hipag-dec-asp.lp

Experimental Results

We conducted experiments with 334 problem instances (294 real, 40 artificial), using the answer set solver Cmodels with Minisat for Haplo-ASP, and the latest versions of SHIPs and Rpoly, on a Linux workstation with Intel Centrino 1.5GHz Xeon processor and 4x512 RAM. For each system we assigned 1000 sec.s of CPU time to solve each problem. We then sorted all problem instances according to CPU times, and plotted a graph of problems and the corresponding CPU time values. We also compared the performance of each system on different problem groups. The experimental results can be found here: hipp-results.pdf