An Algorithm for the Extraction of DNA Fragments using Restriction Enzymes Kaustav Das Pallab Dasgupta P.P. Chakrabarti Dept. of Computer Science & Engineering Indian Institute of Technology Kharagpur, INDIA pallab,ppchak @cse.iitkgp.ernet.in Abstract— Extraction of DNA fragments from existing base sequences is an important step in genetic manipulation. Existing tools for sequence alignment and creating the restriction map of families of restriction enzymes do not give a solution to the extraction problem. In this paper we explain why the task of choosing the most suitable enzymes for an extraction is compu- tationally non-trivial. We present an automated algorithm for solving the extraction problem. Experimental results show that our algorithm is capable of performing extraction from large DNA sequences efﬁciently. I. I NTRODUCTION Gene manipulation is one of the most exciting topics in genetic engineering [6]. Recent advances in genetic engi- neering have enabled new combinations of genetic material to be artiﬁcially constructed in the laboratory by the con- trolled insertion and manipulation of nucleic acid sequences. Plasmids and viruses carry these new sequences into host cells where they can be propagated and ampliﬁed. In various cases this facilitates transcription into mRNA and subsequent translation into proteins. One of the main steps in genetic engineering is to extract the desired DNA fragment from a given base sequence. Once this extraction is done, the spliced DNA fragments are incorporated into vectors (such as plasmids) and used for transmission into host cells. Typically the task of cleaving a DNA sequence to extract the target sequence is done using restriction enzymes. These restriction endonucleases recog- nize speciﬁc DNA nucleotide sequences (called recognition sequences), and cleave the DNA double helix at or near these speciﬁc sites. The task of extracting a target DNA sequence from a given base sequence consists of two tasks, namely: 1) Finding out the sites in the base sequence that matches the target sequence, 2) Cleaving the base sequence near the endpoints of the match using appropriate restriction enzymes to extract a DNA sequence that is similar to the target sequence. Typically a match need not be exact, but must have a high score based on a problem-deﬁned matching function. However, an exact match (if it exists) is usually preferred. Example 1: Suppose the given base sequence is: CATGACGCGCG and the target sequence is CGCG. In this case, exact matches of the target sequence can be found starting from the sixth and the eighth positions of the base sequence. These matches are denoted by and . On searching the restriction enzyme database [4], we ﬁnd the following enzymes: Enzyme Recognition Sequence FnuDII CG.CG NlaIII CATG. HhaI GCG.C The ‘dot’ indicates the position at which the cleavage takes place. For example, NlaIII can cleave the base sequence immediately after the fourth position, and HhaI can cleave the base sequence immediately after the ninth position. Therefore, by using these two enzymes the DNA fragment can be extracted, which contains the target sequence. It is interesting to note that though FnuDII has two cleavage sites in the given base sequence, it is not appropriate since it cleaves the target sequence as well. There are several tools (such as BLAST [1]) for ﬁnding matches for a target sequence in a base sequence. There are also some tools for computing the restriction map of a given base sequence [5], [7]. The restriction map of a base sequence with respect to a set of enzymes indicates the cleavage sites of those enzymes on the base sequence. However these tools do not solve the extraction problem automatically and currently the task of ﬁnding mutually compatible enzymes 1 for extracting the target sequence is solved manually. Since there are about restriction enzymes (today), the problem of ﬁnding out the best pair for a given problem instance is a non-trivial problem. In this paper, we present an algorithm for ﬁnding out the best pair of enzymes for a given extraction problem. We present the main features of this algorithm that provide insights into the complexity of the problem. We have im- plemented a tool that implements this algorithm considering about restriction enzymes. We present experimental results indicating the computational efﬁciency of the tool. The paper is organized as follows. Section II presents an outline of the proposed algorithm. Section III presents our method for creating the restriction map of a given base 1 In some cases a single enzyme can cleave at both ends of the target sequence. In other cases, we require a pair.