Journal of Biomedical Engineering and Technology, 2013, Vol. 1, No. 2, 26-30
Available online at http://pubs.sciepub.com/jbet/1/2/2
© Science and Education Publishing
DOI:10.12691/jbet-1-2-2
Mining Quantitative Association Rules in HIV Protein
Sequences
Anubha Dubey
1,*
, Usha Chouhan
2
1
Department of Bioinformatics, Manit, Bhopal (M.P), India
2
Department of Mathematics, Manit, Bhopal (M.P), India
*Corresponding author: anubhadubey@rediffmail.com
Received July 11, 2013; Revised August 02, 2013; Accepted August 05, 2013
Abstract Lot of research has gone into understanding the composition and nature of proteins, still many things
remain to be understood satisfactorily. It is now generally believed that amino acid sequences of proteins are not
random, and thus the patterns of amino acids that we observe in the protein sequences are also non-random. In this
study, we have attempted to decipher the nature of associations between different amino acids that are present in a
HIV protein. This very basic analysis provides insights into the co-occurrence of certain amino acids in a HIV
protein. Such association rules are desirable for enhancing our understanding of protein composition and hold the
potential to give clues regarding the global interactions amongst some particular sets of amino acids occurring in
proteins. The aim of association rules mining is to reveal underlying interactions in large sets of data items.
Knowledge of these rules or constraints is highly desirable for the in-vitro synthesis of artificial proteins. This will
also give new insights to understand protein-protein interactions in HIV.
Keywords: data mining, quantitative association rule mining, protein composition.
Cite This Article: Dubey, Anubha, and Usha Chouhan, “Mining Quantitative Association Rules in HIV
Protein Sequences.” Journal of Biomedical Engineering and Technology 1, no. 2 (2013): 26-30. doi:
10.12691/jbet-1-2-2.
1. Introduction
Proteins are important constituent of cellular machinery
of any organism. Recombinant DNA Technologies have
provided tools for the rapid determination of DNA
sequences and, by inference, the amino acid sequences of
proteins from structural genes [1]. The proteins are
sequences made up of 20 types of amino acids. Each
amino acid is represented by a single letter alphabet, as
given in Table 1. Each protein adopts a unique 3-
dimensional structure, which is decided completely by its
amino acid sequence. A slight change in the sequence
might completely change the functioning of the protein.
Just as the letters of the alphabet can be combined to form
an almost endless variety of words, amino acids can be
linked together in varying sequences to form a vast variety
of proteins [13].
Table 1. Single letter codes of amino acids
S.No. AA code Full name Side chain polarity Side chain charge Hydropathy Index
1. A Alanine nonpolar neutral 1.8
2. C Cysteine nonpolar neutral 2.5
3. D Aspartic acid polar negative -3.5
4 E Glutamic Acid polar negative -3.5
5 F Phenylalanine nonpolar neutral 1.9
6 G Glycine nonpolar neutral -0.4
7 H Histidine polar positive -3.2
8 I Isoleucine nonpolar neutral 4.5
9 K Lysine polar positive -3.9
10 L Leucine Non-polar neutral 3.8
11 M Methionine nonpolar neutral 1.9
12 N Asparagine polar neutral -3.5
13 P Proline Non-polar neutral -1.6
14 Q Glutamine polar neutral -3.6
15 R Arginine polar positive -3.5
16 S Serine polar neutral -0.8
17 T Threonine polar neutral -0.8
18 V Valine Non-polar neutral 4.2
19 W Tryptophan Non-polar neutral -0.9
20 Y Tyrosine polar neutral -1.3