1 Supplementary data Evolution of the achaete-scute complex in insects: convergent duplication of proneural genes Bárbara Negre and Pat Simpson Department of Zoology, University of Cambridge, Dow ning Street, Cambridge CB2 3EJ, UK Corresponding author: Simpson, P. (pas49@cam.ac.uk) Supplementary Figure S1. Conserved domains in insect AS-C proteins. Amino acid alignments of the bHLH domain, Ase-specific motif and C-terminal motif of the AS-C genes shown in Figure 2. Amino acids of the alpha-helices that are buried in the interior of the four-helix bundle [1] are denoted with "!". Phylogenetic analysis Due to the difficulty of obtaining reliable alignments of the AS-C proteins outside the most conserved domain and to maximise the resolution of the phylogenetic analysis we used three datasets. The first includes all the AS-C proteins (26 proteins), the second ASH-like proteins only (17 proteins) and the third one Ase-like proteins only (9 proteins). Each set of sequences was aligned with ClustalW [2] and manually corrected with the aid of Bioedit. Only aminoacids within conserved blocks as defined by Gblocks [3] were used in the analysis, resulting in datasets of 77, 109 and 158 aminoacids respectively. Note that the three datasets are partially redundant, e.g. all three include most of the bHLH region. For each dataset a phylogenetic tree was obtained with PhyML online [4], using the JTT substitution model and 500 replicates. The tree shown in Figure 2 is a strict consensus of the three trees obtained. Each gene is present in two of the trees. Of the bootstrap values shown in Figure 2, the first correspond to the tree obtained with all AS-C genes, and the second to the trees obtained with ASH alone or Ase alone trees. All trees show the same topology, the only exception being Agam/ASH, which changes position in relation to the other mosquito ASH proteins; this branch of the tree is shown as unresolved (Figure 2). tblastn analyses on the unassembled reads of the genomes of two new Anopheles strains confirms the presence of a single gene in this species. All other branches show a bootstrap support higher than 60/100, with the exception of Tcas/ASH which is always located as basal to all other ASH proteins but with bootstraps lower than 40. Thus this branch is also shown as unresolved (Figure 2). To test the hypothesis that ASH genes duplicated independently in flies, mosquitoes and butterflies we compared the likelihood of the tree we obtained with six alternative trees consistent with ancestral duplications of ASH. All seven alternative tree topologies were compared with the Shimodaira-Hasegawa test, implemented in the proml program of the Phylip package. This test was performed on the ASH dataset only (17 sequences – 109 aminoacids). Evolution of pcl genes within the Drosophila AS-C In D. melanogaster there is a non-related gene, pcl, located within the AS-C between l’sc and ase. This gene is present in this same position in all twelve Drosophila species, but is absent in all other insects (Figure 1). In some species there are several pcl genes in this location: D. ananasae, D. virilis and D. mojavensis have three and D. grimshawi four. A comparison between the pcl genes located within the AS-C complex (Muller element A) and other related Aspartic protease family genes revealed a complex evolutionary history (data not shown). The gene pcl transposed inside the AS-C before the diversification of the Drosophila genus. At this position the ancestral pcl gene gave rise to pcl and pcl-like (pcl-l) by tandem duplication. The pcl-like genes are still located in the AS-C in the species of the Drosophila subgenus, where it duplicated a second time giving rise to pcl-like2 (pcl-l2). However in the ancestor of the Sophophora subgenus the orthologue of the pcl-like genes (Dmel\CG13095) was transposed to Muller element B, where it is now located inside an intron of gene Dh31. The remaining pcl genes originated by independent tandem duplications in each lineage: two in D. ananasae (Dana\pcl2 and Dana\pcl3) and one in D. grimshawii (Dgri\pcl2). Supplementary Table 1. Genome Sequences Analysed. Nucleotide sequences Nucleotides Reference D. melanogaster Chromosome X 250,001-370,000 Release 5.10 D. simulans Chromosome X 170,001-280,000 [5] D. erecta Scaffold_4644 195,001-315,000 [5]