Helping scientiic communication: The NDOC databank Raffaele Conte Abstract Scientific communication is essential for the development of science as it permits the spreading of knowledge. The understanding of the importance of data sharing has resulted in a huge amount of information being currently available that needs to be appropriately organised. For this purpose databases are the best way to arrange inputs as they allow the analysis of the entire spectrum of research. In this paper a special focus will be given to the “negative databases”, namely those databanks that inspect failed interactions such as the Negatome Database for Organic Chemistry reactions (NDOC databank), and to consider their important contribution due to the awareness given on outcomes that are not regularly published. Introduction Scientific communication is a term that refers to the importance of the sharing of experimental data among scientists. These pieces of knowledge are often indispensable in the organisation of a project and because of this they need to be carefully checked and promptly delivered. Researchers regularly receive requests for information related to their activities, where information means research- related findings, methods, data, or materials. This is common even if the results are already published in journals or technical reports due to the fact that these formats frequently omit critical information or detailed descriptions of techniques. In addition some journals state formal policies about an author’s responsibilities on the sharing of publication-related data and materials and articles are not always available in “open source”. Therefore scientists who want to replicate or extend the results of a published study frequently approach the author of the article directly to request additional information. The ethical principle of the importance of data sharing for the progress of science is supported by the professional norm of communalism. In fact, in Merton’s early formulation of this norm, it is stated that “Secrecy is the antithesis of this norm; full and open communication its enactment” 1 . The accepted understanding being that information withholding is considered a violation of a uniformly revered norm of sharing, a socially unacceptable and morally unjustifiable act. 2 Fortunately the norm of communalism is for the most respected, and consequently this is making available an increasing amount of information that must be organised. A database is a set of data with a regular structure that is organized in order to easily find the desired information. Data is a collection of distinct pieces of information. In a databank, data are formatted in a specific way for the use in analysis or in the process of making decisions. A database is organized as a collection of records, each of which contains one or more fields about some entities. Examples of entities in scientific databases are chemicals, sequences of DNA, structures of proteins, etc… The use of databases in science is related with the necessity to organise the enormous amount of information due to the development of scientific researches and the advent of novel scientific disciplines that have led to an exponential increase of the number of journals, books, congress proceedings, dissertations, patents, technical reports, and other papers bringing research results. These publications are referred to as “primary publications” or “primary sources of information”. To conveniently analyse this huge number of scientific works it is the “secondary publications” or “secondary information sources”, which process and summarise primary publications, that are used 3 . In some cases the process of comparison with biography implemented by the “secondary publications” is also used in scientific validation. Databases are the main example of secondary source of information because of their capacity to search and access the most relevant literature in a more convenient way for the users 3 . Previously to the now ubiquitous computer, there were printed publications called “index journals” and “abstract journals” widely considered to be the precursors of current online databases. These secondary publications consisted of several types of indexes; these are texts that contain information on author names, topics, journal title, citations, etc., in alphabetical order, that link to the original articles 3 . Nowadays, due to the countless resources available and especially the harnessing of the Internet there is no better way to manipulate data than through a database. As an example, below is a list of some of the most used databanks and the advantages related with their utilisation. Figure 1: Wordle of “NDOCdatabank” Available online at ( http://www.wordle.net/show/wrdl/7780006/NDOCdatabank_words ) List of the most used databases Accelrys: This is a database developed by Accelrys Inc. a software company that provides programs for chemical, materials and bioscience research for the pharmaceutical, biotechnology, consumer packaged goods, aerospace, energy and chemical industries. This application is designed with a management system that can be used for storing, searching and retrieving chemical structures, experimental data and registration information. Further, there are desktop productivity tools similar to those of Microsoft Excel and Microsoft Access that make it easy to use software. Accelrys offers a wide range of chemistry databases, mostly reaction-based. For example, it gives the access to the Royal Society of Chemistry’s Methods in Organic Synthesis (MOS) that is a monthly published periodical which abstracts more than 100 internationally recognised organic chemistry journals. The Accelrys’ electronic version of MOS adds about 3,300 reactions each year, is updated quarterly and currently stands at more than 33,000 indexed reactions going back to 1991 4 . Accelrys also offers a “Failed Reactions database” that lists reactions with no products or that produce unexpected results. Beilstein or Reaxys: The Crossfire Beilstein, a product of MDL, a subsidiary of Elsevier Science, is the largest database in the field of organic chemistry with an index of the chemical literature from the year 1771 containing structures, physical properties, reactions and literature citations for more than eight million compounds. Moreover, included are details on ecological chemistry issues, the synthesis of each compound, its pharmacology, and its toxicology. Beilstein holds more than five million chemical reactions and 35m associated chemical property and bioactivity records 4 . CAS: The Chemical Abstracts Service substance database or CAS Registry is the biggest file of substance information. It is managed by The Chemical Abstract Service (a division of the American Chemical Society) and currently contains 35,621,639 entries with structures and chemical names of molecules. Each registered substance is identified by a unique CAS registry number permitting a cross reference of the substance through many databases, chemical inventories and reference works. CAS is supported by two databases. CAplus, that consists of bibliographic information and abstracts for chemistry-related published articles and “Registry”, that contains information on more than 71 million organic and inorganic substances, and more than 64 million protein and DNA sequences. 4 Cheminder: It is a portal to free and under subscription scientific databases that allow searches for chemical structures, physical properties, reactions, and purchasing information of chemicals through the typing of the CAS registry number of the required compound. This site was created by CambridgeSoft alongside its largest database available, ChemReact. This latter database carries data on more than 300,000 reactions abstracted from the chemical literature spanning 1974- 1991. ChemReact provides the reactant and product structures, necessary solvents, required reagents, catalysts and information of the yield, and of the side products of the reactions 4. Tripos: Tripos is a set of tools that combine chemical data, structure searching, and molecular analysis and is described as “discovery research software” for pharmaceutical and biotechnology researchers. According to its developers, “this software speeds and improves the processes of molecular discovery efforts and the identiication and optimization of new compounds spanning dozens of industries from the largest pharma companies to emerging biotech irms, from agrochemical and chemical makers to the creators of lavours and fragrances”. 5 26 27