Abstract—The medical studies often require different methods for parameters selection, as a second step of processing, after the database’s designing and filling with information. One common task is the selection of fields that act as risk factors using well- known methods, in order to find the most relevant risk factors and to establish a possible hierarchy between them. Different methods are available in this purpose, one of the most known being the binary logistic regression. We will present the mathematical principles of this method and a practical example of using it in the analysis of the influence of 10 different psychiatric diagnostics over 4 different types of offences (in a database made from 289 psychiatric patients involved in different types of offences). Finally, we will make some observations about the relation between the risk factors hierarchy established through binary logistic regression and the individual risks, as well as the results of Chi-squared test. We will show that the hierarchy built using the binary logistic regression doesn’t agree with the direct order of risk factors, even if it was naturally to assume this hypothesis as being always true. Keywords—Databases, risk factors, binary logistic regression, hierarchy. I. INTRODUCTION N medical statistical studies a very important task is to design and to create the database for data collecting, because it must offers the optimal frame for all the physician’s demands concerning data storing and further processing. After the data are stored into a regular database, the physician begins usually the second step of data processing, by trying to select the most relevant parameters for further medical interpretations. Many statistical methods are available in this purpose, and they are strongly connected with the data nature and the researcher’s projects. A classic method, for example, is the principal components analysis, which takes in consideration all the parameters stored in a database and selects, using some mathematical principles, the most important ones – by identifying and eliminating the parameters that don’t change the global Manuscript received February 12, 2008. C. G. Dascălu, Ph.D., Lecturer, is with the University of Medicine and Pharmacy “Gr. T. Popa”, Iaşi, Romania – The Medical Informatics and Biostatistics Department, Faculty of Dental Medicine (phone: 0040-232- 206441, e-mail: cdascalu@umfiasi.ro). E. M. Carausu, Ph.D., Assoc. Professor, is with the University of Medicine and Pharmacy “Gr. T. Popa”, Iaşi, Romania – The Public Health and Medical Management Department, Faculty of Dental Medicine (e-mail: cm72@email.ro). D. Manuc is with Public Health Ministry, Bucharest, Romania (e-mail: cotrutz@yahoo.com). nature and behavior of data when they are missing; another method in the same area is the discriminant analysis, used eventually in connection with different algorithms for data clustering. Another method for parameters selection, a bit more complicated, takes in consideration the internal links between parameters. This is the binary logistic regression, which can be viewed as a generalization of the linear regression models, and is useful when we want to investigate the connections between one or more categorical independent variables (ordinal or binary) and a dependant categorical binary variable. This method is very useful in the study of risk factors over a certain situation (diagnostics, behavior, a.s.o.), because it builds a model that establish an hierarchy between all the possible risk factors, by selecting the most relevant ones, which have prediction value over the presence / absence of the investigated situation. In this way the database we are working with can be substantially simplified (and eventually divided in smaller data sets) - in cases when we want to find statistical results about a well-defined diagnosis. In such a case we will select from the main database only the records where the investigated diagnosis is found, and for those records, only the relevant fields for the diagnosis, identified through binary logistic regression, in order to redirect the further statistical analysis only over those data. II. MATERIAL AND METHODS The binary logistic regression is used, as we said before, when we want to make a prediction about the presence / absence of a certain parameter based on the values of a set of independent predictor variables [1] – which are categorical, ordinal or binary. The logistic regression curve coefficients can be used to estimate the relative risks (odds ratio) for each independent variable used in the model. A thing important to be noticed is that, in order to build a logistic regression model, is not necessary to check in advance the regular requirements about the nature of the values distribution for the predictor variables or about their variance – therefore the logistic regression can be viewed as an available alternative to be used when the compulsory requirements for the discriminant analysis, for example, are not fulfilled. Because of these characteristics, the logistic regression is used with good results especially in epidemiologic databases, because this model approximates the probability to find a certain result (the dependant variable) when a certain set of conditions is checked (the Methods for Data Selection in Medical Databases: The Binary Logistic Regression - Relations with the Calculated Risks Cristina G. Dascalu, Elena Mihaela Carausu, and Daniela Manuc I International Journal of Biological and Life Sciences 3:1 2007 30