Feature Selection for High-Dimensional Industrial Data Michael Bensch, Michael Schr¨oder, Martin Bogdan, Wolfgang Rosenstiel Eberhard-Karls-Universit¨ at T¨ ubingen - Dept. of Computer Engineering Sand 13, 72076 T¨ ubingen - Germany Abstract. In the semiconductor industry the number of circuits per chip is still drastically increasing. This fact and strong competition lead to the particular importance of quality control and quality assurance. As a result a vast amount of data is recorded during the fabrication process, which is very complex in structure and massively affected by noise. The evaluation of this data is a vital task to support engineers in the analysis of process problems. The current work tackles this problem by identifying the features responsible for success or failure in the manufacturing process (feature selection). 1 Introduction As part of the project Overall Equipment Efficiency (OEE) 1 , the work package Online Tool Controlling (OTC) 2 deals with identifying problems in the chip- production pipeline. Feature selection can guide the engineer and help solve the problem by giving hints as to which features could be responsible when the number of defective chips reaches a specified level. High data dimensionality, unbalanced classes (low yield values are seldom), and noise complicate the problem. Also, there is no guarantee that the Process Control Monitoring (PCM) data contains all problem relevant information. However, PCM data seems predestined for feature selection as the electrical measurements contain many linearly dependent features. We do not consider feature extraction methods such as PCA here. Extraction methods are not so useful for engineers due to a loss of semantics of the features. We therefore concentrate on feature selection (an overview is given in [1], previous work comparing feature selection algorithms can be found in [2, 3]). We study the following goals in the context of industrial production processes: 1. The selection of a very small set of important features may give the engineer insight into a particular production problem. 2. Using only relevant features may lead to a more robust classifier for yield prediction. 1 Funded by the BMBF, in cooperation with Atmel Germany GmbH, camLine GmbH, Elmos Semiconductor AG, Fraunhofer, Philips SC GmbH, Robert Bosch GmbH, TIP GmbH, X-FAB Semiconductor Foundries AG, ZMD 2 Funded by the BMBF and supported by the project partners Robert Bosch GmbH and Elmos Semiconductor AG