OST, www.obs-ost.fr 1 OBSERVATOIRE DES SCIENCES ET DES TECHNIQUES Au service de tous les acteurs de R&D Les notes pratiques de l’OST Matching patent data with financial data 1 An organization’s patent portfolio forms a critical part of its assets and may greatly influence its strategy and market val- ue. Nevertheless, even if recently patent offices have changed their policy on data to make them easier to access, most pa- tent data users are still unable to identify non-ambiguously the patent applicants 2 . One of the main reasons is that patent data are collected to identify the novelty of a invention and to make it public as a new stock of knowledge for further inventors, whatever the applicants. The process of the quality of information is then, first of all, focused on the data regarding scientific and legal information. Thus for instance, every patent has an unique ID number in order to identify perfectly the invention and to be able to establish scientific links between patents. By contrast, no quality check is applied to applicants’ names and address- es, and in some cases (US data for instance) apart from the country even address data are unavailable. Thus identifying (called afterwards disambiguating) existing applicants by name or address, in order to build up a unique identifier for each patenting entity, is not a simple task. Another issue may also be the timeframe: patent attributes are usually a snapshot of data at the moment the dataset producer (for instance EPO or USPTO) releases them. If the producer does not receive updates (for example because this is not required from applicants or because the data producer has no need of such data 3 ) such attributes are frozen at the moment of last update. Patents granted 10 years ago will, in some cases, have been applied for by expired/split/merged/ acquired entities. For instance, it may not be possible to as- sign patents owned by Compaq to Hewlett Packard, by which it was acquired in 2002, since the patent might still be in the name of Compaq. Last but not least patent data do not include applicant group structures, so using them alone it is not possible to consoli- date patent portfolios by “Global Ultimate Owner” (GUO). For such reasons a third party data source is needed, contain- ing for example company history and structure. As a matter of fact there are currently several existing data sources containing financial data, indicators, private equity data and portfolio organized by company, where company ownership structure is also available, as well as their history in terms of mergers and acquisitions, name changes or other events that impact on their structure. The purpose of this document is to illustrate the algorithm that we are developing in order to match patent and financial data through a general purpose methodology that may also be applied when reconciling other data sources. Other attempts had previously been made. For instance, Grid Thoma et al. (2010) described a methodology to match Amadeus [EU companies] and Patstat for EPO and USPTO patents using a powerful string comparison methodology. In our project we have used these previous efforts, but we ex- tend the match scope to all application authorities and all com- panies contained in ORBIS, even if extensive usage has been restricted to three of them (EPO, USPTO and INPI). We have also focused on particularly high-tech companies, selected from the industrial sector using NACE 2.0 aggregation 4 for the purposes of the research project in which this algorithm has been developed. We also enrich our methodology by adding to string comparison tools extensive usage of other data from both datasets in order to remove false positive matches. The method that we are developing is characterized by three steps: harmonization, match and filtering. These will be illus- trated in the following paragraphs. Before going further, we describe the data sources used. 1 This note is a final comment on the results of a project named ‘Valeur Brevet’ carried out with Emilie-Pauline Gallié, Lorenzo Cassi, Anne Plunket, Michele Pezzoni and Valérie Mérindol. 2 Defined as a ‘partnership, corporation, or other organization having the capacity to negotiate contracts, assume financial obligations, and pay off debts’. (http://www.w3.org/2009/03/xbrl/naming.html) 3 Patent offices like EPO or USPTO usually cease data collection when they grant the patent. This means for instance that after the patent is granted changes in applicant name or patent ownership are not necessarily reported in the database. 4 Defined in http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/ High-tech_statistics