SUPPORTING INFORMATION Phylomemetic Patterns in Science Evolution The Rise and Fall of Scientific Fields David Chavalarias *† , Jean-Philippe Cointet *‡ SI.1 Details of the text-mining procedure The complete processing of textual data can be described as follows. It first relies on classical linguistic processes, at the end of which sets of candidate noun phrases are defined: 1. POS-tagging: Part-of-Speech Tagging tool first tags every terms according to its grammatical type : noun, adjective, verb, adverb, etc. NLTK module was used extensively for this step. 2. Chunking: Tags are then used to identify noun phrases in the corpus. A noun phrase can be mini- mally defined as a pattern of successive nouns and adjectives. This step builds the set of our possible multi-terms. 3. Normalizing: We correct small spelling differences between multi-terms, arising from the pres- ence/absence of hyphens. For example: we consider that the multi-terms “extra-cellular matrix”, “extracellular matrix” and “extra cellular matrix” belong to the same class. 4. Stemming: Multi-terms can be combined if they share the same stem. Hence, singular and plural terms are automatically grouped into the same class (e.g. “carcinoma” and “carcinomas” are two possible forms of the stem: “carcinoma”). The grammatical constraints provide an exhaustive list of possible multi-terms grouped into stemmed classes, however we still need to select the N most relevant of these. Two assumptions are classically made in multi-word automatic term recognition tasks: relevant terms tend to appear more frequently, and longer phrases are more likely to be relevant. These criteria select the multi-terms which convey a certain semantic unit, that is to say those with the highest “unithood” (Van Eck et al., 2011). To sort the list of candidate terms we then apply a simple statistical criterium which entails the following steps: • Counting: We enumerate every multi-term belonging to a given stemmed class in the whole corpus, in order to obtain their total number of occurrences (frequency). In this step, if two candidate multi- terms are nested, we increment the frequency of the larger chain only. For example, if “Insulin Growth Factor” is found in an abstract, we increment the multi-stem : “Insulin Growth Factor” only, but not the smaller stems such as ”Growth Factor”. • Unithood processing: according to the method of Frantzi, K., & Ananiadou S. (2000) we associate each multi-stem with its unithood, defined as u(i) = log(l i + 1)f i where l i is the number of terms involved in the multi-term i and f i designates its frequency. • Pruning: Items are then sorted according to their unithood, and the list is pruned to the 4 * N multi- stems with the highest C-value. This step removes less frequent multi-stems, but more importantly makes it possible to implement the following second-order analysis on the pruned list. ∗ Complex Systems Institute of Paris Ile-de-France (ISC-PIF), Paris, France † CAMS, CNRS - EHESS, Paris, France ‡ INRA-SenS, INRA, Marne-la-Vallée 1