arXiv:1904.12587v1 [cs.IR] 4 Apr 2019 Text Classification Components for Detecting Descriptions and Names of CAD models Thomas Köllmer Fraunhofer Institute for Digital Media Technology IDMT D-98693 Ilmenau thomas.koellmer@idmt.fhg.de Jens Hasselbach Fraunhofer Institute for Digital Media Technology IDMT D-98693 Ilmenau jens.hasselbach@idmt.fhg.de Patrick Aichroth Fraunhofer Institute for Digital Media Technology IDMT D-98693 Ilmenau patrick.aichroth@idmt.fhg.de Abstract—We apply text analysis approaches for a specialized search engine for 3D CAD models and associated products. The main goals are to distinguish between actual product descriptions and other text on a website, as well as to decide whether a given text is or contains a product name. For this we use paragraph vectors for text classification, a character-level long short-term memory network (LSTM) for a single word classification and an LSTM tagger based on word embeddings for detecting product names within sentences. Despite the need to collect bigger datasets in our specific problem domain, the first results are promising and partially fit for production use. Index Terms—text processing, machine learning, search en- gines I. I NTRODUCTION Text analysis is a crucial processing step when collecting relevant and filtering out non-relevant content from websites. The context of this paper is the prototype of a search engine specialized for the retrieval of 3D CAD models (computer- aided design). Manufacturers of furniture, e.g., for offices, provide CAD files on their websites to be used by architects in their planning tools. However, often those files are hard to find on their web pages, e.g., hidden in a separate download area. What makes matters even worse is that every manufacturer’s website has its own structure. This makes it hard to quickly find the required models and invites the user to download a model only once and miss updates in the progress. This work is part of an umbrella project with the goal to not only make various manufacturers’ sites searchable (this can be done by all major search engines and the manufacturers sites themselves), but also to present products in a unified way, joining product description texts, pictures and also CAD- files in one coherent interface. The diversity of different manufacturers website presentations and the fact that product descriptions are mostly separate from their associated CAD files, makes this challenging and requires a good understanding how the products are organized within a website. Given the amount of manufacturers and models, it is not feasible to hand-craft crawlers for every manufacturer, nor is there a widely accepted standard to indicate the information we need in a machine-readable format. Using a combination of heuristics and automatic content analysis (text analysis, image analysis, CAD model analysis) we need to find a way to extract the information at an acceptable quality level with minimal human intervention. This paper puts a spotlight on textual analysis problems that arise in such an endeavour. First, we detect whether a piece of text resembles a product description. Second, we decide if a given text is or contains a product name. Both are important for summarizing product information and linking product pages with their associated CAD model files, e.g., by searching the product name in a zip archive consisting of CAD files. While the overall system honours the layout of the page, hints given by the markup or the URL, the goal of the approaches discussed here is to operate on plain text only. That way, they can be used as supporting components for the heuristics that analyse the layout of the page. Also, they are still useful in cases where there is no clear distinction layout-wise at all. Combined with optical features (color, capitalization, placement on the page) the proposed techniques help finding the right product name, but also detecting false positives, that might be suppressed from the result view, or can be flagged for manual classification. In the scope of this paper, we do not apply dictionary approaches, e.g., filtering imprints and company names, but try to build general purpose classifiers using machine learning. II. RELATED WORK A key innovation for text analysis was the approach of using unsupervised learning to create word embeddings, that capture the semantics of words surprisingly well [1]. Text classification traditionally depends on calculating statistics on input text, e.g., using Naive Bayes approaches or Support Vector Machines (SVM). But also in this domain, neural networks are competitive, either word based [2] or character based [3]. Part of speech (POS) tagging made a huge leap forward in the last years using recurrent neural networks (RNN) instead of hand crafted features, or combining both. The winning approach of the 2017 ConLL task on part of speech tagging is based on LSTMs (Long short-term memory networks, a recurrent neural network architecture) [4]. In fact, all but one approaches from the top 10 of this competition are based on recurrent neural networks, most of the time a bidirectional