Semantic HMC: Ontology-described hierarchy
maintenance in Big Data context
Rafael Peixoto
1, 2
, Christophe Cruz
2
, Nuno Silva
1
1
GECAD - ISEP, Polytechnic of Porto, Porto, Portugal
{rafpp,nps}@isep.ipp.pt
2
LE2I UMR6306, CNRS, Univ. Bourgogne Franche-Comté, F-21000 Dijon, France
christophe.cruz@u-bourgogne.fr
Abstract. One of the biggest challenges in Big Data is the exploitation of Value
from large volumes of data that are constantly changing. To exploit value, one
must focus on extracting knowledge from these Big Data sources. To extract
knowledge and value from unstructured text we propose using a Hierarchical
Multi-Label Classification process called Semantic HMC that uses Ontologies
to describe the predictive model including the label hierarchy and the classifica-
tion rules. To not overload the user, this process automatically learns the ontol-
ogy-described label hierarchy from a very large set of text documents. This pa-
per aims to present a maintenance process of the ontology-described label hier-
archy relations with regards to a stream of unstructured text documents in the
context of Big Data without relearn all the hierarchy.
Keywords. Maintenance, multi-label classification, hierarchy induction, ontol-
ogy, machine learning
1 Introduction
The exponential growth of the amount of data available on the web requires new
forms of processing to enable enhanced decision making, insight discovery and opti-
mization. The term of Big Data is mainly used to describe datasets that cannot be
processed using traditional tools.
To extract knowledge from Big Data sources we propose to use a Semantic HMC
process [1, 2] that is capable of Hierarchically Multi-Classify a large Variety and
Volume of unstructured data items. Hierarchical Multi-Label Classification (HMC) is
the combination of Multi-Label classification and Hierarchical classification [13]. The
Semantic HMC process is unsupervised such that no previous labelled examples or
enrichment rules to relate the data items with the labels are required. The label hierar-
chy and the enrichment rules are automatically learned from the data through scalable
Machine Learning techniques.
The automatic concept (label) hierarchy extraction from unstructured documents is
not a trivial process and proper techniques for document analysis and representation
are required. In the context of Big Data, this task is even more challenging due to Big