British Journal of Science 1
December 2016, Vol. 14 (2)
© 2016 British Journals ISSN 2047-3745
Improved Fuzzy C-means for Document Image Segmentation
Abstract
Interest in the automatic analysis and segmentation of document images has been increased
during the recent years. Also, the document segmentation plays an important role in document
analysis, since every day, thousands of documents including government files, technical reports,
books, Newspapers, magazines, etc, need be processed and provided an intelligent access to its
contents both the text and non-text components. Lots of time, money and effort will be
preserving whenever it can be executed automatically. Hence, this paper introduced a new
document image segmentation approach based on suggested improved fuzzy C-means (IFCM)
that focus on segmented the text, images and background pixels from the scanned document
images using the statistical features of regions pixels, collected areas and then clustered in text
and non-text areas. In this approach, a document image is segmented to several non-overlapping
regions via a novel recursive clustering technique relies on the statistical features of each pixel
with its neighborhoods. The performance of this method is evaluated by examining a variety of
complex document images such as newspaper layouts and artificially segmented the text and
images. Also the performance has been recorded in terms of quantitative and qualitative
measures. The experimental performance results are promising and encouraging without the
need of any assistant techniques for pixel segmentation, unlike many techniques of this class.
Since they prove the feasibility and practicality of IFCM and can provide near-optimal solutions
to document layout analysis problems. They achieved accuracy rate 95.21% and recall rate
around 97.51% on a set of 390 documents that confirms the robustness of suggested algorithm.
However, the overall precision is higher due to the different evaluation metrics. Although the
pixel wise evaluation allows for more accurate improvement, this evaluation metrics reflects the
objective of the IFCM.
Keywords: Fuzzy C-means (FCM); Fuzzy C-means Algorithm, Clustering Algorithms, Cluster
Center, Clustering, Document Image Segmentation, Document Image Analysis, Document
Image, Document Image Segmentation.
Introduction
Nowadays, a wide variety of information is being available and converted into electronic
format for efficient storage and processing. This needs handling of documents using image
analysis techniques. The document analysis techniques decompose the document image into
different consistent items which represent the consistent components of the documents image
such as text, graphics and tables, without a prior knowledge of specific format.
Document images are frequently generated from physical documents via digitization using
scanner devices or digital cameras. Various documents, such as newspapers, and magazines,
contain very complex layout. Automatic analysis of a document with complex structure and
layout is considered a difficult task and not within the capabilities of the current document layout
analysis systems.
Hasanen S. Abdullah
University of Technology
Computer sciences Department
E-mail: qhasanen@yahoo.com
Ammar H. Jassim
University of Baghdad/ College of Science for women
Department of computer Science
E-mail: ammar_hussein_2004@yahoo.com