A Tool for Indexing and Classifying Unstructured Textual Documents Based on Product Family Algebra

Alomair, Deemah

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/28375

Title:	A Tool for Indexing and Classifying Unstructured Textual Documents Based on Product Family Algebra
Authors:	Alomair, Deemah
Advisor:	Khedri, Ridha
Department:	Computing and Software
Keywords:	data analytics;document indexing and classification
Publication Date:	17-Aug-2020
Abstract:	Unstructured textual documents comprise the bulk of the data used and archived by organizations within all sectors of the economy. The need to index and classify these documents became an interesting topic that gained more attention in the field of data analytic. Different approaches are used to perform indexing and classification of textual documents. They range from supervised Machine Learning (ML) approaches to rule-based ones. There is a need for exploring novel classification approaches that exhibit better effectiveness and performance in classifying the increasing volume of this kind of data. In this thesis, we propose a novel approach to index and classify unstructured textual documents based on Product Family Algebra (PFA) and implemented using Binary Decision Diagram (BDD). In the proposed approach, a signature is first constructed for a document or a family of documents. The signature is relative to a dictionary of the typical words used in the category under consideration. Then, using operations on product family implemented using BDDs, we carry the classification of a document or families of documents using their signatures. Since ML methods are considered to be the de facto standard in document classification and to compare our method performance to their, we implement four ML classification methods: Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (K-NN), and Decision Tree (DT). After that, we merge these modules into one software system called Smart Document Classification System (SDCS). The assessment of our approach to the classification of textual documents shows its flexibility in indexing and classifying families of textual documents. The classification is deterministic and on a single document (not families of documents), it compares very well with the SVM ML-classifier. Using rules articulated in the language of PFA, It offers a variety of ways for classifying families of documents.
URI:	http://hdl.handle.net/11375/28375
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Alomair_Deemah_N_2020:09_M.Sc.pdf Open Access		5.9 MB	Adobe PDF	View/Open

Show full item record