Skip navigation
  • Home
  • Browse
    • Communities
      & Collections
    • Browse Items by:
    • Publication Date
    • Author
    • Title
    • Subject
    • Department
  • Sign on to:
    • My MacSphere
    • Receive email
      updates
    • Edit Profile


McMaster University Home Page
  1. MacSphere
  2. Open Access Dissertations and Theses Community
  3. Open Access Dissertations and Theses
Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/28375
Title: A Tool for Indexing and Classifying Unstructured Textual Documents Based on Product Family Algebra
Authors: Alomair, Deemah
Advisor: Khedri, Ridha
Department: Computing and Software
Keywords: data analytics;document indexing and classification
Publication Date: 17-Aug-2020
Abstract: Unstructured textual documents comprise the bulk of the data used and archived by organizations within all sectors of the economy. The need to index and classify these documents became an interesting topic that gained more attention in the field of data analytic. Different approaches are used to perform indexing and classification of textual documents. They range from supervised Machine Learning (ML) approaches to rule-based ones. There is a need for exploring novel classification approaches that exhibit better effectiveness and performance in classifying the increasing volume of this kind of data. In this thesis, we propose a novel approach to index and classify unstructured textual documents based on Product Family Algebra (PFA) and implemented using Binary Decision Diagram (BDD). In the proposed approach, a signature is first constructed for a document or a family of documents. The signature is relative to a dictionary of the typical words used in the category under consideration. Then, using operations on product family implemented using BDDs, we carry the classification of a document or families of documents using their signatures. Since ML methods are considered to be the de facto standard in document classification and to compare our method performance to their, we implement four ML classification methods: Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (K-NN), and Decision Tree (DT). After that, we merge these modules into one software system called Smart Document Classification System (SDCS). The assessment of our approach to the classification of textual documents shows its flexibility in indexing and classifying families of textual documents. The classification is deterministic and on a single document (not families of documents), it compares very well with the SVM ML-classifier. Using rules articulated in the language of PFA, It offers a variety of ways for classifying families of documents.
URI: http://hdl.handle.net/11375/28375
Appears in Collections:Open Access Dissertations and Theses

Files in This Item:
File Description SizeFormat 
Alomair_Deemah_N_2020:09_M.Sc.pdf
Open Access
5.9 MBAdobe PDFView/Open
Show full item record Statistics


Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.

Sherman Centre for Digital Scholarship     McMaster University Libraries
©2022 McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4L8 | 905-525-9140 | Contact Us | Terms of Use & Privacy Policy | Feedback

Report Accessibility Issue