Skip navigation
  • Home
  • Browse
    • Communities
      & Collections
    • Browse Items by:
    • Publication Date
    • Author
    • Title
    • Subject
    • Department
  • Sign on to:
    • My MacSphere
    • Receive email
      updates
    • Edit Profile


McMaster University Home Page
  1. MacSphere
  2. Open Access Dissertations and Theses Community
  3. Open Access Dissertations and Theses
Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/28504
Title: Big Data Clustering: Models and Applications
Authors: Wahid, Dewan Ferdous
Advisor: Hassini, Elkafi
Department: Computational Engineering and Science
Keywords: Clustering;Big data;Invoice categorization;Fraud detection;Common-knowledge network;Short-text classification
Publication Date: 2023
Abstract: This thesis presents frameworks for data clustering on big datasets that can arise in different real-world applications. The main contributions of this thesis can be divided into the following four areas of data clustering. Correlation clustering is a well-known problem that appears in different scientific areas with various names that identify clusters when qualitative information about objects' mutual similarities or dissimilarities is given. The first contribution of this thesis is to present a unified discussion on the cross-disciplinary taxonomy-based literature review, bibliometric analysis, literature gaps and dominant research topics related to this problem. As the second contribution, this thesis presents the concept of a common-knowledge network and a heuristic algorithm for clustering editing to identify authors' communities in a research institution. Furthermore, several analyses, such as the dominant research topic and collaboration incident corresponding to each identified research community, are proposed in this thesis to investigate multidisciplinary research activities in research institutions. The third contribution constitutes a framework for user-generated short-text classification based on identified line-item categories. The line-item identification phase uses cograph editing (CoE)-based clustering on keywords network formulated from short-texts. An integer linear programming formulation for CoE on weighted networks and a corresponding heuristic algorithm to identify clusters in large-scale networks are also proposed. The framework has been applied to categorize invoices for a subscription-based invoicing and accounting company. An augmented artificial intelligence (AI) hybrid fraud detection framework in the presence of minimal labelled data sets. This framework uses unsupervised clustering, a supervised classifier, red-flag prioritization, and augmented AI processes. Finally, this thesis outlines an application of this framework to identify fraudulent users in an invoicing platform.
URI: http://hdl.handle.net/11375/28504
Appears in Collections:Open Access Dissertations and Theses

Files in This Item:
File Description SizeFormat 
Wahid_Dewan_F_202305_PhD.pdf
Access is allowed from: 2024-05-02
4.59 MBAdobe PDFView/Open
Show full item record Statistics


Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.

Sherman Centre for Digital Scholarship     McMaster University Libraries
©2022 McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4L8 | 905-525-9140 | Contact Us | Terms of Use & Privacy Policy | Feedback

Report Accessibility Issue