Skip navigation
  • Home
  • Browse
    • Communities
      & Collections
    • Browse Items by:
    • Publication Date
    • Author
    • Title
    • Subject
    • Department
  • Sign on to:
    • My MacSphere
    • Receive email
      updates
    • Edit Profile


McMaster University Home Page
  1. MacSphere
  2. Open Access Dissertations and Theses Community
  3. Open Access Dissertations and Theses
Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/28504
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorHassini, Elkafi-
dc.contributor.authorWahid, Dewan Ferdous-
dc.date.accessioned2023-05-05T18:11:29Z-
dc.date.available2023-05-05T18:11:29Z-
dc.date.issued2023-
dc.identifier.urihttp://hdl.handle.net/11375/28504-
dc.description.abstractThis thesis presents frameworks for data clustering on big datasets that can arise in different real-world applications. The main contributions of this thesis can be divided into the following four areas of data clustering. Correlation clustering is a well-known problem that appears in different scientific areas with various names that identify clusters when qualitative information about objects' mutual similarities or dissimilarities is given. The first contribution of this thesis is to present a unified discussion on the cross-disciplinary taxonomy-based literature review, bibliometric analysis, literature gaps and dominant research topics related to this problem. As the second contribution, this thesis presents the concept of a common-knowledge network and a heuristic algorithm for clustering editing to identify authors' communities in a research institution. Furthermore, several analyses, such as the dominant research topic and collaboration incident corresponding to each identified research community, are proposed in this thesis to investigate multidisciplinary research activities in research institutions. The third contribution constitutes a framework for user-generated short-text classification based on identified line-item categories. The line-item identification phase uses cograph editing (CoE)-based clustering on keywords network formulated from short-texts. An integer linear programming formulation for CoE on weighted networks and a corresponding heuristic algorithm to identify clusters in large-scale networks are also proposed. The framework has been applied to categorize invoices for a subscription-based invoicing and accounting company. An augmented artificial intelligence (AI) hybrid fraud detection framework in the presence of minimal labelled data sets. This framework uses unsupervised clustering, a supervised classifier, red-flag prioritization, and augmented AI processes. Finally, this thesis outlines an application of this framework to identify fraudulent users in an invoicing platform.en_US
dc.language.isoenen_US
dc.subjectClusteringen_US
dc.subjectBig dataen_US
dc.subjectInvoice categorizationen_US
dc.subjectFraud detectionen_US
dc.subjectCommon-knowledge networken_US
dc.subjectShort-text classificationen_US
dc.titleBig Data Clustering: Models and Applicationsen_US
dc.typeThesisen_US
dc.contributor.departmentComputational Engineering and Scienceen_US
dc.description.degreetypeThesisen_US
dc.description.degreeDoctor of Philosophy (PhD)en_US
dc.description.layabstractThis thesis presents big data clustering frameworks that tackle application problems in different real-world scenarios. Primarily, two main approaches have been used in developing these clustering frameworks. The first approach utilizes problem-specific keywords network formulation and network (graph) clustering models with corresponding integer linear programming formulation-based heuristic algorithms, which can identify communities or clusters in big datasets. Furthermore, different procedures were followed based on related application areas to interpret and utilize identified clusters or communities. The second approach is an augmented artificial intelligence hybrid framework of unsupervised clustering and supervised classifiers with a set of minimal labelled data. All approaches have been tested with real-world data that included university researchers' publication networks and subscription-based accounting firm customers' transactions' network data. In addition, this thesis presents a cross-disciplinary taxonomy-based literature review and a bibliometric analysis for correlation clustering, a well-known network clustering problem.en_US
Appears in Collections:Open Access Dissertations and Theses

Files in This Item:
File Description SizeFormat 
Wahid_Dewan_F_202305_PhD.pdf
Access is allowed from: 2024-05-02
4.59 MBAdobe PDFView/Open
Show simple item record Statistics


Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.

Sherman Centre for Digital Scholarship     McMaster University Libraries
©2022 McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4L8 | 905-525-9140 | Contact Us | Terms of Use & Privacy Policy | Feedback

Report Accessibility Issue