Welcome to the upgraded MacSphere! We're putting the finishing touches on it; if you notice anything amiss, email macsphere@mcmaster.ca

Big Data Clustering: Models and Applications

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis presents frameworks for data clustering on big datasets that can arise in different real-world applications. The main contributions of this thesis can be divided into the following four areas of data clustering. Correlation clustering is a well-known problem that appears in different scientific areas with various names that identify clusters when qualitative information about objects' mutual similarities or dissimilarities is given. The first contribution of this thesis is to present a unified discussion on the cross-disciplinary taxonomy-based literature review, bibliometric analysis, literature gaps and dominant research topics related to this problem. As the second contribution, this thesis presents the concept of a common-knowledge network and a heuristic algorithm for clustering editing to identify authors' communities in a research institution. Furthermore, several analyses, such as the dominant research topic and collaboration incident corresponding to each identified research community, are proposed in this thesis to investigate multidisciplinary research activities in research institutions. The third contribution constitutes a framework for user-generated short-text classification based on identified line-item categories. The line-item identification phase uses cograph editing (CoE)-based clustering on keywords network formulated from short-texts. An integer linear programming formulation for CoE on weighted networks and a corresponding heuristic algorithm to identify clusters in large-scale networks are also proposed. The framework has been applied to categorize invoices for a subscription-based invoicing and accounting company. An augmented artificial intelligence (AI) hybrid fraud detection framework in the presence of minimal labelled data sets. This framework uses unsupervised clustering, a supervised classifier, red-flag prioritization, and augmented AI processes. Finally, this thesis outlines an application of this framework to identify fraudulent users in an invoicing platform.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By