Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/28504
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Hassini, Elkafi | - |
dc.contributor.author | Wahid, Dewan Ferdous | - |
dc.date.accessioned | 2023-05-05T18:11:29Z | - |
dc.date.available | 2023-05-05T18:11:29Z | - |
dc.date.issued | 2023 | - |
dc.identifier.uri | http://hdl.handle.net/11375/28504 | - |
dc.description.abstract | This thesis presents frameworks for data clustering on big datasets that can arise in different real-world applications. The main contributions of this thesis can be divided into the following four areas of data clustering. Correlation clustering is a well-known problem that appears in different scientific areas with various names that identify clusters when qualitative information about objects' mutual similarities or dissimilarities is given. The first contribution of this thesis is to present a unified discussion on the cross-disciplinary taxonomy-based literature review, bibliometric analysis, literature gaps and dominant research topics related to this problem. As the second contribution, this thesis presents the concept of a common-knowledge network and a heuristic algorithm for clustering editing to identify authors' communities in a research institution. Furthermore, several analyses, such as the dominant research topic and collaboration incident corresponding to each identified research community, are proposed in this thesis to investigate multidisciplinary research activities in research institutions. The third contribution constitutes a framework for user-generated short-text classification based on identified line-item categories. The line-item identification phase uses cograph editing (CoE)-based clustering on keywords network formulated from short-texts. An integer linear programming formulation for CoE on weighted networks and a corresponding heuristic algorithm to identify clusters in large-scale networks are also proposed. The framework has been applied to categorize invoices for a subscription-based invoicing and accounting company. An augmented artificial intelligence (AI) hybrid fraud detection framework in the presence of minimal labelled data sets. This framework uses unsupervised clustering, a supervised classifier, red-flag prioritization, and augmented AI processes. Finally, this thesis outlines an application of this framework to identify fraudulent users in an invoicing platform. | en_US |
dc.language.iso | en | en_US |
dc.subject | Clustering | en_US |
dc.subject | Big data | en_US |
dc.subject | Invoice categorization | en_US |
dc.subject | Fraud detection | en_US |
dc.subject | Common-knowledge network | en_US |
dc.subject | Short-text classification | en_US |
dc.title | Big Data Clustering: Models and Applications | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Computational Engineering and Science | en_US |
dc.description.degreetype | Thesis | en_US |
dc.description.degree | Doctor of Philosophy (PhD) | en_US |
dc.description.layabstract | This thesis presents big data clustering frameworks that tackle application problems in different real-world scenarios. Primarily, two main approaches have been used in developing these clustering frameworks. The first approach utilizes problem-specific keywords network formulation and network (graph) clustering models with corresponding integer linear programming formulation-based heuristic algorithms, which can identify communities or clusters in big datasets. Furthermore, different procedures were followed based on related application areas to interpret and utilize identified clusters or communities. The second approach is an augmented artificial intelligence hybrid framework of unsupervised clustering and supervised classifiers with a set of minimal labelled data. All approaches have been tested with real-world data that included university researchers' publication networks and subscription-based accounting firm customers' transactions' network data. In addition, this thesis presents a cross-disciplinary taxonomy-based literature review and a bibliometric analysis for correlation clustering, a well-known network clustering problem. | en_US |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Wahid_Dewan_F_202305_PhD.pdf | 4.59 MB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.