Language Identification on Short Textual Data

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/25126

Title:	Language Identification on Short Textual Data
Authors:	Cui, Yexin
Advisor:	Chen, Jun
Department:	Electrical and Computer Engineering
Keywords:	Natural Language Processing;Language identification;Textual data
Publication Date:	2020
Abstract:	Language identification is the task of automatically detecting the languages(s) written in a text or a document given, and is also the very first step of further natural language processing tasks. This task has been well-studied over decades in the past, however, most of the works have focused on long texts rather than the short that is proved to be more challenging due to the insufficiency of syntactic and semantic information. In this work, we present approaches to this problem based on deep learning techniques, traditional methods and their combination. The proposed ensemble model, composed of a learning based method and a dictionary based method, achieves 89.6% accuracy on our new generated gold test set, surpassing Google Translate API by 3.7% and an industry leading tool Langid.py by 26.1%.
URI:	http://hdl.handle.net/11375/25126
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Cui_Yexin_201912_MASc.pdf Open Access		498.36 kB	Adobe PDF	View/Open