Welcome to the upgraded MacSphere! We're putting the finishing touches on it; if you notice anything amiss, email macsphere@mcmaster.ca

Language Identification on Short Textual Data

dc.contributor.advisorChen, Jun
dc.contributor.authorCui, Yexin
dc.contributor.departmentElectrical and Computer Engineeringen_US
dc.date.accessioned2020-01-02T17:13:12Z
dc.date.available2020-01-02T17:13:12Z
dc.date.issued2020
dc.description.abstractLanguage identification is the task of automatically detecting the languages(s) written in a text or a document given, and is also the very first step of further natural language processing tasks. This task has been well-studied over decades in the past, however, most of the works have focused on long texts rather than the short that is proved to be more challenging due to the insufficiency of syntactic and semantic information. In this work, we present approaches to this problem based on deep learning techniques, traditional methods and their combination. The proposed ensemble model, composed of a learning based method and a dictionary based method, achieves 89.6% accuracy on our new generated gold test set, surpassing Google Translate API by 3.7% and an industry leading tool Langid.py by 26.1%.en_US
dc.description.degreeMaster of Applied Science (MASc)en_US
dc.description.degreetypeThesisen_US
dc.identifier.urihttp://hdl.handle.net/11375/25126
dc.language.isoenen_US
dc.subjectNatural Language Processingen_US
dc.subjectLanguage identificationen_US
dc.subjectTextual dataen_US
dc.titleLanguage Identification on Short Textual Dataen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Cui_Yexin_201912_MASc.pdf
Size:
498.36 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: