Temporal Aggregation Approaches for Few-shot Human Action Recognition in Videos

Bo, Yang

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/26456

Title:	Temporal Aggregation Approaches for Few-shot Human Action Recognition in Videos
Authors:	Bo, Yang
Advisor:	He, Wenbo
Department:	Computing and Software
Publication Date:	2021
Abstract:	Over the past decade, the research of deep learning has dramatically progressed and achieved overwhelming performance for various tasks. This success is highly dependent on a large number of manually labeled data. However, it is not always possible to collect enough training data, meanwhile manually labelling a large amount of data is labour-intensive. To learn from a limited number of examples with supervised information, a new machine learning paradigm called Few-Shot Learning (FSL) is introduced. For the few-shot human action recognition task, the core challenge is preserving both spatial and temporal information of the video with few labeled videos. Many approaches based on Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) have been proposed to address the human action recognition task. However, these works fail to preserve the temporal information of the entire video and require a large number of training videos, hence directly applying them to solve the few-shot human action recognition would severely overfit. Currently, there are only a few approaches to address the few-shot human action recognition problem. They either focus on learning how to compare the similarity between video descriptors with few training samples or solving the training samples deficiency by data augmentation. In this thesis, we propose three approaches that preserve the temporal information of the entire video given the frame/segment features: the Discriminative Video Descriptor (DVD), Temporal Attention Vector (TAV) and Contents and Length based Temporal Attention (CLTA). These methods preserve the temporal information of the entire video by convolving frame features with the basis for small dimensional space recursively, aggregating the frame/segment features with temporal weights which are manually defined, aggregating the frame features with temporal weights which is provided by learned Gaussian distribution functions based on both length and content of the video, respectively. We evaluated our approaches on different datasets in various scenarios (regular or few-shot), our approaches achieve comparable or better results compared to the state-of-the-art approaches on different datasets.
URI:	http://hdl.handle.net/11375/26456
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Bo_Yang_202105_Doctor-of-Philosophy.pdf Open Access		4.46 MB	Adobe PDF	View/Open

Show full item record