Going deeper into third-person action anticipation

Sareh Rowlands

Abstract


Analysing human actions in videos is gaining a great deal of interest in the field of computer vision. This paper explores and reviews different deep learning techniques used in third-person action anticipation. The task of action anticipation is divided into feature extraction and a predictive model for many architectures. This paper outlines a project plan for action anticipation in the third person using step-based activity. We will use several data sets to compare some of these different architectures based on their prediction accuracy and ability to predict actions in varying time frames.


Keywords


Action anticipation; Third-person vision; Deep learning; LSTM

Full Text:

PDF

References


Jain A, Singh A, Koppula HS, Soh S, Saxena A. Recurrent neural networks for driver activity an-ticipation via sensory-fusion architecture. In2016 IEEE International conference on robotics and au-tomation (ICRA) 2016 (pp. 3118-3125). https://arxiv.org/pdf/1509.05016

Zhen X. Feature extraction and representation for human action recognition (Doctoral dissertation, University of Sheffield). https://etheses.whiterose.ac.uk/5141/1/Thesis_ZhenXT_revised_final.pdf

Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision 2016 (pp. 20-36). https://arxiv.org/pdf/1608.00859.pdf%EF%BC%89

Abu Farha Y, Richard A, Gall J. When will you do what? anticipating temporal occurrences of ac-tivities. In Proceedings of the IEEE conference on computer vision and pattern recognition 2018 (pp. 5343-5352). http://openaccess.thecvf.com/content_cvpr_2018/papers/Abu_Farha_When_Will_You_CVPR_2018_paper.pdf

Zhang H, Chen F, Yao A. Weakly-supervised dense action anticipation. arXiv preprint arXiv:2111.07593. 2021. https://arxiv.org/pdf/2111.07593

Morais R, Le V, Tran T, Venkatesh S. Learning to abstract and predict human actions. arXiv pre-print arXiv:2008.09234. 2020. https://arxiv.org/pdf/2008.09234

Martinez J, Black MJ, Romero J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition 2017 (pp. 2891-2900). https://openaccess.thecvf.com/content_cvpr_2017/papers/Martinez_On_Human_Motion_CVPR_2017_paper.pdf

Zhao R, Ali H, Van der Smagt P. Two-stream RNN/CNN for action recognition in 3D videos. In2017 IEEE/RSJ International conference on intelligent robots and systems (IROS) 2017 (pp. 4260-4267). https://arxiv.org/pdf/1703.09783

Du Y, Wang W, Wang L. Hierarchical recurrent neural network for skeleton based action recogni-tion. In Proceedings of the IEEE conference on computer vision and pattern recognition 2015 (pp. 1110-1118). https://www.cv-founda-tion.org/openaccess/content_cvpr_2015/papers/Du_Hierarchical_Recurrent_Neural_2015_CVPR_paper.pdf

Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI con-ference on artificial intelligence 2016 Mar 5 (Vol. 30, No. 1). https://ojs.aaai.org/index.php/AAAI/article/download/10451/10310

Schmidt RM. Recurrent neural networks (rnns): A gentle introduction and overview. arXiv preprint arXiv:1912.05911. 2019. https://arxiv.org/pdf/1912.05911

Furnari A, Farinella GM. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE/CVF International conference on computer vision 2019 (pp. 6252-6261). https://openaccess.thecvf.com/content_ICCV_2019/papers/Furnari_What_Would_You_Expect_Anticipating_Egocentric_Actions_With_Rolling-Unrolling_LSTMs_ICCV_2019_paper.pdf

Sener F, Singhania D, Yao A. Temporal aggregate representations for long-range video under-standing. In Proceedings of the IEEE European conference on computer vision, 2020 (pp. 154-171). https://arxiv.org/pdf/2006.00830

Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning 2015(pp. 448-456). http://proceedings.mlr.press/v37/ioffe15.pdf

Qi Z, Wang S, Su C, Su L, Huang Q, Tian Q. Self-regulated learning for egocentric video activity anticipation. IEEE transactions on pattern analysis and machine intelligence. 2021. https://arxiv.org/pdf/2111.11631

Vondrick C, Pirsiavash H, Torralba A. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023. 2015;2:2. http://www.cs.columbia.edu/~vondrick/prediction/paper.pdf

Ranzato M, Szlam A, Bruna J, Mathieu M, Collobert R, Chopra S. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604. 2014. https://arxiv.org/pdf/1412.6604

Kuehne H, Arslan A, Serre T. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pat-tern recognition 2014 (pp. 780-787). https://www.cv-founda-tion.org/openaccess/content_cvpr_2014/papers/Kuehne_The_Language_of_2014_CVPR_paper.pdf

Stein S, McKenna SJ. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM International joint conference on per-vasive and ubiquitous computing 2013 (pp. 729-738). https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=b9410401cec076baef045e83953f3ff24f25d149




DOI: https://doi.org/10.23954/osj.v8i2.3437

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Open Science Journal (OSJ) is multidisciplinary Open Access journal. We accept scientifically rigorous research, regardless of novelty. OSJ broad scope provides a platform to publish original research in all areas of sciences, including interdisciplinary and replication studies as well as negative results.