|
[1]
|
Karpathy, A., Toderici, G., Shetty, S., et al. (2014) Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 1725-1732. [Google Scholar] [CrossRef]
|
|
[2]
|
Goyal, R., Ebrahimi Kahou, S., Michalski, V., et al. (2017) The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 5842-5850. [Google Scholar] [CrossRef]
|
|
[3]
|
Chen, J., Li, K., Deng, Q., et al. (2019) Distributed Deep Learning Model for Intelligent Video Surveillance Systems with Edge Computing. IEEE Transactions on Industrial Informatics. [Google Scholar] [CrossRef]
|
|
[4]
|
Bertasius, G., Wang, H. and Torresani, L. (2021) Is Space-Time Attention All You Need for Video Understanding? The 38th International Conference on Machine Learning (ICML 2021), 18-24 July 2021, 1-12.
|
|
[5]
|
Arnab, A., Dehghani, M., Heigold, G., et al. (2021) Vivit: A Video Vision Transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 11-17 October 2021, 6836-6846. [Google Scholar] [CrossRef]
|
|
[6]
|
Caba Heilbron, F., Escorcia, V., Ghanem, B. and Carlos Niebles, J. (2015) Activitynet: A Large-Scale Video Benchmark for Human Activity Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 961-970. [Google Scholar] [CrossRef]
|
|
[7]
|
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S. and Zisserman, A. (2017) The Kinetics Human Action Video Dataset.
|
|
[8]
|
Yeung, S., Russakovsky, O., Mori, G. and Fei-Fei, L. (2016) End-to-End Learning of Action Detection from Frame Glimpses in Videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 2678-2687. [Google Scholar] [CrossRef]
|
|
[9]
|
Wu, Z., Xiong, C., Ma, C.Y., Socher, R. and Davis, L.S. (2019) Adaframe: Adaptive Frame Selection for Fast Video Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 1278-1287. [Google Scholar] [CrossRef]
|
|
[10]
|
Gao, R., Oh, T.H., Grauman, K. and Torresani, L. (2020) Listen to Look: Action Recognition by Previewing Audio. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 14-19 June 2020, 10457-10467. [Google Scholar] [CrossRef]
|
|
[11]
|
Ghodrati, A., Bejnordi, B.E. and Habibian, A. (2021) Frameexit: Conditional Early Exiting for Efficient Video Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 15608-15618. [Google Scholar] [CrossRef]
|
|
[12]
|
Korbar, B., Tran, D. and Torresani, L. (2019) Scsampler: Sampling Salient Clips from Video for Efficient Action Recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, 27-28 October 2019, 6232-6242. [Google Scholar] [CrossRef]
|
|
[13]
|
Zheng, Y.D., Liu, Z., Lu, T. and Wang, L. (2020) Dynamic Sampling Networks for Efficient Action Recognition in Videos. IEEE Transactions on Image Processing, 29, 7970-7983. [Google Scholar] [CrossRef]
|
|
[14]
|
Meng, Y., Lin, C.C., Panda, R., Sattigeri, P., Karlinsky, L., Oliva, A., Feris, R., et al. (2020) Ar-Net: Adaptive Frame Resolution for Efficient Action Recognition. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 86-104. [Google Scholar] [CrossRef]
|
|
[15]
|
Sun, X., Panda, R., Chen, C.F.R., Oliva, A., Feris, R. and Saenko, K. (2021) Dynamic Network Quantization for Efficient Video Inference. Proceedings of the IEEE/CVF International Conference on Computer Vision, 11-17 October 2021, 7375-7385. [Google Scholar] [CrossRef]
|
|
[16]
|
Park, S.H., Tack, J., Heo, B., Ha, J.W. and Shin, J. (2022) K-Centered Patch Sampling for Efficient Video Recognition. In: European Conference on Computer Vision, Springer, Cham, 160-176. [Google Scholar] [CrossRef]
|
|
[17]
|
Xie, Z., Zhang, Z., Zhu, X., Huang, G. and Lin, S. (2020) Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 531-548. [Google Scholar] [CrossRef]
|
|
[18]
|
Wang, J., Yang, X., Li, H., Liu, L., Wu, Z. and Jiang, Y.G. (2022) Efficient Video Transformers with Spatial-Temporal Token Selection. In: European Conference on Computer Vision, Springer, Cham, 69-86. [Google Scholar] [CrossRef]
|
|
[19]
|
Piergiovanni, A.J., Kuo, W. and Angelova, A. (2023) Rethinking Video Vits: Sparse Video Tubes for Joint Image and Video Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 17-24 June 2023, 2214-2224. [Google Scholar] [CrossRef]
|
|
[20]
|
Bulat, A., Perez Rua, J.M., Sudhakaran, S., Martinez, B. and Tzimiropoulos, G. (2021) Space-Time Mixing Attention for Video Transformer. Advances in Neural Information Processing Systems, 34, 19594-19607.
|
|
[21]
|
Sun, R., Zhang, T., Wan, Y., Zhang, F. and Wei, J. (2023) Wlit: Windows and Linear Transformer for Video Action Recognition. Sensors, 23, Article No. 1616. [Google Scholar] [CrossRef] [PubMed]
|
|
[22]
|
Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016) Layer Normalization.
|
|
[23]
|
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J. and Tian, Q. (2015) Scalable Person Re-Identification: A Benchmark. Proceedings of the IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 1116-1124. [Google Scholar] [CrossRef]
|
|
[24]
|
Wu, Z., Xiong, C., Jiang, Y.G. and Davis, L.S. (2019) Liteeval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019, 1-10.
|
|
[25]
|
Xia, B., Wang, Z., Wu, W., Wang, H. and Han, J. (2022) Temporal Saliency Query Network for Efficient Video Recognition. In: European Conference on Computer Vision, Springer, Cham, 741-759. [Google Scholar] [CrossRef]
|
|
[26]
|
Raviv, A., Dinai, Y., Drozdov, I., Zehngut, N., Goldin, I. and Center, S.I.R.D. (2022) D-Step: Dynamic Spatio-Temporal Pruning. Proceedings of the British Machine Vision Conference, London, 21-24 November 2022, 1-13.
|