Depthwise Separable Temporal Convolutional Network for Action Segmentation

Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human- machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in suc- cessive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS- TCN) that operates on full temporal resolution and with re- duced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates us- ing RDDB. We show that our DS-TCN is capable of captur- ing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.

PDF
No code implementations yet. Submit your code now

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Action Segmentation 50 Salads DS-TCN F1@10% 77.0 # 24
Edit 70.0 # 24
Acc 80.0 # 24
F1@25% 74.43 # 24
F1@50% 65.78 # 24
Action Segmentation Breakfast DS-TCN F1@10% 67.70 # 23
F1@50% 49.18 # 21
Acc 70.75 # 15
Edit 69.02 # 21
F1@25% 62.05 # 22
Action Segmentation GTEA DS-TCN F1@10% 88.30 # 19
F1@50% 72.84 # 22
Acc 78.10 # 19
Edit 84.05 # 17
F1@25% 85.44 # 21

Methods


No methods listed for this paper. Add relevant methods here