Depthwise Separable Temporal Convolutional Network for Action Segmentation
Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human- machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in suc- cessive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS- TCN) that operates on full temporal resolution and with re- duced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates us- ing RDDB. We show that our DS-TCN is capable of captur- ing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.
PDFTasks
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Action Segmentation | 50 Salads | DS-TCN | F1@10% | 77.0 | # 24 | |
Edit | 70.0 | # 24 | ||||
Acc | 80.0 | # 24 | ||||
F1@25% | 74.43 | # 24 | ||||
F1@50% | 65.78 | # 24 | ||||
Action Segmentation | Breakfast | DS-TCN | F1@10% | 67.70 | # 23 | |
F1@50% | 49.18 | # 21 | ||||
Acc | 70.75 | # 15 | ||||
Edit | 69.02 | # 21 | ||||
F1@25% | 62.05 | # 22 | ||||
Action Segmentation | GTEA | DS-TCN | F1@10% | 88.30 | # 19 | |
F1@50% | 72.84 | # 22 | ||||
Acc | 78.10 | # 19 | ||||
Edit | 84.05 | # 17 | ||||
F1@25% | 85.44 | # 21 |