The new dataset contains around 1,500 train videos and 290 test videos, with 50 frames per video on average. The dataset was obtained after processing the manually captured video sequences of static real-life urban scenes. The main property of the dataset is the abundance of close objects and, consequently, the larger prevalence of occlusions. According to the introduced heuristic, the mean area of occluded image parts for SWORD is approximately five times larger than for RealEstate10k data (14% vs 3% respectively). This rationalizes the collection and usage of SWORD and explains that SWORD allows training more powerful models despite being of smaller size.
Paper | Code | Results | Date | Stars |
---|