Left column: original video sequence (192x192 size randomly cropped, away from the border) from UCF-101 dataset.
Right column: reconstructions using the estimated affine transforms described in sec. 2.1.
Note: these reconstructions do not use the affine transforms predicted by our model, but estimated using the ground
truth frames.
As you can see there is barely any noticeable difference between the ground truth and reconstructed video sequences.
This qualitatively
validates two of our assumptions: (i) frame can be decomposed into patches and (ii) each patch motion is well modeled by
an affine transform
Frames from sequence
sequence re-created by applying estimated affine transformations to the previous ground truth frame