temporally overlapping positive pairs

since video narrations are loosely aligned with visual content, VideoCLIP samples overlapping (but not identical) time windows for positive pairs, creating a softer contrastive signal that handles the inherent temporal misalignment in narrated video.