RecVis'24

Object recognition and computer vision 2024

Reconnaissance d'objets et vision artificielle (RecVis) - Master M2 MVA

Lecturers

Gül Varol
( Main lecturer)

Teaching Assistants (TAs)

Ricardo Garcia
()

Lucas Ventura
()

News

11/12/2024 Internship topics (check back for updates).
22/10/2024 Final Project topics are out - read the papers & submit project proposals by Nov 19.
08/10/2024 Sign up for the TAs practical session (Pytorch/Kaggle/Google Cloud tutorial and presentations by TAs' PhD topic) by filling this form by Oct 15. Materials here.
08/10/2024 We will use Google Classroom for announcements, discussions, and assignment collection. The access code will be announced during the lectures.

Information

Course description
Automated object recognition -- and more generally scene analysis -- from photographs and videos is the grand challenge of computer vision. This course presents the image, object, and scene models, as well as the methods and algorithms, used today to address this challenge.

Assignments
There will be three programming assignments representing 50% (10% + 20% + 20%) of the grade. The supporting materials for the programming assignments and final projects will be in Python and make use of Jupyter notebooks. For additional technical instructions on the assignments please follow this link.

Final project
The final project will represent 50% of the grade.

Collaboration policy
You can discuss the assignments and final projects with other students in the class. Discussions are encouraged and are an essential component of the academic environment. However, each student has to work out their assignment alone (including any coding, experiments or derivations) and submit their own report. For the final project, you may work alone or in a group of maximum of 2 people. If working in a group, we expect a more substantial project, and an equal contribution from each student in the group. The final project report needs to explicitly specify the contribution of each student. Both students are expected to present the project at the oral presentation and contribute equally to writing the report. The assignments and final projects will be checked to contain original material. Any uncredited reuse of material (text, code, results) will be considered as plagiarism and will result in zero points for the assignment / final project. If a plagiarism is detected, the student will be reported to MVA.

Computer vision and machine learning talks
You are welcome to attend seminars in the Imagine and Willow research groups. Please see the seminar schedules for Imagine and Willow. Typically, these are one hour research talks given by visiting speakers. Imagine talks are at Ecole des Ponts. Willow talks are at Inria, 48 Rue Barrault, 75013 (when you enter the building, tell the receptionist you are going for a seminar).

Feedback
During any point in time, during or after the semester, do not hesitate to fill this form to provide anonymous feedback about the class.

Schedule (subject to change)

Lecture time: Tuesdays 16:00-19:00
Lecture room: Salle Dussane, ENS Ulm, 45 rue d'Ulm, 75005 Paris
*A few exceptions to the room and time are denoted in the schedule below.*
Google Calendar link
Note: Slides are provided after each lecture.

#	Date	Lecturer	Topic and reading materials	Slides
Instance-level recognition
1	Oct 8	Gül Varol	Class logistics: assignments, final projects, grading; Introduction to visual recognition; Instance-level recognition: local features, correspondence, image matching Scale and affine invariant interest point detectors [Mikolajczyk and Schmid, IJCV 2004], Distinctive image features from scale-invariant keypoints [D. Lowe, IJCV 2004] (SIFT), R. Szeliski, Sections 7.1.1 (feature detectors), 7.1.2 (feature descriptors), 7.1.3 (feature matching), 7.4.2 (Hough transform), 8.1.4 (RANSAC), Video Google: Efficient visual search of videos [Sivic and Zisserman, ICCV 2003] (Bag of features)	logistics & intro & local features
2	Oct 15	Jean Ponce	Camera geometry; Image processing History: J. Mundy - Object recognition in the geometric era: A retrospective; Camera geometry: Forsyth & Ponce Ch.1-2. Hartley & Zisserman - Ch.6; Image procesing: End-to-end interpretable learning of non-blind image deblurring [Eboli, Sun and Ponce, ECCV 2022], Lucas-kanade reloaded: End-to-end super-resolution from raw image bursts [Lecouat, Ponce and Mairal, ICCV 2021] Assignment 1 (A1) out.	[geometry & img processing]
Practical	Oct 17 (1-3pm) Inria, 48 Rue Barrault, 75013	TAs	Pytorch/Kaggle/Google Cloud tutorial. Presentations by TAs about their PhD topics.
3	Oct 22	Gül Varol	Efficient visual search Fast approx. nearest neighbors with automatic algorithm configuration [Muja and Lowe, VISAPP 2009], Video Google: Efficient visual search of videos [Sivic and Zisserman, Book chapter 2006], Object retrieval with large vocabularies and fast spatial matching [Philbin et al., CVPR 2007], Improving bag-of-features for large scale image search [Jegou et al., IJCV 2010], Aggregating local image descriptors into compact codes [Jegou et al., PAMI 2011], Howto100M: Learning a Text-video Embedding by Watching Hundred Million Narrated Video Clips [Miech et al. ICCV 2019]. Final project (FP) topics are out at the end of the lecture.	[search] [FP topics]
Category-level recognition
4	Oct 29	Gül Varol	Supervised learning and deep learning; Optimization and regularization for neural networks A1 due. A2 out.	[neural networks]
5	Nov 5	Gül Varol	Neural networks for visual recognition: CNNs and image classification Gradient-based learning applied to document recognition [Lecun et al., IEEE 1998] (CNN), ImageNet Classification with Deep Convolutional Neural Networks [Krizhevsky et al., NeurIPS 2012] (AlexNet), Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014], Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al., CVPR 2014] (pretraining), Very Deep Convolutional Networks for Large-Scale Visual Recognition [Simonyan and Zisserman, ICLR 2015] (VGGNet), Deep Residual Learning for Image Recognition [He et al., CVPR 2016] (ResNet) A3 out.	[cnn for img classification]
6	Nov 12 Salle Evariste Galois	Gül Varol	Beyond CNNs: Transformers; Beyond classification: Object detection; Segmentation; Human pose estimation Attention is all you need [Vaswani et al., NeurIPS 2017] (Transformers), An image is worth 16x16 words: Transformers for image recognition at scale [Dosovitskiy et al., ICLR 2021] (ViT), Rich feature hierarchies for accurate object detection and semantic segmentation [Girshick et al., CVPR 2014] (R-CNN), Fast R-CNN, [Girshick, CVPR 2015], Faster R-CNN: Towards real-time object detection with region proposal networks [Ren et al., NeurIPS 2015], You only look once: Unified, real-time object detection [Redmon et al., CVPR 2016] (YOLO), Fully convolutional networks for semantic segmentation [Long et al., CVPR 2015] (FCN), Mask R-CNN [He et al., ICCV 2017], Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [Cao et al., CVPR 2017] (OpenPose) A2 due.	[transformers & detection & segmentation & pose]
Advanced topics
7	Nov 19 Salle Evariste Galois	Gül Varol	Generative models; Vision & language -Generation Chapter: Probabilistic Machine Learning: Advanced Topics [Murphy 2023], -VAEs: Auto-Encoding Variational Bayes [Kingma and Welling, ICLR 2014], -GANs: Generative adversarial nets [Goodfellow et al., NeurIPS 2014], -Diffusion: Denoising diffusion probabilistic models [Ho et al., NeurIPS 2020], -Diffusion tutorial: Understanding Diffusion Models: A Unified Perspective [Luo 2022], -CLIP: Learning Transferable Visual Models From Natural Language Supervision [Radford et al., ICML 2021], -Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models [Rombach et al., CVPR 2022], -BLIP: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [Li et al., ICML 2022] FP proposal due.	[generative & VL]
8	Nov 26 Salle des Actes	Cordelia Schmid	Human action recognition in videos Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation [Brox and Malik, PAMI 2011], Deepflow: Large displacement optical flow with deep matching [Weinzaepfel et al., CVPR 2013], Learning realistic human actions from movies [Laptev et al., CVPR 2008], Dense trajectories and motion boundary descriptors for action recognition [Wang et al., CVPR 2011], Two-stream convolutional networks for action recognition in videos [Simonyan and Zisserman, NIPS 2014], Learning spatiotemporal features with 3D convolutional networks [Tran et al., ICCV 2015] A3 due.	[videos]
9	Dec 3	Ivan Laptev	Vision for robotics SFV: Reinforcement learning of physical skills from videos [Peng et al., ACM TOG 2019], HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips [Miech et al., ICCV 2019], Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation [Chen et al., CVPR 2022], Instruction-driven history-aware policies for robotic manipulations [Guhur et al., CoRL 2022], Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [Ahn et al., CoRL 2022], GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos [Soucek et al., CVPR 2024]	[robotics]
10	Dec 10	Mathieu Aubry	3D computer vision 1. 3D analysis: Volumetric and multi-view cnns for object classification on 3d data [Qi et al., CVPR 2016]; PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation [Qi et al., CVPR 2017]; 3D-CODED: 3D correspondences by deep deformation [Groueix et al., ECCV 2018]. 2. 3D generation: A point set generation network for 3d object reconstruction from a single image [Fan et al., CVPR 2017]; AtlasNet: A papier-mâché approach to learning 3d surface generation [Groueix et al., CVPR. 2018]; Deepsdf: Learning continuous signed distance functions for shape representation [Park et al., CVPR 2019]; Nerf: Representing scenes as neural radiance fields for view synthesis [Midenhall et al., ECCV 2020]. 3. Recent works Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives [Monnier et al., NeurIPS 2023]; Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans [Loiseau et al., CVPR 2024]; Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency [Monnier et al., ECCV 2022]; 3D Gaussian Splatting for Real-Time Radiance Field Rendering [Kerbl et al., Siggraph 2023] 4. Training with synthetic data: Domain randomization for transferring deep neural networks from simulation to the real world [Tobin et al., IROS 2017]; Unbiased look at dataset bias [Torralba and Efros, CVPR 2011]; Domain-adversarial training of neural networks [Ganin et al., JMLR 2016].	[3D]
FP	Jan 6-7	Gül Varol	FP presentations Presentations will be virtual, the schedule will be announced soon. FP report due Jan 13.

Resources

D.A. Forsyth and J. Ponce, "Computer Vision: A Modern Approach", Prentice-Hall, 2nd edition, 2011
J. Ponce, M. Hebert, C. Schmid and A. Zisserman "Toward Category-Level Object Recognition", Lecture Notes in Computer Science 4170, Springer-Verlag, 2007
O. Faugeras, Q.T. Luong, and T. Papadopoulo, "Geometry of Multiple Images", MIT Press, 2001.
R. Hartley and A. Zisserman, "Multiple View Geometry in Computer Vision", Cambridge University Press, 2004.
J. Koenderink, "Solid Shape", MIT Press, 1990
R. Szeliski, "Computer Vision: Algorithms and Applications, 2nd ed.", 2022. Online book.
Computer Vision: Models, Learning, and Inference by Simon J.D. Prince (2012)
Understanding Deep Learning by Simon J.D. Prince (2023)
Deep Learning by I. Goodfellow, Y. Bengio and A. Courville (2016)
Michael Nielsen's online book on Neural Networks and Deep Learning (2019)
David Forsyth's Applied Machine Learning textbook draft (2019)
Andrej Karpathy blog
Previous editions of the course:
- RecVis 2009, RecVis 2010, RecVis 2011, RecVis 2012, RecVis 2013, RecVis 2014, RecVis 2015, RecVis 2016, RecVis 2017, RecVis 2018, RecVis 2019, RecVis 2020, RecVis 2021, RecVis 2022, RecVis 2023