...

Gül Varol


Bio: Gül Varol is a permanent researcher (~Assoc. Prof.) in the IMAGINE group at École des Ponts ParisTech, an ELLIS Scholar, and a Guest Scientist at MPI. Previously, she was a postdoctoral researcher at the University of Oxford (VGG), working with Andrew Zisserman. She obtained her PhD from the WILLOW team of Inria Paris and École Normale Supérieure (ENS). Her thesis, co-advised by Ivan Laptev and Cordelia Schmid, received the PhD awards from ELLIS and AFRIF. During her PhD, she spent time at MPI, Adobe, and Google. Prior to that, she received her BS and MS degrees from Boğaziçi University. She regularly serves as an Area Chair at major computer vision conferences and has served as a Program Chair at ECCV'24. She is an associate editor for IJCV and was in the award committee for ICCV'23. She has co-organized a number of workshops at CVPR, ICCV, ECCV, and NeurIPS. Her research interests cover vision and language applications, including video representation learning, human motion synthesis, and sign languages. ( copy shorter version for talks)


Prospective students: Please apply by filling this form.


News


Activities


Research

See Google Scholar profile for a full list of publications.

MotionFix: Text-Driven 3D Human Motion Editing
Nikos Athanasiou, Alpár Cseke, Markos Diomataris, Michael J. Black, and Gül Varol
SIGGRAPH Asia 2024.
@INPROCEEDINGS{athanasiou24motionfix,
  title     = {{MotionFix}: Text-Driven {3D} Human Motion Editing},
  author    = {Athanasiou, Nikos and Cseke, Alp\'{a}r and Diomataris, Markos and Black, Michael J. and Varol, G{\"u}l},
  booktitle = {SIGGRAPH Asia},
  year      = {2024}
}

The focus of this paper is 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The challenges include the lack of training data and the design of a model that faithfully edits the source motion. In this paper, we address both these challenges. We build a methodology to semi-automatically collect a dataset of triplets in the form of (i) a source motion, (ii) a target motion, and (iii) an edit text, and create the new MotionFix dataset. Having access to such data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We further build various baselines trained only on text-motion pairs datasets, and show superior performance of our model trained on triplets. We introduce new retrieval-based metrics for motion editing, and establish a new benchmark on the evaluation set of MotionFix. Our results are encouraging, paving the way for further research on finegrained motion generation. Code, models and data are available at our project website.

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman
ACCV 2024.
@INPROCEEDINGS{xie24_autoadzero,
  title     = {{AutoAD-Zero}: A Training-Free Framework for Zero-Shot Audio Description},
  author    = {Xie, Junyu and Han, Tengda and Bain, Max and Nagrani, Arsha and Varol, G{\"u}l and Xie, Weidi and Zisserman, Andrew},
  booktitle = {ACCV},
  year      = {2024}
}

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Langage Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude*, K R Prajwal*, Liliane Momeni*, Hannah Bull, Samuel Albanie, Andrew Zisserman and Gül Varol
arXiv 2024.
@ARTICLE{raude24cslr2,
  title   = {A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision},
  author  = {Raude, Charles and Prajwal, K R and Momeni, Liliane and Bull, Hannah and Albanie, Samuel and Zisserman, Andrew and Varol, G{\"u}l},
  journal = {arXiv},
  year    = {2024}
}

In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance – retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.

AutoAD III: The Prequel -- Back to the Pixels
Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman
CVPR 2024.
@INPROCEEDINGS{han24_autoad3,
  title     = {{AutoAD III}: {T}he Prequel -- Back to the Pixels},
  author    = {Han, Tengda and Bain, Max and Nagrani, Arsha and Varol, G{\"u}l and Xie, Weidi and Zisserman, Andrew},
  booktitle = {CVPR},
  year      = {2024}
}

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together, we improve the state of the art on AD generation.

TMR++: A Cross-Dataset Study for Text-based 3D Human Motion Retrieval
Léore Bensabath, Mathis Petrovich, and Gül Varol
CVPRW 2024.
@INPROCEEDINGS{bensabath24tmrpp,
  title     = {A Cross-Dataset Study for Text-based {3D} Human Motion Retrieval},
  author    = {Bensabath, L'{e}ore and Petrovich, Mathis and Varol, G{\"u}l},
  booktitle = {CVPRW},
  year      = {2024}
}

We provide results of our study on text-based 3D human motion retrieval and particularly focus on cross-dataset generalization. Due to practical reasons such as dataset-specific human body representations, existing works typically benchmark by training and testing on partitions from the same dataset. Here, we employ a unified SMPL body format for all datasets, which allows us to perform training on one dataset, testing on the other, as well as training on a combination of datasets. Our results suggest that there exist dataset biases in standard text-motion benchmarks such as HumanML3D, KIT MotionLanguage, and BABEL. We show that text augmentations help close the domain gap to some extent, but the gap remains. We further provide the first zero-shot action recognition results on BABEL, without using categorical action labels during training, opening up a new avenue for future research..

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation
Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng and Davis Rempe
CVPRW 2024.
@INPROCEEDINGS{petrovich24stmc,
  title     = {Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation},
  author    = {Petrovich, Mathis and Litany, Or and Iqbal, Umar and Black, Michael J. and Varol, G{\"u}l and Peng, Xue Bin and Rempe, Davis},
  booktitle = {CVPRW},
  year      = {2024}
}

Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.

CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol
AAAI 2024.
@INPROCEEDINGS{ventura24covr,
  title     = {{CoVR}: Learning Composed Video Retrieval from Web Video Captions},
  author    = {Ventura, Lucas and Yang, Antoine and Schmid, Cordelia and Varol, G{\"u}l},
  booktitle = {AAAI},
  year      = {2024}
}

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.


CoVR-2: Automatic Data Construction for Composed Video Retrieval
TPAMI 2024. (Journal extension)
@ARTICLE{ventura24covr2,
  title   = {{CoVR}-2: Automatic Data Construction for Composed Video Retrieval},
  author  = {Ventura, Lucas and Yang, Antoine and Schmid, Cordelia and Varol, G{\"u}l},
  journal = {TPAMI},
  year    = {2024}
}
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman
ICCV 2023.
@INPROCEEDINGS{han23_autoad2,
  title     = {{AutoAD II}: {T}he Sequel -- Who, When, and What in Movie Audio Description},
  author    = {Han, Tengda and Bain, Max and Nagrani, Arsha and Varol, G{\"u}l and Xie, Weidi and Zisserman, Andrew},
  booktitle = {ICCV},
  year      = {2023}
}

Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the `who', `when', and `what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
Mathis Petrovich, Michael J. Black, and Gül Varol
ICCV 2023.
@INPROCEEDINGS{petrovich23tmr,
  title     = {{TMR}: Text-to-Motion Retrieval Using Contrastive {3D} Human Motion Synthesis},
  author    = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
  booktitle = {ICCV},
  year      = {2023}
}

In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance. We introduce a benchmark for evaluation and provide an in-depth analysis by reporting results on several protocols. Our extensive experiments on the KIT-ML and HumanML3D datasets show that TMR outperforms the prior work by a significant margin, for example reducing the median rank from 54 to 19. Finally, we showcase the potential of our approach on moment retrieval. Our code and models are publicly available at https://mathis.petrovich.fr/tmr.

SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation
Nikos Athanasiou*, Mathis Petrovich*, Michael J. Black, and Gül Varol
ICCV 2023.
@INPROCEEDINGS{athanasiou23sinc,
  title     = {{SINC}: Spatial Composition of {3D} Human Motions for Simultaneous Action Generation},
  author    = {Athanasiou, Nikos and Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
  booktitle = {ICCV},
  year      = {2023}
}

Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example 'waving hand' while 'walking' at the same time. We refer to generating such simultaneous movements as performing 'spatial compositions'. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action ?", while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/.

Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Paola Cascante-Bonilla*, Khaled Shehada*, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gül Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, and Leonid Karlinsky
ICCV 2023.
@INPROCEEDINGS{cascantebonilla23_syvic,
  title     = {Going Beyond Nouns With Vision \& Language Models Using Synthetic Data},
  author    = {Paola Cascante-Bonilla and Khaled Shehada and James Seale Smith and Sivan Doveh and Donghyun Kim and Rameswar Panda and G{\"u}l Varol and Aude Oliva and Vicente Ordonez and Rogerio Feris and Leonid Karlinsky},
  booktitle = {ICCV},
  year      = {2023}
}

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy.

Learning text-to-video retrieval from image captioning
Lucas Ventura, Cordelia Schmid, and Gül Varol
CVPRW 2023.
@INPROCEEDINGS{ventura23multicaps,
  title     = {Learning text-to-video retrieval from image captioning},
  author    = {Ventura, Lucas and Schmid, Cordelia and Varol, G{\"u}l},
  booktitle = {CVPRW},
  year      = {2023}
}

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training, which adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that match best the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baseline on text-to-video retrieval on three standard datasets, namely MSR-VTT and MSVD. Code and models will be made publicly available.


IJCV 2024. (Journal extension)
@ARTICLE{ventura24multicaps,
  title   = {Learning text-to-video retrieval from image captioning},
  author  = {Ventura, Lucas and Schmid, Cordelia and Varol, G{\"u}l},
  journal = {IJCV},
  year    = {2024}
}
AutoAD: Movie Description in Context
Tengda Han*, Max Bain*, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman
CVPR 2023 (Highlight).
@INPROCEEDINGS{han23_autoad,
  title     = {{AutoAD}: Movie Description in Context},
  author    = {Han, Tengda and Bain, Max and Nagrani, Arsha and Varol, G{\"u}l and Xie, Weidi and Zisserman, Andrew},
  booktitle = {CVPR},
  year      = {2023}
}

The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping net- work that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles of the current shot; (ii) we address the lack of training data by pretraining on large scale datasets, where visual or contextual information are unavailable, e.g. text-only AD without movies or visual captioning datasets with- out context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods.

Weakly-supervised Fingerspelling Recognition in British Sign Language Videos
K R Prajwal*, Hannah Bull*, Liliane Momeni*, Samuel Albanie, Gül Varol, and Andrew Zisserman
BMVC 2022.
@INPROCEEDINGS{prajwal22transpeller,
  title     = {Weakly-supervised Fingerspelling Recognition in British Sign Language Videos},
  author    = {Prajwal, K R and Bull, Hannah and Momeni, Liliane and Albanie, Samuel and Varol, G{\"u}l and Zisserman, Andrew},
  booktitle = {BMVC},
  year      = {2022}
}

The goal of this work is to detect and recognize sequences of letters signed using fingerspelling in British Sign Language (BSL). Previous fingerspelling recognition methods have not focused on BSL, which has a very different signing alphabet (e.g., two-handed instead of one-handed) to American Sign Language (ASL). They also use manual annotations for training. In contrast to previous methods, our method only uses weak annotations from subtitles for training. We localize potential instances of fingerspelling using a simple feature similarity method, then automatically annotate these instances by querying subtitle words and searching for corresponding mouthing cues from the signer. We propose a Transformer architecture adapted to this task, with a multiple-hypothesis CTC loss function to learn from alternative annotation possibilities. We employ a multi-stage training approach, where we make use of an initial version of our trained model to extend and enhance our training data before re-training again to achieve better performance. Through extensive evaluations, we verify our method for automatic annotation and our model architecture. Moreover, we provide a human expert annotated test set of 5K video clips for evaluating BSL fingerspelling recognition methods to support sign language research.

TEACH: Temporal Action Composition for 3D Humans
Nikos Athanasiou, Mathis Petrovich, Michael J. Black, and Gül Varol
3DV 2022.
@INPROCEEDINGS{athanasiou22teach,
  title     = {{TEACH}: Temporal Action Composition for {3D} Humans},
  author    = {Athanasiou, Nikos and Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
  booktitle = {3DV},
  year      = {2022}
}

Given a series of natural language descriptions, our task is to generate 3D human motions that correspond semantically to the text, and follow the temporal order of the instructions. In particular, our goal is to enable the synthesis of a series of actions, which we refer to as temporal action composition. The current state of the art in text-conditioned motion synthesis only takes a single action or a single sentence as input. This is partially due to lack of suitable training data containing action sequences, but also due to the computational complexity of their non-autoregressive model formulation, which does not scale well to long sequences. In this work, we address both issues. First, we exploit the recent BABEL motion-text collection, which has a wide range of labeled actions, many of which occur in a sequence with transitions between them. Next, we design a Transformer-based approach that operates non-autoregressively within an action, but autoregressively within the sequence of actions. This hierarchical formulation proves effective in our experiments when compared with multiple baselines. Our approach, called TEACH for “TEmporal Action Compositions for Human motions”, produces realistic human motions for a wide variety of actions and temporal compositions from language descriptions. To encourage work on this new task, we make our code available for research purposes at teach.is.tue.mpg.de.

Automatic dense annotation of large-vocabulary sign language videos
Liliane Momeni*, Hannah Bull*, K R Prajwal*, Samuel Albanie, Gül Varol, and Andrew Zisserman
ECCV 2022.
@INPROCEEDINGS{momeni22bsldensify,
  title     = {Automatic dense annotation of large-vocabulary sign language videos},
  author    = {Momeni, Liliane and Bull, Hannah and Prajwal, K R and Albanie, Samuel and Varol, G{\"u}l and Zisserman, Andrew},
  booktitle = {ECCV},
  year      = {2022}
}

Recently, sign language researchers have turned to sign language interpreted TV broadcasts, comprising (i) a video of continuous signing and (ii) subtitles corresponding to the audio content, as a readily available and large-scale source of training data. One key challenge in the usability of such data is the lack of sign annotations. Previous work exploiting such weakly-aligned data only found sparse correspondences between keywords in the subtitle and individual signs. In this work, we propose a simple, scalable framework to vastly increase the density of automatic annotations. Our contributions are the following: (1) we significantly improve previous annotation methods by making use of synonyms and subtitle-signing alignment; (2) we show the value of pseudo-labelling from a sign recognition model as a way of sign spotting; (3) we propose a novel approach for increasing our annotations of known and unknown classes based on in-domain exemplars; (4) on the BOBSL BSL sign language corpus, we increase the number of confident automatic annotations from 670K to 5M. We make these annotations publicly available to support the sign language research community.

TEMOS: Generating diverse human motions from textual descriptions
Mathis Petrovich, Michael J. Black and Gül Varol
ECCV 2022 (Oral).
@INPROCEEDINGS{petrovich22temos,
  title      = {{TEMOS}: Generating diverse human motions from textual descriptions},
  author    = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
  booktitle = {ECCV},
  year      = {2022}
}

We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.

Sign Language Video Retrieval with Free-Form Textual Queries
Amanda Duarte, Samuel Albanie, Xavier Giró-i-Nieto and Gül Varol
CVPR 2022.
@INPROCEEDINGS{duarte22slretrieval,
  title     = {Sign Language Video Retrieval with Free-Form Textual Queries},
  author    = {Duarte, Amanda and Albanie, Samuel and Gir{\'o}-i-Nieto, Xavier and Varol, G{\"u}l},
  booktitle = {CVPR},
  year      = {2022}
}

Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos, the objective is to find the signing video in the collection that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labeled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.

BOBSL: BBC-Oxford British Sign Language Dataset
Samuel Albanie*, Gül Varol*, Liliane Momeni*, Hannah Bull*, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland and Andrew Zisserman
arXiv 2021.
@ARTICLE{albanie21bobsl,
  title   = {{BOBSL}: {BBC}-{O}xford {B}ritish {S}ign {L}anguage Dataset},
  author  = {Albanie, Samuel and Varol, G{\"u}l and Momeni, Liliane and Bull, Hannah and Afouras, Triantafyllos and Chowdhury, Himel and Fox, Neil and Woll, Bencie and Cooper, Rob and McParland, Andrew and Zisserman, Andrew},
  journal = {arXiv},
  year    = {2021}
}

In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL). BOBSL is an extended and publicly released dataset based on the BSL-1K dataset introduced in previous work. We describe the motivation for the dataset, together with statistics and available annotations. We conduct experiments to provide baselines for the tasks of sign recognition, sign language alignment, and sign language translation. Finally, we describe several strengths and limitations of the data from the perspectives of machine learning and linguistics, note sources of bias present in the dataset, and discuss potential applications of BOBSL in the context of sign language technology. The dataset is available at this https URL.

Towards unconstrained joint hand-object reconstruction from RGB videos
Yana Hasson, Gül Varol, Cordelia Schmid and Ivan Laptev
3DV 2021.
@INPROCEEDINGS{hasson21homan,
  title     = {Towards unconstrained joint hand-object reconstruction from {RGB} videos},
  author    = {Hasson, Yana and Varol, G{\"u}l and Schmid, Cordelia and Laptev, Ivan},
  booktitle = {3DV},
  year      = {2021}
}

Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos. Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. The supervised learning approach to this problem, however, requires 3D supervision and remains limited to constrained laboratory settings and simulators for which 3D ground truth is available. In this paper we first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions. Our method relies on cues obtained with common methods for object detection, hand pose estimation and instance segmentation. We quantitatively evaluate our approach and show that it can be applied to datasets with varying levels of difficulty for which training data is unavailable.

Aligning Subtitles in Sign Language Videos
Hannah Bull*, Triantafyllos Afouras*, Gül Varol, Samuel Albanie, Liliane Momeni and Andrew Zisserman
ICCV 2021.
@INPROCEEDINGS{bull21bslalign,
  title     = {Aligning Subtitles in Sign Language Videos},
  author    = {Bull, Hannah and Afouras, Triantafyllos and Varol, G{\"u}l and Albanie, Samuel and Momeni, Liliane and Zisserman, Andrew},
  booktitle = {ICCV},
  year      = {2021}
}

The goal of this work is to temporally align asynchronous subtitles in sign language videos. In particular, we focus on sign-language interpreted TV broadcast data comprising (i) a video of continuous signing, and (ii) subtitles corresponding to the audio content. Previous work exploiting such weakly-aligned data only considered finding keyword-sign correspondences, whereas we aim to localise a complete subtitle text in continuous signing. We propose a Transformer architecture tailored for this task, which we train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals, which interact through a series of attention layers. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not. Through extensive evaluations, we show substantial improvements over existing alignment baselines that do not make use of subtitle text embeddings for learning. Our automatic alignment model opens up possibilities for advancing machine translation of sign languages via providing continuously synchronized video-text data.

Action-Conditioned 3D Human Motion Synthesis with Transformer VAE
Mathis Petrovich, Michael J. Black and Gül Varol
ICCV 2021.
@INPROCEEDINGS{petrovich21actor,
  title     = {Action-Conditioned 3{D} Human Motion Synthesis with Transformer {VAE}},
  author    = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
  booktitle = {ICCV},
  year      = {2021}
}

We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation for human motions by training a generative variational autoencoder (VAE). By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D, HumanAct12 and UESTC datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising. Our code and models will be made available.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain, Arsha Nagrani, Gül Varol and Andrew Zisserman
ICCV 2021.
@INPROCEEDINGS{bain21_frozen,
  title     = {Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval},
  author    = {Bain, Max and Nagrani, Arsha and Varol, G{\"u}l and Zisserman, Andrew},
  booktitle = {ICCV},
  year      = {2021}
}

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute. We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as 'frozen' snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Read and Attend: Temporal Localisation in Sign Language Videos
Gül Varol*, Liliane Momeni*, Samuel Albanie*, Triantafyllos Afouras* and Andrew Zisserman
CVPR 2021.
@INPROCEEDINGS{varol21_bslattend,
  title     = {Read and Attend: Temporal Localisation in Sign Language Videos},
  author    = {Varol, G{\"u}l and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
  booktitle = {CVPR},
  year      = {2021}
}

The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to provide a more robust sign language benchmark; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.

Sign Language Segmentation with Temporal Convolutional Networks
Katrin Renz, Nicolaj C. Stache, Samuel Albanie and Gül Varol
ICASSP 2021.
@INPROCEEDINGS{renz21_segmentation,
  title     = {Sign Language Segmentation with Temporal Convolutional Networks},
  author    = {Renz, Katrin and Stache, Nicolaj C. and Albanie, Samuel and Varol, G{\"u}l},
  booktitle = {ICASSP},
  year      = {2021}
}

The objective of this work is to determine the location of temporal boundaries between signs in continuous sign language videos. Our approach employs 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues. We demonstrate the effectiveness of our approach on the BSLCORPUS, PHOENIX14 and BSL-1K datasets, showing considerable improvement over the state of the art and the ability to generalise to new signers, languages and domains.

Synthetic Humans for Action Recognition from Unseen Viewpoints
Gül Varol, Ivan Laptev, Cordelia Schmid and Andrew Zisserman
IJCV 2021.
@ARTICLE{varol21_surreact,
  title   = {Synthetic Humans for Action Recognition from Unseen Viewpoints},
  author  = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia and Zisserman, Andrew},
  journal = {IJCV},
  year    = {2021}
}

Our goal in this work is to improve the performance of human action recognition for viewpoints unseen during training by using synthetic training data. Although synthetic data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (i) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (ii) We introduce a new dataset, SURREACT, that allows supervised training of spatio-temporal CNNs for action classification; (iii) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (iv) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.

Watch, read and lookup: learning to spot signs from multiple supervisors
Liliane Momeni*, Gül Varol*, Samuel Albanie*, Triantafyllos Afouras and Andrew Zisserman
ACCV 2020. (Best Application Paper Award)
@INPROCEEDINGS{momeni20_spotting,
  title     = {Watch, read and lookup: learning to spot signs from multiple supervisors},
  author    = {Momeni, Liliane and Varol, G{\"u}l and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
  booktitle = {ACCV},
  year      = {2020}
}

The focus of this work is sign spotting—for a given sign corresponding to a keyword, given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage with a semi-supervised learning objective; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on few-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to facilitate study of this task. The dataset, models and code are available at our project page.


Scaling up sign spotting through sign language dictionaries
IJCV 2022. (Journal Extension)
@ARTICLE{varol22_spotting,
  title   = {Scaling up sign spotting through sign language dictionaries},
  author  = {Varol, G{\"u}l and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
  journal = {IJCV},
  year    = {2022}
}
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Samuel Albanie*, Gül Varol*, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox and Andrew Zisserman
ECCV 2020.
@INPROCEEDINGS{albanie20_bsl1k,
  title     = {{BSL-1K}: {S}caling up co-articulated sign language recognition using mouthing cues},
  author    = {Albanie, Samuel and Varol, G{\"u}l and Momeni, Liliane and Afouras, Triantafyllos and Chung, Joon Son and Fox, Neil and Zisserman, Andrew},
  booktitle = {ECCV},
  year      = {2020}
}

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks - we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.

Learning joint reconstruction of hands and manipulated objects
Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev and Cordelia Schmid
CVPR 2019.
@INPROCEEDINGS{hasson19_obman,
  title     = {Learning joint reconstruction of hands and manipulated objects},
  author    = {Hasson, Yana and Varol, G{\"u}l and Tzionas, Dimitrios and Kalevatykh, Igor and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {CVPR},
  year      = {2019}
}

Estimating hand-object manipulations is essential for interpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challenging task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact restricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regularize the joint reconstruction of hands and objects with manipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors physically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transferability of ObMan-trained models to real data.

BodyNet: Volumetric Inference of 3D Human Body Shapes
Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev and Cordelia Schmid
ECCV 2018.
@INPROCEEDINGS{varol18_bodynet,
  title     = {{BodyNet}: Volumetric Inference of {3D} Human Body Shapes},
  author    = {Varol, G{\"u}l and Ceylan, Duygu and Russell, Bryan and Yang, Jimei and Yumer, Ersin and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {ECCV},
  year      = {2018}
}

Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

Long-term Temporal Convolutions for Action Recognition
Gül Varol, Ivan Laptev and Cordelia Schmid
TPAMI 2018.
@ARTICLE{varol18_ltc,
  title     = {Long-term Temporal Convolutions for Action Recognition},
  author    = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia},
  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year      = {2018},
  volume    = {40},
  number    = {6},
  pages     = {1510--1517},
  doi       = {10.1109/TPAMI.2017.2712608}
}

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

Learning from Synthetic Humans
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev and Cordelia Schmid
CVPR 2017.
@INPROCEEDINGS{varol17_surreal,
  title     = {Learning from Synthetic Humans},
  author    = {Varol, G{\"u}l and Romero, Javier and Martin, Xavier and Mahmood, Naureen and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
  booktitle = {CVPR},
  year      = {2017}
}

Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev and Abhinav Gupta
ECCV 2016.
@INPROCEEDINGS{sigurdsson16_charades,
  title     = {Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  author    = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ivan Laptev and Ali Farhadi and Abhinav Gupta},
  booktitle = {ECCV},
  year      = {2016}
}

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but {\em boring} samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, \textit{Charades}, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents, and over $15\%$ of the videos have more than one person. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

Theses

HDR Thesis

Bridging Language and Dynamic Visual Data
Gül Varol
Institut Polytechnique de Paris (IPP), 21 November 2024.
@ARTICLE{varol24_hdr,
  title  = {Bridging Language and Dynamic Visual Data},
  author = {G{\"u}l Varol},
  school = {Institut Polytechnique de Paris (IPP)},
  note   = {Habilitation thesis},
  year   = {2024}
}

In the past decades, the fields of computer vision and natural language processing have evolved concurrently, albeit somewhat independently. Recent developments have witnessed a convergence between these two domains, both in terms of cross-inspiration for method development and the creation of unified frameworks capable of handling diverse data types, such as images and text. This manuscript outlines our recent research at the intersection of vision and language, with a particular focus on \textit{dynamic} visual data, including videos and 3D human motions. Our contributions encompass three key areas: (i)~generative modeling for text-to-human motion synthesis, (ii)~addressing training data scarcity for text-to-video retrieval, and (iii)~automatically annotating sign language videos with text. The common denominator among these contributions is the inclusion of text in the tasks we address. Despite differences in data sources and domain knowledge for task-specific solutions, our methodologies share common tools such as visual sequence modeling with transformers, contrastive learning for retrieval scenarios, and leveraging large language models for text modeling. The manuscript is organized into three main chapters to present these contributions.

In the first part, we delve into human motion synthesis, posing the question: is human motion a language without words? More specifically, can human motions be described or controlled by words? As an effort to answer this question, we develop methods for generating 3D human body movements given textual descriptions. Each work presented in this chapter investigates increasing granularity towards finegrained semantic control, allowing simultaneous and series of actions. Our approaches employ variational autoencoders with transformer architectures, representing 3D motion as a sequence of parametric body models. The promising results underscore the potential of text-conditioned generative models in this domain, while limitations point to the need for future work on scaling up training data to unlock a broader vocabulary of action descriptions.

The second part emphasizes bridging video data and language, addressing challenges posed by limited annotated data in the video domain. We introduce strategies to overcome training data scarcity, including a large-scale collection of captioned web videos which allows for end-to-end video representation learning via text supervision. Additionally, we automate the annotation of text-video pairs for cross-modal retrieval training, using image captioning models on video frames. Similarly, we automatically construct image-text-video triplets for training video retrieval from composite image-text queries. Our focus extends to automatic audio description generation for movies, which can be seen as a form of long video captioning in the context of a story. Here, we employ partial training strategies by combining various sources of training data, such as text-only movie descriptions.

In the third and final part, we focus on sign language as a unique form of video inherently conveying language. Our works in this chapter span tasks such as temporal localization of signs, subtitle alignment, text-based retrieval, temporal segmentation as a form of tokenization, signer diarization, and fingerspelling detection. These studies represent pioneering attempts to scale up computational sign language analysis in an open-vocabulary setting.

This manuscript is intended to offer an overview of the aforementioned research efforts, framing each work within the broader context of our exploration of dynamic vision and language. Further details can be found in the corresponding publications, and the reader is encouraged to refer to them for additional insights.

- David Forsyth, University of Illinois Urbana-Champaign (reviewer)
- Kristen Grauman, University of Texas at Austin (reviewer)
- Jean Ponce, École Normale Supérieure - PSL (reviewer)
- A. Alyosha Efros, University of California Berkeley (examiner)
- Patrick Perez, Kyut.ai (examiner)
- Andrew Zisserman, University of Oxford (examiner, collaborator)
- Michael J. Black, Max Planck Institute (examiner, collaborator)
- Cordelia Schmid, Inria (examiner, collaborator)

PhD Thesis

Learning human body and human action representations from visual data
Gül Varol
École Normale Supérieure (ENS), 29 May 2019.
@PHDTHESIS{varol19_thesis,
  title     = {Learning human body and human action representations from visual data},
  author    = {G{\"u}l Varol},
  school    = {Ecole Normale Sup\'erieure (ENS)},
  year      = {2019}
}

The focus of visual content is often people. Automatic analysis of people from visual data is therefore of great importance for numerous applications in content search, autonomous driving, surveillance, health care, and entertainment.

The goal of this thesis is to learn visual representations for human understanding. Particular emphasis is given to two closely related areas of computer vision: human body analysis and human action recognition.

In human body analysis, we first introduce a new synthetic dataset for people, the SURREAL dataset, for training convolutional neural networks (CNNs) with free labels. We show the generalization capabilities of such models on real images for the tasks of body part segmentation and human depth estimation. Our work demonstrates that models trained only on synthetic data obtain sufficient generalization on real images while also providing good initialization for further training. Next, we use this data to learn the 3D body shape from images. We propose the BodyNet architecture that benefits from the volumetric representation, the multi-view re-projection loss, and the multi-task training of relevant tasks such as 2D/3D pose estimation and part segmentation. Our experiments demonstrate the advantages from each of these components. We further observe that the volumetric representation is flexible enough to capture 3D clothing deformations, unlike the more frequently used parametric representation.

In human action recognition, we explore two different aspects of action representations. The first one is the discriminative aspect which we improve by using long-term temporal convolutions. We present an extensive study on the spatial and temporal resolutions of an input video. Our results suggest that the 3D CNNs should operate on long input videos to obtain state-of-the-art performance. We further extend 3D CNNs for optical flow input and highlight the importance of the optical flow quality. The second aspect that we study is the view-independence of the learned video representations. We enforce an additional similarity loss that maximizes the similarity between two temporally synchronous videos which capture the same action. When used in conjunction with the action classification loss in 3D CNNs, this similarity constraint helps improving the generalization to unseen viewpoints.

In summary, our contributions are the following: (i) we generate photo-realistic synthetic data for people that allows training CNNs for human body analysis, (ii) we propose a multi-task architecture to recover a volumetric body shape from a single image, (iii) we study the benefits of long-term temporal convolutions for human action recognition using 3D CNNs, (iv) we incorporate similarity training in multi-view videos to design view-independent representations for action recognition.

- Francis Bach, Inria, (president of the jury)
- Iasonas Kokkinos, University College London (reviewer)
- Marc Pollefeys, ETH Zurich (reviewer)
- Andrew Zisserman, University of Oxford (examiner)
- Ivan Laptev, Inria (supervisor)
- Cordelia Schmid, Inria (supervisor)


Teaching


People

Current: Past:

Talks