Vision-Based End-to-End Driving
A fundamental problem in the autonomous vehicle domain is correctly handling rare events and objects from the tails of the distribution of possible observations. This often involves the perception, prediction, and action cycle, where current approaches struggle.
Learning Terrain Traversal from Human Strategies for Agile Robotics
Teaching robots to walk on complex and challenging terrains, such as rocky paths, uneven ground, or cluttered environments, remains a fundamental challenge in robotics and autonomous navigation. Traditional approaches rely on handcrafted rules, terrain classification, or reinforcement learning, but they often struggle with generalization to real-world, unstructured environments.
Scene Exploration and Object Search for Robotic System
Object search is the problem of letting a robot find an object of interest. For this, the robot has to explore the environment it is placed into until the object is found. To explore an environment, current robotic methods use geometrical sensing, i.e. stereo cameras, LiDAR sensors or similar, such that they can create a 3D reconstruction of the environment which also has a clear distinction of 'known & occupied', 'known & unoccupied' and 'unknown' regions of space. The problem of the classic geometric sensing approach is that it has no knowledge of e.g. doors, drawers, or other functional and dynamic elements. These however are easy to detect from images. We therefore want to extend prior object search methods such as https://naoki.io/portfolio/vlfm with an algorithm that can also search through drawers and cabinets. The project will require you to train your own detector network to detect possible locations of an object, and then implement a robot planning algorithm that explores all the detected locations.
Action recognition in egocentric videos
Action recognition has long been a challenging task in video understanding. While recent deep learning models have achieved remarkable performance on various related benchmarks, their generalization capabilities remain limited. Furthermore, the task of action recognition is inherently ambiguous, as the same action can often be described using different verbs and levels of detail. In this project, we aim to address this ambiguity by leveraging low-level cues to enhance the disambiguation abilities of action recognition systems, as well as improve their robustness to variations of the viewpoint, interacted objects, and methods of enactment of the same action.
Extending Functional Scene Graphs to Include Articulated Object States
While traditional [1] and functional [2] scene graphs are capable of capturing the spatial relationships and functional interactions between objects and spaces, they encode each object as static, with fixed geometry. In this project, we aim to enable the estimation of the state of articulated objects and include it in the functional scene graph.
Multimodal Floorplan Encoding
The objective of the project is to train a neural network taking any floorplan modality as input and outputting an embedding in a latent space shared by all the floorplan modalities. This is beneficial for downstream applications such as visual localization and model alignment. Check the attached the documents for more details. The thesis will be co-supervised between CVG, ETH Zurich and Microsoft Spatial AI lab, Zurich.
Reconstructing liquids from multiple views with 3D Gaussian Splatting
This project reconstructs liquids from multi-view imagery, segmenting fluid regions using methods like Mask2Former and reconstructing static scenes with 3D Gaussian Splatting or Mast3r. The identified fluid clusters initialize a particle-based simulation, refined for temporal consistency and enhanced by optional thermal data and visual language models for fluid properties.
Cross-modal knowledge distillation in egocentric videos
Humans perceive the world from an egocentric perspective and egocentric perception holds immense promise for learning how they interact with the world and execute tasks and activities. The rich multimodal setting in many existing egocentric video datasets provides a perfect testbed for the challenging task of cross-model knowledge distillation.
Human-Robot Communication with Text Prompts and 3D Scene Graphs
This project extends previous work [a] on calculating similarity scores between text prompts and 3D scene graphs representing environments. The current method identifies potential locations based on user descriptions, aiding human-agent communication, but is limited by its coarse localization and inability to refine estimates incrementally. This project aims to enhance the method by enabling it to return potential locations within a 3D map and incorporate additional user information to improve localization accuracy incrementally until a confident estimate is achieved. [a] Chen, J., Barath, D., Armeni, I., Pollefeys, M., & Blum, H. (2024). "Where am I?" Scene Retrieval with Language. ECCV 2024.
Uncertainty-aware 3D Mapping
The goal of this project is to enhance the 3D mapping capabilities of a robotic agent by incorporating uncertainty measures into MAP-ADAPT, an incremental mapping pipeline that constructs an adaptive voxel grid from RGB-D input.
Assembly Assistant: Crafting your robotic companion for teamwork
This project is inspired by the vision of seamless human-robot collaboration in household settings. As our homes become smarter, the need for robotic systems that can work alongside humans to perform tasks with precision and adaptability is growing. This project empowers individuals to design and build a robotic companion tailored to assist with household tasks like assembling furniture. By fostering teamwork between humans and robots, the project highlights how technology can enhance everyday life, promoting efficiency, creativity, and a shared sense of accomplishment. It envisions a future where robots are not just tools but collaborative partners, making home life easier, more productive, and more enjoyable for everyone.

Powered by  SiROP - the academic career network