Gaussian Splatting representation for Challenging Dynamics
In this project, we want to explore how we can accurately represent challenging dynamics and render them, such as fast motion of continuous dynamic signals or discrete dynamic signals. This is an interesting but not well-studied direction, which is essential for many downstream applications such as autonomous driving.
Action Recognition with 3D Scene Graphs
This project explores the potential of 3D scene graphs to improve action recognition in AR/VR and robotic applications, addressing the challenges posed by the complexity and high dimensionality of video data. By leveraging 3D scene graphs, the project aims to overcome the limitations of 2D scene graphs, offering a more scalable and comprehensive approach to understanding egocentric actions in indoor environments.
Learning Affordances and Functionalities from Egocentric Actions
The primary objective of this project is to use egocentric videos to predict the 3D functionality of a map.
3D Hand Forecasting (HoloAssist: Interactive AI Assistants)
Implement an algorithm that can forecast 3D hand poses
Action Recognition Using 3D Hand-Object Contact Map
The primary objective of this project is to use an enhanced representation of 3D hand and object interaction to improve action recognition accuracy
Action Label Correction from Videos with LLMs
The primary objective of this project is to leverage LLMs to improve action recognition accuracy to develop AI agents.
Learning to interact with objects through Pressure Map
The primary objective of this project is to use a pressure map representation of hands to improve action recognition accuracy and robotic manipulation.
Multimodal Floorplan Encoding
The objective of the project is to train a neural network taking any floorplan modality as input and outputting an embedding in a latent space shared by all the floorplan modalities. This is beneficial for downstream applications such as visual localization and model alignment. Check the attached the documents for more details. The thesis will be co-supervised between CVG, ETH Zurich and Microsoft Spatial AI lab, Zurich.
Reconstructing liquids from multiple views with 3D Gaussian Splatting
This project reconstructs liquids from multi-view imagery, segmenting fluid regions using methods like Mask2Former and reconstructing static scenes with 3D Gaussian Splatting or Mast3r. The identified fluid clusters initialize a particle-based simulation, refined for temporal consistency and enhanced by optional thermal data and visual language models for fluid properties.
Cross-modal knowledge distillation in egocentric videos
Humans perceive the world from an egocentric perspective and egocentric perception holds immense promise for learning how they interact with the world and execute tasks and activities. The rich multimodal setting in many existing egocentric video datasets provides a perfect testbed for the challenging task of cross-model knowledge distillation.
Human-Robot Communication with Text Prompts and 3D Scene Graphs
This project extends previous work [a] on calculating similarity scores between text prompts and 3D scene graphs representing environments. The current method identifies potential locations based on user descriptions, aiding human-agent communication, but is limited by its coarse localization and inability to refine estimates incrementally. This project aims to enhance the method by enabling it to return potential locations within a 3D map and incorporate additional user information to improve localization accuracy incrementally until a confident estimate is achieved. [a] Chen, J., Barath, D., Armeni, I., Pollefeys, M., & Blum, H. (2024). "Where am I?" Scene Retrieval with Language. ECCV 2024.
Uncertainty-aware 3D Mapping
The goal of this project is to enhance the 3D mapping capabilities of a robotic agent by incorporating uncertainty measures into MAP-ADAPT, an incremental mapping pipeline that constructs an adaptive voxel grid from RGB-D input.
Assembly Assistant: Crafting your robotic companion for teamwork
This project is inspired by the vision of seamless human-robot collaboration in household settings. As our homes become smarter, the need for robotic systems that can work alongside humans to perform tasks with precision and adaptability is growing. This project empowers individuals to design and build a robotic companion tailored to assist with household tasks like assembling furniture. By fostering teamwork between humans and robots, the project highlights how technology can enhance everyday life, promoting efficiency, creativity, and a shared sense of accomplishment. It envisions a future where robots are not just tools but collaborative partners, making home life easier, more productive, and more enjoyable for everyone.
OpenSet Semantic SLAM
The goal of the project is to create a Simultaneous Localization and Mapping algorithm that, besides estimating the camera trajectory and the geometry of the scene, also obtains object instances. These object instances should not be restricted to a fixed set of classes (e.g., chair, table). Hence, the problem is open set segmentation.
Digital Twin for Spot's Home
MOTIVATION ⇾ Creating a digital twin of the robot's environment is crucial for several reasons: 1. Simulate Different Robots: Test various robots in a virtual environment, saving time and resources. 2. Accurate Evaluation: Precisely assess robot interactions and performance. 3. Enhanced Flexibility: Easily modify scenarios to develop robust systems. 4. Cost Efficiency: Reduce costs by identifying issues in virtual simulations. 5. Scalability: Replicate multiple environments for comprehensive testing. PROPOSAL We propose to create a digital twin of our Semantic environment, designed in your preferred graphics Platform to be able to simulate Reinforcement Learning agents in the digital environment, to create a unified evaluation platform for robotic tasks.
KALLAX Benchmark: Evaluating Household Tasks
Motivation ⇾ There are three ways to evaluate robots for pick-and-place tasks at home: 1. Simulation setups: High reproducibility but hard to simulate real-world complexities and perception noise. 2. Competitions: Good for comparing overall systems but require significant effort and can't be done frequently. 3. Custom lab setups: Common but lead to overfitting and lack comparability between labs. Proposal ⇾ We propose using IKEA furniture to create standardized, randomized setups that researchers can easily replicate. E.g, a 4x4 KALLAX unit with varying door knobs and drawer positions, generating tasks like "move the cup from the upper right shelf into the black drawer." This prevents overfitting and allows for consistent evaluation across different labs.
Semantic SLAM for Robotic Scene Understanding
MOTIVATION Most 3D scene understanding work applied in the field of robotics realy on two main assumptions: 1. Detailed and accurate 3D reconstructions 2. Reliable semantic segmentation PROPOSAL We propose to use the robot itself for mapping, and then performing the Semantic Segmentation task on the on-board computer. This will allow us to have an end-to-end pipeline to perform scen understanding in real time on the Spot robot.
Egocentric Video Understanding for Environment Interaction
Motivation ⇾ We want to train robots to interact in everyday home environments. But the robot needs data to learn from. 1. The robot needs data from humans to naturally interact with the environment. 2. We need ground-truth of the interaction to evaluate methods. 3. The setup needs to be robust and versatile such that we can make many recordings. Proposal ⇾ We want to develop a 3D ground-truth methodology for environment interactions. We need a setup that is easier to transport than the classical “camera domes” that just record dynamic scenes from every angle. Instead we combine static scans with egocentric and a few exocentric video cameras. Our goal is to be able to track the dynamic states of the functional elements within 1 cm accuracy. With this we can then go to any home and record interactions with high accuracy.
Warping Voxel-grids for 3D Robotic Mapping
This project aims to integrate loop closure optimization into voxel-based 3D mapping by developing a method to warp the voxel grid in response to updated camera trajectories. This approach eliminates the need to rebuild the map from scratch, enhancing the efficiency and adaptability of 3D mapping in real-world scenarios.
Enhancing NeRF predictions by Matching Rendered Images with Nearby References
MOTIVATION Neural Radiance Fields (NeRFs) require a substantial amount of data from various viewpoints to achieve high-quality reconstructions. This is because NeRFs rely on capturing the intricate details of a scene by learning the light field and volumetric density from multiple angles. Diverse data helps the model understand the scene's geometry, texture, and lighting, allowing it to render detailed and realistic views. PROPOSAL We propose to use MaRiNER [Bösiger et al. 2024] as a post-processing step to enhance NeRF reconstructions performed with a smaller amount of data.
Cross-Modal Zero-Shot Scene Graph Alignment
Develop zero-shot scene graph alignment algorithm using multi-modal data such as point clouds, CAD meshes, etc.
Action recognition in egocentric videos
Action recognition has long been a challenging task in video understanding. While recent deep learning models have achieved remarkable performance on various related benchmarks, their generalization capabilities remain limited. Furthermore, the task of action recognition is inherently ambiguous, as the same action can often be described using different verbs and levels of detail. In this project, we aim to address this ambiguity by leveraging low-level cues to enhance the disambiguation abilities of action recognition systems, as well as improve their robustness to variations of the viewpoint, interacted objects, and methods of enactment of the same action.
Learn to predict intent using commonsense knowledge
From robotics to human-computer interaction, numerous real-world tasks would benefit from practical systems that can anticipate future high-level actions and predict intentions and goals based on observation of the past. Intention prediction is important for care robots to anticipate people’s actions and is a key challenge in the design of artificial intelligent systems.
Hand mesh recovery from arbitrary number of unposed views
Predicting the hand mesh directly, or through predicting the pose and shape parameters of a parametric model of hand (i.e., MANO).
Zero-Shot Sequential Localization in Floorplans
This project aims to develop a zero-shot visual localization method for image sequences from diverse sources, enabling floorplan-based localization without environment-specific training.
Feature detection and matching with superresolution
This project proposes an end-to-end framework that employs per-image superresolution networks to upscale images, enabling subpixel accuracy in local feature detection and matching.
Estimating Generic 3D Room Structures
Indoor rooms are among the most common use cases in 3D scene understanding. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. The task is to create method that would automatically reconstruct 3D structural elements from monocular RGB videos. Ideally, the method would estimate plane equations for the structural elements and their spatial extent in the scene.
VR in Habitat 3.0
Motivation: Explore the newly improved Habitat 3.0 simulator with a special focus on the Virtual Reality Features. This project is meant to be an exploration task on the Habitat 3.0 simulator, exploring all the newly introduced features focusing specifically on the implementation of virtual reality tools for scene navigation. The idea is to extend these features to self created environments in Unreal Engine that build uppon Habitat
BeSAFEv2 Benchmarking Safety of Agents in Familiar Environments​ v2
Motivation: Create a realistic rendered benchmark to evaluate reinforcement agents, visual navigation tasks, interaction with other agents, navigation in scene with static and dynamic objects and humans.​ How: Create a realistic rendered benchmark to evaluate reinforcement agents, visual navigation tasks, interaction with other agents, navigation in scene with static and dynamic objects and humans.​

Powered by  SiROP - the academic career network