OpenSet Semantic SLAM
The goal of the project is to create a Simultaneous Localization and Mapping algorithm that, besides estimating the camera trajectory and the geometry of the scene, also obtains object instances. These object instances should not be restricted to a fixed set of classes (e.g., chair, table). Hence, the problem is open set segmentation.
Language-Conditioned Interaction Trajectories for Robotic Manipulation
The project aims to explore how language conditioning can be integrated with vision encoders to accurately predict interaction trajectories for robotic manipulation tasks.
Digital Twin for Spot's Home
MOTIVATION ⇾ Creating a digital twin of the robot's environment is crucial for several reasons: 1. Simulate Different Robots: Test various robots in a virtual environment, saving time and resources. 2. Accurate Evaluation: Precisely assess robot interactions and performance. 3. Enhanced Flexibility: Easily modify scenarios to develop robust systems. 4. Cost Efficiency: Reduce costs by identifying issues in virtual simulations. 5. Scalability: Replicate multiple environments for comprehensive testing. PROPOSAL We propose to create a digital twin of our Semantic environment, designed in your preferred graphics Platform to be able to simulate Reinforcement Learning agents in the digital environment, to create a unified evaluation platform for robotic tasks.
KALLAX Benchmark: Evaluating Household Tasks
Motivation ⇾ There are three ways to evaluate robots for pick-and-place tasks at home: 1. Simulation setups: High reproducibility but hard to simulate real-world complexities and perception noise. 2. Competitions: Good for comparing overall systems but require significant effort and can't be done frequently. 3. Custom lab setups: Common but lead to overfitting and lack comparability between labs. Proposal ⇾ We propose using IKEA furniture to create standardized, randomized setups that researchers can easily replicate. E.g, a 4x4 KALLAX unit with varying door knobs and drawer positions, generating tasks like "move the cup from the upper right shelf into the black drawer." This prevents overfitting and allows for consistent evaluation across different labs.
Semantic SLAM for Robotic Scene Understanding
MOTIVATION Most 3D scene understanding work applied in the field of robotics realy on two main assumptions: 1. Detailed and accurate 3D reconstructions 2. Reliable semantic segmentation PROPOSAL We propose to use the robot itself for mapping, and then performing the Semantic Segmentation task on the on-board computer. This will allow us to have an end-to-end pipeline to perform scen understanding in real time on the Spot robot.
Egocentric Video Understanding for Environment Interaction
Motivation ⇾ We want to train robots to interact in everyday home environments. But the robot needs data to learn from. 1. The robot needs data from humans to naturally interact with the environment. 2. We need ground-truth of the interaction to evaluate methods. 3. The setup needs to be robust and versatile such that we can make many recordings. Proposal ⇾ We want to develop a 3D ground-truth methodology for environment interactions. We need a setup that is easier to transport than the classical “camera domes” that just record dynamic scenes from every angle. Instead we combine static scans with egocentric and a few exocentric video cameras. Our goal is to be able to track the dynamic states of the functional elements within 1 cm accuracy. With this we can then go to any home and record interactions with high accuracy.
Warping Voxel-grids for 3D Robotic Mapping
This project aims to integrate loop closure optimization into voxel-based 3D mapping by developing a method to warp the voxel grid in response to updated camera trajectories. This approach eliminates the need to rebuild the map from scratch, enhancing the efficiency and adaptability of 3D mapping in real-world scenarios.
Enhancing NeRF predictions by Matching Rendered Images with Nearby References
MOTIVATION Neural Radiance Fields (NeRFs) require a substantial amount of data from various viewpoints to achieve high-quality reconstructions. This is because NeRFs rely on capturing the intricate details of a scene by learning the light field and volumetric density from multiple angles. Diverse data helps the model understand the scene's geometry, texture, and lighting, allowing it to render detailed and realistic views. PROPOSAL We propose to use MaRiNER [Bösiger et al. 2024] as a post-processing step to enhance NeRF reconstructions performed with a smaller amount of data.
Extending Functional Scene Graphs to Include Articulated Object States
While traditional [1] and functional [2] scene graphs are capable of capturing the spatial relationships and functional interactions between objects and spaces, they encode each object as static, with fixed geometry. In this project, we aim to enable the estimation of the state of articulated objects and include it in the functional scene graph.
Cross-Modal Zero-Shot Scene Graph Alignment
Develop zero-shot scene graph alignment algorithm using multi-modal data such as point clouds, CAD meshes, etc.
Action recognition in egocentric videos
Action recognition has long been a challenging task in video understanding. While recent deep learning models have achieved remarkable performance on various related benchmarks, their generalization capabilities remain limited. Furthermore, the task of action recognition is inherently ambiguous, as the same action can often be described using different verbs and levels of detail. In this project, we aim to address this ambiguity by leveraging low-level cues to enhance the disambiguation abilities of action recognition systems, as well as improve their robustness to variations of the viewpoint, interacted objects, and methods of enactment of the same action.
Learn to predict intent using commonsense knowledge
From robotics to human-computer interaction, numerous real-world tasks would benefit from practical systems that can anticipate future high-level actions and predict intentions and goals based on observation of the past. Intention prediction is important for care robots to anticipate people’s actions and is a key challenge in the design of artificial intelligent systems.
Hand mesh recovery from arbitrary number of unposed views
Predicting the hand mesh directly, or through predicting the pose and shape parameters of a parametric model of hand (i.e., MANO).
3D Reconstruction, Localization & Segmentation from Unstructured Image Pairs
The goal of this project is to perform a joint 3D reconstruction and segmentation given an uncalibrated and unstructured image pair. Here unstructured can mean that the images contain the same object instance but different backgrounds or the images are shared with an occluder.
From Spatial to Functional: Functional Scene Graph for Enhanced Robotic Decision Making
This project explores the concept of Functional Scene Graphs, an extension of traditional scene graphs that capture not just spatial relationships but also the functional interactions between objects and spaces. For example, a light switch enables illumination of a room, or a key provides access to a locked door. Such relationships, while intuitive for humans, are often overlooked in robotics systems, limiting a robot’s ability to reason about and interact with its environment effectively. The core challenge lies in understanding these functional relationships. While a robot might attempt to explore and infer such connections autonomously, humans could assist by demonstrating interactions, offering a means for robots to learn more efficiently. This project will focus on integrating functional understanding into scene graphs, enabling robots to infer high-level semantic interactions and make better decisions during tasks like navigation and manipulation.
Enhancing Bone Segmentation in Ultrasound Imaging Using Physics-Informed Deep Learning Models
Computer-Assisted Orthopedic Surgery (CAOS) has been demonstrated to improve surgical precision in various procedures, including spinal fusion surgery, arthroplasty, and bone deformity correction [1,2]. Ultrasound, as a radiation-free, cost-effective, and portable alternative to CT and X-ray imaging, has been employed for real-time visualization of both soft tissues and bones through the reflection of acoustic waves. Despite its advantages, ultrasound imaging has inherent limitations such as low signal-to-noise ratio, acoustic shadowing, and speckle noise, which pose challenges for interpretation by surgeons. In our project, we have collected a dataset comprising over 100k ultrasound images with precise bone annotations. These bone labels are categorized into two classes: high-intensity regions (high signal-to-noise ratio) and low-intensity regions (low signal-to-noise ratio), as shown in Figure 1. According to experiment results, surgeons’ performance for bone labeling for low-intensity regions declined significantly compared to the high-intensity regions. [1] Pandey, Prashant U., et al. "Ultrasound bone segmentation: A scoping review of techniques and validation practices." Ultrasound in Medicine & Biology 46.4 (2020): 921-935. [2] Hohlmann, Benjamin, Peter Broessner, and Klaus Radermacher. "Ultrasound-based 3D bone modelling in computer assisted orthopedic surgery–a review and future challenges." Computer Assisted Surgery 29.1 (2024): 2276055.
Zero-Shot Sequential Localization in Floorplans
This project aims to develop a zero-shot visual localization method for image sequences from diverse sources, enabling floorplan-based localization without environment-specific training.
Feature detection and matching with superresolution
This project proposes an end-to-end framework that employs per-image superresolution networks to upscale images, enabling subpixel accuracy in local feature detection and matching.
Estimating Generic 3D Room Structures
Indoor rooms are among the most common use cases in 3D scene understanding. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. The task is to create method that would automatically reconstruct 3D structural elements from monocular RGB videos. Ideally, the method would estimate plane equations for the structural elements and their spatial extent in the scene.
VR in Habitat 3.0
Motivation: Explore the newly improved Habitat 3.0 simulator with a special focus on the Virtual Reality Features. This project is meant to be an exploration task on the Habitat 3.0 simulator, exploring all the newly introduced features focusing specifically on the implementation of virtual reality tools for scene navigation. The idea is to extend these features to self created environments in Unreal Engine that build uppon Habitat
BeSAFEv2 Benchmarking Safety of Agents in Familiar Environments​ v2
Motivation: Create a realistic rendered benchmark to evaluate reinforcement agents, visual navigation tasks, interaction with other agents, navigation in scene with static and dynamic objects and humans.​ How: Create a realistic rendered benchmark to evaluate reinforcement agents, visual navigation tasks, interaction with other agents, navigation in scene with static and dynamic objects and humans.​
Action Label Correction from Videos with LLMs
The primary objective of this project is to leverage LLMs to improve action recognition accuracy to develop AI agents.
Egocentric Reconstruction of Hand-Manipulated Objects
3D reconstruction of objects is an important topic in computer vision. S Recently, a lot of focus has been on egocentric video processing due to the emergence of XR devices. In such cases, objects are mostly manipulated by hands. The main idea for this project is to use only an RGB sequence, which detects an object in the scene and reconstructs it in 3D on the go. Ideally, something like BundleSDF or more recent methods based on Gaussian Splatting could be used for the reconstruction.
Object Pose Estimation using Line and Point Features
We hope to push the state of the art on object pose estimation, especially for textureless objects, by using line features as well as point features.
3D Reconstruction of Water in a Glass
The project is about reconstructing a dynamic scene of water, glass, and an object thrown into the water. The input is images from 2-3 synchronized RGB cameras. The expected output is the 3D reconstruction of each frame, ideally optimized so that the motion is consistent.
GNSS/RTK-SLAM fusion for accurate positioning of geospatial data in Mixed Reality
The main objective of the project is to increase the accuracy and usability of the Mixed Reality solution developed by V-Labs. The V-Labs team expects that the integration of a fusion algorithm based on Artificial Intelligence or an Unscented Kalman Filter (UKF) will be able to reach that goal.
Action Recognition with 3D Scene Graphs
This project explores the potential of 3D scene graphs to improve action recognition in AR/VR and robotic applications, addressing the challenges posed by the complexity and high dimensionality of video data. By leveraging 3D scene graphs, the project aims to overcome the limitations of 2D scene graphs, offering a more scalable and comprehensive approach to understanding egocentric actions in indoor environments.
Human-Robot Communication with Text Prompts and 3D Scene Graphs
This project extends previous work [a] on calculating similarity scores between text prompts and 3D scene graphs representing environments. The current method identifies potential locations based on user descriptions, aiding human-agent communication, but is limited by its coarse localization and inability to refine estimates incrementally. This project aims to enhance the method by enabling it to return potential locations within a 3D map and incorporate additional user information to improve localization accuracy incrementally until a confident estimate is achieved. [a] Chen, J., Barath, D., Armeni, I., Pollefeys, M., & Blum, H. (2024). "Where am I?" Scene Retrieval with Language. ECCV 2024.
Learning Affordances and Functionalities from Egocentric Actions
The primary objective of this project is to use egocentric videos to predict the 3D functionality of a map.
3D Surface Reconstruction from Sparse Viewpoints for Medical Education and Surgical Navigation
In medical education and surgical navigation, achieving accurate multi-view 3D surface reconstruction from sparse viewpoints is a critical challenge. This Master's thesis addresses this problem by first computing normal and optionally reflectance maps for each viewpoint, and then fusing this data to obtain the geometry of the scene and, optionally, its reflectance. The research explores multiple techniques for normal map computation, including photometric stereo, data-driven methods, and stereo matching, either individually or in combination. The outcomes of this study aim to pave the way for the creation of highly realistic and accurate 3D models of surgical fields and anatomical structures. These models have the potential to significantly improve medical education by providing detailed and interactive representations for learning. Additionally, in the context of surgical navigation, these advancements can enhance the accuracy and effectiveness of surgical procedures. References: Yu, Zehao, Peng, Songyou, Niemeyer, Michael, Sattler, Torsten, Geiger, Andreas. MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface Reconstruction. NeurIPS 2022. Baptiste Brument and Robin Bruneau and Yvain Quéau and Jean Mélou and François Lauze and Jean-Denis Durou and Lilian Calvet. RNb-Neus: Reflectance and normal Based reconstruction with NeuS. CVPR 2024. Gwangbin Bae and Andrew J. Davison. Rethinking Inductive Biases for Surface Normal Estimation. CVPR 2024.

Powered by  SiROP - the academic career network