3D Hand Forecasting (HoloAssist: Interactive AI Assistants) | |
3D hand pose forecasting is a new benchmark introduced by HoloAssist [1]. Existing action forecasting work mostly focuses on providing semantic labels of future actions and does not provide explicit 3D guidance on hand poses. Predicting 3D hand poses can be useful for various applications, and it can augment instructions and spatially guide users in different tasks. In this benchmark, we take 3 seconds inputs similar to other 3D body location forecasting literature and forecast the continuous 3D hand poses for the next 0.5, 1.0, and 1.5 seconds. The evaluation metric is the average of mean per joint position error over time in centimeters compared to ground truth. To have a proper evaluation metric that can help 3D action guidance, we remove the mistakes from the action sequences and only forecast 3D hand pose for the correct labels. [1] Wang, X., Kwon, T., Rad, M., Pan, B., Chakraborty, I., Andrist, S., ... & Pollefeys, M. (2023). Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 20270-20281). | |
Action Recognition Using 3D Hand-Object Contact/Pressure Map | |
Action recognition is an essential task in computer vision and has numerous applications in various fields, including robotics, surveillance, and healthcare. The recognition of actions involves the analysis of temporal and spatial information within a video sequence. Current state-of-the-art methods use 3D hand and object poses for action recognition, where the object's corners are commonly used for representation. However, this approach has limitations in accurately modeling the hand-object interaction. In [1], we show that leveraging hand-object contact-map representation helps improve action recognition. However, this representation can be learned implicitly for the task of action recognition. [1] https://arxiv.org/pdf/2309.10001.pdf | |
Action Label Correction with LLMs | |
The recent development of LLMs (Large Language Models), such as ChatGPT and Llama, opens up new possibilities for understanding procedural actions. In the past, action recognition was restricted to the classification of visual frames. However, with LLMs, the model can observe the whole action sequence in a more effective way and even predict the future actions [1]. In this project, students will explore how LLMs can improve action recognition in procedural tasks. Specifically, given a high-level procedural task (e.g., making coffee, copying a paper), students will use existing pretrained action recognition models to predict the top 5 actions for each clip and feed them into the LLMs to refine and correct the predicted actions. As a comparison, students will also establish a baseline using simple machine learning and statistical methods to correct actions. [1] Palm: Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 2023, CVPR'23 workshop | |
Holographic AI Guidance App in MR | |
Reading text manuals to set up and manipulate devices takes a lot of time and is not intuitive when it comes to 3D instruction. Despite the advent of Mixed Reality (MR) devices, 3D instruction is still limited and expensive to set up. In this project, we will develop an app, an adaptive 3D hand guidance system that projects instructional 3D hand poses in MR devices with pre-recorded instructional videos using MR devices. | |
Generalizable Multi-view Reconstruction with 3D Gaussian Splatting | |
This project explores to build a generalizable 3D reconstruction framework with the recently popular 3D Gaussian Splatting, which achieves impressive rendering speed and quality compared with NeRF. | |
Language-Guided 3D Object Detection | |
The goal of this project is to use language prompts to help find object parts in 3D. | |
Metric Relative Pose Estimation | |
The objective of this project is to determine the metric relative pose between two images using object-to-object matches. | |
Lifelong Learning in the Context of Long-Term Mapping and Localization | |
This project explores how to best combine and update visual-inertial maps captured over multiple days despite changes in the environment. Project supervised by and conducted at Magic Leap Zurich. | |
Leveraging Neural Scene Representations for Large-Scale Localization | |
Project supervised by and conducted at Magic Leap Zurich. This project explores how recent neural scene representations like NeRFs could be useful for visual localization. | |
Learning to Understand the World: Semantically-aided Visual Localization | |
Project supervised by and conducted at Magic Leap Zurich. This project explores how high-level semantic information on objects can improve the robustness of visual localization in self-similar environments. | |
Learning a Dense Mapping Descriptor for Localization in Challenging Environments | |
Project supervised by and conducted at Magic Leap Zurich. This project explores how to learn a feature descriptor to use dense volumetric maps for visual localization. | |
Online Feature Selection for Visual Localization | |
This project focuses on visual localization by utilizing local features in construction scenes. The main objective of this project is to introduce a method for online descriptor selection. | |
Accurate SLAM for Human-Robot Teams | |
We extend the lamar.ethz.ch benchmark to develop accurate SLAM methods that can co-register drones, legged robots, wheeled robots, smartphones, and mixed reality headsets based on visual SLAM. | |
Retrieval Robust to Object Motion Blur | |
Fast moving objects are defined as objects that move over significant distances over exposure time of a single image or video frame. Thus, they look significantly blurred. Detection, tracking, and deblurring of such objects have been studied in recent years. However, there are still no methods for robust retrieval of such objects in large image collections. | |
Beyond Marigold: Diffusion-Based Monocular Predictor | |
Extend the recent Marigold in different aspects | |
3D Auto Segmentation with Language Models | |
The combination of image captioning models and vision language models enables to segment images without any further user input and without defining a set of semantic labels. In this project we aim to transfer this methodology to 2D+3D input data. | |
Multi-View 6DoF Object Pose Estimation on HoloLens | |
The goal of this project is to implement an 6DoF object pose estimation method that utilizes the embedded sensors of head-mounted devices like the Microsoft HoloLens to improve the accuracy of the 6DoF pose estimation. The proposed method will be thoroughly evaluated and compared against single-view, stereo, and multi-view baselines. | |
Human Action Classification and Inference from Vision | |
Project supervised by and conducted at Magic Leap Zurich. This project explores how to classify user actions in images captured by egocentric cameras as in augmented-reality devices. | |
Human Full-Body Pose Prediction from Sparse Data | |
Project supervised by and conducted at Magic Leap Zurich. This project explores how to estimate the body pose of a user from wearable sensors (cameras, IMUs) attached to their head and hands. | |
Expressions and Emotion Detection from HMD Sensors | |
Project supervised by and conducted at Magic Leap Zurich. This project explores how to use sensors mounted on an augmented-reality device (cameras, microphones, IMUs) to infer the emotions and facial expression of its user. | |
Splattify: Gaussian Splatting from Feature Sparse Maps in Mixed Reality | |
This project explores how to leverage 3D Gaussian Splatting with data captured by Mixed Reality devices to reconstruct volumetric 3D scenes and improves SLAM sparse maps. The project supervised by and conducted at Magic Leap Zurich. | |
Adopting Delaunay tetrahedralization for dynamic NeRFs | |
Tetra-NeRF [1] offers a way to represent the scene as Delaunay tetrahedralization of the input point cloud. This can be used to represent dynamic 3D scenes [2] as the deformation is performed on the vertices of the tetrahedral mesh. | |
VR in Habitat 3.0 | |
Motivation: Explore the newly improved Habitat 3.0 simulator with a special focus on the Virtual Reality Features. This project is meant to be an exploration task on the Habitat 3.0 simulator, exploring all the newly introduced features focusing specifically on the implementation of virtual reality tools for scene navigation. The idea is to extend these features to self created environments in Unreal Engine that build uppon Habitat | |
BeSAFEv2 Benchmarking Safety of Agents in Familiar Environments v2 | |
Motivation: Create a realistic rendered benchmark to evaluate reinforcement agents, visual navigation tasks, interaction with other agents, navigation in scene with static and dynamic objects and humans. How: Create a realistic rendered benchmark to evaluate reinforcement agents, visual navigation tasks, interaction with other agents, navigation in scene with static and dynamic objects and humans. | |
2D-3D Correspondence Distributions for Articulated Object Pose Estimation | |
SurfEmb [1] is one of the recent state-of-the-art methods for object pose estimation. Although underlying problem formulation used for SurfEmb could be applied to articulated objects, it has only been investigated for rigid objects so far. The goal of this project is to extend this approach to objects with parameterized articulations. | |
Interactive Scene Graphs | |
Motivation:Use action recognition and object detection to extend the content of static scene graphs for a better scene understanding How: The static generated scene graph will be updated with information gathered from action recognition networks and object detection algorithms, providing a better understanding of the scene | |
Making CLIP features multiview consistent | |
CLIP is a powerful way of connecting images to text prompts and vice verse. However, it is not trained in a multi-view consistent manner: the CLIP feature of an object from different viewpoints is inconsistent. The goal of this project is to make CLIP multiview consistent. | |
Editable Scene Representation | |
Motivation: Map interactions of the users with the scene, these interactions can range from tracking how a drawer is opened and closed, to tracking how objects are placed and taken from places, to learn which light switch is turned on and off How: 1. A segmentation of the scene on the level of parts, which is more finegrained than the usual panoptic segmentation 2. Potentially a hierarchy/graph of how the parts are connected (e.g. everything in a box moves with the box) 3. Tracking of the dynamic elements in the scene |
Powered by SiROP - the academic career network