no code implementations • 20 Dec 2023 • Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao
We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video.
no code implementations • NeurIPS 2023 • Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu
A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions.
no code implementations • 2 Nov 2023 • Ruohan Zhang, Sharon Lee, Minjune Hwang, Ayano Hiranaka, Chen Wang, Wensi Ai, Jin Jie Ryan Tan, Shreya Gupta, Yilun Hao, Gabrael Levine, Ruohan Gao, Anthony Norcia, Li Fei-Fei, Jiajun Wu
We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals.
no code implementations • CVPR 2023 • Samuel Clarke, Ruohan Gao, Mason Wang, Mark Rau, Julia Xu, Jui-Hsien Wang, Doug L. James, Jiajun Wu
Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener.
1 code implementation • 1 Jun 2023 • Ruohan Gao, Hao Li, Gokul Dharan, Zhuzhu Wang, Chengshu Li, Fei Xia, Silvio Savarese, Li Fei-Fei, Jiajun Wu
We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear.
no code implementations • CVPR 2023 • Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu
We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch.
no code implementations • 29 Apr 2023 • Trevor Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese
For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image.
no code implementations • 10 Mar 2023 • Hong-Xing Yu, Michelle Guo, Alireza Fathi, Yen-Yu Chang, Eric Ryan Chan, Ruohan Gao, Thomas Funkhouser, Jiajun Wu
We propose Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct object appearance from only images.
no code implementations • 7 Dec 2022 • Hao Li, Yizhi Zhang, Junzhe Zhu, Shaoxiong Wang, Michelle A Lee, Huazhe Xu, Edward Adelson, Li Fei-Fei, Ruohan Gao, Jiajun Wu
Humans use all of their senses to accomplish different tasks in everyday activities.
no code implementations • 17 Oct 2022 • Simon Le Cleac'h, Hong-Xing Yu, Michelle Guo, Taylor A. Howell, Ruohan Gao, Jiajun Wu, Zachary Manchester, Mac Schwager
A robot can use this simulation to optimize grasps and manipulation trajectories of neural objects, or to improve the neural object models through gradient-based real-to-simulation transfer.
1 code implementation • CVPR 2022 • Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu
We present ObjectFolder 2. 0, a large-scale, multisensory dataset of common household objects in the form of implicit neural representations that significantly enhances ObjectFolder 1. 0 in three aspects.
no code implementations • CVPR 2022 • Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
no code implementations • 21 Nov 2021 • Rishabh Garg, Ruohan Gao, Kristen Grauman
Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings.
no code implementations • 16 Sep 2021 • Ruohan Gao, Yen-Yu Chang, Shivani Mall, Li Fei-Fei, Jiajun Wu
Multisensory object-centric perception, reasoning, and interaction have been a key research topic in recent years.
1 code implementation • CVPR 2021 • Ruohan Gao, Kristen Grauman
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
1 code implementation • ICLR 2021 • Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e. g., a phone ringing in another room).
no code implementations • ECCV 2020 • Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Several animal species (e. g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world.
1 code implementation • CVPR 2020 • Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani
In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical.
Ranked #8 on Action Recognition on ActivityNet
3 code implementations • ICCV 2019 • Ruohan Gao, Kristen Grauman
Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel.
Ranked #1 on Audio Denoising on AV-Bench - Wooden Horse
2 code implementations • CVPR 2019 • Ruohan Gao, Kristen Grauman
We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations.
2 code implementations • ECCV 2018 • Ruohan Gao, Rogerio Feris, Kristen Grauman
Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video.
4 code implementations • CVPR 2018 • Ruohan Gao, Bo Xiong, Kristen Grauman
Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a standard two-stream network for activity recognition.
no code implementations • ECCV 2018 • Dinesh Jayaraman, Ruohan Gao, Kristen Grauman
We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation.
1 code implementation • ICCV 2017 • Ruohan Gao, Kristen Grauman
While machine learning approaches to image restoration offer great promise, current methods risk training models fixated on performing well only for image corruption of a particular level of difficulty---such as a certain level of noise or blur.
no code implementations • 1 Dec 2016 • Ruohan Gao, Dinesh Jayaraman, Kristen Grauman
Compared to existing temporal coherence methods, our idea has the advantage of lightweight preprocessing of the unlabeled video (no tracking required) while still being able to extract object-level regions from which to learn invariances.