Search Results for author: Zhijie Yan

Found 20 papers, 11 papers with code

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

1 code implementation • 28 Mar 2024 • Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, Kun Zhan, Peng Jia, Xiaoxiao Long, Yilun Chen, Hao Zhao

However, the exploration of 3D dense captioning in outdoor scenes is hindered by two major challenges: 1) the \textbf{domain gap} between indoor and outdoor scenes, such as dynamics and sparse visual inputs, makes it difficult to directly adapt existing indoor methods; 2) the \textbf{lack of data} with comprehensive box-caption pair annotations specifically tailored for outdoor scenes.

3D dense captioning Dense Captioning

Paper
Code

Large Language Models Powered Context-aware Motion Prediction

no code implementations • 17 Mar 2024 • Xiaoji Zheng, Lixiu Wu, Zhijie Yan, Yuanrong Tang, Hao Zhao, Chen Zhong, Bokui Chen, Jiangtao Gong

Traditional methods of motion forecasting primarily encode vector information of maps and historical trajectory data of traffic participants, lacking a comprehensive understanding of overall traffic semantics, which in turn affects the performance of prediction tasks.

Motion Forecasting motion prediction +1

Paper
Add Code

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

2 code implementations • 14 Nov 2023 • Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans.

Ranked #1 on Acoustic Scene Classification on TUT Acoustic Scenes 2017 (using extra training data)

Acoustic Scene Classification Audio captioning +4

3,321

Paper
Code

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

1 code implementation • 7 Oct 2023 • JiaMing Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang

In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation.

Audio captioning Automatic Speech Recognition +11

276

Paper
Code

Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

no code implementations • 18 May 2023 • Xian Shi, Haoneng Luo, Zhifu Gao, Shiliang Zhang, Zhijie Yan

Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies.

speech-recognition Speech Recognition

Paper
Add Code

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

no code implementations • 24 Mar 2023 • Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings.

Extractive Summarization Keyphrase Extraction

Paper
Add Code

MUG: A General Meeting Understanding and Generation Benchmark

1 code implementation • 24 Mar 2023 • Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

To prompt SLP advancement, we establish a large-scale general Meeting Understanding and Generation Benchmark (MUG) to benchmark the performance of a wide range of SLP tasks, including topic segmentation, topic-level and session-level extractive summarization and topic title generation, keyphrase extraction, and action item detection.

Extractive Summarization Keyphrase Extraction +1

Paper
Code

Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

1 code implementation • 29 Jan 2023 • Xian Shi, Yanni Chen, Shiliang Zhang, Zhijie Yan

Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability.

3,321

Paper
Code

INT2: Interactive Trajectory Prediction at Intersections

1 code implementation • ICCV 2023 • Zhijie Yan, Pengfei Li, Zheng Fu, Shaocong Xu, Yongliang Shi, Xiaoxue Chen, Yuhang Zheng, Yang Li, Tianyu Liu, Chuxuan Li, Nairui Luo, Xu Gao, Yilun Chen, Zuoxu Wang, Yifeng Shi, Pengfei Huang, Zhengxiao Han, Jirui Yuan, Jiangtao Gong, Guyue Zhou, Hang Zhao, Hao Zhao

One of the most challenging problems in motion forecasting is interactive trajectory prediction, whose goal is to jointly forecasts the future trajectories of interacting agents.

Motion Forecasting Trajectory Prediction

Paper
Code

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

1 code implementation • 29 Nov 2022 • Xiaohuan Zhou, JiaMing Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, Chang Zhou

Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text.

Ranked #2 on Speech Recognition on AISHELL-1

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

2,324

Paper
Code

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

2 code implementations • 16 Jun 2022 • Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan

However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus.

Language Modelling speech-recognition +1

6,064

Paper
Code

Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

1 code implementation • 18 Mar 2022 • Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings.

Ranked #1 on Speaker Diarization on AliMeeting

Action Detection Activity Detection +2

3,321

Paper
Code

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

no code implementations • 16 Feb 2022 • Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao

Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV).

Paper
Add Code

BeamTransformer: Microphone Array-based Overlapping Speech Detection

no code implementations • 9 Sep 2021 • Siqi Zheng, Shiliang Zhang, Weilong Huang, Qian Chen, Hongbin Suo, Ming Lei, Jinwei Feng, Zhijie Yan

We propose BeamTransformer, an efficient architecture to leverage beamformer's edge in spatial filtering and transformer's capability in context sequence modeling.

Paper
Add Code

A Real-time Speaker Diarization System Based on Spatial Spectrum

no code implementations • 20 Jul 2021 • Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan

In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting.

speaker-diarization Speaker Diarization +1

Paper
Add Code

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

1 code implementation • 21 May 2020 • Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention.

Sound Audio and Speech Processing

3,321

Paper
Code

Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

no code implementations • 3 Oct 2019 • Kai Fan, Jiayi Wang, Bo Li, Shiliang Zhang, Boxing Chen, Niyu Ge, Zhijie Yan

The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition

no code implementations • 27 Mar 2019 • Shiliang Zhang, Ming Lei, Zhijie Yan

Results in a 20, 000 hours Mandarin speech recognition task show that the proposed spelling correction model can achieve a CER of 3. 41%, which results in 22. 9% and 53. 2% relative improvement compared to the baseline CTC-based systems decoded with and without language model respectively.

Language Modelling Machine Translation +4

Paper
Add Code

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

1 code implementation • 4 Mar 2018 • Shiliang Zhang, Ming Lei, Zhijie Yan, Li-Rong Dai

In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM.

Language Modelling speech-recognition +1

Paper
Code

Deep Feed-forward Sequential Memory Networks for Speech Synthesis

no code implementations • 26 Feb 2018 • Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei, Zhijie Yan

The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the best parametric Text-to-Speech (TTS) systems in terms of the naturalness of generated speech, especially the naturalness in prosody.

speech-recognition Speech Recognition +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.