Search Results for author: Saravan Rajmohan

Found 27 papers, 8 papers with code

Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments

no code implementations • 13 Mar 2024 • Sitao Cheng, Ziyuan Zhuang, Yong Xu, Fangkai Yang, Chaoyun Zhang, Xiaoting Qin, Xiang Huang, Ling Chen, QIngwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

We instantiate the path on structured environments and provide feedback to edit the path if anything goes wrong.

Paper
Add Code

Exploring LLM-based Agents for Root Cause Analysis

no code implementations • 7 Mar 2024 • Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan

Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA.

Management Retrieval

Paper
Add Code

Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach

no code implementations • 29 Feb 2024 • Pooja Srinivas, Fiza Husain, Anjaly Parayil, Ayush Choure, Chetan Bansal, Saravan Rajmohan

We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology.

Paper
Add Code

Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides

no code implementations • 27 Feb 2024 • Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

Effective incident management is pivotal for the smooth operation of enterprises-level cloud services.

Management

Paper
Add Code

X-lifecycle Learning for Cloud Incident Management using LLMs

no code implementations • 15 Feb 2024 • Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, Saravan Rajmohan

to generate insights for detection, root causing and mitigating of incidents.

Management

Paper
Add Code

UFO: A UI-Focused Agent for Windows OS Interaction

1 code implementation • 8 Feb 2024 • Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision.

Navigate

4,591

Paper
Code

Dependency Aware Incident Linking in Large Cloud Systems

no code implementations • 5 Feb 2024 • Supriyo Ghosh, Karish Grover, Jimmy Wong, Chetan Bansal, Rakesh Namineni, Mohit Verma, Saravan Rajmohan

In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads.

Paper
Add Code

Revisiting VAE for Unsupervised Time Series Anomaly Detection: A Frequency Perspective

1 code implementation • 5 Feb 2024 • Zexin Wang, Changhua Pei, Minghua Ma, Xin Wang, Zhihan Li, Dan Pei, Saravan Rajmohan, Dongmei Zhang, QIngwei Lin, Haiming Zhang, Jianhui Li, Gaogang Xie

To ensure an accurate AD, FCVAE exploits an innovative approach to concurrently integrate both the global and local frequency features into the condition of Conditional Variational Autoencoder (CVAE) to significantly increase the accuracy of reconstructing the normal data.

Anomaly Detection Time Series +1

Paper
Code

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

no code implementations • 24 Jan 2024 • Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, Saravan Rajmohan

The results reveal that our in-context learning approach outperforms the previous fine-tuned large language models such as GPT-3 by an average of 24. 8\% across all metrics, with an impressive 49. 7\% improvement over the zero-shot model.

In-Context Learning

Paper
Add Code

COIN: Chance-Constrained Imitation Learning for Uncertainty-aware Adaptive Resource Oversubscription Policy

no code implementations • 13 Jan 2024 • Lu Wang, Mayukh Das, Fangkai Yang, Chao Duo, Bo Qiao, Hang Dong, Si Qin, Chetan Bansal, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

We address the challenge of learning safe and robust decision policies in presence of uncertainty in context of the real scientific problem of adaptive resource oversubscription to enhance resource efficiency while ensuring safety against resource congestion risk.

Imitation Learning Management

Paper
Add Code

Contrastive Learning with Negative Sampling Correction

no code implementations • 13 Jan 2024 • Lu Wang, Chao Du, Pu Zhao, Chuan Luo, Zhangchi Zhu, Bo Qiao, Wei zhang, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

To correct the negative sampling bias, we propose a novel contrastive learning method named Positive-Unlabeled Contrastive Learning (PUCL).

Contrastive Learning Data Augmentation +2

Paper
Add Code

Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction

no code implementations • 8 Jan 2024 • Haozhe Li, Minghua Ma, Yudong Liu, Pu Zhao, Lingling Zheng, Ze Li, Yingnong Dang, Murali Chintalapati, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

Using two real-world datasets of disk failure prediction and conducting node prediction experiments in Microsoft Azure, which is a top-tier cloud provider that serves millions of users, we demonstrate Uptake can significantly improve the failure prediction accuracy by 5% on average.

Cloud Computing

Paper
Add Code

Xpert: Empowering Incident Management with Query Recommendations via Large Language Models

no code implementations • 19 Dec 2023 • YuXuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft.

Management

Paper
Add Code

TaskWeaver: A Code-First Agent Framework

1 code implementation • 29 Nov 2023 • Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang

TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic.

Natural Language Understanding

4,776

Paper
Code

Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective

no code implementations • 27 Nov 2023 • Lukas Wutschitz, Boris Köpf, Andrew Paverd, Saravan Rajmohan, Ahmed Salem, Shruti Tople, Santiago Zanella-Béguelin, Menglin Xia, Victor Rühle

In this paper, we take an information flow control perspective to describe machine learning systems, which allows us to leverage metadata such as access control policies and define clear-cut privacy and confidentiality guarantees with interpretable information flows.

Retrieval

Paper
Add Code

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

1 code implementation • 7 Nov 2023 • Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei zhang, Si Qin, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

To address these limitations, we introduce a novel thought prompting approach called "Everything of Thoughts" (XoT) to defy the law of "Penrose triangle of existing thought paradigms.

Decision Making

Paper
Code

PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

no code implementations • 11 Sep 2023 • Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan

Major cloud providers have employed advanced AI-based solutions like large language models to aid humans in identifying the root causes of cloud incidents.

Decision Making

Paper
Add Code

Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance

no code implementations • 8 Aug 2023 • Menglin Xia, Xuchao Zhang, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, Victor Ruhle

Retrieval augmentation enhances performance of traditional language models by incorporating additional context.

Hallucination Language Modelling +2

Paper
Add Code

Diffusion-based Time Series Data Imputation for Microsoft 365

no code implementations • 3 Aug 2023 • Fangkai Yang, Wenjie Yin, Lu Wang, Tianci Li, Pu Zhao, Bo Liu, Paul Wang, Bo Qiao, Yudong Liu, Mårten Björkman, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance.

Imputation Time Series

Paper
Add Code

Robust Positive-Unlabeled Learning via Noise Negative Sample Self-correction

1 code implementation • 1 Aug 2023 • Zhangchi Zhu, Lu Wang, Pu Zhao, Chao Du, Wei zhang, Hang Dong, Bo Qiao, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang

To mitigate the impact of label uncertainty and improve the robustness of learning with positive and unlabeled data, we propose a new robust PU learning method with a training strategy motivated by the nature of human learning: easy cases should be learned first.

Paper
Code

ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection

1 code implementation • 3 Jul 2023 • Yuhang Chen, Chaoyun Zhang, Minghua Ma, Yudong Liu, Ruomeng Ding, Bowen Li, Shilin He, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

To the best of our knowledge, ImDiffusion represents a pioneering approach that combines imputation-based techniques with time series anomaly detection, while introducing the novel use of diffusion models to the field.

Anomaly Detection Imputation +2

Paper
Code

Introspective Tips: Large Language Model for In-Context Decision Making

no code implementations • 19 May 2023 • Liting Chen, Lu Wang, Hang Dong, Yali Du, Jie Yan, Fangkai Yang, Shuang Li, Pu Zhao, Si Qin, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

The emergence of large language models (LLMs) has substantially influenced natural language processing, demonstrating exceptional results across various tasks.

Decision Making Language Modelling +2

Paper
Add Code

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

1 code implementation • 19 May 2023 • Fangkai Yang, Pu Zhao, Zezhong Wang, Lu Wang, Jue Zhang, Mohit Garg, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang

Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks, but its performance in real industrial domain-specific scenarios is average due to its lack of specific domain knowledge.

Language Modelling Large Language Model +2

Paper
Code

Conservative State Value Estimation for Offline Reinforcement Learning

1 code implementation • NeurIPS 2023 • Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, QIngwei Lin, Saravan Rajmohan, Thomas Moscibroda, Dongmei Zhang

In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states.

D4RL reinforcement-learning

Paper
Code

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

no code implementations • 10 Jan 2023 • Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, Saravan Rajmohan

In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents.

Management Question Answering +1

Paper
Add Code

Learning Cooperative Oversubscription for Cloud by Chance-Constrained Multi-Agent Reinforcement Learning

no code implementations • 21 Nov 2022 • Junjie Sheng, Lu Wang, Fangkai Yang, Bo Qiao, Hang Dong, Xiangfeng Wang, Bo Jin, Jun Wang, Si Qin, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang

To address these two limitations, this paper formulates the oversubscription for cloud as a chance-constrained optimization problem and propose an effective Chance Constrained Multi-Agent Reinforcement Learning (C2MARL) method to solve this problem.

Multi-agent Reinforcement Learning reinforcement-learning +1

Paper
Add Code

Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach

no code implementations • 20 Jul 2022 • Jie Yan, Yunlei Lu, Liting Chen, Si Qin, Yixin Fang, QIngwei Lin, Thomas Moscibroda, Saravan Rajmohan, Dongmei Zhang

This paper investigates a critical resource allocation problem in the first party cloud: scheduling containers to machines.

Scheduling

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.