no code implementations • 13 Mar 2024 • Sitao Cheng, Ziyuan Zhuang, Yong Xu, Fangkai Yang, Chaoyun Zhang, Xiaoting Qin, Xiang Huang, Ling Chen, QIngwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
We instantiate the path on structured environments and provide feedback to edit the path if anything goes wrong.
no code implementations • 7 Mar 2024 • Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan
Lastly, we conduct a case study with a team at Microsoft to equip the ReAct agent with tools that give it access to external diagnostic services that are used by the team for manual RCA.
no code implementations • 29 Feb 2024 • Pooja Srinivas, Fiza Husain, Anjaly Parayil, Ayush Choure, Chetan Bansal, Saravan Rajmohan
We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology.
no code implementations • 27 Feb 2024 • Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Effective incident management is pivotal for the smooth operation of enterprises-level cloud services.
no code implementations • 15 Feb 2024 • Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, Saravan Rajmohan
to generate insights for detection, root causing and mitigating of incidents.
1 code implementation • 8 Feb 2024 • Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision.
no code implementations • 5 Feb 2024 • Supriyo Ghosh, Karish Grover, Jimmy Wong, Chetan Bansal, Rakesh Namineni, Mohit Verma, Saravan Rajmohan
In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only coming from same service, but also from different services and workloads.
1 code implementation • 5 Feb 2024 • Zexin Wang, Changhua Pei, Minghua Ma, Xin Wang, Zhihan Li, Dan Pei, Saravan Rajmohan, Dongmei Zhang, QIngwei Lin, Haiming Zhang, Jianhui Li, Gaogang Xie
To ensure an accurate AD, FCVAE exploits an innovative approach to concurrently integrate both the global and local frequency features into the condition of Conditional Variational Autoencoder (CVAE) to significantly increase the accuracy of reconstructing the normal data.
no code implementations • 24 Jan 2024 • Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, Saravan Rajmohan
The results reveal that our in-context learning approach outperforms the previous fine-tuned large language models such as GPT-3 by an average of 24. 8\% across all metrics, with an impressive 49. 7\% improvement over the zero-shot model.
no code implementations • 13 Jan 2024 • Lu Wang, Mayukh Das, Fangkai Yang, Chao Duo, Bo Qiao, Hang Dong, Si Qin, Chetan Bansal, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
We address the challenge of learning safe and robust decision policies in presence of uncertainty in context of the real scientific problem of adaptive resource oversubscription to enhance resource efficiency while ensuring safety against resource congestion risk.
no code implementations • 13 Jan 2024 • Lu Wang, Chao Du, Pu Zhao, Chuan Luo, Zhangchi Zhu, Bo Qiao, Wei zhang, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
To correct the negative sampling bias, we propose a novel contrastive learning method named Positive-Unlabeled Contrastive Learning (PUCL).
no code implementations • 8 Jan 2024 • Haozhe Li, Minghua Ma, Yudong Liu, Pu Zhao, Lingling Zheng, Ze Li, Yingnong Dang, Murali Chintalapati, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
Using two real-world datasets of disk failure prediction and conducting node prediction experiments in Microsoft Azure, which is a top-tier cloud provider that serves millions of users, we demonstrate Uptake can significantly improve the failure prediction accuracy by 5% on average.
no code implementations • 19 Dec 2023 • YuXuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft.
1 code implementation • 29 Nov 2023 • Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang
TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic.
no code implementations • 27 Nov 2023 • Lukas Wutschitz, Boris Köpf, Andrew Paverd, Saravan Rajmohan, Ahmed Salem, Shruti Tople, Santiago Zanella-Béguelin, Menglin Xia, Victor Rühle
In this paper, we take an information flow control perspective to describe machine learning systems, which allows us to leverage metadata such as access control policies and define clear-cut privacy and confidentiality guarantees with interpretable information flows.
1 code implementation • 7 Nov 2023 • Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei zhang, Si Qin, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
To address these limitations, we introduce a novel thought prompting approach called "Everything of Thoughts" (XoT) to defy the law of "Penrose triangle of existing thought paradigms.
no code implementations • 11 Sep 2023 • Dylan Zhang, Xuchao Zhang, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, Saravan Rajmohan
Major cloud providers have employed advanced AI-based solutions like large language models to aid humans in identifying the root causes of cloud incidents.
no code implementations • 8 Aug 2023 • Menglin Xia, Xuchao Zhang, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, Victor Ruhle
Retrieval augmentation enhances performance of traditional language models by incorporating additional context.
no code implementations • 3 Aug 2023 • Fangkai Yang, Wenjie Yin, Lu Wang, Tianci Li, Pu Zhao, Bo Liu, Paul Wang, Bo Qiao, Yudong Liu, Mårten Björkman, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
However, they suffer from poor data quality like data missing in model training and prediction, which limits the performance.
1 code implementation • 1 Aug 2023 • Zhangchi Zhu, Lu Wang, Pu Zhao, Chao Du, Wei zhang, Hang Dong, Bo Qiao, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang
To mitigate the impact of label uncertainty and improve the robustness of learning with positive and unlabeled data, we propose a new robust PU learning method with a training strategy motivated by the nature of human learning: easy cases should be learned first.
1 code implementation • 3 Jul 2023 • Yuhang Chen, Chaoyun Zhang, Minghua Ma, Yudong Liu, Ruomeng Ding, Bowen Li, Shilin He, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
To the best of our knowledge, ImDiffusion represents a pioneering approach that combines imputation-based techniques with time series anomaly detection, while introducing the novel use of diffusion models to the field.
no code implementations • 19 May 2023 • Liting Chen, Lu Wang, Hang Dong, Yali Du, Jie Yan, Fangkai Yang, Shuang Li, Pu Zhao, Si Qin, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
The emergence of large language models (LLMs) has substantially influenced natural language processing, demonstrating exceptional results across various tasks.
1 code implementation • 19 May 2023 • Fangkai Yang, Pu Zhao, Zezhong Wang, Lu Wang, Jue Zhang, Mohit Garg, QIngwei Lin, Saravan Rajmohan, Dongmei Zhang
Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks, but its performance in real industrial domain-specific scenarios is average due to its lack of specific domain knowledge.
1 code implementation • NeurIPS 2023 • Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, QIngwei Lin, Saravan Rajmohan, Thomas Moscibroda, Dongmei Zhang
In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states.
no code implementations • 10 Jan 2023 • Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, Saravan Rajmohan
In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents.
no code implementations • 21 Nov 2022 • Junjie Sheng, Lu Wang, Fangkai Yang, Bo Qiao, Hang Dong, Xiangfeng Wang, Bo Jin, Jun Wang, Si Qin, Saravan Rajmohan, QIngwei Lin, Dongmei Zhang
To address these two limitations, this paper formulates the oversubscription for cloud as a chance-constrained optimization problem and propose an effective Chance Constrained Multi-Agent Reinforcement Learning (C2MARL) method to solve this problem.
Multi-agent Reinforcement Learning reinforcement-learning +1
no code implementations • 20 Jul 2022 • Jie Yan, Yunlei Lu, Liting Chen, Si Qin, Yixin Fang, QIngwei Lin, Thomas Moscibroda, Saravan Rajmohan, Dongmei Zhang
This paper investigates a critical resource allocation problem in the first party cloud: scheduling containers to machines.