SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration Authors:Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao View a PDF of the paper titled SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration, by Jialong Chen and 4 other authors View PDF HTML (experimental) Abstract:Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose SWE-CI, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term functional correctness toward dynamic, long-term maintainability. The key insight is simple: Maintainability can be revealed by tracking how functional correctness changes over time. The benchmark comprises 100 tasks, each deriving from a real-world code repository with a development history spanning an average of 233 days and 71 consecutive commits. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2603.03823 [cs.SE] (or arXiv:2603.03823v4 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2603.03823 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jialong Chen [view email] [v1] Wed, 4 Mar 2026 08:20:25 UTC (3,311 KB) [v2] Tue, 17 Mar 2026 15:22:33 UTC (3,312 KB) [v3] Wed, 18 Mar 2026 12:07:41 UTC (3,315 KB) [v4] Wed, 1 Apr 2026 05:06:38 UTC (6,535 KB)
AI agent