Week 10 Papers — Agentic AI, RAG & Advanced Research Tools

12 papers covering harness optimisation, agent benchmarks, the reliability-vs-accuracy distinction, long-horizon planning collapse, agentic RAG, and RAG evaluation.

Each entry links to the canonical version of the paper — on arXiv, the journal, or the publisher. Where a paper is paywalled, the DOI is given for UCT-library access.

10.1 · What Agents Are and What's New in 2026

Meta-Harness: End-to-End Optimization of Model Harnesses

Lee, Y., et al. (2026)

View source · arXiv:2603.28052

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces

(2026)

View source · arXiv:2601.11868

10.2 · Failure Modes for Long-Horizon Tasks

Towards a Science of AI Agent Reliability

Rabanser, S., Kapoor, S., Kirgis, P., Liu, K., Utpala, S., & Narayanan, A. (2026) — Princeton CITP

View source · arXiv:2602.16666

Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents

Wang, Z., et al. (2026)

View source · arXiv:2601.22311

YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

(2026)

View source · arXiv:2604.01212

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

(2026)

View source · arXiv:2603.29231

Why Language Models Hallucinate

Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025) — also cited in Week 9

View source · arXiv:2509.04664

10.3 · The Current Tool Landscape and MCP

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Sakana AI (2025)

View source · arXiv:2504.08066

10.4 · RAG in 2026

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Singh, A., Ehtesham, A., Kumar, S., Talaei Khoei, T., & Vasilakos, A. V. (2025)

View source · arXiv:2501.09136

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

Li, Z., Li, C., Zhang, M., Mei, Q., & Bendersky, M. (2024) — Google; EMNLP 2024

View source · arXiv:2407.16833

RAGAS: Automated Evaluation of Retrieval Augmented Generation

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023)

View source · arXiv:2309.15217

10.5 · Advanced Research Tools — A Curated Tour

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Rajab, J., Aremu, A., Chimoto, E. A., et al. (2025) — also cited in Week 4

View source · arXiv:2502.15916

10.6 · Hands-On Activities and Assessment

Assessment design (the “Same Task, Three Ways” activity).