12 papers covering harness optimisation, agent benchmarks, the reliability-vs-accuracy distinction, long-horizon planning collapse, agentic RAG, and RAG evaluation.
Each entry links to the canonical version of the paper — on arXiv, the journal, or the publisher. Where a paper is paywalled, the DOI is given for UCT-library access.
10.1 · What Agents Are and What's New in 2026
Meta-Harness: End-to-End Optimization of Model Harnesses
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command-Line Interfaces
10.2 · Failure Modes for Long-Horizon Tasks
Towards a Science of AI Agent Reliability
Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents
YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
Why Language Models Hallucinate
10.3 · The Current Tool Landscape and MCP
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
10.4 · RAG in 2026
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
RAGAS: Automated Evaluation of Retrieval Augmented Generation
10.5 · Advanced Research Tools — A Curated Tour
The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages
10.6 · Hands-On Activities and Assessment
Assessment design (the “Same Task, Three Ways” activity).