← Back to Course Contents · ← All Papers

Week 9

Critical Evaluation & Limitations of AI

Benchmarks, failure categories, and where AI is now genuinely strong

15 papers spanning benchmark contamination, structural failure modes (the reversal curse, sycophancy, hallucination), and recent cases of AI contributing to genuine mathematical and physical discovery.

Each entry links to the canonical version of the paper — on arXiv, the journal, or the publisher. Where a paper is paywalled, the DOI is given for UCT-library access.

9.1 · The Trajectory of LLM Capabilities

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Deng, C., Zhao, Y., Tang, X., Gerstein, M., & Cohan, A. (2023) — NAACL 2024

View source · arXiv:2311.09783

LatestEval — dynamic test construction to avoid data contamination

(2023) — AAAI 2024

View source · arXiv:2312.12343

Bridging the Gap — measuring the real-world capability gap

Alhanai, T., Kasumovic, M., Ghassemi, M., Zitzelberger, R., Lundin, E., & Chabot-Couture, G. (2024) — AAAI 2025

View source · arXiv:2412.12417

IrokoBench: a benchmark for African languages

(2024)

View source · arXiv:2406.03368

ProgramBench: Can Language Models Rebuild Programs From Scratch?

(2026) — arXiv, May 2026

View source · arXiv:2605.03546

9.2 · Three Categories of Failure

The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”

Berglund, L., Tong, M., Kaufmann, M., et al. (2023)

View source · arXiv:2309.12288

Mathematical Capabilities of ChatGPT

Frieder, S., Pinchetti, L., et al. (2023) — NeurIPS 2023

View source · arXiv:2301.13867

Towards Understanding Sycophancy in Language Models

Sharma, M., Tong, M., Korbak, T., et al. (2023) — ICLR 2024

View source · arXiv:2310.13548

Why Language Models Hallucinate

Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025)

View source · arXiv:2509.04664

9.3 · Where AI Is Now Genuinely Strong

AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge

Fang, Y.-L., Jian, D.-S., Li, X., & Ma, Y.-Q. (2025)

View source · arXiv:2504.01538

Mathematical exploration and discovery at scale

Georgiev, B., Gómez-Serrano, J., Tao, T., & Wagner, A. Z. (2025)

View source · arXiv:2511.02864

Single-minus gluon tree amplitudes

Guevara, A., Lupsasca, A., Skinner, D., Strominger, A., & Weil, K. (2026) — AI-assisted amplitude calculation

View source · arXiv:2602.12176

Solving an Open Problem in Theoretical Physics using AI

Brenner, M. P., Cohen-Addad, V., & Woodruff, D. (2026)

View source · arXiv:2603.04735

Primitive sets — a number-theory problem solved with AI assistance

Alexeev, B., Barreto, K., Li, Y., Lichtman, J. D., Price, L., Shah, J. I., Tang, Q., & Tao, T. (2026)

View source · arXiv:2605.00301

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

(2026) — arXiv, May 2026

View source · arXiv:2605.06651

9.4 · Illusions of Understanding

Conceptual sub-lesson — draws on the failure-mode papers above rather than introducing new primary literature.

9.5 · Verification Protocols for a Moving Target

Original protocol/workflow content.

9.6 · Hands-On Activities and Assessment

Assessment design.