← Back to Course Contents · ← All Papers
Week 9

Critical Evaluation & Limitations of AI

Benchmarks, failure categories, and where AI is now genuinely strong

15 papers spanning benchmark contamination, structural failure modes (the reversal curse, sycophancy, hallucination), and recent cases of AI contributing to genuine mathematical and physical discovery.

Each entry links to the canonical version of the paper — on arXiv, the journal, or the publisher. Where a paper is paywalled, the DOI is given for UCT-library access.

9.1 · The Trajectory of LLM Capabilities

Investigating Data Contamination in Modern Benchmarks for Large Language Models
Deng, C., Zhao, Y., Tang, X., Gerstein, M., & Cohan, A. (2023) — NAACL 2024
LatestEval — dynamic test construction to avoid data contamination
(2023) — AAAI 2024
Bridging the Gap — measuring the real-world capability gap
Alhanai, T., Kasumovic, M., Ghassemi, M., Zitzelberger, R., Lundin, E., & Chabot-Couture, G. (2024) — AAAI 2025
IrokoBench: a benchmark for African languages
(2024)
ProgramBench: Can Language Models Rebuild Programs From Scratch?
(2026) — arXiv, May 2026

9.2 · Three Categories of Failure

The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”
Berglund, L., Tong, M., Kaufmann, M., et al. (2023)
Mathematical Capabilities of ChatGPT
Frieder, S., Pinchetti, L., et al. (2023) — NeurIPS 2023
Towards Understanding Sycophancy in Language Models
Sharma, M., Tong, M., Korbak, T., et al. (2023) — ICLR 2024
Why Language Models Hallucinate
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025)

9.3 · Where AI Is Now Genuinely Strong

AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge
Fang, Y.-L., Jian, D.-S., Li, X., & Ma, Y.-Q. (2025)
Mathematical exploration and discovery at scale
Georgiev, B., Gómez-Serrano, J., Tao, T., & Wagner, A. Z. (2025)
Single-minus gluon tree amplitudes
Guevara, A., Lupsasca, A., Skinner, D., Strominger, A., & Weil, K. (2026) — AI-assisted amplitude calculation
Solving an Open Problem in Theoretical Physics using AI
Brenner, M. P., Cohen-Addad, V., & Woodruff, D. (2026)
Primitive sets — a number-theory problem solved with AI assistance
Alexeev, B., Barreto, K., Li, Y., Lichtman, J. D., Price, L., Shah, J. I., Tang, Q., & Tao, T. (2026)
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
(2026) — arXiv, May 2026

9.4 · Illusions of Understanding

Conceptual sub-lesson — draws on the failure-mode papers above rather than introducing new primary literature.

9.5 · Verification Protocols for a Moving Target

Original protocol/workflow content.

9.6 · Hands-On Activities and Assessment

Assessment design.