15 papers spanning benchmark contamination, structural failure modes (the reversal curse, sycophancy, hallucination), and recent cases of AI contributing to genuine mathematical and physical discovery.
Each entry links to the canonical version of the paper — on arXiv, the journal, or the publisher. Where a paper is paywalled, the DOI is given for UCT-library access.
9.1 · The Trajectory of LLM Capabilities
Investigating Data Contamination in Modern Benchmarks for Large Language Models
LatestEval — dynamic test construction to avoid data contamination
Bridging the Gap — measuring the real-world capability gap
IrokoBench: a benchmark for African languages
ProgramBench: Can Language Models Rebuild Programs From Scratch?
9.2 · Three Categories of Failure
The Reversal Curse: LLMs Trained on “A is B” Fail to Learn “B is A”
Mathematical Capabilities of ChatGPT
Towards Understanding Sycophancy in Language Models
Why Language Models Hallucinate
9.3 · Where AI Is Now Genuinely Strong
AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge
Mathematical exploration and discovery at scale
Single-minus gluon tree amplitudes
Solving an Open Problem in Theoretical Physics using AI
Primitive sets — a number-theory problem solved with AI assistance
AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
9.4 · Illusions of Understanding
Conceptual sub-lesson — draws on the failure-mode papers above rather than introducing new primary literature.
9.5 · Verification Protocols for a Moving Target
Original protocol/workflow content.
9.6 · Hands-On Activities and Assessment
Assessment design.