← Back to Course Contents · ← All Papers
Week 8

Multimodal AI for Research

Vision, documents, audio, and video — what AI can really do across modalities

10 papers covering chart understanding, VLM blind spots, scientific images, document OCR, transcription benchmarks, and long-context behaviour. Two references (ACM FAccT, SSRN) are link-only.

All PDFs link to raw.githubusercontent.com; clicking will download the file directly. Source links go to the canonical version on arXiv, the journal, or the publisher.

8.1 · What Multimodal AI Can See, Hear, and Read

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Wang, Z., et al. (2024)
Vision Language Models Are Blind: Failing to Translate Detailed Visual Features into Words
Rahmanzadehgervi, P., et al. (2024) — ACCV 2024

8.2 · AI and Scientific Images

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Jin, Q., et al. (2024) — npj Digital Medicine
Efficient deep learning-based approach for malaria detection
Mujahid, M., et al. (2024) — Scientific Reports

8.3 · Document Intelligence

Benchmarking Large Language Models for Handwritten Text Recognition
Crosilla, G., Klic, L., & Colavizza, G. (2025)
olmOCR 2: Unit Test Rewards for Document OCR
Poznanski, J., Soldaini, L., et al. (2025)

8.4 · Transcription and Audio Analysis

Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
Imam, S. H., Belay, T. D., et al. (2025)
Benchmarking Automatic Speech Recognition Models for African Languages
Nahabwe, A., Kagumire, S., et al. (2025) — Deep Learning Indaba 2025

8.5 · Video and Multimodal Workflows

Lost in the Middle: How Language Models Use Long Contexts
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024) — TACL 2024

8.6 · Hands-On Activities and Assessment

Hydroxychloroquine and chloroquine prophylaxis for COVID-19
COPCOV Investigators (2024) — PLOS Medicine — activity 3 source paper

Linked but not redistributed

Koenecke, A., et al. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. FAccT ’24. DOI:10.1145/3630106.3658975 8.4
ACM Digital Library doesn’t expose a public PDF endpoint; click through or use ACM Open Access.
Friese, S. (2025). From Coding to Conversation: A New Methodological Framework for AI-Assisted Qualitative Analysis. SSRN:5232579 8.4
SSRN bot-protected.