10 papers covering chart understanding, VLM blind spots, scientific images, document OCR, transcription benchmarks, and long-context behaviour. Two references (ACM FAccT, SSRN) are link-only.
All PDFs link to raw.githubusercontent.com; clicking will download the file directly. Source links go to the canonical version on arXiv, the journal, or the publisher.
8.1 · What Multimodal AI Can See, Hear, and Read
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Vision Language Models Are Blind: Failing to Translate Detailed Visual Features into Words
8.2 · AI and Scientific Images
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
Efficient deep learning-based approach for malaria detection
8.3 · Document Intelligence
Benchmarking Large Language Models for Handwritten Text Recognition
olmOCR 2: Unit Test Rewards for Document OCR
8.4 · Transcription and Audio Analysis
Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
Benchmarking Automatic Speech Recognition Models for African Languages
8.5 · Video and Multimodal Workflows
Lost in the Middle: How Language Models Use Long Contexts
8.6 · Hands-On Activities and Assessment
Hydroxychloroquine and chloroquine prophylaxis for COVID-19
Linked but not redistributed
Koenecke, A., et al. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. FAccT ’24. DOI:10.1145/3630106.3658975 8.4
ACM Digital Library doesn’t expose a public PDF endpoint; click through or use ACM Open Access.
Friese, S. (2025). From Coding to Conversation: A New Methodological Framework for AI-Assisted Qualitative Analysis. SSRN:5232579 8.4
SSRN bot-protected.