Cartography

  1. A benchmark of expert-level academic questions to assess AI capabilities
    HLE benchmark reveals substantial gap between state-of-the-art LLMs and expert human performance on 2,500 closed-ended academic questions across mathematics, humanities, and natural sciences.