GPT-4.1
GPT-4.1 is a large language model within OpenAI's GPT series. It was released on April 14, 2025. GPT-4.1 can be accessed through the OpenAI API or the OpenAI Developer Playground. Three different models were simultaneously released: GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano.
Overview
All three models have a context window of 1 million tokens and a knowledge cutoff of June 2024.
The models were tested on numerous benchmarks. Academic knowledge benchmarks included the 2024 AIME, GPQA, and MMLU. Coding benchmarks included SWE-bench and SWE-Lancer. Instruction following benchmarks included COLLIE and IFEval. Vision benchmarks included MMMU (answering questions about images), MathVista (solving vision-related mathematical tasks), and CharXiv (answering questions about charts from research papers). Long-context benchmarks included two brand-new benchmarks invented by OpenAI: "multi-round coreference" (where the model has to find the i-th instance of something in a fake long conversation synthetically generated by GPT-4o) and "Graphwalks" (forcing the model to simulate breadth-first search).
The models underwent more training regarding tool-calling, so the "OpenAI cookbook" recommends exclusively using the tools field when giving the model access to tools. The models are also trained to follow instructions more literally, making the model more steerable.
Reception
The Verge described GPT-4.1's release as "mark[ing] a pivot in the company's release schedule". HackerNoon praised the model as "a HUGE win for developers", and stated that it challenged the advantages of Gemini 2.5 Pro's longer context window and Claude 3.7 Sonnet's strong reasoning capabilities. Zvi Mowshowitz described GPT-4.1-mini as an "excellent practical model". However, he criticized OpenAI for not doing enough safety testing, saying that he "hate[s] the precedent this sets".
Two research teams - one led by Oxford University researcher Owain Evans, the other based at the AI red-teaming startup SplxAI - independently found evidence that GPT-4.1 could be more misaligned than GPT-4o.