As large-scale language models (LLMs) evolve rapidly, their promises are also strong research assistants. More and more, they don’t just answer simple factual questions. They work on multi-step inference, assessment of conflicting information, procurement of data across the web, and “deep research” tasks that combine it into a coherent output.
This new feature is currently being sold under a variety of brand names by major labs. Openai calls it “deep search”, humanity is called “extended thinking”, Google’s Gemini offers the “Search + Pro” feature, and Prperxity labels “Pro Search” or “Deep Research.” But how effective are these products actually? A new report from Futuresearch evaluates Deep Research Bench (DRB): Web Research Agent, providing the most rigorous rating to date, with results revealing both impressive capabilities and important shortcomings.
What is a deep search bench?
Created by the Futuresearch team, Deep Research Bench is a meticulously built benchmark designed to assess the performance of AI agents in multi-step, web-based research tasks. These are not simple questions with simple answers. They reflect the troublesome and freeing challenges faced by analysts, policymakers, and researchers in real-world settings.
The benchmark includes 89 different tasks across eight categories, including:
- Find the numberExample: “How many recalls have you ever had for FDA Class II medical devices?”
- Verify the claimExample: “Is ChatGpt 10 times more energy-intensive than Google search?”
- Compile the datasetExample: “Trends in the work of US software developers from 2019 to 2023”
Each task type is carefully constructed with human-validated answers and evaluated using a frozen dataset of scraped web pages known as Retrosearch. This ensures consistency across model evaluations, avoiding live web variability.
Agent Architecture: React and Retrosearch
At the heart of the deep research bench is React Architecture, short for “Reason + Act.” This method mimics the way human researchers tackle problems. Think through tasks, perform actions such as performing web searches, observe the results, then decide whether to iterate or end.
While previous models explicitly follow this loop, new “thinking” models often streamline processes and often embed inference into actions in a more fluid manner. To ensure consistency across evaluations, DRB introduces Retrosearch, a custom built version of the web. Rather than relying on the ever-changing live internet, agents utilize curated archives of web pages that have been scraped using tools such as Serper, Playwright, and Scraperapi. The scale is impressive. For high multiple tasks such as “collect evidence,” Retrosearch has access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.
Which AI agents perform the most?
Of all the candidates, Openai’s O3 emerged as a top performer, earning 0.51 out of the possible 1.0 on the deep search bench. While that may sound modest, it is important to understand the difficulties of benchmarking. Due to the ambiguity of task definition and scoring, even the perfect agent could raise what researchers call “noise ceiling” to about 0.8. In other words, even today’s best models still fall short of the informed, systematic human researchers.
Still, the leaderboard clearly provides insight. Not only did the O3 lead the pack, it did it with speed and consistency, showing strong performance on almost every task type. The Claude 3.7 sonnet from humanity continued in close proximity, showing versatility in both its “thinking” and “non-thinking” modes. The Gemini 2.5 Pro Google’s flagship model stood out for its ability to handle tasks that required structured planning and step-by-step inference. Meanwhile, the Open-Weight Deepseek-R1 maintained its pace with the GPT-4 turbo, bringing a comfortable surprise narrowing down the performance gap between the open and closed models.
A clear pattern appeared all over the place. The newer “thinking response” model consistently outperformed its previous counterparts, while the closed model maintained a more pronounced edge than the open weight alternative.
Where do agents struggle?
I found reading the breakdown patterns highlighted in the Deep Research Bench Report surprisingly familiar. One of the most frustrating aspects I personally encountered is when AI agents simply forget what we’re doing, especially during long research and content creation sessions. As the context window grows, the model often starts to lose threads. Key details fade, the goal becomes confused, and suddenly the response feels disjointed or undesired. At some point, I learned that it’s better to reduce losses and start from scratch, even if it means throwing away everything that’s been generated so far.
Such forgetfulness is not mere anecdote, but the most important predictor of failure in the assessment of deep research benches. But that’s not the only recurring problem. The report also highlights how some models fall into repeated use of tools, running the same search over and over again, as if they were stuck in a loop. Others show that queries are insufficiently crafted, rather than thinking critically about how to search effectively. And too often, agents fall victim to premature conclusions. Technically, it checks the box, but it guides half-formed answers that are not in the real insight.
Even among the top models, the difference is tough. For example, the GPT-4 turbo showed a prominent tendency to forget about previous steps, while the Deepseek-R1 was more likely to hallucinate or invent plausible (incorrectly) information. Overall, the model frequently failed to cross-check sources and validate findings before finalizing the output. For those who rely on AI for serious work, these issues seem too familiar. And they emphasize how far they still have to go by building agents that can truly think and study like humans.
How about memory-based performance?
Interestingly, Deep Research Bench evaluated what is called a “Toowless” agent that works without access to external tools, such as web searching and document retrieval. These agents rely entirely on internal training data and memory, and generate answers based solely on what they learned previously during training. In reality, this means they can’t look into anything or check the information. They speculate based on what they “remember.”
Surprisingly, these Toolless agents were performed in much the same way as complete research agents in a particular task. For example, the validation claim task, where the goal is to assess the validity of the statement, achieved a 0.61, almost coinciding with the average 0.62 for tool-enabled agents. This suggests that models such as O3 and Claude often exist in strong internal pre-existence and can recognize the truthfulness of common claims without the need to search the web.
However, on the more demanding tasks, where multiple values need to be connected from different sources, they have fallen completely apart in the more demanding tasks where evidence relies on deriving numbers that need to be linked to finding and assessing diverse facts in context. Without fresh information or real-time search capabilities, there was a lack of means to produce accurate or comprehensive answers.
This contrast emphasizes an important nuance. Although LLM today can “know” many things, deep research relies on inferences with up-to-date, verifiable information, not just recalls.
Final thoughts
The DRB report makes one thing clear. Today’s best AI agents can overtake the average person with slightly defined tasks, but they are lagging behind skilled generalist researchers, especially when it comes to planning strategically.
This gap becomes particularly evident in long or complex sessions. I have experienced it first hand. Agents gradually track task objectives, leading to an annoying disruption of consistency and usefulness.
What makes the deep search bench so valuable is that it not only tests surface-level knowledge, but also explores the intersections of tool use, memory, inference and adaptation, and is closely similar to actual research than benchmarks such as MMLU and GSM8K.
As LLMS continues to be integrated into serious knowledge tasks, FutureSearch tools such as DRB are essential to assess not only what these systems know but how well they actually work.