newstrooper newstrooper
  • Home
  • World News
  • Politics
  • Sports
  • Entertainment
  • Business
  • Technology
  • Travel
  • Gaming
Reading: How good is Real Research’s AI agent? In the deep search bench report
Share

News Trooper

Your Global Insight, Delivered Daily.

Search
  • Home
  • World News
  • Politics
  • Sports
  • Entertainment
  • Business
  • Technology
  • Travel
  • Gaming
Follow US
© 2025 All Rights Reserved | Powered by News Trooper News
News Trooper > Technology > How good is Real Research’s AI agent? In the deep search bench report
Technology

How good is Real Research’s AI agent? In the deep search bench report

June 3, 2025 10 Min Read
Share
How good is Real Research’s AI agent? In the deep search bench report
SHARE

Table of Contents

Toggle
  • What is a deep search bench?
  • Agent Architecture: React and Retrosearch
  • Which AI agents perform the most?
  • Where do agents struggle?
  • How about memory-based performance?
  • Final thoughts

As large-scale language models (LLMs) evolve rapidly, their promises are also strong research assistants. More and more, they don’t just answer simple factual questions. They work on multi-step inference, assessment of conflicting information, procurement of data across the web, and “deep research” tasks that combine it into a coherent output.

This new feature is currently being sold under a variety of brand names by major labs. Openai calls it “deep search”, humanity is called “extended thinking”, Google’s Gemini offers the “Search + Pro” feature, and Prperxity labels “Pro Search” or “Deep Research.” But how effective are these products actually? A new report from Futuresearch evaluates Deep Research Bench (DRB): Web Research Agent, providing the most rigorous rating to date, with results revealing both impressive capabilities and important shortcomings.

What is a deep search bench?

Created by the Futuresearch team, Deep Research Bench is a meticulously built benchmark designed to assess the performance of AI agents in multi-step, web-based research tasks. These are not simple questions with simple answers. They reflect the troublesome and freeing challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 different tasks across eight categories, including:

  • Find the numberExample: “How many recalls have you ever had for FDA Class II medical devices?”
  • Verify the claimExample: “Is ChatGpt 10 times more energy-intensive than Google search?”
  • Compile the datasetExample: “Trends in the work of US software developers from 2019 to 2023”

Each task type is carefully constructed with human-validated answers and evaluated using a frozen dataset of scraped web pages known as Retrosearch. This ensures consistency across model evaluations, avoiding live web variability.

Agent Architecture: React and Retrosearch

At the heart of the deep research bench is React Architecture, short for “Reason + Act.” This method mimics the way human researchers tackle problems. Think through tasks, perform actions such as performing web searches, observe the results, then decide whether to iterate or end.

See also  Fake recruiters email target CFOs using legal netbird tools in six global regions

While previous models explicitly follow this loop, new “thinking” models often streamline processes and often embed inference into actions in a more fluid manner. To ensure consistency across evaluations, DRB introduces Retrosearch, a custom built version of the web. Rather than relying on the ever-changing live internet, agents utilize curated archives of web pages that have been scraped using tools such as Serper, Playwright, and Scraperapi. The scale is impressive. For high multiple tasks such as “collect evidence,” Retrosearch has access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.

Which AI agents perform the most?

Of all the candidates, Openai’s O3 emerged as a top performer, earning 0.51 out of the possible 1.0 on the deep search bench. While that may sound modest, it is important to understand the difficulties of benchmarking. Due to the ambiguity of task definition and scoring, even the perfect agent could raise what researchers call “noise ceiling” to about 0.8. In other words, even today’s best models still fall short of the informed, systematic human researchers.

Still, the leaderboard clearly provides insight. Not only did the O3 lead the pack, it did it with speed and consistency, showing strong performance on almost every task type. The Claude 3.7 sonnet from humanity continued in close proximity, showing versatility in both its “thinking” and “non-thinking” modes. The Gemini 2.5 Pro Google’s flagship model stood out for its ability to handle tasks that required structured planning and step-by-step inference. Meanwhile, the Open-Weight Deepseek-R1 maintained its pace with the GPT-4 turbo, bringing a comfortable surprise narrowing down the performance gap between the open and closed models.

A clear pattern appeared all over the place. The newer “thinking response” model consistently outperformed its previous counterparts, while the closed model maintained a more pronounced edge than the open weight alternative.

See also  Microsoft will help CBI to dismantle the Indian call centre behind Japan's technical assistance scam

Where do agents struggle?

I found reading the breakdown patterns highlighted in the Deep Research Bench Report surprisingly familiar. One of the most frustrating aspects I personally encountered is when AI agents simply forget what we’re doing, especially during long research and content creation sessions. As the context window grows, the model often starts to lose threads. Key details fade, the goal becomes confused, and suddenly the response feels disjointed or undesired. At some point, I learned that it’s better to reduce losses and start from scratch, even if it means throwing away everything that’s been generated so far.

Such forgetfulness is not mere anecdote, but the most important predictor of failure in the assessment of deep research benches. But that’s not the only recurring problem. The report also highlights how some models fall into repeated use of tools, running the same search over and over again, as if they were stuck in a loop. Others show that queries are insufficiently crafted, rather than thinking critically about how to search effectively. And too often, agents fall victim to premature conclusions. Technically, it checks the box, but it guides half-formed answers that are not in the real insight.

Even among the top models, the difference is tough. For example, the GPT-4 turbo showed a prominent tendency to forget about previous steps, while the Deepseek-R1 was more likely to hallucinate or invent plausible (incorrectly) information. Overall, the model frequently failed to cross-check sources and validate findings before finalizing the output. For those who rely on AI for serious work, these issues seem too familiar. And they emphasize how far they still have to go by building agents that can truly think and study like humans.

How about memory-based performance?

Interestingly, Deep Research Bench evaluated what is called a “Toowless” agent that works without access to external tools, such as web searching and document retrieval. These agents rely entirely on internal training data and memory, and generate answers based solely on what they learned previously during training. In reality, this means they can’t look into anything or check the information. They speculate based on what they “remember.”

See also  Increased Gibride AI Images: Privacy Concerns and Data Risks

Surprisingly, these Toolless agents were performed in much the same way as complete research agents in a particular task. For example, the validation claim task, where the goal is to assess the validity of the statement, achieved a 0.61, almost coinciding with the average 0.62 for tool-enabled agents. This suggests that models such as O3 and Claude often exist in strong internal pre-existence and can recognize the truthfulness of common claims without the need to search the web.

However, on the more demanding tasks, where multiple values ​​need to be connected from different sources, they have fallen completely apart in the more demanding tasks where evidence relies on deriving numbers that need to be linked to finding and assessing diverse facts in context. Without fresh information or real-time search capabilities, there was a lack of means to produce accurate or comprehensive answers.

This contrast emphasizes an important nuance. Although LLM today can “know” many things, deep research relies on inferences with up-to-date, verifiable information, not just recalls.

Final thoughts

The DRB report makes one thing clear. Today’s best AI agents can overtake the average person with slightly defined tasks, but they are lagging behind skilled generalist researchers, especially when it comes to planning strategically.

This gap becomes particularly evident in long or complex sessions. I have experienced it first hand. Agents gradually track task objectives, leading to an annoying disruption of consistency and usefulness.

What makes the deep search bench so valuable is that it not only tests surface-level knowledge, but also explores the intersections of tool use, memory, inference and adaptation, and is closely similar to actual research than benchmarks such as MMLU and GSM8K.

As LLMS continues to be integrated into serious knowledge tasks, FutureSearch tools such as DRB are essential to assess not only what these systems know but how well they actually work.

Share This Article
Facebook Twitter Copy Link
Previous Article Pre-installed apps on ulefone, krüger, matz phones reset the device to reset apps and stole the pin Pre-installed apps on ulefone, krüger, matz phones reset the device to reset apps and stole the pin
Next Article Jesus Guerrero: About the hairstylist of Kylie Jenner, who passed away at 34 Jesus Guerrero: About the hairstylist of Kylie Jenner, who passed away at 34
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

Musk’s decision to limit political spending leaves some Republicans cold

Musk’s decision to limit political spending leaves some Republicans cold

Elon Musk's pledge to retreat from campaign spending -- if…

June 2, 2025
GOP Rep. Bill Huizenga is preparing to run for Michigan's open Senate seat

GOP Rep. Bill Huizenga is preparing to run for Michigan's open Senate seat

McKinnack Island, Mich. -- Republican Rep. Bill Huizenga is preparing…

June 2, 2025
'It betrays our values': Progressives grapple with deadly shooting

'It betrays our values': Progressives grapple with deadly shooting

Progressive is tackling that two people who worked at the…

June 2, 2025
Beshear, Khanna to headline Dem mayor summit in July

Beshear, Khanna to headline Dem mayor summit in July

Two potential 2028 presidential primary candidates will descend on Cleveland…

June 2, 2025
Democrats are ‘stuck in that unfortunate reality’ in debate over Biden's illness

Democrats are ‘stuck in that unfortunate reality’ in debate over Biden's illness

24 hours after Sunday's announcement that former President Joe Biden…

June 2, 2025

You Might Also Like

Small deep fakes may be a bigger threat
Technology

Small deep fakes may be a bigger threat

18 Min Read
APT intrusion, AI malware, zero-click exploits, browser hijacking, etc.
Technology

APT intrusion, AI malware, zero-click exploits, browser hijacking, etc.

33 Min Read
How PHI-4 Renersing redefines AI reasoning by challenging the “Bigger Better” myth
Technology

How PHI-4 Renersing redefines AI reasoning by challenging the “Bigger Better” myth

11 Min Read
New Pathwiper Data Wiper Malware Destroys Ukraine’s Critical Infrastructure in 2025 Attack
Technology

New Pathwiper Data Wiper Malware Destroys Ukraine’s Critical Infrastructure in 2025 Attack

9 Min Read
newstrooper
newstrooper

Welcome to News Trooper, your reliable destination for global news that matters. In an age of information overload, we stand as a dedicated news platform committed to delivering timely, accurate, and insightful coverage of the world’s most significant events and trends.

  • Business
  • Entertainment
  • Gaming
  • Politics
  • Sports
  • Technology
  • Travel
  • World News
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service
  • Home
  • World News
  • Politics
  • Sports
  • Entertainment
  • Business
  • Technology
  • Travel
  • Gaming
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service

© 2025 All Rights Reserved | Powered by News Trooper News

Welcome Back!

Sign in to your account

Lost your password?