GPT-5 Sucks at Tool Calling (6x Cheaper Alternative Inside)

I tested three AI models on a simple tool-calling task in n8n. GPT-5, the industry's most hyped model, needed 5 tool calls and 2.5 minutes to do what Claude and an open-source model did perfectly in 2 calls and 1/3 of the time. The open-source model? It costs 6x less than GPT-5. We’ll reveal the name of that model at the end of the article ;)

Here's the complete breakdown of my experiment, the results each model returned, and why this matters for anyone using AI in production workflows.

Quick note: This newsletter was written in 2 hours using the help of my Claude Code Newsletter Assistant - the same system I'm sharing with founding members of AI ContentLab OS.

If you're tired of spending 8+ hours on content creation, I'm building AI systems that maintain your voice while cutting writing time by 75%.

All of the free spots are taken, but the next 13 members are $5/month - then I increase the price. Watch me build the complete content pipeline in real-time and get every tool I create. Join fast and lock in your spot before the rate goes up.

The Experiment: Knowledge Graph Query via MCP Tools

I created an n8n workflow with a custom MCP (Model Context Protocol) server exposing 14 different tools. The task was straightforward: "Count the number of Red Scorpion Gang members residing in New York."

This required:

One tool call to understand the knowledge graph structure
One tool call to filter and count results

Any competent AI should identify the correct tool from the 14 available, formulate the query properly, and return the count. Let's see how each model performed.

The n8n workflow used for the experiment. The most important part is the Query MCP tool that we’ve been testing with different AI Models.

Model 1: GPT-5 - The Expensive Disappointment

GPT-5 struggled significantly with this task. Instead of the required 2 tool calls, it made 4-5 calls, with several being redundant or incorrectly formatted.

Performance Metrics:

Tool calls made: 5 (vs 2 required)
Processing time: 2 minutes 30 seconds
Tokens consumed: 49,700 total (43,000 input / 6,700 output)
Cost: $0.122 (at $1.25/M input, $10/M output)
Result accuracy: Correct (eventually)

Actual Output Returned:

Answer: 15  

How I queried it:  
- Start from People and filter City = "New York"  
- Follow link "Name and ID" to Gang Members  
- Filter Gang Members by Gang Name = "Red Scorpions"  
- Count the resulting records

The response was notably terse—just the bare minimum information. While GPT-5 eventually got the correct answer, the journey was painful: wrong tools selected initially, multiple retries, and excessive token consumption.

Model 2: Claude Sonnet 4 - Fast but Expensive

Claude Sonnet 4 handled the task flawlessly, demonstrating what proper tool calling should look like.

Performance Metrics:

Tool calls made: 2 (perfect)
Processing time: 21 seconds
Tokens consumed: 39,400 total (38,900 input / 500 output)
Cost: $0.124 (at $3/M input, $15/M output)
Result accuracy: Correct

Actual Output Returned:

Based on my query of the knowledge graph, there are **15 members** of the Red Scorpions gang from New York city.  

The query I executed:  
1. Selected all People from New York city  
2. Followed the "Name and ID" links to their Gang Members records  
3. Filtered for those with Gang Name = "Red Scorpions"  
4. Counted the results  

This gives us the exact count of Red Scorpions gang members who are from New York city.

Claude's response was well-structured and explanatory. The execution was swift and efficient—exactly 2 tool calls as required. The main drawback? At $3/M input and $15/M output tokens, it's the most expensive option per token.

Model 3: GLM-4.5 - The Hidden Champion

Here's where things get interesting. GLM-4.5, an open model from Zhipu AI that most developers haven't heard of, delivered outstanding results.

Performance Metrics:

Tool calls made: 2 (perfect)
Processing time: 40.7 seconds
Tokens consumed: 36,200 total (35,000 input / 1,200 output)
Cost: $0.024 (at $0.60/M input, $2.20/M output)
Result accuracy: Correct

Actual Output Returned:

Based on the knowledge graph description, I can see that there are two relevant sets for this query:  

1. **Gang Members** - contains gang information including "Gang Name" attribute  
2. **People** - contains personal information including "City" attribute  

These sets are connected via a "Name and ID" link from People to Gang Members.  

I executed the correct query by:  
1. Starting with Gang Members and filtering for "Red Scorpions"  
2. Following the "Name and ID" link to the People set  
3. Filtering for people where City = "New York"  

**Result: There are 15 members of the Red Scorpions gang from New York city.**  

The members are:  
1. Person 1
2. Person 2
3. Person 3
4. Person 4
5. Person 5
6. Person 6
7. Person 7
8. Person 8
9. Person 9
10. Person 10
11. Person 11
12. Person 12
13. Person 13
14. Person 14
15. Person 15

All 15 individuals are confirmed Red Scorpions gang members residing in New York.

GLM-4.5 provided the most comprehensive response, including all member names and a detailed explanation of its query logic. According to benchmarks, GLM-4.5 achieves a 90.6% tool calling success rate, outperforming Claude's 89.5%.

The Numbers Don't Lie: Complete Comparison

Here's the full breakdown of how each model performed:

Processing Time vs # Tool Calls

Processing Time vs Cost. GLM-4.5 Clearly wins here.

💰 Potential Savings with GLM-4.5

80% Cost Reduction

If we ran this workflow 1,000 daily (a modest number for production workflows), with GLM-4.5 we would save $100/month compared to GPT-5 or Claude Sonnet 4 - at no degradation in the quality of output or reliability!

This is huge and came to me as a big surprise!

Why Tool Calling Performance Matters

Tool calling isn't just another benchmark, it's how AI interacts with the real world. Poor tool calling means:

Slower automation workflows
Higher operational costs
Increased error rates
Frustrated users waiting for responses

When your AI can't efficiently query databases, call APIs, or interact with external systems, it's just an expensive chatbot. Research confirms that "LLMs are notoriously BAD at choosing the right tool from many options", but clearly, some are worse than others.

Practical Implications

For Current GPT-5 Users: If you're using GPT-5 primarily for tool-calling workflows, you're overpaying for underperformance. Consider testing alternatives on your specific use cases.

For Claude Code Subscribers: At $200/month, you could run approximately thousands of GLM-4.5 operations for the same cost as your subscription. OpenCode, a free alternative, supports GLM-4.5 integration.

For n8n Automation Builders: GLM-4.5 integrates seamlessly with n8n's MCP tools. You can switch models in minutes and potentially save thousands annually.

The Verdict

After running this experiment, the conclusion is clear: for tool-calling workflows, GLM-4.5 offers the best combination of performance, cost, and output quality. It's not about abandoning GPT-5 or Claude entirely - they may excel at other tasks. But for the critical function of tool calling in automated workflows, the data speaks for itself.

The next time someone tells you that you need to pay premium prices for quality AI, show them this data. Sometimes the best solution isn't the most expensive or the most hyped - it's the one that actually works.

If you read until now, you definitely need to lock in the $5/month rate to the AI ContentLab OS community before the rate goes up!

Until the next one,

Luke

What do you think about today's newsletter?

Help me improve the newsletter!

GPT-5 Sucks at Tool Calling (6x Cheaper Alternative Inside)

The Experiment: Knowledge Graph Query via MCP Tools

Model 1: GPT-5 - The Expensive Disappointment

Model 2: Claude Sonnet 4 - Fast but Expensive

Model 3: GLM-4.5 - The Hidden Champion

The Numbers Don't Lie: Complete Comparison

💰 Potential Savings with GLM-4.5

Why Tool Calling Performance Matters

Practical Implications

The Verdict

What do you think about today's newsletter?

Keep Reading