inPulse24 Tuesday Briefing
Edition #41 · May 5–11, 2026 · Read time ~8 min
Live · 11 May 2026
Tuesday Briefing/5 stories/4 signals

Million Tokens on a MacBook

Six months ago, running a frontier AI model on your own machine was a research curiosity. This week, the creator of Redis shipped million-token inference on a MacBook — and Big Tech's cloud AI bill hit three quarters of a trillion dollars. The question is no longer whether local AI works. It is which workloads still justify the cloud round trip.

Published11 May 2026
Coverage4 May 2026 – 11 May 2026
Stories tracked36
Featured5
AuthorPulse24 Desk
Last updated11 May 2026
This week’s pulse

The minimum hardware needed to run powerful AI models locally dropped again this week. Antirez released ds4, an engine that runs DeepSeek V4 Flash — including its full million-token context window — on a MacBook. In the same seven days, Sakana AI shipped a small model that orchestrates larger ones and claims 83% cost savings, OpenAI released real-time voice models that collapse three separate services into one, and Cloudflare cut 1,100 jobs citing AI automation. The gap between what you can run on your own hardware and what requires a cloud subscription narrowed again — and the bill for the cloud side keeps climbing.

01The Six

The Six-Month Compression

Six months ago, running a serious AI model required data centre hardware. Most procurement and strategy teams have not caught up with what happened next.

Since then, the floor has dropped every month: a $600 consumer GPU ran 70-billion-parameter models in February, Google halved inference memory requirements and a full voice assistant ran locally on Mac in March, and by April frontier models were practical on everyday MacBooks and running natively on iPhones. Now ds4 extends the trend to a million-token context window — roughly equivalent to an entire codebase or a shelf of documents in a single query — on a laptop.

02

What antirez built

Salvatore Sanfilippo — known as antirez, creator of the database engine Redis — built ds4 to run one specific model, DeepSeek V4 Flash, on Apple's hardware. It requires a MacBook with 128GB of memory and uses aggressive compression to fit what would normally demand cloud infrastructure onto a laptop. It builds on foundational open-source work and follows DeepSeek V4's launch two weeks earlier, which challenged the pricing assumptions of every major AI provider.

It is a narrow tool — one model, one hardware platform — but it is a proof point for what is now possible without a cloud contract.

03

Why the timing matters

This technical compression is landing against a stark economic backdrop. Big Tech's combined AI infrastructure spend hit $725 billion this year, the Financial Times reported on the eighth of May, pushing Amazon, Google, Microsoft, and Meta's collective free cash flow to a decade low. Downstream, the cost pressure is already visible: "tokenmaxxing" — engineers maximising their use of AI coding assistants and chat tools — has driven individual monthly bills as high as $150,000 at some firms, according to the Times of India.

When infrastructure spending runs this far ahead of revenue, the pressure to recoup costs shifts to customers. That means tighter free tiers, more aggressive bundling, or straight price increases on the APIs that many products now depend on.

Running AI locally is no longer just about data privacy. For any team paying serious cloud AI bills, it is becoming a straightforward cost decision — and ds4 means that calculation now extends to the kind of long-context work that, until this week, had no local option at all.

04The counter

The counter-case

The trend is real, but the limits matter. A 128GB MacBook is a £3,500+ machine — most teams issue laptops with a quarter of that memory. The compression ds4 uses to fit the model involves accuracy trade-offs that may not be acceptable for every use case. And it runs exactly one model, not a menu. For teams that need flexibility across multiple AI providers or run standard-spec hardware, cloud APIs remain the practical default.

The thing that would reverse this trajectory: a next wave of AI models so large or complex that they push back beyond what consumer hardware can handle. That has not happened yet.

---

05Quick Hits

Quick Hits

Several of this week's other stories reinforce the same cost-and-control tension that runs through the editorial above.

Sakana AI released RL Conductor, a 7B parameter model that learned to route queries across GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro — sending easy tasks to cheaper models and hard tasks to expensive ones. Sakana claims 83% token cost reduction over static routing. Another approach to the same problem ds4 tackles from the hardware side: making AI cheaper to run.

OpenAI launched three real-time voice models for reasoning, translation, and transcription via its Realtime API, replacing the need to stitch separate speech-to-text, language model, and text-to-speech services together. This is the cloud side of the argument — capabilities that are hard to replicate locally, designed to keep developers on the platform.

Cloudflare cut 1,100 employees — 20% of its workforce — citing a 600% increase in internal AI usage as the direct driver. It is the most explicit AI-to-headcount attribution from a public company this year, and a reminder that the cost pressure AI creates runs in both directions: infrastructure bills go up, headcount goes down.

Anthropic, OpenAI, and Google began embedding engineers directly inside enterprise customers, moving from model vendors to implementation partners. HCLTech reported 2-3% AI-driven revenue deflation as a result — the clearest sign yet that AI providers are competing not just for API calls, but for the services revenue around them.

---

📡 Signals

Worth tracking.

Markets
Moonshot AI secured $2 billion at a $20 billion valuation for its open-weight Kimi LLMs.
Finance
Big Tech's combined $725 billion AI infrastructure spend pushed aggregate free cash flow to a decade low, according to the Financial Times.
Risk
AI-driven analysis identified embargoed software vulnerabilities before disclosure windows closed, compressing traditional patch timelines.
Macro
OpenAI, AMD, and partners released Multipath Reliable Connection, an open protocol for improving GPU networking reliability in large AI training clusters.
📊 Pulse check

The week by the numbers.

Stories tracked
36
Busiest category
5Product
Anthropic 5Google/Alphabet 5OpenAI 4
🔭 The longer view

Trust and predictability are the new constraint.

The editorial above tracks a shift that played out over six months. The question now is what comes next. Two forces are pulling in opposite directions. On one side, the tools and techniques for running AI locally keep getting better — and the hardware most knowledge workers already own keeps getting more capable. On the other, cloud providers are racing to justify their enormous infrastructure spending by building features that are genuinely hard to replicate on a laptop: real-time voice, multi-modal reasoning, AI agents that coordinate across tools and services.

The cloud bet is that complexity creates lock-in — even if you can run a model locally, you cannot easily replicate the integrated platform around it. The local bet is that the core capability — running AI against your own data, on your own hardware, with no round trip and no API bill — is the one that matters most.

Pulse24's read: both bets are partially right, but the trajectory favours local faster than most forecasts assume. If the compression rate of the past six months holds — and every month has delivered a new milestone — then by the end of 2026, the majority of single-model inference workloads in development environments will be runnable on hardware that already sits on engineers' desks. The cloud's defensible territory narrows to orchestration, scale-out training, and multi-modal pipelines that depend on platform integration. The metric to watch: the share of your team's AI queries that still require a round trip to an external API. If that number is not falling quarter over quarter, you are paying for convenience that the market is rapidly making optional.

---

Pulse24’s view

Pulse24's view: What is documented above is not a single product story — it is a structural shift that has played out across six months, driven by multiple independent teams. ds4 is the latest proof point, not the last. If your AI strategy assumes cloud by default, this is the week to question that assumption.

👁 Forward watch

What we’re watching next.

2 August 2026
EU AI Act high-risk AI system obligations take full effect — compliance deadline for deployers and providers in regulated domains.EU AI Act, Article 113, Official Journal of the EU
📚 References

Where this week’s evidence comes from.