inPulse24 Tuesday Briefing

Edition #44 · May 19 – June 1, 2026 · Read time ~7 min

Live · 1 Jun 2026

Tuesday Briefing/3 stories/4 signals

Chips, Code, and the Edge

This fortnight: memory passed 63% of AI chip cost, vendors shipped tooling and models aimed at cutting inference bills, and agentic models began running on consumer hardware.

Published1 Jun 2026

Coverage18 May 2026 – 1 Jun 2026

Stories tracked80

Featured3

AuthorPulse24 Desk

Last updated1 Jun 2026

This week’s pulse

Eighty stories in fourteen days. Memory now accounts for 63% of AI chip cost, and several vendors shipped ways to cut inference bills on existing hardware. Microsoft and Uber pulled back on AI tooling spend, while Glean crossed $300M ARR selling token reduction. And new models and chips — Liquid.ai's 8B MoE, Nvidia's RTX Spark, a 200-TOPS RISC-V PC — moved agentic inference onto consumer hardware.

Inference cost is now a memory problem

What happened

Memory, not compute, is the dominant cost in AI inference: Epoch AI found HBM now accounts for 63% of AI chip component costs, up from 52% in 2024. The responses target that directly. Kog AI claims 3,000 tokens per second per request on 8× AMD MI300X GPUs — well beyond interactive speeds, on hardware teams already own — by optimising for memory bandwidth rather than FLOPS. XCENA raised $135 million for its MX1 chip, which integrates compute into DRAM via CXL and claims a tenfold reduction in inference servers.

So what

The cheapest route to more inference throughput now runs through memory, not faster compute, because memory bandwidth and capacity are where the bottleneck sits — which means teams can cut cost per token on GPUs they already run, and challengers can compete without beating Nvidia on raw compute.

The counter-case

Kog AI's benchmark used a 2B-parameter model in FP16, so frontier-scale gains are unproven, and XCENA's MX1 won't reach mass production until late 2026. These are vendor claims, not deployed results.

Related signals

Platform engineers and anyone who owns an inference bill — the people deciding what to run, and on what hardware.

Action

If you run inference at any scale, benchmark memory-bandwidth utilisation before buying more GPUs — a memory-optimised runtime may cut your cost per token on the hardware you already have.

02A cost

A cost-optimisation layer starts to form around developer tooling

What happened

Microsoft cut internal Claude Code licenses over rising costs and pushed developers to GitHub Copilot CLI, while Uber's COO said the company exhausted its 2026 AI budget in four months. A response is taking shape: Glean crossed $300M ARR, tripling in 15 months, by selling a "context graph" that cuts token consumption, and coverage of the cost squeeze reports enterprises shifting toward smaller, specialised, and open-source models. Even Anthropic's own Claude Code dynamic workflows, demonstrated by porting 750,000 lines in eleven days, run at a higher token cost than the static flows they replace.

So what

Context pruning, model routing, and open-source substitution are starting to become something buyers pay for rather than build, because token reduction now sells on its own — Glean tripled to $300M doing exactly that.

The counter-case

Part of Glean's $300M is annualised run-rate under consumption pricing, not booked ARR, so the demand signal is softer than the headline. And an optimisation layer adds its own integration overhead and lock-in — it may prove transitional if per-token prices fall fast enough.

Related signals

Engineering leads, CTOs, and procurement teams evaluating AI coding tool renewals.

Action

If you manage developer tooling budgets, instrument per-workflow token costs before your next renewal — you can't evaluate a routing or context layer without knowing which workflows actually burn the budget.

Agentic inference reaches consumer silicon

What happened

Liquid.ai released LFM2.5-8B-A1B, an 8B Mixture-of-Experts model with a 128K context window built for tool calling on consumer hardware, with day-one support across llama.cpp, MLX, vLLM, and SGLang. Nvidia launched its RTX Spark Superchip, pairing a Blackwell RTX GPU with a Grace CPU to run local agents on Windows PCs from Dell, HP, and others — its first consumer chip, which sent Qualcomm down 9.8% and Intel down 6%. Nanyang Singtech shipped a RISC-V dataflow AI PC rated at 200 TOPS with 128 GB of unified memory — several times what a typical consumer machine carries, and enough to run a 70B-parameter model locally.

So what

On-device inference is moving from benchmark to shipping product, because compact MoE models and new consumer silicon now run agentic workloads locally, giving buyers a path that doesn't route every request through a cloud endpoint.

The counter-case

On-device models still trail frontier models on hard reasoning, and Liquid.ai's benchmark gains are vendor-reported. Apple's genai.apple.com subdomain ahead of WWDC signals on-device ambition, but its confirmed reliance on Google Gemini for some features shows the near-term reality is hybrid, not local-only.

Related signals

Product engineers, mobile and edge platform teams, and architects weighing latency- or privacy-sensitive workloads.

Action

If you ship a product with latency- or privacy-sensitive inference, pilot an 8B-class on-device model against your current cloud endpoint on the request subset that doesn't need frontier reasoning — measure quality and unit cost side by side.

---

📡 Signals

Worth tracking.

Markets

Silicon Data and Architect are developing a futures market for GPU compute, aiming to let providers such as CoreWeave hedge falling prices against hyperscalers facing rising costs.

Finance

Anthropic closed its final private round at a $965 billion valuation, raising $65 billion — nearly triple its $380 billion February mark — to surpass OpenAI as the most valuable AI startup, ahead of its confidential S-1 filing.

Risk

OpenAI and the UK's NCSC confirmed that prompt injection remains fundamentally unsolved.

Macro

NextEra and Dominion Energy are merging to create the largest US regulated utility, driven by AI data centre power demand.

📊 Pulse check

The week by the numbers.

Stories tracked

Busiest category

11Product

Anthropic 3

🔭 The longer view

Trust and predictability are the new constraint.

Computex runs June 2–5, the day after this goes out, and Nvidia has already used it to put AI silicon in consumer PCs. Three editions ago the inference story was cloud cost (Edition #41); two ago it was agent security (Edition #42).

Pulse24's read: watch whether a major cloud provider responds with a non-GPU or on-device inference tier. Our testable call: at least one announces such an option by Q4 2026, alongside the first enterprise RFPs that specify memory-bandwidth-per-dollar as a primary metric. If neither lands by year-end, treat the current fragmentation as slower than the funding suggests.

---

Pulse24’s view

This fortnight's priority: decide where each workload should run before the silicon reaching production this quarter locks in your 2027 cost base. Three paths are now real — stay on your GPU vendor's roadmap, wrap cost-sensitive traffic in an optimisation layer, or push latency- and privacy-sensitive work onto new data-centre silicon or the device itself.

👁 Forward watch

What we’re watching next.

June 2-5

Computex 2026 (Taipei) — show runs following Nvidia's RTX Spark unveiling; AMD and rival inference-chip vendors expected to respond.computextaipei.com.tw official schedule

June 8

Apple WWDC 2026 keynote — on-device AI and genai.apple.com expected to launch.Apple Newsroom, developer.apple.com/wwdc26

June 18

Google Gemini CLI migration deadline — developers must move to Antigravity CLI or lose Gemini model access.Google Developers Blog

📚 References

Where this week’s evidence comes from.