Remote Voice Agents

As I’ve spent more time in business ideation mode, I’ve written a lot of docs to kick the tires on ideas.

Typing works, but voice chat with ChatGPT while driving has been illuminating. Having an LLM to bounce ideas off of, go do searches, etc, is pretty great. Side note: Claude’s voice agent version is nowhere close to the fluidity in OpenAI’s current implementation.

Both share the downside of not being able to save generated docs anywhere. Tool calls with interruptible voice agents seem a bit tricky, as I found out.

Beyond a single file edit, it’s quite helpful to be able to ask Claude Code to evaluate multiple documents or make files as linked derivatives or supporting elements.

I’ve also wanted the option to kick of Claude Code to do things while driving, either to build something, or to explore an existing code base for answers.

The goal

A web app running on my home dev machine with external access to voice control a file editor plus Claude agent runner. Essentially a voice product development assistant.

Results

A working web app that uses OpenAI’s realtime API for voice control and the Claude Agent SDK to run jobs and voice summarize them back to me. This runs locally on my macbook at home, utilizing Tailscale to securely expose the web app to my phone for voice chat on the road.

Learnings

Talking to a voice agent and having it draw a diagram of spoken ideas in real time feels pretty amazing
Talking to Claude to plan work is definitely faster than typing, but maybe isn’t the right path. The real bottleneck comes from reading those specs in detail and iterating.
Fastest I/O is voice out, reading in. Having stuff read to me is slow. Typing is slow.

Tech

Combining a voice agent with a coding agent like claude code is a bit interesting to try to make into a fluid experience. Still iterating on prompt engineering to make it feel better.
- Claude can do a growing number of useful long running tasks
- These tasks need to be blended back into the voice conversation in a human way
Async tool calls work with voice, but it’s a bit clunky to handle them. Fixable
Both realtime and Claude Agent SDK require API token use, so this gets expensive with a lot of use, and seems a bit silly with a Claude Code subscription
- Moving to a CLI wrapper would be a temporary fix for part of that

Current status: In personal use, iterating

This is a fun place to work, but locally, it’s not a 10x over using Obsidian and dictating into various text boxes at times.
Remotely, it’s pretty cool. Voice to long term memory (LTM), with Claude as a synthesis layer seems pretty useful.

Atlas is here, bringing browser use (and probably realtime voice) to doc editing in all existing spec formats. Seems like it will eat a lot of this white space.

Claude Code cloud and Codex cloud are here - getting us away from our local machines.

Seems like the pieces are coming together for a fully cloud based voice spec + code agent jobs framework. That cloud base voice agent enabled unified spec system seems like white space to fill in. Maybe collaborative specification?