- Road to Artificia
- Posts
- Briefs: Anthropic's Sonnet "Computer Use" Capability (2024-11-05)
Briefs: Anthropic's Sonnet "Computer Use" Capability (2024-11-05)
Hey Readers!
A note about the format for Briefs issues of Road to Artificia:
Briefs differentiates itself from other AI newsletters by:
• Not losing readers in a firehose of everything that's happening, but rather selecting the most important events and explaining why they are important
• Helping less-technical or non-technical leaders level up their knowledge of AI. I hope to make this newsletter interesting to my technical colleagues also, but won’t sacrifice non-technical folks to do so.
I’ll be stubborn about these goals, but flexible about the approach. As such, I plan to tweak the Briefs format as we go based on feedback and other learning.
Anthropic Sonnet “Computer use”
Anthropic released a (developer API based) beta of “computer use” with the sonnet-3.5 model, allowing the model to control a desktop computer interface using vision to understand the interface. Using the reference implementation, you can ask the model to perform a task, and it can use a bundled Linux desktop with a browser to complete the task. It works, if a bit slow, but I ran into false-positives with the safety checks built into the API.
Why is this important?
This is a demo of a weakly-agentic model able to navigate a general desktop + web interface to complete a task via chat + visual understanding of the desktop environment
Many past software advances have required rewriting large chunks of then previous-generation system to leverage the advance. It’s not a big surprise, but Sonnet computer use demonstrates directly that rewrites will not be required for AI to perform tasks typical of today's knowledge worker roles. Agent capabilities will extend to AI use of GUI interfaces just fine.
Now, if you follow the progress of frontier LLMs on the key benchmarks, you'll see that the models are "saturating" all the benchmarks, ie: the community needs to come up with new measures, because LLMs are beginning to max out many of the measures we use to assess progress.
That's not yet the case for computer use. Contrary to the sometimes superhuman performance on various language-oriented tasks, LLMs still score poorly on computer-use tasks at the moment, as you can see below. Although the leader, Sonnet still scores only 22% on the OSWorld benchmark:
OSWorld computer task benchmark
Source: https://os-world.github.io, Xie Tianbao et al.
The slick launch demo:
And a quick demo of what the developer beta actually looks like: You can see sonnet being given a task, then proceeding to open the browser and take a series of actions while it uses vision to understand how to navigate the browser and web interfaces.
AI powered search has one more player
OpenAI has released ChatGPT search, joining the now crowded field of Perplexity, Google Gemini, Bing, and You.com. The experience is very close to ChatGPT and shares the same interface, but with responses grounded in a web index and citations inline. My experience? It’s fast, but not fast enough - I’m not ready to give up Google just yet.
The significance?
This is just one more step in OpenAI’s effort to attack Google’s search revenue in a long term effort to constrain their spending on AI compute. OpenAI also released a Chrome extension that seems to do nothing but switch your default search engine to ChatGPT.
The direction this is going is to make the distinction between search engine queries and AI chatbot queries simply go away.
It’s worth giving ChatGPT search a try, but try Perplexity as well if you haven’t yet - it’s lesser known outside the AI bubble but is well designed. Perplexity’s AI is powered by OpenAI, so how long that relationship lasts is in question, now that their key supplier is a competitor.
AI product releases blocked by insufficient compute
Sam Altman confirmed in a Reddit AMA that OpenAI’s launches are being held up by compute constraints:
Sam Altman on a Reddit AMA
https://www.reddit.com/r/ChatGPT/comments/1ggixzy/comment/luqb1gv/
This is really something to give pause. The blistering pace of AI product releases would be even faster if more inference compute were available. OpenAI is known to be working on their own processor slated to debut in 2026.
At a minimum, we know OpenAI has demonstrated but not fully released:
o1 (as opposed to 01-preview, o1-mini which have been released)
gpt-4o advanced voice with vision
Sora (OpenAI’s generative video model)
At some point in the next year the compute drought will lift. Calibrate your expectations accordingly - however fast you think AI capabilities are moving, they're moving faster inside the frontier labs. The nature of exponential processes.
What did you think of this issue? |