Guide - Anthropic Sonnet v2 First Impressions with Computer Use
- Type
- Guide
- Year
- Category
- Vision Language Model, Amazon Bedrock, Claude 3 Sonnet
After some quite time, big AI players are back with next generation of models. Meta Llama 3.2 with vision capabilities, Stability AI with SD 3.5, Mistral Large 2, and Anthropic with upgraded 3.5 family of Sonnet and Haiku - OpenAI Orion and Google Gemini 2.0 are expected to arrive pretty soon.
We have been using Haiku since it's launch back in March this year. It's one of the most affordable and decent performing model. Sonnet on the other hand has more niche use cases, and it's not cheap. Most of our applications are built and deployed on AWS, so Amazon Bedrock is our standard mechanism to consume these models.
Frontier LLM models are "frontier" for a reason, they reflect the idea that they are at the forefront of AI research and showcase advanced capabilities.
These models push the boundaries of what's possible with language processing, visual understanding, task analysis, tool usage and coordination.
For testing purposes, I have used the reference implementation that comes with following features:
- Ubuntu 22 containerized environment with applications like Firefox browser, text editor, xpaint, x11vnc for VNC server, and few others.
- Agent loop that integrates with Claude 3 APIs on Anthropic, Amazon Bedrock or Google Vertex.
- Web UI with screen rendering of the container VM.
Note: For the Amazon Bedrock, the Sonnet 3.5 v2 is only available in Oregon (US West 2) region. Make sure the model access is available prior to running the container.
Container setup
The only additional part I added is the new shared drive so that I can check the images and the generated bash script at the end.
export AWS_PROFILE=appgambit
docker run \
-e API_PROVIDER=bedrock \
-e AWS_PROFILE=appgambit \
-e AWS_REGION=us-west-2 \
-v $HOME/.aws/credentials:/home/computeruse/.aws/credentials \
-v $HOME/.anthropic:/home/computeruse/.anthropic \
-v /Users/dhavalnagar/Documents/Claude-Shared:/home/shared \
-p 5900:5900 \
-p 8501:8501 \
-p 6080:6080 \
-p 8080:8080 \
-it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
Example use case
I decided to go with a moderate task that can be repeatable after sequence of steps. Following prompt is prepared as the initial input.
I am tasked with a simple but boring exercise. I need to check the Google Doodle of the day, if found, download the exact image of the doodle and save with proper name. Save the image in /home/shared directory. After download, please validate if it's a valid image or not.
I know that Google is obsessed with latency and complex tech behind the scene so make sure you do proper analysis and carry out this task in optimal way.
At the end, please summarize all successful steps so that the same steps can be used next time.
Output
Sonnet 3.5 v2 understood the task and made a plan to execute the steps. Here is a short video of all the steps that it did to complete the work.
Note: The knowledge cutoff for the upgraded Claude 3.5 Sonnet is April 2024, same as that of the original Claude 3.5 Sonnet model.
- The model does good job at breaking the whole task into smaller steps. Agent implementation makes the coordination and management easy.
- Based on individual steps and output of the previous step, it decides what to do next and which tool to use.
- Task validation is an important part, but based on some experiments, I feel those should be clearly indicated in the user prompt for better execution.
- Failures are inevitable when you are doing something moderately complex and includes some visual processing, but the model is able to adapt and find the solution. Although at one point it got stuck because of a UI popup asking permission.
- It takes longer than you think to carry out certain tasks, but once done those steps can be automated in the form of bash script or just instruction steps.
Generated Bash script
#!/bin/bashs
# Function to check and install required packages
check_and_install_packages() {
local packages=("xclip" "curl" "file" "firefox-esr")
for pkg in "${packages[@]}"; do
if ! command -v "$pkg" &> /dev/null; then
echo "Installing $pkg..."
sudo apt-get update &> /dev/null
sudo apt-get install -y "$pkg" &> /dev/null
fi
done
}
# Function to get today's date in YYYYMMDD format
get_date() {
date +%Y%m%d
}
# Function to create output directory if it doesn't exist
ensure_output_dir() {
mkdir -p /home/shared
}
# Main function to get Google Doodle
get_google_doodle() {
local date_str=$(get_date)
local output_file="/home/shared/google_doodle_${date_str}.webp"
echo "Starting Firefox to access Google homepage..."
DISPLAY=:1 firefox-esr "https://www.google.com" &
sleep 5 # Wait for Firefox to load
# Move mouse to the Google Doodle position and right-click
DISPLAY=:1 xdotool mousemove 512 259 click 3
sleep 1
# Move to "Copy Image Link" and click
DISPLAY=:1 xdotool mousemove 601 537 click 1
sleep 1
# Get URL from clipboard
url=$(DISPLAY=:1 xclip -selection clipboard -o)
if [[ -n "$url" ]]; then
echo "Downloading Google Doodle..."
curl -L "$url" -o "$output_file"
# Validate the downloaded file
if file "$output_file" | grep -q "Web/P image"; then
echo "Successfully downloaded and verified Google Doodle to: $output_file"
# Get the image info
file "$output_file"
echo "File size: $(du -h "$output_file" | cut -f1)"
else
echo "Error: Downloaded file is not a valid WebP image"
rm -f "$output_file"
exit 1
fi
else
echo "Error: No URL found in clipboard"
exit 1
fi
}
# Main execution
echo "=== Google Doodle Downloader ==="
echo "Checking required packages..."
check_and_install_packages
ensure_output_dir
get_google_doodle
For the day, when there is no Google Doodle, it executes and returns the following message instead.
The script execution failed because there is no Google Doodle available today (October 27, 2024).
This particular example is not a way to assess the capabilities (or limitations) of the model. However, observing what it does and how it performs at each step provides insight into the future direction of these models.