LLMs can Control your Computer now

Enabling LLM to control your Computer is just an API call away

Oct 23, 2024

How to use LLM API calls to control your PC enhanced — Image by Author

Tech with no advancements will inevitably become obsolete.

In this context, I am referring to LLMs. LLMs are like the shiny apples in the basket, which everyone wants to take a bite of.

AI and AI-first startups are investing time and resources to improve LLMs.

It was just a matter of time before the LLMs could grow a tail and start crawling before they could fly.

Yesterday we witnessed a new LLM capability that could change how we work.

Large language models (LLM) can now perform generic and repetitive tasks on your PC through natural prompts.

Do you want to send an email, save a photo, or open YouTube to watch a video using LLM? It’s possible now.

The enhanced version of Claude 3.5 Sonnet offers PC interaction features (computer use).

Interaction Flow of Claude 3.5 Sonnet

We know that LLMs are brilliant at generating content based on proximity. LLMs will choose the next word based on the closest proximity for text generation.

The same applies to image and audio (pixel/waveform proximity).

How can LLM run operations on a PC? Tools.

Organizations develop LLM tools to interact with or manipulate custom tools and services.

Tools can search the internet, connect with tools, and perform operations.

Anthropic offers three pre-defined tools:

computer_20241022
text_editor_20241022
bash_20241022

These tools are the functional units that Claude's model uses for interacting with computers to call functions and fulfill actions.

Since we allow user access to these tools, using the beta version in containerized docker instances is secure from a data and privacy standpoint.

Simple Beta features Implementation.

Let us try implementing a simple flow that gives us context on how to set up these tools and call the LLM to trigger prompt-based actions on the system.

We will use a docker image supplied by Claude (all computer use tools included). So, we pull the pre-built Docker image from the registry and run it in interactive mode.

export ANTHROPIC_API_KEY=%your_api_key%
docker run \
    -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
    -v $HOME/.anthropic:/home/computeruse/.anthropic \
    -p 5900:5900 \
    -p 8501:8501 \
    -p 6080:6080 \
    -p 8080:8080 \
    -it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest

The docker image is by default bundled with computer, bash, and text editor tools to perform necessary actions.

Now, that we are in interactive mode, we can submit a post request to the LLM endpoint with the necessary payload specifying the tool and instructions.

import anthropic

client = anthropic.Anthropic()

response = client.beta.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[
        {
          "type": "computer_20241022",
          "name": "computer"
        },

    ],
    messages=[{"role": "user", "content": "open browser, go to gmail.com and send email to databracket9@gmail.com"}],
    betas=["computer-use-2024-10-23"]
)

When the API request is sent, the LLM server calls the use_tool method to select available tools on the called system.

Once the tool is selected, LLM will call stop_reason indicating the tool has been found and is ready to use.

Communication between the LLM and docker container happens via use_tool and stop_reason until the desired action is completed(agent loop).

Based on the input message, LLM will send the return following response to the computer use tool to action.

{
    "properties": {
        "action": {
            "description": """The action to perform. The available actions are:
                * `key`: Press a key or key-combination on the keyboard.
                  - This supports xdotool's `key` syntax.
                  - Examples: "a", "Return", "alt+Tab", "ctrl+s", "Up", "KP_0" (for the numpad 0 key).
                * `type`: Type a string of text on the keyboard.
                * `cursor_position`: Get the current (x, y) pixel coordinate of the cursor on the screen.
                * `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.
                * `left_click`: Click the left mouse button.
                * `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.
                * `right_click`: Click the right mouse button.
                * `middle_click`: Click the middle mouse button.
                * `double_click`: Double-click the left mouse button.
                * `screenshot`: Take a screenshot of the screen.""",
            "enum": [
                "key",
                "type",
                "mouse_move",
                "left_click",
                "left_click_drag",
                "right_click",
                "middle_click",
                "double_click",
                "screenshot",
                "cursor_position",
            ],
            "type": "string",
        },
        "coordinate": {
            "description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.",
            "type": "array",
        },
        "text": {
            "description": "Required only by `action=type` and `action=key`.",
            "type": "string",
        },
    },
    "required": ["action"],
    "type": "object",
}

The request fulfillment is based on these responses from the language model.

The feature is still in beta. Usage and access should be enabled after careful consideration.

Final Words

AI that takes action instead of simply responding to questions is how the future of AI should look like.

We are gradually making progress and entering into the AIverse.

Automating actions without the need to run scheduled jobs can become a reality soon.

All we need to a simple natural language prompt instead of a programming language-specific script to automate a task.

In the future, we might not even need to have domain or programming knowledge.

This opens up interesting possibilities that can alter the course of the future in tech.

References

https://docs.anthropic.com/en/docs/build-with-claude/computer-use#bash-tool

https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo#accessing-the-demo-app

Connect with Me