Remote evals and sandboxes

When your task can’t be expressed as a prompt (agents, multi-step workflows, custom tooling, or heavy dependencies), connect your code to a playground or experiment. The iteration workflow stays the same: run evaluations, compare results side-by-side, and share with teammates. Your code handles task execution. Braintrust handles the rest. Two approaches differ in where your code runs:

Remote evals — Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. Braintrust triggers execution, sends parameters, and displays results.
Sandboxes — Run evals in an isolated cloud sandbox, controlled from Braintrust. You push an execution artifact (a code bundle or container snapshot) and Braintrust invokes it on demand. No server to keep running.
Sandboxes are in beta and the API, configuration, and behavior are likely to change in the near future. Requires a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0.

Common use cases

Remote evals
Sandboxes

Private internal resources

Your eval needs to call internal APIs, query private databases, or access services inside your VPN. Because remote evals execute on your infrastructure, that access is already available.

OS-specific or platform-locked tooling

Your eval requires software that only runs on a specific OS or machine — for example, a Windows-only simulation or a Unity project on a dedicated workstation. Remote evals let Braintrust trigger execution on whichever machine has the right environment set up.

Heavy or complex dev setup

Some tools are too painful to install on every teammate’s machine — game engines, large models, specialized SDKs. Set up the environment once on a shared server and let everyone else run the eval from the playground.

Data security and compliance

Sensitive data stays on your infrastructure. Only results are sent to Braintrust.

No server to maintain

Push your eval once and it’s always available from the playground — without keeping a process alive or worrying about uptime. This works well for stable eval versions the whole team can run on demand.

Custom Python or TypeScript environments

Include pip packages with --requirements (Lambda) or bring your own container image (Modal) for full control over the runtime environment.

Reproducible, isolated runs

Each run executes against the same packaged artifact — same bundle or container image — so results are consistent across teammates and over time.

Run a remote eval

Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.

1. Write your eval

A remote eval looks like a standard eval call with a parameters field that defines configurable options. These can be inline parameters, defined directly in your eval code, or saved parameters, created separately and loaded with loadParameters(). Both inline and saved parameters become UI controls in the playground. Install the SDK and dependencies:

# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals

pip install braintrust openai autoevals

# Add to your Gemfile:
gem "braintrust"
gem "openai"

bundle install

# Add to build.gradle dependencies{} block:
implementation 'dev.braintrust:braintrust-sdk-java:<version>'
implementation 'com.openai:openai-java-sdk:<version>'

Create the eval code:

import { Eval, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("my-project", {
  data: [{ input: "hello", expected: "HELLO" }],
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create({
      ...parameters.main.build({ input }),
      model: parameters.model,
    });
    return completion.choices[0].message.content ?? "";
  },
  scores: [],
  parameters: {
    model: {
      type: "model",
      description: "Model to evaluate",
      default: "gpt-5-mini",
    },
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "The prompt used to process input",
      default: {
        messages: [{ role: "user", content: "{{input}}" }],
        model: "gpt-5-mini",
      },
    },
    prefix: z.string().describe("Optional prefix to prepend to input").default(""),
  },
});

import openai
from autoevals import Levenshtein
from braintrust import Eval, init_dataset, wrap_openai
from pydantic import BaseModel, Field

client = wrap_openai(openai.AsyncOpenAI())


class PrefixParam(BaseModel):
    """Pydantic model for the prefix parameter. In Python, non-prompt parameters
    must be defined as Pydantic models (not dicts) to appear in the UI."""

    value: str = Field(default="", description="Optional prefix to prepend to input")


async def task(input, hooks):
    parameters = hooks.parameters

    prefix = parameters.get("prefix", "")
    prompt_input = f"{prefix}: {input}" if prefix else input

    completion = await client.chat.completions.create(
        **parameters["main"].build(input=prompt_input),
        model=parameters["model"],
    )

    return completion.choices[0].message.content or ""


Eval(
    "my-project",
    data=init_dataset("my-project", "my-dataset"),
    task=task,
    scores=[Levenshtein()],
    parameters={
        "model": {
            "type": "model",
            "description": "Model to evaluate",
            "default": "gpt-5-mini",
        },
        "main": {
            "type": "prompt",
            "name": "Main prompt",
            "description": "The prompt used to process input",
            "default": {
                "prompt": {
                    "type": "chat",
                    "messages": [{"role": "user", "content": "{{input}}"}],
                },
                "options": {"model": "gpt-5-mini"},
            },
        },
        "prefix": PrefixParam,
    },
)

# Requires Braintrust Ruby SDK v0.2.1+
require "braintrust"
require "braintrust/server"
require "openai"

Braintrust.init(blocking_login: true)
Braintrust.instrument!(:openai)

client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY"))

simple_eval = Braintrust::Eval::Evaluator.new(
  task: ->(input:) {
    response = client.chat.completions.create(
      model: "gpt-5-mini",
      messages: [{role: "user", content: input}]
    )
    response.choices.first.message.content
  },
  scorers: [
    Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }
  ],
  parameters: {
    prefix: {type: "string", description: "Optional prefix to prepend to input", default: ""}
  }
)

run Braintrust::Server::Rack.app(
  evaluators: {"simple-eval" => simple_eval}
)

// Requires Braintrust Java SDK v0.3.16+
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.chat.completions.ChatCompletionCreateParams;
import dev.braintrust.Braintrust;
import dev.braintrust.devserver.Devserver;
import dev.braintrust.devserver.RemoteEval;
import dev.braintrust.eval.*;
import dev.braintrust.instrumentation.openai.BraintrustOpenAI;
import java.util.List;

class RemoteEvalExample {
    public static void main(String[] args) throws Exception {
        var braintrust = Braintrust.get();
        var openTelemetry = braintrust.openTelemetryCreate();
        var client = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv());

        RemoteEval<String, String> eval =
                RemoteEval.builder(String.class, String.class)
                        .name("my-eval")
                        .task(
                                (datasetCase, parameters) -> {
                                    var request =
                                            ChatCompletionCreateParams.builder()
                                                    .model(parameters.get("model", String.class))
                                                    .addUserMessage(datasetCase.input())
                                                    .build();
                                    var response = client.chat().completions().create(request);
                                    var output =
                                            response.choices()
                                                    .get(0)
                                                    .message()
                                                    .content()
                                                    .orElse("");
                                    return new TaskResult<>(output, datasetCase, parameters);
                                })
                        .scorers(
                                List.of(
                                        Scorer.of(
                                                "exact_match",
                                                (expected, output) ->
                                                        output.equals(expected) ? 1.0 : 0.0)))
                        .parameters(
                                List.of(
                                        ParameterDef.model(
                                                "model", "gpt-5-mini", "OpenAI model to use")))
                        .build();

        Devserver devserver =
                Devserver.builder()
                        .config(braintrust.config())
                        .registerEval(eval)
                        .host("localhost")
                        .port(8300)
                        .build();

        Runtime.getRuntime()
                .addShutdownHook(
                        new Thread(
                                () -> {
                                    System.out.println("Shutting down...");
                                    devserver.stop();
                                }));

        System.out.println("Starting Braintrust dev server on http://localhost:8300");
        devserver.start();
    }
}

Use type: "model" to add a model selector that displays a dropdown of models configured in your Braintrust project. Use type: "prompt" for an inline prompt editor. Both use dictionary syntax. For other parameter types, use Zod schemas (TypeScript) or Pydantic models (Python).

Inline parameters use different syntax across languages:

Feature	TypeScript	Python	Java	Ruby
Model selector	`type: "model"`	`type: "model"`	`ParameterDef.model(name, defaultValue)`	Not supported
Prompt parameters	`type: "prompt"` with `messages` array in `default`	`type: "prompt"` with nested `prompt.messages` and `options`	Not supported	Not supported
Scalar types	Zod schemas: `z.string()`, `z.boolean()`, `z.number()` with `.describe()`	Pydantic models with `Field(description=...)`	`ParameterDef.data(name, defaultValue)`, `ParameterDef.model(name, defaultValue)`	Hash with `type:`, `description:`, `default:`
Parameter access	`parameters.prefix`	`parameters.get("prefix")`	`parameters.get("prefix", String.class)`	`parameters["prefix"]` (via keyword arg)
Prompt usage	`parameters.main.build({ input: value })`	`**parameters["main"].build(input=value)`	Not applicable	Not applicable
Async	`async`/`await`	`async`/`await`	Synchronous	Synchronous

To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.

2. Expose the eval server

Run your eval with the bt eval --dev flag to start a local server:

TypeScript
Python
Java
Ruby

bt eval path/to/eval.ts --dev

Dev server starts at http://localhost:8300. Configure the host and port:

--dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind to. Defaults to 8300.

bt eval path/to/eval.py --dev

Dev server starts at http://localhost:8300. Configure the host and port:

--dev-host DEV_HOST: The host to bind to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind to. Defaults to 8300.

The Java SDK does not have a CLI command. Start the dev server programmatically using Devserver.builder()...build() followed by devserver.start(), as shown in the code example above.

Run as a Rack app

The dev server requires a Rack-compatible web server that supports streaming:

Server	Version
Puma (recommended)	6.x
Falcon	0.x
Passenger	6.x
WEBrick	Not supported — does not support streaming

Create your eval server file:

eval_server.ru

# Requires Braintrust Ruby SDK v0.2.1+
require "braintrust"
require "braintrust/server"
require "openai"

Braintrust.init(blocking_login: true)
Braintrust.instrument!(:openai)

client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY"))

simple_eval = Braintrust::Eval::Evaluator.new(
  task: ->(input:) {
    response = client.chat.completions.create(
      model: "gpt-5-mini",
      messages: [{role: "user", content: input}]
    )
    response.choices.first.message.content
  },
  scorers: [
    Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }
  ]
)

run Braintrust::Server::Rack.app(
  evaluators: {"simple-eval" => simple_eval}
)

Add dependencies and start the server:

# Gemfile
gem "rack"
gem "puma"

bundle install
bundle exec rackup eval_server.ru -p 8300 -o 0.0.0.0

Run as a Rails engine

If you have an existing Rails application, you can mount the Braintrust eval server as a Rails engine instead of running a separate Rack process.

Requires Rails 8.x. Add to your Gemfile:

gem "actionpack", "~> 8.0"
gem "railties", "~> 8.0"
gem "activesupport", "~> 8.0"

Place evaluator classes under app/evaluators/ as subclasses of Braintrust::Eval::Evaluator:

# app/evaluators/food_classifier.rb
class FoodClassifier < Braintrust::Eval::Evaluator
  def task
    ->(input:) { classify(input) }
  end

  def scorers
    [Braintrust::Scorer.new("exact_match") { |expected:, output:| output == expected ? 1.0 : 0.0 }]
  end
end

Generate the initializer:
bin/rails generate braintrust:server
This creates config/initializers/braintrust_server.rb with a slug-to-evaluator mapping auto-discovered from app/evaluators/.

Mount the engine:

# config/routes.rb
Rails.application.routes.draw do
  mount Braintrust::Contrib::Rails::Server::Engine, at: "/braintrust"
end

Auth configurationThe engine defaults to :clerk_token authentication. For local development, set auth to :none in the generated initializer:

# config/initializers/braintrust_server.rb
Braintrust::Contrib::Rails::Server::Engine.configure do |config|
  config.auth = :none
end

auth: :none disables authentication on incoming requests. Only use this for local development. BRAINTRUST_API_KEY must still be set on the server — it’s required to fetch resources from your project.

3. Configure in your project

To make your eval accessible beyond localhost, add the endpoint to your project:

In your project, go to Settings > Remote evals.
Select Create remote eval source.
Enter the name and URL of your remote eval server.
Select Create remote eval source.

All team members with access to the project can now use this remote eval in their playgrounds. Keep the process running while using the remote eval.

4. Run the eval

Run your remote eval from a playground to iterate, or directly as an experiment to capture a durable result.

Playground
Experiment

A playground is a mutable workspace for iteration. Re-running overwrites previous results, so it’s the place to tune parameters and compare configurations side-by-side.

Open a playground in your project.
Select + Task.
Choose Remote eval from the task type list.
Select your eval and configure parameters using the UI controls.
Provide data inline or select a dataset, optionally add scorers, and click Run.

Results stream back as the eval executes. You can run multiple instances side-by-side with different parameters to compare results.

Demo

Run a sandbox eval

Sandboxes are in beta and the API, configuration, and behavior are likely to change in the near future. Requires a Pro or Enterprise plan. Self-hosted deployments require data plane version v2.0.

Run evals in an isolated cloud sandbox, controlled from Braintrust. Push an execution artifact once and Braintrust invokes it on demand from the playground — no server to keep running. Braintrust supports two sandbox providers:

Lambda — AWS Lambda-based. The default for braintrust push. Supports both Python and TypeScript. No extra configuration needed.
Modal — Container-based via Modal. Requires a pre-built Modal container image. Executes TypeScript evals only.

1. Write your eval

A sandbox eval looks like a standard eval call with a parameters field that defines configurable options. These can be inline parameters, defined directly in your eval code, or saved parameters, created separately and loaded with loadParameters(). Both inline and saved parameters become UI controls in the playground. Install the SDK and dependencies:

# pnpm
pnpm add braintrust openai autoevals
# npm
npm install braintrust openai autoevals

pip install braintrust openai autoevals

Sandboxes require TypeScript SDK v3.7.1+ or Python SDK v0.12.1+.

Create the eval code:

my_eval.eval.ts

import { Eval, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("my-project", {
  data: [{ input: "hello", expected: "HELLO" }],
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create(
      parameters.main.build({ input }),
    );
    return completion.choices[0].message.content ?? "";
  },
  scores: [],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "The prompt used to process input",
      default: {
        messages: [{ role: "user", content: "{{input}}" }],
        model: "gpt-5-mini",
      },
    },
    prefix: z.string().describe("Optional prefix to prepend to input").default(""),
  },
});

import openai
from autoevals import Levenshtein
from braintrust import Eval, init_dataset, wrap_openai
from pydantic import BaseModel, Field

client = wrap_openai(openai.AsyncOpenAI())


class PrefixParam(BaseModel):
    value: str = Field(default="", description="Optional prefix to prepend to input")


async def task(input, hooks):
    parameters = hooks.parameters

    prefix = parameters.get("prefix", "")
    prompt_input = f"{prefix}: {input}" if prefix else input

    completion = await client.chat.completions.create(
        **parameters["main"].build(input=prompt_input)
    )

    return completion.choices[0].message.content or ""


Eval(
    "my-project",
    data=init_dataset("my-project", "my-dataset"),
    task=task,
    scores=[Levenshtein()],
    parameters={
        "main": {
            "type": "prompt",
            "name": "Main prompt",
            "description": "The prompt used to process input",
            "default": {
                "prompt": {
                    "type": "chat",
                    "messages": [{"role": "user", "content": "{{input}}"}],
                },
                "options": {"model": "gpt-5-mini"},
            },
        },
        "prefix": PrefixParam,
    },
)

Inline parameters use different syntax across languages:

Feature	TypeScript	Python
Model selector	`type: "model"`	`type: "model"`
Prompt parameters	`type: "prompt"` with `messages` array in `default`	`type: "prompt"` with nested `prompt.messages` and `options`
Scalar types	Zod schemas: `z.string()`, `z.boolean()`, `z.number()` with `.describe()`	Pydantic models with `Field(description=...)`
Parameter access	`parameters.prefix`	`parameters.get("prefix")`
Prompt usage	`parameters.main.build({ input: value })`	`**parameters["main"].build(input=value)`
Async	`async`/`await`	`async`/`await`

To reference saved parameter configurations instead of defining them inline, use loadParameters() (TypeScript) or load_parameters() (Python). See Parameters for details.

2. Register your sandbox

Sandbox registration uses the Braintrust SDK CLI (braintrust push / npx braintrust push). The bt CLI does not yet support sandbox evals.

Lambda
Modal

braintrust push my_eval.py           # Python
npx braintrust push my_eval.eval.ts  # TypeScript

To include pip dependencies:

braintrust push my_eval.py --requirements requirements.txt

To run locally and register the sandbox in one step (TypeScript):

npx braintrust eval my_eval.eval.ts --push

To update an existing sandbox:

braintrust push my_eval.py --if-exists replace           # Python
npx braintrust push my_eval.eval.ts --if-exists replace  # TypeScript

Supported Lambda runtimes: Python 3.8–3.13 and Node.js 18, 20, 22. Pushing with an unsupported version returns an error listing the supported versions.

Modal sandboxes run your eval in a custom container image. The container must include Node.js and your eval code.

Add your Modal credentials under Settings > Organization > Sandbox providers.

Build the container image using Image.build():

import modal

image = modal.Image.from_dockerfile("./Dockerfile")
built_image = image.build()
image_id = built_image.object_id  # e.g. "im-icRxmsk1Sz9XPP2f8OblVU"

The object_id is your image ID to pass as snapshotRef.

import { registerSandbox } from "braintrust";

const result = await registerSandbox({
  name: "My Eval Sandbox",
  project: "my-project",
  sandbox: { provider: "modal", snapshotRef: "im-icRxmsk1Sz9XPP2f8OblVU" },
  entrypoints: ["./my_eval.eval.ts"],
});

from braintrust import register_sandbox, SandboxConfig

result = register_sandbox(
    name="My Eval Sandbox",
    project="my-project",
    sandbox=SandboxConfig(provider="modal", snapshot_ref="im-icRxmsk1Sz9XPP2f8OblVU"),
    entrypoints=["./my_eval.eval.ts"],
)

entrypoints lists the eval files available in the snapshot. Re-registering with a new snapshot_ref updates the sandbox in place.

3. Run the eval

Run your sandbox eval from a playground to iterate, or directly as an experiment to capture a durable result.

Playground
Experiment

A playground is a mutable workspace for iteration. Re-running overwrites previous results, so it’s the place to tune parameters and compare configurations side-by-side.

Open a playground in your project.
Select + Task.
Open the Remote eval submenu and select your sandbox.
Select your eval and configure parameters using the UI controls.
Provide data inline or select a dataset, optionally add scorers, and click Run.

Results stream back as the eval executes. You can run multiple instances side-by-side with different parameters to compare results.

Limitations

The dataset defined in your eval is ignored when running from the playground. Datasets are managed through the playground.
Scorers defined in your eval are concatenated with scorers added in the playground.
For Lambda sandboxes, each individual invocation is capped at 15 minutes, but Braintrust manages dataset iteration outside the Lambda, so large dataset evals are not constrained by that limit.
For Modal sandboxes, each sandbox is capped at a 60-minute lifetime by default. Self-hosted deployments can override this with the MODAL_SANDBOX_TIMEOUT_S environment variable (in seconds).

Next steps

Test prompts and models without custom code
Create parameters to manage configurable settings
Interpret results from your experiments

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Remote evals and sandboxes

Common use cases

Run a remote eval

1. Write your eval

2. Expose the eval server

Run as a Rack app

Run as a Rails engine

3. Configure in your project

4. Run the eval

Demo

Run a sandbox eval

1. Write your eval

2. Register your sandbox

3. Run the eval

Limitations

Next steps

​Common use cases

​Run a remote eval

​1. Write your eval

​2. Expose the eval server

​Run as a Rack app

​Run as a Rails engine

​3. Configure in your project

​4. Run the eval

​Demo

​Run a sandbox eval

​1. Write your eval

​2. Register your sandbox

​3. Run the eval

​Limitations

​Next steps

Common use cases

Run a remote eval

1. Write your eval

2. Expose the eval server

Run as a Rack app

Run as a Rails engine

3. Configure in your project

4. Run the eval

Demo

Run a sandbox eval

1. Write your eval

2. Register your sandbox

3. Run the eval

Limitations

Next steps