When your task can’t be expressed as a prompt (agents, multi-step workflows, custom tooling, or heavy dependencies), connect your code to a playground. The iteration workflow stays the same: run evaluations, compare results side-by-side, and share with teammates. Your code handles task execution. The playground handles the rest. Two approaches differ in where your code runs:Documentation Index
Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
- Remote evals — Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.
- Sandboxes — Run evals in an isolated cloud sandbox, controlled from Braintrust. You push an execution artifact (a code bundle or container snapshot) and Braintrust invokes it on demand from the playground. No server to keep running.
Common use cases
- Remote evals
- Sandboxes
Private internal resources
Private internal resources
Your eval needs to call internal APIs, query private databases, or access services inside your VPN. Because remote evals execute on your infrastructure, that access is already available.
OS-specific or platform-locked tooling
OS-specific or platform-locked tooling
Your eval requires software that only runs on a specific OS or machine — for example, a Windows-only simulation or a Unity project on a dedicated workstation. Remote evals let Braintrust trigger execution on whichever machine has the right environment set up.
Heavy or complex dev setup
Heavy or complex dev setup
Some tools are too painful to install on every teammate’s machine — game engines, large models, specialized SDKs. Set up the environment once on a shared server and let everyone else run the eval from the playground.
Data security and compliance
Data security and compliance
Sensitive data stays on your infrastructure. Only results are sent to Braintrust.
Run a remote eval
Run evals on your own infrastructure, controlled from Braintrust. Your evaluation code runs on your machine or server. The Braintrust playground triggers execution, sends parameters, and displays results.1. Write your eval
A remote eval looks like a standard eval call with aparameters field that defines configurable options. These can be inline parameters, defined directly in your eval code, or saved parameters, created separately and loaded with loadParameters(). Both inline and saved parameters become UI controls in the playground.
Install the SDK and dependencies:
Use
type: "model" to add a model selector that displays a dropdown of models configured in your Braintrust project. Use type: "prompt" for an inline prompt editor. Both use dictionary syntax. For other parameter types, use Zod schemas (TypeScript) or Pydantic models (Python).| Feature | TypeScript | Python | Java | Ruby |
|---|---|---|---|---|
| Model selector | type: "model" | type: "model" | ParameterDef.model(name, defaultValue) | Not supported |
| Prompt parameters | type: "prompt" with messages array in default | type: "prompt" with nested prompt.messages and options | Not supported | Not supported |
| Scalar types | Zod schemas: z.string(), z.boolean(), z.number() with .describe() | Pydantic models with Field(description=...) | ParameterDef.data(name, defaultValue), ParameterDef.model(name, defaultValue) | Hash with type:, description:, default: |
| Parameter access | parameters.prefix | parameters.get("prefix") | parameters.get("prefix", String.class) | parameters["prefix"] (via keyword arg) |
| Prompt usage | parameters.main.build({ input: value }) | **parameters["main"].build(input=value) | Not applicable | Not applicable |
| Async | async/await | async/await | Synchronous | Synchronous |
2. Expose the eval server
Run your eval with thebt eval --dev flag to start a local server:
- TypeScript
- Python
- Java
- Ruby
http://localhost:8300. Configure the host and port:--dev-host DEV_HOST: The host to bind to. Defaults tolocalhost. Set to0.0.0.0to bind to all interfaces (be cautious about security when exposing beyond localhost).--dev-port DEV_PORT: The port to bind to. Defaults to8300.
3. Configure in your project
To make your eval accessible beyond localhost, add the endpoint to your project:- In your project, go to Settings > Remote evals.
- Select Create remote eval source.
- Enter the name and URL of your remote eval server.
- Select Create remote eval source.
4. Run from a playground
- Open a playground in your project.
- Select + Task.
- Choose Remote eval from the task type list.
- Select your eval and configure parameters using the UI controls.
- Provide data inline or select a dataset, optionally add scorers, and click Run.
Demo
Run a sandbox eval
Run evals in an isolated cloud sandbox, controlled from Braintrust. Push an execution artifact once and Braintrust invokes it on demand from the playground — no server to keep running. Braintrust supports two sandbox providers:- Lambda — AWS Lambda-based. The default for
braintrust push. Supports both Python and TypeScript. No extra configuration needed. - Modal — Container-based via Modal. Requires a pre-built Modal container image. Executes TypeScript evals only.
1. Write your eval
A sandbox eval looks like a standard eval call with a parameters field that defines configurable options. These can be inline parameters, defined directly in your eval code, or saved parameters, created separately and loaded withloadParameters(). Both inline and saved parameters become UI controls in the playground.
Install the SDK and dependencies:
Sandboxes require TypeScript SDK v3.7.1+ or Python SDK v0.12.1+.
my_eval.eval.ts
| Feature | TypeScript | Python |
|---|---|---|
| Model selector | type: "model" | type: "model" |
| Prompt parameters | type: "prompt" with messages array in default | type: "prompt" with nested prompt.messages and options |
| Scalar types | Zod schemas: z.string(), z.boolean(), z.number() with .describe() | Pydantic models with Field(description=...) |
| Parameter access | parameters.prefix | parameters.get("prefix") |
| Prompt usage | parameters.main.build({ input: value }) | **parameters["main"].build(input=value) |
| Async | async/await | async/await |
2. Register your sandbox
Sandbox registration uses the Braintrust SDK CLI (
braintrust push / npx braintrust push). The bt CLI does not yet support sandbox evals.- Lambda
- Modal
Supported Lambda runtimes: Python 3.8–3.13 and Node.js 18, 20, 22. Pushing with an unsupported version returns an error listing the supported versions.
3. Run from a playground
- Open a playground in your project.
- Select + Task.
- Open the Remote eval submenu and select your sandbox.
- Select your eval and configure parameters using the UI controls.
- Provide data inline or select a dataset, optionally add scorers, and click Run.
Limitations
- The dataset defined in your eval is ignored when running from the playground. Datasets are managed through the playground.
- Scorers defined in your eval are concatenated with scorers added in the playground.
- For Lambda sandboxes, each individual invocation is capped at 15 minutes, but Braintrust manages dataset iteration outside the Lambda, so large dataset evals are not constrained by that limit.
- For Modal sandboxes, each sandbox is capped at a 60-minute lifetime by default. Self-hosted deployments can override this with the
MODAL_SANDBOX_TIMEOUT_Senvironment variable (in seconds).
Next steps
- Test prompts and models without custom code
- Create parameters to manage configurable settings
- Interpret results from your experiments