Skip to main content

Create an evaluation

Click + next to Evaluations in the sidebar, or use the Evaluations dropdown in a prompt’s or workflow’s toolbar to create one directly from what you’re testing. An evaluation is a workflow with two extra features: a dataset and SUT marking.

Build your dataset

Switch to the Dataset tab in the bottom pane. This is a spreadsheet-style editor: add rows and columns, edit cells directly, or import from CSV. Each column can map to an input on your evaluation workflow. When the evaluation runs, each row becomes a separate run with those values as inputs.

Multi-turn testing

For testing conversational prompts and workflows, mark a column as a thread identifier using the column header’s kebab menu. Rows with the same thread value will run against the same conversation thread, simulating a multi-turn interaction.

Mark what you’re testing

On the canvas, select a Prompt or Workflow node and mark it as the System Under Test (look for the beaker icon in the node’s hover toolbar). Token usage, cost, and duration metrics will only count SUT nodes, so your measurements reflect what you’re actually testing rather than the evaluation scaffolding around it.

Add assertions

Wire Assertion nodes into your workflow to define what “good” looks like. Pick an operator, set the expected condition, and assign a weight (0 to 1). The evaluation’s aggregate score is a weighted average of all assertion outcomes.

Run the evaluation

Hit the Run button in the bottom pane. Every dataset row becomes a workflow run. Watch the results appear in the Run tab: inputs, outputs, assertion results, and the aggregate score. Switch between past evaluation batches using the dropdown next to the tabs. Compare how different prompt versions perform against the same dataset.