Click + next to Evaluations in the sidebar, or use the Evaluations dropdown in a prompt’s or workflow’s toolbar to create one directly from what you’re testing.An evaluation is a workflow with two extra features: a dataset and SUT marking.
Switch to the Dataset tab in the bottom pane. This is a spreadsheet-style editor: add rows and columns, edit cells directly, or import from CSV.Each column can map to an input on your evaluation workflow. When the evaluation runs, each row becomes a separate run with those values as inputs.
For testing conversational prompts and workflows, mark a column as a thread identifier using the column header’s kebab menu. Rows with the same thread value will run against the same conversation thread, simulating a multi-turn interaction.
On the canvas, select a Prompt or Workflow node and mark it as the System Under Test (look for the beaker icon in the node’s hover toolbar). Token usage, cost, and duration metrics will only count SUT nodes, so your measurements reflect what you’re actually testing rather than the evaluation scaffolding around it.
Wire Assertion nodes into your workflow to define what “good” looks like. Pick an operator, set the expected condition, and assign a weight (0 to 1). The evaluation’s aggregate score is a weighted average of all assertion outcomes.
Hit the Run button in the bottom pane. Every dataset row becomes a workflow run. Watch the results appear in the Run tab: inputs, outputs, assertion results, and the aggregate score.Switch between past evaluation batches using the dropdown next to the tabs. Compare how different prompt versions perform against the same dataset.