Note
- https://openai.com/index/introducing-swe-bench-verified/
- Revise the goals of benchmarking:
P1Internal: Track our AI’s improvement.P1External: Communicate (firstly with internal Holistics stakeholders: Sales team, Product team; then with prospects & customers).
- This is divided into many tasks:
- Creating test cases - I’m working on this.
- Evaluation method.
- Evaluation pipeline.
- Presenting evaluation result.
- Goal of creating test suite for benchmarking:
- Make a gold standard for evaluating agents’ ability to work with data-related business problems, rather than generating SQL/AQL snippets (like text-to-sql models do).
- Have a measurable way to indicate how good our AI is (in aspect of better support analytics techniques - running total, percent of total, trend analysis; reduce lines of code) that correlate with real data analyst productivity.
- Ease of comparing Holistics’s AI with chatgpt, text-to-SQL models, or competitors (?).
- Identify capability gaps (where all agents fail).
- Expected one-line pitch:
Our latest AI agent can now successfully handle 95% of common analytical patterns like Period-over-Period comparisons and 85% of complex multi-step Cohort analyses, representing a 20% quarterly improvement in advanced analytical support, according to our benchmark dataset.
Done
- DONE Run mock server to fix AQL queries
/ 2025-08-12
Created Tue, 12 Aug 2025 00:00:00 +0000
Modified Mon, 25 May 2026 06:02:25 +0000