Chinh (lelouvincx) / 2025-08-12

Created Tue, 12 Aug 2025 00:00:00 +0000 Modified Mon, 25 May 2026 06:02:25 +0000

186 Words

Note
- https://openai.com/index/introducing-swe-bench-verified/
- Revise the goals of benchmarking:
  - P1 Internal: Track our AI’s improvement.
  - P1 External: Communicate (firstly with internal Holistics stakeholders: Sales team, Product team; then with prospects & customers).
- This is divided into many tasks:
  - Creating test cases - I’m working on this.
  - Evaluation method.
  - Evaluation pipeline.
  - Presenting evaluation result.
- Goal of creating test suite for benchmarking:
  - Make a gold standard for evaluating agents’ ability to work with data-related business problems, rather than generating SQL/AQL snippets (like text-to-sql models do).
  - Have a measurable way to indicate how good our AI is (in aspect of better support analytics techniques - running total, percent of total, trend analysis; reduce lines of code) that correlate with real data analyst productivity.
  - Ease of comparing Holistics’s AI with chatgpt, text-to-SQL models, or competitors (?).
  - Identify capability gaps (where all agents fail).
- Expected one-line pitch:
  - Our latest AI agent can now successfully handle 95% of common analytical patterns like Period-over-Period comparisons and 85% of complex multi-step Cohort analyses, representing a 20% quarterly improvement in advanced analytical support, according to our benchmark dataset.
Done
- DONE Run mock server to fix AQL queries