Chinh (lelouvincx) / 2025-08-12

Created Tue, 12 Aug 2025 00:00:00 +0000 Modified Mon, 25 May 2026 06:02:25 +0000
186 Words
  • Note

    • https://openai.com/index/introducing-swe-bench-verified/
    • Revise the goals of benchmarking:
      • P1 Internal: Track our AI’s improvement.
      • P1 External: Communicate (firstly with internal Holistics stakeholders: Sales team, Product team; then with prospects & customers).
    • This is divided into many tasks:
      • Creating test cases - I’m working on this.
      • Evaluation method.
      • Evaluation pipeline.
      • Presenting evaluation result.
    • Goal of creating test suite for benchmarking:
      • Make a gold standard for evaluating agents’ ability to work with data-related business problems, rather than generating SQL/AQL snippets (like text-to-sql models do).
      • Have a measurable way to indicate how good our AI is (in aspect of better support analytics techniques - running total, percent of total, trend analysis; reduce lines of code) that correlate with real data analyst productivity.
      • Ease of comparing Holistics’s AI with chatgpt, text-to-SQL models, or competitors (?).
      • Identify capability gaps (where all agents fail).
    • Expected one-line pitch:
      • Our latest AI agent can now successfully handle 95% of common analytical patterns like Period-over-Period comparisons and 85% of complex multi-step Cohort analyses, representing a 20% quarterly improvement in advanced analytical support, according to our benchmark dataset.

  • Done

    • DONE Run mock server to fix AQL queries