Chinh (lelouvincx) / 2026-01-26

Created Mon, 26 Jan 2026 00:00:00 +0000 Modified Mon, 25 May 2026 06:02:25 +0000
378 Words

Notes

  • Tasks

  • Done

    • {{embed ((69671001-f0b8-41c0-9f99-f9706404e7cb))}}
    • DONE Solve issue for Charlotte
  • Answer Charlotte

  • Hello Charlotte,

    I’ve investigated your code. The root cause lies in your models’ relationship, which should be one-to-many instead of one-to-one.

    You can checkout feat/DATA-1122-candiates-allocated to see the fix.

    In the dataset marketing there’re relationships as one-to-one between Practitioners and Strategy Sessions / Documentation Events (I assume it’s CV Upload) / Reference. When the query run, the database joined these tables, causing the Practitioner rows to duplicate the count.

    For example, Dr. Smith (1 Allocation) uploaded their CV 3 times: Row Count: Becomes 3 rows. count(): Returns 3 (Incorrect). approx_count_distinct(): Returns 1 (Correct, but masks the underlying join issue).

    To fix this, I updated the data model to enforce one-to-many relationships and force the metrics to follow.

      1. Relationships marketing.dataset.aml I changed the relationship syntax from one-to-one (-) to one-to-many (>). Please note that when you use one-to-one on a column, Holistics will always assume that your column are already unique in physical table/view.
      // Before: Treated as 1-to-1
      relationship(practitioner_core_documentation_events.practitioner_id - practitioner_practitioners.practitioner_id, true)
      
      // After: 1-to-Many
      relationship(practitioner_core_documentation_events.practitioner_id > practitioner_practitioners.practitioner_id, true)
      
      1. I standardize the duration dimensions (Strategy Session, CV Upload, Reference 2 Added) to explicitly calculate the min timestamp just like what you did with Placement and Presentation.
      // Before
      definition: @aql floor(date_diff('day', ..., practitioner_core_documentation_events.first_cv_upload_timestamp)/30)
      
      // After: Added min() and dimensionalize()
      definition: @aql floor(date_diff('day', ..., min(practitioner_core_documentation_events.first_cv_upload_timestamp))/30)
      | dimensionalize(practitioner_practitioners.practitioner_id)
      | with_relationships(practitioner_core_documentation_events.practitioner_id > practitioner_practitioners.practitioner_id)
      
    • I have implemented these changes. You should be able to use approx_count_distinct and count interchangeably now.
    • Please check it out and let me know if it works!
  • DONE Reply Simon

  • Thank you, Simon. That’s works for me. However I believe it’s better for Anurag to take ownership of this moving forward. With my role is to review his work. Because he has better understanding of the live data (RDS) than I do. I’ve been working off the static dump schema, which is temporary.

    Anurag, the primary goal is to get the tables available in Redshift. Assuming the schema hasn’t evolved much since 10/2025, please load all tables from the previous dump except PriceHistory. To keep it simple: you can load all tables except that one.

    After that, if you’re worry about the airbyte setup, you can either document or give access. I’ll help review and consult.