Notes
Tasks
Done
- {{embed ((69671001-f0b8-41c0-9f99-f9706404e7cb))}}
- DONE Solve issue for Charlotte
Answer Charlotte
Hello Charlotte,
I’ve investigated your code. The root cause lies in your models’ relationship, which should be one-to-many instead of one-to-one.
You can checkout
feat/DATA-1122-candiates-allocatedto see the fix.In the dataset
marketingthere’re relationships as one-to-one between Practitioners and Strategy Sessions / Documentation Events (I assume it’s CV Upload) / Reference. When the query run, the database joined these tables, causing thePractitionerrows to duplicate the count.For example, Dr. Smith (1 Allocation) uploaded their CV 3 times: Row Count: Becomes 3 rows.
count(): Returns 3 (Incorrect).approx_count_distinct(): Returns 1 (Correct, but masks the underlying join issue).To fix this, I updated the data model to enforce one-to-many relationships and force the metrics to follow.
- Relationships
marketing.dataset.amlI changed the relationship syntax from one-to-one (-) to one-to-many (>). Please note that when you use one-to-one on a column, Holistics will always assume that your column are already unique in physical table/view.
// Before: Treated as 1-to-1 relationship(practitioner_core_documentation_events.practitioner_id - practitioner_practitioners.practitioner_id, true) // After: 1-to-Many relationship(practitioner_core_documentation_events.practitioner_id > practitioner_practitioners.practitioner_id, true)- Relationships
- I standardize the duration dimensions (Strategy Session, CV Upload, Reference 2 Added) to explicitly calculate the min timestamp just like what you did with Placement and Presentation.
// Before definition: @aql floor(date_diff('day', ..., practitioner_core_documentation_events.first_cv_upload_timestamp)/30) // After: Added min() and dimensionalize() definition: @aql floor(date_diff('day', ..., min(practitioner_core_documentation_events.first_cv_upload_timestamp))/30) | dimensionalize(practitioner_practitioners.practitioner_id) | with_relationships(practitioner_core_documentation_events.practitioner_id > practitioner_practitioners.practitioner_id)- I have implemented these changes. You should be able to use
approx_count_distinctandcountinterchangeably now. - Please check it out and let me know if it works!
DONE Reply Simon
Thank you, Simon. That’s works for me. However I believe it’s better for Anurag to take ownership of this moving forward. With my role is to review his work. Because he has better understanding of the live data (RDS) than I do. I’ve been working off the static dump schema, which is temporary.
Anurag, the primary goal is to get the tables available in Redshift. Assuming the schema hasn’t evolved much since 10/2025, please load all tables from the previous dump except PriceHistory. To keep it simple: you can load all tables except that one.
After that, if you’re worry about the airbyte setup, you can either document or give access. I’ll help review and consult.