[dblineage] Gathering User Data - lelouvincx's second brain

https://www.reddit.com/r/dataengineering/comments/1hq9dwl/complexity_of_data_transformations_and_lineage
- If you’re working with data as a primary focus, part of the job (a big one) is documenting what you’re doing and validating what you touch before shipping it.
- I’m confident that my systems are and remain correct because I confirm the state of things before starting new work and document what I did, it adds like an hour to a project.
https://www.reddit.com/r/dataengineering/comments/10usa5i/looking_for_an_opensource_data_lineage_app_where/
- Context: company has been documenting all its data objects manually and has a large csv explicitly showing each data object and its predescessor/s. These aren’t just the standard database/workflow/dashboard objects; these include things like power automate scripts. I’m just looking for a good way to show everything in a map, visualize them, and navigate through their connections properly) At this point, I’ll even be happy with a pure visualization engine, like for instance if I can repurpose kedro-viz or dbt’s lineage visualizer so that it can take a csv or json of object relationships as an input. Or even a custom power BI visualization or python graph frontend would be fine, but I can’t seem to see one that works. I’d also be happy if any of the aforementioned lineage tools I mentioned above have this functionality and I just missed it.
https://www.reddit.com/r/dataengineering/comments/1ba4g7v/how_to_diagram_sql_queries/
- Love dbdiagram :) I’m also using dbdocs as a light-weight data catalog instead of plain dbt docs. While I do find dbt docs useful for data lineage, I’ve discovered that I can achieve the same functionality through my dbt core setup using the dbt Power User VSCode extension. And dbdocs fill in the gaps: ERD, table metadata, easy to deploy, shareable,… almost cover 90% of my needs
  - => implying dbdocs isn’t the lineage surface
https://www.reddit.com/r/SQL/comments/nxbtxb/tools_to_draw_data_lineage/
- Good for data models and showing direct relationships between tables, but it doesn’t show data flows. When you visualize data flows, you want to see data from which table ends up where, it’s different than “Column X is an FK to Column Y”
- Will dbdiagram be possible to show how several columns are being transformed to one column?
https://www.reddit.com/r/dataengineering/comments/1kyi6hx/what_do_you_use_for_lineage_and_why
- i’ve used a bunch of these. real talk: data lineage is overrated at early stages & often overcomplicated. when ur team is < 10, physical lineage diagrams on a whiteboard + good dbt docs get you 80% there. we started with DBT lineage for our first year which did the job, then built custom lineage in Preswald when we needed more flexibility (needed to include non-dbt systems). the problem with most enterprise lineage tools is they force you into their ecosystem - great for huge teams with dedicated resources, massive overkill for startups. your investment should match your problems - if ur just trying to debug why a dashboard broke, dbt docs are prob fine. if ur trying to comply with SOX, yea get OpenLineage or something heavy duty.
- Hey , i also want an open source tool for automated data lineage for my company which we can integrate in our product which is a data security product . I am going through openmetadata , but finding it difficult . Can you suggest any lightweight and easy to use tool which is open source ? and which can be used for automated lineage . I went through many tools online like DataHub , Collate , Informatica , etc . Most sites and GPTs suggested to use OpenMetaData. WHat is your recommendation .
https://www.reddit.com/r/dataengineering/comments/1ijr4jd/what_data_lineage_tools_do_you_use_and_what_makes/
- I’ve been working with OpenLineage lately and I like it a lot. It’s an open standard for collecting lineage data. Great community and they are very open to PRs and new features. Not a ton of integrations right now, but they have most of the big ones.
- I’ve been looking closely at DataHub for awhile now, and I think I’ll be using in coming projects. It’s an open-source tool with a managed version by Acryl. It does a bit more than just data lineage too, so may be overkill for what you’re after.
What are the short comes of current data lineage tools?
- Do the current lineage tools address data audit needs?
- https://www.reddit.com/r/dataengineering/comments/1gjzsu7/what_are_the_short_comes_of_current_data_lineage/
- The field is pretty crowded and most of the data platforms are already providing lineage out of the box.
- Make it our own, with thst said 90% of our platform is custom pyspark code running on aws, databriks or azure. No comercial offer does cover that , but they could :). We hocked the backend into our internal llm bot, so not user can just slack into it. No commercial would letvyou do that, they would sell it to you as a addon. Plus we are global brand and we shared our code with other sister brands and we all exchange internal features.
- Bugs everywhere.
- We use Collibra for snowflake lineage. Coverage of sql syntax is ok but it’s very buggy and hard to manage. No proper APIs for lineage means manual management. Other issue is it works on scraping the query logs in snowflake for a period of time so it can produce confusing results after code changes.
Is data lineage one of the most underrated thing in DE?
- https://www.reddit.com/r/dataengineering/comments/1g8k2h5/is_data_lineage_one_of_the_most_underrated_thing/
- I worked for multiple companies as a DE and zero of them applied anything related to data lineage. Whenever my team mentions it would be important to do this it gets ignored.
- If they don’t do documentation, I wouldn’t even expect have of them to know what data lineage even is.
- Data lineage is one of those things no one thinks they need… until they do. Like when you are debugging why a multi-system process or ETL isn’t working. The question of, “where did this data come from” comes up and now you are wasting time trying to find that out. It really sucks if it passes through multiple systems or multiple formats. (ODBC and JDBC are really sneaky like that.) Be the person that documents their stuff and allocate time for it. It will be an uphill battle because documentation is one of the first things thrown overboard when the inevitable money/time crunch shows up.
- This question keeps me up at night, since I’m in the process of building a POC database engine that has cell-level data lineage, forwards and backwards. I’ve been in data over 20 years. Most in DE supporting analytics. I’ve NEVER been somewhere that had robust data lineage. It drove me nuts enough to spend years dreaming up a robust solution. Why don’t places care? As someone who wants to open source something and launch a business around it, it drives me nuts. Am I crazy for finding data lineage fundamental? I don’t think the current gen of tools are there. I don’t think OpenLineage is good enough. It’s progress. (I guess that’s why I’m building my own.) I haven’t used Dagster but anything that doesn’t preserve transaction logs in a way that syncs up with time travel in a consistent way, to me, just isn’t good enough. The major downside of my approach is you only get lineage inside my engine. That’s probably a non-starter for many places, especially those big enough to be early adopters. IDK.
https://www.reddit.com/r/dataengineering/comments/1g3e20y/data_lineage/
- How do you all like to track dataset lineages? Dependencies between tables, sources/sinks per job, something like Kafka to a Spark written Iceberg table joined with another table to eventually landing in Snowflake… etc?
https://www.reddit.com/r/dataengineering/comments/1cvmerf/data_lineage_tools/
- OP is describing the exact use case for OpenLineage, but it’s hard to estimate how complete their lineage graph would be without knowing more about their tooling. OL will give you column lineage for Spark and Airflow jobs. Dbt is supported, as well.
- There are open source catalogs, like DataHub, but data lineage in it is extremely limited. So they do exist, but most likely will not suit your needs. Then you have paid products like Informaticas data catalog, which is out of scope. They support more or less everything.
- Use SQLMesh, it has lineage, diffs, etc.
- I was pushing for OpenMetadata at my last job, lineage being one of the selling points. I never got it deployed.
https://www.reddit.com/r/dataengineering/comments/1iddujm/data_lineage_and_quality_tool/
- I’m exploring OpenMetadata for data quality, governance, and lineage. While I’m not necessarily opposed to containerized deployments, I’m prioritizing ease of use, especially when it comes to automated data lineage and quality testing. I’m looking for alternative tools that might be more convenient to work with in these specific areas. Are there any tools that are considered “better” than OpenMetadata in terms of simplifying the process of setting up and managing automated data lineage and quality tests? Any recommendations would be greatly appreciated!
- SQLMesh is a solid tool for managing transformations, plus you get column level lineage of your models as a part of the open source offering.

Chinh (lelouvincx) / [dblineage] Gathering User Data