Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

  • 56 minutes 11 seconds
    An Opinionated Look At End-to-end Code Only Analytical Workflows With Bruin
    Summary
    The challenges of integrating all of the tools in the modern data stack has led to a new generation of tools that focus on a fully integrated workflow. At the same time, there have been many approaches to how much of the workflow is driven by code vs. not. Burak Karakan is of the opinion that a fully integrated workflow that is driven entirely by code offers a beneficial and productive means of generating useful analytical outcomes. In this episode he shares how Bruin builds on those opinions and how you can use it to build your own analytics without having to cobble together a suite of tools with conflicting abstractions.


    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm interviewing Burak Karakan about the benefits of building code-only data systems
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Bruin is and the story behind it?
      • Who is your target audience?
    • There are numerous tools that address the ETL workflow for analytical data. What are the pain points that you are focused on for your target users?
    • How does a code-only approach to data pipelines help in addressing the pain points of analytical workflows?
      • How might it act as a limiting factor for organizational involvement?
    • Can you describe how Bruin is designed?
      • How have the design and scope of Bruin evolved since you first started working on it?
    • You call out the ability to mix SQL and Python for transformation pipelines. What are the components that allow for that functionality?
      • What are some of the ways that the combination of Python and SQL improves ergonomics of transformation workflows?
    • What are the key features of Bruin that help to streamline the efforts of organizations building analytical systems?
    • Can you describe the workflow of someone going from source data to warehouse and dashboard using Bruin and Ingestr?
    • What are the opportunities for contributions to Bruin and Ingestr to expand their capabilities?
    • What are the most interesting, innovative, or unexpected ways that you have seen Bruin and Ingestr used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bruin?
    • When is Bruin the wrong choice?
    • What do you have planned for the future of Bruin?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    11 November 2024, 12:41 am
  • 47 minutes 36 seconds
    Feldera: Bridging Batch and Streaming with Incremental Computation
    Summary
    In this episode of the Data Engineering Podcast, the creators of Feldera talk about their incremental compute engine designed for continuous computation of data, machine learning, and AI workloads. The discussion covers the concept of incremental computation, the origins of Feldera, and its unique ability to handle both streaming and batch data seamlessly. The guests explore Feldera's architecture, applications in real-time machine learning and AI, and challenges in educating users about incremental computation. They also discuss the balance between open-source and enterprise offerings, and the broader implications of incremental computation for the future of data management, predicting a shift towards unified systems that handle both batch and streaming data efficiently.

    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.
    • Your host is Tobias Macey and today I'm interviewing Leonid Ryzhyk, Lalith Suresh, and Mihai Budiu about Feldera, an incremental compute engine for continous computation of data, ML, and AI workloads
    Interview
    • Introduction
    • Can you describe what Feldera is and the story behind it?
    • DBSP (the theory behind Feldera) has won multiple awards from the database research community. Can you explain what it is and how it solves the incremental computation problem?
    • Depending on which angle you look at it, Feldera has attributes of data warehouses, federated query engines, and stream processors. What are the unique use cases that Feldera is designed to address?
      • In what situations would you replace another technology with Feldera?
      • When is it an additive technology?
    • Can you describe the architecture of Feldera?
      • How have the design and scope evolved since you first started working on it?
    • What are the state storage interfaces available in Feldera?
      • What are the opportunities for integrating with or building on top of open table formats like Iceberg, Lance, Hudi, etc.?
    • Can you describe a typical workflow for an engineer building with Feldera?
    • You advertise Feldera's utility in ML and AI use cases in addition to data management. What are the features that make it conducive to those applications?
    • What is your philosophy toward the community growth and engagement with the open source aspects of Feldera and how you're balancing that with sustainability of the project and business?
    • What are the most interesting, innovative, or unexpected ways that you have seen Feldera used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Feldera?
    • When is Feldera the wrong choice?
    • What do you have planned for the future of Feldera?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    4 November 2024, 2:49 am
  • 48 minutes 50 seconds
    Accelerate Migration Of Your Data Warehouse with Datafold's AI Powered Migration Agent
    Summary
    Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stack
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what the Data Migration Agent is and the story behind it?
      • What is the core problem that you are targeting with the agent?
    • What are the biggest time sinks in the process of database and tooling migration that teams run into?
    • Can you describe the architecture of your agent?
      • What was your selection and evaluation process for the LLM that you are using?
    • What were some of the main unknowns that you had to discover going into the project?
      • What are some of the evolutions in the ecosystem that occurred either during the development process or since your initial launch that have caused you to second-guess elements of the design?
    • In terms of SQL translation there are libraries such as SQLGlot and the work being done with SDF that aim to address that through AST parsing and subsequent dialect generation. What are the ways that approach is insufficient in the context of a platform migration?
    • How does the approach you are taking with the combination of data-diffing and automated translation help build confidence in the migration target?
    • What are the most interesting, innovative, or unexpected ways that you have seen the Data Migration Agent used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI powered migration assistant?
    • When is the data migration agent the wrong choice?
    • What do you have planned for the future of applications of AI at Datafold?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    27 October 2024, 11:28 pm
  • 58 minutes 1 second
    Bring Vector Search And Storage To The Data Lake With Lance
    Summary
    The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storage
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Lance is and the story behind it?
      • What are the core problems that Lance is designed to solve?
        • What is explicitly out of scope?
    • The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?
      • What formats does Lance replace or obviate?
    • In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?
      • Are there any practical or hard limitations on vector dimensionality?
    • When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?
    • I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?
    • What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?
    • The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?
      • What are the other main integrations for Lance?
      • What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?
    • What are the most interesting, innovative, or unexpected ways that you have seen Lance used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?
    • When is Lance the wrong choice?
    • What do you have planned for the future of Lance?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    20 October 2024, 10:34 pm
  • 54 minutes 8 seconds
    The Role of Python in Shaping the Future of Data Platforms with DLT
    Summary
    In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?
      • What are the core principles that guide your work on dlt and dlthub?
    • You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?
    • The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?
      • The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?
    • What are some of the notable investments that you have made in the developer experience for building dlt pipelines?
      • How have the interfaces for source/destination development improved?
    • You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?
    • What is your strategy for building a sustainable product on top of dlt?
      • How does that strategy help to form a "virtuous cycle" of improving the open source foundation?
    • What are the most interesting, innovative, or unexpected ways that you have seen dlt used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?
    • When is dlt the wrong choice?
    • What do you have planned for the future of dlt/dlthub?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    13 October 2024, 10:32 pm
  • 42 minutes 36 seconds
    Build Your Data Transformations Faster And Safer With SDF
    Summary
    In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schema
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what SDF is and the story behind it?
      • What's the story behind the name?
    • What problem are you solving with SDF?
      • dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?
    • Can you describe the design and implementation of SDF?
      • How have the scope and goals of the project changed since you first started working on it?
    • What does the development experience look like for a team working with SDF?
      • How does that differ between the open and paid versions of the product?
    • What are the features and functionality that SDF offers to address intra- and inter-team collaboration?
    • One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?
      • Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?
    • What is your governing principle for what capabilities are in the open core and which go in the paid product?
    • What are the most interesting, innovative, or unexpected ways that you have seen SDF used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?
    • When is SDF the wrong choice?
    • What do you have planned for the future of SDF?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    6 October 2024, 11:07 pm
  • 57 minutes 11 seconds
    Scaling Airbyte: Challenges and Milestones on the Road to 1.0
    Summary
    Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Your host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the project
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Airbyte is and the story behind it?
    • What are some of the notable milestones that you have traversed on your path to the 1.0 release?
    • The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?
    • What are some of the hard-won lessons that you have learned about the realities of data movement and integration?
      • What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?
    • What are the core architectural decisions that have proven to be effective?
      • How has the architecture had to change as you progressed to the 1.0 release?
    • A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?
    • What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?
    • When is Airbyte the wrong choice?
    • What do you have planned for the future of Airbyte after the 1.0 launch?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    23 September 2024, 8:26 pm
  • 38 minutes 41 seconds
    Enhancing Data Accessibility and Governance with Gravitino
    Summary
    As data architectures become more elaborate and the number of applications of data increases, it becomes increasingly challenging to locate and access the underlying data. Gravitino was created to provide a single interface to locate and query your data. In this episode Junping Du explains how Gravitino works, the capabilities that it unlocks, and how it fits into your data platform.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Your host is Tobias Macey and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Gravitino is and the story behind it?
    • What problems are you solving with Gravitino?
      • What are the methods that teams have relied on in the absence of Gravitino to address those use cases?
    • What led to the Hive Metastore being the default for so long?
      • What are the opportunities for innovation and new functionality in the metadata service?
    • The documentation suggests that Gravitino has overlap with a number of tool categories such as table schema (Hive metastore), metadata repository (Open Metadata), data federation (Trino/Alluxio). What are the capabilities that it can completely replace, and which will require other systems for more comprehensive functionality?
    • What are the capabilities that you are explicitly keeping out of scope for Gravitino?
    • Can you describe the technical architecture of Gravitino?
      • How have the design and scope evolved from when you first started working on it?
    • Can you describe how Gravitino integrates into an overall data platform?
      • In a typical day, what are the different ways that a data engineer or data analyst might interact with Gravitino?
    • One of the features that you highlight is centralized permissions management. Can you describe the access control model that you use for unifying across underlying sources?
    • What are the most interesting, innovative, or unexpected ways that you have seen Gravitino used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gravitino?
    • When is Gravitino the wrong choice?
    • What do you have planned for the future of Gravitino?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    1 September 2024, 10:08 pm
  • 53 minutes 30 seconds
    The Evolution of DataOps: Insights from DataKitchen's CEO
    Summary
    In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what DataKitchen is and the story behind it?
    • You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?
    • Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?
    • The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?
    • What are the challenges that never went away?
    • You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?
    • What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?
    • Can you talk through the technical implementation of your new obserability and quality testing platform?
    • What does the onboarding and integration process look like?
    • Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?
    • What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?
    • What do you have planned for the future of your work at DataKitchen?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    4 August 2024, 7:40 pm
  • 49 minutes 26 seconds
    Achieving Data Reliability: The Role of Data Contracts in Modern Data Management
    Summary
    Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
    • Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe the scope and purpose of data contracts in the context of this conversation?
    • In what way(s) do they differ from data quality/data observability?
    • Data contracts are also known as the API for data, can you elaborate on this?
    • What are the types of guarantees and requirements that you can enforce with these data contracts?
    • What are some examples of constraints or guarantees that cannot be represented in these contracts?
    • Are data contracts related to the shift-left?
    • Data contracts are also known as the API for data, can you elaborate on this?
    • The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
    • How did you approach the design of the syntax and implementation for Soda's data contracts?
    • Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
    • Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
    • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
    • When are data contracts the wrong choice?
    • What do you have planned for the future of data contracts?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    28 July 2024, 8:10 pm
  • 54 minutes 45 seconds
    How Generative AI Is Impacting Data Engineering Teams
    Summary
    Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineers
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by clarifying what we are discussing when we say "AI"?
    • Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?
    • Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?
      • What are the areas where LLMs have proven useful/effective in data engineering?
    • Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?
      • As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?
    • As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?
      • What new skills do data teams need to acquire to be effective in supporting AI applications?
    • What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?
    • When is AI the wrong choice?
    • What are your predictions for the future impact of AI on data engineering teams?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your 
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    21 July 2024, 7:31 pm
  • More Episodes? Get the App
© MoonFM 2024. All rights reserved.