Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

  • 54 minutes 8 seconds
    The Role of Python in Shaping the Future of Data Platforms with DLT
    Summary
    In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?
      • What are the core principles that guide your work on dlt and dlthub?
    • You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?
    • The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?
      • The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?
    • What are some of the notable investments that you have made in the developer experience for building dlt pipelines?
      • How have the interfaces for source/destination development improved?
    • You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?
    • What is your strategy for building a sustainable product on top of dlt?
      • How does that strategy help to form a "virtuous cycle" of improving the open source foundation?
    • What are the most interesting, innovative, or unexpected ways that you have seen dlt used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?
    • When is dlt the wrong choice?
    • What do you have planned for the future of dlt/dlthub?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    13 October 2024, 10:32 pm
  • 42 minutes 36 seconds
    Build Your Data Transformations Faster And Safer With SDF
    Summary
    In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
    • Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schema
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what SDF is and the story behind it?
      • What's the story behind the name?
    • What problem are you solving with SDF?
      • dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?
    • Can you describe the design and implementation of SDF?
      • How have the scope and goals of the project changed since you first started working on it?
    • What does the development experience look like for a team working with SDF?
      • How does that differ between the open and paid versions of the product?
    • What are the features and functionality that SDF offers to address intra- and inter-team collaboration?
    • One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?
      • Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?
    • What is your governing principle for what capabilities are in the open core and which go in the paid product?
    • What are the most interesting, innovative, or unexpected ways that you have seen SDF used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?
    • When is SDF the wrong choice?
    • What do you have planned for the future of SDF?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    6 October 2024, 11:07 pm
  • 57 minutes 11 seconds
    Scaling Airbyte: Challenges and Milestones on the Road to 1.0
    Summary
    Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Your host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the project
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Airbyte is and the story behind it?
    • What are some of the notable milestones that you have traversed on your path to the 1.0 release?
    • The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?
    • What are some of the hard-won lessons that you have learned about the realities of data movement and integration?
      • What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?
    • What are the core architectural decisions that have proven to be effective?
      • How has the architecture had to change as you progressed to the 1.0 release?
    • A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?
    • What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?
    • When is Airbyte the wrong choice?
    • What do you have planned for the future of Airbyte after the 1.0 launch?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    23 September 2024, 8:26 pm
  • 38 minutes 41 seconds
    Enhancing Data Accessibility and Governance with Gravitino
    Summary
    As data architectures become more elaborate and the number of applications of data increases, it becomes increasingly challenging to locate and access the underlying data. Gravitino was created to provide a single interface to locate and query your data. In this episode Junping Du explains how Gravitino works, the capabilities that it unlocks, and how it fits into your data platform.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Your host is Tobias Macey and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Gravitino is and the story behind it?
    • What problems are you solving with Gravitino?
      • What are the methods that teams have relied on in the absence of Gravitino to address those use cases?
    • What led to the Hive Metastore being the default for so long?
      • What are the opportunities for innovation and new functionality in the metadata service?
    • The documentation suggests that Gravitino has overlap with a number of tool categories such as table schema (Hive metastore), metadata repository (Open Metadata), data federation (Trino/Alluxio). What are the capabilities that it can completely replace, and which will require other systems for more comprehensive functionality?
    • What are the capabilities that you are explicitly keeping out of scope for Gravitino?
    • Can you describe the technical architecture of Gravitino?
      • How have the design and scope evolved from when you first started working on it?
    • Can you describe how Gravitino integrates into an overall data platform?
      • In a typical day, what are the different ways that a data engineer or data analyst might interact with Gravitino?
    • One of the features that you highlight is centralized permissions management. Can you describe the access control model that you use for unifying across underlying sources?
    • What are the most interesting, innovative, or unexpected ways that you have seen Gravitino used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gravitino?
    • When is Gravitino the wrong choice?
    • What do you have planned for the future of Gravitino?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    1 September 2024, 10:08 pm
  • 53 minutes 30 seconds
    The Evolution of DataOps: Insights from DataKitchen's CEO
    Summary
    In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what DataKitchen is and the story behind it?
    • You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?
    • Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?
    • The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?
    • What are the challenges that never went away?
    • You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?
    • What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?
    • Can you talk through the technical implementation of your new obserability and quality testing platform?
    • What does the onboarding and integration process look like?
    • Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?
    • What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?
    • What do you have planned for the future of your work at DataKitchen?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    4 August 2024, 7:40 pm
  • 49 minutes 26 seconds
    Achieving Data Reliability: The Role of Data Contracts in Modern Data Management
    Summary
    Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
    • Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe the scope and purpose of data contracts in the context of this conversation?
    • In what way(s) do they differ from data quality/data observability?
    • Data contracts are also known as the API for data, can you elaborate on this?
    • What are the types of guarantees and requirements that you can enforce with these data contracts?
    • What are some examples of constraints or guarantees that cannot be represented in these contracts?
    • Are data contracts related to the shift-left?
    • Data contracts are also known as the API for data, can you elaborate on this?
    • The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
    • How did you approach the design of the syntax and implementation for Soda's data contracts?
    • Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
    • Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
    • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
    • When are data contracts the wrong choice?
    • What do you have planned for the future of data contracts?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    28 July 2024, 8:10 pm
  • 54 minutes 45 seconds
    How Generative AI Is Impacting Data Engineering Teams
    Summary
    Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineers
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you start by clarifying what we are discussing when we say "AI"?
    • Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?
    • Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?
      • What are the areas where LLMs have proven useful/effective in data engineering?
    • Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?
      • As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?
    • As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?
      • What new skills do data teams need to acquire to be effective in supporting AI applications?
    • What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?
    • When is AI the wrong choice?
    • What are your predictions for the future impact of AI on data engineering teams?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your 
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    21 July 2024, 7:31 pm
  • 52 minutes 58 seconds
    The Role of Product Managers in Data-Centric Organizations
    Summary
    In this episode Praveen Gujar, Director of Product at LinkedIn, talks about the intricacies of product management for data and analytical platforms. Praveen shares his journey from Amazon to Twitter and now LinkedIn, highlighting his extensive experience in building data products and platforms, digital advertising, AI, and cloud services. He discusses the evolving role of product managers in data-centric environments, emphasizing the importance of clean, reliable, and compliant data. Praveen also delves into the challenges of building scalable data platforms, the need for organizational and cultural alignment, and the critical role of product managers in bridging the gap between engineering and business teams. He provides insights into the complexities of platformization, the significance of long-term planning, and the necessity of having a strong relationship with engineering teams. The episode concludes with Praveen offering advice for aspiring product managers and discussing the future of data management in the context of AI and regulatory compliance.

    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Praveen Gujar about product management for data and analytical platforms
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Product management is typically thought of as being oriented toward customer facing functionality and features. What is involved in being a product manager for data systems?
    • Many data-oriented products that are customer facing require substantial technical capacity to serve those use cases. How does that influence the process of determining what features to provide/create?
    • investment in technical capacity/platforms
    • identifying groupings of features that can be served by a common platform investment
    • managing organizational pressures between engineering, product, business, finance, etc.
    • What are the most interesting, innovative, or unexpected ways that you have seen "Data Products & Platforms @ Big-tech" used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on "Building Data Products & Platforms for Big-tech"?
    • When is "Data Products & Platforms @ Big-tech" the wrong choice?
    • What do you have planned for the future of "Data Products & Platforms @ Big-tech"?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    13 July 2024, 8:55 pm
  • 57 minutes 43 seconds
    Neon: A Serverless And Developer Friendly Postgres
    Summary
    Postgres is one of the most widely respected and liked database engines ever. To make it even easier to use for developers to use, Nikita Shamgunov decided to makee it serverless, so that it can scale from zero to infinity. In this episode he explains the engineering involved to make that possible, as well as the numerous details that he and his team are packing into the Neon service to make it even more attractive for anyone who wants to build on top of Postgres.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Nikita Shamgunov about his work on making Postgres a serverless database at Neon.
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Neon is and the story behind it?
      • The ecosystem around Postgres is large and varied. What are the pain points that you are trying to address with Neon? 
    • What does it mean for a database to be serverless?
      • What kinds of products and services are unlocked by making Postgres a serverless database?
    • How does your vision for Neon compare/contrast with what you know of PlanetScale?
    • Postgres is known for having a large ecosystem of plugins that add a lot of interesting and useful features, but the storage layer has not been as easily extensible historically. How have architectural changes in recent Postgres releases enabled your work on Neon?
    • What are the core pieces of engineering that you have had to complete to make Neon possible?
      • How have the design and goals of the project evolved since you first started working on it?
    • The separation of storage and compute is one of the most fundamental promises of the cloud. What new capabilities does that enable in Postgres?
      • How does the branching functionality change the ways that development teams are able to deliver and debug features?
    • Because the storage is now a networked system, what new performance/latency challenges does that introduce? How have you addressed them in Neon?
    • Anyone who has ever operated a Postgres instance has had to tackle the upgrade process. How does Neon address that process for end users?
    • The rampant growth of AI has touched almost every aspect of computing, and Postgres is no exception. How does the introduction of pgvector and semantic/similarity search functionality impact the adoption and usage patterns of Postgres/Neon?
      • What new challenges does that introduce for you as an operator and business owner?
    • What are the lessons that you learned from MemSQL/SingleStore that have been most helpful in your work at Neon?
    • What are the most interesting, innovative, or unexpected ways that you have seen Neon used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Neon?
    • When is Neon the wrong choice? Postgres?
    • What do you have planned for the future of Neon?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    8 July 2024, 2:55 am
  • 59 minutes 48 seconds
    Improve Data Quality Through Engineering Rigor And Business Engagement With Synq
    Summary
    This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this.
    Announcements
    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigor
    Interview
    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Synq is and the story behind it? 
      • Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address?
    • Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? 
      • Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary?
    • What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team?
    • How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach?
    • With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance?
    • Can you describe how Synq is designed/implemented? 
      • How have the scope and goals of the product changed since you first started working on it?
    • For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows?
    • What are the types of incidents/errors that you are able to identify and alert on? 
      • What does a typical incident/error resolution process look like with Synq?
    • What are the most interesting, innovative, or unexpected ways that you have seen Synq used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq?
    • When is Synq the wrong choice?
    • What do you have planned for the future of Synq?
    Contact Info
    Parting Question
    • From your perspective, what is the biggest gap in the tooling or technology for data management today?
    Closing Announcements
    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.
    Links
    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
    30 June 2024, 7:00 pm
  • 53 minutes 23 seconds
    Stitching Together Enterprise Analytics With Microsoft Fabric

    Summary

    Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.

    Announcements

    • Hello and welcome to the Data Engineering Podcast, the show about modern data management
    • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
    • Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou

    Interview

    • Introduction
    • How did you get involved in the area of data management?
    • Can you describe what Microsoft Fabric is and the story behind it?
    • Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend?
    • Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?
      • What are the elements of Fabric that were engineered specifically for the service?
      • What are the most interesting/complicated integration challenges?
    • How has your prior experience with Ahana and Presto informed your current work at Microsoft?
    • AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?
      • What are the challenges in terms of safety and reliability?
    • What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used?
    • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically?
    • When is Fabric the wrong choice?
    • What do you have planned for the future of data lake analytics?

    Contact Info

    Parting Question

    • From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
    • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    • If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

    Sponsored By:

    Support Data Engineering Podcast

    23 June 2024, 2:00 pm
  • More Episodes? Get the App
© MoonFM 2024. All rights reserved.