A Primer On Enterprise Data Curation with Todd Walter

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Support the show!

24 September 2018

A Primer On Enterprise Data Curation with Todd Walter - Episode 49 - E49

0:00/0:00

Share on social media:

Summary

As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence

Interview

Introduction
How did you get involved in the area of data management?
How do you define data curation?
- What are some of the high level concerns that are encapsulated in that effort?
How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
- What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
What are some of the areas of data architecture and curation that are most often forgotten or ignored?
What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?