Summary
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence
Interview
- Introduction
- How did you get involved in the area of data management?
- How do you define data curation?
- What are some of the high level concerns that are encapsulated in that effort?
- How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
- Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
- What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
- What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
- As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
- In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
- What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
- Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
- ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
- What are some of the areas of data architecture and curation that are most often forgotten or ignored?
- What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Teradata
- Data Architecture
- Data Curation
- Data Warehouse
- Chief Data Officer
- ETL (Extract, Transform, Load)
- Data Lake
- Metadata
- Data Lineage
- Strata Conference
- ELT (Extract, Load, Transform)
- Map-Reduce
- Hive
- Pig
- Spark
- Data Governance
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And you work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform for Metas Machine was built to give your data scientists the end to end support that they need throughout the machine learning life cycle.
Skafos maximizes interoperability with your existing tools and platforms and offers real time insights and the ability to be up and running with cloud based production scale infrastructure instantaneously. Request a demo today at dataengineeringpodcast.com/metis dashmachine to learn more about how Metis Machine is operationalizing data science. And if you're attending the Strata Data Conference in New York in September, then come say hi to Metis Machine at boothp16. 16. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.
Your host is Tobias Macy. And today, I'm interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence. So, Todd, could you start by introducing yourself? Hi, Tobias.
[00:01:42] Unknown:
I'm Todd Walter. I'm, chief technologist for Teradata. Been with Teradata for, as crazy as it sounds, 31 years, and, joined joined the organization when Teradata was still a start up company and have walked through, all of the all of the cycles and lifetimes of of Teradata through its
[00:02:07] Unknown:
history. And do you remember how you first got involved in the area of data management?
[00:02:11] Unknown:
You know, it's funny. I I didn't know it when I was kid, but I've always been a data geek. I, I I've always been fascinated by by data. I got involved in a a project in high school through my, my government teacher, my American government teacher, and and digitized a bunch of maps and identified all of the publicly owned plans within the city, to better understand the tax consequences of public owned lands and, and, just gotten gotten hooked by data all of my life. I like to know I like to know, the details.
[00:02:54] Unknown:
Yeah. It's definitely very valuable skill to have, and it's become even more so as it has become the sort of default currency of businesses.
[00:03:04] Unknown:
It's so crucial these days. You know, there's there's plenty of people out there saying that, you know, data or die, analytics or die, is the new model for, for companies. But I I think that is more true than ever. I I don't think it's the only thing. You still have to have a strategy. You still have to have, something compelling to to, offer to your customers. But if you don't understand in detail how your business operates and and how you make money and how you deal with customers, and you just you can't operate in the world anymore. You can't meet user and customer expectations.
[00:03:50] Unknown:
Yeah. And it seems that in the time leading up to the early 2000 where we were starting to digitize our data sources, but we had a very small variety and small volume of them that business analytics was in a sense easy because you didn't have so many different systems to consider. And, so, you know, the answers you got were fairly well scoped, whereas now you have mountains of data to churn through and massive variety of sources. So it's kinda hard to understand the signal from the noise. And for a while in the sort of mid 2000, there was the onset of the big data mantra of just save everything and magically thing good things will happen. So in that, I'm sure that your skills with data curation and data architecture have proved invaluable to yourself and all of your clients. So can you start by just defining what those terms mean and what their differences are in terms of curation versus architecture?
[00:04:49] Unknown:
Sure. The I'm gonna agree and disagree with you though, on on your on your point there. Certainly, the signals separating the signal from the noise today has gotten significantly more difficult, partly just because of the enormous volumes of data. However, the number of sources has always been monumental. I've been doing, data warehousing. I've been part of the data warehousing industry for for, you know, much of my career at at Teradata and worked with a lot of different companies. And and when I talk to these big customers, they're talking about 1,000, maybe tens of thousands of source applications, each producing data independently.
None of them architected to talk to each other, interact with each other. None of them with similar, with similar keys or similar representations of the data. Just because it was, sort of structured data, just because it was more rows and columns, doesn't make the problem any easier. So data curation has been a problem for the entirety of the data warehouse industry. And the problem just keeps getting bigger and harder as, as we move into this, era of big data. So what do I mean by data curation? I mean, that that the data needs to be managed, cleaned, reassembled, and made into a form that's actually usable by a, by an analyst or a data scientist or a system a modeling system of of some form.
I like the word curation because it makes me think of the curator of a museum. In a museum, there's a lot of artifacts, many of which are in the basement in dusty shelves that aren't on display at all. And and they are kept there because they have some amount of value or some amount of research value, but they're not on display for all of the all of the guests of the museum to see. And then, of course, there are the ones that are on display in in in the, in the museum. And those, of course, are nicely arranged, carefully carefully organized, nicely labeled, easily understood by anyone from children to grandparents, that that come to the museum.
The, skeletons are articulated so that you can understand the, the the creature that they came from and the displays are all very carefully curated, to make that user experience, very easy and very understandable. And I think that data people these days need to think the same way. They need to think about how they curate the data and figure out which things are most important to be on display, if you will, to the whole organization and which things are fine to keep in the basement for a few, a few PhD researchers to,
[00:08:21] Unknown:
look at some days. Yeah. I think that's a very valuable metaphor. And also in the case of the skeletons, for instance, curators will actually fill in the gaps of the skeleton to make it easier to understand, and that's, a very good parallel to draw with the space of data and business intelligence representation. And so in the space of data curation, what are some of the high level concerns that are encapsulated in that effort? And is it something that would generally be led by a single person, or are there multiple, individuals or business roles that are necessary for a proper curation of data within an organization?
[00:09:10] Unknown:
Oh, it's, it's definitely not 1 person unless you unless you define as being led, define that as the role of the chief data officer. It it starts with it starts with that kind of a role, the chief data officer, but, the the curation has to be, a combined effort of the, business owners and and business users, along with the, the IT people and the people who are, the ones who actually do the, the curation. And and curation in a large organization is a monumental task. It's a it is a never ending monumental task because there's always new sources of data.
There's mergers and acquisitions. There's always new applications. There's applications that change, And and, and so it's it is this continuous process that needs to be prioritized and and managed by, by multiple teams of people, working closely together. The really highest level of concern is, time to availability of the of the data. The traditional data warehouse model, it was to highly curate all the data and make it make available only highly curated data. Data that was, let's say, perfect. No data is ever perfect. But but let's say the data is perfect and and, and give it to the users only when the data is perfect for their BI applications and their dashboards and stuff. But the trade off for that is that it takes a lot of time to do that.
A, an ETL project for a, for a new data source can take months, and can take it can be a costly project that has to get in the budget and get planned for next year. And the elapsed time can be months or or even years before the data and the team gets to curate the data and make it available to people. So there's huge pushback. And more and more these days, in these days of web time instant gratification, the users are very impatient to get at their data. As a result, there's a huge amount of pushback say, well, don't curate the data, just give it to me in its raw form and I'll do it now. I'll play with it as the as the end user.
And that's and that's okay, and that's actually can be a very good thing, but it also can introduce all sorts of problems when, the the person using the data doesn't understand the characteristics of the raw data and the data quality issues that might be in it. Let's say they, compute some report that, that computes revenue out of the raw data, but hasn't applied all of the business rules that you would normally apply in a curation process, And they come up with a completely different revenue number, which they report up through the management chain and it gets used. Then now the company is really making bad decisions or even getting executives in trouble because they are reporting incorrect data to the street or or to investors or or whatever. And so the whole process has to be well understood and well governed to make sure that the the the data that people use that's not curated is really understood to be not curated and and are they're not using it for business critical decisions.
Whereas, you're spending the right amount of energy on curating the right stuff to make sure that the business critical decisions are made on really good data. And that's really hard because there are tens of thousands, many tens of thousands of attributes floating around in thousands of data sets in an organization.
[00:13:43] Unknown:
And so how does the size and maturity of a company affect the different ways that they will approach architecting and curating the data that they're capturing and the ways that they are building and tiering their data systems to enable the curation strategies that are being architected.
[00:14:09] Unknown:
The, the size of the companies among the large and medium companies at least doesn't seem to really be the predictive factor. Really, it's about the data maturity of an organization. And, frankly, the pendulum swings back and forth, as I was saying in my previous, in my previous, comments, the the data warehouse people, really frustrated the users because it took so long to get data ready for the data warehouse that the users got really frustrated waiting for that, that data and and wanted raw data to work to work with. And so, so that happens. And then the users get frustrated with that data, not matching the general ledger and not being able to be joined with customer and, not having common units. And they come back to the IT organization and want their data curated more.
So the pendulum kind of swings back and forth over over time as well. And and so I think and what I try to advocate with the customers that I work with and what Teradata has has always tried to advocate is that people should have a pipeline kind of mentality about curation. These days, we would call it an agile methodology around curation where they start with data that is the more in its raw state and let a few exploratory users work with that. And then as that data becomes to the point where it supports production applications or supports more users or supports more sharing between departments, then that data gets more and more curated over time, all managed by a governance process and prioritization process.
There are not very many organisations that have that discipline, however. And so what really, really happens is that the, the organizations start with with raw data, especially in the big data space and and they hack at it and create data pipelines and actually are creating a fair amount of future issues for themselves in terms of scaling up to more analytic projects and scaling up to to production use. But the people who are really data centric organizations, they get this and they go through this pipeline sort of agile process of continuous curation and and making continuously making the data better for their users.
[00:17:15] Unknown:
And so in this pipeline oriented workflow, what does the life cycle of the data look like in terms of the systems that it flows through and the operations being performed on it and the sort of general availability within the organization to these different layers of data?
[00:17:37] Unknown:
The data when it first arrives as a new data set that the organization has never seen before, especially when it is some 1 of the forms of big data. It might be it might be text. It might be web logs. It might be IoT data. Any any of the not traditional column row kind of kind of data from traditional applications, that data, generally needs to land in some form of a data lake, some form of a file system. And, a system that doesn't require significant rigor in the format of the data and the structure of the data. But that data is really only usable by a small number of heroes in the organization that really can understand all the details of the source systems and the data quality and the challenges of the structures of the data.
And so then that data needs to be progressively curated and progressively add metadata to it so that other people can understand the structure of the more curated form of the data. At the other end of the pipeline, when the data is being used by the enterprise as a whole, let's say it's the the common, view of customer data or the, the core financial data or that kind of of information for the company. That data needs to be very highly curated and and put in a form that everybody can use, which is usually, not always, but usually in a more relational form and, in a more traditional relational form.
But by the time it gets there, that there a lot of the, volume of the raw data has been, has been consumed or left behind at stages along the curation pipeline. Maybe there are aggregates that have made it there. Maybe there are computed scores or values that have made it there. Something like a customer sentiment score has been computed out of a large volume of text. The only thing that makes it into the highly curated data is the customer value score and the text is left in the data lake for further new analyses.
[00:20:20] Unknown:
And in the data warehouse where you're storing these aggregates or these condensed analyses of the raw data, would you generally also have some record of the original source of the data and the provenance so that somebody who is interested in doing a different type of analysis or trying to use a different algorithm for generating these aggregates can then go back to those, raw records easily to be able to try and either replicate or discover new reflections of that information?
[00:20:53] Unknown:
Lineage and provenance is a huge deal these days, absolutely enormous. And there's a bunch of new companies springing up. Some of them have been around for a little while. Some of them some newer ones. A lot of energy is being expended in the overall metadata and, lineage and provenance, areas. And and the reasons for that are not only the ones you described about the analyst just needing to know what's going on and where the data came from and how it got there, what what curation rules have been applied. But, but it goes much further than that because it also goes to being able to prove to an auditor or a legal challenge on a privacy case or a legal challenge on a security case, being able to prove how data got to where it is, and being able to prove that bias wasn't introduced or that the data was handled and maintained and secured, properly. And and, it it ties in with access controls and and everything else.
So, so, yeah, that that whole lineage space is is a huge deal along with the the metadata on the data.
[00:22:25] Unknown:
And when you're dealing with data lakes, there has been a lot of discussion about that being sort of the canonical source of data and maybe the only source of data within an organization and people trying to espouse the schema on read and you don't you don't need to define all of your schema upfront because it slows you down. Whereas with data warehouses, you have a very strong structure to the data, and you are very much need to define that schema upfront, and so it's schema on right. And then also in those with how you actually enforce the schemas, there's the question of doing extract transform and then load for being able to put the data into the data warehouse within that structure or with the data lakes doing extract load and then transform where you determine the schema at read time or when you're doing various analysis. So I'm curious what your thoughts are on all of those as somebody who does have, a lot of history and context within the industry.
[00:23:27] Unknown:
Well, that's a big question. There there are people in the industry who, say that the data lake is a replacement for the data warehouse. I do not believe that Teradata does not believe that. However, we strongly believe that the Data Lake has a really important role to play in the overall analytic data platform architecture. There are, there are people that make this an either or conversation, and and we just don't believe that at all. We believe that the data lake and the data warehouse should work together symbiotically to deliver the data and deliver their separate capabilities to the organization.
So the data lake is really good at collecting this really high volume data and doing big grind them up operations over that large volume of data. So the big curation steps, the big processing of sensor data, for instance, to put it into a format that analysts can use and normalize units and normalize time and all these big these are big heavy lifting operations on this data. And and those are really great things to do in the data lake, while delivering access to the highly curated data of the organization is, is something that data warehouses do very well. And each of them are bad at doing the other thing. So data warehouses are bad at doing the really heavy lifting on the semi structured or weekly structured data, the raw data that's coming in.
And the data lake technologies are weak at providing SLAs on high concurrency workloads to support a whole organization. And so we really think that the 2 should work together and that we think there's a natural flow of as the data is more and more curated, it is more and more likely to belong in the data warehouse to be delivered to a much wider group of people sharing it across the organization rather than the, rather than the exploratory users who are doing the initial analysis and initial understanding and the big heavy lifting curation processes.
We also think that the data lake is a virtual concept In that, a file system is a great place to land data that is weekly structured. The the text, the IoT data, the web log data, that all of those all of those kinds of of forms of of data that are that are the new big data sources of the of the world. But the, when the data is coming in a more row and column form, landing that and and formatting it as as unstructured files and then restructuring it back into a form for the widespread use in the organization is an extra hop and an extra extra energy, extra resources used, that that don't really need to be used.
The your your comments or questions about ETL, ELT, ETLTTLT, there's there's there's and and and at at Strata last week, I heard a new 1, which I really liked. E l e. It's, it's around the concept of a data hub where you extract load and then extract again to feed out to a very large number of sources. And 1 of the, 1 of the presenters was bemoaning the fact that all he did was, in his entire life, was ELE, and, just land data and then ship it back out again, and nobody actually ever used it in his in his data platform. But, Teradata has always, advocated the use of an ELT kind of model, just because it is a scalable model for the larger datasets.
It's easier to do the transform processes with a parallelized scalable set of operations rather than trying to push them through a server, you know, a single threaded server somewhere and process them record by record. That's fine for small datasets, but it doesn't work for large datasets. And the ELT model, of course, has been highly, adopted by the, by the data lake folks where a lot of the curation is done on platform leveraging the tools of the data lake environment, like, you know, anything from, MapReduce, Pig, Hive, all the way up to, to Spark and and, Python scripts and everything else.
But the goal is the same. The goal is to push the work into the scalable platform so that you can operate on the very large datasets and do the heavy lifting on the large datasets in a reasonable amount of time. Schema on read and schema on write come back right back to the conversation from before about the life cycle of of data in an organization and the life cycle of data curation. Schema and read is really great when it's a small number of users who are exploring the data and and trying to and trying to understand the structure and the value in the content and trying to derive what the interesting things or the interesting new insights out of that data.
That's a great thing to do. And every organisation should provide in their processes a way for people to land the data in a raw or lightly curated way and make it available in a schema and read, kind of model to that small set of super users who can deal with that data in that form. But when you need to get the data out to 10,000 users in 50 organizations, schema on read no longer makes any sense at all. It is a huge resource, utilization because you're doing it over and over and over again for every every use of the data. It, it it introduces all sorts of opportunities for each person to curate the data or each application to curate the data in a different way and thus get different answers.
It, it introduces a whole bunch of problems that were all the reasons why we did ETL in the data warehouse world in the in the first place. And so the more the data is used across the organization, the more production or data as a product the data becomes, the more curated it needs to be and the more it needs to be modeled. And the work done, the curation work done once and then the curated data used many times by the people downstream.
[00:31:21] Unknown:
And going back to your metaphor of the museum with the data lake versus the data warehouse, the data warehouse ends up being the display room where all of the exhibits are put on display for everybody to be able to access, and they're easy to consume and understand. And then the data lake is the basement of the museum where all of the raw, unprocessed resources are for people to be able to do their research and analysis and prepare them for moving up to the display room.
[00:31:51] Unknown:
Absolutely. And in the display room, you have lots of metadata. They're linked together by by time or or, you know, timelines or or geography or all all of that. They're easily understood. They're all articulated so that you can, you know, the thought it like joining together, you don't, you you you can you can see it all, in in its relationships in addition to, just the data points in individual form.
[00:32:24] Unknown:
And for organizations or individuals who are first starting to plan out the overall data architecture and the associated infrastructure and systems that they're going to be building curation processes, what have you found to be some of the common mistakes that ultimately result in failure of either a lesser or greater degree?
[00:32:50] Unknown:
The failures come from swinging the pendulum to 1 side or the other rather than thinking about it as a continuum. The failures come from both ends of the spectrum in particular. The people who believe that everything needs to be perfectly curated before it gets in the hands of the users are blocking a whole groups of users and especially data science users from being able to to explore new sets of data and work on them without a huge time lag in between. They're also wasting a lot of resources on the curation projects themselves when that data doesn't turn out to have the high utility, that justifies that level of curation.
And on the other end, 1 of my favorite conversations ever was that I sat with a CTO at a very large organization and, he was very proud of the fact that he had gotten IT completely out of the curation business. And and instead, all that they were doing was gathering all the data in the raw form, dumping it in the data lake, and then giving access to the business organizations and saying it's all your problem. Because that is that is going to result in failure. Well, it is resulting in failure because the user organizations don't have the skills. They are each curating their datasets independently.
They're all coming up with different answers to the same question. They, they all have, it's it's back to the the worst of the Datamart world, in in the, in the nineties and early 2000.
[00:34:47] Unknown:
And so it sounds that for somebody who is first beginning a new project or starting up or starting with a new organization, that their best path forward would likely be to start landing data in raw form in a data lake, and then either doing it themselves or having someone help them or an analyst to actually start exploring the data, building reports off of that from the data lake to determine what's actually useful, what's being used by the broader organization, and then starting to, encapsulate that and capture it in the form of a data warehouse, and then building out that data warehouse based on the datasets and reports that are most valuable to the organization while continuing to land new sources in the data lake as sort of a staging and testing ground.
[00:35:38] Unknown:
Exactly. And and as you start, start from the beginning, building out a governance process that that works with the users to understand what level of curation is actually required and do the absolute minimum necessary curation to meet the business requirements. I call it minimum viable curation, stealing the term from the agile world. And the idea is that if you have a good conversation between the people who are doing curation and the people who are using the data and you have a constructive conversation there continuously, then you can spend the right amount of time and dollars on, on doing the curation and be very selective.
You might have a dataset with a 1,000 attributes in it, but the users who are producing the business reports say, well, we only care about these 5. Then you don't need to curate the other 995. Don't waste your time on it. Curate out the 5 that you care about that they care about and leave the rest for another day when another application comes along and it and and those need to to be curated to support another application or a new business use. That governance team is a key thing. You know, nobody gets to start from scratch. It would be nice to start from a green, you know, blank sheet of paper or green field. Nobody gets to start from scratch. But if you did, you know, if I got to start from scratch, I would I would start the governance process very early. It might be very light to start with, but I would be growing it as the datasets grew and as the usage grew and as more and more people using more and more of the, the data more widely in the in the organization.
And I'd build in the metadata from day 1. You gotta start capturing the metadata. It's very hard to go back and capture the metadata and the lineage after the fact. So I'd build in the metadata capture and and lineage capture as automated as possible right from the very beginning to, to make sure that everything track and trace on everything.
[00:38:03] Unknown:
And once the architecture has been established and put into production and people are starting to use it, what are some of the techniques or strategies that you use to allow for continued evolution of those systems to prevent stagnation and eventual failure of either the data platform and data project or the entire
[00:38:26] Unknown:
organization? I think we've touched on a lot of those points already. You have to have the governance, you have to have the metadata, you have to have the continuous conversation with the users because just dumping data in a pile doesn't do anybody any good. There there were there are lots of of, published reports these days about, you know, small percentages of data lakes are actually successful and actually delivering business value. Rewind 2 decades in the same published numbers were written about data warehouses and the the failure modes are the same.
The the failure modes are because IT people do build it and they will come edifices. And those never work. They never succeed. If they're not tightly linked with the business users from the beginning and have a good governance process, then they will never, be able to have a living a living data organism. The data of an organization is a very living thing and it needs to be maintained and managed and fed and managed that way.
[00:39:52] Unknown:
And for somebody who is interested in learning more about the overall landscape of data architecture and curation and some of the concrete strategies and systems that they can implement to help them in that journey, what are some of the resources that you recommend that you found to be the most useful?
[00:40:11] Unknown:
Wow. There's a lot of stuff out there, but it's really difficult to sort the wheat from the chaff. It it's best to look for materials that are published from somebody who is more neutral rather than a vendor, because too many vendors are are just pushing a strategy that is, 1 dimensional. You know, a data lake vendor might be pushing the strategy of everything is scheme on read and everything's landed raw and you just and you just, you know, provide the data out to the to the, to the users. While a, you know, a specialized data mart vendor might be pushing while everything has to be in a star schema and carefully curated and and made available to, to BI engines.
So so there's there there some of the vendor stuff can be, quite 1 dimensional. So, you know, I encourage people to find the the stuff that's written by, the people who are more independent, you know, the independent analyst types and such in the marketplace.
[00:41:30] Unknown:
And are there any other aspects of data curation and the associated concerns that we didn't cover yet, which you think we should discuss further before we close out the show? I think there's 2 things that we didn't talk about and they're they're kind of related.
[00:41:45] Unknown:
1 is if you collect data and nobody uses it, you have wasted resources. You've wasted your time and energy. You've wasted the physical resources for storing the data and managing the data. I am really irritated by people who tell me they have a successful data lake. I have 2 petabytes of data in their data lake. I asked them how many users they have, and they say, The data lake is successful because we put 2 petabytes of data in it. No. It's not. It was a giant waste of resources and you should be fired instead of getting your bonus. So so that is, you know, the whole the whole idea of gathering data should be because there's something that you're going to do with it. There's some there's somebody who is going to analyze it, somebody who's going to take the results of that analyst and execute a business process using the results of that. If it doesn't result in a business changing decision, then it is all worthless.
Right? So, doing data science and putting up posters on the walls about a cool data science project, also a waste of time and energy, unless it is results in turning into a production business process that is causing some value to the business at the end of the day. And there's way too much of that going on. The flip side of that is you gotta decide when to delete data or not keep it in the 1st place. Again, people are counting petabytes, and that's interesting, but not valuable. And the the data that, the data that people need to keep is the data that is actually useful for the business.
Now, of course, you have to have a retention policy, and retention policy says that data must be deleted after 7 years in order to meet some compliance requirements or something, or it needs to be kept for 10 years to meet other compliance requirements. But this is a place where the data people need to get with all of the legal people to define the the the right rules. But then you also have to be smart about which, raw datasets that you keep and which ones you don't. Some people say keep everything, and I have a, you know, kind of a personal feeling about that. And I'm kind of a keep everything kind of person. Just ask my wife. But the yeah. You got that 1. Yeah.
So so yeah. But but and and 1 of the cool things about the data lake technologies at lower cost per terabytes is that you can keep more of it, and that allows you to go back and curate out more attributes from the history of the raw data. But there's there's gotta be an end to that. You know, all that data doesn't have, value forever. And at some point, you know, you're not even selling the cars anymore that you had that you have the sensor data from. You're you're doing, you know, the the lifetime of the of the, of the thing that has the sensor in it has has passed or whatever.
And and then it's time to, then it's time to delete it. And, of course, some of the new, of course, some of the new, rules around privacy are making for some new deletion, requirements that are actually very challenging. You have to be able to forget someone. So someone with, in GDPR in a in a European country can, can call up your company and say, you must forget me, and you need to go through all of the datasets everywhere in your organization and find every record that pertains to that, to that customer and erase them from from your datasets.
That's really, really hard when there's many copies of the data laying around and it's in a lot of different platforms and and, replicated in a lot of different ways, that's a very, very hard problem. And so so deleting data is as as difficult a problem as, as getting it, storing it and curating it in the first place.
[00:46:23] Unknown:
So for anybody who wants to follow you and keep up to date with the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I would be interested to get your perspective on what you view as being the biggest gap in the tooling or technology that's available for data management today. Wow. The biggest gap. There's lots
[00:46:46] Unknown:
of them, and and that's a that's a good thing and a bad thing. It's a bad thing for for, users and IT organizations, but it's a good thing for innovation and, all the creative people firing up startups and making investments in the space. So I think that a couple of key areas are are important. 1 is the the whole data pipeline linked with lineage and metadata. That whole area, most people in the data lake world are writing code for that, and and that cannot scale over the long run. There are some there are a number of companies in that space, but they're all small and fairly early. And and nobody really has a comprehensive end to end answer similar to the answer that we had with the ETL tools of the data warehouse era.
And and, and we really need that tooling because because we really need to scale, the dataset. That doesn't make sense. And on the on the usage side, it it is crucial to much more tightly link all of the different analytics together and much more tightly link them to the data source. There's a lot of people doing tools and cool algorithms, but they are, you know, completely unlinked from from the data stores. And you have to extract data and reformat the data and get it in the right form and put it through a tool. And if you need to use 3 algorithms, you need to use 3 tools. So this this, again, can't scale because, again, people are writing code, lots and lots of of nasty code in order to, in order to solve these problems.
And this is an area where Teradata is spending a lot of energy right now, and, we'll be making, some some,
[00:49:00] Unknown:
announcements in the in the near future at our, big user conference coming up in October. Alright. Well, thank you very much for taking the time today to join me and discuss your experience and perspective on data curation and data architecture. It's been very useful for me, and I'm sure that the listeners will appreciate that as well. So thank you for that, and I hope you enjoy the rest of your day. Thank you very much, Tobias. It's been great talking with you.
Introduction and Sponsor Messages
Interview with Todd Walter Begins
The Importance of Data in Business
Defining Data Curation and Architecture
High-Level Concerns in Data Curation
Impact of Company Size and Maturity on Data Strategies
Lifecycle of Data in a Pipeline Workflow
Data Provenance and Lineage
Data Lakes vs. Data Warehouses
Common Mistakes in Data Architecture
Starting a New Data Project
Evolving Data Systems
Unused Data and Deletion Policies
Biggest Gaps in Data Management Tooling
Conclusion and Final Thoughts