Summary
Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Lak Lakshmanan about the suite of services for data and analytics in Google Cloud Platform.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the tools and products that are offered as part of Google Cloud for data and analytics?
- How do the various systems relate to each other for building a full workflow?
- How do you balance the need for clean integration between services with the need to make them useful in isolation when used as a single component of a data platform?
- What have you found to be the primary motivators for customers who are adopting GCP for some or all of their data workloads?
- What are some of the challenges that new users of GCP encounter when working with the data and analytics products that it offers?
- What are the systems that you have found to be easiest to work with?
- Which are the most challenging to work with, whether due to the kinds of problems that they are solving for, or due to their user experience design?
- How has your work with customers fed back into the products that you are building on top of?
- What are some examples of architectural or software patterns that are unique to the GCP product suite?
- What are the most interesting, innovative, or unexpected ways that you have seen Google Cloud’s data and analytics services used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working at Google and helping customers succeed in their data and analytics efforts?
- What are some of the new capabilities, new services, or industry trends that you are most excited for?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Google Cloud
- Forrester Wave
- Dremel
- BigQuery
- MapReduce
- Cloud Spanner
- Hadoop
- Tensorflow
- Google Cloud SQL
- Apache Spark
- Dataproc
- Dataflow
- Apache Beam
- Databricks
- Mixpanel
- Avalanche data warehouse
- Kubernetes
- GKE (Google Kubernetes Engine)
- Google Cloud Run
- Android
- Youtube
- Google Translate
- Teradata
- Power BI
- AI Platform Notebooks
- GitHub Data Repository
- Stack Overflow Questions Data Repository
- PyPI Download Statistics
- Recommendations AI
- Pub/Sub
- Bigtable
- Datastream
- Change Data Capture
- Document AI
- Google Meet
- Data Governance
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Lakh Lakshmanan about the suite of services for data and analytics in the Google Cloud Platform. So, Lakh, can you start by introducing yourself?
[00:02:06] Unknown:
I'm Lak Lakshmanan. I lead analytics and AI solutions at Google Cloud. What our team does is we find problems that a lot of our customers are solving in a repeated way. And we basically provide a reference architecture and the solutions to those common repeated problems in such a way that people can get started a lot easier. So for example, we see a lot of customers trying to build a marketing analytics platform on Google. So we basically build reference guides and implementations of how to very quickly bring together ads data with your transaction data in order to carry out marketing campaign. So that's a very common use case.
But if you are new to the problem, it can help to see what a best practice solution would look like.
[00:02:57] Unknown:
And so that's essentially what my team does. And do you remember how you first got involved in the area of data management?
[00:03:03] Unknown:
When did I get involved in data management? Well, I've been at Google 5 years now. But before I was at Google, I was a research scientist doing machine learning for weather prediction. So essentially predicting flash floods, tornadoes, hail, lightning, etcetera. So essentially, all my career has involved extracting insights out of data. And as you know, if you wanna be extracting insights out of data, especially as a researcher, you have to learn how to manage the data, how to ensure that it has the right quality so that automated systems work well. The data quality considerations of any automated system that you build are gonna be much higher than when you can count on a human user to apply their judgment before they use the data. Right? And so data management, data governance, data quality have always been at the forefront of everything that I've done.
In fact, some of the machine learning algorithms that we built were algorithms to find quality problems in the data and basically fix them. For example, 1 of the neural networks that I built that is still being used in production is to take weather radar data and remove signals from that data that correspond to birds and insects and what's called chaff or anomalous propagation, etcetera. So very few different scenarios and, you know, built a machine learning model to do that. So yeah. So the question is, where did I get started with data management? It seems like I got started soon out of college with my first job. Yeah. Trying to account for all the anomalies that can be caused by various different particulates and organisms Sounds like a pretty interesting and complex challenge. Exactly. Exactly. And it's 1 of those things that is never ending. Right? So you built the algorithm. It works. It works most of the time. But when it comes to quality control, and this is 1 of the interesting things, is that we had an algorithm that worked.
I think its accuracy was 99.3%. That sounds really good, but 0.7% error meant that, like, we had data that was coming in from a 150 radar. It was basically coming in every 20 seconds. We have basically had a new scan. And when you basically did the math, it turns out that 30% of your images had artifacts in them. Right? So you go from 99.3% at a pixel level accuracy to 70% accuracy. At which point, it's like, is this thing even usable?
[00:05:52] Unknown:
But it turns out that it is and that we have to be careful, but it's also an ongoing governance program as you work through the data. And so as you mentioned, you've been at Google for 5 years now. And I'm wondering if you can just start by giving a bit of an overview of the tools and products that are offered as part of Google Cloud, particularly focused on data and analytics and data management and sort of where your team sits in relation to those products.
[00:06:17] Unknown:
The data cloud, as we call it, the data platform at Google is 1 of our strongest offerings. In fact, like, now when you look at, for example, the recent Forrester Wave, we're at the pole position. Right at the very top right corner of that graph, Google's well known for having invented many of the big data processing technologies that we all use today. Right? So whether it's MapReduce, whether it's separation of compute and storage, whether it is NoSQL with Bigtable, separation of compute and storage with Dremel and BigQuery, MapReduce, of course, which was basically the the inspiration for Hadoop, or Spanner and TrueTime, the way to basically have global consistency.
Google's always been at the forefront of innovation in terms of data. But for the longest time, we would basically publish papers and expect other people to implement them. I think what's changed with Google Cloud is that you now have a very easy entry into the quality of the analytics and operational datasets and AI systems that Google has built and that Google uses. So when you think of something like TensorFlow, for example, it didn't follow the mechanism that we did with MapReduce, where we published a paper on MapReduce around the time that we were getting out of MapReduce and into Dremel and BigQuery. So when the whole Hadoop ecosystem was basically expanding in the outside world, internally, Google had already moved on because we saw a lot of problems that happened with having to manage and maintain clusters and operationalization, etcetera, into a completely serverless data warehouse.
On the other hand, when it comes to AI, you know, when you take a TensorFlow, for example, we basically innovated in the outside world. Right? So it has been an open source project all the way through. And the TensorFlow that you and I use external to Google is the same TensorFlow that gets used internal to Google. So you basically get to share in that innovation. If you ask me, what is part of the platform? So what is part of the platform in broad terms is transactional data systems. And by that, Cloud SQL, which is basically a managed Postgres MySQL, right, or SQL servers, is a fully managed relational database, which typically works as long as you can run it on 1 vertical machine. But once your datasets grow larger, you want distributed database, and that's where Spanner comes in. So Spanner is really good for once your transactional requirements grow beyond a single database and you want you also want to run these databases globally. So we have a lot of, for example, gaming companies and banks that use Spanner because they need to basically manage transactions in a global way. So that's 1 part of our platform or the database is part of the platform. The another part of the data cloud is analytics.
And the crown jewel of our analytics platform is BigQuery. BigQuery is basically a way to carry out analytics on structured and semi structured data. It has full separation of compute and storage. So it basically scales from really small datasets to petabytes of data. And it's you basically get thousands of machines that you get to use for seconds at a time to carry out your analytics. But that is BigQuery. BigQuery is our SQL data warehouse. It's a data lake because it basically allows you to store structured and semi structured data. A lot of people use Spark and Hadoop, the Hadoop ecosystem.
So a lot of people coming on to Google Cloud, the easy entry point in the Google Cloud is to use Google Cloud to run your existing Spark and Hadoop workloads. Dataproc is a product that provides that stepping stone into Google Cloud, into a fully modern cloud platform. The 3rd big product that is part of our analytics data platform is called Dataflow. Dataflow is our ETL, data processing engine. It's the managed version of Apache Beam. Apache Beam, which is open source. Another example of a technology that Google invented, but that we've open sourced and that we provide a managed environment for. Apache Beam, the neat thing about Apache Beam is that you write the same code for both batch and stream. B is for batch, e a m for stream. Right? So it's basically a way to run identical code on both batch and stream, which as you know is is a big challenge because most people end up building 2 separate systems, which is a complete waste. Right? You wanna build a single system that seamlessly transitions from historical data to real time data. And this is particularly important if you're doing machine learning, because when you do machine learning, you train on historical data, but you predict on brand new data. And you don't wanna have to build 2 separate systems, 1 for training and 1 for inference.
You want to use a unified system. So Apache Beam, which is part of our data cloud, is a very important component for simplifying the productionization of ML models. 3rd part of our system, the data cloud is the AI platform, Vertex AI, which basically provides you a way to develop ML models and to deploy them. In the case of developing ML models, you have tools like notebooks. You also have the concept of datasets where the dataset could be built from any of those data sources that I talked about to deployment where basically you deploy into an auto scaling serverless service. So a prediction service.
And connecting the development and the deployment is a pipeline system that allows you to go very quickly to operationalize ML models that you've built, add things like continuous evaluation, monitoring, etcetera, feature stores to those modules.
[00:12:48] Unknown:
The different products in the Google Cloud Platform are obviously designed to be able to integrate well together so that when somebody onboards, they can then go from idea to delivery entirely within the Google Cloud suite of services. And I'm wondering what you see as some of the challenges for being able to provide that clean integration between products while also being able to make them useful in isolation That is a great
[00:13:20] Unknown:
That is a great question. And it's something that really take into account as we design our products, our solutions, and our platform. To take the first first point, the integration between these products is extremely important because we want to basically deliver the ability to quickly do end to end workflows. Now when I went through my the the platform, it may have sounded like a lot, But really, I was talking about 10 products. But if you are if you were to go look at, you know, other competing platforms, you will see on the order of a 150, 200 products.
Right? It takes real discipline to bring this down to 10 products that work really, really well together, that solve the wide variety of use cases, that do so in a very innovative way that are very consistent among each other. Right? So 1 of the things that Google Cloud is really known for is the quality of our user experience. And it doesn't come magically. It comes because it's something that we design for and we use. But then this the converse is what happens if you want just a point solution? These things are very well integrated and very well put together. For example, I I talked about Spark and I talked about BigQuery, and you run Spark on Dataproc and you run BigQuery, you can run SQL.
But what if you want to run a Spark program on data that's in BigQuery? Not a problem. There is a Spark connector such that you can run your spark program on data in BigQuery. What if your spark program has created parquet files that you want to query and join against your dataset in BigQuery? Not a problem. BigQuery SQL engine can read and do SQL on parquet files dynamically. What if you wanna train a TensorFlow model, right, to on data that's in BigQuery? Not a problem. TensorFlow has a BigQuery reader that's basically able to read it and train your model. What if after you've trained your ML model, you say, I don't wanna deploy to an ML prediction engine, I want to actually run this on batch data in BigQuery.
Not a problem. You can take your TensorFlow model, load it into BigQuery, and run a SQL query that invokes TensorFlow. Right? So all these things are extremely well tied together and well integrated. Having said that, we will always have customers who say, well, no. I love BigQuery, but I don't want to use your spark engine. I want to use Databricks. That works as well. Databricks is a Google partner. You can get them on the marketplace, and you can work with them. Right? It basically, it uses cloud IAM. It uses the same identity access management. That's part of basically building an open platform where you can basically get other point solutions.
And because all of our datasets and our APIs are open, we have connectors to hundreds of different systems. Say for example, you're using a product, let's say, Mixpanel, well, Mixpanel will be able to read out of BigQuery. Okay? If you're using a product like Avalanche, it's a tiny data warehouse. And so you wanna use that instead of BigQuery, not a problem. You can do that and Avalanche should be able to work with Dataproc. We then well, part of what we've done by open sourcing all of the core technologies, TensorFlow, Apache Beam, Kubernetes, etcetera, is that we've basically allowed a really strong ecosystem to grow around our core 10 products.
So we build these 10 products and we say, this is a great set of 10 products. You can basically do what you want to do. It's very well integrated because we focused and we built this great thing. But if you wanna bring in something else, those will interoperate with our products because our products are open source and open API. That's the way we solve the second part of the conundrum is we build a great set of managed products and we open source the underlying underpinnings so that our customers ultimately get choice and flexibility.
[00:17:42] Unknown:
And in terms of people who come to the Google Cloud Platform, I'm wondering, in your experience, what you've seen as some of the primary motivators for people who are adopting either some or all of the solutions that you provide.
[00:17:55] Unknown:
So what are some of the primary motivations? So 1 of the motivations is, of course, there's a big organic move anyway from on premises to cloud, and that's primarily driven by the speed of innovation, agility, cost. That's 1 set of customers. And at that point, they're often choosing among the 3 major cloud providers, which 1AM I gonna choose. And in in many cases, what people end up doing is they basically run a POC. They test things. They figure out what works for their platform, etcetera. And we get a fair share of those customers. People who are basically they're carrying out a cloud transformation. They're carrying a movement.
Within those sets of customers who tends to choose us, customers for whom open source is important. Right? So anybody who is really sold on Kubernetes and the ability to basically run the same workload on multiple clouds likes to choose Google Cloud because we invented Kubernetes, or our GKE, Google Kubernetes Engine, and Cloud Run are the absolute best at basically providing a managed Kubernetes experience, and you get to basically run them on multiple parts. Right? So you basically get that choice. Another reason that people often choose Google is because of our strength in data and AI. As I said, right, that's 1 area in which even to this day, there is no serverless data warehouse that scales to large datasets. Now you have small serverless data warehouses, in memory data warehouses, and stuff like that.
But BigQuery is still today unique 10 years later for basically scaling from small data all the way to petabytes of data in such a way that you don't have to specify the size of a data warehouse beforehand. You don't have to manage clusters. It is literally bring your code. We will run your code for you. So that amount of no ops serverless thing is the second reason why people choose cloud. The third reason is the quality of our AI. Quality of AI models is completely driven by the amount of data that you use to train them with. And whenever you're building AI models, you have a choice. You can buy stuff, you can build it from scratch, or you can customize it.
Building from scratch is very expensive. So whenever possible, most customers try to see what they can buy that's used, reuse immediately or what they can customize. And whenever you're buying or you're customizing, the quality of the AI model is extremely important. And the quality of the AI model is almost completely driven by the amount of data used to train that model. Think for example of a text to speech model or a speech to text model. We get to basically use Android. Right? YouTube. Right? We basically have the kind of quantity of data and quality of data that other people struggle to get. And therefore, the quality of the AI models that that are available on Google Cloud is typically head and shoulders above anything else. And you know this. Right? If you've ever done a translation and you compare Google translate to a translation system anywhere else, you know the difference in quality. And that difference in quality shows up in every 1 of our AI models. So that's the third reason. And the 4th and final reason is what we call the 1 Google approach.
For example, a lot of customers find that experience of, now you go ahead and you you search for a business. They say I'm searching for ladders, and you basically get a list of now you on Google search, you basically get a list of you know, what's a ladder? Sure. But you also get, where can I buy a ladder? You get Home Depot. You get Lowe's, etcetera. You can click on Home Depot. You get the hours of Home Depot. You can basically then set it up such that from your search, you can basically ask that store, do they have this particular, like, 21 foot ladder in stock?
You can do that through voice. Right? And you can basically or you can have a Google Assistant call on your behalf, get this back. So that kind of a complete experience is doable on Google Cloud, where basically it's something that you have to cobble together somewhere else. So that's the 4th and final reason why a lot of people use Google Cloud.
[00:22:38] Unknown:
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch.
As people are onboarding into Google Cloud and they're starting to either shift their workloads or they have brand new greenfield projects that they're using Google Cloud for, I'm curious, what are some of the challenges or sort of conceptual hurdles that they run into as they're trying to either make their existing workload fit the model that Google Cloud promotes or as they're trying to understand what the capabilities
[00:23:40] Unknown:
are and the sort of best way to approach how to build their solutions using the suite of products that Google offers? I mean, that's a very broad question, like the best way to approach a solution. So let me take a very concrete example. And hopefully that concrete example will give you an idea of the kinds of possibilities that exist. So 1 of the common things that a lot of enterprises are doing now is that they are outgrowing their on prem Teradata instances. And they're basically moving from Teradata to BigQuery. So let's take that as an example. Let's say you're an on prem customer. You're basically using Teradata on prem, and you wanna basically move it to a serverless data warehouse because it scales much better. It's a lot less expensive.
It lets you do machine learning within the data warehouse. All of those advantages. What does that involve now? A few things. Firstly, on prem, you basically have a bunch of systems that are dependent on Teradata. Right? You may have your Power BI tools that are basically accessing Teradata. You may have your ETL pipelines that are publishing into that Teradata instance. You may have You may have data being exported out of Teradata into reports, etcetera. So when we say we wanna move from Teradata to BigQuery, it is not just about moving Teradata to BigQuery.
It's about moving everything. Right? It's moving that entire ecosystem. So how do we basically simplify that? Right? So we do a few things. Firstly, we have automated query translation. Right? So you can basically take your Teradata queries and convert them into BigQuery queries through automated tools. Now the automated tools aren't perfect, but they basically get quite a bit of them. Like, 85 to 90% of queries can be automatically translated. Then what you do is you basically then go to any kind of an ETL system that is basically publishing into Teradata and change the queries in those systems such that they are now querying off BigQuery instead.
The data, we have automated tools to basically take the data in Teradata and move it into BigQuery. But now we're talking about the queries, we're talking about the dependencies. Another thing that we sometimes find we have to do is that you can't just move all the data all at once. You basically break it workload by workload, and each workload, you move it 1 at a time. And for a while, you have both Teradata and BigQuery going, and you have things being published into both places. And in order to simplify that, you might put a virtualization solution in place where you basically have a system that receives Teradata inputs, but actually executes them on BigQuery and sends back Teradata output.
And so that virtualization basically helps that migration happen. But there's another bit that we also have to do. You have to basically think about how do you upskill, right, and change the way people think about how they design schema, how they design tables, how they write queries in a more efficient way, etcetera. Right? So because again, these systems are different, they're built in a different way. And so part of this whole deal is you have to set up training. Right? So basically training for everyone involved in writing queries, using queries, managing systems, etcetera. So a project like this, let's say, okay. I'm gonna take my Teradata on prem and move it to the BigQuery, involves all of these aspects.
Right? Everything from data and schema migration, to query translation, to virtualization, to people training, and you have to put all of those together. And that's basically something they need that we have experience in, that we help people about, but that's 1 use case. And we basically have to consider every such thing. And as we build experiences, we build playbooks that basically say this is how it is successful in this particular scenario.
[00:27:46] Unknown:
And in your experience, would you say that BigQuery is sort of the biggest center of gravity in the data products that you're offering, but also in terms of what pulls people into using Google Cloud for their data workloads?
[00:28:01] Unknown:
So, 2 separate questions. Center of gravity and what pulls people. So center of gravity. The center of gravity for SQL workloads. SQL is not the only type of workload there is. The center of gravity for data processing workloads is data flow. Right? So that's the thing. If you wanna move and manipulate data on the fly, you tend to use data flow. If you're manipulating SQL based workloads, you tend to use BigQuery. And if you need transactional support and real time consistency, you use Spanner. And the what attracts people to Google Cloud, it really depends. Right? So some of the products that are beloved products, if I would. Right? So products that people just fall in love with. 1 of them is Cloud Run.
Cloud Run is basically completely auto scaled way to take your container and run it. And any kind of containerized workload just completely fully managed and auto scaled and run. That's Cloud Run. Everybody who's used Cloud Run just absolutely loves it. The second, like, very beloved product on Google Cloud is BigQuery. BigQuery is easy to get started with. It is super powerful. I now SQL is 1 of those things that's been around for 50 years and it's well understood. And it's extremely diverse in terms of the set of use cases that it's so that's the second thing that people absolutely love. Right? So Cloud Run, BigQuery.
The third thing that people love is Spanner. Right? But Spanner is a much more niche of a use case. But if you have a problem that Spanner solves, there's nothing like it. There's nothing like Spanner. So that's the 3rd, very beloved product that just people just completely fall in love with. And the 4th 1 that people completely fall in love with is AI platform notebooks. Right? So this idea of a fully managed notebook experience that is completely separate from your compute environment, it always blows people's mind when you basically start a notebook and you say, let me add a GPU to this notebook.
And you basically don't normally we think of I have my hardware and I install software on my hardware. This changes the paradigm completely. You have your notebook and you attach pieces of compute to that notebook, different pieces of compute at different times depending on what you want. If you wanna run BigQuery from that notebook, you send it off to BigQuery. If you want to do an ML if you wanna deploy an ML model for prediction, it's a serverless thing. You do it from a notebook. So a notebook becomes this very lightweight interface
[00:31:01] Unknown:
to the rest of the AI and data platform. And that's another thing that people absolutely. Yeah. I haven't played around with the notebook piece of it yet, but I do know that with BigQuery, 1 of the other things that can draw people into it, even if it's not something that they're specifically looking for, is the fact that a number of useful and interesting datasets have been published onto BigQuery. I'm thinking, from my own experience, things like the GitHub repositories that are available for people to be able to do analytics across the entire code bases of GitHub or PyPI package download statistics and things like that, and people being able to take those published datasets, run their own analyses on it. But the people who are publishing the dataset don't have to bear the cost of those queries. And so I think that that's another interesting
[00:31:46] Unknown:
sort of innovative aspect of what BigQuery offers. Absolutely. Absolutely. You touched on a few key points. Number 1, all of the commits in GitHub, all of the questions and answers in Stack Overflow, These are all tremendously large datasets, and people are able to publish them. And then you can just come in and just query it without having to set up a cluster, without having to do anything beforehand. It's a dynamic query. And to your final point, the people who publish the data don't have to bear the cost of the people querying the data. That's because of that complete separation of compute and storage, which is incredible, And it has a lot of business implications as well. You're able to share data with your suppliers.
Your suppliers are able to share data with you. Can be 2 different organizations. You can break down silos within your company. So there's a lot of lot of benefits to to that kind of mechanism.
[00:32:41] Unknown:
In terms of your experience of working with the Google Cloud Platform and working with customers who are onboarding and building out these proof of concepts, and you're helping to educate them on sort of the best practices for how to use these different systems. I'm curious how some of those experience have fed back into the product suite that Google offers in terms of new capabilities or new interfaces or just user experience improvements.
[00:33:06] Unknown:
Yeah. Absolutely. And, again, this is not just me, but, no, we have a lot we have teams of folks who basically work with our customers, and we learn. Right? As I mentioned, right, we have a very full fledged Teradata to BigQuery migration playbook. We know all of the stuff that we can do. We can be sure that that was not something that you can dream up in an afternoon. It is something that as you do it, each time you do it, you learn and you add it. You basically make it better and nicer for the next person. We also built like, as an example of products that we built based on customer engagements, is why it's recommendations AI. We have a product called recommendations AI. The way it works is you bring your product catalog and you bring your user transactions. What have people bought in the past? And bringing these 2 datasets in, we will create a very high quality leading edge recommendations AI model. Customers who basically used our AI model have typically seen, you know, improvements of 15 to 20% in terms of a lift. Right? So state of the art, high quality model. And that was an example of something where we had our professional services team work with the customer, help them build an a recommendations AI model. And then another customer said, hey. Help us build your recommendations AI model. We went ahead and we helped them do that. Right? But the basic concept, you need a product catalog, you need a user behavior data. And once you have those 2 things, you can basically build your recommendations model. Means that we could kinda abstract out the details of how people store their product catalog, how people store their all of the transactions, all of the interactions on the websites, etcetera, and say, here's a schema. If you can give us data in these schema, we can build your recommendations model for you, and we can help you integrate it, and we can help you serve it out. Right? And that basically becomes our recommendation AI solution.
So we basically have that arc of as we work with customers, we can see what problems they're solving and basically make solutions out of it. But the converse is also true. Sometimes as we work with customers or problems that they run into, we say, well, this is dumb. This should not be that hard. End up basically adding that capability into our system. A great example of this, and this is something that we're getting much better at, is that, basically, people would create BigQuery tables. And when you create a table, you had no way to rename a table. But renaming a table, why would you ever want to rename a table in production? And then, okay, fine. But it turns out that there's actually customers who kept on needing this, and they would actually have to build stuff where, okay, I'm gonna be writing to this to my real time table, and then I'm gonna take the last 15 days of data, and I'm gonna move it out. I'm gonna call that table, right, this table, and I'm gonna rewrite to this new table. And now we have to rename this table with that old date stamp. Right? So those are the kinds of things that people wanted to do, and we did not make it easy. There was no easy way to rename a table.
Just very recently, just over last week, we basically create a SQL statement to rename it. Right? So those are the kinds of things that we see. There's a customer friction, customer pain point. Let's basically, you know, make it easy. Let's let's let's solve it. And those are things where, you know, we get feature requests. We get customers who get stuck, customers who build. Again, renaming a table is not hard. You can do it, but it shouldn't be that painful. And that's why we basically built it into the system.
[00:36:56] Unknown:
In terms of the architectural patterns of the systems design and the sort of data integration flows and any software patterns that go along with that, I'm curious if there are any approaches that you see as being unique to Google Cloud Platform Suite for data and analytics that you don't see in externally available systems, whether that's open source products or other cloud providers or on premises systems?
[00:37:23] Unknown:
Absolutely. I think 1 very common pattern that you see on Google Cloud is a trifecta. Pub Sub, Dataflow, BigQuery. It really works on Google because Pub Sub is a message queue that is global. You can publish into Pub Sub from anywhere in the world, and it is 1 single Pub Sub, right, in which you have multiple topics and subscriptions, etcetera, but a single Pub Sub. And then data flow, which is basically a single system for the process both batch and stream, which means you can do replays, you can do historical data, you can do real time data. And BigQuery also, which is global and which is serverless. So now you have 3 serverless products, PubSub, Dataflow, BigQuery, that essentially allow you to ingest data, process it, and make it available for analytics.
So that's a really common thread that you see on Google Cloud. It's a very, very, very simple architecture that just nails the, you know, most common sets of use cases, whether it is IoT, whether it is web traffic, whether it is connected devices, whether it is log analytics. It is just these 3 products. Right? They just they just they just nail it, and they basically work on batch. They work on stream. They basically work with structured data, semi structured data. You can do machine learning in BigQuery off of it. That architecture, it's so powerful, and it's so simple, but it doesn't exist in other places because there is no global message queue. Instead, you basically have to build your message queue in every location.
There is no serverless data processing. There you need to build separate batch and stream code streams. There is no serverless data warehouse that scales up and down, like, dip or depending on the traffic, depending on the user query, and depending on the data sizes. It doesn't support streaming SQL. It doesn't support machine learning. All of these things, right, as that's I think 1 of the continues to surprise me. That, you know, many years on, we've shown that how it can be done. But I think it speaks to the challenge involved in implementing that simple architecture that it doesn't actually exist in open URLs.
[00:39:57] Unknown:
And in your experience of building your own projects on top of Google Cloud and working with your customers to help enable them to build their systems, what are some of the interesting or innovative or unexpected ways that you've seen the product suite used?
[00:40:11] Unknown:
Our customers do amazing things. And so so let me try to pick a couple of public things. Spotify, for example. Their royalty calculation is done on Google. Just think about what it means to calculate the royalty involved for a specific artist. You have to know what song is playing on every user's device everywhere in the world. Right? Such that you basically get that information and for how long it was played in order to calculate the royalty. I mean, it's a mind blowingly hard challenging use case, completely doable. In fact, with that same simple architecture I talked to you about, pops up, data flow, BigQuery notebooks.
That's it. It's amazing that Spotify is basically able to do something that complex on that simple of an architecture. It speaks the power of the platform.
[00:41:15] Unknown:
And in your own experience of working at Google and with Google products, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:23] Unknown:
I think the biggest lesson that we've learned is that a radically simple architecture can be off putting and can be different and new, especially if you're dealing with a lot of complexity on premises. Right? So people often don't believe us when we say it can't be this simple. They try to replicate the exact structures that they're using on prem, and they try to basically bring it to the cloud and they say, well, do you have a way to do x, y, and z? And our answer is, why do you need to do x, y, and z? And it turns out that it is not even needed anymore. But those conversations can be a little hard when you're coming in, when you realize that you don't really need to have your authentication built into every database product.
You can have 1 single cloud authentication that basically, you know, gives you access to all your products. If you need to do it and do data exfiltration, we basically say use a virtual private cloud, PC service control. And that's a new concept, completely unlike what someone may have been familiar with on prem and trying to make design decisions while you're learning about these different ways of looking at a problem can be pretty challenging. So that's 1 of the things that we've learned to basically step back and ask, like, what the customer is trying to achieve rather than try to replicate on premises.
[00:43:09] Unknown:
In your experience of working with customers and helping them to realize the best solutions to their problems, are there any cases that you can think of where the Google Cloud platform or specific product that they had their that they were intent on using was the wrong choice for the problem that they were trying to solve? In terms of the wrong choice,
[00:43:31] Unknown:
we've had people try to do ML inference on GPUs, for example. Because, hey, GPUs are the most performant, and they would find that their ML models would not be sped up by the GPU. And that it turns out is because GPUs are really good at dense models and speeding up matrix manipulations and so on. They're not as good if you're serving out recommendations, if you're serving out sparse models. Right? So sometimes people end up choosing an architecture because they ask a very or they go look at performance statistics and say, what is the best hardware to run my ML inference? The best hardware to run your ML inference is the GPU, but the GPU is not the best place to run every single ML model that you have. There are some ML models that you should be running in on a CPU and not on a GPU. Right?
So sometimes people make choices of products, of technologies based on the overall aggregate, and the details sometimes matter. And then we have we don't those tend to be a lot of things. So for example, where is the best place to do your analytics? It's absolutely BigQuery. But if your throughput is extremely high and your latency requirements are extremely low, then BigQuery is not a good choice. You should be using Bigtable instead. Right? So those are sometimes you basically make your product choices. Sometimes you make the product choice without realizing the corner cases. And the corner cases are it turns out to be sometimes important. And that's when you have to go back and change your design, use a different product rather than try to soldier on and try to get better latency out of BigQuery. It is much sometimes much better to say, okay, not BigQuery for this problem, but big table for this problem.
[00:45:30] Unknown:
As you continue to work with customers and work with the platform and observe the ongoing trends in the industry, what are some of the new capabilities or new services or emerging approaches and technologies that you're personally excited for?
[00:45:45] Unknown:
Super excited by this new product we have called Data Stream, which 1 of the very hard things that people used to do was change data capture from an operational system to an analytic system, and you would have to go build your own. Data stream basically is a fully managed serverless way of mirroring your transactional database and your analytics database. So all changes that happen to your Oracle system show up in BigQuery automatically. Right? Fully auto scaled. It just happens. So that is a radical simplification of many people's lives.
So I'm super excited about it. Another thing that is fundamentally very exciting to me is this set of solutions that we call Document AI. The world is full of paper processes, invoices, bills, w 2 forms, etcetera. But they're part of a lot of business processes. And so we end up accepting a lot of error in data entry, accepting a lot of labor in terms of digitizing those things in order to use them. Where document AI comes in is being able to basically apply machine learning to understand these unstructured datasets. So we've had people use document AI for mortgage processing or procurement and so on. And that to me, is this revolution that is coming in a lot of back offices. AI has now gotten good enough that much of the drudgery that's involved can basically be accommodated now with AI.
[00:47:31] Unknown:
And are there any other aspects of your work on the Google Cloud Platform or the suite of services that you offer and that you've helped customers onboard onto that we didn't discuss yet that you'd like to cover before we close out the show?
[00:47:44] Unknown:
Actually, Lee, 1 thing we haven't talked about is COVID. We're living through this pandemic. And 1 of the coolest things I think have working at Google for me in the last year and a half has been how many companies and organizations we've been able to touch and improve basically by being the IT department for the world. A few examples maybe to close things out because I don't I think we all love to basically say, where are we having an impact beyond just the technology? Right? So an example of is when COVID first started, unemployment basically rose dramatically.
And the labor departments of many states couldn't keep up with the hundreds of thousands of applications that they were dealing with. And I was talking to you about Document AI. So that was 1 of the solutions that we put in place to help a lot of state department of labor deal with triage unemployment applications. K? So 1 of the neat things is that, you know, I feel proud that we, as Google, were able to help improve the lives of so many people whose unemployment claims might have been processed, you know, 30, 40 days too late. And instead, we were able to process them on time basically by applying cloud technologies and AI to it. Another example of this, again, dealing with COVID is a lot of grocery chains suddenly had to grow up in a matter of weeks.
Right? So what people thought was gonna happen by 2027 happened in a matter of 8 weeks in 2020. Right? The number of people shopping online, the number of people wanting curbside delivery. Take curbside delivery. We had a grocery chain that needed to implement curbside delivery pronto. What does it involve to do curbside delivery? Well, you need to make sure that your inventory system in the store is perfectly up to date, which means you have to now before you can do curbside delivery, you need to have an real time inventory system. So help them build a real time inventory system in 6 weeks.
This is the kind of project that would have normally taken 3 years. We just did it in a matter of weeks. Mhmm. No. And that's again, no testament to, like, amount number of people in organizations that we were able to help. I feel very fortunate and very proud to be have been part of Google and part of Google Cloud, where we've been able to have a lot of organizations come through COVID. For example, Google Meet, we provided it free for education. It's being used. I think usage of Google Meet went up double digit times, right? Dramatically growth in usage, especially even my kids' schools to use, Meet, and that's how their school have schools have run over the last year. So another example of cloud and cloud technologies coming to the rescue of how we as a society have been able to deal with the pandemic.
[00:51:07] Unknown:
And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:51:16] Unknown:
Probably the biggest gap that exists is in terms of what we ask our users to do for data governance. The data governance story today is not yet fully integrated. It is not yet as easy and as seamless as it ought to be. So as technologists, as builders of technology tools, the biggest gap that we need to address is that we need to make data governance easier, easier to understand, easier to implement, easier to monitor, and easier to secure.
[00:51:56] Unknown:
Absolutely agreed on that point. Well, thank you very much for taking the time today to join me and share the work that you've been doing and your experience of helping people onboard to Google Cloud and helping us explore the capabilities that it provides. It's definitely a very interesting and powerful suite of services. So I appreciate you taking the time today to join me and all of the time and effort you've put into helping make it a more usable and more useful platform. So thank you for all that, and I hope you enjoy the rest of your day. Thank you very much, Japans. It was a lot of fun. Listening. Don't forget to check out our other show, podcast.init@python podcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Overview of Atlan
Interview with Lak Lakshmanan: Introduction and Background
Google Cloud Platform: Tools and Products Overview
Customer Motivations and Use Cases for Google Cloud
Challenges and Best Practices for Onboarding to Google Cloud
BigQuery and Its Role in Google Cloud
Customer Feedback and Product Development
Unique Architectural Patterns in Google Cloud
Innovative Customer Use Cases
Impact of COVID-19 and Google Cloud's Role
Future of Data Management and Closing Remarks