Taking A Tour Of The Google Cloud Platform For Data And Analytics

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all their data

assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to data engineering podcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3,000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today, I'm interviewing Lakh Lakshmanan about the suite of services for data and analytics in the Google Cloud Platform. So, Lakh, can you start by introducing yourself?

I'm Lak Lakshmanan. I lead analytics and AI solutions at Google Cloud.

What our team does is we find

problems that a lot of our customers are solving in a repeated way. And we basically

provide

a reference architecture

and the solutions to those common repeated problems

in such a way that people can get started a lot easier. So for example,

we see a lot of customers

trying to build a marketing analytics platform on Google. So we basically build reference guides and implementations

of how to very quickly

bring together

ads data with your transaction data in order to carry out marketing campaign. So that's a very common use case.

But if you are new to the problem, it can help to see what a best practice solution would look like.

And so that's essentially what my team does. And do you remember how you first got involved in the area of data management?

When did I get involved in data management? Well, I've been at Google 5 years now.

But before I was at Google, I was a research scientist

doing

machine learning for weather prediction.

So essentially predicting

flash floods, tornadoes,

hail,

lightning,

etcetera.

So

essentially, all my career has involved

extracting insights out of data.

And as you know, if you wanna be extracting insights out of data, especially as a researcher,

you have to learn how to manage the data,

how to ensure that it has the right quality so that automated systems work well.

The data quality considerations of any

automated system that you build are gonna be much higher

than when you can count on a human user

to apply their judgment before they use the data. Right? And so

data management, data governance, data quality

have always been

at the forefront

of

everything that I've done.

In fact, some of the machine learning algorithms that we built

were algorithms to find quality problems in the data

and basically fix them.

For example,

1 of the neural networks that I built that is still being used in production

is to take weather radar data and remove

signals from that data that correspond to birds and insects and what's called chaff

or anomalous propagation,

etcetera.

So very few different scenarios and, you know, built a machine learning model to do that. So yeah. So the question is, where did I get started with data management? It seems like I got started soon out of college with my first job. Yeah. Trying to account for all the anomalies that can be caused by various different particulates and organisms Sounds like a pretty interesting and complex challenge. Exactly. Exactly. And it's 1 of those things that

is never ending. Right? So you built the algorithm. It works. It works most of the time. But when it comes to quality control, and this is 1 of the interesting things,

is that we had an algorithm that worked.

I think its accuracy was 99.3%.

That sounds really good,

but 0.7%

error

meant that, like, we had data that was coming in from a 150 radar.

It was basically coming in every 20 seconds. We have basically had a new scan.

And when you basically

did the math,

it turns out that

30% of your images had artifacts in them. Right? So you go from

99.3%

at a pixel level accuracy

to

70% accuracy. At which point, it's like, is this thing even usable?

But it turns out that it is and that we have to be careful, but it's also an ongoing governance program as you work through the data. And so as you mentioned, you've been at Google for 5 years now. And I'm wondering if you can just start by giving a bit of an overview of the tools and products that are offered as part of Google Cloud, particularly focused on data and analytics and data management and sort of where your team sits in relation to those products.

The data cloud, as we call it, the data platform at Google is 1 of our strongest offerings.

In fact, like, now when you look at,

for example, the recent Forrester Wave, we're at the pole position.

Right at the very top right

corner of that graph, Google's well known

for having invented many of the big data processing technologies that we all use today. Right? So whether it's MapReduce, whether it's separation of compute and storage,

whether it is NoSQL

with Bigtable,

separation of compute and storage with Dremel and BigQuery,

MapReduce, of course, which was basically the

the inspiration for Hadoop, or Spanner and TrueTime, the way to basically have global

consistency.

Google's always been at the forefront of innovation in terms of data.

But for the longest time, we would basically publish papers and expect other people to

implement them. I think what's changed with Google Cloud is that you now have a very easy entry

into the quality

of the analytics

and operational datasets

and AI systems

that Google has built and that Google uses.

So when you think of something like TensorFlow, for example,

it didn't follow the mechanism that we did with MapReduce, where we published a paper on MapReduce

around the time that we were getting out of MapReduce and into Dremel and BigQuery. So when the whole Hadoop ecosystem

was basically expanding in the outside world, internally, Google had already moved on because we saw a lot of problems that

happened with having to manage and maintain clusters and operationalization,

etcetera,

into a completely serverless data warehouse.

On the other hand, when it comes to AI, you know, when you take a TensorFlow, for example,

we basically innovated in the outside world. Right? So it has been an open source project all the way through. And

the TensorFlow that you and I use external to Google is the same TensorFlow that gets used internal to Google. So you basically get to share in that innovation. If you ask me, what is part of the platform? So what is part of the platform

in broad terms

is transactional

data systems. And by that,

Cloud SQL, which is basically a managed

Postgres

MySQL, right,

or SQL servers, is a fully managed

relational database,

which typically works as long as you can run it on 1 vertical machine. But once your datasets grow larger,

you want distributed database, and that's where Spanner comes in. So Spanner is really good

for

once

your transactional requirements

grow beyond a single database and you want you also want to run these databases globally. So we have a lot of, for example,

gaming companies and banks that use Spanner

because they need to basically manage transactions

in a global way. So that's 1 part of our platform or the database is part of the platform. The another part of the data cloud is analytics.

And the crown jewel of our analytics platform

is BigQuery.

BigQuery is basically a way to

carry out analytics

on structured and semi structured data. It has full separation of compute and storage.

So it basically scales from really small datasets

to petabytes of data. And it's you basically get thousands of machines

that you get to use for seconds at a time to carry out your analytics.

But that is BigQuery. BigQuery is our SQL data warehouse. It's a data lake because it basically allows you to store structured and semi structured data. A lot of people use Spark

and Hadoop, the Hadoop ecosystem.

So a lot of people coming on to Google Cloud,

the easy entry point in the Google Cloud is to use

Google Cloud to run your existing Spark and Hadoop workloads.

Dataproc is a product

that provides that stepping stone into Google Cloud, into a fully modern cloud platform.

The 3rd

big product that is part of our analytics data platform

is called Dataflow.

Dataflow is our

ETL, data processing engine. It's the managed version of Apache Beam. Apache Beam, which is open source.

Another example of a technology that Google invented,

but that we've open sourced and that we provide a managed environment for. Apache Beam,

the neat thing about Apache Beam is that you write the same code for both batch and stream. B is for batch, e a m for stream. Right? So it's basically a way to run

identical code on both batch and stream, which as you know is is a big challenge because most people end up building 2 separate systems,

which is a complete waste. Right? You wanna build a single system that

seamlessly transitions from historical data

to

real time data. And this is particularly important

if you're doing machine learning, because when you do machine learning, you train on historical data, but you predict on brand new data. And you don't wanna have to build 2 separate systems,

1 for training and 1 for inference.

You want to use a unified system. So Apache Beam, which is part of our data cloud, is a very important component

for simplifying the productionization

of ML models.

3rd part of our system, the data cloud is the AI platform, Vertex AI,

which basically provides you a way to develop ML models

and to deploy them. In the case of developing ML models, you have tools like notebooks.

You also have the concept of datasets

where the dataset could be built from

any of those data sources that I talked about

to

deployment

where basically you deploy

into an auto scaling serverless service. So a prediction service.

And connecting the development and the deployment

is a pipeline system that allows you to go very quickly

to operationalize

ML models that you've built,

add things like continuous evaluation,

monitoring,

etcetera,

feature stores to those modules.

The different products in the Google Cloud Platform are obviously designed to be able to integrate well together so that when somebody onboards, they can then go from idea to delivery entirely within the Google Cloud suite of services.

And I'm wondering what you see as some of the challenges

for being able to provide that clean integration between products

while also being able to make them useful in isolation That

is

a

great

That is a great question. And it's something that

really take into account as we design our products, our solutions, and our platform. To take the first first point,

the integration between these products is extremely important

because we want to basically deliver

the ability to quickly do end to end workflows.

Now when I went through my the the platform,

it may have sounded like a lot,

But really, I was talking about 10 products.

But if you are if you were to go look at, you know, other competing

platforms,

you will see on the order of a 150, 200 products.

Right? It takes

real discipline to bring this down to

10 products that work really, really well together,

that solve

the wide variety of use cases, that do so in a very innovative way that are very consistent

among each other. Right? So 1 of the things that Google Cloud is really known for is the quality of our user experience.

And it doesn't come magically.

It comes because it's something that we design for and we use. But then this the converse is

what happens if you want just a point solution? These things are very well integrated and very well put together. For example,

I I talked about Spark and I talked about BigQuery,

and you run Spark on Dataproc

and you run BigQuery, you can run SQL.

But what if you want to run a Spark program on data that's in BigQuery?

Not a problem. There is a Spark connector

such that you can run your spark program on data in BigQuery.

What if your spark program has created parquet files

that you want to query and join against your dataset in BigQuery?

Not a problem. BigQuery SQL engine

can read and do SQL on parquet files

dynamically.

What if you wanna train a TensorFlow model, right, to on data that's in BigQuery?

Not a problem. TensorFlow has a BigQuery reader that's basically able to read it and train your model. What if after you've trained your ML model, you say, I don't wanna deploy to an ML prediction engine,

I want to actually run this on batch data in BigQuery.

Not a problem. You can take your TensorFlow model,

load it into BigQuery,

and run a SQL query that invokes TensorFlow.

Right? So all these things are extremely well tied together and well integrated.

Having said that,

we will always have customers

who say, well, no. I love BigQuery, but I don't want to use your spark engine. I want to use Databricks.

That works as well. Databricks is a Google partner. You can get them on the marketplace,

and you can work with them. Right? It basically, it uses cloud IAM. It uses the same identity access management.

That's part of basically building an open platform

where you can basically get other point solutions.

And because all of our datasets and our APIs are open, we have connectors

to hundreds of different systems. Say for example, you're using a product, let's say, Mixpanel,

well, Mixpanel will be able to read out of BigQuery. Okay? If you're using a product like Avalanche, it's a tiny data warehouse. And so you wanna use that instead of BigQuery, not a problem. You can do that and Avalanche should be able to work with Dataproc.

We then well, part of what we've done

by

open sourcing all of the core technologies,

TensorFlow, Apache Beam, Kubernetes,

etcetera,

is that we've basically

allowed a really strong

ecosystem to grow

around our core 10 products.

So we build these 10 products and we say, this is a great set of 10 products. You can basically

do what you want to do. It's very well integrated because we focused and we built this great thing.

But

if you wanna bring in something else, those will interoperate with our products

because our products are open source and open API. That's the way we solve the second part of the conundrum is we build a great set of managed products and we open source

the underlying underpinnings

so that our customers

ultimately get choice and flexibility.

And in terms of people who come to the Google Cloud Platform,

I'm wondering, in your experience, what you've seen as some of the primary motivators for people who are adopting either some or all of the solutions that you provide.

So what are some of the primary motivations?

So 1 of the motivations is, of course, there's a big organic move anyway

from on premises to cloud, and that's primarily driven by the speed of innovation,

agility,

cost.

That's 1 set

of customers. And at that point, they're often choosing

among the 3 major cloud providers, which 1AM I gonna choose.

And in in many cases,

what people end up doing is they basically run a POC. They test things. They figure out what works for their platform, etcetera.

And we get a fair share of those customers. People who are basically

they're carrying out a cloud transformation. They're carrying a movement.

Within those sets of customers who tends to choose us,

customers for whom

open source is important.

Right? So anybody who is

really sold on Kubernetes and the ability to basically run the same workload

on multiple clouds

likes to choose

Google Cloud because we invented Kubernetes,

or our GKE, Google Kubernetes Engine,

and Cloud Run

are the absolute

best

at basically providing a managed Kubernetes experience,

and you get to basically run them on multiple parts. Right? So you basically get that choice.

Another reason that people often choose Google is because of our strength in data and AI. As I said, right, that's 1 area in which even to this day, there is no

serverless data warehouse that scales to large datasets. Now you have small serverless data warehouses, in memory data warehouses,

and stuff like that.

But BigQuery is still today unique

10 years later for basically scaling from small data all the way to petabytes of data

in such a way that you don't have to specify

the size of a data warehouse beforehand. You don't have to manage clusters.

It is literally bring your code. We will run your code for you. So that

amount of no ops

serverless thing is the second reason why people choose cloud.

The third reason is the quality of our AI. Quality of AI models is completely driven by the amount of data that you use to train them with. And whenever you're building AI models,

you have a choice. You can buy stuff,

you can build it from scratch,

or you can customize it.

Building from scratch is very expensive. So whenever possible, most customers

try to see what they can buy that's used, reuse immediately or what they can customize.

And whenever you're buying or you're customizing,

the quality of the AI model is extremely important.

And the quality of the AI model is

almost completely

driven by the amount of data used to train that model.

Think for example of a text to speech model or a speech to text model.

We get to basically use

Android. Right? YouTube. Right? We basically have

the kind of quantity of data and quality of data

that other people struggle to get. And therefore, the quality of the AI models that that are available on Google Cloud is typically head and shoulders above anything else. And you know this. Right? If you've ever done a translation

and you compare Google translate to a translation system anywhere else, you know the difference in quality. And that difference in quality

shows up in every 1 of our AI models. So that's the third reason. And the 4th and final reason

is what we call the 1 Google approach.

For example,

a lot of customers find that

experience of, now you go ahead and you you search for a business. They say I'm searching for ladders,

and you basically get a list of

now you on Google search, you basically get a list of you know, what's a ladder? Sure. But you also get, where can I buy a ladder? You get Home Depot. You get Lowe's, etcetera. You can click on Home Depot. You get the hours of Home Depot. You can basically

then set it up such that from your search, you can basically ask that store,

do they have this particular, like, 21 foot ladder in stock?

You can do that through voice. Right? And you can basically or you can have a Google Assistant call on your behalf, get this back. So that kind of a complete experience

is doable

on Google Cloud, where basically

it's something that you have to cobble together somewhere else. So that's the 4th and final reason why a lot of people use Google Cloud.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Hi touch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools.

Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today.

Get started for free at data engineering podcast.com/hitouch.

As people are onboarding into Google Cloud and they're starting to either shift their workloads or they have brand new greenfield projects that they're using Google Cloud for,

I'm curious, what are some of the challenges

or sort of conceptual hurdles that they run into as they're trying to either make their existing workload fit the model that Google Cloud

promotes or as they're trying to understand what the capabilities

are and the sort of best way to approach how to build their solutions using the suite of products that Google offers? I mean, that's a very broad question, like the best way to approach a solution. So let me take a very concrete example.

And hopefully that concrete example will give you an idea

of the kinds of possibilities that exist. So 1 of the common things that a lot of

enterprises are doing now

is that they are outgrowing

their on prem Teradata instances.

And they're basically moving from Teradata to BigQuery.

So let's take that as an example. Let's say you're an on prem customer. You're basically using Teradata on prem, and you wanna

basically move it to a serverless data warehouse because it scales much better. It's a lot less expensive.

It lets you do machine learning within the data warehouse.

All of those advantages.

What does that involve now?

A few things. Firstly,

on prem, you basically have a bunch of systems that are dependent on Teradata. Right? You may have your Power BI tools that are basically

accessing Teradata. You may have your ETL pipelines that are publishing into that Teradata instance. You may have You may have data being exported out of Teradata into reports, etcetera.

So when we say we wanna move from Teradata to BigQuery,

it is not just about moving Teradata to BigQuery.

It's about moving

everything. Right? It's moving that entire ecosystem. So how do we basically

simplify that? Right? So we do a few things. Firstly,

we have

automated query translation. Right? So you can basically take your Teradata queries

and convert them into BigQuery queries through automated tools.

Now the automated tools aren't perfect, but they basically get quite a bit of them. Like, 85 to 90% of queries

can be automatically translated.

Then what you do is you basically then go

to any kind of an ETL system that is basically publishing into Teradata

and change the queries in those systems

such that they are now querying off BigQuery instead.

The data, we have automated tools to basically take the data in Teradata

and move it into BigQuery. But now we're talking about the queries, we're talking about the dependencies.

Another thing that we sometimes find we have to do is that you can't just move all the data all at once. You basically break it workload by workload,

and each workload, you move it 1 at a time. And for a while, you have both Teradata and BigQuery going, and you have things being published into both places.

And in order to simplify that, you might put a virtualization

solution in place

where you basically have a system

that

receives Teradata inputs,

but actually executes them on BigQuery

and sends back Teradata

output.

And so that virtualization

basically helps that migration happen. But there's another bit that we also have to do. You have to basically think about how do you upskill,

right, and change the way people think about

how they design schema, how they design tables, how they write queries in a more efficient way, etcetera. Right? So because again, these systems are different, they're built in a different way. And so part of this whole deal is you have to set up training. Right? So basically training for everyone involved

in writing queries, using queries,

managing systems,

etcetera. So

a project like this, let's say, okay. I'm gonna take my Teradata on prem and move it to the BigQuery,

involves all of these aspects.

Right? Everything from

data and schema migration, to query translation, to virtualization,

to people training, and you have to put all of those together. And that's basically something they need that we have experience in, that we help people about, but that's 1 use case.

And we basically have to consider every such thing. And as we build experiences, we build playbooks

that basically say this is how it is successful

in this particular scenario.

And in your experience, would you say that BigQuery

is sort of the biggest

center of gravity in the

data products that you're offering, but also in terms of what pulls people into using Google Cloud for their data workloads?

So, 2 separate questions.

Center of gravity and what pulls people.

So center of gravity.

The center of gravity

for

SQL

workloads.

SQL is not the only type of workload there is. The center of gravity for data processing workloads

is data flow. Right? So that's the thing. If you wanna move

and manipulate data on the fly, you tend to use data flow. If you're manipulating

SQL based workloads, you tend to use BigQuery.

And if you need transactional support and real time consistency,

you use Spanner.

And the what attracts people to Google Cloud,

it

really depends. Right? So some of the products that are beloved products, if I would. Right? So products that people just fall in love with. 1 of them is Cloud Run.

Cloud Run is basically

completely auto scaled way to take your container

and run it.

And any kind of containerized workload just completely fully managed

and auto scaled

and run. That's Cloud Run.

Everybody who's used Cloud Run just absolutely

loves it.

The second, like, very beloved product on Google Cloud is BigQuery. BigQuery is easy to get started with. It is super powerful.

I now SQL is 1 of those things that's been around for 50 years and it's well understood.

And it's extremely

diverse in terms of the set of use cases that it's so that's the second thing that people absolutely love. Right? So Cloud Run,

BigQuery.

The third thing that people love is Spanner. Right? But Spanner is a much more

niche of a use case. But if you have a problem that Spanner solves,

there's nothing like it. There's nothing like Spanner. So that's the 3rd,

very beloved

product

that just people just completely fall in love with. And the 4th 1 that people completely fall in love with

is AI platform notebooks.

Right? So this idea of a fully managed

notebook experience

that is completely separate

from your compute environment,

it always blows people's mind

when you basically start a notebook

and you say, let me add a GPU to this notebook.

And you basically don't normally we think of I have my hardware

and I install software on my hardware.

This changes the paradigm completely. You have your notebook

and you attach

pieces of compute to that notebook,

different pieces of compute at different times

depending on what you want. If you wanna run BigQuery from that notebook, you send it off to BigQuery. If you want

to do an ML if you wanna deploy an ML model for prediction, it's a serverless thing. You do it from a notebook. So a notebook becomes this

very lightweight

interface

to the rest of the AI and data platform. And that's another thing that people absolutely. Yeah. I haven't played around with the notebook piece of it yet, but I do know that with BigQuery, 1 of the other things that can

draw people into it, even if it's not something that they're specifically looking for, is the

fact that a number of useful and interesting datasets have been published onto BigQuery. I'm thinking,

from my own experience, things like the GitHub repositories that are available for people to be able to do analytics across the entire code bases of GitHub or PyPI package download statistics and things like that, and people being able to take those published datasets,

run their own analyses on it. But the people who are publishing the dataset don't have to bear the cost of those queries. And so I think that that's another interesting

sort of innovative aspect of what BigQuery offers. Absolutely. Absolutely. You touched on a few key points. Number 1, all of the commits in GitHub, all of the questions and answers in Stack Overflow,

These are all tremendously

large datasets,

and people are able to publish them. And then you can just come in and just query it without having to set up a cluster, without having to do anything beforehand. It's a dynamic query.

And to your

final point,

the people who publish the data don't have to bear the cost of the people querying the data. That's because of that complete separation of compute and storage, which is incredible,

And it has a lot of business implications as well. You're able to share data with your suppliers.

Your suppliers are able to share data with you. Can be 2 different organizations. You can break down

silos within your company. So there's a lot of lot of benefits to to that kind of mechanism.

In terms of your experience of working with the Google Cloud Platform and working with customers who are onboarding and building out these proof of concepts, and you're helping to educate them on sort of the best practices for how to use these different systems.

I'm curious how some of those experience have fed back into the product suite that Google offers in terms of new capabilities or new interfaces or just user experience improvements.

Yeah. Absolutely. And, again, this is not just me, but, no, we have a lot we have teams of folks who basically work with our customers,

and we learn. Right? As I mentioned, right,

we have a very full fledged Teradata to BigQuery migration playbook. We know all of the stuff that we can do. We can be sure that that was not something that

you can dream up in an afternoon.

It is something that as you do it, each time you do it, you learn and you add it. You basically

make it

better and nicer for the next person.

We also

built like, as an example of products that we built

based on customer engagements,

is why it's recommendations AI. We have a product called recommendations AI. The way it works is you bring your product catalog

and you bring your user transactions. What have people bought in the past? And bringing these 2 datasets in, we will create

a very high quality

leading edge

recommendations AI model. Customers who basically used our AI model

have typically seen, you know, improvements of 15 to 20%

in terms of a lift. Right? So

state of the art, high quality

model. And that was an example of something where we had our professional services team work with the customer,

help them build an a recommendations AI model. And then another customer said, hey. Help us build your recommendations AI model. We went ahead and we helped them do that. Right? But the basic concept,

you need a product catalog, you need a user behavior data. And once you have those 2 things, you can basically build your recommendations model. Means that we could kinda abstract out

the details of how people store their product catalog, how people store their all of the transactions,

all of the interactions

on the websites, etcetera, and say, here's a schema. If you can give us data in these schema, we can build your recommendations model for you, and we can help you integrate it, and we can help you serve it out. Right? And that basically becomes our recommendation AI solution.

So we basically have that

arc

of

as we work with customers, we can see what

problems they're solving and basically make solutions out of it. But the converse is also true. Sometimes as we work with customers or problems that they run into, we say, well, this is dumb. This should not be that hard.

End up basically adding

that capability

into our system. A great example of this, and this is something that we're getting much better at, is that, basically,

people would create BigQuery tables. And when you create a table, you had no way to rename a table. But renaming a table, why would you ever want to rename a table in production?

And then, okay, fine. But it turns out that there's actually customers who kept on needing this, and they would actually have to build

stuff where, okay, I'm gonna be writing to this to my real time table,

and then I'm gonna take the last 15 days of data, and I'm gonna move it out. I'm gonna call that table,

right, this table, and I'm gonna rewrite to this new table. And now we have to rename this table with that old date stamp. Right? So those are the kinds of things that people wanted to do, and we did not make it easy. There was no easy way to rename a table.

Just very recently, just over last week, we basically create a SQL statement to rename it. Right? So those are the kinds of things that we see. There's a customer friction, customer pain point.

Let's basically, you know, make it easy. Let's let's let's solve it. And those are things where, you know, we get feature requests. We get customers who get stuck, customers who build. Again, renaming a table is not hard. You can do it,

but it shouldn't be that painful.

And that's why we basically built it into the system.

In terms of the

architectural patterns

of the systems design and the sort of data integration flows and any software patterns that go along with that, I'm curious if there are any approaches that you see as being

unique to Google Cloud Platform Suite for data and analytics that you don't see

in externally available systems, whether that's open source products or other cloud providers or on premises systems?

Absolutely. I think 1

very common pattern that you see on Google Cloud

is

a trifecta.

Pub

Sub, Dataflow, BigQuery.

It really works on Google because Pub Sub is a message queue that is global.

You can publish into Pub Sub from anywhere in the

world, and it is 1 single Pub Sub, right, in which you have multiple

topics and subscriptions, etcetera, but a single Pub Sub.

And then data flow, which is basically a single system for the process both batch and stream, which means you can do replays, you can do historical data, you can do real time data. And BigQuery also, which is global and which is serverless.

So now you have 3 serverless products,

PubSub, Dataflow, BigQuery,

that essentially allow you to ingest data, process it, and make it available for analytics.

So that's a really common

thread that you see on Google Cloud. It's

a very, very, very simple architecture that

just

nails the, you know, most common sets of use cases, whether it is IoT,

whether it is web traffic,

whether it is connected devices,

whether it is log analytics.

It is just these 3 products. Right? They just they just they just nail it,

and they basically work on batch. They work on stream. They basically work with structured data, semi structured data.

You can do machine learning in BigQuery off of it.

That architecture,

it's so powerful,

and it's so simple,

but it doesn't exist in other places because there is no global message queue. Instead, you basically have to build your message queue in every location.

There is no serverless data processing.

There you need to build separate batch and stream code streams.

There is no serverless data warehouse that

scales up and down,

like, dip or depending on the traffic, depending on the user query, and depending on the data sizes.

It doesn't support streaming SQL. It doesn't support machine learning.

All of these things, right, as that's I think 1 of the

continues to surprise me.

That, you know, many years on, we've shown that how it can be done.

But I think it speaks to the challenge involved

in

implementing

that simple architecture

that it doesn't actually exist in open URLs.

And in your experience of building your own projects on top of Google Cloud and working with your customers to help enable them to build their systems, what are some of the interesting or innovative or unexpected ways that you've seen the product suite used?

Our customers do

amazing things. And so so let me try to pick

a couple

of public

things. Spotify, for example.

Their royalty calculation

is done on Google. Just think about what it means

to calculate the royalty

involved for a specific artist.

You have to know what song is playing

on every user's device

everywhere in the world.

Right? Such that you basically get that information

and for how long it was played in order to calculate the royalty.

I mean, it's a mind blowingly

hard

challenging

use case,

completely doable.

In fact, with that same simple architecture I talked to you about,

pops up, data flow, BigQuery notebooks.

That's it. It's amazing

that Spotify is basically able to do something

that complex

on that simple of an architecture.

It speaks the power of the platform.

And in your own experience of working at Google and with Google products, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think the biggest

lesson that

we've learned

is that

a radically

simple architecture

can be

off putting

and can be

different and new,

especially if you're dealing with a lot of complexity on premises.

Right? So people often don't believe us when we say it can't be this simple. They try to replicate

the exact structures that they're using on prem,

and they try to basically bring it to the cloud and they say, well, do you have a way to do x, y, and z? And our answer is, why do you need to do x, y, and z? And it turns out that it is not even needed anymore.

But those conversations

can be

a little hard when you're coming in, when you realize that you don't really need

to have your authentication

built into

every database product.

You can have 1 single cloud authentication

that basically, you know, gives you

access

to all your products.

If you need to do it and do data exfiltration,

we basically say use

a virtual private cloud, PC service control. And that's a new concept,

completely unlike what someone may have been familiar with on prem

and

trying to make design decisions while you're learning about

these different

ways of looking at a problem

can be pretty challenging.

So that's 1 of the things that we've learned to basically

step back and

ask, like, what the customer is trying to achieve

rather than try to replicate on premises.

In your experience of working with customers and helping them to realize the best solutions to their problems,

are there any cases that you can think of where the Google Cloud platform or specific product that they had their that they were intent on using was the wrong choice for the problem that they were trying to solve? In terms of the wrong choice,

we've had people try to do

ML inference on GPUs, for example. Because, hey, GPUs are the most performant,

and they would find that their ML models would not be sped up by the GPU.

And that it turns out is because

GPUs are really good at dense models and speeding up matrix manipulations

and so on. They're not as good if you're serving out recommendations, if you're serving out sparse models.

Right? So

sometimes people end up choosing an architecture because they ask a very

or they go look at performance statistics and say, what is the best hardware to run my ML inference?

The best hardware to run your ML inference is the GPU,

but the GPU is not the best place to run every single ML model that you have. There are some ML models that you should be running

in on a CPU and not on a GPU. Right?

So sometimes

people make choices

of products, of technologies

based on

the overall aggregate,

and the details sometimes matter. And then we have we don't those tend to be a lot of things. So for example, where is the best place to do your analytics? It's absolutely BigQuery.

But

if your

throughput

is extremely high

and your latency requirements are extremely low,

then BigQuery is not a good choice. You should be using Bigtable instead.

Right? So those are sometimes you basically make your product choices.

Sometimes you make the product choice

without realizing the

corner cases. And the corner cases are it turns out to be sometimes important. And that's when you have to go back and change your design, use a different product rather than

try to soldier on and try to get better latency out of BigQuery. It is much sometimes much better to say, okay, not BigQuery for this problem, but big table for this problem.

As you continue to work with customers and work with the platform and observe the ongoing trends in the industry, what are some of the new capabilities or new services

or emerging

approaches and technologies that you're personally excited for?

Super excited by this new product we have called Data Stream,

which 1 of the very hard things that people used to do was change data capture from an operational system to an analytic system,

and you would have to go build your own.

Data stream basically is a fully managed serverless way

of mirroring your transactional database and your analytics database.

So all changes that happen to your Oracle system

show up in BigQuery

automatically.

Right? Fully auto scaled. It just happens.

So that

is a radical simplification

of many people's lives.

So I'm super excited about it.

Another thing that is

fundamentally

very exciting to me

is this set of solutions that we call Document AI.

The world is full of paper processes,

invoices,

bills, w 2 forms,

etcetera.

But they're part of a lot of business processes. And so we end up

accepting a lot of error in data entry, accepting a lot of labor

in terms of digitizing those things in order to use them.

Where document AI comes in

is being able to basically apply machine learning

to understand

these unstructured datasets.

So we've had people use document AI for mortgage processing

or procurement

and so on. And that

to me, is this

revolution

that is coming

in a lot of back offices. AI has now gotten good enough

that much of the

drudgery that's involved

can basically be

accommodated now with AI.

And are there any other aspects of your work on the Google Cloud Platform or the suite of services that you offer and that you've helped customers onboard onto that we didn't discuss yet that you'd like to cover before we close out the show?

Actually, Lee, 1 thing we haven't talked about is COVID. We're living through this pandemic.

And

1 of the coolest things I think

have working at Google for me in the last year and a half

has been

how many

companies and organizations

we've been able

to touch and improve

basically by being the IT department for the world. A few examples

maybe to close things out because I don't I think we all love to basically say, where are we having an impact beyond just the technology? Right? So an example of is when COVID first started,

unemployment

basically rose dramatically.

And the labor departments of many states

couldn't keep up with the hundreds of thousands of applications that they were dealing with.

And

I was talking to you about Document AI. So that was 1 of the solutions that we put in place

to help

a lot of state

department of labor deal with triage

unemployment applications.

K? So 1 of the neat things

is that, you know, I feel proud that we, as Google, were able to help improve the lives

of so many people

whose unemployment

claims might have been processed,

you know, 30, 40 days too late. And instead, we were able to process them on time

basically by applying

cloud technologies and AI to it. Another example of this,

again, dealing with COVID

is a lot of grocery chains

suddenly had to grow up in a matter of weeks.

Right? So what people thought was gonna happen by 2027

happened in a matter of 8 weeks

in 2020. Right? The number of people shopping online, the number of people wanting curbside delivery.

Take curbside delivery. We had a grocery chain that needed to implement curbside delivery

pronto.

What does it involve to do curbside delivery? Well, you need to make sure that your inventory system in the store

is perfectly up to date,

which means you have to now before you can do curbside delivery,

you need to have an real time inventory system.

So

help them build a real time inventory system in 6 weeks.

This is the kind of project that would have normally taken 3 years.

We just did it in a matter of weeks.

Mhmm. No. And that's again,

no testament to, like, amount number of people in organizations

that we were able to help.

I feel very fortunate and very proud to be have been part of Google and part of Google Cloud, where we've been able to have a lot of organizations

come through COVID.

For example, Google Meet, we provided it free for education. It's being used. I think usage of Google Meet went up double digit times, right?

Dramatically

growth in usage,

especially even my kids' schools to use,

Meet, and that's how their school have schools have run over the last year. So another example of

cloud and cloud technologies coming to the rescue of how

we as a society have been able to deal with the pandemic.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Probably the biggest

gap that exists

is in terms of what we ask our users to do for data governance.

The data governance

story today is not yet fully integrated.

It is not yet as easy and as seamless

as it ought to be.

So

as technologists,

as builders of technology tools,

the biggest gap that we need to address is that we need to make data governance

easier,

easier to understand,

easier to implement,

easier to monitor,

and easier to secure.

Absolutely agreed on that point. Well, thank you very much for taking the time today to join me and share the work that you've been doing and your experience of helping people onboard to Google Cloud and helping us explore the capabilities that it provides. It's definitely a very interesting and powerful suite of services. So I appreciate you taking the time today to join me and all of the time and effort you've put into helping make it a more usable and more useful platform. So thank you for all that, and I hope you enjoy the rest of your day. Thank you very much, Japans. It was a lot of fun.

Listening. Don't forget to check out our other show, podcast.init@python

podcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links