Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Modern data teams are dealing with a lot complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to dataengineeringpodcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool water Flask. Your host is Tobias Macy. And today, I'm interviewing Tejas Menohar about Hitouch, a data platform that helps you sync your customer data from your data warehouse to your CRM, marketing, and support tools. So, Tejas, can you start by introducing yourself?

Hey, I'm Tejas. I'm 1 of the co founders of HiTetch. As Tobias said, we help companies sync data from their data warehouses

into sales and marketing tools, specifically customer data. And we do this with a super simple interface, which is just SQL. Essentially, you paste a SQL query into Hightouch, and

we tell us how this data should appear in downstream systems that your business teams use like Salesforce

or even Facebook ads. And Haida channels all the data plumbing work of getting it into the tools so you can just focus on the data.

And do you remember how you first got involved in data management?

Prior to founding Hightouch, I actually worked at Segment for quite a while, and Segment provides an API for companies to track all their customer data in a unified way through a central API and a bunch of libraries for different clients, like websites, back ends, mobile apps, etcetera. And that's how I got into the whole customer data, data engineering, and also MarTech space.

So I joined there as an early engineer,

about the 8th or 10th engineer at the company back in 2016,

and then worked there until mid 2019

when I actually left. And recently, the company was acquired by Twilio.

And so you gave a bit of an overview about what it is that you've built at HiTouch. Can you give a bit more background about your motivation for turning that into a business and some of the story of getting it off the ground?

At Hightouch, basically, a trend that I noticed while I was actually at Segment

was that the 1 place where companies are centralizing

all their customer data

is not a SaaS tool like Segment or a particular vendor, but it's actually their own data warehouse,

especially with all the recent advancements in the data warehousing space with new data warehouses coming out like BigQuery and Snowflake.

So generally, companies are centralizing all their customer data into the warehouse because of these new data warehouses, which allow you to own your data and access it very quickly,

but also because of the whole wider ecosystem around the data warehouses. There's plenty of new BI tools on the market. There's tons of ETL providers that help you get data from various SaaSs into your data warehouse like Segment and Fivetran, and every single player in the whole marketing and analytics space is basically centralizing around the data warehouse as the 1 place to export data. So what I found is that data warehouses at all types of companies, b to c, b to b, even newspapers,

were using the data warehouse to centralize all their customer data, but they were all using it for analytical and BI and reporting use cases.

What we're trying to do at Hightouch is say that now that you have all your customer data inside of your actual data warehouse,

why stop at using it to answer questions? But you should actually use it

to power operational use cases, both in the day to day of your business teams, whether that's like sales or marketing or support, and also things like in app personalization

in your actual product experiences.

So that's our goal at HiTouch, to basically shift the data warehouse from something that's seen as just a repository of customer data for analytical use cases to something that's actually seen as a repository of customer data for operational use cases.

And I see this as such a big opportunity because

every company we talk to, there are so many scripts that are being created on an ad hoc basis

to just get data from 1 system to another and do all these point to point iterations, like get data from Salesforce to HubSpot or HubSpot to Marketo,

and all these random point to point integrations

that I think could be better served if they're built directly on top of the data warehouse, where you have the power of SQL to manipulate this data freely

in a way that's accessible to much more than engineers. And all your customer data is already in the central place, the data warehouse anyways.

So at Hightouch, our goal is to eliminate workflows where companies where business teams are uploading CSVs that are manually exported from BI tools on a daily basis,

and also where engineers are having to write tasks to just get data from 1 system and other, to with a simple interface that allows you to take the results of any SQL query

and automatically stream it into systems like Salesforce or HubSpot or Facebook ads or 30 plus other tools in a way that's super flexible.

For business users who are relying on some of these different SaaS platforms for

tracking various aspects of the customer life cycle or understanding the

needs and desires of the customers based on the customer acquisition that they're doing, whether it's from entries in Salesforce for people who are entering based on customer engagement

or people who are streaming in clickstream data, click tracking data from things like Segment or Google Analytics,

what are some of the main points of friction that those users and teams are facing when they try to actually

make use of the customer data that they're collecting through these various avenues?

Yeah. It's a great question. So 1 of the most common use cases that we see Hightouch solve are use cases in the marketing department in terms of business teams that that get benefit from Hightouch.

So while the user of HiTouch is a technical user, like someone who can write SQL, connect to the data warehouse, and actually figure out what data should go to what business system, The person who ultimately receives the benefit from High Dutch is generally someone on a business team. We find marketing as the largest business team that's affected. The way that marketing is affected affected is you can imagine that I'm a marketer

at a b to c company or even a b to b company, and I'm sitting in my end marketing tool that allows me to send emails to certain groups of customers and manage all the content for those emails.

Think something like Iterable or Salesforce Marketing Cloud or Marketo.

And when I'm in this tool,

I want to actually pick users to be targeted

based on certain criteria about them. So as an example, if I'm at an e commerce company, I might want to target users

who added an item to their cart that was worth over $1, 000,

but didn't check out that item. This is a pretty common campaign. But sometimes marketers don't actually have the data inside of their end tool, whether it's the item's price,

that step in the funnel, the events the user took, the user's name,

to actually build the segment using that tool segmentation builder to target certain users within their email platform.

So what they do is they have 2 options then. 1, they can go to an analyst and they can ask them to pull them a user, a list of users that should be targeted in this campaign

as a CSV, and then go upload it to these tools.

You can't imagine how many times we actually see this happening, even at companies with pretty sophisticated data stacks.

The second option that they have is actually to go ask an engineer

to modify the tracking code that's getting data into

these tools, whether it's a script that gets data into Marketo or whether it's

a call from the front end of their website that gets data into Marketo.

And, basically, what Hightouch allows is it allows anyone who has access to that SQL query or access to someone who can write that SQL query that can pull a 1 off list of users to essentially automate that SQL query or operationalize that SQL query.

So the marketer can ask the analyst to not just give them a 1 time CSV that they have to keep refreshing in their tool or asking it. Instead, they can ask them to paste that SQL query inside of the HiTouch UI and say how that data should show up in downstream systems like Marketo.

Hi tech handles all the regular syncing of the data. So, really, what we replace is marketers,

salespeople, support people having to ask engineers for 1 off scripts to get data into their business systems or analysts for 1 off CSVs.

And so you mentioned that the primary interface

for the people who are the end users of Hightouch is actually for

more technical staff and people who are familiar with SQL

and that the primary teams that are benefiting from the output of high touch are people in things like marketing and

possibly sales. So I'm wondering if you can just talk through some of the ways

that those different user profiles and the

interaction patterns that you've seen have influenced the overall design and product prioritization of what you're building at High Touch.

Due to the diversity of use cases for customer data around the organization,

there's actually different access patterns that we support inside of our tool itself.

So just to give you 1 example,

some email marketing tools

like Marketo or like Iterable, for example, have pretty sophisticated

data platforms within them

that allow you to segment users based on various conditions,

like Boolean conditions

or kind of event funnel conditions, like a user did this and then they did this, or a user did this and then they did not do this.

Whereas some other tools like Facebook ads don't have that much sophistication around how you can segment users based on your first party data inside of their platform.

So what we find is that the way people use Hightouch for for some tools that do have sophisticated data platforms

is that they get raw traits about their users

into these platforms,

and then do the segmentation inside of the platform itself.

Like use Marketo to build Boolean conditions off of these raw traits, like what's the age of the user or what step of the funnel did they leave off at from the warehouse.

And then with other platforms like Facebook ads, which don't have as much of these data sophistication within their own platform,

we actually find that

companies are just syncing

lists of users that meet certain conditions from the data warehouse

into platforms like Facebook ads. So essentially just automating the upload of a SQL query

as a CSV into the system. That's just a list of users to target rather than the raw data about them to then manipulate in a tool like Facebook ads. So there's a lot of different ways that the product can be used based on how the actual destination tools that you're using the business data inside of

support manipulating data within their platforms and the use cases.

We like to focus on marketing

in particular,

not because marketing is the only business team within the company that can leverage customer data, but we find that it's the business team that leverages the most amount of customer data on a regular basis. So we think it's a pretty interesting and wide open opportunity.

That said, there are use cases across other teams like sales, for example.

A lot of B2B SaaS companies that

have a bottoms up adoption process where users are signing up individually on their sites,

but they also have an enterprise sales process where they're reaching out to accounts that are heavily utilizing the platform and trying to get them on a corporate contract. A lot of these companies also use Hightouch. And what they do is that

they sync metrics about how their product is being used by their users into their CRM, like Salesforce,

which is traditionally only armed with metrics like when's the last time the salesperson has contacted this user. So now with Hightouch, we basically supercharge their CRM

with information about how those users are not just being contacted by the salespeople, but how they're actually using the product. For example, when's the last time they logged in or how many API calls did they make this month? And with that insight, the salespeople can both better target

users amongst the whole customer base to prospect which ones to focus on and try to upsell,

as well as get visibility into what customers are actually doing right from within their CRM when they're entering a call. So there's a lot of different interesting ways that customer data can be utilized across the organization.

And sometimes what data is actually being synced to,

destination depends on the team that's gonna consume that data, but sometimes it also depends on the limitations of those downstream tools that you're syncing data to.

In terms of the actual

attributes and records that you're able to propagate to these downstream systems,

what are some of the

data modeling considerations

that users need to factor in in terms of what attributes are required, which ones are optional, what systems accept which records, and the types of records, and the the source data that you need to be able to collect to effectively populate these downstream systems with the information that's being required?

Yeah. So more or less, the way it works in Hightouch is that you select

all the data that you want to appear in these downstream systems via just simple SQL query. And then when you're actually telling Hightouch to send that data to a downstream system like Salesforce,

that's when you specify how you want it to appear in that downstream system.

I'll take Salesforce as an example because it's 1 of our more popular integrations,

both on the b to b and the b to c side,

and because it's very robust and horizontal in terms of the use cases of it. So Salesforce has hundreds of built in objects and in addition to that, you can also have custom objects.

And these objects range from things like leads

to contacts,

to accounts,

to even custom objects like billing reports for the month that a SaaS company might be using inside of their Salesforce.

And each of these objects is almost like a table in a database. It can have different constraints,

required fields, nullable fields, types, etcetera.

Hightouch does the heavy lifting

of making sure that the data you're picking from the data warehouse matches those constraints

in terms of selecting required fields, selecting fields with types that are compatible between SQL and Salesforce.

And it basically goes very deep in each of these integrations

to make sure that the end user doesn't have to be aware of how the integrations work. They can just use the actual UI to make sure that they're handling all these constraints when loading data into them. So we don't actually require your data to be in any particular format inside of the data warehouse, which is what I think is 1 of the really most powerful pieces of the model we've chosen for HiTouch of of building this data integration solution on top of the data warehouse.

Instead,

if there needs to be any transformation to get data into the right format to end up in a system like Salesforce,

then you can actually do that transformation in SQL itself, which is a super powerful language for doing all this transformation.

We find that our customers are about split in terms of how they use the Hightouch product.

Half our customers paste extremely complicated SQL queries into the Hightouch product. These are queries that, you know, might have 10 joins, a bunch of where clauses,

many functions to transform data,

splitting things up

like full name into first name and last name as an example.

And then the other half of the users just have a select star from a view or cable that exists in their data warehouse inside of Hightouch.

Basically, the people in the first camp

generally aren't doing much cleanup

on the actual data after they arrive in the warehouse. They're just kind of manipulating the data inside of each SQL query as an ad hoc process, whether that's for analysis or for use in high touch.

The people in the second camp that are just doing a select star from a table or view, they are generally using

software like DBT,

which is data build tool, which helps you manage SQL views, whether they're materialized as tables or as views

inside of your data warehouse

with just SQL. So they're either either using software like dbt to basically do post processing on all the data that's in their warehouse and clean it up into an accessible format,

or they're using some sort of airflow workflows

or in house versions of things like DBT, we actually see pretty often,

before they use high touch on top of that data, and then they just have a select star instead of high touch. So the nice part about high touch is that

we make it really seamless

to connect data in your data warehouse with all these downstream tools because we think about all the constraints and issues that might come up when you're actually syncing data into those downstream tools for you and warn you about them. But at the same time, we don't enforce our data model of what your user data should look like on top of your data warehouse or on top of your actual business systems because we know that every company is unique. So instead, we let you do all the transformations you want

both in terms

of transforming the data to be in a certain format or kind of transforming the data model to be in a certain format in SQL itself.

And as far as the actual high touch platform, can you dig a bit into how it's architected and some of the ways that the system has evolved or changed since you first began working on it?

So in terms of how the high touch platform works,

at the simplest level,

you can imagine that we have a platform where you can create queries,

paste a SQL query into there, connect it to downstream tools like Salesforce,

Mixpanel,

HubSpot,

tools like that, and say how the data should look between the SQL query and the downstream tool. Basically, what we do is that we run that SQL query against your data warehouse on a schedule that's configurable by you or can even be triggered

by workloads that are happening in your DBT

or in your airflow.

And then what we do is before sending it to any of these downstream tools,

we take a diff on the data. So we compare it to but to take this diff, we basically compare it to what we loaded from your data warehouse the previous time to see what are the changes, the removals, the additions to your data before syncing it to any downstream

destinations, like Salesforce or Facebook, for example.

And that helps avoid a lot of concerns

around hitting rate limits in these tools or overwhelming them or using up all your API credits and making you pay a bunch of money to tools like Salesforce.

So HYDEC tries to think about all of this stuff for you that you wouldn't necessarily think about if you're writing, like, a simple Python script that you're running on a cron job.

And in terms of how the architecture has evolved

over time,

there are a couple of things that we've prioritized.

So 1 of them is actually helping companies own their own data. And what I mean by this

is while Hightouch is a SaaS service that you just use by going to app.hitouch.

Io,

we actually have an architecture that allows

all the customer data to remain in your own infrastructure.

So this means all the customer data can actually remain in your own infrastructure

spread across your own data warehouse and your own Amazon or Google or Azure S3 or object store bucket.

And we've really prioritized that architecture because we believe that

with increasing regulation,

privacy concerns, security concerns

across all industries in the market,

there's a large opportunity

to let companies own data and their own infrastructure

while still having the ease of using a SaaS service. So that's something that we prioritized pretty early on, but didn't have in the very initial version of high touch because we didn't know how big of an opportunity that was. And that's unlocked us to work with a lot of larger clients

much before usual companies in this space would have.

And I think the really unique thing to emphasize there is we allow you to own your data without actually being an on prem service.

In that by using the data warehouse, the data lives in your environment,

but Hightouch isn't built in a way that you actually have to run the Hightouch platform or the code in your Amazon EC 2 instance or in your own server, which has a lot more overhead as a user. So this hybrid architecture is something that we're really proud of.

Another thing in terms of how the platform has evolved

is really just the integrations platform.

Over time, we've definitely grown the sheer number of integrations, and that's something that we are just constantly continuing to do.

But with that, we've also increased the robustness of the integration platform itself quite significantly.

We have a lot of central abstractions

within Hightouch's backend that define how

data is queued up before it goes to downstream systems.

In addition, central abstractions for things like error reporting to the user,

monitoring,

logging.

And this is something that we plan to invest more and more in

as Hitash scales, because we want there to basically be, like, no concerns

that you would have thought of if you were developing your own scripts to get data into systems like Salesforce

that you wouldn't have thought of if you're using Hightouch.

So I would say the 2 main places that the architecture has evolved so far has been 1 on the hybrid architecture

making sure that Hidet can be run-in a way that companies really own their own data and it doesn't have to leave their environment.

And 2, the integrations platform, just making that more and more robust

in a way that keeps providing benefit to Hitachi as we build more and more integrations in the platform because we know we're in the business of building integrations.

In terms of future looking

roadmap,

where I think we're going to have a lot of innovation

is going to be on the developer experience side of Hightouch. So we're actively working on making Hightouch an amazing developer experience, building deeper integrations with other software in the ecosystem

like DBT,

Looker. We have some innovation on our own modeling layer that's coming out as well, as well as bringing concepts like

CICD

testing, dry running software, like kind of simulating what would happen when you run this workflow against your Salesforce account to high touch. Traditionally, analysts,

business users, and non engineering users don't think about these things in the same way that engineers do. But by giving people who are usually just used to querying data the ability to automate that data by automating a SQL query,

that's also a lot of power and things that can go wrong in your business systems. In that, an analyst pasting a SQL query into Hightouch and setting up some settings now could create millions of records in your Salesforce account unexpectedly

if Hightouch isn't configured properly. So I think there's great opportunity

looking forward to basically bring concepts like CICD,

QA, dry running software,

sandbox environments

into the Hidouch platform and teach companies a new way of doing this without having to loop

Some of

Some of the interesting aspects of the

integration with some of these other systems, particularly when you were talking about being able to track deleted objects

or the issues around compliance with things like GDPR.

I'm wondering if you

have the ability to do things like track deleted customers

because they may have requested that their information is removed for GDPR reasons and then being able to then remove those records from the downstream systems to ensure that the to ensure that their information is scrubbed from all the places where it might have been propagated to?

Yeah. I'm glad that you brought that up. That's actually an area that we don't touch on too much in our marketing content right now, but it's actually a huge benefit of

the Hightouch platform.

Essentially, we do propagate more than just insertions and updates into the customer data that you're querying in the data warehouse, but we also have the ability in our our platform to propagate deletions to this customer data, which creates a really interesting setup and vision if you're thinking about, you know, from the lens of a CIO

at a really large company, you have all these problems where you're sending customer data to all these different business systems,

but there's no clear place to see the lineage of that customer data as in where did this definition come from, or also manage the governance of that customer data as in when a GDPR request sent comes into your business. How do I make sure it's deleted from all the systems?

So there's this interesting vision that we can start to paint with a system like Hightouch

operationalizing

your data from a central store, which in this case is the data warehouse, where,

you know, if you delete data from that central store or if you mark it as deleted, we can actually propagate that to all the downstream tools like Salesforce,

Marketo,

Facebook ads, etcetera,

to make sure that that GDPR deletion

request was handled throughout the stack.

Deletions is something that we've actually thought about from day 1 with Hightouch, and we're really excited about that vision, being able to do both data lineage in terms of figuring out what does this field in Salesforce actually mean. Well, it means it's from this view in your state of warehouse,

which is pushed to Salesforce an hour ago,

and that's actually built up out of these other 3 views that came from this

DAG inside of dbt.

So that's 1 aspect of it, but also the governance aspect of it, which basically says

when we delete a particular attribute or row from the data warehouse due to this GDPR request,

that cycles through all of these materialized views, all of these BI reports,

but also all of these downstream systems that are traditionally

implemented in a fragmented

way. So, yeah, that is 1 of the the huge benefits of high touch. And we think that's something that we're going to really capitalize on long term as companies move their whole stack to HiTouch.

Right now, we go into organizations and we let them solve 1 off data problems in their stack without moving their whole stack into Hightouch. But once you're using Hightouch throughout your company and all customer data is actually flowing from the warehouse into other business systems,

then there's a lot of opportunity

to just have a central interface where all this data is really governed.

Today's episode of the data engineering podcast is sponsored by Datadog,

a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications,

logs, and more.

Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering,

operations, and the rest of the company.

Go to data engineering podcast.com/Datadog

today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

In terms of the actual

efficiency of being able to keep up to date with all the changes in the data warehouse, I know that for source systems like application databases along the lines of

along the lines of Postgres or MySQL,

there's a lot of movement in the change data capture space. I'm wondering what the availability

of interfaces like that are for some of the data warehouses that you're working with and some of the ways that you're optimizing

for keeping up to date with those various change sets. And as also, as you mentioned, being able to

understand some of the throttling requirements for the downstream systems

and doing things like cost optimization

on the data warehouse as well so that you don't particularly for things like BigQuery, you don't execute queries on a, maybe, a polling basis and just sort of reducing some of the cost overhead on that side as well?

So change data capture or CDC is something that we're really excited about. That's something that's becoming more and more common and popular in the data warehousing space. So just for anyone in the audience who doesn't know what that is, in particular, some of the data warehouses

like Snowflake are coming out with deep features

that are basically called chain tracking or change data capturing or streams.

So in particular, Snowflake calls it table streams.

And essentially what that allows you to do is track

changes,

whether they're inserts or updates or deletions,

or even metadata changes like DML

on the actual tables that you have in Snowflake,

and then operate on that stream

to, you know, send data to downstream services, create new tables, do anything like that. It's kind of like a more robust form or non transactional form of triggers

that you might have in a OLTP database like Postgres or MySQL.

So right now,

primarily the way that Hightouch does this sort of diffing of the data is by actually calculating the difference

either in SQL, in your own data warehouse, or in our end

via just doing a streaming difference between essentially 2 files. 1 is coming from the data warehouse and we're streaming it down, and 1 is from the S3 bucket, whether it's owned by Hightouch or whether it's owned by your own company in your own AWS cloud.

That being said, CDC is something we're super excited about to make this process

more efficient and actually not just more efficient,

like computational perspective,

but to also be able

to unlock new use cases

in the long term that would benefit from lower latency

of pushing data out of high touch into downstream systems.

So today,

there are CDC features in some of these data warehouses.

For example, Snowflake has table streams, which can work on an actual table that's been materialized inside of the data warehouse.

In addition,

they have

incrementally

computed materialized views,

where basically, generally materialized views created in the data warehouse are just kind of refreshed on a schedule. We can call it command, like, refresh materialize view, and it'll rerun the SQL query and save the results as a table. But Snowflake and some of these data warehouses are also coming out with incremental versions of the of this feature that can process that SQL query

that

So we're really excited about both of these features,

and streams is something that we're already starting to utilize inside of the HiDutch platform today. Incrementally materialized views

is something that some of our customers are utilizing

inside of their Snowflake, which we're consuming, but we don't do anything special around that today. That said, a lot of these features as they exist in the data warehouses today come with a lot of quirks

or caveats.

For example,

cable streams

can only operate on a table itself. That's materialized. A lot of companies using high touch query views or paste more complex SQL queries, as I mentioned, into the high touch platform. So we wouldn't be able to use table streams in that case. And then in addition,

material incrementally computed materialized views inside of Snowflake

don't support certain things like joins as a stand state. So if you if you join, then it actually can't compute that materialized view incrementally anymore in either BigQuery or in Snowflake. So we're really excited about the potential here because

we think that there's no reason, no theoretical or no computer science reason, I guess, that

these data warehouses shouldn't also be able to operate as a streaming platform still using SQL as the interface.

That said, we acknowledge that there are bottlenecks

in just the progress that these data warehouses have made, and it's going to take a couple of years, we think, until our customers can truly utilize the streaming features of these data warehouses across the board.

There are some exciting companies in this space, though. For example,

Materialise.

Io is a company that we're super excited about. They are working on a streaming

data warehouse from ground up, where streaming SQL was the first thing thought about in this data warehouse rather than in Snowflake and BigQuery and Redshift. Amazon, Google, and Snowflake are kind of thinking of streaming as an afterthought to their existing batch based systems. So we're really super excited about them. We're super excited for them to have some sort of cloud offering. We have built kind of experimental connectors into Materialise

on our end, but we don't have any customers using them today. I mean, we think when streaming becomes mainstream, there's going to be a lot of potential

for more use cases of high touch. For example, some companies are using a high touch to push insights from their data warehouse, like a propensity score, the likeliness to buy

score from their data warehouse into their production databases today, like Postgres,

to then use in their front end or customer experience to actually change the customer experience that a user is facing. And the problem with this is that those propensity scores

can only be computed

hourly or every 30 minutes or something in batch in an efficient way. But with streaming SQL, it would be amazing if there was a unified interface where this data could be computed in real time or near real time in a way that doesn't waste a bunch of compute resources and have to scan the data over and over and over. Because then things like real time personalization could be done directly off the data warehouse, orchestrated by something like Hightouch.

So in a nutshell, we're super excited about CDC.

We're already starting to add some CDC like features into our platform today, but the data warehouses aren't at a point where we can just rely on them for CDC. So we have to do a lot of diffing ourselves.

And in terms of the upstream data sources that are necessary

for high touch to be useful and effective for the people who are trying to use it for the downstream use cases for these different SaaS providers.

What are some of the

common

and kind of core competency

upstream systems that are necessary for populating the data warehouse

to

allow Hightouch to be able to be used effectively?

This is a really interesting question because, basically,

every company in, you know, the marketing,

analytics, or customer data space is building connectors into the data warehouse today.

So it's becoming easier than ever to get data into your data warehouse today, and and that's why we think the data warehouse is the right place to solve these customer data integration problems.

We don't think this was the case 10 years ago, but we do think it is today because there's so many SaaSs that make it so easy to get data into your data warehouse,

whether that's just turning on a flip of the switch in Google Analytics 360

to get all your data from your Google Analytics platform flowing into BigQuery

or whether that's, you know, going to segment and saying, I want my data to show up in Postgres or Redshift,

or whether that's even going into

something like Fivetran and saying, here are the 7 SaaS services that my company uses and I want the data from all of them into the data warehouse. So we think it's become easier than ever to get data into the data warehouse today. But in terms of looking forward to how this data can be made more and more useful,

with things like CDC and lower latency requirements, a lot of these systems that get data into data warehouse today in terms of SaaS services

are loading data every hour or every 30 minutes

or every some interval that's not super low latency,

And that provides bottlenecks on how you can actually use the data. For example,

for providing contextual information to customer support representatives,

We find that a hard space to get into from Hightouch

because customer support representatives

oftentimes don't wanna see data that's, you know, 30 minutes old because the customer is doing something in the app today, and they wanna see that information on the sidebar of their Zendesk. So if they have that data flowing from their data warehouse into Zendesk

via Hightouch

and it's out of date, then the customer support representative is going to be very confused when they're actually addressing this customer's ticket. If they're talking about an order that they just made on the site, but there's no order showing up on the Zendesk sidebar.

So

ingestion latency

and the rate at which all these services are pushing data into the data warehouse

is definitely a large issue

and 1 that's not just going to be needed to be fixed inside of the data warehouses' platforms themselves,

but also 1 that the whole market is going to have to adapt to, as in everyone writing into the data warehouses should be

aware that you can actually write into them in lower latencies now and use things like streaming APIs and have better interfaces to do this sort of stuff. So we think it's something that's definitely gonna happen, but it'll take some time.

The other interesting area

in terms of

upstream services of high touch in the data warehouse, helping you get data into the data warehouse

is data modeling.

So an interesting trend these days that we're seeing is that

with the sophistication of these new data warehouses, in particular, the separate of compute and storage,

it's actually possible to do a lot of data modeling work that used to have to be done outside the data warehouse

preprocessing.

Now it's actually possible to do that inside the data warehouse

post processing.

So this isn't the whole shift from basically the ETL,

extract,

transform,

then load way of thinking, to the ELT, which is extract,

load, and then transform way of thinking, which companies like DBT and Fivetran are really advocating

for. And we're really excited about this because it basically means that now that you can do all this transformation inside the data warehouse,

there there's no excuse for platforms not having a good way to just get data into the data warehouse.

1 of the hardest parts about building

connectors into data warehouses

previously was

deciding on the right way to model that data in the data warehouse. When we were at Segment, we we struggled thinking about this a lot because it's very hard to

give analysts data in the data warehouse in a format that makes a generic format that makes sense for all businesses. In the end, every business is unique. But now that a lot of this transformation can be done

within the data warehouse itself, we're seeing providers like Fivetran just just dump JSON into Snowflake for some of their newer integrations

since customers can actually manipulate that data

as they see fit

within Snowflake using the JSON functions that are available inside of SQL.

So overall,

we've seen

this movement to ELT and the separation of storage and compute inside of the data warehouses

make it way easier for both companies to get data into data warehouses as vendors and as customers, which we're really excited about because Hightouch really never wants to focus on data collection. We just wanna help people activate the data that they already have in their warehouse.

In terms of the actual engineering effort, you have a bit of an end times m problem

of integration where you need to be able to support multiple source data warehouses and then a higher number of downstream tools. And

all of those systems are going through their own process of evolution and

modifying the interfaces that they have available, either to improve the capabilities or add new attributes or deprecate old attributes. And I'm wondering

how you've structured the overall efforts in order to be able

to stay up to date and keep up with the rate of change.

To be completely honest, I don't think that this is a massive challenge in terms of n times m

for HiTouch.

And the reason being

that most of the types of data that people are utilizing and passing to these downstream tools

are composed of pretty simple data types.

So in Hightouch, once we query a customer's data warehouse, whether it's a Postgres or a Redshift or a BigQuery or a Snowflake or even some data lake like Presto,

we kind of convert it into a central format

on our end that

from there, we start thinking about how does this data flow to Salesforce or how does this data flow HubSpot, or how does this flow data flow to Marketo.

We don't really build, like, point to point integrations. Like, here's how we do Postgres to Salesforce. Here's how we do Postgres to Marketo. Instead, we build, we're basically thinking of Postgres, Redshift, BigQuery, etcetera, to a high touch data format. You can think of it as like a JSON, thinks that has like the type of this field and things like that. And then from that format, we think about how to get data to various different business systems.

So that core abstraction is

definitely something that's very, very useful, both for us and for our customers, as you might imagine.

In the long term, I think just like BI tools, like companies don't want offerings like Hightouch

or even DBT

or BIJUUL to be a part of the data warehouse platform because they wanna be agnostic between different data warehouse platforms.

And when a company moves from 1 data warehouse platform to another,

they want to be able to use the same BI tools and ETL tools and, tools like Hightouch on top of them. So this central abstraction of data into kind of a Hightouch generic SQL format before we get it into each destination

is pretty key to both us and our customers.

In terms of the overall data warehouse market, what are some of the trends that you're keeping a close eye on and that you're particularly excited for?

In general, for the whole data warehouse market, I am really particularly excited about streaming. Well, first, I always like to say that I think

nightly jobs do solve business problems. So a lot of companies,

sometimes in first reaction, think, oh, you're built off the data warehouse.

You're going to be slow. That's what the market has kind of told them.

A, batch processing is getting pretty fast. There's companies like JetBlue using DBT and getting data into Snowflake on frequencies as low as 2 minutes.

So that's pretty fast. Batch processing doesn't have to be nightly. They are right to some extent. There are actually bottlenecks when it comes to the latency that in which you can ingest data

and then transform data

through data warehouses themselves, as well as the wider ecosystem. Like software like dbt is really meant to to run-in batch today. So personally, I'm super excited about

the addition of streaming features and incremental computation features

to these data warehouse platforms, because I think it it unlocks so many potentials and removes the biggest

reason not to use a data warehouse in data infrastructure projects. At Segment, we built some complicated pipelines

that, you know, had parallel

computation going on in a data warehouse for batch processing

and

also stream processing systems. You think like Apache Flink or Spark Streaming.

And this was very complicated. So I'm really excited for someone to finally solve the problem of of streaming

SQL and democratize access to this technology. So

in in particular, I'm I'm excited for how Materialise is approaching it from ground up, and I think that there will definitely

be operational use cases, which it makes sense to use Materialise for. I'm also excited to see how

AWS, Google, and Snowflake

position themselves and

build solutions to the streaming problem within their platform, even if it's like a 70% solution or something that's not as good as the niche database. I'm really excited for that technology to be,

democratized and become just a standard so that when people think of SQL, they don't think of batch. They don't just think of streaming. They don't just think of batch, but they just think of the best way to manipulate data. In terms of other things that I'm excited about in the data warehouse ecosystem,

streaming lends its way well into this, but I'm excited for Snowflake in general to just offer

more and more access patterns on top of the data warehouse.

So Snowflake has already started

releasing

things like the ability to run Python and Java jobs on top of Snowflake and data that exists in Snowflake, in addition to just, you know, the SQL that you can run on Snowflake. With all these different access patterns, whether it's SQL

executed in batch,

SQL executed in streaming,

Python and Java jobs executed on top of Snowflake, I'm really excited about Snowflake's vision of

creating a data cloud,

where it's not just a data warehouse technology. It's not just a stream processing technology. It's not just a

batch code execution engine, like Smart, but it's a repository

of customer data that can be operated on in many different access patterns in a way that's scalable. Because I don't think SQL is the end all. I think it is great for 90% of use cases and I think it's here to stay, but there are things that are undoubtedly

hard to do in SQL.

For example, if you try to implement complex, like entity resolution logic

that needs to recurse

or try to form something like a graph in SQL.

You need to have multiple executions,

multiple iterations over the data. But if you try to do something like this in Python or using GraphX and Spark,

it can be a lot simpler. The difference is that the systems are not as accessible. You don't have all your customer data in them already.

They're hard to run. They're hard to operate. And I wanna see Snowflake's

the accessibility

of the powerful execution engine that Snowflake has today for SQL, I'm really excited to see that expanded to a bunch of different access patterns so that data engineers can just operate on data however they'd like, rather than having to use a new system that's specialized whenever they hit constraints

of the existing systems.

And in terms of the ways that you're seeing HiTouch being used, what are some of the most interesting or unexpected or innovative projects that you've seen built with it?

So I think 1 of the most interesting projects

built with Hightouch today, actually

not use cases of just syncing data into

business tools like Salesforce and Facebook ads, that solves business problems, but it's it's not super cool, to be honest. I think what's super cool is

people powering

customer experiences

of high touch. So

you might think, what is a company like Optimizely,

which allows you to, you know, have personalized AB tests or just personalization in general

off some some customer data, like a Boolean flag for a user or a score? Optimizely can basically serve this to your front end in a low latency API

that's globally distributed and available around the world, and then allows your front end to tailor the experience of your site based on that user, whether it's personalization,

or an AB test or an experiment.

But if you think about companies like Optimizely and Dynamic Yield that basically do this, they're really just a thin layer around your customer data, exposing a different access pattern, which is a low latency API that's available on the front end of your site, globally, just off of a particular subset of customer data. And really,

it would make a lot more sense for services like this to be built on the data warehouse,

which has all your customer data and is where your data scientists and engineers, analysts are pushing new definitions and models about the users.

We see a lot of companies today when we talk to data science teams and data engineering teams, that tell us about how an analyst or data scientist does research

on what should drive certain personalized customer experiences in their data warehouse using SQL. But then the product engineering team goes off and reimplements this

as

just code inside of their their application API

to actually serve those experiences to the customers. We're inside of the front end. And we're excited about a world where

data from the warehouse cannot just be operationalized for business intelligence use cases and for syncing data to SaaS services, but where it can be operationalized for powering actual customer experiences. And we're already seeing customers do this today in terms of just starting to push data from their data warehouse in high touch into production systems

like Postgres, and then use those production systems like Postgres to serve personalized experiences

with vast single row lookups

inside of their apps. And so that's something we've had customers push us to that we weren't initially thinking about. And we're super excited about the future of. We think Postgres and MySQL, etcetera, destinations are just a start. And down the road, we'll probably be investing in things like a low latency personalized API that's completely driven off the data in your data warehouse. So just give us a SQL query. We'll make sure the data gets into this API and is available in

in low latency, like from your front end with some sort of authentication.

So we're super excited about the whole ability of powering true customer experiences off the data warehouse. We don't think there's a good way to do this today in the market. And we think the way that companies are doing this internally ad hoc

is really wasteful a lot of the times because they haven't thought about the potential to use a data warehouse for this purpose.

In terms of your own experience of building the high touch platform and growing the capabilities and the customer base, what are some of the most interesting or unexpected or challenging lessons that you've learned?

The biggest 1 should be

there's a lot of challenges in scaling an integrations platform. We've already acknowledged that that we're gonna be an integrations company to some extent, in that we're gonna be needing to be really good and diligent about how we interact with the 3rd party services on the back end. How do we do things like back offs, retry? How do we QA every possible case

of errors that can come back from a destination?

And how do we find them before our customers?

Once we find these errors, how do we show them to our users in a way that's understandable? How do we proactively alert them about these errors? How do we warn them before these errors happen by by looking in their data? And

this comes with a lot of challenges.

Both challenges because you're dealing with third parties, you don't have complete control over them. You actually don't have complete visibility and how they work either. You have their APIs, which are public interface to them.

And in addition, it comes with a lot of challenges in just creating the right abstractions

internally in your own infrastructure,

abstractions that are neither too light, so that it's really easy for someone to make a mistake when building your 300th integration,

but abstractions that are also not too heavy in that it becomes hard to build a new integration because in destination 200 doesn't work the same way as destination 100.

So, thankfully,

a number of people from our founding team used to work at Segment, including myself, and we've we've faced these challenges before. We've built abstractions that are too heavy, and we've built abstractions that are too light. And we've seen abstractions that weren't built for a long time because of

core tech debt. We're figuring out ways to build these as early as possible with Hightouch.

I think that's 1 of the biggest technical challenges we have today. In addition, another big technical challenge that exists today is just maintaining that hybrid architecture

and figuring out ways that we can work with more and more compliance and security sensitive companies.

We have a lot of interesting ideas on our road map, including kind of

proxying the data and working on an encrypted format so that customer data doesn't even pass through our infrastructure in a visible way. We don't store it today, but let's think about a way that doesn't even pass through it. So Very Good Security

is a company, the security space that's built some really innovative stuff here. We're excited to check out their work.

Other things are like, how do we connect to companies that, you know, have their data warehouse completely blocked off to the public Internet? We've built some stuff here around reverse SH tunneling is the term we

use for it. So I think, 1, just scaling and maintaining an integrations platform

that's better than building scripts in house is a big challenge, and 2,

handling super compliance

and security sensitive companies

is a large challenge, both organizationally and in terms of your go to market and sales, but also technically.

For people who are looking to be able to integrate their data warehouse into some of these different downstream systems or just the overall space of data integrations,

what are the cases where HiTouch is the wrong choice?

So off the bat, 1 of the cases where HiTouch is the wrong choice is when you don't have the data you need in a data warehouse. When you don't have the data you need in a data warehouse, it's definitely

the wrong choice because we don't help you get data into the data warehouse. If you do have a process to do that, you know, 90% of your data is in the data warehouse and 10% is not, then it's probably the right choice. But if you just don't have a data warehouse in general, it might be the wrong choice. We do think there's a lot of opportunity in helping companies end to end. So you go from 0 to solving problems in their Salesforce account by creating a warehouse, getting the data you need into it. And then using Hidash, we think with all these services going self-service, pay as you go,

sign up online, and just set it up, we think there's really an opportunity in building something awesome there. That said, it's not the right choice if you don't have a data warehouse today. Another reason that high touch might not be the right choice

is if you don't have people who can write SQL accessible to you on your team. The organizations that we work with, they either have the users using Hightouch can write SQL or they have access to someone who can write SQL or paste a SQL query for them really easily.

But that's not always the matter of the world, and there are integration platforms like Zapier,

for example, that allow you to build point to point integrations between 2 services

without using SQL. So if you every time a ticket gets added to Jira, you wanna do something in Salesforce,

and then you don't have someone who can write SQL, it'll really help you think about that holistically, that a platform like Zapier can be better because it's been built for a different audience. And we don't really intend

to compete with that audience or that user. We we intend to focus on the technical user and people with data warehouses. Lastly, we we do I should mention that we do have some features for kind of visually tweaking SQL queries, doing stuff like marketing segmentation on top of the data warehouse. But even to use those features, you need to have a data team or someone accessible

with nearby the NoSQL

to set them up and connect the data warehouse in the first place. But, yeah, I think 2 reasons off the bat that you wouldn't want to use HiTouch are really if you don't have the data you need in a warehouse or accessible in SQL. And 2, you may have data in SQL available, but you may not have anyone who knows how to query it accessible to you, in which case it really makes sense to use something like Zapier instead

because it those tools really have like an if this, then do kind of workflow for business users. Another reason that it definitely doesn't make sense to use Hitachi is if you need extremely low latency

in terms

of responding to new data and ingesting data. Data warehouses are generally not suited for this today.

That said, I would emphasize that more and more companies are figuring out ways to get data into data warehouses faster and faster. So

you may not need a real time system even if you think you do. As in we've we've seen, you know, JetBlue

on deepgeforms

getting data into the data warehouse as low as 2 minutes and using it for things like predicting flight delays flight delays, which is super interesting.

And it just shows the potential even doing batch processing faster.

So

if you do need your true real time use cases or more event based systems, then the data warehouse might not be the right choice for you as well.

Are there any other aspects of the work that you're doing at Hightouch or the overall space of customer data

and the work that you're doing for data integrations that we didn't discuss yet that you'd like cover before we close out the show?

1 thing that I think is just an interesting

concept

is today there's kind of a dichotomy

that we have in the market. You can either use

tools that are built more generically

to handle data, like Looker offers a visual interface to slice and dice data, but it's it's pretty generic. It's not built for customer data. It's not built for use cases like analyzing

user funnels.

Or you can use tools like, you know, Mode Analytics or Periscope or Tableau that lets you run SQL on top of your dataset. But basically, you can either use these generic platforms

to understand and operate on your customer data If you want it to be on your own server or on your own warehouse,

or you can use these

higher level tools

that operate on their own dataset

and have their own siloed dataset about your customer information. So these tools are things like Mixpanel or Amplitude

or Salesforce, as an example.

That's not really seen as a digital customer data platform as much, but it really is. So I think there's this this interesting dichotomy where you either get the generic tool on your data, or the nice UI on someone else's data. I think that's going to be patched over time.

Companies like Indicative

are building

Amplitude, like functionality,

directly on top of your own data warehouse. I think that has a lot of potential. Inside of Hightouch, we have a visual mode that does need to be set up by a SQL literate user, someone who understands the data models. But you can actually expose some of those transformed models that you're creating in your data warehouse to marketers to

do specific tasks, like audience segmentation before they run an ad campaign.

And I think that Visual SQL gets a really bad rep. But if you think about what tools like Amplitude and Mixpanel are doing, they're they're essentially generating

SQL queries on their back end to compute what you're seeing in the UI. So I think Visual SQL gets a bad rep a lot of time because SQL builders just look like SQL expression builders. Like, click your data where clause, click your data select, click your data join. But I think there's a lot of potential for

nice visual UIs that help people solve very specific purposes,

like segmenting users, for example, in a marketing use case directly on top of your own data story. I I don't think this dichotomy really has to exist. And I'm super excited for all the companies in that space, including what we're working on at High Touch. Though it's not the we're not trying to, like, give this in a magical way on top of your data warehouse. We still need it technically easier to set it up, which I think is inevitable, and you have to embrace that. So, yeah, I'm super excited with where things are going in that space, and I think that it's an agile problem that has been overlooked in a lot of ways.

For anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Outside of the stuff that I already mentioned around streaming, I think the biggest gap in the tooling of actual data management

today is probably

around

lineage

and metadata,

understanding

what does this call mean in the data warehouse?

How is it being used by my teams across different BI functions?

Where is it going? Like, is this column going into a system like Salesforce?

Or what is this column in Salesforce actually mean? I think those questions are still extremely hard to answer. Part of it's a technology problem. A lot of it in my perspective is a people problem. I know a lot of people are building different platforms for metadata and lineage and things like that. I think a lot of it will have to come down to a people problem until things are super standardized amongst different companies. But in general, it is not a space that I hold super strong opinions on, but it's a space that I'm very excited

about. Well, thank you very much for taking the time today and sharing the work that you're doing at Hightouch and the experiences you've had working so closely with users of data warehouses and enabling them to make multiple uses of their data much easier.

Definitely an interesting problem to solve and an interesting approach to that. So I appreciate all the time and effort you've put in, and I hope you enjoy the rest of your day. Sweet. Thank you for having me. Have a nice day.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links