Scale Your Analytics On The Clickhouse Data Warehouse

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking,

scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up.

And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances.

In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show.

And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers.

Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of pre built integrations, and a simple API for crafting your own.

With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.

Podcast.init

listeners get 2 months free on any plan by going to python podcast.com/clubhouse

today and signing up for a free trial.

And you can visit the site at python podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host is Tobias Macy. And today, I'm interviewing Robert Hodges and Alexander Zaitsev about ClickHouse, an open source column oriented database for fast and scalable OLAP queries. So, Robert, can you start by introducing yourself? Hi. My name is Robert Hodges, and I'm CEO of Altenity. We do commercials,

software and support for ClickHouse.

And, Alexander, can you introduce yourself?

Yes.

My name is Alexander Zaitsev. I'm CTO of Fultinity and, actually, 1 of cofounders.

And,

we are doing commercial quick house

and helping everybody

to

to get most of this excellent database technology.

And going back to you, Rob Robert, do you remember how you first got involved in the area of data management?

I do. It was with a database called m 204 back in 1983.

This was a pre relational system that was really popular with the US government and military,

and I loved it. I thought it was the coolest piece of software I'd ever used. And since then, I've worked in and around databases

through the present time

included working at Sybase,

working on clustering and replication for my SQL, plus many other others. So actually, ClickHouse, I think, is database number 20 for me.

That's impressive. And, Alexander, do you remember how you first got involved in the area of data management?

Yes.

I'm certainly not so

don't have so much experience as Robert. I started to work with databases in the late nineties,

So probably 20 years already, and that was various kind of different databases. But since,

2007

or 8, I'm mostly working with different analytical

technologies,

and this is probably the most interesting area, at least for me, in databases.

And,

we other years, I worked with a number of cool databases,

including

Vertica, which is

so the best of my knowledge is the best of my experiences,

the best commercial database

for analytics to the moment.

But later on, we, discovered QuickHouse.

Actually, it's 3 years already when QuickHouse is open source. So it's 3 years since,

it's available in public, and we started

to work with the house very closely,

look into that, and eventually

migrated huge system from house.

That was a great success for the company I was worked for. But,

probably, I'm going a little bit ahead of,

the schedule of our tool. But during this exercise,

I realized that Greek house is a great technology with great potential, but it still lacks,

a lot of features

and support comparing to commercial solutions.

And, that's why

I found it all to

fill the gap and to make ClickHouse successful

and usable for enterprises.

And so can you dig a bit more now into what ClickHouse itself is? And, you mentioned a little bit about how you got involved with that, Alexander, but if you can also share that information, Robert. Yeah. Absolutely. So the way I got involved with it was Alexander has been bugging me for about 3 years

about this great database and that I should come look at it.

He and I met and have been friends since, for about 17 we met in 2002. We've been friends for

17 years, and have stayed in contact as we've moved across different companies.

And I think the real trigger was that

in addition to hearing from Alexander, I was working at VMware

for 4 years after selling a previous company to them. And I was working in the business unit that was involved in,

developing software for VMware partners, and it became clear that ClickHouse just looked like a great fit for a service that these partners could run. And that was really what triggered the thinking about, hey, this, you know, this is looks like something I should sign up for because this database is

is, you know, runs really fast, is very easy to manage. It looks like a really interesting business opportunity. And so I joined the company on January 1st and been working together with Alexander ever since.

And so I know that ClickHouse is an open source and column oriented database engine, but I'm wondering what the primary use cases are that ClickHouse is targeting and that it enables.

Let me ask you this. From the very beginning, ClickHouse was developed to power,

probably the 2nd biggest web analytics platform,

which has been developed in Yandex. So this is a huge Russian

Internet company. And Yandex has,

it's a magical platform, Yandex Metrica, which you can think of something similar to Google Analytics.

And they have a lot of users, a lot of traffic, and so on. So primary use case for ClickHouse was to power,

this,

web based

analytic platform. So,

its

first use case was

web analytics, certainly.

But, it happened to be very flexible system with a lot of, features and extensibility and so on. So right now, the calls is used for a route for a variety of use cases.

All kinds of analysis on big data you can think of,

starting from web, mobile analytics, time series, date data analysis.

It's used in finance for stock market analysis. It's used even for blockchain analytics.

There are use cases,

for quick calls

in manufacturing,

in medicine,

in retail industry.

So wherever you have a lot of data,

that you need to put into some database

in a cost efficient way and to run fast,

real time queries,

the cows is a good use. And the industries and different companies, they recognize it, and we see a lot of growth in the last,

years months.

Oh, if I could extend that, just a little bit. I think 1 of the things that's interesting, even though it started with this single,

this this initial web analytics use case, 1 of the strengths of ClickHouse is that it is a it is designed to be a general purpose data warehouse, so it is SQL based.

It can do joins. It has a number of,

very interesting capabilities like materialized views that make it suitable for a pretty wide range of of problems outside that initial use case.

And in terms of the broader market for database engines, I'm wondering if you can just characterize the position of ClickHouse

and how it fits in relation to both commercial and open source databases

and

just also in terms of just the broader big data ecosystem.

That's a good question. And, actually,

the company that I worked for before, LifeStream,

we,

we were investigating a lot of options before we came to click house.

In and there are not too many. So if you go for analytical database, you don't have too many options.

In commercial world, there's Vortica,

Greenplum,

that it is open source now, but it was commercial originally.

Redshift, you could if you go to, the cloud,

Snowflake,

and

some bigger folks like Teradata

or Oracle Exadata and so on. So

Teradata, like, seem expensive,

Verzik

is moderately expensive, and,

cloud solutions,

have their pros and cons.

Secondly.

In,

open source work, it's even

even

situation is even worse.

There is only green plan right now,

and there were several companies who tried to make efficient open source and Cisco solutions,

but, they were not not successful.

1 of them is, Infobryte,

and,

they

built quite efficient,

data warehouse,

that you could

that you can't put onto your desk. So it's on the SQL Server solution, but they couldn't go for cluster. And now I was Infiniti,

which,

is currently

a part of MariaDB,

MariaDB console.

And there were a couple of hours, but none of them was successful enough for variety prisms.

Technology is complex to develop. It's, not easy to market and so on and so on. So Plicals

is probably the only

1, who was backed up by a big company like and,

had a good team of engineers to execute it. So it

evolved.

And if you're talking about if you're asking about positioning, it's certain,

massive parallel processing databases.

So these are databases

that are for analytics.

They are typically column store. They're typically with

architecture that scales very well to your hardware resources,

GPU cores,

memory

storage,

or cluster,

and so on.

Yeah. And I would say I would also say that it's not just we don't just compare ourselves to column stores. So 1 very common way to process data particularly in pat in batch is to use

Hadoop

or Spark, and we see a lot of that. For example, people who are using Elastic MapReduce on Amazon, there's a huge amount of usage of that.

But the big difference is that,

ClickHouse is real time or very close to it. Whereas even with Spark, which is much,

faster than traditional Hadoop,

you're you could be waiting many, many seconds or even minutes to get an answer to, analytic questions.

So there's there's definitely a a big advantage to using this, to using this, database.

And can you talk through how ClickHouse itself is

and able to be able to handle these scaling and performance capabilities?

Sure. I'll I'll take that 1. So ClickHouse, as we said, is a column store. It has very similar design choices to,

to Vertica.

What does that mean specifically? So

when you load

data into a table, instead of storing it as rows the way you do in a database like

my SQL, for example.

What we're gonna do is break up the columns and store them as arrays. Now those arrays are

stored in individual files. They're highly compressed.

And,

so when we do a query, we will actually only read the columns that are mentioned in the query. And the the compression the decompression is very efficient.

Moreover, we have optimizations

that as we're traversing the array, we will the the c plus the the core of the system is c plus plus. We'll use SIMD instructions whenever possible so that instead of, for example, for summing

integers, instead of just summing 1 of them at a time, we will push as many of them onto registers as we can, process them with SIMD,

instructions and get speed ups there.

Second thing that we do

is that we have indexes,

which allow us to locate.

So we're we're selecting the columns. We use indexes, which are built automatically

to locate the,

parts of the columns that we should be reading. So we're able to drop out a lot of,

data there. We do have secondary index structures that allow us to be even more specific about what we read, what we read. So the amount of data that we process is then is then significantly really reduced.

3rd, we're really efficient about breaking up those arrays

and dispatching work onto all cores

so that, for example, when you run ClickHouse on a single node, if you run a a utility like HTOP and you have your query configured to use all resources,

you will see this light up in every single core, including

course with hyper threads. Every single core will go to a 100%.

And then finally, we're,

MPP enabled.

So there's

that means we can spread across,

the we can spread queries and data across multiple nodes.

Moreover, ClickHouse has both built in sharding as well as batteries included replication.

So we're able to distribute the data

over both shards and replicas within those shards, and the ClickHouse will automatically find appropriate parts of the data, query them, federate the results, and pass them back to you. So it's just a bunch of there's not there's not 1 thing in the architecture here.

There are multiple things that are all sort of,

you know, sort of designed to work together to produce

very large, you know, very fast query results over enormous volumes of data.

Yeah. The vectorization capabilities are definitely particularly interesting for being able to

parallelize and compact the overall number of operations necessary to be able to retrieve or manipulate any individual piece of data. So,

I'm I'm sure that that's 1 of the major capabilities. And

then in terms of the overall sort of life cycle of data as it flows through ClickHouse going from just into inserting an individual record or a batch of records

through to the storage engine, and then on the query side coming back out. I don't know if you can just talk through the overall,

life cycle and process flow of how that data is managed.

Yeah. This this is interesting actually how QuickHouse handle that because,

the data,

when it's stored in order to be queried,

efficiently and fast, it needs to be sorted. You can think of ClickHouse as a database

with

a single clustering index.

So the data is,

stored at rest and,

clustering index allows the house to locate

data efficiently when this index return is used. But the challenge comes when,

the new data comes

and you need to

keep the data in the same sort of structure.

So what Clearhouse does, is the following.

With every new insert,

and we typically insert not a single record, but,

some batch of records.

Hundreds of 1, 000 to millions in 1 short statement for the good performance and actually

production workloads.

Every insert generates

a new

part of parts of, a table, and it's actually called part,

in precalves,

terminology.

And this part is internally sorted,

by this

primary sort order.

But once you insert more and more data, you have a lot of those parts. So in great time,

you would have to

check all the parts for your data,

loads from, old parts and not just somewhere else and and then present results to the user.

That would be

not very good, in the long term. So what the house does, it starts us to notches,

this,

parts of the data

to the bigger bigger ones. So when the new sort comes, we have new block of data,

click counts and picks up some, old block of data, old part of data, merge it together

and

produce a new part.

And this process this ground process, which is called merge, it happens all the time in the background, so it constantly

merges the data.

And what when it not just, it's actually a no sort operation.

It seems a bit more complicated than that because your house has to do it per single column.

It has to decompress data first,

then compress it afterwards.

But in nutshell, it's a merge sort, which is very efficient operation,

that can be performed in a steaming fashion.

So

and during this merge process,

no query interactions is

happening. So it's absolutely transparent for the user.

The new,

partial of the data, they just eventually replace

the old ones

and the acquires

are running faster

faster all the time. So your data is always

pretty much well sorted on the disk. It's not

going to be a single

sorted,

part or single sorted

piece of data always,

but a small small number of those.

And this is very important concept of of quick house.

Something similar is this in Vertica, but Vertica has a more complicated design with a memory structures

and disk structures. In click house,

it always stores data in the disk, and this is,

the main difference

and actually, it's not a difference. You you have to store in the disk if you're going for a huge data.

I don't have so much memory.

So because over source in the disk, the data is all sorted, but sometimes in several parts, which are being merged in the background all the time. I think that's the main idea. And as far as the clustering capabilities,

is it a sort of master minion type of architecture, or is it an active active where any instance can accept rights and then those rights are distributed or across the cluster for durability. I'm just wondering how that plays into the overall data flows. It's master. It's, active active. So what what you have is

you have shards. And and actually, it's kind of interesting how how ClickHouse does this because really the the shards are just defined as

through a configuration file that says, hey. You know, we have this cluster, and my shards are located here, here, and here. Now you can extend that definition to also say each of those shards should have

1, 2, or 2, 3, or 4 replicas. In that case, what happens is you have to have ZooKeeper

running with, ClickHouse. And then each time a part gets inserted to a particular replica,

a note is made in

in

in zookeeper

that the other replicas can then refer to and recognize, hey. There's something that we need to we need to, replicate.

There's a part that needs to be replicate over to to our side of the fence. So the replicators

that you can you can basically insert to any replica, and it will,

eventually make its way to all the other replicas. So it's kind of an eventual consistency replication model. And another interesting feature that I noticed as I was preparing for the interview is the option for,

some data safeguards such as deletion protection.

And so I'm wondering if you can talk through some of those safeguards that are available and how those factor into enabling different use cases for click halves that would otherwise be intractable or impractical?

It's interesting that you asked this question because,

normally,

users are not concerned of this those safeguards.

But, yes, click house has some guards against

incidental deletes of tables and partitions.

Because we're talking about,

big data, if somebody

occasionally deletes a table,

it might be quite

expensive

and,

takes a long time to restore. So by default, kick house doesn't allow you to drop tables or partitions

above

50

gigabytes

of size, but it's certainly configurable and can be disabled. Speaking of the of data deletes in general, it's a common feature

of an analytical database that they don't like to use in updates too much.

Many analytical

databases,

they started from, right on there and, it's already short on there. And,

are not exceptions.

But,

alright. Currently,

you can delete,

or update data quite

flexibly

in.

Not probably not in, in a sense of traditional SQL, but in a little different click counts way. Usually, click counts has a lot of own ways to do

certain types of things,

which makes a joint implicals kind of interesting sometimes.

But,

all those,

decisions,

have good

good thinking. And I would add to that that I think, you know, as you go into analytic systems, you tend to think about deletes just in a different way. So the ClickHouse way to

to think about data deletion

tends to be more along the lines of, hey. I've got enormous amounts of raw data coming in that I want to

keep for a certain period of time, say 30 days or 7 days or something like that. But then as

time goes on, I want to maintain aggregates of that data that become increasingly sparse over time. So what you would think of the way you would think about that is

you would have the raw data coming in,

and ClickHouse has a feature that allows you to assign a time to live,

which is based on a a SQL expression. So you can, you know, a simple thing is just keep it for 30 days, but it could also be based on on some condition in the data itself. So you would do that for your, you know, for your raw data coming into the to the main fact table,

and then you would build materialized views that automatically compute aggregates.

Those are computed from the time that the data is inserted. And then even after the data drops out of the, you know, gets gets deleted out of the out of the main table, you would then continue to have the materialized views.

So that's kind of a ClickHouse way of thinking about that deletion about that deletion problem.

And then as far as getting data into ClickHouse, you can obviously insert individual records at a time. But for large volumes, that's impractical.

So I'm curious what the other loading paths are for getting data into ClickHouse in a scalable and maintainable fashion? You know, this is this is a great question because the ClickHouse has such a variety

of of ways of of loading data that it's really I'm sure that I won't be able to to remember them all, and Alexander will have to help me. So just as far as the APIs and and, in effect, basic inserts, the the principle

1 really important click ClickHouse principle is you don't like many data warehouses,

you don't wanna insert single rows. So you wanna insert in batches, say, of 10, 000, 20, 000, 30000, something like that.

So that's the first thing. There are a couple of different ways you can do this through the APIs that I'll just mention in passing. A very common there you can insert using h t HTTP requests. There's also a native

API format there, or we call it native TCP format,

which inserts data that is already

in typed and in column order,

so that results in very efficient inserts. And so

when we think about using APIs,

you can, of course, write a Python program which invokes these.

But at a first sort of the first order approach is you actually just use the ClickHouse client,

which is this command line client, and it has a very efficient streaming

of different data formats. So for example, it can read CSV. It can read, 1 line. There's a format called 1 line JSON. There's various other formats that it can read in. And the great thing about it is is is you just

you just cat the file and basically pipe it into,

to ClickHouse client, and it just blasts it up to the ClickHouse server at full speed. And you can run many of these in parallel. That's we do that commonly for data loading. However,

beyond that, there's there are many other ways to read data directly from the ClickHouse server. So there are these things called table functions that allow you, for example, to go out and directly read CSV off a URL.

You can read data directly out of Kafka queues.

There's in fact, ClickHouse has also what are known as table engines,

which can basically take anything in the outside world and make it look like a click house table. So a common way to read from Kafka is to use the Kafka table engine,

which then makes a Kafka queue look like something you can select out of. So that's another way to put it in. I think 2

also very, very interesting ways to load data or we can go directly to, to Postgres as well as to MySQL,

and we can we can also read data directly out of those tables.

So that's a that's yet another way that you can pull data in and actually store it in ClickHouse. And then finally,

sometimes you don't even need to read the data at all. So for MySQL, for example, we have ClickHouse uses very commonly what are called dictionaries,

which

are reference tables that are loaded into memory.

But the base for those reference tables can be another database like MySQL. So 1 of the things you might do is if you have a big fact table and a bunch of smaller dimension tables,

say, for example, containing product SKUs or something like that,

a common way to access them is to leave them in my sequel,

and then just,

define a dictionary,

which will read them from the MySQL database. And every time they're referred to,

ClickHouse can just pull the data in,

and you'll you'll be able to access the data directly from MySQL.

Yeah. So in a lot of ways, it looks like you're inverting the ETL model where rather than pushing data into the database, you're actually using the database engine to pull data from other sources, which I think Correct. Simplifies things in a lot of ways. Correct. Yeah. It's a really interesting model, and I think that it also gives you some

you know, because we talked a little bit about deletes earlier on, and 1 place where deletes

are not being able to delete stuff easily or in the in the conventional SQL way can be a problem is is specifically dimension tables.

That's a good example of place where you actually do want to you may want to

update things on a regular basis, and in that case, keeping them in something like my SQL turns out to be very convenient.

And when I was looking through the documentation,

it seems that you can actually set up ClickHouse to be a replication target for MySQL or Postgres

to simplify the change data capture approach of being able to pull data in as it's updated in the database. I'm wondering if I'm accurate in that interpretation of the documentation.

That's correct. And we did we oh, Alexander, you wanna take that 1?

Yes.

So that's correct. That's, you can do it. But, right now, it's not a native click house feature.

So you need to install

external

tool or software

in order to,

listen for,

for replication logs or something like that on my SQL processor site and then,

push data into ClickHouse.

But,

we're going to change that actually there,

and,

we

envision that ClickHouse,

will be able to connect to,

MySQL progress

directly as a replica

and, listen to to changes

the tables of interest.

And, once, it's implemented,

it's it's actually a very cool feature that we are looking forward to have it because it allow you to speed up your MySQL or Postgres analytics,

very efficiently with, very little effort.

And, yeah, and this is this is a feature we love. We've we've had a number of conversations with people about. Because if you look at MySQL or Postgres applications,

it's very common to see

1 or 2 large tables, and they can sometimes in some applications, they'll grow to 100 of millions of rows because they might be, you know, events from marketing campaigns or something like that. And then you'll have a bunch of little tables. It tells that this is a long tail in in in traditional database schemas.

So what this feature would do is it would allow us to subscribe

to changes on that on those big tables. They would be brought over and be available for processing queries.

But then what we can do is because we can integrate with the smaller tables through dictionaries and other features like that, we can go back and as necessary just just reference the data from the small tables that's living on my SQL. So this just feels like a really compelling

and interesting way of accelerating analytics on existing my SQL databases.

And it's 1 word we're definitely looking forward to getting implemented.

And 1 of the interesting challenges that are posed by the fact that you can pull data from these external sources or be able to subscribe to these replication feeds is the idea of schema evolution

and being able to evolve the schema in ClickHouse itself when when the data formats change

from those external systems

or when you have maybe malformed data that's coming through, for instance, in a Kafka topic. And so I'm curious how you approach that challenge in your own experience of running ClickHouse.

I think I might yeah. Let me jump in on this because I actually worked on replication for my SQL for many years. So the short answer is that's an exercise for the reader right now in ClickHouse.

This is definitely something we're looking at. The ClickHouse actually does have very flexible schema management. You can do alter call or, you know, alter table, add columns, change data types,

things like that. It does require

that you have the ability to intercept

the

the schema changes in the,

in the replication log. And, certainly, the MySQL replication log has this information.

The trick is to to see it coming and then correctly process the, the change. So that doesn't exist yet,

And that's 1 of the reasons why,

we've actually had discussions

with,

you know, with users about

potentially replicating all data

out of MySQL, but but it's exactly that schema change that makes this a difficult problem.

So, for now, I think when people are replicating, it's better to do, you know, sort of large tables

that

where the schema changes are sort of few and far between, and you should be able to manage them. I think this will be improved over time, but it's just generally a very hard problem.

And in particular,

it is very hard to solve when you have heterogeneous

replication because there will be schema changes

on the target side that are simply very difficult or excuse me, on the source side that are simply very difficult to to implement on the target. So I foresee that as always having a somewhat manual component

even,

you know, even in as our replication capabilities improve.

Yeah. Schema evolution in general is a hard problem that doesn't appear to be fully solved in any 1 system, and then it just gets exacerbated

when those schema evolutions are bridging different systems,

you know, as it's the case when you're pulling data from these source sets into a data warehouse or in the event of using a data lake where you have data that's landing in different schemas and then being able to manipulate and analyze across the whole historical record?

Yeah. It's it's just not an easy problem to solve. So,

I think I think we would we would be super happy if we could just get this this feature in where we can at least subscribe to the ongoing data,

including,

inserts, updates, and deletes on

large tables, and then basically have a click houses exist as an archive. I think that one's tractable to solve. And then once we've made progress on that, we'll look beyond that to to what else we can do. And so beyond the core engine of being able to store and scale these

data storage and analytics requirements,

I know that there are also some interesting and advanced capabilities that are built into the engine, such as different SQL extensions and custom data types. And so I'm wondering if you can enumerate some of the more interesting ones that are unique or particularly well implemented in ClickHouse.

Table engines in ClickHouse is probably,

in general,

not too interesting. ClickHouse does have a lot of different table engines. That's right.

And,

it has table engines for its own internal structures, and it has table engines

for external data as well. In particular, you can have,

MySQL table

inside the house, and the data will be seen in MySQL, but you can read it from house like the house table. But that's probably not very interesting. The real interesting thing for me in is is

its ability to aggregate data in real time automatically

using special aggregating

aggregating engines with extensions

of merge 3,

engine, which is the main core for transapplicals and which does those NERSH operations

in the background

to source in the data all the time. So during the sorting, it also kind of break data. So you can define,

that,

your

table is actually

an aggregate. So you have some sort of columns

that

you use as a base for your application and some other columns that you want to roll up. I could play it some sort of different,

different metrics. And the calls will do it automatically in the background.

These are the cool features has,

to me at least because it allows you to,

to build,

very efficient systems. As you do have, a data flowing in your system at a huge rate like 1, 000, 000 rows per second

and it's being accumulated in a huge table. And this is good table to look into sometimes and probably run some queries, but not in a general basis. And most of the time, we typically,

need a different sort of data aggregated

by minute or by 10 minutes or by hour or by day.

And, you can build those aggregating structures

and the calls would maintain aggregates automatically, and it's real time,

in real time. Once you insert the data, the data is already available in those aggregates,

and you can query it with subsequent response

time. I don't know of any other system that has this exact feature. Another cool feature is arrays, and probably Robert will talk about that. I will talk about arrays. I love arrays. So so there's a lot of databases out there that do support arrays, just as a data type.

What's interesting is that ClickHouse

really makes them first class citizens in a couple of important ways. 1 is that ClickHouse has a feature called an array join

where you can basically

compute aggregates which include

a bunch of things stuck in an array. So for example,

you could aggregate sales by months, and you could keep an array of all the regions that contributed to that. But then what you can do is you can construct SQL queries which will flip this out so that that array goes from being, you know, sort of an array of values stuck in the row that it basically pivots around, and you can now

have,

a new set of rows, which then associate each region with with values. So the arrays give you, the array capabilities give you an important pivoting capability, which is in many systems, you actually have to pull data out and do the pivot yourself, for example, using Python Pandas.

So that's 1 really cool thing that that I love about arrays.

Another thing is there are map functions

that you can use to iterate across arrays and, you know, sort of change values, sum them, things like that. So you get kind of a programming model even within the SQL queries

that has a lot of similarities

to the the sort of,

map reduce constructs that I think people are used to if they're using JavaScript or Python. So there's there's a lot of really powerful capabilities there. There's 1 final feature that I think is totally cool that that has just arrived in ClickHouse. It's still

kind of maturing.

There's as of joins,

and these are are kind of an interesting feature. They exist in KDB, which is a very popular database for financial users.

And an as of join allows you to join

2 tables that may, for example, contain values that have a key that is close,

but not quite not quite an exact match. So for example,

if you the canonical example here is you have financial data

where you have people measuring market prices, you know, at, you know, say every 10 seconds or so, and then you have weather data,

which maybe is measured, you know, on an hourly basis. And what you'd like to do is do a join

of each row of financial data

to the nearest

weather related record

based on time, and the as of join allows you to do that. It's a totally cool capability, and it it allows you to to merge data and and sort of analyze data in a way that's almost impossible if you don't have it. Yeah. I think that those sort of fuzzy joins is definitely a very interesting capability, particularly for a system that's designed for being able to work with these large quantities of data that might have some variability

or some potential minor errors in generation or insertion where you want to be able to just get an approximate value

or be able

to perform some estimation across these data without necessarily knowing what the exact value is or that isn't necessarily going to match up precisely

as would be the case with a typical relational system that you would use inside of an application.

Yes. That's that's a good point. Right? And the click ons in addition to those

zone joints, that,

I can't say they're Fizer joints, but

they have very interesting semantics.

The house has,

some more ways to do approximate

calculations.

1 of very useful feature is approximate unique.

So it's

this is quite common,

when you analyze

a lot of

data

with visitor specific

when you are interested in a number of unique visitors.

And traditionally,

you have to populate

some count distinct if you wish in SQL.

In ClickHouse, you have a different way to do that more efficiently,

with, this approximate unique calculations

that are sampling the data using some fresh functions.

They're pretty accurate

for the general use. And if you go to the small numbers,

then they are precisely accurate. And I think it supports for sampling, so you can run acquires

with

the sampling extension and the precourse would take

a small portion of the data

to calculate

the result, and then you just multiply it to the sample size. And this is built inside the c code, so it's native, feature feature of big files.

Actually, speaking of uniques,

1 of coolest QuickHouse features here is that we can aggregate uniques.

Normally, if you have UNIX, you cannot,

aggregate UNIX for 1 bucket of the data, for another bucket of your data, and then combine them together because unique just don't sum up together.

Greekhouse, they found a cool way to

sum up UNIX,

and, you can build aggregates with UNIX.

I don't know of any other system that that can do aggregates of of UNIX. In Brickhouse, you can do. And,

that makes it

a great choice for,

companies

who depend on Unix like, companies

that's working with,

some RCV Networks, for example, or,

IP telemetry.

I think it's to go create some,

metrics accounts or a number of unique devices or unique users.

And for somebody who's getting started with setting up a ClickHouse system, can you talk through the overall onboarding process

and some of the considerations that they should be thinking about as far as data modeling that would be different in ClickHouse than another either relational or column oriented system that they might be used to? Yeah. I can take that 1. This is the,

I think the getting started is just not very hard because you can basically install ClickHouse

in about 60 seconds. If you have a decent networking connection, you just say apt install ClickHouse server and bang. It shows up. So when you're getting started, that's a typical

just jump in,

install from packages,

just make sure you have you know, it's nice to have good storage, but it's not a good it's not a big requirement. The really interesting question is around modeling because what what I find is is 1 of the huge differences

between ClickHouse

and and for example, OLTP databases like MySQL, Postgres, or Oracle is you actually don't have to think about very many things when you're starting. Yes. Your data has to be in a table. We're talking SQL,

but you don't have to make decisions

about data types. So for example,

you can start,

and we actually have customers who do this. You can just take if you have a CSV file and you wanna read it and look at it and start to do initial,

you know, sort of initial analytics, you can just load everything as a string. The only decision you really need to make when you're loading the data is you do need to

give a partition key. So that's you know, give some way that ClickHouse can decide how to break the data up into parts.

So that's a fairly simple decision. Another thing is you do have to you do have to choose a primary sort order. So, again, you make that choice early on. If you don't get it right, you can always go reload the data and and and try again. So that's not a problem. So

first cut is if you're really just getting going with new data and you have no idea what's in there, make a a basic choice on the partition key, make a basic choice on the sort order, put everything in its strengths. You can then optimize that as you start to look at the data. So first of 1 thing that you can do which speeds up performance enormously is begin you know, after you've looked at the data, go ahead and assign proper data types. So instead of a string date, you know, make it a real date time, for example.

Instead of a string integer, you know, make it an like a u int 8 so that you're just using a a byte. That will

vastly improve your your performance, and, of course, it makes your query simpler because, you know, when you're counting things, you don't have to apply conversion functions.

And then there's a a second order type of modeling that you wanna do, which is to think about codex.

So codex are basically

are going to give you they're transformations

that are

applied to the data as they are

stored and, of course, as they pull back from storage. Sort of a good example of a of of a codec is you may change, for example, the way that the compression works.

You may,

for example, on strings, we have a very powerful codec, called, low cardinality.

And what this does is it basically, if you have let's say you have a a string value for region in some dataset. Well, there's a finite number of geographic regions

in in in most cases, say, you know, say maybe you have

500 of these. Well, if you apply this low cardinality,

data type or we call it low cardinal data type, but it's really a transformation,

you apply this to the column, what it will do is automatically convert those region values to an integer

and maintain a lookup table for you

that that it automatically then associates so that when you do a select, you see the region name, but underneath, it's just it's just comparing and and,

selecting using integers. So it's basically 3 levels. You know, you're starting out. You just make a you can make a basic choice about,

the partition key and the sort order, but but just have a rough estimate on the,

you know, on the on the data types. You can then refine the data types, and then you can apply different codex and other transformations

on the data. And that's that's kind of the way that you can iteratively get to a really efficient schema. I think that it's there's also another question in modeling that comes up very commonly in many use cases, and that's what to do when you have variable schema.

And that's something that Alexander has actually worked on quite a bit. And, Alexander, I

think that's 1 where you can offer some interesting comments.

Yes.

I think, Robert. So

in modern days,

people,

some users, they used to

schema less databases or databases with not very strict schema like Cassandra, for example, and, or, sometimes serious databases

like Influx or Prometheus.

And they come to click on us so if, this background, they wonder

if they can, do the same. And in most cases, it makes sense because,

for for example, consider problem of monitoring when you have a lot of different devices.

All devices may have their own properties and so on. And it's very hard to make a schema that works for all cases

or it would have,

a fountain of columns per table. So instead, what we can do in is to store data in arrays.

So you have you might have array of keys and array of values,

or you can have multiple arrays for different data types. It's kind of gets away gets you away from the columnar structure because

array is stored altogether.

But, you can start from from this and

work with a schema less,

data and just store everything as QA players. And if you decide that performance is not very good, you can then materialize

some

specific values or specific attributes from, those key value arrays as a separate columns. It can be done afterwards.

It can be done absolutely with no downtime

just by alter table, and click ons will start populating those specific attributes that are important for filter, for example, for you. And, you can start, querying the data using those filters much more fast and more more efficiently.

So this is kind of hybrid approach. Right? So you will start with this array,

or arrays and probably some other

common properties like time stamps, whatever.

And then you can extract

certain certain data,

for the better performance.

Similar to what Robert said, earlier, you can start with the basic scheme of these things and then continue to specific data types, codex, and so on. And click house allows you to do everything. Yeah. And we have users who are using exactly this approach, for example, to process logs.

You know, keep the data in arrays and then use this materialized column feature. That's actually another

really killer ClickHouse feature. The fact that you can you can do it with 0 downtime and basically,

you know, sort of instantly begin to

to refer to the to the new column with correct type is a is a great feature.

Yeah. And all of these capabilities as far as the flexible modeling options puts me in mind of the current trend as far as the shift away from ETL into ELT where you land the raw data into some source where in a lot of cases, it's a data lake, but with some of these modern cloud capable

data warehouses such as Snowflake DB, I know, encourages you to just load the raw data into it and then do the transformation after the fact. And so it seems that ClickHouse is another case where you would just load the raw data into some source table and then take advantage of these schema evolution capabilities or schema manipulation capabilities

for then being able to transform the data into a more processed form for being able to use for more advanced analytics?

Yeah. That's right. I wouldn't say that this is primarily use case for click calls, but click calls can do pretty well,

with,

this, approach as well. Yeah. And I think that we're definitely not I mean, there are systems out there which

will actually allow you to do you you know, you'll have a meta store which will, you know, sort of implement schema on read so that you may have just text files. This is the this is the Hadoop hive model. That's sort of the far end of the spectrum. And ClickHouse is absolutely not designed or not intended to follow that pattern.

We do require for to get high performance,

you're really going to have to

put the data into

into ClickHouse's storage format. You know, we can, of course, refer to data in other databases, but those tend to be smaller tables.

So we do expect that people are going to,

you know, eventually load most of their data into ClickHouse and and store it in the native merge tree format. So we're kind of closer to

in that sense, really closer to a traditional database

than than, for example,

you know, some of the things some of the systems that are designed to read data directly off off s 3. And then as far as getting ClickHouse deployed, I know that you said in terms of just the initial onboarding, you can do an app get installed. But for actually putting it into a production context, I'm wondering what's involved and some of the edge cases that users should be aware of and some of the tuning capabilities that would be useful to point out. Good question. So it's right that you can run the house,

in start 2, 000 in 60 seconds,

and, you can run it to your laptop for or high end server.

Your house would, run everywhere. It's absolutely portable. And, when go into production, there are different considerations.

So it's not always,

performance is actually not always, the first 1 and helps to prep house, which is already very fast. Quite often, it's it's more like usability

for better fit into infrastructure.

Having said that for maximum performance, ClearPass is better optimized for running in bare metal servers. Of course, it,

has a lot of internal optimizations

that rely

to,

CPU cache, utilization for instance, but the data is CPU local.

And, sometimes if you use,

certain virtualizations,

this is not the case. But, you can go with

to clouds as well, including public clouds like Amazon

or Azure or Google Cloud, and

works reasonably well here, given that you're provided with

the good storage,

storage option.

It's certainly

not performs as fast as on bare on comparable bare metal servers.

There is certain expense of virtualization,

but it's it's very convenient. Right? So just

deploy it to the cloud, and if everything else in the cloud, we actually don't have a reason to go somewhere else. What we see now in the recent months, there is a growing interest to deploying kickoffs in Kubernetes

because Kubernetes allows you to run kind of the same kickoffs configuration

everywhere. You can have Kubernetes on bare metal, have Kubernetes

in

clouds and public and private clouds,

and

the running pattern for QuickHouse would be exactly the same in all Kubernetes environments.

We actually

see QuickHouse,

to be more and more popular in Kubernetes.

Actually, to the moment, this is probably the only data warehouse or analytical database that can run Kubernetes in general at all. And, there are

multiple reasons for that. 1 of the reasons is that, stateful applications and application of storage

as still pretty new to Kubernetes,

and,

has also,

very

pattern and not just the cloud, but any other analytical database.

So,

I kind of lost the thought, but

it's it's great that

particular ClickHouse portability,

allows it to run everywhere and optimize for particular use cases scenario. If you optimize for performance, you will go 1 way. If you optimize for convenience, go to another way.

And, the trade offs are not that huge.

So Kubernetes overhead is not significant if compared to bare metal or Kubernetes, right, in Amazon

and so on. Yeah. I think the performance of ClickHouse on Kubernetes has been very enlightening. So the Kubernetes operator that we've developed

takes what is

naturally a very complex deployment, which is isn't anything particular to click click houses. It's just that

deploying data

applications with state on on Kubernetes

involves creating a lot of resources and managing them.

So,

the Kubernetes operator that we've developed

reduces these to a single custom resource that's easy to configure and,

configure, deploy, and update.

But what we've definitely seen is is that

well, first of all, these give you huge flexibility because you can quickly spin up clusters. You can create clusters in less than a minute, depending on their size, of course.

But you also have we've also seen pretty promising

performance results that the for example, when we run tests

that run on Kubernetes

versus

running on VM. So Kubernetes on Amazon VMs versus just running on straight VMs. We see virtually no difference in the in the performance. So that's actually something that looked very promising. And

we we do,

we are working with customers who are have major deployments on Kubernetes,

and we definitely see this as a growing,

as a growing,

platform

for data warehouse in future.

And

as far as the sort of use cases

and instances of ClickHouse and customers that you've worked with, I'm wondering what have been some of the most interesting or unexpected or innovative ways that you've seen it used. I'll give you 1 that just sort of blew my mind about 4 days after I got on the job.

We have a customer that does,

that develops custom fraud detection systems for banks and credit card processors.

So the way that they use ClickHouse

is they

bring it in, and

the first thing they have to do is in order to

build the system and select appropriate models and things like that for for doing fraud detection, they actually need to understand what data that

financial institution has available. So they'll just ask them to give them a few CSV files containing

more or less complete samples of data. They'll just throw it into ClickHouse, and start playing with it and looking around. And this is exactly what I was talking about when I, you know, when I I was discussing how

straightforward modeling can be that you can literally just start at strings. This is this is exactly what they do. They just throw the CSV in. They then start playing around with it. I have not ever seen a database that can do this on this scale. So it's a really interesting and and innovative way to use it. Yes. And, thanks, Robert. That's a good example. And from my side,

probably my favorite is

some sort of distributed computation

system, which 1 of our customer

has built

last year.

So what they did, they,

stopped a cluster of clickable servers

and

developed kind of execution pipeline

processing

the data. It's actually, it was top market data.

And but absolutely in parallel in all servers.

And on the,

that that pipeline,

it took several steps,

to to do very complex

mathematical computations.

That that happened to be quite efficient because

And after all parts of pipeline,

finished,

the data has been merged, in a together in a single table to be accessible by by the users. You can pretty easily to,

to develop the same thing on Spark or Huddl, but, it would take,

several days

for results to come using,

given the data sizes that they used. In click ons, it it took just several hours. It was probably

20 20 times performance increase.

And,

the idea idea is general that you can use analytical database

as

computational processing engine is kind of cool, I think. We haven't seen any other client who doing the same since that time,

but probably it will change, especially,

given the fact that ClickHouse is getting closer to a machine learning

world recently.

We add new features,

that

bring some machine learning models and cyclical.

We

it investigate

a ways to integrate closer with some machine learning,

capabilities and so on. So it may it may change. But,

1 year ago, it was a real

kind of wonder.

And I know of some other companies

who migrated

their,

data

analytics and machine learning algorithm from Spark to the cows. And,

again,

their performance difference and hardware difference was close to 1 to 20. And another thing that I didn't ask about yet is

whether ClickHouse has any concept of multi tenancy where you might have different

groups within an organization that want to be able to leverage the capabilities of ClickHouse, but still maintain some sort of ownership of the data and then also be able to

grant either restricted access or be able to federate across different,

sort of groups of data within the overall platform for them being able to do higher level aggregates across aggregates across those data sources?

Yeah. That's a that's a really interesting question that actually comes up a lot, and it actually has

what you're referring to is an authority

model that that would allow people to share data safely across across tenants. And I think at this point, it's fair to say that we'd have to be honest and say that ClickHouse

does not have strong capabilities

in this area.

But there are a number of there are a couple of interesting features that are coming down the pike that will make this possible. The first is

to, get both table and column level security. This is a standard feature that you get in that you have in, you know, existing databases, of course, like Oracle. It's also in the, cloud databases like, cloud data warehouses like Snowflake.

This is very well developed. So this is in the road map and will be coming in the you know,

for hopefully sometime this year, but it remains

doing

you need to control resource usage.

So this is something that will be arriving in ClickHouse a lot sooner,

and that's specifically

resource

pools that allow you to restrict

things like CPU, RAM,

storage,

IO,

or IOPS on storage, basically control these for groups of users. And this capability is fundamental to be to to allow different groups of users to safely use a large system. That's something that we're actually working on right now, and we expect to see that be merged into the,

into ClickHouse by the end of this year. And as far as your overall experience of working on ClickHouse

and helping customers get instances deployed and put into production, I'm wondering what you have found to be some of the most challenging aspects and some of the most interesting or unexpected or valuable lessons that you've learned in the process.

I I will I will just focus on 1 place that I think is a challenge for many systems, and that is security.

So ClickHouse is a we like to say ClickHouse is a young system, so it has

great speed, great portability.

But it's also the early use cases have involved very large quantities of of data that either didn't have import, you know, big security considerations. So for example, data was not considered

subject to GDPR, things like that. So so definitely

being able

to build systems that can pass,

PCI d DSS, for example, for credit card information or get by corporate infosec teams, that's definitely a challenge. And,

what we've we actually just got, recently got our first contract

to add our our first,

security related feature. So it's for log cleansing to keep,

sensitive data out of logs. So we're looking forward to sort of stepping down that path. But I think in the meantime, if people are building systems, for example, that need to meet PCI DSS,

requirements,

what they have to do with ClickHouse is think whole system and, you know, think about security groups, think about,

network security, and things that you can do to to kind of create a trusted environment where you can safely run. So that's definitely a a design a major design challenge,

for ClickHouse. I think the learning for me was that, you know, in talking to customers,

it can actually be done. So we don't actually have to have these features. There are ways to use ClickHouse that are quite secure, but it does require some additional thinking in your system design,

and we hope to reduce that thinking,

over the next year or so. You mentioned that ClickHouse is still a bit of a young system. And, Alexander, you said that part of the reason for founding Altenity was to address some of the,

sort of sharp edges of ClickHouse itself. And so I'm wondering, at this point, what you see as being the shortcomings of ClickHouse

and how you address them in terms of the products and services that you provide at Altinity?

That's a good question.

1 short comment Robert just mentioned is security.

So QuickHouse,

it has been developed inside

an enterprise,

without

then without strict security requirements,

and it's still, like, an a lot of features that any other major

database

has. So, this is 1 of the things we're working on. Another area which,

is still,

not a good in QuickHouse is storage management. Again,

it has been and and probably a few hours. So and in all cases,

this is common from the fact that ClickHouse has been developed

by a single company originally with its own set of requirements

and

view,

on the product.

And even 3 years in open source,

they did change it somewhat,

but, not completely. So QuickHouse

or Old Net is still mostly,

driven by the core team, and we're working very close with

it. The core team and community to shift,

the agenda and road map to address better

the needs of the wider

wider audience and wider scope of users. And we're working pretty hard on, on that actually. Probably, Robert, you can add something. Yeah. I absolutely. The tiered storage, I think, is 1 of the most important features. So what you typically wanna do is you'll have hot data, which you may, you know, wanna dump to SSDs.

You'll have older data, you know, sort of you know, maybe after a week, you you're content to move it to hard disks. And then after some lengthy period of time, you have data that you just don't refer to very much. Maybe you're keeping it for regulatory reasons. You'd actually like to drop that onto object storage. So that's a pretty I think that's probably 1 of the biggest things we hear from users to give people the capability

to do that. We are

working very closely with with the Yandex core committers as well as,

as well as our own team to design a solution for that and get that implemented. I think that's gonna be a huge step forward, but there are many others. I I think what's

what's,

and and in fact, what the shortcomings the flip side of the shortcomings

is that people want to use ClickHouse

even with the things that are missing,

but it's created a market demand for,

you know, to round out ClickHouse. And that's actually the that's actually 1 of the biggest things we do in our business is basically help people identify,

you know, these areas that are are still not mature either work around them or actually,

get us budget so that we can we can actually fix them. And and it's it's evolving pretty quickly. We've we've done a number of PRs this year, and we expect to see that double or triple. The rate at which we're we're adding features will double or triple by the end of this year. And what are the cases when ClickHouse is the wrong choice and you'd be better suited by choosing either a different database engine or a different,

just overall system architecture?

It's not an OLTP database.

So and that's sometimes people come to it and think, okay. Can I make it do both? And ClickHouse

really is designed to do analytics. Well, we're not trying to be a we're not trying to compete with Postgres

to, like, run a bank

or or deal with money. I think there's another case that's more subtle, which is

within the analytics.

If you have a lot of concurrent query concurrent point queries

where you're

just

want to get the data for, like, off a single row, ClickHouse is not a good solution for that, and the reason is that we're fast because we use sparse indexes. That's 1 of the big reasons that we don't construct these large indexes that you then have to maintain and read. But what that means is that if you ask for a single row, there's a good chance you may you may read 8, 000 values and then decide which 1 you

actually want. So that's a case where another database, for example, if you're you know, Redis might be the might be an appropriate solution for that. And are there any other aspects of the ClickHouse database itself or the use cases that it enables or the work that you're doing at Altinity that we didn't discuss yet that you'd

like to cover before we close out the show? Yeah. I think I would just summarize by saying I the thing that I really like about ClickHouse is I I love the speed. I I love the scaling. But I think that

let's be honest. There's a bunch of databases that can do that, and that alone doesn't make you unique. What I think really

is creates this huge opportunity

is that ClickHouse has a permissive open source license. So it's Apache 2 o. That's a really, really great feature for many people. It means it can be used very flexibly in a wide range of applications. That's thing number 1. Thing number 2 is it's very portable. We already talked about that it runs, you know, bare metal to containers to clouds, and it runs well in all of those environments.

And then the third is that it's relatively simple to manage, and I think this is something that is kind of inherited. We have sort of the ethos of MySQL

that in the you know, embedded in the design of this system, and the idea is just make stuff easy for users.

Make it easy to install. Make it relatively easy to do the management.

Don't build an enormous complexity that requires DBAs. This is an this is a database

that can be used and deployed and operated by developers.

So I think that's a really powerful feature.

Alright. And for anybody who wants to get in touch with either of you or follow the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, I'm gonna throw 1 out there, which is I I think the biggest gap that I see globally in the industry

is that we've got 2 separate worlds. 1 of analytic databases and and data management, and we've also got a separate and then we have a world running in parallel

that's doing machine learning and AI. And I think the most interesting

area and the place where we need where we need new tooling and new ways of approaching things is to do things that can merge those 2 worlds. It's an interesting problem because it's not just a purely technical problem. There are big challenges in embedding

ML, for example, in databases,

but I think there's also human challenges because these problems up till now have tended to be solved or managed by different groups. So you'll have, for example, your Python data stack folks, and you'll have your your DBAs or more, you know, sort of developers who are who are more familiar with traditional databases. So that to me is like the big the big and really interesting problem that I that that we're all kinda circling around and would would love to solve.

I I perfectly agree with Robert. So he discussed it multiple times that there is a huge disconnect between 2

those 2 worlds,

and, there is no evidence that it's going to be to be fixed

at some point. So there are different

tries or attempts to do something,

that glues 2 walls together, but it's not about technology. It's more like about mindset.

So it will take a while to to do something here. But it's a problem we're really excited about because, well, we like math and this is basically involves, you know, understanding the you know, coming from the database world, there is a mind shift to understanding,

to understanding machine learning and and AI. We're really excited about learning more about that, but I think also more generally

excited about helping to solve this problem. It's it's just really it's just a really interesting problem, and there's just huge value

in being able to combine these 2 worlds. Alright. Well, thank you both for taking the time today to join me and talk about the work that you're doing on ClickHouse and the work that you're doing at Altinity to help,

support that database.

It's definitely an interesting piece of technology and 1 that I plan to take a closer look at and possibly use for my own purposes. So thank you both for all of that, and I hope you enjoy the rest of your day. Thank you, Tobias, very much. Thank you, Tobias. It's been great.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links