Summary
The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Clickhouse is and how you each got involved with it?
- What are the primary use cases that Clickhouse is targeting?
- Where does it fit in the database market and how does it compare to other column stores, both open source and commercial?
- Can you describe how Clickhouse is architected?
- Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time?
- I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse?
- Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse?
- For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse?
- What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse?
- For someone getting started with Clickhouse can you describe how they should be thinking about data modeling?
- Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum?
- How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of?
- How is data replication and consistency managed?
- What is involved in deploying and maintaining an installation of Clickhouse?
- I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse?
- What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used?
- What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it?
- What are the shortcomings of Clickhouse and how do you address them at Altinity?
- When is Clickhouse the wrong choice?
Contact Info
- Robert
- Alexander
- alex-zaitsev on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Clickhouse
- Altinity
- OLAP
- M204
- Sybase
- MySQL
- Vertica
- Yandex
- Yandex Metrica
- Google Analytics
- SQL
- Greenplum
- InfoBright
- InfiniDB
- MariaDB
- Spark
- SIMD (Single Instruction, Multiple Data)
- Mergesort
- ETL
- Change Data Capture
- MapReduce
- KDB
- OLTP
- Cassandra
- InfluxDB
- Prometheus
- SnowflakeDB
- Hive
- Hadoop
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances. In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.
Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show. And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers. Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of pre built integrations, and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.
Podcast.init listeners get 2 months free on any plan by going to python podcast.com/clubhouse today and signing up for a free trial. And you can visit the site at python podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers. Your host is Tobias Macy. And today, I'm interviewing Robert Hodges and Alexander Zaitsev about ClickHouse, an open source column oriented database for fast and scalable OLAP queries. So, Robert, can you start by introducing yourself? Hi. My name is Robert Hodges, and I'm CEO of Altenity. We do commercials,
[00:02:12] Unknown:
software and support for ClickHouse.
[00:02:14] Unknown:
And, Alexander, can you introduce yourself?
[00:02:16] Unknown:
Yes. My name is Alexander Zaitsev. I'm CTO of Fultinity and, actually, 1 of cofounders. And, we are doing commercial quick house and helping everybody to to get most of this excellent database technology.
[00:02:33] Unknown:
And going back to you, Rob Robert, do you remember how you first got involved in the area of data management?
[00:02:38] Unknown:
I do. It was with a database called m 204 back in 1983. This was a pre relational system that was really popular with the US government and military, and I loved it. I thought it was the coolest piece of software I'd ever used. And since then, I've worked in and around databases through the present time included working at Sybase, working on clustering and replication for my SQL, plus many other others. So actually, ClickHouse, I think, is database number 20 for me.
[00:03:08] Unknown:
That's impressive. And, Alexander, do you remember how you first got involved in the area of data management?
[00:03:14] Unknown:
Yes. I'm certainly not so don't have so much experience as Robert. I started to work with databases in the late nineties, So probably 20 years already, and that was various kind of different databases. But since, 2007 or 8, I'm mostly working with different analytical technologies, and this is probably the most interesting area, at least for me, in databases. And, we other years, I worked with a number of cool databases, including Vertica, which is so the best of my knowledge is the best of my experiences, the best commercial database for analytics to the moment.
But later on, we, discovered QuickHouse. Actually, it's 3 years already when QuickHouse is open source. So it's 3 years since, it's available in public, and we started to work with the house very closely, look into that, and eventually migrated huge system from house. That was a great success for the company I was worked for. But, probably, I'm going a little bit ahead of, the schedule of our tool. But during this exercise, I realized that Greek house is a great technology with great potential, but it still lacks, a lot of features and support comparing to commercial solutions.
And, that's why I found it all to fill the gap and to make ClickHouse successful and usable for enterprises.
[00:04:48] Unknown:
And so can you dig a bit more now into what ClickHouse itself is? And, you mentioned a little bit about how you got involved with that, Alexander, but if you can also share that information, Robert. Yeah. Absolutely. So the way I got involved with it was Alexander has been bugging me for about 3 years
[00:05:03] Unknown:
about this great database and that I should come look at it. He and I met and have been friends since, for about 17 we met in 2002. We've been friends for 17 years, and have stayed in contact as we've moved across different companies. And I think the real trigger was that in addition to hearing from Alexander, I was working at VMware for 4 years after selling a previous company to them. And I was working in the business unit that was involved in, developing software for VMware partners, and it became clear that ClickHouse just looked like a great fit for a service that these partners could run. And that was really what triggered the thinking about, hey, this, you know, this is looks like something I should sign up for because this database is is, you know, runs really fast, is very easy to manage. It looks like a really interesting business opportunity. And so I joined the company on January 1st and been working together with Alexander ever since.
[00:06:02] Unknown:
And so I know that ClickHouse is an open source and column oriented database engine, but I'm wondering what the primary use cases are that ClickHouse is targeting and that it enables.
[00:06:13] Unknown:
Let me ask you this. From the very beginning, ClickHouse was developed to power, probably the 2nd biggest web analytics platform, which has been developed in Yandex. So this is a huge Russian Internet company. And Yandex has, it's a magical platform, Yandex Metrica, which you can think of something similar to Google Analytics. And they have a lot of users, a lot of traffic, and so on. So primary use case for ClickHouse was to power, this, web based analytic platform. So, its first use case was web analytics, certainly. But, it happened to be very flexible system with a lot of, features and extensibility and so on. So right now, the calls is used for a route for a variety of use cases.
All kinds of analysis on big data you can think of, starting from web, mobile analytics, time series, date data analysis. It's used in finance for stock market analysis. It's used even for blockchain analytics. There are use cases, for quick calls in manufacturing, in medicine, in retail industry. So wherever you have a lot of data, that you need to put into some database in a cost efficient way and to run fast, real time queries, the cows is a good use. And the industries and different companies, they recognize it, and we see a lot of growth in the last, years months.
[00:07:47] Unknown:
Oh, if I could extend that, just a little bit. I think 1 of the things that's interesting, even though it started with this single, this this initial web analytics use case, 1 of the strengths of ClickHouse is that it is a it is designed to be a general purpose data warehouse, so it is SQL based. It can do joins. It has a number of, very interesting capabilities like materialized views that make it suitable for a pretty wide range of of problems outside that initial use case.
[00:08:16] Unknown:
And in terms of the broader market for database engines, I'm wondering if you can just characterize the position of ClickHouse and how it fits in relation to both commercial and open source databases and just also in terms of just the broader big data ecosystem.
[00:08:33] Unknown:
That's a good question. And, actually, the company that I worked for before, LifeStream, we, we were investigating a lot of options before we came to click house. In and there are not too many. So if you go for analytical database, you don't have too many options. In commercial world, there's Vortica, Greenplum, that it is open source now, but it was commercial originally. Redshift, you could if you go to, the cloud, Snowflake, and some bigger folks like Teradata or Oracle Exadata and so on. So Teradata, like, seem expensive, Verzik is moderately expensive, and, cloud solutions, have their pros and cons.
Secondly. In, open source work, it's even even situation is even worse. There is only green plan right now, and there were several companies who tried to make efficient open source and Cisco solutions, but, they were not not successful. 1 of them is, Infobryte, and, they built quite efficient, data warehouse, that you could that you can't put onto your desk. So it's on the SQL Server solution, but they couldn't go for cluster. And now I was Infiniti, which, is currently a part of MariaDB, MariaDB console. And there were a couple of hours, but none of them was successful enough for variety prisms. Technology is complex to develop. It's, not easy to market and so on and so on. So Plicals is probably the only 1, who was backed up by a big company like and, had a good team of engineers to execute it. So it evolved.
And if you're talking about if you're asking about positioning, it's certain, massive parallel processing databases. So these are databases that are for analytics. They are typically column store. They're typically with architecture that scales very well to your hardware resources, GPU cores, memory storage, or cluster, and so on.
[00:11:09] Unknown:
Yeah. And I would say I would also say that it's not just we don't just compare ourselves to column stores. So 1 very common way to process data particularly in pat in batch is to use Hadoop or Spark, and we see a lot of that. For example, people who are using Elastic MapReduce on Amazon, there's a huge amount of usage of that. But the big difference is that, ClickHouse is real time or very close to it. Whereas even with Spark, which is much, faster than traditional Hadoop, you're you could be waiting many, many seconds or even minutes to get an answer to, analytic questions. So there's there's definitely a a big advantage to using this, to using this, database.
[00:11:52] Unknown:
And can you talk through how ClickHouse itself is and able to be able to handle these scaling and performance capabilities?
[00:12:01] Unknown:
Sure. I'll I'll take that 1. So ClickHouse, as we said, is a column store. It has very similar design choices to, to Vertica. What does that mean specifically? So when you load data into a table, instead of storing it as rows the way you do in a database like my SQL, for example. What we're gonna do is break up the columns and store them as arrays. Now those arrays are stored in individual files. They're highly compressed. And, so when we do a query, we will actually only read the columns that are mentioned in the query. And the the compression the decompression is very efficient. Moreover, we have optimizations that as we're traversing the array, we will the the c plus the the core of the system is c plus plus. We'll use SIMD instructions whenever possible so that instead of, for example, for summing integers, instead of just summing 1 of them at a time, we will push as many of them onto registers as we can, process them with SIMD, instructions and get speed ups there.
Second thing that we do is that we have indexes, which allow us to locate. So we're we're selecting the columns. We use indexes, which are built automatically to locate the, parts of the columns that we should be reading. So we're able to drop out a lot of, data there. We do have secondary index structures that allow us to be even more specific about what we read, what we read. So the amount of data that we process is then is then significantly really reduced. 3rd, we're really efficient about breaking up those arrays and dispatching work onto all cores so that, for example, when you run ClickHouse on a single node, if you run a a utility like HTOP and you have your query configured to use all resources, you will see this light up in every single core, including course with hyper threads. Every single core will go to a 100%.
And then finally, we're, MPP enabled. So there's that means we can spread across, the we can spread queries and data across multiple nodes. Moreover, ClickHouse has both built in sharding as well as batteries included replication. So we're able to distribute the data over both shards and replicas within those shards, and the ClickHouse will automatically find appropriate parts of the data, query them, federate the results, and pass them back to you. So it's just a bunch of there's not there's not 1 thing in the architecture here. There are multiple things that are all sort of, you know, sort of designed to work together to produce very large, you know, very fast query results over enormous volumes of data.
[00:14:44] Unknown:
Yeah. The vectorization capabilities are definitely particularly interesting for being able to parallelize and compact the overall number of operations necessary to be able to retrieve or manipulate any individual piece of data. So, I'm I'm sure that that's 1 of the major capabilities. And then in terms of the overall sort of life cycle of data as it flows through ClickHouse going from just into inserting an individual record or a batch of records through to the storage engine, and then on the query side coming back out. I don't know if you can just talk through the overall, life cycle and process flow of how that data is managed.
[00:15:23] Unknown:
Yeah. This this is interesting actually how QuickHouse handle that because, the data, when it's stored in order to be queried, efficiently and fast, it needs to be sorted. You can think of ClickHouse as a database with a single clustering index. So the data is, stored at rest and, clustering index allows the house to locate data efficiently when this index return is used. But the challenge comes when, the new data comes and you need to keep the data in the same sort of structure. So what Clearhouse does, is the following. With every new insert, and we typically insert not a single record, but, some batch of records.
Hundreds of 1, 000 to millions in 1 short statement for the good performance and actually production workloads. Every insert generates a new part of parts of, a table, and it's actually called part, in precalves, terminology. And this part is internally sorted, by this primary sort order. But once you insert more and more data, you have a lot of those parts. So in great time, you would have to check all the parts for your data, loads from, old parts and not just somewhere else and and then present results to the user. That would be not very good, in the long term. So what the house does, it starts us to notches, this, parts of the data to the bigger bigger ones. So when the new sort comes, we have new block of data, click counts and picks up some, old block of data, old part of data, merge it together and produce a new part.
And this process this ground process, which is called merge, it happens all the time in the background, so it constantly merges the data. And what when it not just, it's actually a no sort operation. It seems a bit more complicated than that because your house has to do it per single column. It has to decompress data first, then compress it afterwards. But in nutshell, it's a merge sort, which is very efficient operation, that can be performed in a steaming fashion. So and during this merge process, no query interactions is happening. So it's absolutely transparent for the user.
The new, partial of the data, they just eventually replace the old ones and the acquires are running faster faster all the time. So your data is always pretty much well sorted on the disk. It's not going to be a single sorted, part or single sorted piece of data always, but a small small number of those. And this is very important concept of of quick house. Something similar is this in Vertica, but Vertica has a more complicated design with a memory structures and disk structures. In click house, it always stores data in the disk, and this is, the main difference and actually, it's not a difference. You you have to store in the disk if you're going for a huge data.
I don't have so much memory. So because over source in the disk, the data is all sorted, but sometimes in several parts, which are being merged in the background all the time. I think that's the main idea. And as far as the clustering capabilities,
[00:19:16] Unknown:
is it a sort of master minion type of architecture, or is it an active active where any instance can accept rights and then those rights are distributed or across the cluster for durability. I'm just wondering how that plays into the overall data flows. It's master. It's, active active. So what what you have is
[00:19:37] Unknown:
you have shards. And and actually, it's kind of interesting how how ClickHouse does this because really the the shards are just defined as through a configuration file that says, hey. You know, we have this cluster, and my shards are located here, here, and here. Now you can extend that definition to also say each of those shards should have 1, 2, or 2, 3, or 4 replicas. In that case, what happens is you have to have ZooKeeper running with, ClickHouse. And then each time a part gets inserted to a particular replica, a note is made in in in zookeeper that the other replicas can then refer to and recognize, hey. There's something that we need to we need to, replicate.
There's a part that needs to be replicate over to to our side of the fence. So the replicators that you can you can basically insert to any replica, and it will,
[00:20:30] Unknown:
eventually make its way to all the other replicas. So it's kind of an eventual consistency replication model. And another interesting feature that I noticed as I was preparing for the interview is the option for, some data safeguards such as deletion protection. And so I'm wondering if you can talk through some of those safeguards that are available and how those factor into enabling different use cases for click halves that would otherwise be intractable or impractical?
[00:20:57] Unknown:
It's interesting that you asked this question because, normally, users are not concerned of this those safeguards. But, yes, click house has some guards against incidental deletes of tables and partitions. Because we're talking about, big data, if somebody occasionally deletes a table, it might be quite expensive and, takes a long time to restore. So by default, kick house doesn't allow you to drop tables or partitions above 50 gigabytes of size, but it's certainly configurable and can be disabled. Speaking of the of data deletes in general, it's a common feature of an analytical database that they don't like to use in updates too much.
Many analytical databases, they started from, right on there and, it's already short on there. And, are not exceptions. But, alright. Currently, you can delete, or update data quite flexibly in. Not probably not in, in a sense of traditional SQL, but in a little different click counts way. Usually, click counts has a lot of own ways to do certain types of things, which makes a joint implicals kind of interesting sometimes. But, all those, decisions, have good
[00:22:30] Unknown:
good thinking. And I would add to that that I think, you know, as you go into analytic systems, you tend to think about deletes just in a different way. So the ClickHouse way to to think about data deletion tends to be more along the lines of, hey. I've got enormous amounts of raw data coming in that I want to keep for a certain period of time, say 30 days or 7 days or something like that. But then as time goes on, I want to maintain aggregates of that data that become increasingly sparse over time. So what you would think of the way you would think about that is you would have the raw data coming in, and ClickHouse has a feature that allows you to assign a time to live, which is based on a a SQL expression. So you can, you know, a simple thing is just keep it for 30 days, but it could also be based on on some condition in the data itself. So you would do that for your, you know, for your raw data coming into the to the main fact table, and then you would build materialized views that automatically compute aggregates.
Those are computed from the time that the data is inserted. And then even after the data drops out of the, you know, gets gets deleted out of the out of the main table, you would then continue to have the materialized views. So that's kind of a ClickHouse way of thinking about that deletion about that deletion problem.
[00:23:45] Unknown:
And then as far as getting data into ClickHouse, you can obviously insert individual records at a time. But for large volumes, that's impractical. So I'm curious what the other loading paths are for getting data into ClickHouse in a scalable and maintainable fashion? You know, this is this is a great question because the ClickHouse has such a variety
[00:24:06] Unknown:
of of ways of of loading data that it's really I'm sure that I won't be able to to remember them all, and Alexander will have to help me. So just as far as the APIs and and, in effect, basic inserts, the the principle 1 really important click ClickHouse principle is you don't like many data warehouses, you don't wanna insert single rows. So you wanna insert in batches, say, of 10, 000, 20, 000, 30000, something like that. So that's the first thing. There are a couple of different ways you can do this through the APIs that I'll just mention in passing. A very common there you can insert using h t HTTP requests. There's also a native API format there, or we call it native TCP format, which inserts data that is already in typed and in column order, so that results in very efficient inserts. And so when we think about using APIs, you can, of course, write a Python program which invokes these.
But at a first sort of the first order approach is you actually just use the ClickHouse client, which is this command line client, and it has a very efficient streaming of different data formats. So for example, it can read CSV. It can read, 1 line. There's a format called 1 line JSON. There's various other formats that it can read in. And the great thing about it is is is you just you just cat the file and basically pipe it into, to ClickHouse client, and it just blasts it up to the ClickHouse server at full speed. And you can run many of these in parallel. That's we do that commonly for data loading. However, beyond that, there's there are many other ways to read data directly from the ClickHouse server. So there are these things called table functions that allow you, for example, to go out and directly read CSV off a URL.
You can read data directly out of Kafka queues. There's in fact, ClickHouse has also what are known as table engines, which can basically take anything in the outside world and make it look like a click house table. So a common way to read from Kafka is to use the Kafka table engine, which then makes a Kafka queue look like something you can select out of. So that's another way to put it in. I think 2 also very, very interesting ways to load data or we can go directly to, to Postgres as well as to MySQL, and we can we can also read data directly out of those tables.
So that's a that's yet another way that you can pull data in and actually store it in ClickHouse. And then finally, sometimes you don't even need to read the data at all. So for MySQL, for example, we have ClickHouse uses very commonly what are called dictionaries, which are reference tables that are loaded into memory. But the base for those reference tables can be another database like MySQL. So 1 of the things you might do is if you have a big fact table and a bunch of smaller dimension tables, say, for example, containing product SKUs or something like that, a common way to access them is to leave them in my sequel, and then just, define a dictionary, which will read them from the MySQL database. And every time they're referred to, ClickHouse can just pull the data in, and you'll you'll be able to access the data directly from MySQL.
[00:27:15] Unknown:
Yeah. So in a lot of ways, it looks like you're inverting the ETL model where rather than pushing data into the database, you're actually using the database engine to pull data from other sources, which I think Correct. Simplifies things in a lot of ways. Correct. Yeah. It's a really interesting model, and I think that it also gives you some
[00:27:31] Unknown:
you know, because we talked a little bit about deletes earlier on, and 1 place where deletes are not being able to delete stuff easily or in the in the conventional SQL way can be a problem is is specifically dimension tables. That's a good example of place where you actually do want to you may want to update things on a regular basis, and in that case, keeping them in something like my SQL turns out to be very convenient.
[00:27:56] Unknown:
And when I was looking through the documentation, it seems that you can actually set up ClickHouse to be a replication target for MySQL or Postgres to simplify the change data capture approach of being able to pull data in as it's updated in the database. I'm wondering if I'm accurate in that interpretation of the documentation.
[00:28:14] Unknown:
That's correct. And we did we oh, Alexander, you wanna take that 1?
[00:28:18] Unknown:
Yes. So that's correct. That's, you can do it. But, right now, it's not a native click house feature. So you need to install external tool or software in order to, listen for, for replication logs or something like that on my SQL processor site and then, push data into ClickHouse. But, we're going to change that actually there, and, we envision that ClickHouse, will be able to connect to, MySQL progress directly as a replica and, listen to to changes the tables of interest. And, once, it's implemented, it's it's actually a very cool feature that we are looking forward to have it because it allow you to speed up your MySQL or Postgres analytics, very efficiently with, very little effort.
[00:29:14] Unknown:
And, yeah, and this is this is a feature we love. We've we've had a number of conversations with people about. Because if you look at MySQL or Postgres applications, it's very common to see 1 or 2 large tables, and they can sometimes in some applications, they'll grow to 100 of millions of rows because they might be, you know, events from marketing campaigns or something like that. And then you'll have a bunch of little tables. It tells that this is a long tail in in in traditional database schemas. So what this feature would do is it would allow us to subscribe to changes on that on those big tables. They would be brought over and be available for processing queries.
But then what we can do is because we can integrate with the smaller tables through dictionaries and other features like that, we can go back and as necessary just just reference the data from the small tables that's living on my SQL. So this just feels like a really compelling and interesting way of accelerating analytics on existing my SQL databases. And it's 1 word we're definitely looking forward to getting implemented.
[00:30:18] Unknown:
And 1 of the interesting challenges that are posed by the fact that you can pull data from these external sources or be able to subscribe to these replication feeds is the idea of schema evolution and being able to evolve the schema in ClickHouse itself when when the data formats change from those external systems or when you have maybe malformed data that's coming through, for instance, in a Kafka topic. And so I'm curious how you approach that challenge in your own experience of running ClickHouse.
[00:30:48] Unknown:
I think I might yeah. Let me jump in on this because I actually worked on replication for my SQL for many years. So the short answer is that's an exercise for the reader right now in ClickHouse. This is definitely something we're looking at. The ClickHouse actually does have very flexible schema management. You can do alter call or, you know, alter table, add columns, change data types, things like that. It does require that you have the ability to intercept the the schema changes in the, in the replication log. And, certainly, the MySQL replication log has this information. The trick is to to see it coming and then correctly process the, the change. So that doesn't exist yet, And that's 1 of the reasons why, we've actually had discussions with, you know, with users about potentially replicating all data out of MySQL, but but it's exactly that schema change that makes this a difficult problem.
So, for now, I think when people are replicating, it's better to do, you know, sort of large tables that where the schema changes are sort of few and far between, and you should be able to manage them. I think this will be improved over time, but it's just generally a very hard problem. And in particular, it is very hard to solve when you have heterogeneous replication because there will be schema changes on the target side that are simply very difficult or excuse me, on the source side that are simply very difficult to to implement on the target. So I foresee that as always having a somewhat manual component even, you know, even in as our replication capabilities improve.
[00:32:29] Unknown:
Yeah. Schema evolution in general is a hard problem that doesn't appear to be fully solved in any 1 system, and then it just gets exacerbated when those schema evolutions are bridging different systems, you know, as it's the case when you're pulling data from these source sets into a data warehouse or in the event of using a data lake where you have data that's landing in different schemas and then being able to manipulate and analyze across the whole historical record?
[00:32:54] Unknown:
Yeah. It's it's just not an easy problem to solve. So, I think I think we would we would be super happy if we could just get this this feature in where we can at least subscribe to the ongoing data, including, inserts, updates, and deletes on large tables, and then basically have a click houses exist as an archive. I think that one's tractable to solve. And then once we've made progress on that, we'll look beyond that to to what else we can do. And so beyond the core engine of being able to store and scale these
[00:33:25] Unknown:
data storage and analytics requirements, I know that there are also some interesting and advanced capabilities that are built into the engine, such as different SQL extensions and custom data types. And so I'm wondering if you can enumerate some of the more interesting ones that are unique or particularly well implemented in ClickHouse.
[00:33:44] Unknown:
Table engines in ClickHouse is probably, in general, not too interesting. ClickHouse does have a lot of different table engines. That's right. And, it has table engines for its own internal structures, and it has table engines for external data as well. In particular, you can have, MySQL table inside the house, and the data will be seen in MySQL, but you can read it from house like the house table. But that's probably not very interesting. The real interesting thing for me in is is its ability to aggregate data in real time automatically using special aggregating aggregating engines with extensions of merge 3, engine, which is the main core for transapplicals and which does those NERSH operations in the background to source in the data all the time. So during the sorting, it also kind of break data. So you can define, that, your table is actually an aggregate. So you have some sort of columns that you use as a base for your application and some other columns that you want to roll up. I could play it some sort of different, different metrics. And the calls will do it automatically in the background.
These are the cool features has, to me at least because it allows you to, to build, very efficient systems. As you do have, a data flowing in your system at a huge rate like 1, 000, 000 rows per second and it's being accumulated in a huge table. And this is good table to look into sometimes and probably run some queries, but not in a general basis. And most of the time, we typically, need a different sort of data aggregated by minute or by 10 minutes or by hour or by day. And, you can build those aggregating structures and the calls would maintain aggregates automatically, and it's real time, in real time. Once you insert the data, the data is already available in those aggregates, and you can query it with subsequent response
[00:35:50] Unknown:
time. I don't know of any other system that has this exact feature. Another cool feature is arrays, and probably Robert will talk about that. I will talk about arrays. I love arrays. So so there's a lot of databases out there that do support arrays, just as a data type. What's interesting is that ClickHouse really makes them first class citizens in a couple of important ways. 1 is that ClickHouse has a feature called an array join where you can basically compute aggregates which include a bunch of things stuck in an array. So for example, you could aggregate sales by months, and you could keep an array of all the regions that contributed to that. But then what you can do is you can construct SQL queries which will flip this out so that that array goes from being, you know, sort of an array of values stuck in the row that it basically pivots around, and you can now have, a new set of rows, which then associate each region with with values. So the arrays give you, the array capabilities give you an important pivoting capability, which is in many systems, you actually have to pull data out and do the pivot yourself, for example, using Python Pandas.
So that's 1 really cool thing that that I love about arrays. Another thing is there are map functions that you can use to iterate across arrays and, you know, sort of change values, sum them, things like that. So you get kind of a programming model even within the SQL queries that has a lot of similarities to the the sort of, map reduce constructs that I think people are used to if they're using JavaScript or Python. So there's there's a lot of really powerful capabilities there. There's 1 final feature that I think is totally cool that that has just arrived in ClickHouse. It's still kind of maturing.
There's as of joins, and these are are kind of an interesting feature. They exist in KDB, which is a very popular database for financial users. And an as of join allows you to join 2 tables that may, for example, contain values that have a key that is close, but not quite not quite an exact match. So for example, if you the canonical example here is you have financial data where you have people measuring market prices, you know, at, you know, say every 10 seconds or so, and then you have weather data, which maybe is measured, you know, on an hourly basis. And what you'd like to do is do a join of each row of financial data to the nearest weather related record based on time, and the as of join allows you to do that. It's a totally cool capability, and it it allows you to to merge data and and sort of analyze data in a way that's almost impossible if you don't have it. Yeah. I think that those sort of fuzzy joins is definitely a very interesting capability, particularly for a system that's designed for being able to work with these large quantities of data that might have some variability
[00:38:46] Unknown:
or some potential minor errors in generation or insertion where you want to be able to just get an approximate value or be able to perform some estimation across these data without necessarily knowing what the exact value is or that isn't necessarily going to match up precisely as would be the case with a typical relational system that you would use inside of an application.
[00:39:07] Unknown:
Yes. That's that's a good point. Right? And the click ons in addition to those zone joints, that, I can't say they're Fizer joints, but they have very interesting semantics. The house has, some more ways to do approximate calculations. 1 of very useful feature is approximate unique. So it's this is quite common, when you analyze a lot of data with visitor specific when you are interested in a number of unique visitors. And traditionally, you have to populate some count distinct if you wish in SQL. In ClickHouse, you have a different way to do that more efficiently, with, this approximate unique calculations that are sampling the data using some fresh functions.
They're pretty accurate for the general use. And if you go to the small numbers, then they are precisely accurate. And I think it supports for sampling, so you can run acquires with the sampling extension and the precourse would take a small portion of the data to calculate the result, and then you just multiply it to the sample size. And this is built inside the c code, so it's native, feature feature of big files. Actually, speaking of uniques, 1 of coolest QuickHouse features here is that we can aggregate uniques. Normally, if you have UNIX, you cannot, aggregate UNIX for 1 bucket of the data, for another bucket of your data, and then combine them together because unique just don't sum up together.
Greekhouse, they found a cool way to sum up UNIX, and, you can build aggregates with UNIX. I don't know of any other system that that can do aggregates of of UNIX. In Brickhouse, you can do. And, that makes it a great choice for, companies who depend on Unix like, companies that's working with, some RCV Networks, for example, or, IP telemetry. I think it's to go create some, metrics accounts or a number of unique devices or unique users.
[00:41:31] Unknown:
And for somebody who's getting started with setting up a ClickHouse system, can you talk through the overall onboarding process and some of the considerations that they should be thinking about as far as data modeling that would be different in ClickHouse than another either relational or column oriented system that they might be used to? Yeah. I can take that 1. This is the,
[00:41:51] Unknown:
I think the getting started is just not very hard because you can basically install ClickHouse in about 60 seconds. If you have a decent networking connection, you just say apt install ClickHouse server and bang. It shows up. So when you're getting started, that's a typical just jump in, install from packages, just make sure you have you know, it's nice to have good storage, but it's not a good it's not a big requirement. The really interesting question is around modeling because what what I find is is 1 of the huge differences between ClickHouse and and for example, OLTP databases like MySQL, Postgres, or Oracle is you actually don't have to think about very many things when you're starting. Yes. Your data has to be in a table. We're talking SQL, but you don't have to make decisions about data types. So for example, you can start, and we actually have customers who do this. You can just take if you have a CSV file and you wanna read it and look at it and start to do initial, you know, sort of initial analytics, you can just load everything as a string. The only decision you really need to make when you're loading the data is you do need to give a partition key. So that's you know, give some way that ClickHouse can decide how to break the data up into parts.
So that's a fairly simple decision. Another thing is you do have to you do have to choose a primary sort order. So, again, you make that choice early on. If you don't get it right, you can always go reload the data and and and try again. So that's not a problem. So first cut is if you're really just getting going with new data and you have no idea what's in there, make a a basic choice on the partition key, make a basic choice on the sort order, put everything in its strengths. You can then optimize that as you start to look at the data. So first of 1 thing that you can do which speeds up performance enormously is begin you know, after you've looked at the data, go ahead and assign proper data types. So instead of a string date, you know, make it a real date time, for example. Instead of a string integer, you know, make it an like a u int 8 so that you're just using a a byte. That will vastly improve your your performance, and, of course, it makes your query simpler because, you know, when you're counting things, you don't have to apply conversion functions.
And then there's a a second order type of modeling that you wanna do, which is to think about codex. So codex are basically are going to give you they're transformations that are applied to the data as they are stored and, of course, as they pull back from storage. Sort of a good example of a of of a codec is you may change, for example, the way that the compression works. You may, for example, on strings, we have a very powerful codec, called, low cardinality. And what this does is it basically, if you have let's say you have a a string value for region in some dataset. Well, there's a finite number of geographic regions in in in most cases, say, you know, say maybe you have 500 of these. Well, if you apply this low cardinality, data type or we call it low cardinal data type, but it's really a transformation, you apply this to the column, what it will do is automatically convert those region values to an integer and maintain a lookup table for you that that it automatically then associates so that when you do a select, you see the region name, but underneath, it's just it's just comparing and and, selecting using integers. So it's basically 3 levels. You know, you're starting out. You just make a you can make a basic choice about, the partition key and the sort order, but but just have a rough estimate on the, you know, on the on the data types. You can then refine the data types, and then you can apply different codex and other transformations on the data. And that's that's kind of the way that you can iteratively get to a really efficient schema. I think that it's there's also another question in modeling that comes up very commonly in many use cases, and that's what to do when you have variable schema.
And that's something that Alexander has actually worked on quite a bit. And, Alexander, I think that's 1 where you can offer some interesting comments.
[00:45:52] Unknown:
Yes. I think, Robert. So in modern days, people, some users, they used to schema less databases or databases with not very strict schema like Cassandra, for example, and, or, sometimes serious databases like Influx or Prometheus. And they come to click on us so if, this background, they wonder if they can, do the same. And in most cases, it makes sense because, for for example, consider problem of monitoring when you have a lot of different devices. All devices may have their own properties and so on. And it's very hard to make a schema that works for all cases or it would have, a fountain of columns per table. So instead, what we can do in is to store data in arrays.
So you have you might have array of keys and array of values, or you can have multiple arrays for different data types. It's kind of gets away gets you away from the columnar structure because array is stored altogether. But, you can start from from this and work with a schema less, data and just store everything as QA players. And if you decide that performance is not very good, you can then materialize some specific values or specific attributes from, those key value arrays as a separate columns. It can be done afterwards. It can be done absolutely with no downtime just by alter table, and click ons will start populating those specific attributes that are important for filter, for example, for you. And, you can start, querying the data using those filters much more fast and more more efficiently.
So this is kind of hybrid approach. Right? So you will start with this array, or arrays and probably some other common properties like time stamps, whatever. And then you can extract certain certain data, for the better performance. Similar to what Robert said, earlier, you can start with the basic scheme of these things and then continue to specific data types, codex, and so on. And click house allows you to do everything. Yeah. And we have users who are using exactly this approach, for example, to process logs.
[00:48:16] Unknown:
You know, keep the data in arrays and then use this materialized column feature. That's actually another really killer ClickHouse feature. The fact that you can you can do it with 0 downtime and basically, you know, sort of instantly begin to to refer to the to the new column with correct type is a is a great feature.
[00:48:37] Unknown:
Yeah. And all of these capabilities as far as the flexible modeling options puts me in mind of the current trend as far as the shift away from ETL into ELT where you land the raw data into some source where in a lot of cases, it's a data lake, but with some of these modern cloud capable data warehouses such as Snowflake DB, I know, encourages you to just load the raw data into it and then do the transformation after the fact. And so it seems that ClickHouse is another case where you would just load the raw data into some source table and then take advantage of these schema evolution capabilities or schema manipulation capabilities for then being able to transform the data into a more processed form for being able to use for more advanced analytics?
[00:49:22] Unknown:
Yeah. That's right. I wouldn't say that this is primarily use case for click calls, but click calls can do pretty well, with,
[00:49:30] Unknown:
this, approach as well. Yeah. And I think that we're definitely not I mean, there are systems out there which will actually allow you to do you you know, you'll have a meta store which will, you know, sort of implement schema on read so that you may have just text files. This is the this is the Hadoop hive model. That's sort of the far end of the spectrum. And ClickHouse is absolutely not designed or not intended to follow that pattern. We do require for to get high performance, you're really going to have to put the data into into ClickHouse's storage format. You know, we can, of course, refer to data in other databases, but those tend to be smaller tables.
So we do expect that people are going to, you know, eventually load most of their data into ClickHouse and and store it in the native merge tree format. So we're kind of closer to in that sense, really closer to a traditional database than than, for example,
[00:50:27] Unknown:
you know, some of the things some of the systems that are designed to read data directly off off s 3. And then as far as getting ClickHouse deployed, I know that you said in terms of just the initial onboarding, you can do an app get installed. But for actually putting it into a production context, I'm wondering what's involved and some of the edge cases that users should be aware of and some of the tuning capabilities that would be useful to point out. Good question. So it's right that you can run the house,
[00:50:57] Unknown:
in start 2, 000 in 60 seconds, and, you can run it to your laptop for or high end server. Your house would, run everywhere. It's absolutely portable. And, when go into production, there are different considerations. So it's not always, performance is actually not always, the first 1 and helps to prep house, which is already very fast. Quite often, it's it's more like usability for better fit into infrastructure. Having said that for maximum performance, ClearPass is better optimized for running in bare metal servers. Of course, it, has a lot of internal optimizations that rely to, CPU cache, utilization for instance, but the data is CPU local.
And, sometimes if you use, certain virtualizations, this is not the case. But, you can go with to clouds as well, including public clouds like Amazon or Azure or Google Cloud, and works reasonably well here, given that you're provided with the good storage, storage option. It's certainly not performs as fast as on bare on comparable bare metal servers. There is certain expense of virtualization, but it's it's very convenient. Right? So just deploy it to the cloud, and if everything else in the cloud, we actually don't have a reason to go somewhere else. What we see now in the recent months, there is a growing interest to deploying kickoffs in Kubernetes because Kubernetes allows you to run kind of the same kickoffs configuration everywhere. You can have Kubernetes on bare metal, have Kubernetes in clouds and public and private clouds, and the running pattern for QuickHouse would be exactly the same in all Kubernetes environments.
We actually see QuickHouse, to be more and more popular in Kubernetes. Actually, to the moment, this is probably the only data warehouse or analytical database that can run Kubernetes in general at all. And, there are multiple reasons for that. 1 of the reasons is that, stateful applications and application of storage as still pretty new to Kubernetes, and, has also, very anterior deployment pattern and not just the cloud, but any other analytical database. So, I kind of lost the thought, but it's it's great that particular ClickHouse portability, allows it to run everywhere and optimize for particular use cases scenario. If you optimize for performance, you will go 1 way. If you optimize for convenience, go to another way.
And, the trade offs are not that huge. So Kubernetes overhead is not significant if compared to bare metal or Kubernetes, right, in Amazon
[00:53:52] Unknown:
and so on. Yeah. I think the performance of ClickHouse on Kubernetes has been very enlightening. So the Kubernetes operator that we've developed takes what is naturally a very complex deployment, which is isn't anything particular to click click houses. It's just that deploying data applications with state on on Kubernetes involves creating a lot of resources and managing them. So, the Kubernetes operator that we've developed reduces these to a single custom resource that's easy to configure and, configure, deploy, and update. But what we've definitely seen is is that well, first of all, these give you huge flexibility because you can quickly spin up clusters. You can create clusters in less than a minute, depending on their size, of course.
But you also have we've also seen pretty promising performance results that the for example, when we run tests that run on Kubernetes versus running on VM. So Kubernetes on Amazon VMs versus just running on straight VMs. We see virtually no difference in the in the performance. So that's actually something that looked very promising. And we we do, we are working with customers who are have major deployments on Kubernetes, and we definitely see this as a growing, as a growing, platform for data warehouse in future.
[00:55:15] Unknown:
And as far as the sort of use cases and instances of ClickHouse and customers that you've worked with, I'm wondering what have been some of the most interesting or unexpected or innovative ways that you've seen it used. I'll give you 1 that just sort of blew my mind about 4 days after I got on the job.
[00:55:33] Unknown:
We have a customer that does, that develops custom fraud detection systems for banks and credit card processors. So the way that they use ClickHouse is they bring it in, and the first thing they have to do is in order to build the system and select appropriate models and things like that for for doing fraud detection, they actually need to understand what data that financial institution has available. So they'll just ask them to give them a few CSV files containing more or less complete samples of data. They'll just throw it into ClickHouse, and start playing with it and looking around. And this is exactly what I was talking about when I, you know, when I I was discussing how straightforward modeling can be that you can literally just start at strings. This is this is exactly what they do. They just throw the CSV in. They then start playing around with it. I have not ever seen a database that can do this on this scale. So it's a really interesting and and innovative way to use it. Yes. And, thanks, Robert. That's a good example. And from my side,
[00:56:37] Unknown:
probably my favorite is some sort of distributed computation system, which 1 of our customer has built last year. So what they did, they, stopped a cluster of clickable servers and developed kind of execution pipeline processing the data. It's actually, it was top market data. And but absolutely in parallel in all servers. And on the, that that pipeline, it took several steps, to to do very complex mathematical computations. That that happened to be quite efficient because And after all parts of pipeline, finished, the data has been merged, in a together in a single table to be accessible by by the users. You can pretty easily to, to develop the same thing on Spark or Huddl, but, it would take, several days for results to come using, given the data sizes that they used. In click ons, it it took just several hours. It was probably 20 20 times performance increase.
And, the idea idea is general that you can use analytical database as computational processing engine is kind of cool, I think. We haven't seen any other client who doing the same since that time, but probably it will change, especially, given the fact that ClickHouse is getting closer to a machine learning world recently. We add new features, that bring some machine learning models and cyclical. We it investigate a ways to integrate closer with some machine learning, capabilities and so on. So it may it may change. But, 1 year ago, it was a real kind of wonder.
And I know of some other companies who migrated their, data analytics and machine learning algorithm from Spark to the cows. And, again, their performance difference and hardware difference was close to 1 to 20. And another thing that I didn't ask about yet is
[00:58:56] Unknown:
whether ClickHouse has any concept of multi tenancy where you might have different groups within an organization that want to be able to leverage the capabilities of ClickHouse, but still maintain some sort of ownership of the data and then also be able to grant either restricted access or be able to federate across different, sort of groups of data within the overall platform for them being able to do higher level aggregates across aggregates across those data sources?
[00:59:22] Unknown:
Yeah. That's a that's a really interesting question that actually comes up a lot, and it actually has what you're referring to is an authority model that that would allow people to share data safely across across tenants. And I think at this point, it's fair to say that we'd have to be honest and say that ClickHouse does not have strong capabilities in this area. But there are a number of there are a couple of interesting features that are coming down the pike that will make this possible. The first is to, get both table and column level security. This is a standard feature that you get in that you have in, you know, existing databases, of course, like Oracle. It's also in the, cloud databases like, cloud data warehouses like Snowflake.
This is very well developed. So this is in the road map and will be coming in the you know, for hopefully sometime this year, but it remains doing you need to control resource usage. So this is something that will be arriving in ClickHouse a lot sooner, and that's specifically resource pools that allow you to restrict things like CPU, RAM, storage, IO, or IOPS on storage, basically control these for groups of users. And this capability is fundamental to be to to allow different groups of users to safely use a large system. That's something that we're actually working on right now, and we expect to see that be merged into the,
[01:00:58] Unknown:
into ClickHouse by the end of this year. And as far as your overall experience of working on ClickHouse and helping customers get instances deployed and put into production, I'm wondering what you have found to be some of the most challenging aspects and some of the most interesting or unexpected or valuable lessons that you've learned in the process.
[01:01:17] Unknown:
I I will I will just focus on 1 place that I think is a challenge for many systems, and that is security. So ClickHouse is a we like to say ClickHouse is a young system, so it has great speed, great portability. But it's also the early use cases have involved very large quantities of of data that either didn't have import, you know, big security considerations. So for example, data was not considered subject to GDPR, things like that. So so definitely being able to build systems that can pass, PCI d DSS, for example, for credit card information or get by corporate infosec teams, that's definitely a challenge. And, what we've we actually just got, recently got our first contract to add our our first, security related feature. So it's for log cleansing to keep, sensitive data out of logs. So we're looking forward to sort of stepping down that path. But I think in the meantime, if people are building systems, for example, that need to meet PCI DSS, requirements, what they have to do with ClickHouse is think whole system and, you know, think about security groups, think about, network security, and things that you can do to to kind of create a trusted environment where you can safely run. So that's definitely a a design a major design challenge, for ClickHouse. I think the learning for me was that, you know, in talking to customers, it can actually be done. So we don't actually have to have these features. There are ways to use ClickHouse that are quite secure, but it does require some additional thinking in your system design, and we hope to reduce that thinking,
[01:02:51] Unknown:
over the next year or so. You mentioned that ClickHouse is still a bit of a young system. And, Alexander, you said that part of the reason for founding Altenity was to address some of the, sort of sharp edges of ClickHouse itself. And so I'm wondering, at this point, what you see as being the shortcomings of ClickHouse and how you address them in terms of the products and services that you provide at Altinity?
[01:03:15] Unknown:
That's a good question. 1 short comment Robert just mentioned is security. So QuickHouse, it has been developed inside an enterprise, without then without strict security requirements, and it's still, like, an a lot of features that any other major database has. So, this is 1 of the things we're working on. Another area which, is still, not a good in QuickHouse is storage management. Again, it has been and and probably a few hours. So and in all cases, this is common from the fact that ClickHouse has been developed by a single company originally with its own set of requirements and view, on the product.
And even 3 years in open source, they did change it somewhat, but, not completely. So QuickHouse or Old Net is still mostly, driven by the core team, and we're working very close with it. The core team and community to shift, the agenda and road map to address better the needs of the wider
[01:04:28] Unknown:
wider audience and wider scope of users. And we're working pretty hard on, on that actually. Probably, Robert, you can add something. Yeah. I absolutely. The tiered storage, I think, is 1 of the most important features. So what you typically wanna do is you'll have hot data, which you may, you know, wanna dump to SSDs. You'll have older data, you know, sort of you know, maybe after a week, you you're content to move it to hard disks. And then after some lengthy period of time, you have data that you just don't refer to very much. Maybe you're keeping it for regulatory reasons. You'd actually like to drop that onto object storage. So that's a pretty I think that's probably 1 of the biggest things we hear from users to give people the capability to do that. We are working very closely with with the Yandex core committers as well as, as well as our own team to design a solution for that and get that implemented. I think that's gonna be a huge step forward, but there are many others. I I think what's what's, and and in fact, what the shortcomings the flip side of the shortcomings is that people want to use ClickHouse even with the things that are missing, but it's created a market demand for, you know, to round out ClickHouse. And that's actually the that's actually 1 of the biggest things we do in our business is basically help people identify, you know, these areas that are are still not mature either work around them or actually, get us budget so that we can we can actually fix them. And and it's it's evolving pretty quickly. We've we've done a number of PRs this year, and we expect to see that double or triple. The rate at which we're we're adding features will double or triple by the end of this year. And what are the cases when ClickHouse is the wrong choice and you'd be better suited by choosing either a different database engine or a different,
[01:06:13] Unknown:
just overall system architecture?
[01:06:14] Unknown:
It's not an OLTP database. So and that's sometimes people come to it and think, okay. Can I make it do both? And ClickHouse really is designed to do analytics. Well, we're not trying to be a we're not trying to compete with Postgres to, like, run a bank or or deal with money. I think there's another case that's more subtle, which is within the analytics. If you have a lot of concurrent query concurrent point queries where you're just want to get the data for, like, off a single row, ClickHouse is not a good solution for that, and the reason is that we're fast because we use sparse indexes. That's 1 of the big reasons that we don't construct these large indexes that you then have to maintain and read. But what that means is that if you ask for a single row, there's a good chance you may you may read 8, 000 values and then decide which 1 you
[01:07:04] Unknown:
actually want. So that's a case where another database, for example, if you're you know, Redis might be the might be an appropriate solution for that. And are there any other aspects of the ClickHouse database itself or the use cases that it enables or the work that you're doing at Altinity that we didn't discuss yet that you'd
[01:07:23] Unknown:
like to cover before we close out the show? Yeah. I think I would just summarize by saying I the thing that I really like about ClickHouse is I I love the speed. I I love the scaling. But I think that let's be honest. There's a bunch of databases that can do that, and that alone doesn't make you unique. What I think really is creates this huge opportunity is that ClickHouse has a permissive open source license. So it's Apache 2 o. That's a really, really great feature for many people. It means it can be used very flexibly in a wide range of applications. That's thing number 1. Thing number 2 is it's very portable. We already talked about that it runs, you know, bare metal to containers to clouds, and it runs well in all of those environments. And then the third is that it's relatively simple to manage, and I think this is something that is kind of inherited. We have sort of the ethos of MySQL that in the you know, embedded in the design of this system, and the idea is just make stuff easy for users.
Make it easy to install. Make it relatively easy to do the management. Don't build an enormous complexity that requires DBAs. This is an this is a database that can be used and deployed and operated by developers. So I think that's a really powerful feature.
[01:08:34] Unknown:
Alright. And for anybody who wants to get in touch with either of you or follow the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. You know, I'm gonna throw 1 out there, which is I I think the biggest gap that I see globally in the industry
[01:08:57] Unknown:
is that we've got 2 separate worlds. 1 of analytic databases and and data management, and we've also got a separate and then we have a world running in parallel that's doing machine learning and AI. And I think the most interesting area and the place where we need where we need new tooling and new ways of approaching things is to do things that can merge those 2 worlds. It's an interesting problem because it's not just a purely technical problem. There are big challenges in embedding ML, for example, in databases, but I think there's also human challenges because these problems up till now have tended to be solved or managed by different groups. So you'll have, for example, your Python data stack folks, and you'll have your your DBAs or more, you know, sort of developers who are who are more familiar with traditional databases. So that to me is like the big the big and really interesting problem that I that that we're all kinda circling around and would would love to solve.
[01:09:52] Unknown:
I I perfectly agree with Robert. So he discussed it multiple times that there is a huge disconnect between 2 those 2 worlds, and, there is no evidence that it's going to be to be fixed at some point. So there are different tries or attempts to do something, that glues 2 walls together, but it's not about technology. It's more like about mindset.
[01:10:17] Unknown:
So it will take a while to to do something here. But it's a problem we're really excited about because, well, we like math and this is basically involves, you know, understanding the you know, coming from the database world, there is a mind shift to understanding, to understanding machine learning and and AI. We're really excited about learning more about that, but I think also more generally excited about helping to solve this problem. It's it's just really it's just a really interesting problem, and there's just huge value
[01:10:46] Unknown:
in being able to combine these 2 worlds. Alright. Well, thank you both for taking the time today to join me and talk about the work that you're doing on ClickHouse and the work that you're doing at Altinity to help, support that database. It's definitely an interesting piece of technology and 1 that I plan to take a closer look at and possibly use for my own purposes. So thank you both for all of that, and I hope you enjoy the rest of your day. Thank you, Tobias, very much. Thank you, Tobias. It's been great.
Introduction to ClickHouse with Robert Hodges and Alexander Zaitsev
Robert Hodges' Journey in Data Management
Alexander Zaitsev's Experience with Analytical Databases
What is ClickHouse?
Primary Use Cases for ClickHouse
ClickHouse in the Broader Database Ecosystem
Scaling and Performance Capabilities of ClickHouse
Data Lifecycle in ClickHouse
Clustering Capabilities of ClickHouse
Data Safeguards and Deletion Protection
Loading Data into ClickHouse
Schema Evolution and Handling Malformed Data
Advanced Capabilities and SQL Extensions
Getting Started with ClickHouse
Deploying ClickHouse in Production
Interesting Use Cases and Customer Stories
Multi-Tenancy and Data Ownership
Challenges and Lessons Learned
Shortcomings and Future Directions
When ClickHouse is the Wrong Choice
Final Thoughts and Closing Remarks