TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman

Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode

and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on iTunes or Google Play Music. Tell your friends and coworkers and share it on social media. I've got a couple of announcements before we start the show. There's still time to register for the O'Reilly Strata Conference in San Jose, California

happening from March 5th to 8th. Use the link data engineering podcast.com/

strata dash sand dash jose

to register and save 20% off your tickets. The O'Reilly AI Conference is also coming up happening April 29th to 30th in New York. It will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com/aicondashnewdashyork

to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science

happening in Boston from May 1st through 4th. It has become 1 of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets, go to data engineering podcast.com/odscdasheastdash2018

and register. Your host is Tobias Macy. And today, I'm interviewing Ajay Kulkarni and Mike Friedman about Timescale DB, a scalable time series database built on top of PostgreSQL.

So if you guys could start by introducing yourselves. Hi. I'm Ajay. I'm the CEO and this is our CTO, Mike. Hi, everybody. And Mike doubles as a professor of computer science at Princeton and we're the cofounders of Timescale.

Mike and I actually have a fun history. We Actually go back 20 years back to the 1st school of college at MIT, but we were roommates. Since then, I actually stayed, I went into industry. Went back to business school at MIT, and then started a company around communication analysis. That was later acquired by GroupMe, which then acquired by Skype, which then acquired by Microsoft. Then all 3 of those acquisitions happened in 6 months.

And in parallel, Mike actually went into academia,

earned his PhD from NYU in Stanford. And while at Stanford as part of the team that developed software defined networking. While at Stanford, he also started a company,

called the Lumix,

around,

IP geolocation that was then acquired by Quova slash NuStar. And then, ended up at Princeton where he's teaching computer science. And as it was around the time where I was, you know, halfway into my golden handcuffs at Microsoft, and Mike had, you know, maybe had gotten tenure

at Princeton that we decided to team up again to take on this new challenge. That new challenge was actually around,

helping developers and companies manage the increasing amount of machine data and time series data that their businesses were generating,

and that's and that's why we started Timescale. And if you could go a bit deeper into how you each got involved in the area of data management.

Yeah. It's it's again, it's it's 1 of those kind of fun startup history

stories. We started off not building a database, but actually building an IoT platform. What we saw was that there was this rise

in machines.

In fact, we saw that

the, the computing industry moves in waves. Right? It moved from mainframes to desktops and laptops and then smartphones. And with each wave, the computers become smaller,

but also more powerful and more ubiquitous. And and you might remember Microsoft's original slogan in the in the eighties was, you know, computer on every desk. And and now with smartphones, you know, you could argue a computer in every pocket. What we've noticed now is that we're entering a new wave with a computer,

and everything. And these things include our vehicles, our manufacturing lines, our farms, power plants, you know, homes, like with the Alexas,

of the world, and even in our bodies. And so, you know, what what Mike and I realized was that

as as human beings, we were we're now living with machines.

And, you know, as businesses, we were swimming in machine data. And, we actually started off, you know, with that's with that hypothesis, but with this with actually a very different product. We started off, with an IoT platform. And the IoT platform, you know, you know, help people who are making these devices

analyze their data and monitor devices.

And what ended up happening is we were collecting all, you know, all kinds of,

you know, all kinds of, of time series that wasn't sensor data.

And

then we do place to store that sensor data, and so we use a few off the shelf time series databases. And and actually after several months, we realized that we were quite unhappy with the state of the time series database world.

We we felt like we had, you know, essentially a choice between, you know, reliability and ease of use,

but lack of scalability, I e relational databases.

Or, you know, we had no SQL options which were scalable, were actually or hard to use, not that reliable, not that performant for complex queries, and in fact, siloed our data into 2 different databases.

So we actually, you know, had this insight, like, oh, you know, we feel like we could build a better time series database on top of Postgres.

And so then, you know, being a, you know, a group of, you know, computer science academics and PhDs,

we just built that database, but we initially just built it for the IoT platform. But then a funny thing happened is that as we were selling the IoT platform, you know, we we got some companies to sign on, but we also had a lot of companies that said,

especially larger companies who said, you know, look like, you know, we can't use your IoT platform because we're building our own IoT platform. But tell us more about this database you built because

it sounds like it's something that we could use as well.

And that's when we kinda realized that the solution to the IoT problem was,

you know, not a bit not in different not another platform, but a better database. And then as soon as we became a time series database, we realized that the time series data problem was even much bigger than IoT. It was in fact

problems in other industries as well like finance and advertising and eventing, logistics and so forth. And so that's essentially

and and essentially how we how we essentially, you know, Kenny got into the database world, you know, that's the whole story.

And as far as time series data, what is it about the sort of nature of how it comes in or the ways that it gets stored or used that makes it such

I'm wondering how you view your position

in the overall market, and what makes time scale stand out from the other options? Let me start by the first part of your question, which was what makes time series data different?

And you frame this by saying, you know, what makes it more challenging? And 1 of the

basic answers is the volumes of this is, you know, so much larger

than what you have historically seen in many,

OLTP or transactional databases.

If you think of a large bank, let's say like Bank of America, you might have something like 50, 000, 000 customers.

So

conceptually, you might think that, you know, you have 50, 000, 000 accounts. That's

how large your primary database is. You're doing operations to modify balances and so forth. But if you think about time series data, you have this vast array of

streams of

sensor data and a dent coming in, and they keep building and building.

And we see that across many areas

of of new types of machines, of connected cars, of industrial machines, putting out more and more data. Now 1 of the interesting things about it is that the nature of those workloads are different. So if I go back to my example of a traditional transactional database, what you might actually be doing is commonly

performing what are effectively random updates to the database.

I have 2 records. I do a transaction.

I modify both those records in place. And that's what a lot of

the high performance

OLTP market has focused on. And so when you talk about

scaling that, you need to think about how do you scale that workload.

But when we talk about time series workloads, the workloads look different.

In fact, most of the operations are actually inserts,

and they're mostly the latest time intervals.

It's this new wave of data coming coming on. And when you do queries, you might actually be scanning them in different ways,

either,

but largely with certain time regions or about certain types of devices or stock checkers that are kind of first class citizens in your data. And so

in some sense, when you think about scaling relational databases, that has been a question that people have been been working on for 30 years. And what we actually did was we're trying to solve a different problem because we're trying to solve this problem, but for those time series workloads

that are append heavy, insert heavy, to latest time interval, as opposed to random updates that are common to transactional workloads.

And that has allowed us to rethink how we go on to kind of build up from a relational database and how we want to design the whole architecture

around this workload.

And do you have a very detailed blog post on your site that explains a bit about the design decisions for how timescale is implemented,

And you mentioned there the fact that the data is largely append only and immutable. And 1 of the challenges

for time series in particular is the fact that you may receive events out of order or you may have sensors that are infrequently connected such as mobile devices

or low power sensors, and so you need to be able to handle those events coming in out of sequence. So I'm wondering if there are any particular challenges that are faced because of those requirements and how you handle that internally in the database.

So all those cases actually work out of the box with timescale.

And I should say that when I say they're append

mostly or or insert heavy, what we say is our performance, the way we get our performance benefits

is we rely on on on those assumptions,

but we don't require that for correctness perspective.

You could update our data, you could delete data in time scale like you would with traditional database. And when you talk about out of order, we we kind of don't rely on the fact that all data comes exactly in order. That is, you know, of course, sensors are gonna be perhaps late and and not all clocks are gonna be synchronized and the time stamps are gonna be a little little off. But what we're what kind of our performance assumptions are are driven by is that they, for example, all all come within a roughly the same same interval. So we kind of do this very kind of automated

and native

time and multidimensional

partitioning,

where maybe, let's say the last hour of your data

naturally stays in memory,

based on just the way we underlying how to architect to the database. And so if any of your data comes within the last hour, it, you know, is very efficient, all stays in memory, and and and and really drives forward that high performance that we get. Even if the data is delayed by more than several hours, it still works correctly.

It just occasionally has to do a couple lookups to disk to handle that. Now the other thing you mentioned is is out of order data. Is that a common use case in time series databases

is the ability to wanna do continuous segregations or roll ups, where you might continuously be calculating, you know, averages

or other metrics over time

so that, you know, you you continuously compute, let's say, your your 1 minute average and your hour long average in origin, again, make queries later faster. And we kind of have very nice native support for this, but it also works without of order data. So that if data comes into time periods that have already been pre computed, again, the database makes it very trivial and transparent

to recompute

those calculations

so that you continuously get the correct answer and not have multiple inaccurate records in your database like some other options on the market. And

I'm wondering if you can talk a bit more about how the database itself

is implemented, particularly given its nature as an extension to postgreSQL.

And along with that, if you can talk about the impacts that the 10 dot x release has had on the implementation

details of your project. Yeah. So as as you alluded to, timescale is actually implemented

as an extension on top of Postgres. And this was a very careful decision that we made when when setting out to build it. In doing so,

effectively, what we do is we give the illusion to users and to operators that we look like a Postgres database.

In that, even though if I have this, we call them a hyper table, it looks like a standard Postgres table. You could have a hyper table of a 1000000 rows, You could have a hyper table of 500, 000, 000, 000 rows, and we've had users have a single hyper table at half a trillion rows. But under the covers, of course, we're doing a lot. But what that gives you on top of Postgres is it inherits the entire ecosystem.

So if you're a tool that speaks to Postgres,

you could speak to timescale.

So if you use an ORM,

like, Ruby or or Java Hibernate, if you use a visualization platform like Grafana or Tableau,

if you use admin tools, if you use a queuing system or or a pipeline system like Kafka, all those things will just work out of the box with time scale. That has been a very powerful thing because what we often find is people say, well, I love to I love Postgres.

I started doing this in Postgres,

but then I hit this scalability limitation.

And now, you know, do I have to throw the baby out with the bath water? Do I need to go to a completely different system? Do I have to hack together something

that kinda sorta works and it's hard to use? And with timescale, you install our extension and you could basically keep scaling on and even get much higher performance than you would. Insert time 20 x the performance that you would with native Postgres. And so both from the users

inherits basically probably the largest ecosystem

of any time series database out there, because it inherits the entire ecosystem of Postgres. Then also, is that inherits the your 20 years of reliability that Postgres has had. You know, Jay wrote this interesting in a launch post. I think it was called when boring is awesome.

In that, you

want your database to just work. You

don't want your database to work about 3 AM because, you know, you threw some random seg fault and you lost data. And that we saw that really the as a database, this is an operational database that powers mission critical applications.

It's a core part of infrastructure,

and you know, it takes

5 sometimes 5, but often 10 years

to reach the reliability that you want from that type of infrastructure.

But we can provide that today, because we figure out how to make postgres as amazing product work now for time series data. Yeah. And your second question was 10 dot o, And there's a lot of great things with 10 dot o. It it gives a lot of parallelization support. It obviously has some support for, partitioning.

But so far, our users have had a great experience. We support,

timescale running on 10 dot o across, you know, all different platforms.

And, users have been able to take advantage of timescale

combined with, Postgres 10, take advantage a lot of the some of the parallelization support and other type of of new features that Postgres

10 put out, and it has been a a great a great great product for us. Yeah. Just just to add to that, you know, we're huge fans of the post cruise community. We love the fact that, you know, 10 came out last year, 11 will come out this year, and and we we wanna see that that growth continue because I think it's it's it is great for developers. I I think with 10 and even with 11, there are still

quite a few limitations when it comes to time series data and I think, you know, some of them over time may improve but some of them over time, you know, may not make sense to be in mainline.

And so I think, you know,

with all the improvement, time scale can still add a lot of value on top if you're using postcodes for time series data. But that said, you know, we we love all this development and we're we're big fans and supporters of the Postgres community.

And given the fact that it is an extension to Postgres, is it possible to deploy it on top of managed platforms such as Amazon RDS or Google Cloud SQL or the Azure database offerings?

It's not yet. And and if you look at, I think our our GitHub issues, I think that's our number 1 GitHub issue by far is still requesting support for timescale on some of the managed offerings. And, you know, we hear from customers that they've, you know, requested this

from Amazon and Google and Microsoft. So, you know, we we may see something, you know, sometime in the near future, but it's currently not available. And the reason it's not available is not a technical limitation

of of Timescale interacting with these products. It's more,

a question of of getting launched on these different platforms. And so for somebody who does want to start using Timescale,

what would be involved

in getting it deployed and maintained?

And what are some of the resource

allocations that they should be considering

for deploying the database on top of?

Yeah. So so there

I'll answer that in a few ways. Number 1, we support a a variety of of installation and deployment options. As in in terms of installation options,

you know, our most common are, is either via our our docker, is via docker or via

some of our kind of, you know, Linux based package managers like AppGet or YUM or RPM. We also support, you know, for Macs, which a lot of people use, you know, via brew for just for local testing. But also in terms of deployment, you know, because we're we're a we're a database and we're a post rest extension,

we essentially support

deploy a variety of deployment models. You know, we have anything from, you know, the public clouds to even, you know, private clouds, to servers on premise, to gateways at the edge or even the edge device.

You know, we've we've already tested time scaling on a Raspberry Pi. And I'm not sure if someone today is actually using that,

but we do have usage on on Linux based gateways at the edge. So essentially, you can either scale up or scale down, with timescale. In terms of resources that are available,

you know, generally, most people that we work with today, you know, have their own, you know, resources in house for managing databases. You know, larger companies have DBA teams. Some smaller companies have their own DevOps teams. But also, you know, we do provide

enterprise support for companies who need higher levels of support or an SLA.

So that's an option as well. But but, you know, to be perfectly honest, because we're packaged as Postgres,

most people have pleasantly found that a lot of the tools that they already use, whether for for backups or replication,

with Postgres,

just work out of the box with Timescale.

And 1 of the main selling points of Timescale is the fact that you can scale all the way from small devices

scalability

starts

to break down. So the axis for scaling happened to be

scalability starts to break down?

So the axis for scaling are are typically like Ajay said, you know, we've had people

at least test us on Raspberry Pis. It's not, you know, typically the most performant platform due to the due to the hardware,

but certainly works on a gateway.

And sometimes, you know, in thinking about time series data, there's really and the time series database, there's really 2 axes that we think about. 1 is that of scale and performance,

but the other 1 are things like time oriented

analytical capabilities and time oriented data management. And so I say that because, for example,

things like making doing continuous aggregation

and data retention,

such that you might keep raw data for

a week and then aggregate the data for a month. You know, the reason you might adopt

a timescale on this low power devices,

because it makes time, you know, time series data management easier, and some of the analytical features we had easier. As opposed to the fact that, on a Raspberry Pi, you're not doing a 100, 000 inserts a second. And so there's really different decisions on how you why you deploy,

time scale in these 2 in these 2 axis. And 1 of them is that we also because we actually allow you to, along with time to use data, store your relational data, what we found, for example, in the Edge, is now somebody could have a single

database, basically, a post cross database running time scale that they could use for all of their data, and it basically allows them to simplify their stack on the edge. Now in the cloud, of course, people have or in on premise, people have scaled up. We support 2 types of clustering today. That is that we allow you to do kind of postgres streaming replication with,

multiple read only replicas. So you could scale the number of actually

servers you have entering queries. And we also very interestingly

allow you to elastically

scale your storage

even associated with kind of the master

and replicas of the of the time time scale cluster. And that is we kind of have this unique thing that in 1 hyper table, unlike a traditional post cross database and most relational databases, in 1 hyper table, which again to the user looks like a single table, we allow you to actually elastically add disks.

So if you're starting at, you know, 1 terabyte today and you need more space, you could add another 5 terabytes,

another 10 terabytes.

If you keep adding disks, and then we will,

again, do the time series nature workloads,

We'll continue to load balance all your new inserts

over these new disks. So we've had people take this to the to the realm of, I think,

50 or a 100 terabytes,

in more,

you know, kind of private cloud environments. And and and Tobias, let me also answer that same question,

a different

way because I suspect the next question you might ask is at what point does this the scalability

breakdown?

And and typically, we measure scalability,

in 3 ways. 1 is insert performance, the other is query latency, and the third is,

kinda essentially just data volume.

In terms of insert performance, you know, today, we support 100 of thousands of inserts a second. And, of course, those are row inserts per second. And if a row has 10 metrics, that works out to millions of metrics per second. So I would say number 1 today, if you need an insert performance, let's say, in the tens of millions of instances per second, then we don't we currently don't support that. But obviously, that's it's 1 area that we are working on. So that might will change in a year. This the second thing I wanna point out is query latency. Query latency,

we actually have a great story there because, you know, we're building up a Postgres and relational database. You can, you know, build hyper tables with, you know, as many secondary indexes as you want, which allow for, you know, really complex queries,

really efficient complex queries. Also, because we're we look like Postgres, we support

Postgres streaming replication, which allows you to have, you know, 1, kind of master but have, you know, several,

read replicas. And and those read replicas, you know, allow users to essentially, you know, increase their query throughput by round robining queries. That that said that said, if if you think if you think most of your queries will touch all of your data, I e you can leverage,

you know, indexes,

then that is where, you know, time scale may not be the best option. You may want something that's more map reduce in nature as an example.

But as as as

typically what we say. The third thing is is data volumes at disk storage.

And as Mike pointed out, because of our Hypertable framework, it allows us to build this logical table that essentially that effectively spans across multiple disks. And that's allowed our users, you know, on a single machine,

10, 20, 30 disks

that that have allowed them to grow their data volume to 50, 60, almost a 100 terabytes just on a single machine. And and and and so that's where we are today. You know, 1 thing we're working on is is being able to to scale it out across multiple machines,

which would allow us, you know, in a year to get, you know, to a to a petabyte of data volume. But but but, you know, that said, if today you see yourself having petabytes of data, or you're even beyond that, we would not recommend timescale. And we said Rick Rick Rubin something that's more file system based like like an HDFS space system. And a couple of things coming out of that is you mentioned the ability to use streaming replication and then do query balancing for read replicas. I'm wondering if Timescale

adds any capabilities for proxying whether the queries go to the slaves or the master for read or write. Yeah. Whether the queries go to the slaves or the master for read or write? Yeah.

Today we recommend actually using 1 of the tools in the ecosystem. In in some sense, that's 1 of the advantages we have as building on top of Postgres is that there is such this re rich ecosystem

and often different,

deployments and companies and users wanna use different things. Either things like our custom built for,

Postgres,

or things that, you know, with all these new,

container management systems like Kubernetes or Docker Swarm, there are other ways that you can get the similar type of functionality. So there there are currently just a a whole variety of different ways that Deepak will do this. And,

you know, we often make recommendations.

People ask us on some of our support channels,

but there there's a lot of easy support there.

As Ajay was talking about, 1 thing that we're working on is is kind of more native

scale out where you could have many primaries that will increase the total capacity. And there, we're going to have kind of better or more built in support,

to to make this transparent than using these 3rd party extensions in Postgres.

And speaking of those 3rd party extensions for horizontal scale out, I was wondering if you or anyone you're aware of has explored

running timescale in conjunction with Citus for being able to do that horizontal partitioning

to scale across multiple instances

or if the nature of the hyper table makes it sort of impractical

question that someone always asks

when timescale comes up or Sidus comes up. I think if you look at the recent hacker news articles by either side us or timescale,

there's always a question about the other about the other option if they're compatible. The answer is we don't know. We have not tried it. We don't know if anyone else has tried it. It's possible that they work together, possible that they don't. 1 thing we found is that, you know, typically,

you know you know, Citus and Timescale are solving. They're both scaling Postgres, but they're scaling it for very different workloads.

And in particular, you know, we're we're very time series

oriented, very time series specific,

which we believe is a is a large enough and complicated enough on its own that it requires a specialized database. What we what we think the world will look like is if you end up using both,

you might end up using both on, on different machines

and maybe using something like a foreign data wrapper to to query across the 2. But but, you know, in short, you know, we really haven't tested or In in some sense, you know, part of our goal with timescale is to really make the data management use and analysis of time series data easy. Performant and easy. And so this whole idea of a logical view, a hyper table, you know, we basically wanna make it to look like you're interacting with the data. 1 table, you don't care where it is. You don't care how big it is and so forth. And so really to give, you know, such a great user experience and and and such a performance and a reliable,

you know, you have to really think about what this database, what the system looks end to end. You have to think about, you know, how you get performance, how you get correctness semantics.

And so, you know, while you can kind of cobble together

different solutions.

In the end, we think, you know, the best solution is gonna be something that really tries to tackle the problem

holistically, focuses on 1 problem well, and and just really, you know, does that, you know, really really executes on that. And that's kind of why our vision is in in trying to as we build that scale out, how do we build it knowing what we need for the time series database problem, as opposed to thinking, you know, how do we cobble together different tools

to get something that kinda sorta works.

And and

and to that point, you know, I'd like to stress that, you know, our, you know, scale is in our name.

And and I think we we built our initial

kind of brand around scaling postscripts to these workloads.

But I think the the other half of that, which often goes unnoticed, but I think it's equally important is ease of use.

And, you know, 1 thing we really strove for is is as you are writing time series and incurring time series data, you shouldn't have to worry about

anything. You shouldn't have to worry about how your data is stored, how your data is partitioned,

and and we effectively handle that behind the scenes. If you're if you're writing data to time scale,

and and,

and and your volume increases, we will automatically,

you know, write the new data to new chunks and we'll create new chunks as necessary. In fact, we've had we've had some customers who have were writing

half a 1000000 inserts a second. And and at that volume, they're creating a new partition every 2 to 3 minutes. And and that that's actually not even feasible if you're trying to do it manually. But with, you know, Timescale handling behind the scenes, they'd have to worry about, you know, oh, do we need to create a new chunk? Oh, does this is this a new child? Oh, like, do how do I join across the partition boundary?

That's all behind the scenes. And and there's actually a lot of work we've done in,

you know, creating this hyper table abstraction. The hyper table abstraction layer essentially, you know, presents this this,

this

illusion of a logical table across all time and space across all your data. Even though behind the scenes,

the database is, storing the data across multiple partitions and creating new partitions necessary.

But but what we've essentially done is essentially, like, enforce this contract that says

this hyper table will look exactly like a table to you. So if you're inserting data, you insert to the hyper table. If you're querying data, you query the hyper table. If you're creating constraints, you create constraints on the hyper table, indexes on the hyper table, and timescale will transparently behind the scenes, create new chunks as necessary,

query the appropriate chunks,

efficiently to retrieve your data, and then propagate, you know, constraints and indexes to each of the underlying chunks. The key thing is as user,

you don't have to worry about it. And on the subject of indexes, that also touches both the ease of use and the scalability

aspects because

as anybody who has used a relational database for long enough and occasionally in the wrong manner, you know that adding too many indexes or indexes of the wrong sort can completely destroy your query performance.

So I'm wondering if you, in the process of building Timescale, ended up needing to either create or modify new index types or

whether you automatically add some form of indexing to the hyper tables to make it easier and more performant to query that data.

Somewhat 2 parts of your questions, which is 1 of them is is why, for example, given that time scale has built on top of Postgres, you know, why how do we claim that we get or why do we get 20 times the insert performance? And 1 of the reasons is, if you have in index data, you know, effectively what happens is when you insert a new row, what you're going to try to do then is, of course, update all the indexes. So for example,

if you actually

have some in the, you know, index on,

let's say, time or on some other ID,

As you write the latest data, you're now going to try to update that index to be properly in place. Properly maintain the right sort order for the index, you could do efficient queries. And what that means is as you're you know, when you have a 1, 000, 000 rows in your database, maybe everything's fine, everything's in memory, things are fast. You know, when you have a 1000000000 rows or a 1000000000 rows, you're gonna start swapping a lot. And that's when your performance is really going to plunge. So 1 of the ways that we

kind of architecturally

make that much faster is that our indexes are all local

to every partition, to every little chunk. So when Ajay was saying

that, you know, these chunks were dynamically created every 3 minutes, the indexes are only within that 3 minutes worth of data, that 10 or

20, 000, 000 rows of data in that couple minutes. It's not across, you know,

half a trillion rows in your entire database. And so what that means is you again kind of naturally take advantage of the locality of the data to keep the index that is actively being changed in memory and fast. And then the stuff that's not being modified

sits on disk, doesn't get modified.

Maybe is,

asynchronously, you could reindex it and make it more efficient or cluster that chunk. And it's kind of the the access pattern to all these chunks, which has allowed us to kind of overcome some of the performance hits that you see when trying to take a traditional relational database and just make it much larger. So we haven't really,

yet gone down the path of trying to build custom indexes

in different types. In fact, Postgres is great in supporting

all host of things and B trees and Hash Indexes and Brin and Gin and GIST and a whole set of indexes. We support them all, but we make them small and localized

to data that is often accessed both from an in search and query perspective together.

And given the

complexity of the problem space that you're trying to solve for, I'm wondering if you can talk a bit about what the most challenging aspects

of building timescale have been both from the technical and the marketing or promotional aspects.

Let me start with the marketing side, and I'll let Mike ask the technical aspect.

I think with marketing,

I actually point to a couple of things.

Number 1, you know, we're we're we're pretty new. We're pretty new database. Right? We launched we launched last April, but in those 9, 10 months, we've sold we've actually quite a bit of growth. You know, we've we recently just passed a 100, 000 downloads,

and we're being deployed by real by real customers, you know, real real use cases in manufacturing,

utilities, and and in, like, you know, telecoms and even finance. And I think 1 challenge for us, which I suspect is a good problem to have is because we're so new, a lot of people don't realize how

how robust and advanced the product already is. And, you know, I think last week we announced that, you know, for example, we have

you know, companies like STE Energy in Italy, which is a renewable energies utility,

they recently replaced

Redis with Timescale

to back their operational dashboards that they use to monitor 47 power plants in Italy and Albania, Colombia, Peru, and a variety of other countries. And I think that caught a few people off guard. They said, wow, like, how is a company like how is an Italian utility, which you wouldn't think of as, you know, as the earliest tech adopter,

How, you know, how are they betting on a on a less than your old database? Well, it's because timescale, because of the approach we've taken, of course, seeing some of the engineering decisions we've made as well is already quite robust. And and,

and and despite being less than a year old, you know, we've you know, we were deployed in production.

We've been deployed in production for months already. So I think that's number 1 is is is to essentially say, hey, I know you're used to most databases taking years to get production ready,

but we're not most databases.

You know, we've been production ready since, like, month 3. And so and and,

and so that's number 1.

And that number, yeah. Part of that is because, you know, we're not, again, not trying to build from scratch, and we stand on the shoulder of giants. We stand on the shoulder of 20 years of postgres

development,

which has put correctness and reliability

and dependability at its core. I think the second challenge in terms of of marketing

and outreach has been, you know, there's just a lot of noise in the database market right now. And we found that a lot of people are somewhat fast and loose with their marketing claims, and and we try not to do that. But as a result, we find ourselves, you know, having to work harder to actually show how we're different than other options.

You you know, 1 example of that is is full SQL. You know, a lot of databases,

you know, including time series databases claim to be full SQL. But if you kinda scratch below the surface,

there there's there's, like, an asterisk. They'll say, oh, it's full SQL, but we don't support joints. Right? Or a full SQL, but we don't support window functions. And, in our minds, you know, you know, joins are are pretty fundamental part of SQL. And in fact, even for a lot of our users, window functions are pretty critical as well. Or you can only order by time. Right. Right. It's yeah. We support full SQL, but only orders ordering by time. We don't order the order by any other key. And and and for us, it's like

it's tough because they've created a lot of, you know, misinformation

that we essentially have to cut through and say, okay. I know they claimed their full SQL, but they're actually not full SQL, and we actually support all these other things as well. We're we, in fact, support everything that Postgres supports,

literally everything,

which I think most would would agree is full SQL. And we said, look, you know, if you don't need joins, we don't need window functions, if you're okay ordering by time but nothing else, then maybe that's good enough. But what we found is that most people

need need the full spectrum SQL. At least that, you know, for future proofing reasons, they wanna be reassured that if in the future they need to join, if the future need to do, you know, you know, use geospatial data with something like post JS, they'll be able to do that with timescale. Oh, it'd be that could be able to do that with their database which, you know, they can do with timescale. I don't know Mike, you wanna talk some of the the technical challenges with that. You know, I think the technical challenge is that goes back to what we view as

our mission and what we expect people to rely on us for. In that building a database, you're really

a core part of a of a of a company's infrastructure,

and they rely on you. So if you think about the web world where, you know, Facebook was able to get great products at because you move fast and break things. That might work for a social media platform, and, you know, they've done massively impressive engineering.

But that is not the goal of a database.

And so think our technical challenge there is is perhaps not surprising

in the sense that,

you know, we always need to keep an eye on that

the

features that we look to provide to our users because they're asking for them,

are also counter, of course, that we need to be,

you know, do this with safety, security, and insurance foremost at our minds.

And again, going back to the fact that there are so many other time series

databases on the market, I'm wondering if there are any that you would call out as being ones that you personally view as your closest competitors

or ones that people may want to move off of in favor of timescale?

We don't really think about, you know, other databases

competitors, but we just think of them as other options that developers have with time series data. And I think you have have a a variety of options, and I can kind of explain how we compare against them. You know, 1 option is that, you know, if you have time series data, but you don't have a lot of it. You know, let's say you're you're collecting economic forecast on a monthly basis,

then, you know, you should just store that in a relational database.

I'm not really sure you need a time series database.

On the flip side, if you're storing, you know, lots of time series data,

but your queries are fairly straightforward. Maybe you're doing, you know, lookups or you're doing, you know, single column rollups around around the time dimension,

and and and this data is not mission critical,

then we would essentially point you to, you know, some of the the more no SQL time series databases that that perform well on that. But but but broadly speaking in the world of time series databases, what we're finding is that

that, you know, until time scale launch, every time series database, because the number 1 problem was scale,

was effectively a NoSQL database. They essentially gave up the relational

model for something that was, you know, more optimized

for

for inserting data, but then essentially gave up a lot of the, you know, query power,

you know, and a lot of the complexity in some cases reliability as well. And so it's the way we like to kind of describe timescale versus other the whole the entire

time series database market is that we're the only time series database that's essentially built on a relational database,

which means a few things. Number 1,

we support full SQL. And number 2, we support, you know, all the good stuff that you get with a relational database

including joins and secondary indexes.

And number 3, because we're built on top of Postgres,

we're actually quite reliable.

And and what we found is that despite being a few months old, we've had we've had users come to us from from older time series databases because they found that we're just more reliable and and we can't take the credit for that. It's really a lot of like we can't take all the credit for that. A lot of that comes from the Postgres community. And number 4 is, you know, because we look like Postgres,

we essentially inherit the entire Postgres ecosystem. So if you wanna use, you know, Tableau or Grafana

or,

you know, if you wanna use any of the Postgres backup utility tools, or even if you have a question of, you know, how do I do this in timescale?

You know, we essentially have the broadest ecosystem and probably

the most documentation

of any time series database,

despite the fact we're relatively the the new canon. And to follow-up with 1 thing that Ajay said, at the beginning of this conversation, you know, he talked about how we initially had built an IoT platform, and we had tried using

1 of the existing NoSQL databases. And, of course, what we did was we had all this metadata as well. We had we're collecting data about devices, so we had a separate,

in fact, post cross database that stored metadata

about the devices and stuff they were. And what we ended up doing like many

probably startups that have this scenario of their time series data

siloed in the time series database, and their relational data, metadata siloed in a relational database, is that we punted this the application layer, and we actually had to write in our microservices

code that joined across these 2 databases. You know, in every rest API call, when you do something, you'd have to talk these 2 databases. And and sometimes the databases were not consistent because,

you know, somebody had written data to your time series database about a device that had not yet been registered, and then you had to resolve that and do with all that complexity. And when we were talking to other companies as well, we found this as well in that the data scientist wants to do some analysis,

but currently, he had access or she had access to the time series database.

But in order for them to merge across this database meant that they had to talk to engineering,

and somehow get engineering to modify the applications

to support their new type of analysis. And that always kind of let people back. And so we found that when was Jay was talking about, if all you wanna do is simple roll ups in

a NoSQL database, you know, you we found that, you know, that might be useful. For example, if you wanna do a simple dashboard where you just wanna show the latest data about a particular thing across time. But once you start seeing, oh, there's a problem here. Let me figure out, you know, why is there a problem here? I then need to go to my metadata. I need to figure out what are the conditions about this device or about this environment

that caused this. Then again, you can't get this directly from the system. You have to build more complex in the application. But because we support this whole thing, because it just looks like SQL,

you could do both the

performant

dashboarding and analytical queries and also the ad hoc analysis.

And that's also where we saw ourselves different

from a lot of the people that were a lot of the products

that were just looking to, you know, plot trend lines.

Related to that in terms of plotting trend lines and also going back to your roots as

a project for being able to capture machine data, 1 of the use cases that you call out on your website is

to be used for systems metrics and monitoring

in a server environment, which is particularly

relevant to me because my primary role is as a DevOps or Sysadmin. So I'm wondering

how timescale fits into that overall ecosystem

and whether it can be used along with existing tools such as Graphite or Prometheus or if it would largely replace them for being able to store your system metrics? And I know you already mentioned that you have direct integration with Grafana, which is becoming 1 of the popular front ends for that use case.

Yeah. We we we actually have, direct integrations with both Grafana and Prometheus.

The Grafana connector was actually developed, initially as a Postgres connector, but the Prometheus connector is something we built. In fact, we wrote a, Postgres extension that we later open source called pg Prometheus.

Essentially allows anyone to store actually creates a Prometheus data type within Postgres, and allows any anyone to store Prometheus data directly in Postgres. And then obviously with time scale, you can then scale the workload.

But but, you know, at a at a high level, you know, you know, metrics is obviously a huge use case for time series data.

And, you know, for us, it's we're not trying to replace any of these solutions. We're we're essentially a complement.

And if you want if you want a complement to Prometheus or Graphite,

that can store your metrics data in a way that's easily queryable using SQL or you're even using, you know, other tools, you know, Timescale is is a is a great resource for that. And as you, you know, as you already identified,

we plug right into Grafana so you can use this right away. And in fact, a lot of people will often ask us, you know, do you, you know, do you support,

you know, this this metrics format or this metrics format?

And 1 thing we found is typically the answer is yes. Because because you typically someone has written a connector or or a translator from a metrics format to Postgres. And again, anything that speaks to postgres will speak the time scale. And so that's that's kinda how we see ourselves fitting into that ecosystem.

And 1 thing I just realized that we forgot to touch on specifically

is for somebody who wants to

use Timescale in their existing Postgres database. Is there a,

specific

syntax that they need to add to their create table statements in order to enable the hyper table attributes for that given table?

Yeah. Yeah. In fact, it's it's, it's quite easy. It's quite easy to install timescale in your existing postscripts instance.

You know, the first command obviously because we're an extension is, you know, is is create extension timescale DB. And the next thing is as you as you identified is when you're creating a table, first you you define the table using the the standard create table statement,

define the schema. But right after that, you have to call, our function called create a hyper table. And create a hyper hyper table takes a number of parameters

that we we talk about in our website for for best, you know, for best use.

But but but, you know, essentially, you just have to tell timescale that, hey. I want you to treat this this table as a hyper table, and then we do all the, you know, for lack of a better word, the magic behind the scenes. But after that point, you you write to the hyper table, you clear the hyper table, you create indexes and constraints and triggers just as if it was a normal

and so after that after that 1 command create a hyper table,

you don't even have to change your your code if you're if you're migrating for postgres. Essentially, it will just work exactly the same. And does that function work properly against an existing table or would you want to create a brand new table and then migrate existing time series data to that if you already have it stored in the database? Yeah. Currently, it it actually only works against an empty table. So you need to, you know, create a new table, call that create a hyper table command on it, and then migrate your data. In in the future, we'll consider,

possibly converting existing tables, but this thing is relatively easy. You can do it in command, you know, insert

into new table, select from old table,

or you could also we've written the,

and and open sourced,

on our GitHub is a parallel copy command.

Copy, although fast, is actually single threaded. And so with our parallel copy, we've,

had people, you know, do I think 1 even almost reached a 1, 000, 000 rows per second at an insert rate. And so, you know, you can really load up this, hyper table quite quickly. And what are some of the most interesting

or unexpected uses of timescale that you have seen? It's, you know, it's quite interesting.

We we,

you know, our our roots the roots for Timescale

was in machine data because we, obviously, we came out of a an IoT platform company. And so we initially thought the main usage of Timescale would would be IoT. Or it would be would be, you know, maybe, you know, old line industrial IoT, like manufacturing

machinery or or utilities or maybe even some new line like, you know, the smart home or or, you know, any of these things that you see on Kickstarter.

And and then we thought and then we thought, oh, okay. You you know, of course, we're a time series database, so maybe we'll see some workloads,

that you typically see with time series database, like

like metrics and maybe finance. And what we found is that, we've actually seen a lot more uses than than just those areas.

You know, early on, we see some we saw someone migrate to Timescale from Hadoop. Now this wasn't a, you know, a petabyte size Hadoop cluster. It was something in the in the tens of terabytes. But even then, they moved that to a single Timescale node. And and that caught us off guard because then we're like, wait, are we are we a data warehouse? Right?

And then and then we saw, you know, some other users who who are deploying us in ways that in other ways that we didn't expect, you know, for example, with geospatial data, you know. And, you know, when we started we'd heard of post GIS, but we hadn't really thought about, you know, geospatial

temporal data. But it turns out, you know, time scale post GIS are quite compatible. So if you're tracking an asset over time, you have both geospatial data and time series data. And we and we've seen a lot of usage with time scale in that area. And and

and and effectively, you know, looking back, I think what we saw on that 1st year is that time series data used to be this niche. Right? It used to be this niche initially within the financial community. And that's where like in a database like KTB came out of. And then later,

essentially,

it became this niche within the DevOps community and metrics, which is where, you know, arguably every other time series data has come up has has come from. What what we believe we're seeing today is that time series data is emerging at more and more places. You know, part of this obviously is IoT, but part of this is also,

you know, you know, something we call the the evolution of data resolution,

which essentially it it just says that it just means that, you know, as, storage gets cheaper and data processing tool become more powerful, you know, you're just naturally gonna store data in a higher and higher resolutions.

And so in the past, you may have stored the latest balances of of people's bank accounts. You know, today, you you you you'll they'll store the transactions

that essentially,

you know, affect that final bank account balance. And the future future, you're gonna you're gonna store every possible

interaction,

with the consumer, with the user that essentially helps in, you know, predict future behavior. And the further along that spectrum you go, the more and more time series data you're capturing. And as a result, you know, time series data is is evolving from something that used to be this niche into something that, you know, essentially all data

is evolving into. And and we strongly believe that over time, more and more data will evolve into time series data.

And and I think, and then I think that's 1 reason why today you see, you know, if you look at the DB engine stats, it's 1 reason why you see that time series databases have been the fastest growing category of databases past 24 months. And I think, you know, I think time scale

as, you know, the only time series database that supports

relational database like workloads, but for time series scale, I I think we're really well positioned to help developers, you know, with those sorts of problems.

And

moving now to the

business aspects

of timescale, I'm wondering first whether you had always intended for the

code base to be released as open source or what your reasoning was behind that decision and also what the business model is for the company to be able to support the future growth and health of both the business

and the project?

So timescale, you know, again, was initially just a product we built for ourselves. It was a database that we needed, so we built it. And then we found and then we found that other people people needed it. So the question is, if other people need the database,

you know, how do we offer to them? In our minds,

there wasn't a question. It was just, you know, we wanna use a database today, we would only use an open source database. And so,

out of the gate, we're like, okay, like, you know, if we wanna offer this, it needs to be open source. And then we essentially spent 6 months open sourcing it.

And then effectively, we became an open source database company. And so then the question is as an open source database company, you know, how does 1 support oneself such that you can continue to build a database and essentially build a sustainable business around it. And to to answer that question, we we've just, you know, we just looked to other companies who've done this successfully,

before. You know, we we look at, you know, Mongo, which IPO'd recently,

Elastic, which we hear is doing quite well. Even even some startups that are, you know, haven't aren't quite at the IPO level but are doing quite well, like, you know, Databricks with Apache Spark and Confluent and Kafka. And and, essentially, what what we're learning is that that sustainable business model is essentially an open core model, where the vast majority of your your code base is open source, you you know, for the community. But there are some features that large enterprises need when they're deploying your database in production. Typically typically around operational,

you know,

and convenience, you know, use cases. And and those companies are are actually more than willing to pay. They're actually more than willing to pay if you can solve those problems because you're typically solving mission critical use cases. So 1 thing we're in the process of of of, you know, evaluating now is is, you know, what does the the enterprise version of Timescale look? We already have some features that that that customers are using and paying for, but that's more on ad hoc basis.

And what I expect is in the next year or 2, we'll have a a more more coherent story on what the enterprise model looks like. That being said, you know, our our main focus is the open is the open source community. We think the only way this this business survives is that

if if we really, you know, maintain and allow the open source product to flourish.

So the last thing we wanna do is severely hamper

is is to hamper the open source product. But that said, you know, I think it's in everyone's best interest if is is if time scale is self sustaining. So we are thinking about how to build, you know, that self sustaining. Talking about the future work that you have in store for timescale, I'm wondering if there are any particular features

or improvements that you have planned for upcoming releases?

Probably the biggest

project we're working on in, in in terms of improving the capabilities of Timescale is really,

focused on scale out clustering

for, you know, Timescale primaries so that, you know, you're not limited to a single node primary with multiple,

read only replicas,

but we can historically, people call this sharding. I I sometimes stay away from that term because we, in some sense, even on a like I say, even on a single node, we do this microsharding where we could have sometimes tens of thousands of these little shards on a on a single node, which is how we get our performance. But it's that we could actually do inserts that in order to gain more capacity, you just keep rolling out new servers. And that's probably the largest engineering task we have over the next year. You know, thankfully, we have a very strong engineering team, kind of the the core of our team. A lot of us actually,

knew each other previously. A lot of them were my former PhDs and postdocs from Princeton and and have worked on,

scalable consistency models and scalable

transactional systems for for many years. And so, obviously, kind of, technically

understand what the design is, and part of it is just the engineering effort to deliver, you know, highly reliable

product that 1,

expects from, you know, a core database. Is there any particular help that people in the community can offer to help

the timescale database itself

improve

and grow in the future?

Definitely. I I think,

you know, I think the,

you know, the the the the database so far, I think we've we've had more growth than we expected, but we we still fully recognize that there's a lot more room to grow and

and and we still feel like we're just getting started. We welcome and encourage any kind of community feedback and help, whether it's,

you know, features, new features that we could add, if it's functionality that could be made smoother. You know, obviously, if there are any bugs or any any issues that people have,

we we're we we love to hear those so we can,

rapidly rectify them. But but also, like, you know, you know, broadly, you know, like, you know, any feedback on on, like, you know, how timescale

can solve

someone's problem better? You know, we wanna hear that. If if if you if you look at timescale today and and you say, oh, hey. It's it's it's 80% there, but if it had this extra 20% then we could totally use it. We wanna hear that and and we will honestly take that to heart because, you know, we we fully recognize that it's with our community's help that we, you know, you know, together, you know, you know, solve this problem and build the best time series database.

Are there any topics

that you think we should cover before we start to close out the show that we didn't already discuss?

So, you know, in in in terms of the business model, you know, often what companies ask yeah. Just look, if you're if you're using a database or mission critical system, you wanna make sure that that database will be around and supported for years. So often what what what companies will ask us is is how are you supported today as an open source business? And and we're happy to say that we actually have some very strong investors

who are who are backing us and supporting us. You know, we just announced last week that we've raised $16, 000, 000

from a few firms and a dozen angels or more than dozen angels. The those these firms are, you know, NEA, New Enterprise Associates, and and their their firm has been around, you know, for a long time. And they've invested in quite a few open source data related projects, including

Databricks,

Mongo, Elastic,

MAPD,

and now us.

Our other primary investor is Benchmark,

and Benchmark obviously also has a lot of, you know, experience in the space. And in particular, you know, they they've invested in, you know, Hortonworks

and,

you know, Confluent, the Kafka comp Docker as well. And, and and the partners from there, you know, it's Forrest Baskett from NEA and Peter Fenton from Benchmark, and they're both just, you know, I think probably, you know, 2 of the best, if not the best investors in the industry, you know, and we're so glad to have them. We also have a 2 Sigma Ventures backing us, which is, the venture arm of a hedge fund here in New York. That's 1 of the leading quantitatively

oriented hedge funds. And and they actually

they're the only investor that we spoke to had that had already built their own time series database.

Obviously, because, you know, for their own needs, they needed them. And so they they totally got, you know, the need for a new time series database. And and so it's been great to have them on board. And and and our our angels include, you know you know, the the founders of and CEOs of, you know, Hortonworks,

you know, of CockroachDB,

of,

you know, Cloudflare,

you know, older, you know, data data companies that have exited that include data domain that EMC is a data storage company. EMC acquired for 2.4 billion. Nicira, which is software defined networking company that was acquired by VMware for over a 1000000000, you know, some leading folks in the ad tech space, including the founders of Right Media and Mote here in New York.

And and so, you know, we're we we have

a very strong set of of people supporting us. And I think

because of who we haven't born, because of our momentum, I think we'll be, you know, well positioned for years to come to be able to grow this company and to hire the best talent. Is there anything else that we should talk about? Of course. If you have any questions after the fact, you can email us and we're happy to to fill that in. And so for anybody who wants to follow the work that you're up to or get in touch,

I'll have you add your preferred contact information to the show notes. And as 1 final question to give the listeners 1 more thing to think about, I'll have you each answer from your perspective

what is the biggest gap in the tooling or technology that's available for data management today. As far as data is concerned, we're in a very, I think, exciting time right now. We're at a time where,

you know,

data is quickly becoming the most valuable resource for for every business in every industry. And as a result,

you know, more and more businesses need more and more types of data infrastructure.

But a side effect of all this or or I guess an effect of all this is that data infrastructure challenges have gotten more and more specialized.

So as a result, you know, maybe in the past you had 1 database and that was it.

Today, you essentially have this this kind of potpourri

of of data infrastructure

of components that that are essentially plugged together.

And and essentially and and it's it's that glue

that where things get challenging today. I think if you're if you're connecting timescale to to Kafka, to Spark,

to maybe a data warehouse, maybe to something else, you

know, it quickly you can see that that the your infrastructure can slowly get more and more unwieldy.

I I think that's that's a huge challenge today.

We we think we think the solution to that is actually

is is in leveraging SQL, which is maybe somewhat heretical in the in the in the data community. But we think even though SQL SQL will, you know, anyway, it's old and and people had essentially had, you know, shot you know, moved away from it in recent years to more NoSQL or bespoke options. I, you know, I I think SQL is really poised to make a comeback. In particular, because

if you're gonna have this, you know, fragmentation

of data infrastructure components, you need a glue that, you know, common language that that kind of links it all together.

And and and, you know, we strongly believe that SQL is that glue and that's and it's obviously another reason why we're we're so, you know, bullish as a as a SQL only time series data. I think that's it actually. Head link only Mike has 1.

Okay.

I I had something along those lines, but I think that's a good way to end

it. Fair enough. Alright. Well, I really appreciate the both of you taking the time out of your day to talk to me and discuss the work that you're doing at timescale.

It's definitely a very interesting project

and 1 that I've been following closely since I first heard about it a few months ago, and 1 that I'm hoping to make part of my infrastructure

for a metrics pipeline that I'm planning to build out. So again, thank you for your time and the work that you're doing and I hope you enjoy the rest of your day. And thank you for having us. Thanks for having us.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links