Summary
The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
- Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
- Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Tonic is and the story behind it?
- What are the core problems that you are trying to solve?
- What are some of the ways that fake or obfuscated data is used in development and analytics workflows?
- challenges of reliably subsetting data
- impact of ORMs and bad habits developers get into with database modeling
- Can you describe how Tonic is implemented?
- What are the units of composition that you are building to allow for evolution and expansion of your product?
- How have the design and goals of the platform evolved since you started working on it?
- Can you describe some of the different workflows that customers build on top of your various tools
- What are the most interesting, innovative, or unexpected ways that you have seen Tonic used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic?
- When is Tonic the wrong choice?
- What do you have planned for the future of Tonic?
Contact Info
- @AdamKamor on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Tonic
- Django
- Ruby on Rails
- C#
- Entity Framework
- PostgreSQL
- MySQL
- Oracle DB
- MongoDB
- Parquet
- Databricks
- Mockaroo
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features. Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)
- Gartner: ![Gartner](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4ODnKDqa.jpg) The evolving business landscape continues to create challenges and opportunities for data and analytics (D&A) leaders — shifting away from focusing solely on tools and technology to decision making as a business competency. D&A teams are now in a better position than ever to help lead this change within the organization. Harnessing the full power of D&A today requires D&A leaders to guide their teams with purpose and scale their scope beyond organizational silos as companies push to transform and accelerate their data-driven strategies. Gartner Data & Analytics Summit 2023 addresses the most significant challenges D&A leaders face while navigating disruption and building the adaptable, innovative organizations this shifting environment demands. Go to [dataengineeringpodcast.com/gartnerda](https://www.dataengineeringpodcast.com/gartnerda) Listeners can save $375 off standard rates with code GARTNERDA Promo Code: GartnerDA
Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies, and lead with purpose. Join your peers at Gartner Data and Analytics Summit from March 20th to 22nd in Orlando, Florida for 3 days of expert guidance, peer networking, and collaboration. Listeners can save $375 off of standard rates with code Gartner DA. Go to data engineering podcast.com/gartnerda to find out more. Truly leveraging and benefiting from streaming data is hard. The data stack is costly, difficult to use, and still has limitations. Materialise breaks down those barriers with a true cloud native streaming database, not simply a database that connects to streaming systems. With a Postgres compatible interface, you can now work with real time data using ANSI SQL, including the ability to perform multi way complex joins, which support stream to stream, stream to table, table to table, and more, all in standard SQL. Go to data engineering podcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring.
Your host is Tobias Macy. And today, I'm interviewing Adam Kamor about Tonic, a service for generating datasets that are safe for development, analytics, and machine learning. So, Adam, can you start by introducing yourself?
[00:01:34] Unknown:
Absolutely. And by the way, thanks for having me. Really excited to be here today. So my name is Adam Kinmore, and I'm 1 of the cofounders and the head of engineering at tonic.ai, the fake data company.
[00:01:44] Unknown:
And do you remember how you first got started working in data?
[00:01:47] Unknown:
I feel like I've always been in data. At heart, I like numbers and I'm, you know, I'm kinda nerdy about it. But how did I first start, like, you know, literally working in data? I guess, after grad school, I took a job in Seattle. So I'm originally from Atlanta. And I took a job in Seattle working at Microsoft. And, you know, it's a very data driven culture there. So, I think that's, like, you know, kind of working in data. But when I really got started, I took a job after a few years working at Tableau, a data visualization company. And there, I think they call it the motto. It's to help people see and understand their data. And that's a very, you know, data focused company. So I think maybe that was my first real foray into it, you know, professionally.
[00:02:26] Unknown:
And so in terms of what you're building at Tonic, can you give a bit of an overview about what you're doing there and some of the story behind how it got started and why this is where you decided spend your time and
[00:02:36] Unknown:
energy? Yeah. Absolutely. So there's a few questions there. We got started with Tonic because myself and my cofounders, we all kinda knew we wouldn't do a startup, and we knew we wanted to do it together. We tried a bunch of silly ideas that were, I think, kind of like hair brained and not very good. You know, we're all chatting 1 day and we kind of, you know, made this real had this realization that we all work in data. Like, we're all very data focused. People working at companies that that, you know, data is a core part of their mission, or their product, or how they're helping their customers. So, we all had a lot of experience working at companies like this. So, we started thinking about data specific ideas. And then, we ended up with Tonic because Tonic solves a problem that we have all had in our professional lives, you know, prior to starting Tonic. And I think you hear that a lot from startup founders. Like, oh, well, this is a problem I'd had before and I wanted to solve it. And I think that's a pretty good way to do a startup. And it really is a problem we've all had. Like, so as an example for me working at Tableau, you know, customers have workbooks. You know, workbooks are basically like a unit of work at Tableau using their product, and customers would occasionally have problems with these workbooks.
They would need help. Like, maybe they'd be reporting a bug or they would just have a technical question and say, okay, we'll send this to workbook. Well, I can't because it's connected to a database and you can't access that data. You know, like, there's the firewall or you don't have permission to or contractually, we can't show it to you, etcetera. Right? So that was my, you know, experience. My co founders, you know, who had worked at various other companies had similar experiences related to not being able to access customer data. And it kind of, you know, clicked. Okay. Well, if we could enable individuals to kind of, you know, access a substitute for the real data, that would probably be useful. That's kind of how we landed on Tonic. Initially, we weren't really sure who our customer was. Was our customer a data scientist? Was it a BI analyst? Or was it a software engineer or tester?
We ultimately ended up going with the software developer and testers. So, you know, they have these production databases. Right? Like, you know, if you're building an application, almost certainly there is a database that backs that application. Right? Like any stateful application. It could be, you know, like the simplest thing would be, I have a CRUD app and users can log in. The logins are stored in the database, but oftentimes the database has a lot more than just logins. So, we know just from our experience that, you know, privacy is becoming more and more of a concern, and companies are kind of starting to lock down production data more and more and developers can't access it. And the options that software developers had at the time for what they do when they can't access production data for developing and testing were not great. You know, option 1 is ignore the rules and regulations and access prod anyways.
A lot of companies 5, 10 years ago were still doing that. There's high risks there though. Option 2 was to kind of like handcraft a database that goes in your staging or dev environments, where you've kind of like manually inserted some number of rows so you can kind of exercise flows in the application. That also kind of sucks because, you know, okay, your production database has, you know, a 1, 000, 000, 000 rows across all your tables, your staging database has 30. Right? Like, you're not gonna, like, pick up on all of the bugs and issues that are happening in production if you're sta staging your dev databases, like, so dissimilar from production in terms of, like, the plethora of combinations available across all the columns in your tables or the scale of that data. Right? I write a new query to power a view in the front end UI.
How do I know that query is actually performing at scale if my table that I'm querying against has 5 rows in it? Like, yeah, you can run an explained query, but, you know, those can be nuanced and tricky to read for a lot of people. Like, the best thing you can do is just, you know, run it against data of the right size and see if the UI loads in an appropriate amount of time. Right? So, you know, there's these options that developers have, and we didn't think any of them were great. Like, we wanted to give developers a staging or development database that looks and feels just like production, same complexity, same scale. Right? Obviously, the same schema and structure that's brass tacks right there. But we wanted it, of course, to be devoid of sensitive information.
That's in essence what tonic does. You give us a production database, out the other end comes a new database that is identical to production in every way, but all of the sensitive data is replaced with fake but realistic looking data. And then you can use that in your lower environments for testing and development.
[00:06:48] Unknown:
As far as the kind of testing and development environment, being able to deal with production data in those cases, that's a problem that has existed for effectively since we've had production and nonproduction environments and Yes. Something that people have tried to solve in myriad ways. And I'm curious what it is about this particular problem space that makes it such a challenging problem to solve and challenging to solve it in a way that it stays solved.
[00:07:13] Unknown:
Why it's so challenging? Let's say I have done this before. Before starting Tonic, I have worked at companies where we needed, you know, de identified or or fake databases that were, you know, that had the same amount of complexity as production, let's say. And, you know, complexity is both like, you know, the the number of combinations across, like all the columns and also the scale. So let's say, you wanna go build a Python script to do this, to create like, your fake database. Right? Like, okay, I'm gonna write some Python. I'll programmatically insert rows. I can obviously insert a lot of rows really quickly because I'm gonna write it in code. Let's do this. Right? Okay. You see, you start off on day 1 and you start with your users table. That's normally like a pretty central table in your database. Right? It's like, okay. Well, my users table has first name, last name, email, the date they signed up, etcetera. Okay. I can write a Python script to go do this. Right? Like, I can create random dates between a min and a max. I can go find a static list of names and kind of sample from that at random. Oh, the primary key is an integer. I'll just make it auto incrementing, etcetera. You do all that and you're feeling good. Okay. Well, I've done that. Now, there's other tables that I got to go deal with. And these tables, and there can literally be tens or hundreds of them that are going to have foreign keys back to the users table or that the users table have foreign keys to them. And that's kind of like the first set of dependencies. But then there's going to be secondary, tertiary, and airy dependencies of tables that reference those tables, and tables that reference those tables and those tables. And you get this huge complex web of tables and dependencies that you have to deal with. You know, understanding, okay, this table references that 1. In this table, I put these primary keys. Therefore, the foreign key column that references it needs to sample only from that primary key column. Oh, but now it's a compound primary key. I'll go deal with that. And it gets really complex.
And that just to create a database that has the referential integrity you need. But it's way worse than that. Because applications make assumptions about the database structure that aren't encoded only in the foreign keys. You know, foreign keys are typically easy to satisfy because you at least know them and they're explicit. And you can write code that will satisfy them. But there's these hidden constraints in the system that you have to then deal with. Like the application makes the assumption that if the value in this column is A, the value in this other column, in this different table, must be C or D. It can never be F or G. If it's F or G, the application is going to throw an exception. Because that is just not a state a user can get into. Right? And there's going to be 100 or 1000 of these hidden constraints in like, you know, any, you know, application of even like, you know, medium complexity.
Right? So you end up writing a Python script that needs to have essentially the same business logic as your application, just so it can insert data that your application will understand and be able to deal with. That very quickly becomes not tenable. Right? Like, you know, hundreds of developers spending hundreds of thousands of man hours or people hours that have developed this application. 1 person is not going to go create a new application that mimics all of that complexity, so that you can insert test data. It's just not feasible. And that's the approach a lot of people take. And, you know, the first few days things are going well, and then it kinda dawns on them the complexity of their database, which really is meant to reflect the real world. Right?
So it's just not doable. And I'll I'll say that, like, I think that the number 1 competitor for Tonic is typically customers trying to DIY this themselves. That's our biggest competitor. And companies that try to go down that route very often come back to us after they have kinda tried. Or the companies that we work with are, they have these scripts they run, but they work so poorly. It's very difficult and complex to maintain them, and their output is so lackluster that moving the tonic is a big doom for their, you know, development and test teams.
[00:10:53] Unknown:
I'm also curious, in your opinion, why tools like Django and Rails that have these built in ORMs where it's, you know, incorporated into the application development experience. There are ubiquitous libraries for fake data generation. They understand all the different correlations of models. Why it is that projects like that don't have some sort of built in capability for Sure. Being able to generate fake data and test data.
[00:11:19] Unknown:
That's an interesting idea. And, like, an ORM tool certainly understand the dependencies between your tables. Right? Like, they know the foreign keys. They know default values. They know the check constraints if they exist in the system, etcetera. And they could likely generate code to satisfy all of the or they could likely generate data to satisfy all of those constraints. But you still have the issue of the hidden business constraints that I was talking about earlier. You know, the assumptions the application makes about the database that aren't enshrined in the constraints in the system of the database system, excuse me. They would fail at that. Of course, that doesn't mean ORMs aren't great they are they aren't great. Of course, like, you know, our application happens to be written in c sharp, and we're using Entity Framework as our ORM for, you know, the our own back end systems. It's lovely. But I don't see reasonably how a framework or any other ORM could solve that, you know, hidden constraint issue that I'm referring to.
[00:12:10] Unknown:
In terms of the core target use case that you're focusing on, you said that you're aiming at developers, but this obviously has some applicability to analytical use cases and potentially kind of machine learning use cases. I'm curious what you see as some of the variance in terms of the maybe scale and type of data that people are trying to obfuscate and some of the specific requirements or constraints that they have to apply in bringing that data to a non production environment or simulating data to then be able to realistically represent what they're trying to do in production? Right. There's a tremendous amount to unpack there. It is definitely true. Our bread and butter, our primary use case, where we spend
[00:12:52] Unknown:
90 plus percent of our focus is on de identifying production databases for development and test teams. With that being said, we have a growing use case for data that's not for developers and testers, but is instead for these analytics use cases. And there's typically, like, 2 types of consumers of this analytical data. It's going to be, like, your normal data analyst who's using Power BI or Tableau or Looker or any of those tools, kind of analyze the company's data to understand what's happening. And then there are, like, the data science and ML teams, who are likely doing similar things but using very different techniques for how they accomplish that. Right? So both of these teams, I think, typically will use, you know, the data warehouse of the organization to kind of, you know, get their job done and answer their questions.
So there's key things that are different, like when we de identify a data warehouse versus an application database. The first is the scale. Data warehouses are typically larger because that they include most of the data of, you know, the production database or the application database, and then a lot of other data. Right? So scale is 1 thing. Luckily, most people that are using data warehouses are typically using technologies like BigQuery, Redshift, Snowflake, Databricks, or they're just, you know, storing parquet files in s 3, etcetera. These types of like storage and database technologies really lend themselves to very performance, you know, loading, unloading, transforming of data. So, even though the scale of this data is a lot more, the tools that are being used are up the task and tonic is able to kind of scale in a in a way that can support, you know, really at any scale. Like, as an example, you know, we have customers that process 50 terabytes a day, 50 terabytes a day of data with tonic. You know, we have other customers with, you know, Snowflake tables that are tens of terabytes in size. And and tonic is able to to scale with all of this and perform, you know, normally performantly. There are exceptions that, you know, things can get nuanced and tricky, but in general, it's fine.
The other key difference is that the types of data transformations you care about when you're de identifying for an application database versus a data warehouse are very different. So in a application database, it's really about generating data that satisfies those constraints, like the ones that we were talking about earlier. Like you want the data to look realistic at first glance, but primarily, it needs to satisfy those constraints. So the application can work and you can test everything that you wanna test. In an analytics use case, it's more about preserving the quality of any like statistical analysis you're going to do. So, you typically care a lot more about like statistical relationships between columns.
So, the the tools for generating that type of data are very different than for application databases. And in tonic, that really means, like, what data transformations we have or what data transformations are used will be different depending on your use case and scenario. So, for example, like, a really common transformation when you're doing application testing and development would be, okay, replacing names with fake names. Kind of a trivial example, but it's popular. Right? In a data analytics use case, you likely don't need to replace names with fake names because, like, the name of an individual doesn't really affect an analysis unless you're doing some type of weird analysis. Like, how does the length of a name affect how much money they spend? Right? They that's, like, kind of contrived and weird, but you typically don't care what the actual name is. What you do care about though, is that you can kind of correlate records and other tables with entities like in the user table. So, I don't need to replace names with fake names. I need to replace names with tokens, where a given name always gets the same token and 2 different names don't get the same token ever, so there's no collisions.
Right? And in that way, like, yeah, I don't know the names anymore, so the names are safe, but I can still, like, do correlations and joins across tables and things like that. Right? So, like, type of transformation is really useful in the analytics use case where you're, essentially, you just wanna kinda, like, uniquely and consistently tokenize sensitive data. Is it pertinent to the analysis that one's doing? To really drive that home, I'll give a good example. Let's say you want to know the maximum amount of money spent in a given state, but, you know, state needs to be de identified or hidden in some way. Well, if you apply that tokenization strategy I was talking about, you can still find the maximum spent in a state. That's a bit contrived. I'm I'm struggling to think of like an immediate example just off the top of my head. It's kind of early in the morning for me and, you know, it's always hard to think on your feet. But that would be an example. Like, there's a lot of different SQL queries or you know, questions that 1 can ask of the data that are going to hold true when specific columns are tokenized. And that's what we're kind of going forward with that approach.
What we're starting to see is, you know, customers essentially asking us to de identify data warehouses for analytics use cases. But then there's this whole other use case that we've gone into. Right? Like that approach there is really good if you just want to give more widespread access to your data warehouse to like analysts, data scientists, even to engineers. We have a new tool that we've just come out with though. It's called djinn. It's spelled djinn, which in some cultures, I think it's kind of representative of a genie, specifically, I think Arabic.
And, of course, it's a nice play on gin and tonic. Right? Our gin offering is actually purpose built for data scientists that are doing that same type of, like, they're still analyzing data. But in gin, it's all about data synthesis. So instead of de identifying data, we actually just look at the dataset that you're interested in, train a machine learning model on that data, which is then used generatively, produce additional rows of purely synthetic data. So data scientists, they no longer need to access the real data or the de identified data at all. They can just point gin at their data. Gin will train a model. They can then programmatically invoke that model to generate synthetic rows of data, and then they can use those synthetic rows of data in their analysis. It's really useful for privacy use cases where that they can no longer access production in their dev or staging environments.
And it's also useful for data augmentation use cases. Like, let's say I'm a data scientist, I'm trying to train a model for churning customers. So I want to know when my customers are going to churn so I can do something about it. Right? I'm gonna train a model based on, like, past examples of churn at my company, but I don't have a lot of churn. Right? But I need to train this model to predict who's going to churn. Let's say you're using like a logistic regression for doing this. Right? And you're getting a yes, no 1 if a customer is going to churn. Those models are going to perform poorly if they're working with highly imbalanced data, meaning most of your customer examples don't churn. So you can use gin to actually synthesize new examples of churning, augment them into your data, and then you can train your logistic regression on a more balanced dataset to get higher quality data. So gin is this new offering, like I said, and it's primarily meant for these data scientists.
[00:19:28] Unknown:
Another aspect of what makes this whole problem space challenging is that people don't always model their data properly. They don't necessarily add foreign keys in the database where you would where where they're actually implying foreign keys, particularly if they're using an ORM and not doing it quite properly. Or there might be implicit or business rule linkages between data that is completely invisible within the database, and so you have to just be able to, like, look through the code to figure out, oh, this is how these 2 things are actually tied together. A 100%. I'm wondering how you deal with some of those kind of implicit joinings of data and being able to ensure that you can actually represent them in non production environments Yes. Or some of the ways that using something like Tonic helps to actually surface those implicit joins.
[00:20:17] Unknown:
That's right. I think there's 2 issues that you get to tackle here. 1 is the discovery portion. Like, you need to understand where are the foreign keys that don't exist in the database but are assumed by the application. And then 2 is, okay, you have them. What transformations do we apply if they need to be de identified? Right? The discovery portion is tricky. It's tricky because, like, it is very difficult to write an algorithm that is going to be able to find a foreign key relationship in 2 tables if it's not already covered, if it's not already encoded in the database system itself.
And the reason for that is you'll just get too many false positives. Right? Like, imagine 2 random tables both with integer values. Maybe both columns are, like, auto incrementing. Right? Like, it's very easy to think there's a foreign key relationship there just because all of the integers in 1 column are found integer values in this other column. Right? But that can also just be chance. Right? Imagine 2 zip code columns. There likely will be like, you know, 1 set combined in the other. So like, oh, it's a foreign key, but no, it's not. And also like the how that scales is very difficult. So, like, a, you're gonna get a lot of false positives. B, that's a very extensive computation to run to see if all the values in 1 column are are contained in another column. And then imagine, you know, an application database with a 1, 000 tables. I mean, that's tens of thousands of columns. You know, pairwise doing that across all these columns is really just not tenable.
So on the discovery portion, it is a typically somewhat manual process that can be made a lot easier by some type of reasonable naming convention for your column. That's really like the saving grace. If your columns have reasonable names, then it is much easier to kind of identify these, what I'm going to call, virtual foreign keys. In Tonic, foreign keys play a crucial role in 2 ways. Way number 1 is that if there is a foreign key, then you need to make sure that any transformation applied on the primary key column gets applied appropriately to the foreign key column.
So that, you know, you'll maintain a referential integrity in your output data set. And, of course, that transformation has to have certain requirements on it. 1, well, if you're applying a transformation on a primary key column, it better maintain uniqueness. Right? Because primary keys are always unique. 2, that transformation must be deterministic. So that when I go apply the same transformation on my foreign key columns, my foreign keys will get mapped in the same way that my primary keys did. So that, again, you're maintaining referential integrity. And then the second reason we care about, you know, foreign keys at Tonic is because of our subsetting engine. I'd say probably or very close to a majority of our customers, in addition to de identifying their data with Tonic, are subsetting their data. You know, you have this massive, you know, 10 terabyte production database. You don't need a 10 terabyte database for each of your developers. You wanna subset it down to a very realistic looking 5, 10, 20 kilobytes of data. And then each developer gets their own, like, you know, local development database as an example. Right? Well, our subsetting engine, which again, guarantees referential integrity, needs to take advantage of the foreign keys of the database to understand how to like, you know, move data and which rows to keep and which to discard when it's subsetting. If a database doesn't have any foreign keys defined, then your subset is gonna be very weird.
So, you know, hopefully, you have all your foreign keys in the database. The subsetting engine is just gonna, you know, work right up the gate. It's gonna be awesome. If you don't, you need to go into the Tonic UI and actually tell Tonic where you're, you know, using air quotes here, where your virtual foreign keys are. And that goes back to that discoverability aspect. And then once you tell Tonic where your virtual foreign keys are, we can then produce a reliable subset. And then we can apply appropriate transformations on the primary and foreign key columns, as I was talking about earlier. Our UI for doing this has gone through, you know, multiple iterations and actually, I think it's being redesigned again right now, actually. And it's like really purpose built to, like, how can I quickly add as many foreign keys as possible with this few clicks as needed? Right? And that's what it's kind of built for because a lot of our customers have, you know, very large complex schemas And being able to do this quickly is really key.
[00:24:10] Unknown:
Digging into your platform itself, can you talk through some of the design and implementation of what you're building at Tonic and some of the ways that the initial approaches have either had to be reassessed or some of the assumptions that were able to hold true as you went from idea into your current level of production?
[00:24:30] Unknown:
When we first started building, it was just myself and 1 of my cofounders. I have 3 other cofounders and myself and my cofounder, Andrew, are on the engineering side. So we were the 2 people developing tonic from day 1. I don't know if we made any assumptions. In the early days of a startup, when your pre revenue, pre customers, all of that, it's kind of the Wild West. You're just writing code quickly and seeing what sticks. So what assumptions did we make? I honestly don't know if we made money at all. I think it was kind of chaotic, But the core assumption that we made that has held true throughout is kind of, like, the fundamentals of tonic, you know, what it can do, what it can't do, what the rough experience is, what are the ins and outs, you know, things like that. And that has really like held true since day 1, like the demos that we gave early on even when we were fundraising. I mean, the product has not fundamentally changed since those days. I think we were really honest, I think from the earlier days. But, you know, engineering wise, we didn't really, you know, get serious until we started getting customers.
And I'd say the thing that has changed the most is probably and it's the thing that's currently changing the most, is probably our back end architecture. In the early days, like, on day 1, you know, that first customer that we're working with that was, you know, probably like 1 of our friends, if we're being honest. Right? Imagine, you know, you're an ETL tool and you have to copy all of the rows in 1 database into another database. Right? Well, the most naive way you could do this is, you know, 2 4 loops nested inside each other. The first 4 loop iterates over the tables, the inner 4 loop iterates over all the rows of the table. And inside that inner for loop, it's like it reads in the row, it processes the row, meaning it applies your transformations then it writes the row to the output. Now, this is a really naive way of doing it. It's it's likely not performing. But that is, you know, precisely how it started, and it worked. But then you get to your second customer and, like, okay, well, their database is, you know, I don't know, twice as big. You know, you're going from 10 gigs to 20 gigs, whatever. And it's like, oh, man, this sucks. This is taking way too long. And so, like, over time, we've just constantly been iterating on this back end architecture and we're going through now what is essentially a it's definitely our largest.
I'm gonna call it a refactor, but, I mean, large portions are being rewritten, you know, honestly. And where we're gonna end up is gonna be, you know, a very new and very performant architecture that I'm very much looking forward to. So, I'd say that's probably, like, the thing that has changed the most, both in terms of, like, the delta and also just how many times we've kind of like rewritten parts, refactored parts, done like performance traces and CPU profiling as we just are constantly iterating on this, you know, piece of code.
[00:27:00] Unknown:
As far as the workflow and use cases around Tonic, I'm wondering if you can talk to some of the ways that your customers are using Tonic and integrating it into their development workflows and their infrastructure and some of the ways that they think about how to approach the problem of working with data before production or being able to obfuscate data effectively?
[00:27:24] Unknown:
It's done in a lot of different ways. I'd say the thing that I think, like, most of the ways that it's done successfully, I think they all have in common, is automation. Because automation allows you to run tonic frequently, which means you're constantly getting refreshed and new versions of production. I say this during all of the sales demos that I give, which is not as many as it used to be because we have other folks do that now that are much better than me. But every click you make in Tonic, like, every change you make, everything you can do, you can do via our API as well. Our product shifts with, of course, like, you know, API documentation. So our customers that begin automating Tonic, I think are the most successful. The typical, like, flow of a customer in Tonic is, okay, they install Tonic. When I say install, most of our customers install Tonic on prem. And on prem, I mean, in their own public Cloud, in their own data center, but they're installing Tonic on their own machines in their own networks.
This is I mean, I guess it's pretty obvious why, but, you know, customers use tonic because they have very sensitive data. You know, it could be healthcare data, finance data. Like, those aren't the only examples, but those are certainly common. Right? And they don't want to send that data outside of their networks. So Tonic goes into their networks and gets installed. A normal, you know, installation of Tonic, you know, can be configured to run entirely air gapped if needed. So, anyways, you know, Tonic gets installed, they start using it, they're primarily using the UI. The first thing typically done in the API would be to start running jobs programmatically.
It's like, okay. Well, I don't want it to log in and generate data button every time because I'm doing it, like, once a night at midnight. I don't wanna have to log in and do that. That's kinda stupid. They will typically start orchestrating that part. Like, there'll be some other process. It could be, like, you know, an airflow graph or it could be, you know, some CICD pipeline is triggering these jobs. That's typically the first step. And then, there's a lot of different ways they can go. Right? Like if the database scheme is not changing often, you really don't have to do too much else. Let it just run. But, you know, if things are changing often, if tables are being added, columns are being added, you know, etcetera, then some customers actually, you know, take it a step further and start doing everything programmatically.
Meaning, okay, a new column's added. Let's programmatically apply transformations to columns. You see customers, you know, logging into the UI less and less over time. In terms of, like, what workflows they use, I would say most of our customers are running Tonic at least once a week, and they are triggering those jobs programmatically. But there's a lot of, like, cool things you can do. I'll give some examples. At a lot of companies, you know, you have a a Git repository that contains all of, like, your DDL statements for your database. Like, you know, it has, like, your create tables, you know, for new tables. It has your modified table and columns, you know, as you make changes over time. Right? Like, the combination of all these TTL files is your application database. Right? A lot of customers will actually check their tonic config file into that repository. So let's say I'm a developer at this organization, I go and add a new column to a table and I copy in data from a column in a different table.
Okay? So, I've done this. And let's say that that column contains sensitive information. Okay? Well, you know, when I'm doing my code review or when I give my code review to someone else, they can say, hey, you added this new column to this table and it has sensitive information in there or it will contain sensitive information once it goes live. Right? You need to go, you know, add something to the tonic configuration to de identify this column. They know when it's been done because your tonic config, this file has been checked into that same repository. And it's just a JSON file. So, you know, customers can start, you know, Yeal and modifying their tonic config together. That's 1 fun workflow I see.
Another would be this is a little in the weeds. I hope I can explain this in a clear way. Like, let's say you have a staging database that you use, and this staging database needs to always be seeded with a specific set of rows, so that you can run your regression tests against staging. Right? Then the regression tests are automated. They need to have these specific users with these IDs in there, so the test will run. Right? If every night you're kind of refreshing staging with with data from production, you need, you know, pretty clever thinking on how you then wanna ensure that your seed data isn't modified incorrectly, or how you could always get your seed data back into staging when everything's been wiped out. Right? So customers come up with pretty clever, like, what are essentially like upsert solutions. That's really what they are. Like, how can I upsert my data that I wanna get into the database without messing up my seed users that I need for running regression tests? That's actually something that we started seeing so often that we're starting to, you know, bake that into the product in a first class way. That can lead to fairly complex workflows.
Say, okay. Well, we do this, then we do that, and it becomes this whole dependency graph of what needs to happen first. There could be multiple databases involved. It's like, you know, some can be used just for, like, holding data before it get put here, etcetera. Right? And I think putting that into the product is gonna hopefully remove all of these, you know, workflows that customers have come up with.
[00:32:09] Unknown:
As far as the variance in workflows and some of the specific constraints that people apply as far as how it gets run, when it gets run, You mentioned also the challenge of schema migration. I'm wondering what are some of the most challenging edge cases that you are dealing with or have had to engineer around and what are some of the cases because of the fact that this is such a fractal and complex domain to work in, what are some of the cases where you've decided, okay, this is something that we are absolutely not gonna bother trying to do because it is, you know, NP hard and it just doesn't make, you know, economical sense for actually for us to invest the engineering to try and solve it.
[00:32:47] Unknown:
I think a good example is 1 that I spoke of earlier when we talk about, like, automatically detecting foreign keys. That is something we we have thought about before. And just, like, the back of the envelope math that we've done has suggested that it's not really feasible to do. I think that's a good example, and we have not, you know, tried in earnest to do it. I think it'll stay that way. I could be wrong. Maybe someone on the team will come with a really clever idea. But we have thus far avoided even trying to do it. So I think that's a good example.
[00:33:15] Unknown:
In your work of building this platform and working with your customers and seeing some of the ways that your utility is being applied to their problems, what are some of the most interesting or innovative or unexpected ways that you've seen Tonic used?
[00:33:29] Unknown:
When customers first started using Tonic outside of production databases, I thought that was pretty pretty good. Like, we hadn't considered using it to, like, de identify data warehouses or not. We haven't seriously considered it and what that meant. And customers started doing it and using some of our, like, you know, pretty sophisticated transformations to get really, you know, good statistical results from their warehouse for, like, their analysts. But I thought that was impressive. You know, basically, I thought, okay. Those are sophisticated customers. That's awesome. Not to pat ourselves on the back, but I thought it was also impressive that tonic was able to work in use cases that it wasn't originally designed for. It kinda showed that the tool had matured to a point where it could be used in a pretty general way, which was which was also great to see. In your work of building this system and working with your customers and exploring some of the boundaries of this problem space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? Building a product that connects to a foreign database, meaning it's not the database that backs the application that you're building, is very challenging.
Before doing this, I had always worked on applications that connected to like their own database, typically via an ORM. Right? And building a tool that needs to be able to connect to and operate on an arbitrary database that you've never seen before is a real challenge. And it's a real challenge because databases well, first of all, you don't get an ORM. So, like, all abstraction that an ORM delivers, you don't get. You are only interacting with these databases via SQL, you know, asking the questions that you need answered that way. Like, how many tables do you have? How many columns do you have? Give me a list of those columns. Okay. Where are the foreign keys? Etcetera. Right? It's a lot. And it has to be done per database because each database kinda handles this in a very different way. But even worse than that is databases have a very long tail of features.
Really, really long tail. There'll be a feature that only 1 of your customers is using, But they're using it, so you have to go support it. And what it really means is you need to support that long tail for each database that you support. That is a challenge and it's expensive. When we first started, our first customer was on Postgres, so we built support for Postgres. And we're like, okay, this is good. Let's just sell more Postgres. Our second customer was on MySQL. Our 3rd customer was on SQL Server. Our 4th was like, I think Postgres, but our 5th was Oracle. So within 5 customers, we had to support 4 different databases. It's complex. And, you know, honestly, though, early on, we didn't support those long tails. We just supported the features those customers had. But over time, we've had to support the long tail on each database, I would say.
[00:35:59] Unknown:
And in terms of the database support matrix, as you said, each individual database, even constraining it to just relational engines, has Yeah. You know, a huge number of potential use cases and features and ways that people are abusing it. And I'm wondering what has been your stance of kind of deciding when to add support for a new database engine and when or if you're planning on moving beyond just the bounds of relational engines.
[00:36:26] Unknown:
That's right. So we already support non relational databases. Mongo is our 1st SQL database that we support. In addition, a lot of the data warehouses we support, they sure do look and feel like like SQL, and you primarily interact with them the same way, but they don't actually support foreign keys. And then we also support just, you know, the operating on flat files. Like, you have a folder of parquet files, or a folder of CSV files, or an s 3 bucket of whatever. We support that as well via, you know, Databricks and Spark. The initial investment to get it supported is normally fine. It's the ongoing support of it that is the real cost, you know, folks don't always consider. I'd say we are very customer and numbers driven.
So we add support for new databases when we've seen a sufficient number of customers ask for that support. So, like, you know, basically, all the databases that we support today are only in the product because a customer has requested them. It's almost always been a large number of customers have requested them. That's for adding a database. For adding support for individual features, it depends on the scale of the feature. I could be on a call with a current customer that's complaining about something in the UI that makes it hard to do something. And like a day later, we'll have that fixed. That kind of thing happens all the time. But for larger features, like, for example, this Upsert feature that I was talking about earlier, We've been talking about doing something like that for well over a year, and it's a very large investment. But we also have a large number of customers that would be helped by it. So we started working on it. For people who are struggling with bringing production
[00:37:54] Unknown:
like data into nonproduction environments, what are the cases where Tonic is the wrong choice?
[00:37:59] Unknown:
Tonic is the wrong choice. They're typically not customers or prospects. That during, you know, that initial, like, you know, discovery call, they're like, oh, we want to generate data from scratch. Meaning, we're building an application and we don't have any customers yet. We want to seed our database with data, so we can, you know, test things out a little better. Tonic is typically not the right tool for that situation. You can actually use Tonic to generate data from scratch. You know, connect Tonic to a database with a schema that has no data in it. And you can use a subset of our transformations for basically generating, you know, data without having any seed data and you can kind of tell how many rows to add per table. You can do it. But, you know, beyond a couple tables, that's very difficult to do in Tonic, and we're not really built for that. There are other tools in the market. Makaroo, it's a really good example.
If you want to create data from scratch, I think, you know, a tool like Makaroo is probably the way to go. But I struggle to think of a reason why an organization that has an application database, that has users, that has data, couldn't use Tonic to generate de identified data.
[00:38:59] Unknown:
As you continue to build and iterate on Tonic and work with customers and explore new ways that you could apply your kind of fake data generation and data subsetting utilities, what are some of the things you have planned for the near to medium term or any particular problem areas that you're excited to dig into?
[00:39:15] Unknown:
I'm excited to see where our machine learning efforts with gin go. I think we'll have increased investment there and across all of our analytics use cases over time. I think we'll begin making a larger push into de identifying data warehouses, so more folks in your organization can actually access production like data, you know, for analytics use cases. I think we will continue to invest in making the, you know, developer test data experience better. You know, I think Upstart is a great example of that. I think we'll continue adding data sources as they become, more requested.
For example, I know we're currently considering the effort it would take to support databases like Dynamo because we get a lot of requests for it. You know, no promises on when it'll be done, but it's, you know, we're looking into it now. And this is like super high level. I would struggle to come up with some more details here, but I'm excited for what additional products we will be able to offer that kind of like vertically integrate our solution. There's things that are done before data gets to Tonic and things that are done after data leaves Tonic that I think that we could help with. As 1 example, it's like, okay, well, to use Tonic, you need to know where your sensitive data is, Right? So you can apply transformations. Well, Tonic helps you do that. It scans your database and it finds all of your sensitive data for you. And then it even suggests what transformations to apply based on what type of sensitive data it is. I'm kind of expanding that into, like, a more fully fledged data catalog solution, I think, would be interesting. As an example, no idea if we'll ever do that. You know, I'm excited for, like, you know, that type of conversation. I like it.
[00:40:42] Unknown:
Are there any other aspects of what you're building at Tonic or the specifics of data subsetting and data obfuscation and dealing with sensitive data in nonproduction environments that we didn't discuss yet that you'd like to cover before we close out the show?
[00:40:55] Unknown:
If anything I've said is interesting today and you are interested in generating production like data for development and testing, you should go to tonic.ai. Check us out. Right at the top on the left hand side is a link that you can click on to create a trial account at app.tonic.ai. You can actually try it out. You know, you can connect to your database, you can connect to 1 of our sample databases, you can give the product a go. If you're a data scientist or a machine learning engineer, you can do the same thing. Go to tonic.ai. And there's a different button for you to try out gin, which you can try out at gin.tonic.ai, spelled djinn.
[00:41:29] Unknown:
And for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:41:44] Unknown:
I would say very few of our customers have a great data discovery option. Or if they do, they're they're certainly not using it. I think data discovery is something that could maybe use some help. But that's just my possibly pretty narrow view of things.
[00:42:03] Unknown:
Well, that that's 1 of the great things about this question is I get a good cross section of people with different areas of focus and seeing kind of whatever it is that they're paying attention to, what are the things that they see as a gap? Because, you know, as with everything in technology, it's all fractal. So Of course. Wherever you are in that particular fractal geometry, there's a particular view that you have. That's right. What's your answer? You probably have a pretty good answer here. I've actually been asked this a couple of times, and the 1 that I keep coming back to is the connection between application developers and data engineers and doing the maintenance of context and semantics of domain information as it is created in an application context and then gets handed off to analytical use cases where, right now, data engineers will just dig into the guts of a database and they just pull it out and have to reconstitute the semantics, where if the kind of semantic contextual knowledge management was a more native part of the application development experience and being able to expose that to analytical use cases, it would save a lot of effort in kind of the downstream applications.
Are there any tools that do this today or attempt it? None to my knowledge, and at least none that are kind of widespread enough to have any meaningful impact. I'm thinking about things like, you know, Django with its REST APIs or its ORM or Rails or, you know, Spring Framework where you have an ORM that you can use to model your data and store it in the application database. There are, you know, contextual semantics in that model information and the logic surrounding it, but there's no way to, you know, preserve that as you're extracting that information into other contexts. And so if those same, you know, ORM frameworks also had a built in framework to expose that information in analytical API for pulling that data out and maintaining some of the domain knowledge around it, that would reduce a lot of the overhead and burden of data teams who are trying to take that information that was generated for 1 purpose and apply it to others.
[00:44:00] Unknown:
Right. Right. We kind of I mean, at a very small scale, we're getting more serious about our own analytics. We're moving data around and, you know, starting to study it a little more. This is definitely true because, like, unless you understand, like, deeply what our application schema is, like, you know, for our own product, it's really hard to get questions answered. You know, being a developer on the team and also doing some of the state engineering work, I'm able to handle it. But like, you know, other folks that are asking questions, they don't have that context and it's really hard for them to get things done. So, yeah, that does match my own experience. I'll say that.
[00:44:32] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing on Tonic and exploring the space of how to actually make production data useful in non production contexts in a safe and sustainable manner. So I appreciate the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. You as well. And thank you very much for having me. I enjoyed our conversation.
[00:44:58] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Event Announcement
Challenges with Streaming Data
Interview with Adam Kamor Begins
Adam's Journey into Data
Founding Tonic and Early Challenges
Problems with Traditional Data Handling
Target Use Cases for Tonic
Design and Implementation of Tonic
Customer Workflows and Automation
Challenging Edge Cases and Limitations
Unexpected Uses and Lessons Learned
Database Support and Expansion
When Tonic is Not the Right Choice
Future Plans and Developments
Closing Remarks and Contact Information