Summary
Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
- Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
- Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what the SQLake product is and the story behind it?
- What is the core problem that you are trying to solve?
- What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow?
- What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)?
- Can you describe the technical implementation of the SQLake feature?
- What does the workflow look like for designing and deploying pipelines in SQLake?
- What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales?
- SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling?
- What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales?
- What are some of the edge cases that you have had to provide escape hatches for?
- What are the most interesting, innovative, or unexpected ways that you have seen SQLake used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLake?
- When is SQLake the wrong choice?
- What do you have planned for the future of SQLake?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Gartner: ![Gartner](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4ODnKDqa.jpg) The evolving business landscape continues to create challenges and opportunities for data and analytics (D&A) leaders — shifting away from focusing solely on tools and technology to decision making as a business competency. D&A teams are now in a better position than ever to help lead this change within the organization. Harnessing the full power of D&A today requires D&A leaders to guide their teams with purpose and scale their scope beyond organizational silos as companies push to transform and accelerate their data-driven strategies. Gartner Data & Analytics Summit 2023 addresses the most significant challenges D&A leaders face while navigating disruption and building the adaptable, innovative organizations this shifting environment demands. Go to [dataengineeringpodcast.com/gartnerda](https://www.dataengineeringpodcast.com/gartnerda) Listeners can save $375 off standard rates with code GARTNERDA Promo Code: GartnerDA
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features. Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)
- MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png) Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit [dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo) to learn more.
Truly leveraging and benefiting from streaming data is hard. The data stack is costly, difficult to use, and still has limitations. Materialise breaks down those barriers with a true cloud native streaming database, not simply a database that connects to streaming systems. With a Postgres compatible interface, you can now work with real time data using ANSI SQL, including the ability to perform multi way complex joins, which support stream to stream, stream to table, table to table, and more, all in standard SQL. Go to data engineering podcast dotcom/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring.
Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies, and lead with purpose. Join your peers at Gartner Data and Analytics Summit from March 20th to 22nd in Orlando, Florida for 3 days of expert guidance, peer networking, and collaboration. Listeners can save $375 off of standard rates with code Gartner DA. Go to the data engineering podcast.com/gartnerda to find out more. Your host is Tobias Macy, and today, I'm interviewing Ori Rafael about the SQL Lake feature for the Upsolver platform that automatically generates pipelines from your queries. So, Ori, can you start by introducing yourself?
[00:01:34] Unknown:
Yes. My name is Ori Rafael. I'm the CEO and 1 of the cofounders of Obsolver, and I come from, data engineering backgrounds.
[00:01:42] Unknown:
And for folks who haven't listened to your past appearances on the show, can you give a bit of a reminder about how you first got started in data?
[00:01:50] Unknown:
Yeah. I started in the Oracle age, owning about 50 Oracle databases, supporting application team and BI teams. So those were my first steps. Did that for a few years and kind of started to go more into cloud in the last 6, 7 years.
[00:02:06] Unknown:
And in terms of the SQL Lake product, we've done past episodes about your work on Upsolver and some of the aspects of unifying batch and streaming workflows, and this is the latest iteration in terms of the user experience and kind of the functionality of what you're building. So I'm wondering if you can give a bit of an overview about what it is that you've built there and some of the story behind how you came to the conclusion that this was the right feature set to build and some of the, I guess, user challenges that you're trying to address in introducing this capability?
[00:02:39] Unknown:
And I think that SQL, we don't call it a feature. SQL, like, is a full product. You can call it the v 2 of AppSolver, basically destined to eventually replace the original app solver and kind of having the, as you call them, conclusions from the previous iteration. And I hope the the name helps with that is the combination of having SQL to writing big data pipelines, but doing it on top of a cloud data lake environment. Let's say the framing of the problem you're trying to solve. You're trying to build a big data pipeline in the cloud. You're storing your data in lake, And eventually, you're going to deliver it either to engines that query the lake or you're gonna deliver it to data warehouses. That's the problem they're trying to solve. And the North Star for the company has always been ease of use. We come from a database background. We like SQL. We can do a lot with SQL.
And the velocity of developing on open sources like Spark wasn't what we were used to. It was something that we had to to take a lot of the work on ourselves because we couldn't really give Spark to citizen data teams, and it just took a long time to do. So we tried to basically bring the SQL, you can even say sometimes Oracle like experience to the world of big data pipelines. Now that's AppSolver in general. Now why SQLite? What's different? So the original AppSolver product used the user interface to build pipelines. You will still use SQL for your transformations, but the pipelines themselves is something you build with a drag drag and drop type of environment.
What we figured out is we wanted to create a system that's very well defined, has a very clear API. UI, you can take to many different directions, but you have an API, you can just implement that API. You can document it. You can test it. You can bring something that's much more familiar for data engineers, for data teams without us explaining it. So we felt like our entire system needs to be SQL based, just like you would with a database. But in this case, this is a sort of database that you're using for moving data, not necessarily for querying data. So the other reason to use SQL is the ICD. I think the market did a pretty big shift in the last few years, Primatica type of experience to I want to own my pipeline code. I wanna test it. I want to be able to manage it as I manage all my other code. And in this case, we're giving you SQL code. So instead of you learning our internal APIs to create pipelines, that's kind of the API behind the UI that you see, Everything is SQL based. You do that. So managing deployments, managing tests, managing all of that becomes much easier once you work with a SQL system. How does the CICD works in App Server? Well, how does it work in a database? Right? It's your SQL code. You manage it. It's the same. The original name for SQLite in App Server was database experience. How do we provide database experience for the people trying to build big data pipeline?
[00:05:31] Unknown:
As far as the pipeline management aspect, you mentioned that in the initial implementation of Upsolver, you went for a UI driven paradigm of being able to construct the pipelines and manage the data flows. A lot of other people are going to be familiar with tools such as airflow and some of the other tools of that generation. Lately, there have been kind of a new generation of data orchestration systems that are trying to be more data aware in terms of things like DAXTER, Prefects, Flight. And I'm wondering if you can talk through some of the anti patterns that you've seen teams start to adopt in the space of the kind of UI driven or very task oriented workflows that they have come to be accustomed to and some of the ways that you're trying to address some of those anti patterns and kind of prevent people from being able to fall into those bad practices with what you're building in SQL. Ly?
[00:06:28] Unknown:
Yeah. The 2 words data aware are the keywords. So pipeline is the combination of 2 things. It's the transformations that you write, and it's the orchestration of actually to run those transformation. Transformation is a necessary step. Like, you need to define your use case, you need to define what you wanna do with raw data, how is it becoming analytics ready. But orchestration is, kind of an unnecessary step that we do, but it's not adding any value. Basically, let's turn a 1 time query into something that runs continuously. So orchestration is, I think, the big anti pattern, and this is not something that only app server is saying. If you look at the big data platform like Databricks and stuff, like they are talking about the clarity pipelines. How do I take their transformation and turn them into a pipeline without additional engineering? Now if you look at the Airflow approach, the Airflow approach is by design, not data aware because it's not the system running the pipeline.
You need this white code to define when can I actually run a step, the dependencies between the steps, always creating those dependencies, how do to do? And there are a lot of dependencies. Processing state, I have additional tables I want to to merge with. I want to do aggregation. Like, there are a 1, 000, 000, 000 cases for stateful processing, and it's how to do on something like an airflow just because you have to define all the dependencies. They're not infilled out of your transformation. So let's say, in my opinion, it will be hard to make Airflow a data aware tool because it doesn't hold the data. It doesn't do the processing. Another system is doing the processing. This is why we are seeing declarative pipelines coming from Databricks and stuff like not exactly seeing it from Airflow because they are not the engine processing. So we kind of figured out that that was in an early stage in AppSolver, that we wanna have the processing engine and orchestration together so we can make the old orchestration completely automatic. Let's say the number 1 enemy for getting ease of use was orchestration.
So if defining the dependencies is 1 anti pattern, the second anti pattern is trying to force the batch system on a streaming workloads. Like, I remember writing ETLs in my Oracle days. You know, I could write a query. It could run once a day. Sometimes it ran for an hour or 2 or 4, but nobody cares because data comes every 24 hours. You can scan a lot of historical data. You can just do a lot, and you don't really need to think about your pipeline as a continuous process. Once you start going more real time, even hourly batch is more real time. I'm not even talking about the companies doing 1 at once a minute. Once you do that, you need to really be thinking incrementally. And your pipeline, if I'm looking at the online documentation, usually people would say that's that you shouldn't force streaming into something like an airflow. And now you have to think in 2 different ways. I have 1 system for batch. I have 1 system for streaming. Our goal was to unify them and kinda not make the user think too much, is it streaming or batch? Let me know what's the data freshness that you need. Give me that contract. Let me figure out how to run your pipeline, but don't try to force that batch process where you're not thinking about very small incremental step with a lot of dependencies.
And the point that it breaks is is streaming. So companies trying to just take airflow and build their streaming workloads with that, we often see that fail.
[00:09:45] Unknown:
In terms of the unification of the management of the transformation and the orchestration into the SQL dialect, I'm curious what are some of the trade offs that you considered around the kind of benefits of putting everything into that same language versus being able to more natively represent some of the kind of orchestration and sequencing constructs that might be necessary for more complex use cases?
[00:10:14] Unknown:
So the model we chose to work with so in AppSolver, you have tables and jobs. Those are the 2 main entities. Okay. So and the jobs move data between tables. That's that's clear. Now how does a job work? So I could write in absolute create job. I have properties of the job, like the refresh interval, when do I wanna start. But it's basically properties. You don't need to know a lot of syntax. You just need to get familiar with the properties. And after that, there is an SQL clause. That SQL clause needs to be as close as possible to an ANSI SQL clause. In most cases, it's just a 100% ANSI. So you as a user come, you write your query. You said, this is my transformation. Now you put a create job on top of it. You define those properties, and that's basically the contract of the table you're going to create. Now you want to create more tables on top of that? You can do that. For each table, you can create another table. That's the model you work in Obsolver. So I'm gonna tell the user, UI the query, learn the part about writing a VA job and some properties, and now you have a pipeline.
[00:11:13] Unknown:
As far as the kind of orchestration primitives that you're building in, what was the kind of level of detail or granularity consumers are of the SQLite platform and how that informed the, consumers are of the SQLite platform and how that informed the kind of level of detail and control that you wanted to expose versus just making it a smooth end user experience so that they don't have to worry about some of the inner workings of what their tasks are actually doing?
[00:11:45] Unknown:
I think it's more the latter. I'll explain what we do give to the users, but I think the approach is it's just too much to put on the user. 1 of the things that we did is we created table that shows you like, you can query with SQL and see all the orchestration tasks and the status that 1 ran and what didn't run, and we're creating views on top of it so it would be easy to query. But it's still very complex. Like, I'm not expecting the average user because if I'm gonna start the most vanilla pipeline ever, it's still going to create a table. It needs to update metadata. Let's say if I'm using something like a catalog or Hive Metastore, it needs to do compaction of small files. It needs to apply retention policies. So you have, like, 3 different angles in which you are going to update the table for a vanilla use case. I'm not even talking about the case where you're updating the table for multiple jobs or writing replays or fixing data on specific time interval. So we basically, our approach was to try to basically abstract orchestration completely.
What we do ask users to do is to define the interval, like, how often they actually want to get the data. There is a bunch of stuff they can do to play with sliding time windows. That's more for the streaming use cases. It's usually for the stream use cases. And 1 thing we introduced that wasn't in AppServer Classic is the concept of sync jobs. So the concept of sync jobs says that all the tables that are being let's say I have a chain of tables, and all of them are connected with sync jobs. It means that all of these tables are going to maintain a timeline of the data. And if minute 1 didn't get into table 3, then the job, so table 45, wouldn't continue their processing. So the entire pipeline from start to finish is going to be managed on a single timeline. That's very important for streaming, but not only because it's it gives you a lot of control over your data. Let's say I have some data, some data quality problems, and I identify that it's in between 12 o'clock and 1 o'clock at night on a specific day. I can just rerun that piece, and then the entire change is going to propagate along the pipeline because the entire pipeline is is synced into 1 timeline. So the concept of sync jobs is 1 way you can work. The other concept say, I don't care about timeline, just write into the table. Whenever there is available data, write into a table. So for example, I have some kind of reference data. I'm using an address book. So I don't care about the time I updated the address book. I just want the most recent addresses, and I don't care about sync jobs. But if, for example, I have a click stream and I wanna take 2 different events and join them together, then the time means something here. If the 2 events, for example, are sessions are correlated, I can't run the 2 jobs in completely different times. And if I'm going to do a replay, then maybe I'm just gonna run 1 stream and not run the other stream. So those 2 are not gonna be synced. I'm gonna get bad results.
So we, in the past, didn't give the user the option to choose between 2 modes, synced or not synced. And today, they can do that. So that gives them more flexibility.
[00:15:00] Unknown:
In terms of the implementation of SQL Lake, I'm wondering if you can talk through some of the kind of design and technical aspects of it and some of the ways that you were able to build on top of your existing investment in the previous generation of the AppSolver platform and cases where you had to either kind of throw throw pieces away or rebuild or reengineer it?
[00:15:24] Unknown:
Basically, SQLite is built on the AppSolber processing engine, which is running in production since 2018. So that's the processing engine that we are very certain of. We didn't write it from scratch because the processing engine works well, and we tested it at scale. What we did do, I think the approach of the original AppServer Classic was let's create an entire data architecture as materialized use. So I have my whole data, and I have materialized use on top of that. And what we found out kind of when you face reality is there are a lot of changes people are doing to their tables. First of all, that materialized view, quotes materialized view changes a lot. Adding columns and removing columns and doing a lot of schema changes. I think that I've seen cases where I've seen over a 100 edits for 1 pipeline from a single from a single user. The second is that people are really good at screwing up their data, and it needs to get fixed a lot. There is no real meaning to the concept of doing random changes to materialize use because they are the result of a query. But sometimes you want to change data more and more. So we decided that we are still going to keep the concept of materialized use, but we are going to separate jobs from tables. In the past, each job, if 1 job equals 1 table, today you have tables, you have job, you can have multiple jobs running into 1 table. Like, there is more flexibility on the entities, and that's solving all of those more, let's say, edge use cases that we saw over the last few years. So kind of separating the engine to addressing tables and jobs as 2 different entities was something that was a redesign.
I think the second was to take a look of SQL and say, well, we want a full support of SQL. What don't we support, and what don't we support as SQL? For example, you could edit materialized use in App Server classic. We don't have that concept anymore because editing materialized use was very confusing for our users. What version of data am I currently looking at? That kinda was a question that came up a lot. And if I'm running it materialized you, what data is it going to override? Like, let's look at even a small to medium data lake. You can take a data lake of a 100 terabyte and decide to just recompute your materialize you. It's not practical. Everything is an evolution.
You never recompute everything from scratch. So that was a redesign on that part.
[00:17:46] Unknown:
In your redesign and reimplementation of Upsolver into SQL Lake and orienting around this declarative approach to managing the transformations and the pipelines. What does the workflow look like and some of the pieces that you're trying to optimize for in the customer experience as you are rolling this out and getting people onboarded to it?
[00:18:08] Unknown:
I think the intention of SQLite, like, the user experience we are showing to users is there are 2 personas in AppSolver. There's a data engineer persona, and there is a data consumer persona. What would a data engineer do? They will set up the account. They'll create the connection. They will set up the initial ingestion. They would put the whole data in the lake. They would provide the initial access. Now if I'm gonna look further and I'm gonna look at the data consumer, the data consumer has a contract with us. They like queries. We deliver pipelines. That's what we are trying to deliver in the user experience. So they're they're doing create job that's basically wrapping some kind of consummation SQL that they're writing. When we started up Solver, we kinda looked at everything as 1 1 single motion, and that's really not the case. Each 1 of the users has a different life in the system. The data engineers want to get started fast. They want to make sure that you're going to ingest in a secure way, and they're gonna think about cost. But the data consumer, they just want to work in an easy way, so we wanted to just write a query.
What you need to do in AppSumo is you need to write a query, and everything was kinda derived from there.
[00:19:19] Unknown:
Because of the fact that the entire user experience is oriented around SQL workflows, I'm curious how you've thought about some of the additional tooling around that for being able to manage things like versioning, composability, code reuse. And I'm thinking in particular around things like dbt being able to use some of those paradigms to apply to SQL Lake, both for the transformations, which people are already using dbt for, but also managing kind of reuse and templating around some of the pipeline logic?
[00:19:51] Unknown:
So the day of the release, we already like, there's a nice demo of it, but, basically, if you work in a deep environment, you can manage the App server pipelines using Git actions. And feedback on that was very good so far because most people use Git, and Git actions was a familiar process, at least to the customers we have worked on it so far. We think that DBT is a great tool for managing deployments and kind of managing my data pipeline ecosystem. And we are building a connector where you basically will be able to write the App Server SQL from dbt. Dbt will not be doing the orchestration. That's basically 1 difference. Well, dbt core doesn't do doesn't do the orchestration. But if you bind dbt on the cloud, you do get the orchestration. So we are not behind the approach of let's do my orchestration on dbt because everything we said about Airflow still applies here. But I definitely would want to see, like, HubSpot customer manage their pipelines and deployments and different environments using dbt, and you can use Python scripting language to basically wrap the SQL.
[00:20:56] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, DBT models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to dataengineeringpodcast.com/monte Carlo to learn more. In terms of some of the challenges that people experience around composability and code reuse in SQL, I'm curious how that influenced your choices around some of the extensions to the dialect that you were designing and ways that you were trying to kind of reduce the potential for conflicts or code repetition and being able to optimize for the kind of software engineering practices that a lot of the data community is trying to adopt?
[00:22:17] Unknown:
A few things on the dialect. First of all, every every time you write an SQL clause, so I said create job that SQL clause, then we try to keep it ANSI. Like, if you're gonna try to go non ANSI, there's an app server architect that's gonna jump at you from some from some point and telling you that you're basically breaking the contract of the system. But we did extend ANSI, especially if you look at the streaming. There are some functions for doing streaming and, enrichment functions like IP to geo, unpacking user agent, those type of things. So that's extension, but it's just additional functions. 1 area where AppServer kind of diverted a little bit from ANSI, and there is a good reason for that, is nested data.
We feel like SQL is not the first choice for software developers for nested data because SQL is trying to force you to flatten the data back and then work on it, and you're losing the relationship between parents, siblings, and sons, which is important in an object like JSON. So we invested a lot of time in allowing people to work with SQL, but on top of the nested context. For example, I have nested JSON data, and it has arrays. I could actually add an enrichment in the context of the array item, and I could do operations between arrays. Let's say it would take 2 arrays in App Store, do lay a with plus array b, and would just do an item wise addition between each item in the array. So sometimes users are not even necessarily understanding that what they're doing is a very complex operation. They just work on the nested object. They don't need to start doing all of those flattening, and I feel like that's the place SQL is going to go to. If you look at Amazon's open source party queue, they are already trying to do something like that. In order to do that, we allow you to create, like, parameters, like local parameters in a query using a let syntax. You can add that almost after every clause in the query, and we allow you to work in a nested context without flattening. Those were 2 major things that are still not part of ANSI, that was invented quite some time ago. But we are a data pipeline company. The whole data, in many cases, is nested, and we wanted to provide an easy experience for unpacking it. So that felt to us like a good enough reason to break the ANSI contract for that use case.
[00:24:43] Unknown:
The other complex aspect of what you're building is the question of orchestration, which is something that companies have dedicated their entire focus on. And so having it be a component of a broader platform is definitely very ambitious, and I'm curious what are some of the complexities and edge cases that you've had to address in building your orchestration system and maybe some of the prior art that you looked to for inspiration on how to implement it in your SQL Lake platform?
[00:25:13] Unknown:
I think that the first part of it is that we are not building the same orchestration system as they do. Like, we have architected AppSolver differently. So AppSolver, we said the words data aware. It stores all the data it ingest on a data lake on a timeline. So you have the starting point is consistently all the data on a data lake. Now what are the tasks that AppSoft is running? So each task is going to be defined by some kind of time period. The minute of data is going to be a task. So I don't need to ask the users to define the tasks for me, what's a task. I'm going to infer that myself using the data that was Odoo. So the architecture is a little different because we are not trying to do trigger based or time based orchestration.
We are trying to do only data aware orchestration for data pipeline. So we are solving a smaller problem and trying to do that in a very good way. Now what are the site use cases? I think the hardest part was with what I described before, the synced versus unsynced. That was pretty hard to do to make sure that you have a data pipeline. It has 6 jobs, and all of those jobs are going to be synced without asking the users to define those dependencies. That was pretty hard to do. And I think that the other part was trying to make the system seamless for stream and batch use cases. So when we started, AppServer was really just optimized for streaming system.
Today, it's also optimized for slower batch workloads, which means that you need to take a different sharding strategy. Sharding is something that's hard for the users to really get to do well. How many shards do I need for every pipeline? In what case? It's something that you really need to understand the internals and getting everything to work with auto sharding. It was hard to do and definitely very related to the orchestration.
[00:27:04] Unknown:
In terms of the initial feedback and some of the early feedback that you've gotten from people who are already using SQL Lake, what are some of the sharp edges or escape hatches that they've asked for in terms of being able to maybe delve deeper into control over the way that the data is being transformed or some of the operational characteristics of being able to work with that data and ways that you've been able to wrap that back into your product development?
[00:27:31] Unknown:
Escape hatch is always important. Having UDFs, for example, is is always a good escape hatch. You know you're not gonna get stuck on syntax. I think 1 escape hatch was just a redesign I described earlier on SQLite was separating jobs and tables into 2 different entities, so you can have multiple jobs writing into the same table. That creates a lot of flexibility for use cases we have been struggling with for the 4 years that preceded SQLite. The other thing that's not in the system yet and is going to be soon is having direct, insert, update, and delete so you could fix your data manually in cases you don't wanna write it as a pipeline, and also to support the GDPR use case so you could be able to delete user they have to use those the ability to be forgotten.
Even if your data is in the data lake, that's something that makes a little more sense to do with the d in DML, insert, update, delete. So that's 1 escape hatch that we are still adding, But those are the main ones.
[00:28:35] Unknown:
Another aspect of what you were talking about in terms of timeliness, as I was reading through some of the documentation around what you're building at SQL Lake, it was talking about being able to automatically manage the partitioning and chunking of the underlying data so that you can parallelize across it. And I'm curious some of the challenges that you're experiencing in being able to manage the timeliness of a given pipeline as the volume or complexity of the underlying data scales?
[00:29:03] Unknown:
So I think the first thing, it was to build the AppSolver processing engine as a headless feature. So we don't have a single coordinator node or manager node or something like that. That's a single point of failure. That is done with data ware architecture I described because every period of data is a task, and every task is going to be a impotent. So it's going to write the exact same bytes every time you're gonna run it, regardless of the time you're gonna run it. So I can actually look at the file system, and the file system is going to help me understand what's the next task that I wanna run. So let's say I'm gonna move from a full instance cluster to an 8 instance cluster. I'm gonna get 2 x on performance. It makes it much more predictable to understand am I actually going to need to support my use case. Now on top of that, 3 scaling policies, sometimes even do some custom ones, but 3 general scaling policies. 1 is optimizing for lower cost and another is optimizing for consistent lower latency. And the difference between them is basically how aggressive you wanna be with scaling up and then scaling down.
So that's the second method.
[00:30:11] Unknown:
Another interesting challenge that a lot of people in the data ecosystem are trying to address is how to think about pricing and what are the reasonable units to use for computing usage. And I'm wondering if you can talk through your journey of that experience and some of the ways that your experiences with the initial implementation of Upsolver have informed your current approach to how to bring SQLite to market and make it accessible to a wider variety of companies?
[00:30:42] Unknown:
So there are 2 parts to that. The first part is the model, what is going to be the units that we are going to use? And the 2nd model is the entry level. Do I need to pay 5 figures, 4 figures, 3 figures? Like, how much do I need to pay when I actually get started? Now when AppSolver started to sell its product in 2018, we chose a compute based model. So, basically, based on number of CPUs. Now the problem with that is that CPUs has an indirect relationship with the value that you're gonna get. I'll give you an example. I have a pipeline that's running on 16 instances, and now I need some more RAM.
So I need to double up, and I need to go to bigger instances. And now I'm paying double for the same pipeline, for the same volume of data without create additional business value. That creates friction with users. It creates predictability problems for buyers. So I'm not a fan of going to our buyer and telling, hey. I know you have bought for x amount, but now you're gonna probably end the year at 4x. Maybe you should go talk to your CFO. That's not a good conversation you wanna send them having. And it's very hard for them to predict how much it's actually going to cost, and it prohibits usage. We talked about that quite a lot. It was 6 months prod project of talking to users, understanding what to do. And we decided that the app server unit of economics is going to be volume of terabyte that ingested, emphasis on ingested. What does this mean? You, Tobias, are going to ingest data into AppSolver. You're gonna pay $99 per terabyte that you ingest.
After you ingested the data, you can run as many pipelines as you want. We are not gonna charge a software fee from you. The app server is gonna run-in your account. You're still gonna pay for the instances, but the payment for the instances is so low because it's spot instances. It could be like 2% of what you would pay for the same capacity for a data warehouse, for example. So the cost of the instances is almost negligible. And on the other hand, you understand exactly that if you are processing a 100 terabyte a year, and next year you're gonna process a 150, then you understand how much you're gonna pay for the App Store software, and it also ties much strongly and much better to your business value. So if I have a gaming, if I have a game and my game, 100 tera per year, and now it's gonna yield a 100 and 50, probably it's because I have more users. And I have more users, I'm willing to pay more for the data. So it tied better to my business objective. It was more predictable, and the users just loved it. Like, we started migrating from existing customers, talked to new customers. We closed some deals. We're stuck on compute based model. We love the data ingested model.
It's much, much simpler to explain also. I suddenly am I, Ollie, as the founder, am no longer at deal desk. For the salespeople. They can just give very simple pricing to their customer, and we can lead with pricing. So instead of asking you to do a POC and after 2 months, tell you how much is it actually going to cost because we actually need to try it, we can tell you how much it's gonna cost before the POC. You know if it makes sense to change an existing system and even to invest resources in this project. So the this is the reason why we made the change, and we're very bullish about the about the change. Now the entry point for AppSoft in the past was roughly $20, 000. So we wouldn't start unless you would pay an annual fee of $20, 000. We would only look at kinda doing a sales led type of motion. And today, you can go sign up to AppSolvers, start paying $99 per terabyte. If you have a terabyte a year, you're gonna pay $99 for the year.
So you're not gonna pay a lot for data volume, especially if you work on your own. If you work with us, we do charge for the ongoing support services. But if you are independent and you can be independent on an SQL only system, the cost is very low for you to get started.
[00:34:38] Unknown:
In terms of the applications of SQL Lake, now that you have people using it and applying it to their problems? What are some of the most interesting or innovative or unexpected ways that you've seen it used or workflows that you've seen people build around it? As a data engineer, I tend to drift to the technological
[00:34:58] Unknown:
examples that are really cool, and those are usually scale ones. So seeing a single workload with 3, 000, 000 events per second and over a 100 petabytes a year, that was pretty cool to do it with so much data for that customer that sometime we just ran out of EC 2 instances in the region when we needed to do a replay or things like you start facing problems that you didn't even think about when you start planning. So I think that's very big. Datadex was 1. The second was, I would say, streaming. So, for example, a company doing oil rigs, and they were able to really see the data in a 15 minute, 20 minute, 30 minute delay, and now getting that in under 1 minute latency.
Now that sensor data is bringing a lot a lot more value. It's actually meaningful for the organization at very different levels. That was very cool to see. And maybe the last like, 1 of the problems that got me into AppSolver was a problem where to build machine learning models over advertising data, and we needed to get to very fine grained user information. So the fact that people are building a pipeline that's enriching the ad opportunities that I'm looking at with information about a 1000000000 users, and you have granular information on a 1000000000 users, and you can do that process in a 100000 times per second based on what we create. That was very cool. Yeah. I kind of drift from high scale use cases and and think that the business use cases, the real time ones are the ones that are the coolest.
[00:36:29] Unknown:
In your own experience of building SQL Lake and overseeing its development and rollout, what are some of the most interesting or unexpected or challenging lessons you've learned in the process?
[00:36:41] Unknown:
Maybe I'm still digesting, but I feel like switching to an SQL system from a UI based system creates, like, you're basically selling people a database. We're selling something that's very, very similar to a database. We have to be very creative in how we teach them to use that database. So it wouldn't be like a knife you can draw, but never use it. How do we provide a lot of examples? As a result of the early user feedback of how do I actually find an example to what we are doing, SQLite has the concept of templates built very strongly into the product.
Every time we see a nice use case, we just add the template to it. And there is a power of community here. As more customers join into SQLite, we can create more templates. They can create more templates. And now you have an example for every pipeline that you wanna do. So you can tell the story of an entire data infrastructure, but on a single document of SQL commands. And that tells a very powerful story. And until we started talking to customers, we didn't understand that the system is the template. And we kinda just wanna gave them this SQL empty worksheet and to thinking that they will take it from there. And we covered a lot of ground as a result of that. We added snippets. We added templates. Like, as many examples we can have. Like, there's a GitHub repository. Like, there's a bunch of examples as a result of that I received there.
[00:38:07] Unknown:
For people who are interested in being able to simplify the operational aspects of managing their data, what are the cases where SQL Lake is the wrong choice? I feel the why now for SQL Lake is
[00:38:21] Unknown:
going more real time. So I kind of put the line in the sand in 1 hour. So if you're not interested in delivering data on an hourly basis, maybe, like, old style orchestration is good for you. Like, it's not that you don't have good batch alternative. Airflow is bad at it. It's something like that. Like, that problem is a solved problem. And if you don't have the need to go modern, then don't. Maybe then AppSolver is not the tool for you. I think maybe the second option is if you need a connector tool. So if you wanna bring data from a bunch of SaaS applications, that's not the AppServer use case, at least not today.
More of a 5 train type of use case. AppServer is focused on streaming data coming from Kafka Kinesis, data files and object storage, and database application. Database application, something that we added about a year and a half, 2 years ago. So those are the 3 main use not use cases, 3 main types of data, Bringing data from Salesforce and HubSpot and kinda creating the marketing data warehouse based on multiple connectors, that's not an upsell.
[00:39:27] Unknown:
As you continue to build and iterate on SQL Lake, what are some of the things you have planned for the near to medium term or any particular problem areas or projects
[00:39:38] Unknown:
that you're excited to dig into? There are a few. So I think that the first 1 that's going to come out is data quality. So basically every table in Absolver is going to have expectations, and you will be able to basically understand when expectations are being violated and improve your data quality. That's something that we heard a lot about from customers that they don't want to just define the pipeline. They wanna define all of those expectations in order to proactively monitor their data quality. I've heard it enough times, but I'm very excited about releasing it. It's already in the works.
Other than that, we are going to add support for Icebelt. So now you can write with AppServer to data warehouses, and you can write to data lakes if you're using a Hive metastore or Blue catalog, but we are not creating an ICE build catalog. That's actually not so hard to do, and we are planning to do this in q 1. This will create a nice opportunity for people using tools like Snowflake, for example. So today, you're using Snowflake. We first create a copy of your data in the lake, then we push it onto Snowflake, but not your entire data using Snowflake. Now that Snowflake support Iceberg tables, we'll be able to create those tables, and you will be able to query your data from Snowflake, whether it's from a lake or from a warehouse. And that's not just true for Snowflake. It's true for any other system that supports Iceberg. So I think that's an important integration that we want to add. 1 that I mentioned before was direct insert, updates, and deletes.
So you'll be able to fix data manually when you want to not work with the continuous pipeline model. You just wanna fix something. It works like a database. It needs to work so like like a database. Those are the main 3 that come to mind.
[00:41:21] Unknown:
For anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:41:35] Unknown:
I feel like the gap of data engineering this is the number 1 missing skill gap today. Like, it's 4 times worse, I did the numbers, than the gap we have with data scientists. Like, you have 1 data engineer for every 2 data scientists, and you have 2 the open data engineer positions for every open data scientist. So, like, a 4 times gap, it's a big problem. If you ask data engineers, where do you spend most of your time? They're spending on data pipelines. And I feel like this is the biggest drop. This is the biggest thing that data engineers are doing. Then this is the biggest problem to solve because organizations move in the pace of data engineering. Now I would very much like to see the world going to a data mesh future where data teams are independent and data engineers are more enablers, more infrastructure developers than people that are doing ad hoc development for data teams.
And by allowing pipelines to be easier, you're increasing the audience that can do data pipelines. For those 2 reasons, the growth in data mesh and what's currently kind of boggling data engineers today, I think data pipeline is the biggest problem. Like data warehouses, you have a bunch of them. It works very well. So I feel like the problem of moving data is the most time consuming.
[00:42:52] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on SQL Lake and having the bravery to completely reimagine the product that you're releasing. So appreciate all of the time and energy that you and your team are putting into making creation and management of these real time pipelines more tractable. So thank you again for that, and I hope you enjoy the rest of your day. Thank you, Tobias.
[00:43:24] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast thought in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom. Subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Ori Rafael and Upsolver
Overview of SQL Lake Product
Challenges with UI-Driven Workflows
Design and Technical Aspects of SQL Lake
User Experience and Onboarding
SQL Dialect Extensions and Nested Data
Orchestration and Scaling Challenges
Pricing Model and Market Accessibility
Applications and Use Cases of SQL Lake
Future Plans and Developments