The Alooma Data Pipeline With CTO Yair Weinberger

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute. And for complete visibility into the health of your pipeline, including deployment tracking and powerful alerting driven by machine learning, Datadog has got you covered.

With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you'll have everything you need to find and fix performance bottlenecks in no time. Go to data engineering podcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt. And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Yair Weinberger about Illumina, a company providing data pipelines as a service. So, Yair, could you start by introducing yourself? Yeah. Thank you, Tobias. I'm Yair,

1 of the founders and, CTO. I,

spent, my my time building data pipelines in different places. I spent about 12 years in the military in Israel and about 4 years in advertising

technology,

company called, Convert Media. It was eventually acquired by Tagula. And throughout that time, I was,

focused on building data pipeline over and over again in different places. And

that's what brought me to start the Luma as well about 4 and a half years ago. And do you remember how you first got involved in the area of data management? There was just a lot of data to process,

and, someone had to do it, at that time. The name data engineer was not in use that yet. It was about 15 years ago. We started by,

trying to create a way to ingest data

that was generated back then in the military

into somewhere where analysts can access it. And, that proved to be not such a trivial problem. And

that pretty much was the same course for the last 15 years.

And so as we mentioned, we're talking about your company, Illumina. So I'm wondering if you can describe a bit about what Illumina is and does and the origin story that, brought it about. Yeah. Sure. So when, about 4 and a half years ago,

me and and my other 2 founders,

Yoni and Rami,

we thought there is a problem in data integration space. I mean, we felt that we are building the same the same product or the same thing over and over again wherever we are at in order to ingest data and to, you know, make data ready for analytics,

process data. And when we started LUMA, actually, we we what we thought is this is not a new problem. This is I mean, that problem exists for, like, 30 years. So why is it so difficult? There are big companies out there. There are, that every almost every big company like Oracle, Microsoft, IBM, Cisco has some product around Dell, has some product around data integration. So

why are we finding ourselves,

building the same thing over and over again? And we actually start when we when we started a company, we actually thought we are missing something. I mean, the the first our initial reaction was we must be missing something. We must be doing something that we are unaware of. I mean, how come this doesn't work with for such an

old problem and so much money invested in trying to solve that problem? So what we did is we went ahead and and tried to talk to as many companies as we could. So as an engineer, our my tendency and all of us are engineers, our tendency was to get in a room and start coding. But we we thought that if we do that and we are indeed missing something, then we might never find out what we're missing. So we we went out,

of our comfort zone and went to talk to about

150 different companies, most of them in the Bay Area, but, you know, all over the place as well. And tried to ask them, you know, how do you ingest data today? What works? What doesn't work? And then we heard a very similar story over and over again from those a 150 companies, and they were all shapes and sizes. We got to someone at Home Depot and Coca Cola,

and then we got to someone at, you know, 5 people start up in, like, a basement. So we we pretty much heard the same story over and over again. And that story was

in the

data integration space,

there were traditionally

2 different product categories. 1 category was the ESP or enterprise service bus with company like TIBCO

and, you know, a newer company, MuleSoft, recently acquired by Salesforce

and which were focused around

communication between different applications

inside the org. And it was focused around on premise applications and real time. So that, you know, let's say the org has an enterprise has a 1, 000 internal applications.

You only need to connect each and every 1 of them to the bus instead of creating, like, n squared connection connecting each and every 1 to each other. And the other type was ETL or extract, transform, and load, where

traditionally again, company like Informatica,

Microsoft, SSIS, Oracle Data Integration,

traditionally focused around batch loading of data and preparing it for analytics. So extracting data from where it resides,

some transformation to make the data ready, and then loading it into the favorite, like, analytics or BI tool.

And and, again, traditionally, this was done in,

in what is refer referred to as OLAP cubes.

So

some some BI software that hold the cube inside it,

that's already usually had all of the data in memory. And then

that worked for many, many, many years. But but what we found out in in those conversations is that

2 major changes have happened, and and that was

that resonated with our experience, with our difficulties to use the incumbent products for data integration. So 1 thing and you know what? Maybe in retrospect, that sounds very trivial.

Hindsight is always 2020. But at at the time, we took some time to get to those conclusions. So the first was

that everyone is moving to the cloud. And the Cloud,

actually, it means in that context of data integration, it means 2 things. 1 is that

the data itself is moving into the Cloud, is moving into,

you know, cloud storage like Amazon S3 or Google Cloud Storage or Azure Blob Storage.

And then also that

or, of course, Cloud data warehouses like Amazon Redshift or Snowflake or BigQuery, etcetera. But it also means that some of the applications

are moving to the Cloud as well. So companies started using SaaS instead of using on premise applications. So Salesforce replaced Siebel, and we have Marketo and HubSpot and Zendesk. And

pretty much every company today is using a lot of different SaaS vendors.

So even the applications themselves are no longer on prem, no longer part of the org. And on the other hand,

what we've seen is that real time data is becoming more and more important for more business use cases. And I'm sure

everyone who listens that, took an Uber in their life, and Uber has this amazing feature of surge pricing,

where the price is being determined on the fly according to supply and demand. And,

I'm now in Israel, in Tel Aviv, and we don't have Uber here because of the government regulation. And,

under our shower, it's just impossible to get a taxi. But in order to make that possible to get the price dynamically allocated using supply and demand, you need real time data. You need real time access to the,

to to the passengers,

to the to the car's data, etcetera. And you cannot and and the traditional ETL tools, they work in batches. They work every day, maybe every 12 hours. And then what we've heard over and over again is whenever we are using something in the cloud or we need real time data, we are building it ourselves in house using open source tools. So we spin up Kafka and Spark or

Storm or, you know, all of those Apache open source great projects that allow us to build a real time data pipeline in the clouds. And that's that's what the trigger for us to start the Luma. Because what what we thought to ourselves is if so many companies

are building this in house and are

taking, you know, putting a team of engineers, and there is even a new title, data engineer that didn't exist 10 years ago,

and telling them build that pipeline for me. Maybe there is a good spot in the market to build a product that that can do that as well. You mentioned that there was this need for

access to real time data and with the advent of the cloud and SaaS, it was no longer as simple as just plug everything into the enterprise service bus or just rely on this batch,

ETL

architecture.

And so I'm wondering if you can dig a bit into how Illumina is architected

and what your decision making process was around whether to go with the stream versus batch architecture for it. Yeah. Sure. So the main and again, following that footstep, we had 2 main following those assumptions that we found out with 2 main,

core,

decisions when we started architecting the LUMA, which 1 was to make everything in the cloud. We are going to host LUMA 100% in the Cloud. It is going to be a multitenant platform, and

and we are going to talk about the the challenges of that later. But 100%

multitenant,

running exclusively in the cloud and, with great focus around

cloud integration and cloud data warehouse. And the other was to build the Ooma around streaming data. So we want to try and turn everything in the world into streaming. And, you know, not always it is possible.

Sometimes there is just data that you cannot get streaming, especially if you're getting it from an external third party source.

But a lot of, data sources that traditionally are not streaming, we have worked very hard to turn them into streaming, into,

let's say, an immutable set of changes. Actually, I highly recommend

a blog post by Jay Kreps, which is called the the log,

which explains how everything is an event stream.

If we look at the database, we need to look at the transaction log of the database as an event stream. If we look at the changes that happen over time to, let's say, something that's being updated,

we need to look at every update as an event instead of just looking at a final value at each point. So

that's the that's also a core

thought around the Loom architecture. Try and make everything into an event stream. Now traditionally,

we I think we can divide the different types of data that people want to ingest into 4 categories.

1 is files. They have, you know, you have files and those files can come from many, many different places or sources, and you want to ingest those files.

And ideally, what what we try to do there, again, even that we try to make as streaming as possible

by, for example, agents that

read that sit on the log directory and just ship

every new log row as it's being written,

or just even if if it's pulling

some location as frequently as possible and streaming the file in a in a micro batch,

that's still as, you know, as as fast as possible.

Then there are clickstream events where they are by nature streaming. Let's say if we have,

some JavaScript or, you know, SDK for iOS or Android that are sending events,

those are just streaming by nature. The 3rd category, which was maybe the most difficult to turn into streaming was RDBMS.

So you want to ingest data from, let's say,

a transaction database and use and and run some analytics on top of that. That's a very common use case. And as I mentioned, what we try to do is read that transaction log. Sometimes it's easier to access transaction log. Sometimes it's very, very difficult. And the 4th and last is,

API. So if we want need to access some kind of an external SaaS API,

unfortunately, that's not in those cases. Streaming is not always possible. So again, you you need to look into the API and you need to see how can I implement change data capture? How can I capture all the changes from that API

and and do your best? And again, some APIs, it's possible. With others, it's again, it it could be very difficult, and you have to work with microbatches. But the core idea was to make everything streaming or data streams and then build the capabilities

of stream processing inside the product itself. Makes sense so far? Yes. Absolutely. Great. So then we turned everything into streaming and we tried to distill, okay, what are people doing when they're building a data pipeline? So we know they spin up Kafka, we know they might spin up Spark Streaming or Kafka streams today or a storm or but what what are the core

components of a pipeline if you try to to dissect it? And and we broke it down into

into

a couple or a few main components and and then try to tackle each and every 1 of them. So as I mentioned, 1 part is just ingesting the data, getting data into the message bus or to Kafka. By the way, Alooma also uses Kafka UnderDode. It's an amazing product. But you can some we've seen a lot of companies using, like, a cloud vendor solution like Google Pub Sub or Amazon Kinesis. But just getting everything into the message box, that's the integration or ingestion part. And that's even table stakes. I mean, you can build data pipeline if you cannot ingest data. From there, there is 1 component of running code on the data. So you have data. You need to process it somehow. You need to be able to run code in the data that streams in. In a Loom, we call that component the code engine, but you call it many different names. And then the next part, and that's something that usually is not,

usually, it's not well thought out in the many ad hoc data pipeline that that we've seen is the is the schema repository

or how do you how do we do schema detection, schema translation, automatic schema

amendment of the target? So how do we deal with schema? And I think that, actually, I'm I'm I'm delving into a rant here, but there was this concept of schema and read, and that concept was actually pretty strong for a long time. It was like, okay, let's have a data lake. Let's spin up in a dupe cluster. Let's shove everything in there

and then take care of the schema when we read the data. And and I think that that conception has,

has a lot of flaws because then you are unaware of schema changes until you're actually trying to read them. And then you found out that you have a data swamp, and you don't know where anything in the data lake even resides anymore. So that schema on read concept, actually,

let's say, for maybe 4 or 5 years was,

rule the rule the data integration,

world. And and I think that today more and more companies are moving off of schema and read and are moving into schema and write. But at Aruma, we try to put it as 1 of the core components of the pipeline,

understand

what is going through the pipeline, understand the schema, understand how the schema evolves, and create a schema repository with all the history of the schema changes. And then, of course, apply it whenever it need to be applied. And then the last piece that is also often overlooked is how to handle errors.

What happens if something doesn't fit the schema? What's happening if something doesn't fit the bill, if a component is down, if, your

Redshift cluster is not responding, if your processing logic is

flawed, how do you gracefully recover from those without getting that page into in the morning? So these are the main components that we've tried to build,

in into the Alooma architecture. And this is what we had in mind when trying to design a data pipeline as a service. And touching a bit more on the topic of schema on read versus schema on write is that when you're just shoving the data into the data lake without processing without even looking to see whether it fits the intended shape Mhmm. It will easily hide a lot of potential errors in your source data or your processing

to your point of having some sort of error catching, when when you do that schema on read, there's no

point at which that's even possible until there's no way to recover from it, and you potentially have terabytes or petabytes of data that is completely useless.

Yep. Exactly. I I completely agree. And and, you know, I've, again, I've been using data lakes myself as well, and I've have seen that happen. I've seen it go wrong, go right. So I definitely

don't recommend that approach today. Yeah. I agree. And 1 thing too as far as being able to do the processing with the code engine or in the ETL workflow.

1 thing that seems to be gaining a lot of prevalence

is the concept of ELT where you do the transformation after you've loaded the data Mhmm. In order to be able to maintain the integrity of the original,

of the data at its origin versus doing the transformation and potentially either stripping away data or incorrectly emerging it. So I don't know if you have any thoughts on that. Yeah. Yeah.

I I have I might have some thoughts.

So

so it, actually so the the the approach that again that we took in Illumina and I actually I believe in personally is a 2 pronged approach.

First, you need to maintain the raw data always, and Illumina, by the way, does that. So we allow you to push your raw data into

a cloud storage of your choosing, and it's your cloud storage.

So first of all, you have to maintain the raw data for, you know, as long as you want to make sure you can reply it. You can redo the processing logic.

So the raw data should be kept raw, of course. But on the other hand, not all transformations are a good fit for a data warehouse. So let's say

you are pushing the data into, Snowflake for example, as 1 of the most,

fastest growing cloud data warehouse. Not every and you want to do ELT. Not every transformation

is a good fit to run on the cloud on the data warehouse itself. Some are much more efficient to run, let's say, to run-in Python on the stream itself.

So what we do, in addition to storing both the raw data to make sure that you can always reprocess data and you don't strip out any important information, etcetera. By the way, there is 1 exception to that that I'll mention soon.

You want

to,

support

both trans so let's say ETLT,

both transformation on the stream itself and transformations on the warehouse itself. And and just let me give you 1 very,

very, funny example from 1 of our customers. So 1 of our gaming customers released a version, to their game, and there was a field that was usually Boolean,

and

1 developer decided it might be a good idea to fill it with yes and no instead of true and false. Now if you try to do that as an pure ELT process, you will you will need to create a new column that is a string or varchar

and then create some SQL to cast that string of yes and no into, that Boolean column that you really want to query. And of course, you don't want to query strings. And of course, you cannot fix the problem with the source

because once you've released an update to something a user runs on their device, they might never update again. You might be stuck with that bad data forever. So there are some transformation that makes so much more sense to run on the stream. And I think a pure ELT approach is

is, is often

is often destined to fail. On the other end, of course, you cannot do everything. ETL is not the solution for everything as well. Because some some transformation you do

make not much more sense to 1 in secret. So the approach we talk is, is an ETLT approach where you can run code for transformation that,

transform the data on the stream itself for things that are more efficient to do that way, and then run and then we we provide the mechanism to run scheduled queries on the data warehouse side. So you can schedule a job that creates roll up tables, creates facts table, denormalize schema, things that make much more sense to run-in SQL. So I think we you need to try and get the best of all worlds there. And going back to

the

architectural level of what you've built at Alooma,

there are these different components and different aspects to it, but I'm wondering if there are any particular pieces that you found to be the most challenging to be able to scale as you grow the business.

Uh-huh. Yeah. I mean, yeah. That that's a that's a great question. I I I am almost tempted to say every

piece, but but even may maybe maybe let's dive in a little bit more. So 1 piece that's actually, you know, surprising at first is if you want to get clicks or events, you need an endpoint and you need that endpoint to be on the Internet. You need it to be able to accept data from all clients, like, millions and millions or 100 of millions of different clients and they could be browsers,

they could be mobile devices, they just send you the data at all times. And just, you know, creating an

event endpoint that is reliable

and,

you know, you you cannot if if if that endpoint is not responding, you're losing data. You're losing your customer's data. So just creating data and making it scalable so, you know, it can can handle peaks and, you know, things like,

you know, Memorial Day weekend shopping spree or the holidays or sometimes you can have a 10 x or maybe a 100 x peaks. Creating it elastic and the ability just to withstand the volume of incoming data, that in itself was a was quite a challenge. Then, let's say you did and they got in, the first component,

we have is Kafka. Now Kafka itself is actually 1 of the most amazing pieces of software I've seen in terms of scalability. It is extremely easy to scale Kafka and, it's very, very reliable. So but

Kafka is using ZooKeeper. And ZooKeeper,

if you ever get a chance to work with it, it's it's not 1 of the easiest piece of software to scale. The reason that ZooKeeper

cannot scale for rights. And actually, well, this is all already ancient history but at some point, Kafka, I think version 8 point 1 or something released an update where they don't save the topic information on ZooKeeper anymore, but use Kafka for that. And only use ZooKeeper for what it was designed to do which is delete the election. And then that relieved a lot of the pressure on ZooKeeper. But before that happened, we had tons of issues with scaling ZooKeeper actually to be able to support our our Kafka deployment. On the on the event processing side, 1 of the biggest

hurdles we had to tackle

is exactly once processing. And, of course, I mean, if you read a little bit about exactly once, there isn't really exactly once. The best you can have is idempotent,

which means you

you you might need to repeat the processing

on the same data, but you need to make sure that the results stay the same even if the same data is

being ingested more than once. And that

the the implementation there is is is not the easiest because basically what you need to do is you need to move to microbatches

because you cannot act any single event, that would just be 22 time consuming.

And then when the micro batch fail, you need to be able to reprocess that micro batch without affecting the result. And there is a lot of, implementation details of how to implement that, where in 1 hand, you actually maintain idempotency.

On the other hand, you can withstand scale. Because when you scale up, you might need to the the batch size to amend itself automatically.

You might need different batch sizes for different customers. You need to make sure that you don't have starvation

because, of course, every tenant is independent.

That was quite a challenge for us as well. We were using,

Apache storm

for the stream processing, component itself. We're actually now in the process of moving off of Apache storm into Kafka streams. We are running it in,

like, in dev for or if some months now and feel like it's ready to move on to production. The reason is that

Kafka Stream have become more evolved and

implements a lot of the exactly 1 semantics

implementation that we've implemented ourselves in, in storm. Think these are these are the main the main ones that come to mind.

And in your documentation,

you mentioned that you have 5 nines

of uptime and in terms of data processing reliability.

So I'm wondering how you manage the underlying infrastructure

to be able to support that SLA.

Yeah. That that's actually that's a good question. Actually, it's also not not easy to maintain. So I I would say, our general approach to infrastructure management is has a 2 core principles.

1 is

monitoring,

and we actually trust monitoring more than testing. Very strict monitoring to make sure that when when whenever something goes wrong, we know immediately. And the other part is very frequent releases to production.

So it's almost counterintuitive

when you're saying, okay. How how can you maintain 59? What happened if you create a bug and you deploy it, to production to all of your customers? So the idea is that if you have strong monitoring and very frequent releases and,

and a very strong rollback mechanism, if you create a problem, you can recover from it extremely quickly. Then,

in terms of the underlying infrastructure itself, we run we run the the processing itself on the on Kubernetes.

So if any component fails, it it recovers, and so a lot of redundancy

there. And, again, as I mentioned, the the micro batch approach

makes sure that if some batch failed in the middle, we can reprocess that batch without creating duplication in the data and without creating any missing data. But I would say there is no the main maybe the main, maybe I should have started the answer is there is no silver bullet here. There is no silver bullet that tells you implement this and you'll have 59.

It's just a lot of many, many lead bullets

that, that you need to use and and and and and do correctly. But but if I if I do distill it, it's it's monitoring and and quick deployment. So that's, again, any any issue can be distilled to what exactly caused it very, very quickly. Yeah. And I work in the systems management and cloud automation space, so the release often

mantra is definitely something that I'm very close to. And to dig a bit more into that, the reason that it makes it easier to manage your systems when you release frequently is that every change set is very small, so it's easy to see Yep. What was the difference that caused this error and it makes it easier than to roll forward rather than roll back because

rollback is largely a fantasy because particularly if you have anything involving a schema change, you can't just roll back the code Yeah. And then have everything work again. Yeah. For sure. Roll forward is is, is what we do most of the times. And then also if if you're doing frequent release, you you feel confident to roll forward because it's, again, very small change there. And so as you mentioned from the start, you've been multi tenant in your entire infrastructure.

And I'm curious what additional

Yeah. That that's a that's a challenge. Actually Yeah. That that's a that's a challenge. Actually, it's a the challenge is a is has 2 phases. 1 is the technology challenge, but the other is just the task challenge because you have,

you have, you have to gain the trust. The thing is, we are going to send our crown jewels, our most

important data to you to the cloud, and for some companies, this is an abomination.

They're not used to that at all. So you need to overcome the trust issues, and that's, you know, mostly,

you do that with good, I would say, a good salesmanship and

an external audit. So we have, you know, SOC 2 type 2 and all of those external audits and compliance that we do in order to to gain the trust. The other side is the technology challenge

because you need to make sure you don't mix your customers

data. You need to make sure you don't it's it's not even mixing data. You don't starve the processing for 1 customer on expense of the other that all of the sudden has a surge in the in the in the streaming data. And I would say the most challenging part is the code engine because we let our customers

run arbitrary code on our machines. So we have to make sure that is sandboxed

and, they cannot get out of that sandbox. And let's say we have a malicious customer that that tries to get out of the sandbox and read other customers data, we need to to put very strong, walls around that to make sure that and and and our approach to us, so we we separate the customer data,

by using different Kafka topics. That actually also helps us with the starvation because then we can decide how many partitions we have per customer and how many posting units we have per customer, and we can deploy those separately

and then even scale those up and down separately. So we start by,

by having different customers on different topics. That's 1. The other piece is,

build a very strong sandbox around the code engine. This was something that we've worked very hard on. And actually, if you think about it, in every company where you deploy something like Kafka plus Spark, for example, you already give every developer that has access to Spark, access to all the data because they can run arbitrary code and it is not sandbox at all. Where in Loomoo, we've because it's multi tenant from the get go, we've worked very hard to create Sandboxes around the code engine. Our current implementation

of the Sandbox, we had, we had several implementations,

throughout the course of life of Luma. The first 1 was based on Jython and the Java Security Manager was used as sandbox. Eventually, we have moved half of that to Docker type sandbox

with extra hardening on the host itself. So the the customer code will will run inside their own doc docker.

But then outside of that docker, the host itself is also hardened. So especially on the networking front to make sure that that code cannot escape from that container that it runs in. And then on the

audit and and engineering side,

we have, no engineering at Ooma has access to any customer data. In order to be able to access it, let's say we need to do something for support or something for, you know, to help a customer in monitoring that pops, they will access the service inside Alooma that provides them a temporary token, and then they can access the customer's system or data with that token. And then, of course, it's audited everything then, and that token expires after,

after some time. So even, so by default,

we restrict the access to, no access.

And in the code engine, you mentioned that it uses Python and your initial implementation used Jython. I'm wondering what your reasoning was for using that as your language target and whether you have plans for expanding

the supported run times. Yeah. Sure. So originally,

the use of Python was,

because these those were the a lot of the requirements that we've heard from data engineers is we want to use Python. Python. It was a language of choice at the time. The again, the original use of Jython was because of

that's how we implemented the first sandbox. We definitely plan to support more runtimes.

Now that we run it on top of a Docker and it's not tied to any runtime engine that's specific that's language specific like Jython, it's much easier

to add support from all languages. We now have,

Python 2.7,

Python 3, and we will add at some point, in the near future JavaScript,

Ruby, and Scala. But and again, now it's very easy for us because we have an API.

And any Docker that can speak that that API, we don't care what it runs under the hood, essentially. The system

does does not care. And 1 of the things that you mentioned early as well is the fact that you have this schema repository

with schema versioning,

and I know that 1 of the options that you support is automatic schema management at the destination

based on the schema that you introspect. So I'm wondering what are some of the potential pitfalls of doing that automatic management of the schema in the target? Yeah. That yeah. That's a great question. It actually depends a lot also on the type of data warehouse you want to write the data into.

Because the main, the main pitfalls,

of automatic scan arrangement

are with

a a a data that is

semi structured. Actually, that actually brings to a point I want to to mention earlier. So there are 2 main pitfalls. 1 principle

is that there is some data that you don't want to write to data warehouse. So So if you do automatic schema management completely,

you might write data that you don't want to reach the the other side. It's mostly around PII data. So,

personally identifiable information.

Sometimes you don't want that even to hit your target data warehouse. So if you have automatic scaling and you add the PII field, it's being written automatically. And actually, it also relevant to the raw data as well that's being written to s 3, for example, to the cloud storage. Because this is 1 exception to raw data, especially now with with

the GDPR,

looming on us, which is a new regulation for the EU,

data privacy and protection.

Companies, they don't want to store PI data because it's, they need to get,

from every EU citizen even if they reside outside of the EU. They need to get extensive permission and they need to be able to delete it, etcetera. So sometimes they don't even want to store in the raw data. So this is 1 aspect

when if you do automatic schema management

and every new field is automatically being written, you might write data that you find secret or you don't want to write for a reason of compliance. The other aspect of automatic schema management really depends on where you write the data into. Because for a data warehouse that supports semi structure semi structure

data, like for example variant data type in Snowflake,

there isn't a real risk because everything that's nested will just go into that semi structure data type and the risk is very small. If you write your data to something that don't support it, something like Redshift where Illumina will automatically

bunch of keys in that JSON data that's come from your app or from your game or whatever

and it just adds more and more and more columns to the target at some point, you're going to get a very wide table and that can be a problem

both with performance and the storage space. So that's another pitfall. And also,

string length is a common pitfall. Let's say you want to cross the data into a string Alooma found out that the data usually fits into a, I don't know, 5 kilobytes of, of string length, and then all of sudden you have 1 event that it has longer string. So all of those are, are pitfalls that that might happen to you with automatic schema management. However, I'm not sure if you can avoid them if you're doing it manually. So, yeah, you can avoid

secrets and not writing secret data, but let's say you map things manually. It's also very difficult

to know upfront what's going to be the string lengths

of a specific data data point from now to the entire future. Now if the data by the way is coming from a source with a strong schema, let's say it's coming from an Oracle database

or it's coming from Salesforce where the the source data has schema attached to it, I don't think there is really a risk because, you know, you are guaranteed that the data will conform with the source schema. But if it's a free schema or semi structured data, then you might try into risk of creating too many columns or

too short of, marker columns. And in the case when you elect not to have the schema managed automatically

and you do have source data that no longer fits the specified shape, will that just automatically end up in the restream queue to be processed manually?

Yeah. Exactly. So so Aluma has 3 modes actually. So the first mode is app auto map,

which is what you mentioned. The other 2 are strict and flexible. So strict means if some if you have a new field or you have anything that doesn't conform the schema, put it in the restream queue or in the error handling queue. And, flexible means stream it anyway and just alert me. Just let me know that I have new fields I need to map. But it's okay

if you stream them if you stream data without those fields for now until I take care of it. Now, of course, you don't risk losing any data because the raw data is still being stored So you can always reply it, but the question is do I want to wait and have the data in the rest of your queue? Or do I want the data to stream anyway

and then take care of those after fill at my leisure?

And the risk of having data that changes shape without any sort of warning is increased by the fact that you are that you support

source data that is outside of your control, particularly from these various SaaS applications.

And given the large number of integrations that you support, I'm wondering how you maintain the code base to be able to interact with these various APIs

and various source systems, particularly

as newer versions are released and schemas change or,

API specifications change?

Yeah. This is a yeah. A large part of it is, you know, a hamster hamster in the wheel job. So we we have a team that's responsible for

tracking those h API changes and for every API we support, you know, being, getting all the release notes, understanding what needs to change, and and make sure we are in front of the API changes always. Actually,

1 thing that caught us off guard was the Facebook API changes after the senate hearing.

They they did, like, a bunch of sudden changes without any any warning. So,

we had to run faster inside the wheel to to make sure we are on top of that. But,

generally speaking,

you know, it's a it's a lot of work. Just, you know, hundreds of, different APIs make making sure you're on top of all of those. We do have, like, a generic infrastructure that that makes it much easier and that's like something that we only need to

basically put in the API URLs and the paging mechanism and

the authentication mechanism, etcetera. And then it creates all of the code

to ingest data automatically.

So it's there is a lot of infrastructure around it as well. But, it's also

eventually, if an API changes version,

you have to make sure that you know

of it, that the new version doesn't break anything, etcetera.

And when somebody

is first getting started with onboarding onto a loom, I'm wondering what the workflow looks like for setting up the various integrations,

what initial preparation is necessary, and how much of that is managed automatically by Alooma? Yeah. It's it should be very simple. So let's say you you log into LUMA. What you would see is you would you you would be able to hook put credentials for your data warehouse and then add integrations just by clicking a button in the UI so you can add let's say you're adding a Google Analytics data source. So you add a Google Analytics data source, you give a loom authorization to OOS, you choose the report that you want to ingest and that's it. Alooma has that the automatic schema detection, generally, what we call it is a 1 click mode. So once you click finish, there is nothing else to be done on your side. Just wait and see your data coming in, being ingested, schemas are created, table are created.

The entire processes

in order to ingest the data are being created automatically and the data makes its way to the target. That's the normal flow. Usually, there is not much preparation except of credentials of passwords. Some data sources require more preparation. Let's say you want to ingest

a a data from a SQL server using a change data capture, you need to enable that on your database. Sometimes that requires your DBA

or the admin that manages the SQL Server to enable that upfront. But for most data sources, especially for the SaaS ones, it's a matter of putting in the credentials and just, sit back and enjoy.

And when you refer to watching the data come in, in your case, that's actually fairly literal because I know that 1 of the features that you provide is that live view of the data flow as it's coming from the various systems so that you can have a visual representation

of how things are moving and be able to potentially identify issues

just by doing visual pattern matching of either, you know, this looks normal or this looks abnormal, you know, this isn't this data source isn't processing appropriately, etcetera. Yeah. Definitely. So the live view actually started, purely as an eye candy. It started as a you know, many of the Alooma employees are data engineers themselves and and used to be data engineers for many years. And for them, 1 of the 1 of the things that came up from from the Alooma employees was that the data engineer is always has, like,

the the CEO only knows about them when something breaks. Like, they if everything works, no 1 even knows they're existing, like, a very gray and, and and sad place to be where people only know about your existence when there are problems. So 1 of our front end engineers wanted to create an experience where they would be able to showcase the world even when everything is working. So, hey, look at this cool visualization. It's our data. It's flowing. You can actually see it, see the dots moving. And this was pretty cool as an eye candy. But even after we we rolled it out and people got excited about, you know, the eye candy properties of that, we found out there is a real use case to use it. Like, you can filter

on things on that screen and then see the data flows. Like you see, okay, event is coming from the web. Then it goes into my the MySQL database.

Then something update in Salesforce and I see it coming from Salesforce. So it's really useful for Gotchip for, you know, making sure the data flows correctly exactly like you said.

But,

the the story around it the story behind it is that it started purely as an eye candy. That's funny. And speaking of use cases, you have a fairly extensive set of potential use cases described on your website, but I'm wondering if there are any situations

where it actually doesn't make sense to use a luma where it's more useful to build something in house or if there's a particular scale scale of company where they're more likely to have the in house expertise to be able to build something that suits their needs better than something that Illumina can provide? Yeah. Sure. And the main thing and that that's actually what our first qualifying question to every company that that we speak to is, do you want to move to the cloud? Because if a company

is set and we've met, especially bigger ones, we still want to use our Hadoop on premise data lake and we want to, you know, ingest data from our data sources into that data lake, Luma is not a good fit. We are focused around moving data to the cloud. It doesn't make any sense for us to take the data out of your data center to the cloud only to push it back into your data center. So this is 1 1, 1 place where I think, Alooma is is very useful for anything

that involves a cloud. Either you are born in the cloud and your data is already there or you want to migrate into the cloud. Companies want to stay on prem, then Alooma will not be a good fit. I would say the in terms of use cases, sometimes

Alooma is not a 1 stop shop. Sometimes you you will still run some software side by side with Alooma. For example, we have 1 customer,

pretty big, that's running your recommendation engine on the data from the website. That recommendation engine doesn't necessarily have to live inside the little code engine. Sometimes it's better to run model training, for example,

as an EMR job on the data in s 3. So that's okay. I mean, we are not offended.

We sometimes even recommend that. We will still ingest data. We'll put it there. We'll put it in s 3 and then you can train your models there using code that's running on EMR. From vision perspective, of course, I want in, you know, 5 years from now, 3 years from now, I want the Loomo to be the runtime environment for any data application and support any workload. But the focus today is mostly

streaming

data and going into the cloud.

And we've spoken a bit about some of the various challenges at different layers of the technical stack for building and maintaining a loom. But I'm wondering if there are any other big challenges that you either have faced or are facing, whether from the technical or business perspective.

Yeah. Sure. I mean, building a company is

is challenging, especially, you know, I'm, that's the first company for me that I'm a founder in and that's true for all of the founders. And we are lucky enough to have

Sequoia and Lightspeed on our investor team, and they're extremely experienced. They've seen tons of companies, and they help a lot in building the company. But building a company is a huge challenge. And, you know, we have

a great executive team that helps in there, but it's

a challenge every day even, you know, getting the first customers and then getting from those first customers to the material

ARR

ARR scaling. Today, Illumina has a 130 customers running in production.

And again, that's that's even and if I so I don't know any 1 of them personally, for example. Let's say in the first 20 customers, as a CTO, I knew all of the problems

personally. Now there is a team of support engineers about that. So it's

scaling companies.

We we can talk another another full session about the challenges there.

And

you mentioned that your goal for Illumina in the future is to be this 1 stop shop for all data flows. But I'm wondering if you have any specific plans for the future of Illumina or anything in the near to medium term. It's not a revolution. So it's, to evolve Alooma towards that goal. So, again, we we try to support more extensive use cases. We try to support different

types of workloads and more integrations,

more languages,

you know, a a more robust schema repository and and error handling,

better

it's always getting 1 step ahead, supporting more and more use cases.

In terms of the, you know, currently, we are very focused on growth. In the short to medium term, we are focused on growth. We are trying to grow the company. You know, trip the goal for us so I said we are now a 130.

Our goal is to, get to 300 customers by the end of this year. It definitely seems doable with the pace that we're been growing so far. So that's that's a plan for the medium near future. But again, when I when I say 1 stop shot, I mean maybe it's good to, to talk a few moments about it. If you think about different abstraction layers, the first abstraction that happened in, recently was the the hardware layer. And and I remember, I mean, 15 years ago, our biggest problem with our project was that we did not have enough room

in the server room to put the server. We didn't have enough power. We had to think I found myself going over specs of p s PSUs

to see if this is 400 watts or 600 watts, if I can even plug it into the world. And today, we just, spin up machines on Amazon. We don't even think about the PSU. We think about the memory that we need, the the compute, etcetera. So that abstraction layer had finished. And then the next abstraction layer with companies like Docker and the Roku was obstructing the application server. So now, of course, we don't install Linux anymore. We just put our stuff in a container,

and we ship that container into the cloud. But the whenever we go to data,

we go back in time. We start thinking, oh, do I need Cassandra and MongoDB and, know, what what type of workload do I have and and what type of database will satisfy that need and then I need to deploy and maintain all this. The data layer is not obstructed at all.

My goal is to be able to create more and more APIs

inside the Alooma runtime environment

to interact with data, to be able

to just say, I need a key value store. I don't care what's under the hood. It could be Redis. It could be error spike. It could be whatever. I need the key value store. I need response time of 1 of sub of sub millisecond,

and I need to be able to store 1, 000, 000, 000 keys. And then you have an API that satisfy that, very similar to how today we have an API, that,

spins up a machine on Amazon. Hope that makes sense. So that's that's the

vision play, I'd I'd say. And as a final question,

as somebody who is working very closely with tooling

and various platforms for managing data. I'm wondering what you see as being the biggest gap in the available tools or technology for data management today. Yeah. That's

a great question. I'm not sure if there is a single gap because for every

every specific

task that that, you know, I would, I I can think of, there is some tool that's created. I think the biggest gap is something that ties them all, ties everything. Because again, today, if you want to take those data tools to build things in house, you would

need to take a and b and c and d and stitch them all together. So I I feel like the biggest gap is something that can provide a tool that can provide the solution for for all of those use cases, not that you will need to use a specific solution for or a specific,

product for

each use case.

And so with that, I'll have you add your preferred contact information to the show notes for anybody who does want to follow-up or follow the work that you're doing. Well, thank you very much for taking the time to join me today and discuss the work that you're doing with Alooma. It's definitely a very interesting platform, and I'm excited to see where you take it in the future so thank you for that and I hope you enjoy the rest of your day. Thank you very much for hosting me.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links