In this episode, I had the pleasure of speaking with Ken Pickering, VP of Engineering at Going, about the intricacies of streaming data into a Trino and Iceberg lakehouse. Ken shared his journey from product engineering to becoming deeply involved in data-centric roles, highlighting his experiences in ecommerce and InsurTech. At Going, Ken leads the data platform team, focusing on finding travel deals for consumers, a task that involves handling massive volumes of flight data and event stream information.
Ken explained the dual approach of passive and active search strategies used by Going to manage the vast data landscape. Passive search involves aggregating data from global distribution systems, while active search is more transactional, querying specific flight prices. This approach helps Going sift through approximately 50 petabytes of data annually to identify the best travel deals.
We delved into the technical architecture supporting these operations, including the use of Confluent for data streaming, Starburst Galaxy for transformation, and Databricks for modeling. Ken emphasized the importance of an open lakehouse architecture, which allows for flexibility and scalability as the business grows.
Ken also discussed the composition of Going's engineering and data teams, highlighting the collaborative nature of their work and the reliance on vendor tooling to streamline operations. He shared insights into the challenges and strategies of managing data life cycles, ensuring data quality, and maintaining uptime for consumer-facing applications.
Throughout our conversation, Ken provided a glimpse into the future of Going's data architecture, including potential expansions into other travel modes and the integration of large language models for enhanced customer interaction. This episode offers a comprehensive look at the complexities and innovations in building a data-driven travel advisory service.
Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at data engineering podcast.com/datafold today. Your host is Tobias Macy, and today I'm interviewing Ken Pickering about streaming data into a Trino and Iceberg lakehouse. So, Ken, can you start by introducing yourself? Hi. Yeah. Thanks for having me on. I, I am the VP of engineering at Going, which is a company that provides,
[00:01:01] Ken Pickering:
travel deals for consumers. And do you remember how you first got started working in data? It's a long time ago. So I came up from, like, product engineering, but I found over time that more and more of your the requirements I was working on involve data in some capacity. And so I think it it started with, ecommerce and InsurTech, really, which were based on pretty large volumes of data at the time. So recommendation algorithms for ecommerce. And then, like, I worked for a property insurance company where we'd, like, be able to price, like, roughly what would be in a person's home, you know, based on region, size of house, like, that sort of thing.
And so that's where I really got interested in the topic of it. And and more and more of my career has, like, pushed further into, interpreting and then productizing data in some capacity.
[00:01:46] Tobias Macey:
Your current role is in Going where you're leading their data platform, and I'm wondering if you could just start by giving a bit of an overview about the types of data that Going relies on and the role that it plays in the business.
[00:02:01] Ken Pickering:
So yeah. So, Going needs to find travel deals. It's a subscription service. So you sign up, and we try to find travel deals for you. And more and more that becomes a data intensive problem. And there are sort of two aspects of that. 1 is, well, how do you find flight prices? You know, it's a very large problem domain. You know? Instead of computer science terms, it it could be classified in some capacities as, like, NP complete. You know? Because it you can get to a lot of places via a lot of routes. And so, like, trying to find deals for customers is is sort of part of that that problem space. And, you know, prohibitively, the volumes of data I mean, the data the data volumes you're dealing with are are prohibitively large in some capacity, and so you have to figure out how how you wanna traverse that. And so, you know, ingesting large volumes of of flight information, and flight data is one is one aspect of it and building models around. Well, is this a good price or a bad price?
You know, if you have a specific itinerary you're trying to follow that's different than finding a globally available, like, deal. And so, like, we try to focus on sort of both those problems and just advise people when a good time of travel would be. You know, and then then there are concepts like, you know, personalization. Like, what are the person's interest? What kind of deals apply to them? Because when you deal with, like, a mobile form factor, making sure that you're giving, like, timely topical and and sorted information is interesting. And so and I'd say expected from a modern consumer as to how they use their apps today. So we you know, and so I'd say we collect, yeah, basic travel information, low flights cost, you know, segments and routes, and all that sort of fun stuff. And then the other side is, like, event stream information. What are people doing on the device, and how can we provide a better service to them based on based on that input? You mentioned a little bit about the size of data that you're dealing with. Yep. But in terms of the overall volume variety velocity,
[00:03:44] Tobias Macey:
what are some of the characteristics that you need to accommodate to be able to fulfill the data oriented objectives of the going system?
[00:03:53] Ken Pickering:
So one way we've solved the problem is is both through passive and active search. And passive search is where the lot a lot of data comes in. Active search is when we search for flight prices. It's a transactional where, you know, we send out a query. We get a response to request coming back in, and it's a pretty manageable data. The passive data is sort of like where we, aggregate with a very large it's called a global distribution system, GDS. It's how a lot of online travel agencies search for airfare, and what we do is we get access to their global traffic. And so, each of them can give you about, at least a gigabyte a second of of travel data.
And, you know, if you do the math out, it's about 50 petabytes a year of data you're kinda sifting through for each provider. And so, like, that is sort of the relative volume of it, and it's kind of messy data and that it's, like, really, like, nested JSON. And so, you know, there is a fair amount of preprocessing you have to do to be able to derive action from it. But our goal is to try to kind of sift through that haystack and find a needle of, like, good deal flight pricing for our customers. And so, you know, I think the the reason that we solve the problem with streaming in many capacities is because, time matters in these types of situations because flight prices fluctuate. The sooner we identify a deal and the sooner we notify our consumers of a deal, the more likely that deal is being able to exist when they go to transact on business. Because, you know, by prices can change a few times a day. And because you don't know how long that deal has been in place in some capacity, you get a notification of it from a passive search that somebody else did, but you don't really know how long that price has been active because there's not really a great system for publication and broadcasting that. So that's why we went went with a kind of a real time streaming nature of the architecture because of the the cost considerations. That's also why we tried to land it in our warehouse and and and iceberg in real time. Because if you're already processing the data because it's, like, a large amount of data and it's kind of a messy format and you already have extracted what you're looking for out of it, it makes sense to transform
[00:05:50] Tobias Macey:
and land as well. And before we get too much more into the technical architecture, I'm also interested in the composition of your engineering and data teams that manage and support and consume these data flows where, typically, if you're dealing with analytical data, then a lot of that responsibility lives with a separate team from your software and application engineers. But given that the data is a core capability of the actual product, I'm curious how that influences the ways that you think about the composition of that engineering talent. Going engineering is small,
[00:06:27] Ken Pickering:
small small and fierce. We're it's about 25 people right now. So we all work together. I mean, we're we're still a startup in that capacity where there aren't really huge organizations and big departmental silos. But, yeah, it does require close coordination between how we are operating as as a team. So we do have people that specialize in data engineering. We have 2 data engineers and a manager, that oversees the 2 data engineers and our one data scientist, Nick, who does an amazing job for being our solo data scientist. But we really have a lot of vendor tooling. I I mean, in this ecosystem, like, when you don't have very many engineers, we've kind of strung together a series of SaaS tools that let us streamline engineering operations. So we're mostly focused on, I would say, how we efficiently transform data on the data engineering side and then how we efficiently model that data on the data science side and worry less about, like, the operational side of data engineering, which we, like, we push out to vendors. But and then in terms of how how people interact, like, a lot of it comes from product management. Like, we will develop models, at the end of the day, you know, we have to work closely with product and the product teams be like, okay. We have a statistical valuation of this price, but how do we present that to a consumer in a way that makes sense? How do we measure, like, notification threshold and, you know, when we should push something to users versus when we should observe and see if a price gets better and then talk to our customers about how do we display a lot of the rich information we collect and generate to customers so they can feel like they're making an informed and educated decision on a flight purchase. Being a subscription business, we sort of have to do that because that's our credibility. We don't make money from the transaction. We make money on advising people of a potential transaction.
And so accuracy and and empowering our customers is actually very important for our business. And I don't see a better way to do that than, like, close partnership with data and data science. You know, I mean, you know, we collect tons and tons of information on their behalf and then try to represent that in a way that is that is timely and makes sense, but our credibility is also in those numbers. And so people will unsubscribe if they don't see value in in the company. And so that's really one of the major things and and and why we really require close-up participation
[00:08:30] Tobias Macey:
among the teams. Now digging more into the architecture, you've already mentioned Iceberg. We already have mentioned Trino at the start. You've stated that you're dealing with streaming data flows. I'm wondering if you can just talk to some of the ways that that overall system design fits together and the ways that you're accommodating streaming in that overall architecture.
[00:08:51] Ken Pickering:
So we're heavy users of Confluent to collect the streams. We use, Starburst Galaxy to tie into Confluent and then transform to Iceberg. There are some intermediary programmatic steps that occur also derived off of Kafka with the separate services. And then on the modeling side, we use Databricks. And so Databricks ties to our iceberg data warehouse to compute models, and then we have services that run-in real time that can latch on to Confluent to parse events in real time or a real model. When you say model, I'm wondering if you can just unpack that a little bit because it is a very overloaded term. There's a variety of different applications we use. A a lot of it is just predictive analytics. You know, like, if something is a good deal or not is, you know, we we use, like, z clustering for that in terms of, like, okay. Like, on what threshold of a deal does this sit? And then, you know, we are getting a little more into ML because we have to when you when you start incorporating things like advanced notice and seasonality and velocity of, like, number of searches on a domain, like, the problem space gets progressively more complex if a customer says, like, I wanna fly specifically on this weekend. Right? You can analyze a whole year's worth of data and determine, like, what the highs and lows of pricing are for that year. But, like, to find the right price for the right duration and a right timeline, incorporates progressively more complex techniques to do that. So yeah. Like that but that's how we're doing it today. And then there's, like, there is the BI side of how we model and cluster user behavior and cohort and and those sorts of things. Now we've established what your current architecture is, some of the different system components. I'm wondering
[00:10:22] Tobias Macey:
whether that is how the overall system design started or whether you had maybe a more batch oriented approach to begin with and maybe what the evolution looked like and some of the challenges of managing that migration process while keeping the business operational?
[00:10:39] Ken Pickering:
No. When I started it's been about a year. It's been after about a year. It was really right centered around a a a Ruby monolith and then a tool that would ingest data but not capture data, and and it was kind of like an expert rule system as to how we found flight pricing. But kind of rudimentary and very rule specific and focused, and the rules were kind of tweaked by by humans. And so one of the things I wanted to do when I came in is it's a it's a you're not gonna actually be able to scale with a system like that. The order of magnitude of where we would need to go and still be human curated is just it's, you know, it's just not possible, honestly. And so so, you know, we were a Snowflake customer at the time, a lot of batch oriented processes in Snowflake. I chose to go to a more open stack just because I saw the volumes of data, we would be ingesting as a business and thought the cost magnitude in Snowflake would be eventually prohibitive. And so one of the reasons I like the open lake house architecture is that it lets me commoditize compute and lets me use different vendors for different purposes and actually really tweak and refine, how we handle things. Like, in house today, we use Spark for a couple different things. We use Athena. We were using Athena to load into Databricks initially because it was the easiest way to get data to Databricks. You know, now Databricks, though, has acquired Tabular, and now they have a native Iceberg integration. And so we're talking to them about, like, maybe potentially using that. So it's but, like but those those options are open to us because we follow an open lake house architecture. So that was really an important thing for me to lay out just because I wanted the optionality for the business to be able to expand in a bunch of different directions. And I think if you're building a data data derived company today, I do think it's a better pattern. I think being able to separate storage and compute with a format that is both fast and has a variety of, I'd say, cool functions, in it is is probably what the future should be. And so that's sort of where I netted out on that. And so that was the first thing we laid out. But now now that we have, like, unified ways to process data and handle pipelines, it's it's a lot easier to incorporate new data sources, to incorporate new models. One of the things I'm working on is, like, data democratization for our business customers and how we actually get out of having analysts have to author every dashboard and and data engineers have to do every aggregation under the sun. To try to make data the center of the business is really kind of my major focus. Because if you think about what we provide, we're the recommendation engine for flight pricing, which is all data. It is all it is all computation. It is all data. We'll have a UX. We have, you know, brand specifications. We have marketing and efforts and those sorts of things, but those are also derived from data and our user behavior. In terms of the selection of systems and the structure of the overall architecture, I'm also interested in maybe the the whys and wherefores
[00:13:16] Tobias Macey:
of how you ended up with Iceberg and Trino as the core of that system, particularly given the streaming nature and the fact that you are using that data for a consumer facing use case. And Trino is definitely very scalable and performant, but not typically in the way that you would think about from an application design perspective where you want to have query responses in the order of milliseconds or maybe low 100 of milliseconds at the most. And, also, on the iceberg side, typically, when you think streaming lakehouse, I know that Hoody has positioned itself as the answer to that because of that was the core of what it was designed for. And I'm wondering kind of what that evaluation and selection process looked like to get to where you are. Well, choosing format was an important distinction for us because of the volume of the dataset.
[00:14:09] Ken Pickering:
It gets primitively challenging to switch formats. And so when we did the initial analysis of popularity, the formats, what what we see vendor integrations because as I sort of mentioned earlier, we are vendor centric. Where I see the market going, you know, where I see the most effort businesses are putting into the format, Iceberg kinda stood out for us. I'm familiar with Hoodie and and and its capabilities, you know, and I'm familiar with Delta and its capabilities. But for me, it was like a VHS Betamax thing where you don't really wanna be putting everything in Betamax. And so for me, it was targeting where I thought the business was going. And, you know, it turned out I think I think I'm still correct. I think it turned out to be a correct decision for us. You know, I I like Trino for analytic purposes and, purpose of aggregation huge datasets. Like, that's what we mostly use it for today. For consumer data, I am all for caching and or or pushing data to different data sources. For instance, we heavily use, like, a open search for our site for, like, deal searching and those sorts of things. You know? And so for me, it's like, if I have to lift and shift for consumer applications and cache, like, current stuff outside of our warehouse, we'll do that. You know? I think it's it's really, like, right tool for the right job. And so, you know, Trino is is an analytics beast and and, you know, I think and it's been great at processing these large datasets. I I probably wouldn't use it for right right to consumer applications unless I thought it was more cost performant than an Elastic. But the volumes I'm pushing into Elastic are not prohibitive enough or the index is too expensive yet. And so that's where we are today. But I think one of the best parts of an open ecosystem is having multiple engines be able to tie into it and having multiple patterns with which to extract data. And so that's really what we've targeted and focused on. But I sort of have to I do have to abstract what is required to operate a real time app where people expect fast results versus, like, being able to seek across large volumes of data for for calculation purposes and then eventing purposes. Because, if you think about the way a lot of people engage with apps, this is the way we try to encourage people to engage with apps, is it comes down to push notifications.
So, like, what drives our push is, you know, massive volumes of data and massive volumes of processing, and then we decide when to push to our customers. And that is basically based on our historic and and data warehouse data. When we do a push, though, we want someone to come into the app and have a fast response. And so, like, the things we have notified them about will be fast and current for them so they can just pop in, look at the deal, determine if they want it, and take an action. You know? And so you can kind of work both sides of that and try to find the right technology mix, to do that sort of thing. And I'm also
[00:16:38] Tobias Macey:
wondering if you have done any evaluation of some of the recent OLAP engines. I'm thinking in particular, in terms of ClickHouse as one of the popular options. Although the the fact that it has the collocated storage and compute, I can see as being a potential reason against it in your case. But, also, the PINO project is another one that comes to mind as far as an option for OLAP and for being able to actually directly serve some of those end users because of the fact that it is optimized for those fast query responses.
[00:17:11] Ken Pickering:
I play with both Pinot and ClickHouse, I think I think, we're like, I don't wanna say anymore about, like, kind of reservability system that I'm 99% sure it uses ClickHouse based on the cost economics of it. For me, it's in engineering leadership, sometimes I just we pick something and we stick with it until it prevent to cause us problems later. And so I have a lot of intellectual curiosity and Pinot and ClickHouse and those sorts of things. Like, honestly, Postgres and Elastic are doing the consumer job for us today. And until it becomes, like, cost prohibitive or scale prohibitive, we probably wouldn't take a step back and reevaluate because it is about cost performance. And until something becomes a problem there, like, I wanna keep engine focused on more of, like, the intellectual property generation than than kind of cutting over data stacks. And so if performance became a problem, we would a 100% evaluate that. Right? It's just it's just not a problem today. And so I think it's just, you know, setting up your KPIs effectively to say, alright. The app has to respond in this amount of time. If it can't I mean, that that's why that's why Elastic was brought in in the first place. Right? Was that relational database is way too slow in assembling, you know, assembling results for some of these queries. It was just a lot faster to get the results out of out of Elasticsearch. I know Elastic has limits. I've used Elastic extensively in my career, and at some point, we would look at, like, yes. Like, we will probably end up needing to replace that at scale. But for today, you know, with kind of, like, awkward teenage years as a business, it's also cost effective and appropriate. So And I imagine too that part of the ways that you are transforming and representing the data as it gets to that consumer interaction layer also makes it more conducive to a record oriented approach where Postgres and Elastic are able to operate across it effectively. You're not going to have a pattern where the user interacts with the application and says, I want to search for a flight from Boston to Berlin, for example, and then you then have to go and scan across 1,000 or millions of rows. You've already precomputed
[00:18:59] Tobias Macey:
the optimal options for that pairing of locations.
[00:19:04] Ken Pickering:
Especially with an app form factor, you really have to craft your your the data you're returning to the app to the form factor because you don't wanna send apps tons of data. They don't have great processors. They could be unlimited bandwidth. And so, extraneous data tends to slow down your customer experience. And so the experience. And so the the hope would be that you are notifying them and targeting something specific and exactly sending them that data. And if they wanna do expiration, that's fine. But the the good thing is, in travel, people expect the experience to be slow. So once people get off the well worn path of, like, here's a notification. Here's what we found for you. You know, it's a fast fast experience there. People have kind of learned that, like, if you search for a flight, you're gonna sit there for 30 seconds. You know? And and you we've all used Kayak or Expedia, and you see the flight results. Yeah. I mean, it's you you can make it look interesting, but at the end of the day, those systems are all slow. And so I get I get a little bit of a pass on speed just based on people's normal experience with travel. But, you know, at at the end of the day too, though, the problem is you're also getting, in many cases, hundreds of results back from multiple providers that you have to mesh together and then, return to the customer. And so that's actually also where we go back and forth between REST and and gRPC.
Right? Like, when do you stream data versus when do you, have a RESTful interaction? Because, like, you know, in some some cases, they're definitely streaming cases. In terms of the orchestration
[00:20:26] Tobias Macey:
of the overall end to end data flow, Obviously, you have your Kafka stream for the data ingest. You have Trino and Iceberg for manipulation and aggregation. You've got Postgres and Elastic for presentation. But how do you manage the triggers, the workflows, the dependencies, and observability across those different data streams to ensure that you're achieving the latencies that you're targeting and making sure that any errors are identified ahead of time? And this is probably a separate question. But given the fact that these are consumer facing data flows, I'm also interested in what the uptime requirements are and some of the on call requirements that you have around those data flows.
[00:21:12] Ken Pickering:
Yeah. Right now, we are a big Datadog shop, and and I find companies our face typically use Datadog, I would say, in general. Like, companies our size use Datadog just because it is, like, an all in one shop for all forms of observability and lets you stand up things rapidly with with with minimal effort from engineering. You know, I think as engineering shops get more, more advanced, like, you start leaning more of the Prometheus route or or looking at other systems that can do this. So tracing and analysis just because data gets expensive for a lot of this stuff. And but right now, that's what we're we're mostly targeting. I would say in terms of, like, uptime and on call, like, instant response is a thing. You know, and we try to categorize different forms of instance. Like, if if the app or site is down, it's definitely like a p zero all hands on deck. Like, you know, get we gotta get it back up and running. Depending on what pipelines fall or or, you know, we can we can kind of order those based on order of magnitude. We try to have runbooks and parties in in each of the alerts. So we know something's going on. We can alert on it and then decide what the relevant action is. We have a blue team internally that responds to events. Engineers take a rotation where there's a primary and secondary.
Now, your secondary for the 1st week, primary for the 2nd week is kind of a hand off and transition of shadowing. And since and then, and that's worked for us. I'll say, like, we we target 99.9 because that well, because also, that's what most of our vendors give us, and so it's really tough for us to stay up past 99.9 if it's the only certainty we're getting from other places. But I would say that, like, it's not we're we're also not like a hospital. You know? And we're so there there is some amount of tolerance for downtime because it's not, like, a critical operating system like I've I've had to work on in previous aspects. I feel we're talking, like, 999 or those types of those types of systems. But we are a subscription service, and it's frustrating for our customers when we are down. And so we try very hard to hit that 999.
[00:23:01] Tobias Macey:
As far as the architecture, I imagine that because of the fact that you are using Postgres and Elastic on the serving and presentation layer, you're caching the data there. You have a certain amount of room for errors earlier in the stack because you're not going to bring down the whole site if you fail to get the latest batch of data from one of the data providers. It's just going to be stale for a little bit longer than expected. And I'm wondering if you can talk to maybe some of the ways that you are managing the overall data life cycle because, obviously, there is a certain timeline to the utility of flight data where once the flight date has passed, your customers don't care anymore. It's useful from a analytical perspective to be able to maybe do some projections and analysis of whether a particular price is good or not. But for the consumer, they don't care once that flight has flown. And I'm wondering if you could talk to some of the ways that you manage the life cycling of data in that presentation layer, how far into the future it can even project
[00:24:06] Ken Pickering:
versus the data size and, performance requirements on the consumer layer. Yeah. And that's where the active and passive shopping comes into the mix because passive data for me is opportunistic data. And that, like, we don't pay it it it it's much cheaper for us to look at other people's searches than to execute our own searches based on the way search query and and performance works. Execute our own searches based on the way search query and and performance works. So at the end of the day, I save bottom line money by processing huge amounts of data. But at the same time, like, if if I don't get, for some reason, any searches from, like, you know, JFK to Hawaii, you know, because a lot of people wanna go to Hawaii, then I have to active search for that data, which means we'll pay for it and do a ton of searches to try to collect pricing information. And so, like, that's really where the differentiation is. It's like, how are you populating your flight? I I mean, I should have called it like a flight cache. Like, update like, every day we try to get results from passive data to know what existing flight prices are for a huge amount of time line. That's why the data volumes are so pretty large in in many cases is because we are getting a lot of that. But I have to layer on top of that and check the cache if I haven't gotten events and then actively search for and populate that data. Because if someone is watching a flight, they're paying us to watch it for them, and so we have to find that data if we don't have it, in our system. So that's sort of how how that that side of it works. You know? And so for me, it's if a feed goes down, like, one of the large feeds go down and I miss an hour or two's worth of data, I miss potential deals, and I miss cost savings based on the data that I could populate, but I can still find that data because I can still search for that data. And so but that's you know? But doing this in a way where we can remain cost competitive, it requires passive data to do it because it's just oh, I would pay 1,000,000 and 1,000,000 and 1,000,000 of dollars every month for the number of searches we need to do. And for the orchestration
[00:25:53] Tobias Macey:
piece of ensuring that the propagation of data is
[00:25:57] Ken Pickering:
proceeding, what are you relying on for that overarching control plane? That's the thing we're still working on. And so right now, we will run a job scheduler to search because we have a a number of users who have set up stuff in the system, but it's still a finite number that we can process. And so we can scan that, check the cache if it doesn't exist, repopulate from active search. At some point, though, we'll probably need a control point mechanism that will auto populate the cache and and do a lot more intelligence. But for now, we're kind of hacking it with just what we know what people are looking for. You know, again, but that number gets prohibitively large, and we have to think of a new system to build as a result of that. As you continue to build and iterate the system, you mentioned that it's been about a year in the making to where you are now from where you were previously.
[00:26:44] Tobias Macey:
What are some of the other technologies or other architectural components that you are currently evaluating, planning for, and what are the pieces that you feel are fairly well established
[00:26:58] Ken Pickering:
and are not at any risk of changing? Well, I mean, it's funny because it's like you can swap in components. Like, there's some of this different componentry for some of these things. Like, fundamentally, I do believe in event streaming. Like, I believe in, like, kind of a a message bus style or event streaming architecture. There are different technologies in there. We use Confluence today. We could use insert service there tomorrow. Right? There's competitors to to Confluent and Kafka now. You know, it's like, and so for me, it's about what makes sense in the natural evolution. I would say, like, the one thing that is sort of immutable right now for us is iceberg in that we are landing in it and we have no desire to leave it. I think, hopefully, the technology ecosystem mirrors that, you know, but it's Galaxy today, and I really enjoy using Galaxy as a Trino provider. But I could run Trino. I could run Athena. I could run could run Remio on top of Iceberg. You know? And so for me, it's like you know, I don't I'm not looking to switch components out because, well, one, it's sort of a new space, and we just stood it up. Like, it's we we made some pretty pretty self conscious decisions initially when we selected this stack. Like, you know, we we evaluated SageMaker and Databricks, on the ML side, and we we went with Databricks. And the odds of Databricks disappointing us for our ML problems in the long term are are low. You know, it's a pretty low I think what we're doing, I think it's a it's a good tool set for that. An area that we're we're we're leaning into in terms of future expansion, though, is LLM. Like, I think trying to figure out how we do content generation and copy or or, like or or explaining our logic or reasoning for why we'd recommend a destination to a customer, but not having to have a human being being write that to them. Because if we know somebody is watching a lot of travel travel places, right, like a lot of, like, tropical travel places, like, I'm looking at Aruba and Hawaii and, you know, let's say, a couple of the Caribbean Islands. Like, why if we find a deal for, the Azores.
Right? Like, we have to tell them why we're saying, well, hey. I know you set up these things, but here's why I recommend Azores. Here is what you can do there. Right? Like, well, that's that's a perfect use case for LM. We we've written travel publications internally in a lot of these places and could probably summarize that content pretty readily. And so, like, that's an area of, like, let's go over all of our unstructured data we've written over the past 9 years as to how we could, like, reflect that to customers and reflect that experience then is something we're considering. But that would be you know? And I think that that's, you know, we're we're we're talking to some folks about a pilot there, in the first half of next year. Because I think right now, we're we're kind of done with the overall, like, like, data flows that we need in our system. And there it's because it's a new system, it's it's pretty good for now. And so it's like, okay. Now how we can leverage this more? Probably 2 years from now, I'll be like, oh, crap. Like, we should probably reevaluate how we how we do all this stuff. But for now, I feel I feel I feel pretty confident.
[00:29:33] Tobias Macey:
Yeah. And to your point, Iceberg is definitely on the upswing as far as adoption and integration across the stack. And in particular, I think the Arrow ecosystem is really investing in support for Iceberg, and so that also expands the interoperability and the the capabilities, particularly in maybe that more ML oriented ecosystem because of the investment that's happening in the Python and Rust communities to be able to interop with Iceberg. Yeah. A 100%. And in your work at going, building this stack, powering the application, what are some of the most interesting or innovative or unexpected ways that you've seen the data or the overall data stack used? So one of the things I love most about
[00:30:19] Ken Pickering:
consumer facing businesses is it's all data and insights. Right? It's like like, it's it's you push a feature, you see how people respond to it, you respond effective and appropriately. And I'd say, like, learning things about our customers and learning things about behavior and learning things about the changes of technology are always really intriguing to me. Like, minor tweaks in an onboarding flow can wildly unlock different scenarios. And so for me, it's like, well, you don't know that unless you're collecting the data and and getting fine grained about customer behavior in certain instances. And so I love the insights that that generates. I love the complexity of how we come up with an eventing schema and how we track an eventing schema and how we take millions of events and try to aggregate trends out of them to help drive our business and business strategy. We are a very data centric and data focused business these days. And so I'd say that's it's seeing a feature that you you thought was gonna be minor while they land and seeing the data come in for that, or it's like seeing something that you thought was gonna be awesome, just, like, 2% of users adopt it. And you're like, oh, that's kinda sad. You know? But But those types of things I think are are really intriguing for me. You know, I'd say the other thing I love looking into is global flight trend data. Like, where are people looking to go? Right? Like, what is a desirable location? Like, you know, you could probably see the impact of Taylor Swift's tour across the US in terms of flight searches. It's things like that where you could do weird kinds of relationships. Because, like, you know, for us, like, we're looking into eventing. Right? Like, maybe we should help people fly to see Taylor Swift as a feature. Right? And so we can look at data for that and see if it's a global trend and see what it does with flight pricing and see if it's a thing we could offer our customers. Things like that are pretty pretty neat. Given the fact that the overarching
[00:31:56] Tobias Macey:
mission of the business is helping people get from here to there at a reduced price,
[00:32:03] Ken Pickering:
what are the opportunities that you see for incorporating other modes of travel and price analysis on that? I'd love to. I mean, I think, like, I think, you know, European travel train's huge thing. The, and we're starting to look into further partnerships around that. Like, how do we actually help people do, you know, next phase trip planning? Like, flights is part of it. But, yeah, I'd I'd say Goings mission in the long term is to try to be a a generalized travel adviser. Like, we started out with flights, but, hopefully, we could help with accommodations or recommendations of things to do or, you know, or offer alternative modes of transportation, let people trip plan, you know, let people share their trips, see what other people are doing around trip planning. I think those types of things are are kind of interesting for us to lean into. But, you know, it's like, you know, 25 engineers are still still locked on flights. So, but, no, that that's the growth that's the growth path we see for us. With a largely use, US focused customer base, though, flights are still the biggest because the rail system in the US is pretty unless you're traveling on the East Coast, pretty spotty. And then so you know, and then what we haven't we we've leaned to get people into Europe thus far, but not, like, what you do once you're there yet. So And in your experience
[00:33:09] Tobias Macey:
of coming to going, helping them rebuild their data stack, reimagine the way that their overall system architecture operates? What are some of the most interesting or unexpected or challenging lessons that you learned in the process?
[00:33:23] Ken Pickering:
Data platform cutovers are always a hustle. Like, no matter how well you plan or how well you do everything, there's always this awkward phase of doing the last mile and cutting over from one system to another and then all the errors that that arise from that in terms of, like, dashboards and and business critical stuff that kinda collapses. You know, we we as an exercise with, the migration from from Snowflake to to Iceberg, we target our schemas too. We get a schema overhaul in parallel because, like, we thought some of our schemas were were inefficient for where we saw the business going. And so, like, doing all that simultaneously was a big project, and it always turns into a mess at the end. I I would love to figure out a better way to do that. I just I haven't yet. And so that's been interesting. And then and then there's the customer education because we did move from, preset to quick site. We were using the the superset under the covers for a while, but decided to keep everything in the AWS stack. Because I think, you know, I think Amazon queue is interesting. I think the NLP stuff they're doing with Bedrock is interesting. My eventual goal is to get nontechnical users to be able to actually describe and define dashboards, and have good enough schema that it could probably handle that kind of specification, and that's a work in progress. It's still not still not there. I just have to kinda know what to do in the bot to a certain extent. But we already talked a little bit about some of the plans that you have for the forward looking goals of the going data architecture.
[00:34:43] Tobias Macey:
But as somebody who is operating in the data space, what are some of the overarching trends or technologies or, just areas of personal interest that you're focused on and keeping an eye towards as you continue to lead and grow and scale your data operations?
[00:35:01] Ken Pickering:
You know, having built stuff like this from scratch and then looking at where we are today with just the sheer amount of options in market for virtually every kind of data solution that is, like, wildly more scalable than what we're using, let's say, like, 6 or 8 years ago, has been really, really awesome to see. And so for me, like, building something this big and managing this volume of data without the tools we have today would be would be impossible for a team my size to actually manage and handle and operate. You know? And so for me, like, that's been a huge differentiator in market. Just the sheer amounts of, like, of concurrent data you can stream in and it not be a big deal, the way you can land and process that data at, you know, at at the multi petabyte scale and that not be a big deal. Like, those things were a big deal at other phases in my career, and so, like, that's been awesome to see from that perspective. You know, it's been kind of and, you know, even, you know, having to build, you know, service frameworks to serve different algorithms up and do my own custom AB. It's like, all that you can get a lot of that out of the box, and those are pretty large engineering challenges. Like, those are, like, employing a team full time to work on those sorts of things. And so that's been that's been super cool, that you can actually do a lot of stuff with not a huge amount of humans.
And you can do ambitious stuff with not a huge amount of humans required. It's been a pretty big unlock for engineering.
[00:36:18] Tobias Macey:
Are there any other aspects of the work that you're doing at going the data architecture and platform architecture that you have established or the overall applications
[00:36:29] Ken Pickering:
of the Trino and Iceberg combination that we didn't discuss yet that you'd like to cover before we close out the show? One of the things I like about Trino and Galaxy in general is that they are trying to build a cohesive Iceberg management solution. You know, I think with Tabula getting snapped up by Databricks, I think, but, you know, and it's still sort of be, like, promoting Delta under the covers. I think, like, for me, being able to work with an open lake house architecture and actually, like, work with Trino as a technology. Because Trino Trino is a great technology, but it used to be kind of a pain to manage all those clusters. I'd say, like, like, Galaxy has has provided a lot of transparency on that, and they provide some data management capabilities on top of it. I think data management is an area that kind of lacks in a lot of tool sets that I've worked within the past. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Oh, man. Data quality is still an issue. Quality checking, it's still an unsolved problem. You know, I think, unless a feed completely breaks, like, it's tough to detect bugs in data feeds accurately. You know, I think application of AI potentially in that space is very interesting.
And I'd say, like, data education and lineage is still really hard, especially as companies become more business derived and driven. Like, it's still tough to kind of articulate. You should rely on this dataset because it is concrete and transactional. Like, it it integrates with our payment processor and that is physical money in our bank account versus this is a tool that is hooked up to our website and is about 20% inaccurate. And, like, you should use it for directional data, but not hard business decisions. You know, but, like, education on that and marking that and when people like, it's just it's tough to propagate that context to customers is what I would say. And so for me, like, that's still an area that we have to invest in and solve as a business. That and, like, data leakage, like, people are still leaking PII and and, like, the governance side. So it's it's all it's it just it's it's the nontechnical probably portion of the and the policy and quality portions of data management that I still think are, challenges.
[00:38:40] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and share your work of building this data architecture and the ways that you are using Iceberg for streaming ingest. Definitely a very interesting problem space, interesting solution that you've developed. So thank you again for taking the time, and I hope you enjoy the rest of your day. Thank you. Thanks for having me. This is awesome. Thank you for listening, and don't forget to check out our other shows. Podcast.netcoversthepythonlanguage, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and coworkers.
Introduction to Streaming Data with Ken Pickering
Data's Role in Travel Deals at Going
Handling Large Volumes of Flight Data
Engineering Team Composition and Vendor Tooling
Evolution from Batch to Streaming Architecture
Choosing Iceberg and Trino for Data Architecture
Managing Data Lifecycle and Presentation
Active vs Passive Data Collection
Future Technologies and System Components
Expanding Travel Modes and Future Plans