Summary
In this episode of the Data Engineering Podcast Andrew Luo, CEO of OneSchema, talks about handling CSV data in business operations. Andrew shares his background in data engineering and CRM migration, which led to the creation of OneSchema, a platform designed to automate CSV imports and improve data validation processes. He discusses the challenges of working with CSVs, including inconsistent type representation, lack of schema information, and technical complexities, and explains how OneSchema addresses these issues using multiple CSV parsers and AI for data type inference and validation. Andrew highlights the business case for OneSchema, emphasizing efficiency gains for companies dealing with large volumes of CSV data, and shares plans to expand support for other data formats and integrate AI-driven transformation packs for specific industries.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Andrew Luo, CEO of OneSchema, talks about handling CSV data in business operations. Andrew shares his background in data engineering and CRM migration, which led to the creation of OneSchema, a platform designed to automate CSV imports and improve data validation processes. He discusses the challenges of working with CSVs, including inconsistent type representation, lack of schema information, and technical complexities, and explains how OneSchema addresses these issues using multiple CSV parsers and AI for data type inference and validation. Andrew highlights the business case for OneSchema, emphasizing efficiency gains for companies dealing with large volumes of CSV data, and shares plans to expand support for other data formats and integrate AI-driven transformation packs for specific industries.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Andrew Luo about how OneSchema addresses the headaches of dealing with CSV data for your business
- Introduction
- How did you get involved in the area of data management?
- Despite the years of evolution and improvement in data storage and interchange formats, CSVs are just as prevalent as ever. What are your opinions/theories on why they are so ubiquitous?
- What are some of the major sources of CSV data for teams that rely on them for business and analytical processes?
- The most obvious challenge with CSVs is their lack of type information, but they are notorious for having numerous other problems. What are some of the other major challenges involved with using CSVs for data interchange/ingestion?
- Can you describe what you are building at OneSchema and the story behind it?
- What are the core problems that you are solving, and for whom?
- Can you describe how you have architected your platform to be able to manage the variety, volume, and multi-tenancy of data that you process?
- How have the design and goals of the product changed since you first started working on it?
- What are some of the major performance issues that you have encountered while dealing with CSV data at scale?
- What are some of the most surprising things that you have learned about CSVs in the process of building OneSchema?
- What are the most interesting, innovative, or unexpected ways that you have seen OneSchema used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on OneSchema?
- When is OneSchema the wrong choice?
- What do you have planned for the future of OneSchema?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- OneSchema
- EDI == Electronic Data Interchange
- UTF-8 BOM (Byte Order Mark) Characters
- SOAP
- CSV RFC
- Iceberg
- SSIS == SQL Server Integration Services
- MS Access
- Datafusion
- JSON Schema
- SFTP == Secure File Transfer Protocol
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today, I'm interviewing Andrew Luo about how One Schema addresses the headaches of dealing with CSV data for your business. So, Andrew, can you start by introducing yourself? Hey, Tobias. Thank you for having me. I'm Andrew, CEO of 1 schema.
[00:01:02] Andrew Luo:
We are a platform for automating CSV imports and super excited to be on the podcast today. And do you remember how you first got started working in data? Yeah. So I used to lead data engineering, and CRM migration and onboarding at a CRM company called Afiniti. We were a CRM, platform that worked with investment professionals like VCs, and investment bankers. Every customer that onboarded onto our platform came with a messy spreadsheet dump, from Salesforce, HubSpot, etcetera, and it needed to be validated, cleansed, and loaded onto our platform. So I was the lead of the engineering team that built our self serve, spreadsheet onboarding, tool that was used for self serve customer onboarding, in addition to, of course, the tooling that would, enable our data ops and implementation team of 5 or 6 people that also had to deal with the relational joins and all of that for migrating folks onto our platform. As you might imagine, any engineer looking at a team of, you know, people manually cleaning spreadsheets 9 to 5 every single day, it has to imagine that there's a better way. And so that was kind of the spark of inspiration for One schema.
[00:02:10] Tobias Macey:
And before we get too much into One schema, I wanna dig a bit more into CSVs in particular, because as you mentioned, they are still very ubiquitous and they have been around for decades at this point. They are the lowest common denominator for data interchange, or there may be a few that are in that same ballpark. But despite the fact that we have had years of evolution and innovation in data interchange formats, data storage formats, CSVs are still everywhere. And I'm wondering what are some of the reasons that you see as their longevity and their ubiquity despite all efforts to the contrary?
[00:02:48] Andrew Luo:
Yeah. I think of, you know, human language as a good analogy for this, right? Spreadsheets are the lingua franca of data exchange, right? They are the one thing that everyone can agree on, right? Spreadsheets were there as one of the original Excel files and spreadsheets with one of the original user interfaces for for data and for, computing as we know it. And so the vast, vast majority of business users in an enterprise organization can read and understand a spreadsheet versus every company has a different API. Right? Imagine a world where, you know, every single company invented its own language with its own API and you needed to build a translation layer in order to understand it. Right?
Python, SQL, these sorts of, you know, you know, technical coding languages are becoming more ubiquitous, but still not even remotely as close to the number of companies and the number of APIs and languages that are being created in the world today. Right? You know, every single year, right? New data tooling appears and paradigm shifts are so rapid that these languages are so hard to keep up with, which is really expensive, right, if you think about building new translation layers and APIs and data pipelines for every single thing. So that's, I think, the number one reason why spreadsheets are so common and so powerful. Of course, they do come with their problems because it's useful for everything. Right? There are lots of things that are really nice and structured about APIs, but I think that's the number one reason why spreadsheets are everywhere.
I would say the second most important reason for this is security and access management. I think that because APIs are constantly changing, people tend to not build in the security layers properly with every single new API that exists. And therefore, oftentimes, you see a lot of, like, you know, universally powerful API keys that can access everything about an API without the proper access management. Hence, why we see sensitive data uses CSVs a lot more often as well. So these would be, you know, government data, health care data, legal data, and financial banking data. These sorts of data streams are really predominantly still CSVs because they're sensitive.
[00:04:59] Tobias Macey:
Which is ironic since so many other data formats have the ability to have specific fields be encrypted or hashed or masked in some way that can then be reversed programmatically without having to transfer that information in plain text over the wire and have the potential for it to leak or get misplaced or have 15 different copies of the same file that accrue over time.
[00:05:22] Andrew Luo:
I know. I know. It is super ironic that the CSV, which, you know, you might think is, like, not secure, right, because it's an Excel file on, you know, a client facing analyst's laptop or something like that. But in reality, for the person sending the data, they feel like that's really secure because they know exactly what is being sent. They know it's in a spreadsheet file that will not be changed, and they can send it through as a report. And that's the other aspect of it. So many systems of record, whether they are ERPs, EMRs, CRMs, right? All of these sorts of systems have natural report building capabilities to generate CSVs in a way that oftentimes systems that exist in the world today have much more flexible tooling that actually can generate a CSV report than even the API is capable of doing today.
[00:06:11] Tobias Macey:
You mentioned that as far as the sources of CSV data, you have these sensitive data streams. You mentioned earlier that you dealt a lot with that for exports from things like Salesforce or other CRMs. And I'm curious, given the fact that there are so many different options for data interchange, including things like JSON, which at least has some form of schema and type information versus CSVs where it's just text, and the only way that you have to understand the structure is by making sure that the delimiters line up, which often they don't. And I'm wondering what are some of the main sources of CSV data that you're seeing teams have to deal with for business and analytical processes?
[00:06:55] Andrew Luo:
Yeah. So we deal a lot with what I would call cross company data interchange. And the reason cross company data interchange relies on spreadsheets more often is because of the key insight that when you are dealing with cross company, cross department, you know, sorts of boundaries, you often have business users that are actually responsible for the relationship. Right? You have your customer success, your onboarding team, your client services team, your professional services team, your consulting team that are actually in charge of both understanding the schema alignment and the mapping process for schema a to schema b. Right? And because you have this team that is naturally in charge of basically this mapping process, imagine, like, a Workday migration from, you know, something like, you know, a different ERP system.
These can be lengthy 18 month processes, for example. This sort of workflow really requires both the nontechnical business user to understand first name in this system means, you know, field a in my other system. That sort of process really requires a lot of, you know, sitting down with the customer, understanding their workflows and those sorts of aspects. Because of that, we often find that the data format itself simply needs to be understandable by the nontechnical business users. Right? And so you might imagine that a spreadsheet is the natural tool that you would go to use for that data discovery, you know, data prep, pre validation, and all of those sorts of mapping aspects.
Hence, why Excel is still more common than JSON, where you might imagine that the format is actually less visible from easily being able to parse, right, all of the data in bulk all at once. And so that's kind of the key insight around, you know, needing nontechnical users to be empowered as business users rather than, of course, I think a point to point API connection or something like that is absolutely you're totally right. JSON would be much better for that. The other interesting
[00:08:56] Tobias Macey:
point around CSVs and their utility is you mentioned their use in these spreadsheet software applications, Excel being the most ubiquitous and widespread. But even with an Excel file, you actually have the ability to store type information in that file format, whereas with CSVs, you lose all of that and you have to infer the type just based on what it looks like. Oh, well, this seems like it's a number, but that one's quoted, so maybe it's not. And I'm just wondering, given the fact that CSVs can be so inconsistent in terms of the type representation and the inference around those types. What are some of the challenges that you see as far as people who are both implementing the export path of CSVs, of generating the CSVs in the first place to make sure that that type information is maintained as closely as possible? And then on the ingest, the ways that people have to wrestle with figuring out, okay. Well, who gave me the CSV? What does it even supposed to mean, and does that even actually map to the data that's in there? Yeah. This is something we see and deal with all the time because, of course, these are the limitations of CSVs. Right? And there are a few. Right? Obviously,
[00:10:05] Andrew Luo:
the lack of nesting inside of CSVs means that lots of times the export format will come up with different types of nesting structures inside of the spreadsheet itself. Right? We've seen everything from, you know, obviously, delimiters within cells, right, semicolons, pipes, all of that, all the way to, of course, you know, multiple rows that need to be collapsed together. Lots of, you know, ways of dealing with that are nonstandard. Relational data inside of CSVs is also difficult. Most, of the time when you're doing a migration or an integration path, you have multiple entities that have relational data. Oftentimes, this is why people use Excel files with multiple tabs, in order to kind of represent multiple tables in the in the destination system. And then, of course, dynamic schemas. Right? Unlike JSONs where the schema is kind of, you know, the keys and the values and the mappings are implicit in the actual structure itself. And then, of course, this is also why we have EDI standards, like, you know, oncx12, hl7, and health care. Right? All of these. The fact that schemas can change on a customer to customer basis, none of that schema information is represented directly in the CSV. It's not encoded into the file format itself, which is of course what makes it difficult.
Then finally, the lack of collaboration and automation workflows, right? Like, you know, most CSV driven workflows are very much like Excel driven with like Excel Macros and DBA and these sorts of types of workflows that are not shareable. Hard for a team to collaborate on them. And so, of course, when you have these very important data pipelines that are running on them, those are also other challenges with using CSV as a file format. One schema really, as a company, was designed to tackle these problems head on by being a platform that takes the ubiquity of CSVs, but then applies a lot of those nice structures.
We imply and use and apply data type validation on top of CSVs. We have AI to actually infer data types and validations as well in addition to suggesting transforms and those sorts of things. Basically, trying to give data operations teams, the best of both worlds with the reliability of CSVs, being one of the biggest challenges that one schema,
[00:12:13] Tobias Macey:
aims to solve. Another interesting aspect of CSVs is that, to my understanding, there isn't really a concrete specification that defines their structure. They're more of just a happenstance of somebody made it, and everybody else started doing it, and it just was led into the wild. I'm wondering what you have seen as far as any actual standards around the ways that CSVs should be structured, should be interpreted, and some of the ways that those standards or specifications
[00:12:44] Andrew Luo:
are just completely ignored and violated left and right. There are so many debates we have on what standards of CSVs should or should not be interpreted in a certain way. Funnily enough, inside of one schema, there are, I think, 6 CSV parsers that we waterfall through, in actually different languages also for performance reasons, whether it's like a Go based CSV parser or a Rust based CSV parser or a JavaScript based one. And we u use a a variety of these to try to catch all of the different edge cases. The most popular open source CSV parser in different programming languages actually conform to different standards. I don't know if you are aware of that. But very much is such a problem where, you know, there's not a real general consensus on like, you know, what happens when you have, like, a new line character inside of a quoted cell value inside of a CSV? Or how do you handle jagged CSVs, for example, or white spaces around the cell? Oftentimes, for our product team, when we are thinking about these sorts of challenges, we usually think about what is most likely going to be both the safest approach so that we obviously never want to transform data unnecessarily.
But we try to offer these up as intelligent suggestions, whether it's a jagged CSV or whether it's, you know, white spaces around the CSV. I think in the aim of trying to create like a universal standard, we've seen all sorts of things around these things, whether it's file formats, UTF 8 bomb characters, invisible characters in Excel versions that are, like, in the early 2000. But one schema aims to basically you know, we see so many of these CSV workflows across, you know, the hundreds of companies that we work with at this point. And we've really had to piecemeal it together as far as, like, hitting all of the different variants and then applying the right intelligent fix. So we actually have the ability to automatically pad out a CSV. If it's jagged, we have the ability to infer the limiters. But in every single one of these cases, we always ask the user and, you know, ask for confirmation that this is what they want to do. It's tricky. It's, like, not as good as having a universal standard, right, where everyone agrees. But we found that with CSVs being this, like, you know, lingua franca, we actually do have to make some inferences and then ask the user in those cases.
[00:15:01] Tobias Macey:
Even if you do have an agreed upon standard, you're always going to find several people who disagree on the implementation or just completely ignore it. I think one of my favorite terrible examples of a standard being ignored is when trying to deal with a SOAP API interface that returned the error response with a 200 HTTP response code, but embedded the error in the actual returned XML that I then had to modify the parser that I was dealing with to actually handle that properly.
[00:15:29] Andrew Luo:
That's right. That's right. And, also, you know, if you really just think about CSVs, right, one of the easiest checks that we have internally is also if Excel can handle it and you can open it in Excel, then it's probably something that One schema should support. Once again, just going back to the ubiquity of spreadsheets, right, most people aren't reading the CSV, you know, standard that I think was published in the late nineties. Right. I think most people are probably using Excel's parser as the standard. And so that's what we also try to conform to. And so digging more into
[00:16:02] Tobias Macey:
one schema, you mentioned a little bit of the genesis and the problems that you were seeing and trying to address. I'm wondering if you can dig a bit more into how you determined that this was actually a a viable business case and something that enough people wanted solved for them without having to do it in house that it was worth investing your time and energy and capital into.
[00:16:26] Andrew Luo:
Yeah. I mean, as far as what this looked like, right, having led the team, at a CRM company myself doing this, we spent 4 years building 4 different versions of this product, what is effectively like something similar to Salesforce DataLoader internally. And just the number of edge cases in CSVs, to your point, felt infinite. I still remember, you know, talking to our head of client services as well as our staff and chief architect and engineer on our team. And very much, it was very this I this clash, right, where, the client services team didn't want to continue hiring a massive team of people to manually clean every spreadsheet. But then on the other side, our technical teams were very much saying exactly what you were saying, Tobias. Right? Like, CSVs are an unsolvable problem where the number of edge cases with them are infinite. And so being able to bridge that gap, we never quite got there, ourselves.
But I really just had this vision where if an entire company was dedicated to bridging that gap and solving that problem, we could get the best of both worlds and really, of course, reduce a lot of the manual work that, like, you know, us as all people in the world cleaning spreadsheets every single day should really have a better tool to do this. And so that was kind of the the genesis of the idea and then also the pain point that I saw. Right? Because, you know, we had to scale that team linearly with the number of customers that we had. The more customers that you have onboarding, the more people you need cleaning spreadsheets. And the ability to even drive for 1 schema today, our customers can get, you know, 2x, 3x efficiency on the implementation, onboarding, client services teams that are cleaning spreadsheets manually.
And that's just that's wild. Right? It scales with the size of the company. If you have 50 people cleaning spreadsheets and you're using Excel and Excel Macros and those sorts of things, right, These automations are just so, so valuable. And so that was kind of the business opportunity that we saw with spreadsheets and CSV automation.
[00:18:25] Tobias Macey:
So for people who are dealing with these challenges of CSV as a data interchange and the ways that you're approaching the problem at 1 schema. I'm curious who is typically the person that identifies that 1 schema is actually the solution for the problem that they're dealing with and brings it into the company and just some of the ways that you have to work through some of the customer education and helping people at various levels of business or engineering understand that you could solve this yourself, but this is how long it's going to take and how much it's going to cost or, you know, just doing that cost benefit analysis.
[00:19:04] Andrew Luo:
Yeah. So, typically, the companies that we work with are going to be companies or larger enterprises that already have some form of a an implementation team or a data engineering team or a data ops team that is already doing this, cleaning the spreadsheets in order to get client data onto the platform. And also, I know we haven't talked about the tailwinds of AI yet, but we really do see this in the market today where as companies are increasingly trying to develop their AI copilots or models inside of their products, getting client data into your actual data warehouse or your data stores is super important, right, for actually being able to do that. So increasingly, we are seeing that people are bringing in data teams to ingest or scrape or get client data into their platforms. And so the ROI case is really simple, actually.
Typically, it'll be a, you know, head of data engineering or, you know, a VP of data governance or head of IT, chief information officer, CTO, or even, sometimes a head of product, right, as well that is going to bring 1 schema in because they see this. Right? They see on their client services or customer success teams that they have a lot of people manually doing the work of cleaning spreadsheets, writing one off Python scripts or SQL scripts. Right? You have a repository of 300 of them that you've copy and pasted. Right? And they're like completely brittle and breaking. In the HR world, I know it's January 3rd.
The folks in the HR benefits world are probably just coming off of a very hectic open enrollment season. And, you know, all of those files that have been ingested from HR systems takes a lot of that manual work and scripting as well. And so all of those sorts of, you know, workflows are the ones where it's usually going to be a a head of ops,
[00:20:59] Tobias Macey:
head of customer success, or head of engineering or data engineering slash product that will bring us in. And in terms of the specific core problems that you're addressing, we talked a lot about the problems of doing this manual data cleaning, doing ingest. But from the product perspective, what are the actual interfaces that you're providing and the solutions they are offering to be able to make CSVs as a means of data exchange less of a headache for the business?
[00:21:29] Andrew Luo:
Yeah. Great question. So one schema, it comes with an out of the box system to basically validate, transform, and map CSV data into a known schema inside of the company. So if, for example, we work with enterprise tech companies like Toast and Ramp, we work with large consulting firms like Huron Consulting Group and e commerce marketplaces like Mercari, These sorts of companies, let's just take, you know, Toast as an example. Right? If, for example, you are ingesting, you know, POS point of sale data, right, from a restaurant or a retailer, right, that, you know, Toast is working with, those sorts of spreadsheets from provided by the client when migrating from a different system onto Toast would be ingested by the onboarding team, right, into the platform. Historically, they would receive it as an Excel file, clean it inside of Google Sheets or Excel, and then upload it into the system. But with one schema, we can either embed a completely self serve embeddable importer directly into the Toast platform if it's a tech company so that end users and customers can just do it themselves with our automated validations, AI powered transformations, all of those sorts of aspects that preclude the need for the onboarding team to be involved at all. Or, for, you know, a large consulting firm like Huron, for example, we can give the consultants, the field consultants that work with the customers directly the tools, to make it so that instead of using Excel and VBA and one off scripts, the data engineering team can actually pre program workspaces that the field consultants that are working with the customers, doing the mapping and understanding the voice of the customer, as far as what field means what, they can automate that mapping and validation process rather than needing to put it into a BI tool, wait 24 hours for the data team to run a SQL script to cleanse it, send it back, map it. This sort of back and forth, you know, and data prep process can take weeks or months even to actually handle
[00:23:33] Tobias Macey:
and really slows down the migration process, time to value, or integration process. And one schema basically cuts down on all of that, making it possible to automatically generate human readable error reports that the client can actually use to clean their data faster. You mentioned a little bit about the different parsers that you are using to be able to manage the various edge cases of CSV data. But more broadly, can you give an overview of how you've architected your system to be able to handle the variety and scale and multitenancy
[00:24:05] Andrew Luo:
of the data that you're working with? That's a great question. We've tried all sorts of different architectures in the early days for what works best for a CSV in particular. And we've tried relational databases, NoSQL databases, tried loading it into, you know, S3 or Iceberg or something like that. In reality, though, because Excel is the number one interface that our customers are using today for doing this, in addition to, of course, you know, that's the value, right? Being able to see all the data, quickly manipulate it. Cel has limitations on 1,000,000 row plus files as well. So that's a huge pain point for our customers today. If you have a file above a 1000000 rows, you've gotta go to use SSIS or Microsoft Access or something like that. So we actually built an in memory CSV database that was built in Rust, specifically for processing spreadsheets and parallelizing them. So, it's our own system in house. It parallelizes across rows, columns, and it is specifically designed for validations and transformations on top of spreadsheets.
In memory is really important. So this is the reason why we can support up to 50,000,000 row files, right, loaded into one schema. And it's just as snappy as Excel, as if you had a laptop that had 500 gigs of RAM running on it. And so that's our core technology and the architecture that we adopted. Of course, it persists out to a cloud storage bucket as well, the caching layer. But that is the architecture that we use, and it is what makes us unique in the world of data, where unlike a lot of other data tools where you have to run a profiler or, you know, you have to read a very extensive Java error log in order to see what the validations are actually seeing, one schema lights up the validation errors directly inside of the spreadsheet. And so all of those automations happen directly, natively on top of the platform and actually also then link into SFTP connectors, API connectors, S3 buckets, file storage systems, and all of that, and, of course, are also
[00:26:09] Tobias Macey:
collaborative on the cloud. So all of those aspects are what make us unique. You started this company with a focus on CSPs and addressing that pain point. You mentioned that you custom wrote an an engine specifically for dealing with parallelizing CSV data and all of the challenges that go with that. I'm wondering if you have put any thought into other types of data or other data format that are maybe adjacent to CSVs that you're considering supporting as you build out your capacity and start to grow into more use cases and markets?
[00:26:45] Andrew Luo:
That's a great question. So we already have expanded, of course, From day 1, we had CSV, Excel, and, you know, all of the XLSX, XLS types of support for all of the different spreadsheet esque formats. As of recently this year, we've also expanded into XML and JSON support, so nested structures. We've been able to handle a lot of the nested file types. Right? So being able to apply transforms to unnest and parse data into tabular formats from XML and JSON is stuff that we support already as well. Now looking forward into 2025, this being the start of the year, we are also looking at EDI formats as well. So this would be, you know, encodings like HL7, X12, Edifact, all of these sorts of EDI standards oftentimes can be parsed into CSV.
We also already have fixed width, which is kind of getting close to, you know, other types of EDI standards, fixed width files for everyone on the podcast. Right? It's like, you know, TXT files where every single column, the client will provide
[00:27:52] Tobias Macey:
the, you know, mappings from character 0 to character 8, character 9 to character 15. Is it column b? Right? Those sorts of, formats can can also be handled, by one schema. Because of the fact that you wrote your engine in Rust, I'm curious too if you have looked at all into being able to integrate with things like the data fusion engine from the Arrow ecosystem to be able to get a lot of that interop for free with some of the other query engines and data frame libraries, etcetera?
[00:28:21] Andrew Luo:
That is also coming soon. Interoperability is huge, for a data pipelining tool like 1 schema. We have thought a lot about the different engines that we could either partner with or, of course, integrate directly with when it comes to destinations and sources. So far, we have been really been focused on our own intelligence layer of actually, you know, getting the transforms set up mainly because many of the clients that we work with already have like an extractor layer and a loading layer already built out inside of their systems. And so one schema being obviously the best transform layer for CSVs and transform and validation layer for CSVs. We've really thought about ourselves as that, and linking into other loading and extracting systems is something that is still very much 2025 priority for interoperability.
We've also, of course, been looking into, for example, interoperability around the validation rules themselves. Inside of 1 schema, you have a templating system for setting up your schemas with all of the columns. We have 50 plus data types, right, for the validation rules, for dates, phone numbers, addresses, and all of that. We are looking into also interoperability with JSON schema and other, you know, JSON schema systems that, clients and customers might already be using so that we can really natively link into the API connections and, you know, the S3 or other cloud AWS Azure data buckets that are already being used. Today, we primarily see a lot of SFTP though. It turns out when it comes to working with CSVs, can't escape SFTP as a standard. And so a lot of that, we've also been working on file name tokenizers,
[00:30:02] Tobias Macey:
for example, a lot of metadata is stored in file names and being able to route files around to integrate with MFT providers that customers might You mentioned that you had to write your engine in Rust because of the inherent limitation as far as the million row count of Excel that has been there for probably as long as there has been Excel. Although before that, in the early days, it was probably a much lower limit. And I'm wondering what are some of the other types of performance issues that you have dealt with in having to process all of these types of CSV information and some of the orchestration layers that you've had to build out to be able to manage retries, failures, and being able to just address all of the different edge cases around this problem space.
[00:30:48] Andrew Luo:
Yeah. There are so much that I can go into around the actual architecture and the performance problems. Performance is key for us. Right? Architecture and the performance problems. Performance is key for us, right? Especially because large files are one of the main reasons that people need large client facing data engineering teams in order to do this workflow. In addition to, of course, data pipelining, if you get the same spreadsheet every day or every hour as is the case for catalogs or supply chains and things like that. Those are the two main reasons that, you know, data engineers oftentimes get looped into these client facing data workloads.
But I would say the ways we've thought about this is, of course, firstly, parallelization is really important. CSVs by nature are quite parallelizable for the vast majority of transformation use cases, especially given that, you know, CSVs are oftentimes entity based. So row by row by row, you can usually, you know, even handle like row and column unique transformation sets. So we parallelize natively. Our engine parallelizes across rows and columns by default. And we actually also have the ability to write custom code inside of 1 schema that is vectorized by default. So we have a code editor that you can use directly inside of 1 schema. You can use, TypeScript, to write your own transforms or your own validations. You can parameterize them, and then they vectorize into our Rust engine so that you can write really performant custom transforms and custom validations across your spreadsheet.
So parallelization, I think, is first. The second thing I think that we had to really think about from day 1 was, of course, the list operations framework, the undoing nature and logging nature of all of the transforms. Every single transformation is not only undoable and logged, but also replayable. So, of course, this is really helpful for integrations use cases where we have the ability natively for every single transform that a user performs inside of 1 schema to be recorded and replayed directly on the same spreadsheet again to build data pipelines and data workflows so that, you know, if a nontechnical business analyst cleans the spreadsheet once, we can replay it on every single subsequent file that comes into an SFTP server, for example. That, from day 1, we had, which was pretty important as figuring out the right abstractions for the inputs and outputs of a transform, needed to be thought through both from a performance standpoint, but also from a saving and logging standpoint and a snapshotting standpoint. Excel files have memory explosion sometimes when you unzip them or when you save them into a CSV. Excel files are effectively like complex zip files with a variety of CSVs and other metadata inside. It turns out, because we are an in memory system, anything that's not a true CSV file has the possibility of memory explosion where a 5 megabyte Excel file might explode up to being a 5 gigabyte, right, CSV. And so, we had to, of course, handle a lot of those edge cases as well when it came to Excel files and unzipped or zip files even that explode in memory when they are expanded. And then, of course, joins are one of the huge memory challenges for our system as well. If you have 2 very large files that need to be joined together, you know, the ability to stream them together and doing the join in a performant way, those sorts of aspects are also, of course, difficult when it comes to being able to actually hold in memory system the two pieces that are important for joining. So I would say those are some of the big performance aspects, right? And then of course, anything that's across the entire file is a little difficult. So multi column uniqueness checks, I would say, are maybe the other, most commonly known, right, you know, type of validation that is difficult to do, across a very, very large file.
[00:34:36] Tobias Macey:
On the point of time saving features that you offer, I imagine that you have built up a catalog of here are common transformations that are done for files from this type of system that are going to this type of system or things of that nature. And then you also mentioned some reliance on AI and LLMs as part of the transformation and mapping machinery. I'm wondering if you can talk a bit more about some of those time saving and user experience improvements that you've built into the system.
[00:35:07] Andrew Luo:
Yeah. So that is the main focus for us in 2025 and what we have planned for 1 schema and our product road map. We have seen with 1 schema today, even with our, you know, native out of the box rules based validations and transforms, we can see a 2x efficiency improvement in data operations teams cleaning spreadsheets received from clients, whether it's migrations or integrations from client systems. Now, with AI and AI suggested transforms, we actually expect there to be an even more significant improvement in efficiency that we can drive. But with AI, what we've also found is that we need to build AI transformation packs for specific industries and use cases. Some of the big ones are, for example, accounting use cases are really, really popular with 1 schema. Accounts receivable data and accounts payable data is oftentimes sent using spreadsheets.
HR data goes without saying. Right? Employee eligibility files, right, and employee files from an HRIS are very commonly all sent via spreadsheets as well because HRISs are also quite sensitive, right, from a salary SSN standpoint. Also, ERP supply chain data for catalogs. Right? I would say that's another really big one for purchase orders, procurement catalogs. And then, of course, health care EMR systems. Right? Very sensitive patient data as well. So whether it's migrations from systems like those or integrations from systems like those, we expect those use cases to be the ones where we're going to be building our AI transformation packs from, like, accounting system a to accounting system b, HR system a to HR system b, or HR system A to benefits system. That is where we are going to be building specific AI based transformation packs so that if we know that you're coming from a NetSuite or a Sage and you're trying to get that data into a specific format that is also known, we will have AI transformations that will be able to automate even more of the existing workloads.
I would say with rules, we can probably get, like I said, 60% of the way there with automating a lot of the validations and transforms, which is what drives this 2x efficiency. But I think with the AI features that we have planned around these use cases and industries, we can get to 90 plus percent of automation efficiency.
[00:37:29] Tobias Macey:
And in terms of the applications of CSV data and the fact that you are building your business around supporting those different workflows, I'm wondering what are some of the most surprising ways that you have seen CSV used or surprising things that you've learned about CSVs in the process of building one schema and having to dig very deeply into this niche problem space?
[00:37:54] Andrew Luo:
Funnily enough, I think working in this problem space, it feels niche. Right? Like, when you we first started this company, we had many advisors, investors really asking us, you know, Andrew, like, why do you why are you so excited about spreadsheets? And thinking about spreadsheets all day long and every day, it just seems like not only is it this like relic of the past, right, when it comes to data formats, but also, you know, are they really that common? But I would say the number one surprising thing I've learned about CSVs is just simply truly how ubiquitous they are because every single day, I find new spreadsheet use cases that I haven't thought of. For example, in the education space, right, all student records are typically sent through spreadsheets as well. We have some really amazing, social impact use cases as well inside of 1 schema because there are so many types of nonprofits, for example, that either use us for fundraising or use us to even, for example, match students with teachers or match foster parents with orphans. Right? All of these sorts of use cases happen through spreadsheets. We've worked with religious groups that have used spreadsheets to organize their communities. We have worked with marketplaces that load spreadsheets of Pokemon cards, to be sold on their sites.
The beauty of it is really not only the fact that the spreadsheet use cases are all unique, right? It turns out barcodes in, certain countries have specific validation rules and checksums inside of them that need to be validated versus, you know, Pokemon card serial numbers have a very specific validation set as well. And the really fun thing about one schema is that we see all of it, and we see the long tail of all of the data in the world that is still sent around using spreadsheets. I think someone, one of our advisors, said it really well. You know, Excel is still the number one ETL tool in the world by use. The biggest Fivetran, connector and the most popular Fivetran connector is still the spreadsheet connector as well. I think the ubiquity of CSVs, CSVs are this like sleeping giant, right, in the data world that we all knew is big, but we always knew was kind of there, but not really that flashy or that exciting. And I think by being CEO of a company that really just specializes in processing spreadsheets, I think my eyes are truly open to the world of, wow, these are all the use cases that still rely on CSVs today. If the CSV was to disappear, I think the world of data as we know it would probably grind to a halt given the number of use cases that people rely on CSVs with. And Excel is also, I believe, I don't know how valid this is, but anecdotally, I've heard that it is the most widely used programming environment too.
I do not doubt it. People have done amazing things with Excel formulas, especially in the finance and accounting worlds. For example, I have seen, Excel formulas that are pages and pages long for being able to put together some of these models.
[00:41:05] Tobias Macey:
And as much as people want to kill CSVs because they are lacking in various ways, they are also still very widely used in the machine learning and AI space for model training, model experimentation, loading weights, etcetera.
[00:41:22] Andrew Luo:
Absolutely. Absolutely. And not even to say the web scraping world, that is also, of course, becoming really important for getting pre training data into AI models. Right? Most of the time you scrape datasets. Yeah. You probably either output it as JSON or CSV, but so many datasets also you can even just think about it. If you were to purchase datasets from vendors, the most common format you get that in is also a spreadsheet. Either that or a giant JSONL dump of some kind, which is also effectively tabular. So being the, you know, the lingua franca of data exchange, right, the language that everyone knows how to use, it's, I think, just been long overdue that we have better technology and better tooling. I think the other thing that people think of is it's just been around for so long. There's nothing we can do to make it better, even though it's obviously well known all of the challenges that we've talked about today around spreadsheets and why they're hard to work with. And in your work of building 1 schema, working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen your platform used?
Yeah. So I think when we first set out to build 1 schema, we thought of ourselves as a tool primarily for tech companies, tech companies that needed CRM migration or CRM import or these sorts of tools that would make it possible, to embed into a tech platform that would make it possible for customers to onboard or make it possible for customers to integrate with. That, of course, is was in fact, a really great, you know, market for 1 schema. But I would say the more surprising market is actually the vast world of not tech companies, but more consulting firms, real estate firms, banks, accounting firms.
It turns out spreadsheets are even more painful and even more ubiquitous beyond the world of tech and tech companies. We see, for example, that it is in fact the companies that do not have very large sprawling engineering teams and data engineering teams that have the biggest pain with spreadsheets. As you might imagine, right, if you have a team of 1,000 consultants, for example, but not a lot of data engineers, you have even more of a problem with spreadsheets and have probably built more complexity, more process automation on top of spreadsheets and Excel macros than a team with the most modern data stack rollout and implementation that there ever was with 20 layers of DBC that handle this process. And so I would say that has been the most interesting aspect. We have found tremendous market pull from larger, more traditional enterprises and governments, for example, that have to deal with spreadsheets and actually have even more of a pain point around receiving spreadsheets on a daily basis with an automation that is very brittle sitting on top of the SFTP server that relies on an Excel macro that was written 5 years ago, 10 years ago, that continues to power all of the banking pipelines that actually run through or the tax compliance pipelines for reporting to a government on, you know, a key, key business workflow.
Those are the workflows that we have been most surprised by because they oftentimes exist in companies that are actually not what you would typically imagine are your traditional
[00:44:46] Tobias Macey:
tech company. And in your work of building the business, building the technology,
[00:44:51] Andrew Luo:
working in this ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned personally? I think that the one big lesson kind of along these lines, I think one thing that Y Combinator, when we went through it, really taught us well was really around being close to your customer. And the lesson we learned was if you're not building your product or talking to your customer, everything else is irrelevant. That's kind of the key lesson that drove us to this insight. It also turns out, you know, oftentimes when you start off building a company, you're building the product for what you as a founder understand and know. For me coming from tech startups very much, I understood tech companies. But as we continued building this company, continued talking to our customers, this is where we realized that enterprises outside of Silicon Valley are really where a lot of the big spreadsheet problems lurk. Especially with AI, I think it's been a unique moment in time for us, right, as a lot of these traditional industries and traditional companies are really looking for ways to drive efficiencies in their teams and companies with AI. And so as we've really leaned into that hole that we have seen, we have really, you know, opened up our eyes to, like, all of the different aspects that we really historically have not looked at, right? These different industries that we are also less familiar with, we historically haven't leaned into, because we were so focused, right, on on the tech companies that we understood. I think that the lesson there and the unexpected lesson there is truly listening to your customers even if your customers don't quite look exactly like you and following where your customers are pulling you to is always, you know, the better strategy than trying to, I guess, use intelligent design or blueprinting for your company.
Oftentimes, letting the market pull you where it wants to pull you is is really important to listen to. Obviously, there are traps to just doing whatever the whims of the market and the flavor of the month are. But for us, I think listening to our customers and especially our customers that we understand less, opening our ears up to that and really allowing ourselves to go in that direction is is what really made the difference for us. For people who are working with CSVs
[00:46:59] Tobias Macey:
and figuring out what their approach is going to be, what are the cases where one schema is the wrong choice?
[00:47:05] Andrew Luo:
That's a great question. I would say the landscape of data integration is really broad, and there are so many data integration tools out there, iPaaS tools, embedded iPaaS tools, ETL tools, etcetera. I would say one schema is the wrong choice wrong choice for anyone looking for a point to point API solution. Right? Any sort of integration or migration workflow where you have a standard known schema and a standard known output schema as well. When you know the input and you know the output and you're looking for a connector, there are so many better options than one schema for that. You have your ELT tools, your unified API tools. You can just load it into the warehouse and then transform it there. There's really not that much that you need from 1 schema in terms of transformation.
You really just need a better extractor and a better loader. Right? Especially if you have, like, 50 bajillion different connectors that you need. One schema, like I said earlier, is really more of a data prep, pre validation, validation tool, transformation tool for semi structured data, right, like CSVs, Excels, XMLs, JSONs, those sorts of things, whether it's one time migration or whether it's recurring. And you really want your business analysts to do that. So that's, I would say, the number one thing that I would say. One schema is the wrong choice if you're looking for this point to point API to API connection. But if you have unstructured or semi structured data and you have a large team of business analysts that are doing this, that's really the key for knowing that you should be using 1 schema. And as you continue
[00:48:40] Tobias Macey:
to build and iterate on 1 schema, you mentioned a few of the things you have planned for the road map for 2025. I'm wondering if there are any specific projects or problem areas you're excited to explore.
[00:48:51] Andrew Luo:
Yeah. One of the ones I'm super excited about for this quarter is, intelligent relational joins. Relational joins are one of the historically most difficult concepts to not even understand, but even to perform even inside of an Excel file, right, with VLOOKUPs and being able to actually get that to work, especially if you have a non technical team. Intelligently detecting join keys, suggesting relational join patterns, being able to guide our users through left joins, inner joins, outer joins, and those sorts of nuances, and also helping and guiding our end users through debugging those.
That is one of the things I am most excited about from our intelligent transforms map for 2025. Are there any other aspects of the work that you're doing at One Schema, the overall
[00:49:39] Tobias Macey:
challenge of dealing with CSVs, the ecosystem around CSV as a data interchange that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think that
[00:49:49] Andrew Luo:
the number one aspect I think of when I think of the way you would know if you if you have this problem is actually, I think, simply just looking at your team and looking at the the the team of implementation onboarding, managers, implementation managers, client services analysts, sales and solutions engineers that are really mired down by this problem, I think a lot of the data engineering and data world is really focused on internal problems, whether it's, you know, reports and dashboards and BI and analytics to inform exec teams and leadership.
I think that that is where data teams are the best and are really, really effective because the centralized model of data tooling has been really designed for that. What I see is that because we are a cross company data exchange tool, it's when data teams are pulled into these client facing data workloads that are unique, right, for every single customer, whether it's an integration or whether it's a, you know, migration workflow. Those are the ones where we find that the data tooling as we know it is somewhat lacking. We see people using really powerful ETL tools and pointing them at each individual client, right, and their, spreadsheet format or using, like, very heavy DBT style scripts for that. And I think that those are the ones where the shift left paradigm that we have seen, where enabling the end users, the client facing teams, and the business analysts to perform validation and preventing the bad data from even entering the warehouse is super important for not only saving I mean, every data team has a giant backlog of work that they haven't gotten around to. Right? And so I would say that the paradigm shift that we've really seen at One schema is especially around client facing and client services types of data work. We really see that traditional data tools are lacking when it comes to being able to handle the flexibility and the one off nature of client to client data integration and data
[00:51:54] Tobias Macey:
interaction. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology
[00:52:10] Andrew Luo:
management today. Yeah. That's a great question. I would say there's so much data tooling in the world today, and very much we are only working on a very small slice of it. I would say even still today when it comes to data management and data tooling, the world of unstructured data, I think, is still a gap even beyond what one schema can tackle. I think that there are a lot of really interesting companies working on, for example, PDFs and also other types of images and unstructured data. I think unstructured data in general, we have seen lots of companies ask us for things beyond spreadsheets that we can't handle, more complex file types, even things like tax documents, customs documents, and these sorts of use cases. I think with, of course, the advent of a lot of these LLM based technologies, other forms of document processing specific to the nature of specific industry and use case workflows will be a gap that I think can be solved for the first time with LLMs and AI that will drive a lot of efficiency as well when it comes to, you know, manual, large offshore teams that are reading through and parsing, like, you know, an invoice document or
[00:53:23] Tobias Macey:
or a customs document or a a tax document, for example. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at One schema and the hard one lessons around CSPs and all of their applications that you have gained in the process. It's definitely a very interesting problem domain and one that as you pointed out is still as ubiquitous as ever. So I appreciate all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Awesome. You as well. Thank you so much for having me, Tovise. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts, and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today, I'm interviewing Andrew Luo about how One Schema addresses the headaches of dealing with CSV data for your business. So, Andrew, can you start by introducing yourself? Hey, Tobias. Thank you for having me. I'm Andrew, CEO of 1 schema.
[00:01:02] Andrew Luo:
We are a platform for automating CSV imports and super excited to be on the podcast today. And do you remember how you first got started working in data? Yeah. So I used to lead data engineering, and CRM migration and onboarding at a CRM company called Afiniti. We were a CRM, platform that worked with investment professionals like VCs, and investment bankers. Every customer that onboarded onto our platform came with a messy spreadsheet dump, from Salesforce, HubSpot, etcetera, and it needed to be validated, cleansed, and loaded onto our platform. So I was the lead of the engineering team that built our self serve, spreadsheet onboarding, tool that was used for self serve customer onboarding, in addition to, of course, the tooling that would, enable our data ops and implementation team of 5 or 6 people that also had to deal with the relational joins and all of that for migrating folks onto our platform. As you might imagine, any engineer looking at a team of, you know, people manually cleaning spreadsheets 9 to 5 every single day, it has to imagine that there's a better way. And so that was kind of the spark of inspiration for One schema.
[00:02:10] Tobias Macey:
And before we get too much into One schema, I wanna dig a bit more into CSVs in particular, because as you mentioned, they are still very ubiquitous and they have been around for decades at this point. They are the lowest common denominator for data interchange, or there may be a few that are in that same ballpark. But despite the fact that we have had years of evolution and innovation in data interchange formats, data storage formats, CSVs are still everywhere. And I'm wondering what are some of the reasons that you see as their longevity and their ubiquity despite all efforts to the contrary?
[00:02:48] Andrew Luo:
Yeah. I think of, you know, human language as a good analogy for this, right? Spreadsheets are the lingua franca of data exchange, right? They are the one thing that everyone can agree on, right? Spreadsheets were there as one of the original Excel files and spreadsheets with one of the original user interfaces for for data and for, computing as we know it. And so the vast, vast majority of business users in an enterprise organization can read and understand a spreadsheet versus every company has a different API. Right? Imagine a world where, you know, every single company invented its own language with its own API and you needed to build a translation layer in order to understand it. Right?
Python, SQL, these sorts of, you know, you know, technical coding languages are becoming more ubiquitous, but still not even remotely as close to the number of companies and the number of APIs and languages that are being created in the world today. Right? You know, every single year, right? New data tooling appears and paradigm shifts are so rapid that these languages are so hard to keep up with, which is really expensive, right, if you think about building new translation layers and APIs and data pipelines for every single thing. So that's, I think, the number one reason why spreadsheets are so common and so powerful. Of course, they do come with their problems because it's useful for everything. Right? There are lots of things that are really nice and structured about APIs, but I think that's the number one reason why spreadsheets are everywhere.
I would say the second most important reason for this is security and access management. I think that because APIs are constantly changing, people tend to not build in the security layers properly with every single new API that exists. And therefore, oftentimes, you see a lot of, like, you know, universally powerful API keys that can access everything about an API without the proper access management. Hence, why we see sensitive data uses CSVs a lot more often as well. So these would be, you know, government data, health care data, legal data, and financial banking data. These sorts of data streams are really predominantly still CSVs because they're sensitive.
[00:04:59] Tobias Macey:
Which is ironic since so many other data formats have the ability to have specific fields be encrypted or hashed or masked in some way that can then be reversed programmatically without having to transfer that information in plain text over the wire and have the potential for it to leak or get misplaced or have 15 different copies of the same file that accrue over time.
[00:05:22] Andrew Luo:
I know. I know. It is super ironic that the CSV, which, you know, you might think is, like, not secure, right, because it's an Excel file on, you know, a client facing analyst's laptop or something like that. But in reality, for the person sending the data, they feel like that's really secure because they know exactly what is being sent. They know it's in a spreadsheet file that will not be changed, and they can send it through as a report. And that's the other aspect of it. So many systems of record, whether they are ERPs, EMRs, CRMs, right? All of these sorts of systems have natural report building capabilities to generate CSVs in a way that oftentimes systems that exist in the world today have much more flexible tooling that actually can generate a CSV report than even the API is capable of doing today.
[00:06:11] Tobias Macey:
You mentioned that as far as the sources of CSV data, you have these sensitive data streams. You mentioned earlier that you dealt a lot with that for exports from things like Salesforce or other CRMs. And I'm curious, given the fact that there are so many different options for data interchange, including things like JSON, which at least has some form of schema and type information versus CSVs where it's just text, and the only way that you have to understand the structure is by making sure that the delimiters line up, which often they don't. And I'm wondering what are some of the main sources of CSV data that you're seeing teams have to deal with for business and analytical processes?
[00:06:55] Andrew Luo:
Yeah. So we deal a lot with what I would call cross company data interchange. And the reason cross company data interchange relies on spreadsheets more often is because of the key insight that when you are dealing with cross company, cross department, you know, sorts of boundaries, you often have business users that are actually responsible for the relationship. Right? You have your customer success, your onboarding team, your client services team, your professional services team, your consulting team that are actually in charge of both understanding the schema alignment and the mapping process for schema a to schema b. Right? And because you have this team that is naturally in charge of basically this mapping process, imagine, like, a Workday migration from, you know, something like, you know, a different ERP system.
These can be lengthy 18 month processes, for example. This sort of workflow really requires both the nontechnical business user to understand first name in this system means, you know, field a in my other system. That sort of process really requires a lot of, you know, sitting down with the customer, understanding their workflows and those sorts of aspects. Because of that, we often find that the data format itself simply needs to be understandable by the nontechnical business users. Right? And so you might imagine that a spreadsheet is the natural tool that you would go to use for that data discovery, you know, data prep, pre validation, and all of those sorts of mapping aspects.
Hence, why Excel is still more common than JSON, where you might imagine that the format is actually less visible from easily being able to parse, right, all of the data in bulk all at once. And so that's kind of the key insight around, you know, needing nontechnical users to be empowered as business users rather than, of course, I think a point to point API connection or something like that is absolutely you're totally right. JSON would be much better for that. The other interesting
[00:08:56] Tobias Macey:
point around CSVs and their utility is you mentioned their use in these spreadsheet software applications, Excel being the most ubiquitous and widespread. But even with an Excel file, you actually have the ability to store type information in that file format, whereas with CSVs, you lose all of that and you have to infer the type just based on what it looks like. Oh, well, this seems like it's a number, but that one's quoted, so maybe it's not. And I'm just wondering, given the fact that CSVs can be so inconsistent in terms of the type representation and the inference around those types. What are some of the challenges that you see as far as people who are both implementing the export path of CSVs, of generating the CSVs in the first place to make sure that that type information is maintained as closely as possible? And then on the ingest, the ways that people have to wrestle with figuring out, okay. Well, who gave me the CSV? What does it even supposed to mean, and does that even actually map to the data that's in there? Yeah. This is something we see and deal with all the time because, of course, these are the limitations of CSVs. Right? And there are a few. Right? Obviously,
[00:10:05] Andrew Luo:
the lack of nesting inside of CSVs means that lots of times the export format will come up with different types of nesting structures inside of the spreadsheet itself. Right? We've seen everything from, you know, obviously, delimiters within cells, right, semicolons, pipes, all of that, all the way to, of course, you know, multiple rows that need to be collapsed together. Lots of, you know, ways of dealing with that are nonstandard. Relational data inside of CSVs is also difficult. Most, of the time when you're doing a migration or an integration path, you have multiple entities that have relational data. Oftentimes, this is why people use Excel files with multiple tabs, in order to kind of represent multiple tables in the in the destination system. And then, of course, dynamic schemas. Right? Unlike JSONs where the schema is kind of, you know, the keys and the values and the mappings are implicit in the actual structure itself. And then, of course, this is also why we have EDI standards, like, you know, oncx12, hl7, and health care. Right? All of these. The fact that schemas can change on a customer to customer basis, none of that schema information is represented directly in the CSV. It's not encoded into the file format itself, which is of course what makes it difficult.
Then finally, the lack of collaboration and automation workflows, right? Like, you know, most CSV driven workflows are very much like Excel driven with like Excel Macros and DBA and these sorts of types of workflows that are not shareable. Hard for a team to collaborate on them. And so, of course, when you have these very important data pipelines that are running on them, those are also other challenges with using CSV as a file format. One schema really, as a company, was designed to tackle these problems head on by being a platform that takes the ubiquity of CSVs, but then applies a lot of those nice structures.
We imply and use and apply data type validation on top of CSVs. We have AI to actually infer data types and validations as well in addition to suggesting transforms and those sorts of things. Basically, trying to give data operations teams, the best of both worlds with the reliability of CSVs, being one of the biggest challenges that one schema,
[00:12:13] Tobias Macey:
aims to solve. Another interesting aspect of CSVs is that, to my understanding, there isn't really a concrete specification that defines their structure. They're more of just a happenstance of somebody made it, and everybody else started doing it, and it just was led into the wild. I'm wondering what you have seen as far as any actual standards around the ways that CSVs should be structured, should be interpreted, and some of the ways that those standards or specifications
[00:12:44] Andrew Luo:
are just completely ignored and violated left and right. There are so many debates we have on what standards of CSVs should or should not be interpreted in a certain way. Funnily enough, inside of one schema, there are, I think, 6 CSV parsers that we waterfall through, in actually different languages also for performance reasons, whether it's like a Go based CSV parser or a Rust based CSV parser or a JavaScript based one. And we u use a a variety of these to try to catch all of the different edge cases. The most popular open source CSV parser in different programming languages actually conform to different standards. I don't know if you are aware of that. But very much is such a problem where, you know, there's not a real general consensus on like, you know, what happens when you have, like, a new line character inside of a quoted cell value inside of a CSV? Or how do you handle jagged CSVs, for example, or white spaces around the cell? Oftentimes, for our product team, when we are thinking about these sorts of challenges, we usually think about what is most likely going to be both the safest approach so that we obviously never want to transform data unnecessarily.
But we try to offer these up as intelligent suggestions, whether it's a jagged CSV or whether it's, you know, white spaces around the CSV. I think in the aim of trying to create like a universal standard, we've seen all sorts of things around these things, whether it's file formats, UTF 8 bomb characters, invisible characters in Excel versions that are, like, in the early 2000. But one schema aims to basically you know, we see so many of these CSV workflows across, you know, the hundreds of companies that we work with at this point. And we've really had to piecemeal it together as far as, like, hitting all of the different variants and then applying the right intelligent fix. So we actually have the ability to automatically pad out a CSV. If it's jagged, we have the ability to infer the limiters. But in every single one of these cases, we always ask the user and, you know, ask for confirmation that this is what they want to do. It's tricky. It's, like, not as good as having a universal standard, right, where everyone agrees. But we found that with CSVs being this, like, you know, lingua franca, we actually do have to make some inferences and then ask the user in those cases.
[00:15:01] Tobias Macey:
Even if you do have an agreed upon standard, you're always going to find several people who disagree on the implementation or just completely ignore it. I think one of my favorite terrible examples of a standard being ignored is when trying to deal with a SOAP API interface that returned the error response with a 200 HTTP response code, but embedded the error in the actual returned XML that I then had to modify the parser that I was dealing with to actually handle that properly.
[00:15:29] Andrew Luo:
That's right. That's right. And, also, you know, if you really just think about CSVs, right, one of the easiest checks that we have internally is also if Excel can handle it and you can open it in Excel, then it's probably something that One schema should support. Once again, just going back to the ubiquity of spreadsheets, right, most people aren't reading the CSV, you know, standard that I think was published in the late nineties. Right. I think most people are probably using Excel's parser as the standard. And so that's what we also try to conform to. And so digging more into
[00:16:02] Tobias Macey:
one schema, you mentioned a little bit of the genesis and the problems that you were seeing and trying to address. I'm wondering if you can dig a bit more into how you determined that this was actually a a viable business case and something that enough people wanted solved for them without having to do it in house that it was worth investing your time and energy and capital into.
[00:16:26] Andrew Luo:
Yeah. I mean, as far as what this looked like, right, having led the team, at a CRM company myself doing this, we spent 4 years building 4 different versions of this product, what is effectively like something similar to Salesforce DataLoader internally. And just the number of edge cases in CSVs, to your point, felt infinite. I still remember, you know, talking to our head of client services as well as our staff and chief architect and engineer on our team. And very much, it was very this I this clash, right, where, the client services team didn't want to continue hiring a massive team of people to manually clean every spreadsheet. But then on the other side, our technical teams were very much saying exactly what you were saying, Tobias. Right? Like, CSVs are an unsolvable problem where the number of edge cases with them are infinite. And so being able to bridge that gap, we never quite got there, ourselves.
But I really just had this vision where if an entire company was dedicated to bridging that gap and solving that problem, we could get the best of both worlds and really, of course, reduce a lot of the manual work that, like, you know, us as all people in the world cleaning spreadsheets every single day should really have a better tool to do this. And so that was kind of the the genesis of the idea and then also the pain point that I saw. Right? Because, you know, we had to scale that team linearly with the number of customers that we had. The more customers that you have onboarding, the more people you need cleaning spreadsheets. And the ability to even drive for 1 schema today, our customers can get, you know, 2x, 3x efficiency on the implementation, onboarding, client services teams that are cleaning spreadsheets manually.
And that's just that's wild. Right? It scales with the size of the company. If you have 50 people cleaning spreadsheets and you're using Excel and Excel Macros and those sorts of things, right, These automations are just so, so valuable. And so that was kind of the business opportunity that we saw with spreadsheets and CSV automation.
[00:18:25] Tobias Macey:
So for people who are dealing with these challenges of CSV as a data interchange and the ways that you're approaching the problem at 1 schema. I'm curious who is typically the person that identifies that 1 schema is actually the solution for the problem that they're dealing with and brings it into the company and just some of the ways that you have to work through some of the customer education and helping people at various levels of business or engineering understand that you could solve this yourself, but this is how long it's going to take and how much it's going to cost or, you know, just doing that cost benefit analysis.
[00:19:04] Andrew Luo:
Yeah. So, typically, the companies that we work with are going to be companies or larger enterprises that already have some form of a an implementation team or a data engineering team or a data ops team that is already doing this, cleaning the spreadsheets in order to get client data onto the platform. And also, I know we haven't talked about the tailwinds of AI yet, but we really do see this in the market today where as companies are increasingly trying to develop their AI copilots or models inside of their products, getting client data into your actual data warehouse or your data stores is super important, right, for actually being able to do that. So increasingly, we are seeing that people are bringing in data teams to ingest or scrape or get client data into their platforms. And so the ROI case is really simple, actually.
Typically, it'll be a, you know, head of data engineering or, you know, a VP of data governance or head of IT, chief information officer, CTO, or even, sometimes a head of product, right, as well that is going to bring 1 schema in because they see this. Right? They see on their client services or customer success teams that they have a lot of people manually doing the work of cleaning spreadsheets, writing one off Python scripts or SQL scripts. Right? You have a repository of 300 of them that you've copy and pasted. Right? And they're like completely brittle and breaking. In the HR world, I know it's January 3rd.
The folks in the HR benefits world are probably just coming off of a very hectic open enrollment season. And, you know, all of those files that have been ingested from HR systems takes a lot of that manual work and scripting as well. And so all of those sorts of, you know, workflows are the ones where it's usually going to be a a head of ops,
[00:20:59] Tobias Macey:
head of customer success, or head of engineering or data engineering slash product that will bring us in. And in terms of the specific core problems that you're addressing, we talked a lot about the problems of doing this manual data cleaning, doing ingest. But from the product perspective, what are the actual interfaces that you're providing and the solutions they are offering to be able to make CSVs as a means of data exchange less of a headache for the business?
[00:21:29] Andrew Luo:
Yeah. Great question. So one schema, it comes with an out of the box system to basically validate, transform, and map CSV data into a known schema inside of the company. So if, for example, we work with enterprise tech companies like Toast and Ramp, we work with large consulting firms like Huron Consulting Group and e commerce marketplaces like Mercari, These sorts of companies, let's just take, you know, Toast as an example. Right? If, for example, you are ingesting, you know, POS point of sale data, right, from a restaurant or a retailer, right, that, you know, Toast is working with, those sorts of spreadsheets from provided by the client when migrating from a different system onto Toast would be ingested by the onboarding team, right, into the platform. Historically, they would receive it as an Excel file, clean it inside of Google Sheets or Excel, and then upload it into the system. But with one schema, we can either embed a completely self serve embeddable importer directly into the Toast platform if it's a tech company so that end users and customers can just do it themselves with our automated validations, AI powered transformations, all of those sorts of aspects that preclude the need for the onboarding team to be involved at all. Or, for, you know, a large consulting firm like Huron, for example, we can give the consultants, the field consultants that work with the customers directly the tools, to make it so that instead of using Excel and VBA and one off scripts, the data engineering team can actually pre program workspaces that the field consultants that are working with the customers, doing the mapping and understanding the voice of the customer, as far as what field means what, they can automate that mapping and validation process rather than needing to put it into a BI tool, wait 24 hours for the data team to run a SQL script to cleanse it, send it back, map it. This sort of back and forth, you know, and data prep process can take weeks or months even to actually handle
[00:23:33] Tobias Macey:
and really slows down the migration process, time to value, or integration process. And one schema basically cuts down on all of that, making it possible to automatically generate human readable error reports that the client can actually use to clean their data faster. You mentioned a little bit about the different parsers that you are using to be able to manage the various edge cases of CSV data. But more broadly, can you give an overview of how you've architected your system to be able to handle the variety and scale and multitenancy
[00:24:05] Andrew Luo:
of the data that you're working with? That's a great question. We've tried all sorts of different architectures in the early days for what works best for a CSV in particular. And we've tried relational databases, NoSQL databases, tried loading it into, you know, S3 or Iceberg or something like that. In reality, though, because Excel is the number one interface that our customers are using today for doing this, in addition to, of course, you know, that's the value, right? Being able to see all the data, quickly manipulate it. Cel has limitations on 1,000,000 row plus files as well. So that's a huge pain point for our customers today. If you have a file above a 1000000 rows, you've gotta go to use SSIS or Microsoft Access or something like that. So we actually built an in memory CSV database that was built in Rust, specifically for processing spreadsheets and parallelizing them. So, it's our own system in house. It parallelizes across rows, columns, and it is specifically designed for validations and transformations on top of spreadsheets.
In memory is really important. So this is the reason why we can support up to 50,000,000 row files, right, loaded into one schema. And it's just as snappy as Excel, as if you had a laptop that had 500 gigs of RAM running on it. And so that's our core technology and the architecture that we adopted. Of course, it persists out to a cloud storage bucket as well, the caching layer. But that is the architecture that we use, and it is what makes us unique in the world of data, where unlike a lot of other data tools where you have to run a profiler or, you know, you have to read a very extensive Java error log in order to see what the validations are actually seeing, one schema lights up the validation errors directly inside of the spreadsheet. And so all of those automations happen directly, natively on top of the platform and actually also then link into SFTP connectors, API connectors, S3 buckets, file storage systems, and all of that, and, of course, are also
[00:26:09] Tobias Macey:
collaborative on the cloud. So all of those aspects are what make us unique. You started this company with a focus on CSPs and addressing that pain point. You mentioned that you custom wrote an an engine specifically for dealing with parallelizing CSV data and all of the challenges that go with that. I'm wondering if you have put any thought into other types of data or other data format that are maybe adjacent to CSVs that you're considering supporting as you build out your capacity and start to grow into more use cases and markets?
[00:26:45] Andrew Luo:
That's a great question. So we already have expanded, of course, From day 1, we had CSV, Excel, and, you know, all of the XLSX, XLS types of support for all of the different spreadsheet esque formats. As of recently this year, we've also expanded into XML and JSON support, so nested structures. We've been able to handle a lot of the nested file types. Right? So being able to apply transforms to unnest and parse data into tabular formats from XML and JSON is stuff that we support already as well. Now looking forward into 2025, this being the start of the year, we are also looking at EDI formats as well. So this would be, you know, encodings like HL7, X12, Edifact, all of these sorts of EDI standards oftentimes can be parsed into CSV.
We also already have fixed width, which is kind of getting close to, you know, other types of EDI standards, fixed width files for everyone on the podcast. Right? It's like, you know, TXT files where every single column, the client will provide
[00:27:52] Tobias Macey:
the, you know, mappings from character 0 to character 8, character 9 to character 15. Is it column b? Right? Those sorts of, formats can can also be handled, by one schema. Because of the fact that you wrote your engine in Rust, I'm curious too if you have looked at all into being able to integrate with things like the data fusion engine from the Arrow ecosystem to be able to get a lot of that interop for free with some of the other query engines and data frame libraries, etcetera?
[00:28:21] Andrew Luo:
That is also coming soon. Interoperability is huge, for a data pipelining tool like 1 schema. We have thought a lot about the different engines that we could either partner with or, of course, integrate directly with when it comes to destinations and sources. So far, we have been really been focused on our own intelligence layer of actually, you know, getting the transforms set up mainly because many of the clients that we work with already have like an extractor layer and a loading layer already built out inside of their systems. And so one schema being obviously the best transform layer for CSVs and transform and validation layer for CSVs. We've really thought about ourselves as that, and linking into other loading and extracting systems is something that is still very much 2025 priority for interoperability.
We've also, of course, been looking into, for example, interoperability around the validation rules themselves. Inside of 1 schema, you have a templating system for setting up your schemas with all of the columns. We have 50 plus data types, right, for the validation rules, for dates, phone numbers, addresses, and all of that. We are looking into also interoperability with JSON schema and other, you know, JSON schema systems that, clients and customers might already be using so that we can really natively link into the API connections and, you know, the S3 or other cloud AWS Azure data buckets that are already being used. Today, we primarily see a lot of SFTP though. It turns out when it comes to working with CSVs, can't escape SFTP as a standard. And so a lot of that, we've also been working on file name tokenizers,
[00:30:02] Tobias Macey:
for example, a lot of metadata is stored in file names and being able to route files around to integrate with MFT providers that customers might You mentioned that you had to write your engine in Rust because of the inherent limitation as far as the million row count of Excel that has been there for probably as long as there has been Excel. Although before that, in the early days, it was probably a much lower limit. And I'm wondering what are some of the other types of performance issues that you have dealt with in having to process all of these types of CSV information and some of the orchestration layers that you've had to build out to be able to manage retries, failures, and being able to just address all of the different edge cases around this problem space.
[00:30:48] Andrew Luo:
Yeah. There are so much that I can go into around the actual architecture and the performance problems. Performance is key for us. Right? Architecture and the performance problems. Performance is key for us, right? Especially because large files are one of the main reasons that people need large client facing data engineering teams in order to do this workflow. In addition to, of course, data pipelining, if you get the same spreadsheet every day or every hour as is the case for catalogs or supply chains and things like that. Those are the two main reasons that, you know, data engineers oftentimes get looped into these client facing data workloads.
But I would say the ways we've thought about this is, of course, firstly, parallelization is really important. CSVs by nature are quite parallelizable for the vast majority of transformation use cases, especially given that, you know, CSVs are oftentimes entity based. So row by row by row, you can usually, you know, even handle like row and column unique transformation sets. So we parallelize natively. Our engine parallelizes across rows and columns by default. And we actually also have the ability to write custom code inside of 1 schema that is vectorized by default. So we have a code editor that you can use directly inside of 1 schema. You can use, TypeScript, to write your own transforms or your own validations. You can parameterize them, and then they vectorize into our Rust engine so that you can write really performant custom transforms and custom validations across your spreadsheet.
So parallelization, I think, is first. The second thing I think that we had to really think about from day 1 was, of course, the list operations framework, the undoing nature and logging nature of all of the transforms. Every single transformation is not only undoable and logged, but also replayable. So, of course, this is really helpful for integrations use cases where we have the ability natively for every single transform that a user performs inside of 1 schema to be recorded and replayed directly on the same spreadsheet again to build data pipelines and data workflows so that, you know, if a nontechnical business analyst cleans the spreadsheet once, we can replay it on every single subsequent file that comes into an SFTP server, for example. That, from day 1, we had, which was pretty important as figuring out the right abstractions for the inputs and outputs of a transform, needed to be thought through both from a performance standpoint, but also from a saving and logging standpoint and a snapshotting standpoint. Excel files have memory explosion sometimes when you unzip them or when you save them into a CSV. Excel files are effectively like complex zip files with a variety of CSVs and other metadata inside. It turns out, because we are an in memory system, anything that's not a true CSV file has the possibility of memory explosion where a 5 megabyte Excel file might explode up to being a 5 gigabyte, right, CSV. And so, we had to, of course, handle a lot of those edge cases as well when it came to Excel files and unzipped or zip files even that explode in memory when they are expanded. And then, of course, joins are one of the huge memory challenges for our system as well. If you have 2 very large files that need to be joined together, you know, the ability to stream them together and doing the join in a performant way, those sorts of aspects are also, of course, difficult when it comes to being able to actually hold in memory system the two pieces that are important for joining. So I would say those are some of the big performance aspects, right? And then of course, anything that's across the entire file is a little difficult. So multi column uniqueness checks, I would say, are maybe the other, most commonly known, right, you know, type of validation that is difficult to do, across a very, very large file.
[00:34:36] Tobias Macey:
On the point of time saving features that you offer, I imagine that you have built up a catalog of here are common transformations that are done for files from this type of system that are going to this type of system or things of that nature. And then you also mentioned some reliance on AI and LLMs as part of the transformation and mapping machinery. I'm wondering if you can talk a bit more about some of those time saving and user experience improvements that you've built into the system.
[00:35:07] Andrew Luo:
Yeah. So that is the main focus for us in 2025 and what we have planned for 1 schema and our product road map. We have seen with 1 schema today, even with our, you know, native out of the box rules based validations and transforms, we can see a 2x efficiency improvement in data operations teams cleaning spreadsheets received from clients, whether it's migrations or integrations from client systems. Now, with AI and AI suggested transforms, we actually expect there to be an even more significant improvement in efficiency that we can drive. But with AI, what we've also found is that we need to build AI transformation packs for specific industries and use cases. Some of the big ones are, for example, accounting use cases are really, really popular with 1 schema. Accounts receivable data and accounts payable data is oftentimes sent using spreadsheets.
HR data goes without saying. Right? Employee eligibility files, right, and employee files from an HRIS are very commonly all sent via spreadsheets as well because HRISs are also quite sensitive, right, from a salary SSN standpoint. Also, ERP supply chain data for catalogs. Right? I would say that's another really big one for purchase orders, procurement catalogs. And then, of course, health care EMR systems. Right? Very sensitive patient data as well. So whether it's migrations from systems like those or integrations from systems like those, we expect those use cases to be the ones where we're going to be building our AI transformation packs from, like, accounting system a to accounting system b, HR system a to HR system b, or HR system A to benefits system. That is where we are going to be building specific AI based transformation packs so that if we know that you're coming from a NetSuite or a Sage and you're trying to get that data into a specific format that is also known, we will have AI transformations that will be able to automate even more of the existing workloads.
I would say with rules, we can probably get, like I said, 60% of the way there with automating a lot of the validations and transforms, which is what drives this 2x efficiency. But I think with the AI features that we have planned around these use cases and industries, we can get to 90 plus percent of automation efficiency.
[00:37:29] Tobias Macey:
And in terms of the applications of CSV data and the fact that you are building your business around supporting those different workflows, I'm wondering what are some of the most surprising ways that you have seen CSV used or surprising things that you've learned about CSVs in the process of building one schema and having to dig very deeply into this niche problem space?
[00:37:54] Andrew Luo:
Funnily enough, I think working in this problem space, it feels niche. Right? Like, when you we first started this company, we had many advisors, investors really asking us, you know, Andrew, like, why do you why are you so excited about spreadsheets? And thinking about spreadsheets all day long and every day, it just seems like not only is it this like relic of the past, right, when it comes to data formats, but also, you know, are they really that common? But I would say the number one surprising thing I've learned about CSVs is just simply truly how ubiquitous they are because every single day, I find new spreadsheet use cases that I haven't thought of. For example, in the education space, right, all student records are typically sent through spreadsheets as well. We have some really amazing, social impact use cases as well inside of 1 schema because there are so many types of nonprofits, for example, that either use us for fundraising or use us to even, for example, match students with teachers or match foster parents with orphans. Right? All of these sorts of use cases happen through spreadsheets. We've worked with religious groups that have used spreadsheets to organize their communities. We have worked with marketplaces that load spreadsheets of Pokemon cards, to be sold on their sites.
The beauty of it is really not only the fact that the spreadsheet use cases are all unique, right? It turns out barcodes in, certain countries have specific validation rules and checksums inside of them that need to be validated versus, you know, Pokemon card serial numbers have a very specific validation set as well. And the really fun thing about one schema is that we see all of it, and we see the long tail of all of the data in the world that is still sent around using spreadsheets. I think someone, one of our advisors, said it really well. You know, Excel is still the number one ETL tool in the world by use. The biggest Fivetran, connector and the most popular Fivetran connector is still the spreadsheet connector as well. I think the ubiquity of CSVs, CSVs are this like sleeping giant, right, in the data world that we all knew is big, but we always knew was kind of there, but not really that flashy or that exciting. And I think by being CEO of a company that really just specializes in processing spreadsheets, I think my eyes are truly open to the world of, wow, these are all the use cases that still rely on CSVs today. If the CSV was to disappear, I think the world of data as we know it would probably grind to a halt given the number of use cases that people rely on CSVs with. And Excel is also, I believe, I don't know how valid this is, but anecdotally, I've heard that it is the most widely used programming environment too.
I do not doubt it. People have done amazing things with Excel formulas, especially in the finance and accounting worlds. For example, I have seen, Excel formulas that are pages and pages long for being able to put together some of these models.
[00:41:05] Tobias Macey:
And as much as people want to kill CSVs because they are lacking in various ways, they are also still very widely used in the machine learning and AI space for model training, model experimentation, loading weights, etcetera.
[00:41:22] Andrew Luo:
Absolutely. Absolutely. And not even to say the web scraping world, that is also, of course, becoming really important for getting pre training data into AI models. Right? Most of the time you scrape datasets. Yeah. You probably either output it as JSON or CSV, but so many datasets also you can even just think about it. If you were to purchase datasets from vendors, the most common format you get that in is also a spreadsheet. Either that or a giant JSONL dump of some kind, which is also effectively tabular. So being the, you know, the lingua franca of data exchange, right, the language that everyone knows how to use, it's, I think, just been long overdue that we have better technology and better tooling. I think the other thing that people think of is it's just been around for so long. There's nothing we can do to make it better, even though it's obviously well known all of the challenges that we've talked about today around spreadsheets and why they're hard to work with. And in your work of building 1 schema, working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen your platform used?
Yeah. So I think when we first set out to build 1 schema, we thought of ourselves as a tool primarily for tech companies, tech companies that needed CRM migration or CRM import or these sorts of tools that would make it possible, to embed into a tech platform that would make it possible for customers to onboard or make it possible for customers to integrate with. That, of course, is was in fact, a really great, you know, market for 1 schema. But I would say the more surprising market is actually the vast world of not tech companies, but more consulting firms, real estate firms, banks, accounting firms.
It turns out spreadsheets are even more painful and even more ubiquitous beyond the world of tech and tech companies. We see, for example, that it is in fact the companies that do not have very large sprawling engineering teams and data engineering teams that have the biggest pain with spreadsheets. As you might imagine, right, if you have a team of 1,000 consultants, for example, but not a lot of data engineers, you have even more of a problem with spreadsheets and have probably built more complexity, more process automation on top of spreadsheets and Excel macros than a team with the most modern data stack rollout and implementation that there ever was with 20 layers of DBC that handle this process. And so I would say that has been the most interesting aspect. We have found tremendous market pull from larger, more traditional enterprises and governments, for example, that have to deal with spreadsheets and actually have even more of a pain point around receiving spreadsheets on a daily basis with an automation that is very brittle sitting on top of the SFTP server that relies on an Excel macro that was written 5 years ago, 10 years ago, that continues to power all of the banking pipelines that actually run through or the tax compliance pipelines for reporting to a government on, you know, a key, key business workflow.
Those are the workflows that we have been most surprised by because they oftentimes exist in companies that are actually not what you would typically imagine are your traditional
[00:44:46] Tobias Macey:
tech company. And in your work of building the business, building the technology,
[00:44:51] Andrew Luo:
working in this ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned personally? I think that the one big lesson kind of along these lines, I think one thing that Y Combinator, when we went through it, really taught us well was really around being close to your customer. And the lesson we learned was if you're not building your product or talking to your customer, everything else is irrelevant. That's kind of the key lesson that drove us to this insight. It also turns out, you know, oftentimes when you start off building a company, you're building the product for what you as a founder understand and know. For me coming from tech startups very much, I understood tech companies. But as we continued building this company, continued talking to our customers, this is where we realized that enterprises outside of Silicon Valley are really where a lot of the big spreadsheet problems lurk. Especially with AI, I think it's been a unique moment in time for us, right, as a lot of these traditional industries and traditional companies are really looking for ways to drive efficiencies in their teams and companies with AI. And so as we've really leaned into that hole that we have seen, we have really, you know, opened up our eyes to, like, all of the different aspects that we really historically have not looked at, right? These different industries that we are also less familiar with, we historically haven't leaned into, because we were so focused, right, on on the tech companies that we understood. I think that the lesson there and the unexpected lesson there is truly listening to your customers even if your customers don't quite look exactly like you and following where your customers are pulling you to is always, you know, the better strategy than trying to, I guess, use intelligent design or blueprinting for your company.
Oftentimes, letting the market pull you where it wants to pull you is is really important to listen to. Obviously, there are traps to just doing whatever the whims of the market and the flavor of the month are. But for us, I think listening to our customers and especially our customers that we understand less, opening our ears up to that and really allowing ourselves to go in that direction is is what really made the difference for us. For people who are working with CSVs
[00:46:59] Tobias Macey:
and figuring out what their approach is going to be, what are the cases where one schema is the wrong choice?
[00:47:05] Andrew Luo:
That's a great question. I would say the landscape of data integration is really broad, and there are so many data integration tools out there, iPaaS tools, embedded iPaaS tools, ETL tools, etcetera. I would say one schema is the wrong choice wrong choice for anyone looking for a point to point API solution. Right? Any sort of integration or migration workflow where you have a standard known schema and a standard known output schema as well. When you know the input and you know the output and you're looking for a connector, there are so many better options than one schema for that. You have your ELT tools, your unified API tools. You can just load it into the warehouse and then transform it there. There's really not that much that you need from 1 schema in terms of transformation.
You really just need a better extractor and a better loader. Right? Especially if you have, like, 50 bajillion different connectors that you need. One schema, like I said earlier, is really more of a data prep, pre validation, validation tool, transformation tool for semi structured data, right, like CSVs, Excels, XMLs, JSONs, those sorts of things, whether it's one time migration or whether it's recurring. And you really want your business analysts to do that. So that's, I would say, the number one thing that I would say. One schema is the wrong choice if you're looking for this point to point API to API connection. But if you have unstructured or semi structured data and you have a large team of business analysts that are doing this, that's really the key for knowing that you should be using 1 schema. And as you continue
[00:48:40] Tobias Macey:
to build and iterate on 1 schema, you mentioned a few of the things you have planned for the road map for 2025. I'm wondering if there are any specific projects or problem areas you're excited to explore.
[00:48:51] Andrew Luo:
Yeah. One of the ones I'm super excited about for this quarter is, intelligent relational joins. Relational joins are one of the historically most difficult concepts to not even understand, but even to perform even inside of an Excel file, right, with VLOOKUPs and being able to actually get that to work, especially if you have a non technical team. Intelligently detecting join keys, suggesting relational join patterns, being able to guide our users through left joins, inner joins, outer joins, and those sorts of nuances, and also helping and guiding our end users through debugging those.
That is one of the things I am most excited about from our intelligent transforms map for 2025. Are there any other aspects of the work that you're doing at One Schema, the overall
[00:49:39] Tobias Macey:
challenge of dealing with CSVs, the ecosystem around CSV as a data interchange that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think that
[00:49:49] Andrew Luo:
the number one aspect I think of when I think of the way you would know if you if you have this problem is actually, I think, simply just looking at your team and looking at the the the team of implementation onboarding, managers, implementation managers, client services analysts, sales and solutions engineers that are really mired down by this problem, I think a lot of the data engineering and data world is really focused on internal problems, whether it's, you know, reports and dashboards and BI and analytics to inform exec teams and leadership.
I think that that is where data teams are the best and are really, really effective because the centralized model of data tooling has been really designed for that. What I see is that because we are a cross company data exchange tool, it's when data teams are pulled into these client facing data workloads that are unique, right, for every single customer, whether it's an integration or whether it's a, you know, migration workflow. Those are the ones where we find that the data tooling as we know it is somewhat lacking. We see people using really powerful ETL tools and pointing them at each individual client, right, and their, spreadsheet format or using, like, very heavy DBT style scripts for that. And I think that those are the ones where the shift left paradigm that we have seen, where enabling the end users, the client facing teams, and the business analysts to perform validation and preventing the bad data from even entering the warehouse is super important for not only saving I mean, every data team has a giant backlog of work that they haven't gotten around to. Right? And so I would say that the paradigm shift that we've really seen at One schema is especially around client facing and client services types of data work. We really see that traditional data tools are lacking when it comes to being able to handle the flexibility and the one off nature of client to client data integration and data
[00:51:54] Tobias Macey:
interaction. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology
[00:52:10] Andrew Luo:
management today. Yeah. That's a great question. I would say there's so much data tooling in the world today, and very much we are only working on a very small slice of it. I would say even still today when it comes to data management and data tooling, the world of unstructured data, I think, is still a gap even beyond what one schema can tackle. I think that there are a lot of really interesting companies working on, for example, PDFs and also other types of images and unstructured data. I think unstructured data in general, we have seen lots of companies ask us for things beyond spreadsheets that we can't handle, more complex file types, even things like tax documents, customs documents, and these sorts of use cases. I think with, of course, the advent of a lot of these LLM based technologies, other forms of document processing specific to the nature of specific industry and use case workflows will be a gap that I think can be solved for the first time with LLMs and AI that will drive a lot of efficiency as well when it comes to, you know, manual, large offshore teams that are reading through and parsing, like, you know, an invoice document or
[00:53:23] Tobias Macey:
or a customs document or a a tax document, for example. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at One schema and the hard one lessons around CSPs and all of their applications that you have gained in the process. It's definitely a very interesting problem domain and one that as you pointed out is still as ubiquitous as ever. So I appreciate all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Awesome. You as well. Thank you so much for having me, Tovise. Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts, and tell your friends and coworkers.
Introduction to One Schema and CSV Challenges
The Ubiquity and Challenges of CSVs
Cross-Company Data Interchange and CSVs
Challenges in CSV Export and Import
The Genesis of One Schema
One Schema's Solutions and Interfaces
Expanding Beyond CSVs
Surprising Uses of CSVs
When One Schema is Not the Right Choice