Summary
Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
- This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
- You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
- Your host is Tobias Macey and today I'm interviewing Greg Werner about building IllumiDesk, a data-driven and AI powered online learning platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Illumidesk is and the story behind it?
- What are the challenges that educators and content creators face in developing and maintaining digital course materials for their target audiences?
- How are you leaning on data integrations and AI to reduce the initial time investment required to deliver courseware?
- What are the opportunities for collecting and collating learner interactions with the course materials to provide feedback to the instructors?
- What are some of the ways that you are incorporating pedagogical strategies into the measurement and evaluation methods that you use for reports?
- What are the different categories of insights that you need to provide across the different stakeholders/personas who are interacting with the platform and learning content?
- Can you describe how you have architected the Illumidesk platform?
- How have the design and goals shifted since you first began working on it?
- What are the strategies that you have used to allow for evolution and adaptation of the system in order to keep pace with the ecosystem of generative AI capabilities?
- What are the failure modes of the content generation that you need to account for?
- What are the most interesting, innovative, or unexpected ways that you have seen Illumidesk used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Illumidesk?
- When is Illumidesk the wrong choice?
- What do you have planned for the future of Illumidesk?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Illumidesk
- Generative AI
- Vector Database
- LTI == Learning Tools Interoperability
- SCORM
- XAPI
- Prompt Engineering
- GPT-4
- LLama
- Anthropic
- FastAPI
- LangChain
- Celery
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access) today and get 2 weeks free!
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting [dataengineeringpodcast.com/datafold](https://www.dataengineeringpodcast.com/datafold) today!
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack)
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Introducing Rudder Stack Profiles. Rudder Stack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team. You specify the customer traits, then profiles runs the joints and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack. You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date. With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.
Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface. Go to data engineering podcast.com/materialize today to get 2 weeks free. Your host is Tobias Macy, and today I'm interviewing Greg Werner about building Illumadesk, a data driven and AI powered online learning platform. So, Greg, can you start by introducing yourself?
[00:01:35] Unknown:
Sure. My name is Greg Werner, as you said. Not Greg Werner, but Greg Werner. And, I am the cofounder of AlumiDesk, and we are based out of Atlanta and Utah mostly, and we have a small team in the Ukraine as well. And do you remember how you first got started working in data? I do. So my previous, venture was very tightly focused on, e invoicing for banks on the sell side of things, and their transaction volume was extremely high, particularly at the end of the month. And on the supply side or buy side of things, we integrated supply chains. So that's when I from a at a professional level, when I really started to get involved with data engineering because of the massive amounts of data that we had to deal with in short time spans.
So those those documents were structured, as XML documents and had a specific schema. So we had to convert that, XML document, store it in a database, put it in a data lake, process documents asynchronously and that kind of thing. Nothing nothing exciting, but it just had to work, and it had to be performing. And any second of performance advantage that we could get there was was not only extremely cost beneficial for us, but also, put a big smile on our customers' faces. So now with the Luminessk, it's the data part of it is really interesting because of the generative AI and the embedding models that we have to interact with and how the content is loaded and split and stored into vector databases and then how we retrieve that data to improve the model's context in order to create these wonderful courses that we're building with AI.
And sort of as a side note in grad school, I did take a few data science classes, and that's kind of where I also got exposed to data and how quickly I understood that garbage in, garbage out. So I I understood that the j data engineering pipeline was a must have for any data science endeavor to work. So, that also got me into understanding the intricacies of, you know, relational database, NoSQL databases, and pipelines, and streaming, and all that stuff. So if that stuff doesn't work, obviously, none of none of the the the toys that we're all talking about today would you know, those wouldn't work either.
[00:04:18] Unknown:
And so in terms of the Illumadesk project, you mentioned a little bit about using AI for content generation. I'm wondering if you can just give an overview of the focus of the project and what it is about this problem space that is making it worth your while to invest the time and energy into building it.
[00:04:38] Unknown:
Yeah. So the first idea for this, started a few years ago, and the problem space that we were trying to deal with was helping instructors and content managers save time. When when building and distributing their content or surfacing their content to their learners. And, ultimately, the goal for any instructor or any organization that has instructors and content managers is is to improve learning outcomes, which lead to improved productivity at work, a better learner experience, you know, especially in, the realms of k through 12 or higher ed. They maybe have a a more traditional approach to assessing student outcomes, but, you know, the the student should still still be happy, and there's still competition between different colleges and universities to attract and retain the best talent, if you will. But the problem space we were dealing with was, that the instructor is also the content manager in many, situations, which sort of reminds me of when you're a data scientist, you you you, you know, you sort of have to dabble into the data engineering aspects of your project setup.
Or if you're a front end developer, you also have to dabble into deploying your application to something like, you know, Versal, or Netlify or something like that because it's just really hard to have a budget that would allow you to recruit and retain a content manager and then also recruit and retain an instructor. And a lot of companies just start with the instructor, maybe even as a third party consultant. So, what we're trying to do is help that persona use templates and AI to create high quality courses that they can then deliver to their students and then also manage that course, during the course of time so that it can have up to date data and relevant data, that they can, you know, train their learners with. I hope that made sense. Yeah. Absolutely. And so with this idea of
[00:06:52] Unknown:
generative AI, using it in an educational context, I like your point of using templates for managing the structure of that. And for educators and content creators, I'm wondering if you can talk to some of the challenges that they face in being able to develop and maintain the course materials and content for their target audiences, and in particular, being able to adapt that content to learners, particularly if they might be at different, stages of sophistication for the subject matter at hand?
[00:07:25] Unknown:
Yeah. So the there are several standards in the industry. Some are focused on specific verticals, like, the higher ed or k to 12 vertical has a standard called learning tools interoperability or LTI, which is a way for the platform or the learning management system to interact with external tools, such as a plagiarism detector, a course builder like ourselves, etcetera. And then in the private sector, there's other standards like, SCORM, and then there's different versions of SCORM because that standard's been around for a while. There's xAPI among others. So to answer your question, I think part of the issue that people have or content managers and instructors have when developing courses is that it's hard for them to create the content and distribute it in a format that's compatible with the formats that everybody else needs to surface it to their own learners. For example, if I'm a content manager and I have a training agency that focuses, specifically on the developing content, and you happen to use PowerPoints only for, teaching your courses, and then someone else may use, you know, Google Docs or Word documents. Someone else may use Notion or something more interactive or headless CMS.
So, you know, the the possibilities are somewhat endless. So it's, for us, the way we thought about it is that we had to develop sort of a canonical internal document schema, structured as a JSON that would allow us to inject content into different places within that schema, whether it be manual or with the generative AI, and then just and then use standard transformations tools to export from our canonical schema to other, you know, to other formats, whether that be a package format like SCORM or xAPI that I mentioned or specific documents, in a folder collection, like, you know, Google Docs or Word Docs or or something like that. So, and and then on the importing side of things, it works the same way. So we can retrieve data from different data sources and inject it into our internal, canonical document format so that we can obtain or fetch, content from different, sources, including directly from the generative AI models.
So that's that's kind of the way we we approach this this problem space.
[00:10:05] Unknown:
And generative AI is gaining everybody's attention right now because of the kind of headline capabilities that people are touting of, oh, I can ask it this question, and then it will answer in a manner that is comprehensible and for the most part, generally accurate. But for the case of educational material, you typically need to go beyond just generally accurate, and you need to have some validation and confidence in the content that it's producing. Because as an educator or as a content creator, you are representing yourself through this educational material. So you want people to have confidence that what you're saying is factual and correct, and accurate. And I'm wondering if you can talk to some of the ways that, in your platform and in the workflow of these content creators and content managers, how they go through that process of validating the output from the AI models as well as reducing the burden on them to double check everything that is being produced.
[00:11:08] Unknown:
Yeah. So I think that's even more relevant in verticals that are extremely, strict with compliance requirements. So something like health care, for example, that content better be accurate, and it better cite all of the sources or you're in big trouble. So, you know, for other, verticals or use cases, maybe it's not so strict, but it's still extremely relevant. So what we've done, is I'm sure you're, you're familiar with the term of hallucinations and things like that that generative AI is famous for. So you wanna avoid that wherever possible, at least detect when the generative AI is providing you with inaccurate results. So I think I think we can parse this out parse the question out into 2 spaces.
The first 1 is, how do we enter how do we send the request to the generative AI so that we can limit or reduce the risk that there will be hallucinations or inaccurate data returned back to us as the gen with the generative AI's output. And then, also, how do we validate or evaluate the quality of those outputs, including, as you mentioned, how do how you know, where can we automate those things? Because it can be a very tedious task. So I'll start with the context part. And what we do here, there's a lot of experimentation that takes place, also known as, prompt engineering, where the user has to instruct the generative AI model how it should behave by identifying perhaps the persona that the generative AI model should should emulate, provide the generative AI model with additional context. For example, saying, I'm Tobias Macy, and I have a data engineering podcast, and I would like to ask some concrete questions about x y z. That's a much better way for the generative AI to understand or guide the generative AI into providing more relevant responses to to you.
And then there's also, from a data engineering, standpoint, there's also ways of structuring the request to the generative AI where you can, instruct it to respond in specific formats or schemas. For example, if you're asking, the generative AI to inject content in certain placeholders within a template, for example, a layout template, you may need to instruct the generative AI to respond with a specific JSON schema that in that could include, for example, a paragraph key with the text in that paragraph key. And then you can have another key, like the code key, and it could add a content within the code code key instead of, for example or in addition to, for example, instructing the generative AI to respond in markdown format. So you could have the same headers and and code blocks in the response, but structured and markdown instead of with a JSON format. So those are a couple of ways that you can guide the the AI to give you better responses and also to structure them in a way that your own systems can parse out the results and then import them into into the course. And then from a data engineering, standpoint, 1 of the big surprises for us was being able to structure the format in such a way that it's more relevant for the use case, a learning use case, if you will. So as you know, there's learners learn in different ways, and 1 of those ways could be, for example, with a a question and answer chatbot, which everybody's familiar with these days, I'm sure. But how you interact with a knowledge base, for example, with a question and answer format is very different from having the generative AI feed you content in a free flow, form of text. Because if you store all of your data or your, I and I'll get that into I'll get that into that in a second, how you how we store the data. But how we store the data, if you have a JSON schema that has an email, for example, and you feed it with generative AI, you say, okay. How do I want this email to look in a question and answer format? So you would say something like, who's who is the author? Question mark. And then the response would be, the author is Tobias Macy. And you say, okay. What is the content or the main ideas of this email that would you you the AI could, parse out the the email and provide you with the summary or the main idea of each paragraph in the email. And then you can store that complete result in the question and answer format so that when you do use something like a chatbot or a q and a chatbot, it's it's accessing the same content, but it's structured in a way that it's better handled by the q and a format in in that in that chatbot interface.
So, there's different ways to store the content as as well when you're retrieving it for the purposes of of, you know, creating learning environments. Should I stop there?
[00:16:44] Unknown:
I I think that's a good spot to tee off into the next thing I was gonna ask about. And with that aspect of the learning environment and the fact that you are building a generalized platform to allow people to educate or teach people across different problem domains, it also brings up the question of how do you allow them to provide their own contextual cues to the AI to ensure that the content that is being produced is relevant to the topic that they are trying to address. And I'm wondering if you can talk to some of the data integrations or ways that the people who are using your platform to build this content are able to bring their own information to populate those contextual aspects and and bring either bring their own vector DB or load data into your platform to give that information to the AI so so that it's producing useful content?
[00:17:35] Unknown:
Yeah. So I think this is where LLM frameworks and or concepts in general really help us solve use cases where companies or organizations of any type have to train their staff, potential employees, partners, or students in higher ed or k to 12 with with specific data or content that wasn't used to train the generative AI model with. So just a little bit of background, and I'm sure everybody, your audience knows this, but I'll go ahead and provide a summary that models like generative AI models like GPT 4 or, or llama or anthropic, and there's more popping up every day, are generally speaking, they're called foundational models that are trained with a general corpus of publicly available information on the web. And sometimes it's public, but it's gated, like, stack overflows.
Content is public to us. They don't allow generative AI models to scrape their data and then be able to answer, programming questions, for example, that were, you know, in the Stack Overflow database. So, even though the information may be public, still may be gated somewhat. So if you have a specific use case for, content that needs to be transformed into a course deliverable or q and a deliverable, then what you need to do is a is a few things. The first 1 is you need to load the content from the and it's usually, let's just use a hypothetical large organization in the health care space.
So this hypothetical organization has a ton of content, internally. They may have content in images. They may have it in video. They may have it in PDFs. They may have it in, you know, traditional relational databases. They may have may have it on internal knowledge basis that are managed by third party vendors. So if if you wanted to teach a course, for example, on give me the give me the lowdown or give me the summary on the content for medical machines used to do knee surgeries, I'm not even sure I'm using the right terms here because I'm not a health care guy, then it would have to look from a variety of data sources in order to provide better context in order to develop that course.
So that's that's 1 thing. And that pipeline usually consists of loading the document and then splitting the document into chunks. So the generative AI model use usually has a limited number of tokens that they can accept in order to, you know, create a in the request, in order to create a response. It could be 4, 096 tokens. It could be 8192 tokens, etcetera. So there is a limit regardless of, you know, if if it's large or small. There is a limit. So all of that content is loaded, and it's transformed, into into raw text. Usually, something like ASCII. And then so you're stripping things like HTML tags. You're converting things from PDFs to raw text, things like that. And then there's a splitting job.
And then once it's split, there's certain parameters that you can put in place there so you can have overlaps between the split, chunks of text. And then there's a separate model that you can call to assign a mathematical value to each chunk, And that's where you have, vector databases are are just so prevalent now because once you have a mathematical or basically a floating point number assigned to each chunk of text, then all of that is stored in a vector database. And there's a vector space where you can get a top k result, and there's different you know, there's a lot of math involved, but, and I'm not an expert on the math, but, there's a relationship between the numbers and there's a similarity between how closely related those numbers are. So when you retrieve context or data from that vector database and there's a specific term in your query. For example, you say in a q and a chatbot, please provide me with the most relevant techniques to do, knee surgery with this new machine that we have, then it can provide you with concise answers because it has a better context.
So there is a a pretty involved data engineering pipeline. Obviously, you can use open source LLM frameworks, but those are just abstractions for the more traditional ETL, transformation pipelines that that you've talked about so much on your podcast. And then how you store that information in a vector database. And then there's, obviously, in in a vertical in this hypothetical organization, there's also a lot of cleaning going on to remove personally identifiable information if if it is, in the data source. So that's that's to establish context. But then there's another aspect of this.
If the model isn't trained to understand those medical terms, then it's it's gonna return garbage. So the model has to be fine tuned. So, usually, there's, a a machine learning ops or an ML ops pipeline that that's been around for a few years now, and I'm sure your audience is also very familiar with that pipeline. But it's basically grabbing this foundational model that, you know, we may use a private privately hosted version of GPT 4 or we may we may grab a llama and host it in house. And then you use, a a, you know, a training test set of data where in many cases, there is humans involved, a human in the loop to label the data, to validate whether the outputs created by the generative AI model are indeed accurate. So you need a train, you know, train staff to, you know, validate that the generative AI is providing with the expected results.
So that information is used to fine tune the model. And then, once the model was deployed to production, the MLOps comes into play because you can use users' feedback. I'm sure everybody's seen the like or dislike buttons that are common with the chatbots, the generative AI chatbots. And that's just basically a way for us to tell the system or the element framework whether or not that response is accurate or not, and then that would be fed into part of the training test data in order to keep iterating and improving the model's performance. So that those are the 2 or 3 and that that the same thing would have to be done for the embedding model, of course. And then, also, generating images or interpreting images and converting that into text is also a source of grabbing or obtaining foundational models and then fine tuning them. So, these generative AI models can be multimodal in the sense that they can understand not only text, but you can also send it images or videos, to improve context directly with that media instead of having to, for example, transcribe a video to text and then send it to text or describe an image and then send it send them, text to their to to the generative AI. You can, you know, send the image directly.
Now to answer the the second part of your question is, okay. Well, how do you validate the results or the quality of the generative AI models, and how do you automate the that evaluation process is a big source, focus for us because we are developing courseware, and we need to make sure that, the source of the content is accurate, but also that the quality of the content is meets a a minimal level of criteria. So in this hypothetical health care company, you would have to store in this vector database. You can store the source of where the content was obtained from, or you can also have third party data sources where you can fetch, you know, somewhat with similarity search. You can also say, okay. Well, this this content was fetched from this URL or from this knowledge base article, and, you can you can append, as part of the response scheme, the content, but also the source of where that was attained from.
And then you can have, an evaluation evaluation model that, is checking if in fact that, content that's provided by the generative AI is in fact located in that source with a certain statistical result of accuracy. So you can you know, there's different ways to do that, but you can grab a sample of all of the the responses, and then you can grab all the sources for that content and then do a check to see if there's a statistical likelihood that it's not accurate. And then you can flag that content so a human can go in there and then manually check whether or not that content is accurate. So there is some parameterization involved where you can say, well, it has to be 99% accurate or 85% accurate, and that's something that you would need to adjust over time. And then, hopefully, with time, the the results would get more and more accurate. So you can, say, go from, for example, from 85% to 90%, 95%.
So that evaluation model is is working automatically. And then the last thing is kinda like the process to evaluate the eval evaluation model. So you can have an auditor go into the company, click on that evaluation service, if you will, but then they're gonna have to see audit trails and see a methodology for how does that evaluation model work, when was it run, what humans signed off on that evaluation model, etcetera, so that the the auditor can, go ahead and provide the company with an attestation report that the content that they're providing to their health care employees is is indeed accurate.
[00:28:08] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. DataFold integrates with DBT, the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing. If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.
Learn more about DataFold by visiting data engineering podcast.com/dataFold today. And then the other aspect of data in this platform, we've been talking about being able to bring data into the context for the AI. We've talked about using AI for generating the content, and so we've largely been discussing from the content creator or educator standpoint, but also on the other side of the equation is the learner and being able to understand their experiences and interaction with the platform to be able to feed that back into the content development and content update workflow. I'm curious if you can talk to some of the opportunities for gaining insights and with learner interactions with the platform to be able to feedback into the content development and content update workflow and some of the ways that you're thinking about the useful pieces of information for, collecting from that learner engagement?
[00:29:39] Unknown:
What we've learned so far, and this isn't rocket science by any means, but what we've learned so far is that there's to be able to normalize the data based on how learners interact with that data and so that that data can be used for the purposes of analytics, perhaps even, machine learning models that could recommend better approaches on how to surface that content. Is really hard if the course material is not standardized. So if you have a course that's completely asynchronous, as in the instructor isn't actively involved with you in a live session, for example. It's just something that you take on your own, and then the instructor provides you with feedback asynchronously when they have time to do so. Then that, I think, is a more standard way of providing the the platform with data points that are consistent across students.
However, it does restrict students to learning in 1 way. So if you're a student that would rather learn with audio, instead of clicking on on your mouse in, your web browser by logging into your, web application, then those data points are gonna be very different. So what we're trying to understand is how do how do we assess the learners when they first start the course to understand their personality traits, their learning traits, their desired outcomes? And then based on on those features, we can recommend the most likely path for them to improve their outcome. So it could be a series of audio, videos or a series of chapters in an ebook, and then we can monitor how their learning outcomes are compared to other learning outcomes from from other students.
And, another big, data point that's important for us to understand is, the level of of engagement. So if, generally speaking, if the student is engaging a lot with the course, that means they're probably gonna learn more. So having features such as, gamification or having other aspects of the course material that may improve accessibility, such as for the blind and deaf, or, you know, improving contrast, the size of the fonts, just basic things like that also may help improve engagement. But it's also on how to use generative AI to restructure the content into different tones.
Someone may learn with a more humorous approach. Some may learn with a more factual or dry approach. So we can also use the AI to improve the likelihood that the learner is gonna engage more with the content. So you so if if student a is more likely to engage with the content with video, then we can translate the text that we have for the course and the images into video with, you know, a human looking avatar, and explain the video ex explain the content in the video. Other students may, prefer more, interactivity or quizzes or live coding exercises as in, you know, with a microlearning approach.
Because if they're like me, they're, you know, probably ADHD or something, so they need small snippets of learning content. And, others, may just learn with raw text and images, in a more traditional sense of, you know, a textbook. But the AI can help us transform very similar to how the context is transform is is obtained by transforming different data sources into a a structure that can be stored in the vector database. So if you have course content, that's all text, and that course content was obtained from a variety of sources and stored as text, then you can also convert that text into other formats such as video, and then you can also measure the interactivity specifically with those videos. So you can have videos with calls to action, with little quizzes within the video, or you can have that same quiz reside, within a a text block within a more traditional, ebook type of environment. I don't know if that answered your question or
[00:34:14] Unknown:
Yeah. And so now talking through the implementation of Illumadesc, I'm curious if you can talk to some of the ways that you've thought about the architecture of the platform, particularly given the fact that you're aiming at this rapidly evolving space of generative AI and the types of models and a, as yet to be cemented stack for being able to to interact with these systems and just some of the ways that you've thought about the design and implementation of that overall, product.
[00:34:47] Unknown:
Yeah. So our our stack was or is mostly a Django Python back end. So we have a Django with the Django rest framework for our back end. We have a traditional restful endpoints for CRUD operations, And then we have some microservices with the fast API, framework, and those microservices are used sort of as a building blocks. And, the as yet to be cemented LLM framework. So we're having to piece together things as microservices that can scale, both vertically and horizontal horizontally within, our Kubernetes cluster, which is, hosted by 1 of the big cloud vendors.
So, for example, 1 of our microservices, 1 of our fast API microservices is only tasked the data transformation part of the data engineering pipeline. So even though we could use a more robust ETL, tool, we're actually using, line chain, mostly because it's Python. And, we're using a lot of the transformer classes that they already have available. And if something doesn't meet our criteria for performance or reliability, then, we may use, you know, a more battle tested approach, whether it be a managed service or, something, like, connecting to, you know, a a spark cluster or something like that for ETL.
So, each part of our data pipeline is sort of split into a microservice, and and Django, sort of bootstraps that together with, salary jobs. So we have Celery jobs that run, jobs that as a pipeline that hook in these different microservices together, including spawning pods in our Kubernetes cluster that may require running code. So 1 of the features that we have in our learning platform is being able to create code exercises and connecting to a back end runtime with a Jupyter kernel that allows, you know, users to test their code or run their code. We also have an auto grader that may use, unit testing to evaluate whether or not the student's answer was correct. So we also have part of that pipeline is also interacting with our orchestration service Kubernetes to to run some of those things. And our front end is mostly a React application.
So we interact with our back end purely through, WebSockets if we ever need to stream updates to our react front end and just traditional, CRUD prod re request and responses with our REST API.
[00:37:32] Unknown:
And as you have built out this platform going from the initial idea to where you are today. I'm curious what have been some of the assumptions that you had going in or some of the ways that you thought about the problem that have changed or evolved over that time? I think I think the biggest
[00:37:51] Unknown:
surprise for me is that it always goes back to the basics from a data engineering standpoint. So when I first was exposed to the term LOM framework, I was like, oh, wow. That's cool. That's new. Right? But I, you know, I very quickly saw that the LLM framework is really just a new term or a new title for things that we've known and loved for many, many years, which is basically a day data engineering pipeline. I think the LLM framework does deal with things that are more focused on the generative AI landscape. For example, some of these open source framework do have retrievers that are focused specifically on the vector database, and they're already, you know, ready out of the box.
And the communities that support these frameworks, In many cases, there are, the the vendors themselves. As you know, many new vector databases have spawned in the ecosystem and are purpose built for that purpose. But other than that, if we're talking about document loading, I think we all know that's pretty traditional document or data transformations. I think that's pretty traditional calling and embedding model model to, you know, identify chunks of text with a number. It it labeling our data is also, been around for years.
So the more we the more we try to incorporate an element framework into our back end stack, the more we move towards to developing a a a battle proof end to end data engineering pipeline that just so happens to interact with the elements that we need to have a generative AI framework. So a part of that pipeline is calling the generative AI model to get a response, storing it, and then providing that response to the end user. So for me, that's been my biggest takeaway from my, or from our adventure so far in this space. I'm sure there's a lot to be done still, and I'm sure we all have our own predictions on what could happen in the future.
But moving forward, we we just wanna leverage, things that we know work and that have been battle tested and wherever possible. And then if 1 of those tools doesn't have the capability, that we need, then we can just have small snippets of code that can handle the things that we need. For example, if we need if we have a specific parsing, problem that we can't solve within the open source or, excuse me, a battle tested solution, then we can develop a small snippet of code to, you know, for example, convert fence code and markdown to a, you know, a specific,
[00:40:28] Unknown:
block in a JSON schema. And in terms of the application of Illumadesk, we've talked largely about the platform, the features, the capabilities For people who are looking to build content, what are some of the ways that you have seen it used most broadly? Is it largely for paid courses for or for, you know, individual practitioners to be able to share their knowledge? Are you seeing it used for internal company trainings for being able to stay up to date with their, internal technology stacks? I'm just wondering if you can talk to the the workflow of onboarding into a Lumidesk and thinking about the content creation workflow and being able to use the AI capabilities to iterate on that the the development process there?
[00:41:18] Unknown:
Sure. So when we first started, our niche was very focused on the data science training space, partly because of our background at the company and also just because we knew the personas and the problem, you know, and and and the content that they were trying to teach. And what we incorporated at at that time was a feature that, as mentioned previously, is to connect to Jupyter kernels to allow instructors to, create coding exercises, content with native markdowns so they can import their Jupyter Notebooks directly into, our authoring tool, which we call activity editor.
But very quickly, we we saw that the data science subject is used by many professions, whether that be financial, whether that be health care or sports. You know, obviously, there's a lot of statistics and, data science in sports. So we became exposed to other, departments, if you will, once we had the, you know, the data science use case shipped. And now the those instructors and those content managers are saying, hey. I need a compliance course for my company, and my company is in the insurance vertical or in the health care vertical, and this compliance course needs to be, based mostly on my private internal data.
And the generative AI model that I'm using, actually, also needs to be private. So we don't wanna have we don't wanna have a situation where we're grabbing people's private data and then sending it to a publicly hosted, generative AI model. So I think for us moving forward, the most surprising thing is is how the breadth of the types of content you create for as far as use cases go. So you can do vendor training. You can do compliance training. In higher ed, we have a customer doing a robotics course with a programming language called Julia. So there's if if you can mix and match these Legos, if you will, or pieces of of the LLM framework puzzle where you can bring your own model, get your own context, and then also structure the course in the layout, that makes the most sense to your learners and then evaluate the outcomes for your learners in addition to the generative AI's content during the authoring process, then the possibilities are somewhat endless and for us a little bit overwhelming. So we're trying to stay focused on very specific use cases so that we're not trying to be all things to all people. And in your work of building the platform and
[00:44:05] Unknown:
putting it in front of people to develop their own content? What are some of the most interesting or innovative or unexpected ways that you've seen a Luminess used?
[00:44:14] Unknown:
So the first use case that we had was to build courses. So we we went to market with a a traditional course authoring and learning management system system paradigm where you have a course authoring tool to develop your courses and your content. Whether or not you choose to use a template or start from a blank slate would be up to the content manager and or instructor. And then on the other hand, we have the system of record with the learning management system. But what works we're we're quickly finding out is that people are using, our generative AI models to and our chat with AI interface that we incorporated as a first class citizen into our user interface where the instructors and content managers doing a question and answer session with their internal documentation, whether it be a knowledge base or a corpus of PDF documents.
We run the jobs to log the data everything we've talked about, run the data engineering pipeline, if you will, to improve context. And then based on the interactivity that they have with that q and a chatbot, then they can copy that information into the authoring tool to start building their courses or enhance the template that we already provide them. That was probably the most surprising workflow that we saw when we were when we shipped our tool, and the ask was was so great for the chat with AI that we just incorporated it, like I said, as a first class citizen. And the other 1 that was surprising to us is surfacing the content not just as a course, but also as other in other formats such as many many, a mini series of blog articles is is I think a pretty common test bed for, hey, I think I'm thinking about writing a book, which is obviously another way to learn content. But I would like to test how many people might be interested in this subject by having a a mini series of blog articles, and I wanna do that first sort of as an MVP. So having the platform become more of a content management system is probably our most the most surprising thing to us. It sort of turns us into a CMS CMS platform for content managers and instructors more than a learning management system with an authoring tool to develop courses.
[00:46:35] Unknown:
And in your experience of building the platform, building the business around it, working with some of your customers, what are the most interesting or unexpected or challenging lessons that you've learned in the process? I think the most, interesting and challenging parts of it are
[00:46:50] Unknown:
how to deal with organizations that have strict privacy requirements, but also wanna leverage generative AI wherever possible to improve their level of productivity. And, obviously, improved productivity for them means, you know, dollar signs So they don't have to hire expensive consultants to develop the content and then have in house or other, consultants, maintaining the content as as you said previously. So if they can have generative AI develop 80% of that work, then that's a big, you know, win for them. However, they're also very concerned with privacy requirements, hallucinations from the generative AI. So what we're you know, the biggest challenge for us is having a platform that a third party auditor can attest is both secure from a privacy standpoint, but also is providing content in a format and providing content that is accurate, you know, and and having our customer not perceive that perceive what our tool can offer them as risky.
So I think those are the biggest challenge challenges for us. But on the other hand, they're also extremely interested regardless of whether it's with us or with someone else because, obviously, many people are incorporating generative AI into their platform is is, you know, they're really pushing for having a framework in place that would allow them to take advantage of, you know, of of these tools.
[00:48:20] Unknown:
And as people are thinking about building content, trying to share their knowledge, what are the cases where Illumadesk might be the wrong choice? I don't think Illumadesk
[00:48:31] Unknown:
is good platform for creating more specific content that is in a format that is better suited for their learners. However, we can interact with those other solutions with their APIs. So for I'll give you a specific example. If you wanted to create a product demo video and there's a few vendors out there doing that that and you and and you determine internally that that's that's all you need is a human looking avatar created by an AI talking about your product. And if you determine that that's good enough for you in order to teach your content, then we're probably not a good fit, you know. So what our platform does is that you can hook in those videos into the material. But if you need other things like, improving context, bring your own model, being able to, you know, leverage things like coding exercises, auto grading, you know, analytics or reports on how the the learners interacting with your content and things of that nature, then we're we're a better fit. And as you continue to build and iterate on the product and the problem domain, what are some of the things you have planned for the near to medium term? I think that our ability to integrate with other platforms is key, not only from a data engineering standpoint, but, for example, if a vendor has a privately hosted, vector database and, hosted with the vector databases cloud or cloud first approach, if you will, allow them to select that vector database as their data source for context.
So had being allowing the customer to mix and match the different pieces of that yet to be determined element framework that you mentioned previously would really help them, meet the compliance requirements, but also help them experiment with different pieces of the puzzle in order to improve results. So another example is the embedding model, the generative AI model itself, calling other models in the gen in the data engineering pipeline. So, doing things like, transcribing 500 videos that you may have from a podcast to text and being able to retrieve that information in a q and a format would also be something that we could do with an API first approach or we may call a service that, you know, has an audio to text transcription service so that we're not building it in house. So those are the things that we see moving forward as important.
And, you know, we have a a beta version of our, Zapier, integration now, but we're gonna continue that you know, that was just the first approach, but we're gonna continue to leverage those integrations moving forward so that our customers have more flexibility.
[00:51:15] Unknown:
Are there any other aspects of the Illumadesk platform or the overall space of AI generated content or the challenges
[00:51:24] Unknown:
of building educational materials that we didn't discuss yet that you would like to cover before we close out the show? I think we pretty much covered everything general you know, generally speaking. I think 1 of the aspects to for us in all of this is, again, back to the data engineering pipeline is the formats that users may already have courses built in. So a lot of companies obviously are already delivering courses, but that for those formats may be in an older version of SCORM. They may even, be using an an open source LMS that's not really maintained anymore. So having a a data engineering pipeline that's importing full on courses from those other formats into our platform is something that's also, you know, related to data engineering because we can import it into these canonical formats for context. And then from there, we can export it into in into the Alumidesk version. I think also how users interact with the the the courseware, and how they can get generative AI is is is important. So the learners can also have access to q and a models, generative AI models, etcetera, so that they can as as the old saying goes, sometimes the best way of learning is teaching. Having them develop content, I think, is is also important.
[00:52:38] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that the tooling is there. I think the messaging is what needs to be fixed if you will. I think the the so called LLM frameworks that,
[00:53:02] Unknown:
you alluded to previously, and my personal opinion is really just a repurposed data engineering pipeline. So I think all of the tools are there, but perhaps what we need to do is fix the messaging and try to, couple the different pieces of the LLM framework that we all know, the data loaders, the data transformers, the evaluators, you know, the retrievers, and things like that, and couple those with the terms that are more typically used with the day data engineering pipelines. And then have the customer decide which aspects to use in a more battle tested data engineering solution, such as spark or something similar to that, and then where to use or a more, bleeding edge on framework in order to solve their, challenges.
[00:53:51] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Illumadesk. It's definitely a very interesting product, so I appreciate all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Likewise.
[00:54:04] Unknown:
Thank you.
[00:54:11] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Greg Werner and Illumadesk
Focus and Challenges of Illumadesk
Generative AI in Educational Content
Data Integration and Contextual Cues
Learner Engagement and Feedback
Platform Architecture and Evolution
Use Cases and Content Creation Workflow
Challenges and Lessons Learned
Future Plans and Integrations