Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

This week is a special crossover episode from our other show, The Machine Learning Podcast.

If you like what you hear, then you can find more at themachinelearningpodcast.com.

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Introducing RudderStack Profiles.

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable enriched data to every downstream team.

You specify the customer traits, then profiles runs the joins and computations for you to create complete customer profiles.

Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack.

You shouldn't have to throw away the database to build with fast changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses,

but swap the decades old batch computation model for an efficient incremental engine to get complex queries that are always up to date.

With Materialise, you can. It's the only true SQL streaming database built from the ground up to meet the needs of modern data products.

Whether it's real time dashboarding and analytics, personalization and segmentation, or automation and alerting, Materialise gives you the ability to work with fresh, correct, and scalable results, all in a familiar SQL interface.

Go to data engineering podcast.com/materialize

today to get 2 weeks free.

Your host is Tobias Macy, and today, I'm interviewing Eran Yaghav about building an AI powered developer assistant at Tap9. So, Eran, can you start by introducing yourself?

Hey. Thanks for having me. I'm Eran. I'm,

the CTO cofounder

of Tap 9. Other than that, professor of CS at Technion,

which is 1 of the leading Israeli universities. I've been doing research on program synthesis,

for many, many years now, I think,

since before it was cool.

And, definitely

looking forward to this conversation.

And do you remember how you first got started working in machine learning?

I've been, working on program synthesis for many years now. I think it's, since the mid 2000 or something like that. And

somewhere

around

2010, we realized that a lot of programming tasks are extremely repetitive and can be automated if you learn from millions of examples.

So, initially,

we worked on,

classical approaches, mostly logic based approaches to program synthesis,

using version spaces to represent spaces of candidate programs

and explore the space

of programs,

for synthesis.

But when, you know, neural networks started to gain popularity again,

and mostly when LCMs

started to be useful,

I really got hooked on that. And from then on, I think really

evolved together with the field

closely to NLP techniques

through transformers

and, you know, and the rest, as they say, is history.

And,

really,

you you know, since the age of transform, so to speak, together with my students and with, Yoav Goldberg from Barillan University, we've done

a lot of cool things around,

the expressive power

of various networks. So fairly theoretical work,

expressive power of various RNNs,

applications

of LLMs in general for software engineering.

And, also, did very cool work with my student, Gail Weiss, and with Joav

on interpretability

of transformers

and reverse engineering transformers. So that was

few years back already. So been

in this field for a while now.

And so for the Tab 9 project, can you describe a bit about what it is and some of the story behind how it came to be and why this is the problem that you want to spend your time and energy on? So Top 9 is an AI assistant for software development. It helps you

with all software development

maintenance

tasks. It can help you generate code, generate test, generate documentation,

review your code, and it will eventually help you

drive the entire software development life cycle end to end using AI. So,

back, I think, in

2018, maybe, we were the first to bring AI code completions to market initially just in Java

based on

classical techniques. Let's call them more logic based than semantic techniques. But when we moved to use GPT based networks 2019 and extend the platform to support more native languages.

Started by focusing on code completions because

we saw that it's a good place to deliver a lot of value to developers.

But the vision is much wider. I think it's pretty obvious now that the future of software development

is AI driven.

So, like,

everything in software development is going to be assisted with AI. You will not write

software or maintain software without AI. Just, like, you don't do it without a compiler, right, or without an interpreter doesn't matter for interpreted languages.

So the future is

software development is AI driven. It will take

some time to get to that point where the entire process is AI driven,

but we definitely

see more and more

pieces of this vision materialize. Right? Every every month that passes,

we get another task in software engineering,

boosted by AI in a significant

manner.

For the use case of

AI as an assistant to your development progress,

what you mentioned code completion as the initial foray into that space. I'm wondering what you see as

the main motivations

that individual developers and then at the team or organizational level, they start to adopt these AI capabilities into their development practices?

Yeah. So so I think for individuals,

it's,

mostly

productivity

and

also,

at least for me, also some elements of discovery. So as I'm working

with tab 9, I obviously get the acceleration of not having to think about syntax and even about the implementation

of various things that, you know, either I've done a million times before and I don't want to remember or even things that I don't know how to do. And I don't really

want the nitty gritty details. I just want it to be done, like, following some API

or something like that. So definitely acceleration

and productivity boost for the individual,

but also higher quality of code

and discoverability,

especially with top 9 chat. I

ask questions on how to do certain things

where I don't know the answer,

and I get educated. I learn stuff,

from top 9 chat. So may maybe kind of a more structured answer to your question

is as an individual. I feel that top 9 serves

3 different layers

for me.

1 is remind me. It reminds me things that I know how to do, but, you know, I just don't remember the exact details.

Second is teach me. It teaches me how to do things,

that I don't necessarily know how to do. And the third thing is elevate me, like, really give me wider context and make me a better developer by kind of exposing

various ways to do stuff. So as individual, it is really like working

with an expert that can expose you to new things as well as reminding you kind of the

non interesting things and get them out of the way quickly. To to answer a second part of the question,

for organizations,

I think, again, it it's clear to all any organization that all software is going to be developed with the assistance of AI. Right? So,

there is an AI in the future of software development anywhere and everywhere. So organizations are coming

to us to see how top 9 can improve productivity,

how can it harmonize the code that is being created in the organization. Right? Because when everybody is following kind of the same model or the same AI when they generate new code, you get harmonized code. They're kind of all of them are inside the distribution, so to speak. So nobody is kind of getting

on a on a tangent. Right? They're all on the common path in some sense.

And with the launch of top 9 chat, organization also coming to see whether chat can accelerate knowledge sharing and help answer questions that would otherwise

require human experts to to answer. So

a wide range of things. And, again, it's the functionality of to

expand to code review and to other surface areas. I think but we'll see even more, demand from organizations.

Yeah. I particularly like that thought of having the expert sitting over your shoulder because that is 1 of the challenges to scaling engineering teams is that there's only so much expertise that can be built up within an individual or a set of individuals, and it takes time to

proliferate that knowledge. And it's also

sometimes challenging to be able to ensure that all of the people who are coming into the team are allocated enough time with those experts to be able to pick up the various signals that you receive through that interaction. And so by being able to being able to

reduce the total amount of time required

to gain some of that context and knowledge,

it allows that,

subject matter expert to scale more effectively because they don't have to spend as much of their time on staring over somebody's shoulder while they write code and try to figure things out.

Yeah. Absolutely. And it allows you to kind of get things right from the get go because you're getting help as you're generating the code.

You may be generating in a way that will pass the review data. Right? So you it's not only that you get the expert knowledge. In a sense, you get the expert knowledge early enough,

so you don't have to get rejected in code review and and redo,

what you did. Right?

And the other side of that, though, is

there's only so much expertise that an AI can

consume or consolidate

because some of that expertise is contextual

and requires

things that are, at least as of now, still within the domain of, human only capabilities such as intuition or being able to make logical leaps between 2 different things or understanding the overall business context without having to explain it in very minute detail to the AI. And I'm wondering what you see as some of the real world limitations

of using a generative AI in the creative process of software development?

Yeah. So so definitely,

architectural

reasoning and high level reasoning,

business kind of reasoning on why we're doing it this way is something that is right now at least beyond,

what the AI is capable or at least it's not easy to communicate that information to the AI right now, not to say that it will remain that way forever. But,

at at this point,

this kind of reasoning is pretty hard to the AI, and it's pretty hard to communicate that. But I think that that's the beautiful thing about AI systems in software development is that you can relieve the humans from, like, the nitty gritty details of the exact syntax of calling an API or

how do I sequence, you know, some calls to other things. And the humans can focus on the architecture

and on kind of the business

aspects of why are we doing it this way. And so,

usually, when

people ask me, like, you know, like general audience ask you, are

developers going to be replaced by AI? Like, our software engineer is going to be replaced by AI. And I tell them, no. Absolutely not. Because

the job of a software engineer

is to solve

business problems using software. It is not to generate code. Right? So the the code generation aspect of a software engineer, yeah, that may be accelerated or, to a large degree,

replaced

by AI. But thinking about the business problem and what software architecture and what software should be created for it is definitely

in the realm of of the human. But maybe to to to answer even, like, slightly more philosophically,

I would say that the limitation of Gen AI in general is the human.

The the models are already extremely strong

in the ability both to take a lot of input, you know, be it through fine tuning on customer codes, say, or using very large context windows or context stuffing or whatever other techniques you wanna use to communicate a lot of information

into into the model. And and the models can also generate very extensive outputs. Right? So so the bottleneck is really the human. Like, do you, Tobias, really want

to read 2, 000 lines

generated by top 9 and review them to see that they do exactly what you wanted? Like, in 1 vault, like, here, Tobias, like, Bam. 2, 000 lines. Good luck. I don't think so. That's not how humans work. So so really the barrier

to a large degree of agenda in general, actually, and in software creation

is the human,

and

we need to find better ways to

communicate with the human and lowering the cognitive,

load on the human when when they work in the system. Right now, this is done by the eye kind of the natural granularity of communication is

a class level, method level, something like that that you can say, oh, yeah. It does what I want. But if you want to go beyond that, the barrier is not the model. The barrier is the human being able to say,

yes. I understand kind of what this does, and this is what I wanted.

Speaking as somebody who's been in software engineering for a little while, it's very similar to the experience of working with the

the the person who's requesting the software to be built and doing the requirements gathering to understand what is it actually that you're trying to solve for. And so we're just moving that another layer down where I, as the human, have to get the requirements from other humans, and then I, as the human, have to relay those requirements to the AI in a way that the AI can understand it. So it's it's really just the same experience just with a different interface.

Yeah. It's it's a similar it's a similar experience, and we would think that in the future, we'll find a better interface. Right? Because,

otherwise, as you said, we just, like, rolled the problem 1 level up or 1 level down. You what you really want is a better way to to communicate that intent, right, than than to create the software in, in a better manner. Yeah. I I mean, as you said, as software engineers, our job is to solve problems for the business, and half of that is understanding what the problem actually is and what what the technologies are able to solve.

Yeah. I I think may maybe 1 positive

outlook on this. As software engineers, we shouldn't be too negative and bitter. I think,

that the the positive here is that AI, Gen AI, allows us

to get the prototype faster and to kind of get to a to a faster realization that the software actually doesn't do what we wanted and that the requirements are unclear. Right? So it's a faster iteration. I think this is extremely valuable. Absolutely. Humans as intermediate representation.

Yeah. Absolutely.

And then another aspect of AI in the software development context, and we've addressed this a little bit, but the aspect of as software engineers, we feel like, oh, we're the expert in the problem domain. We know exactly what we're doing. We don't need an AI to come and give a bunch of recommendations that might be wrong, and then I have to evaluate it. I'm wondering what are some of the aspects of skepticism that you have come up against with developers who are starting to think about this utility in their tool belt or some of the aspects of oversight that they need to be aware of as they're starting to exercise generative AI for producing

larger and larger components of their code? So, actually,

I I think skepticism

is already gone. Like, a couple years back maybe or maybe 3 years ago, it was like,

yeah. I'm not sure that this can generate anything useful. It maybe automates only the parts that, you know, I don't care about and, like, very mundane stuff

and doesn't help me. It's just another thing that gets in the way. But in the last, maybe, year and a half, maybe 2 years even,

I think the technology has matured to a level that is,

really pleasing, and it's a pleasure to work with something like tab 9. And you you work with them and it's like, oh, I actually didn't know that it is capable of doing that. I'm pleasantly surprised, and and you get, like, a sequence of these pleasant surprise all the time. Not to say that all of them are pleasant. Sometimes there's the, you know, the surprise are unpleasant. But for the most part, it's a pleasant experience. And 1 interesting psychological

thing about the product is that people love the product. And I I think

1 reason is that we are

developers. We are fans of the technology, and we are really rooting for the the product to succeed. Right? We want AI to succeed in some sense. So whenever it succeeds, like, oh, yeah. It got that right. That's very like, we're happy for the product that it succeeded, which is the fascinating psychological,

phenomena from my perspective.

I think another aspect there is the variable reward. It doesn't always succeed. And so the the fact that you're kind of,

expecting to see something and you're so there's some game there. Right? You're playing with AI. It's playful. It's a playful thing.

And the technology is mature enough that even when it fails, you kinda understand why it fails. So you treat it like a child that has not developed the deeper understanding. You say, oh, yeah. I see why you got that wrong. Let me try to rephrase my prompt, then maybe that will make you succeed. Right? So So there's a lot of, like, nurturing

of the AI by the developer. Definitely,

if you want to take the kind of the more

critical point on that,

I think the AI

does act as an amplifier. Right? So tab 9 is contextualized

on your code. And if your code sucks,

it will

keep generating

code in that vein. Right? So if you write horrible code, you guess what?

I I understand that this is what you want. You probably want more horrible code, and so it will help you create that. It will try not to. Right? But it could. You you could persuade it to follow your style into the the dark corners

of of software. And so, I think people developers need to be aware that

AI imitates their style in some sense, and they have to be careful and keep an eye on what gets generated, especially

if their current code is not that great.

Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete

analytics. Trusted by teams of all sizes, including Comcast and DoorDash,

Starburst is a data lake analytics

platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture,

with first class support for Apache Iceberg, Delta Lake, and Hoody,

so you always maintain ownership of your data.

Want to see Starburst in action? Go to data engineering podcast.com/starburst

and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Absolutely. And and that brings to mind my other question about technical debt and managing

the amount of technical debt that is generated through this interface where it's the same garbage in garbage out principle that you were saying where if you write a really bad wave looping through a string, then you're just gonna get more bad loops and more bad loops, and you're, you know, you're quickly going to get into a situation where your software doesn't work at all. Yeah. It's it's not necessarily that bad. I mean, some of the customers, you know, when when we step into the customer, they typically say something like, oh, can you train on our codes? So we we get something that is similar to our code and no. And then they think about it again and say, actually, you know what?

Let's train on somebody else's code.

Remember remember that we said we have 30, 000, 000 lines of code? Actually, we just want to train on the 3. Right? The the other 27, 000, 000 lines of code, it's better that they don't see the light of day. Right? So,

I think we're we're seeing a lot of that. And if you do train on your good project, so if you kinda be careful about what you put in into the context, I I think you will get high quality code, that you can be happy with. Right? And you're absolutely right that, if you're not careful, you could enter the business of producing,

what we call, new legacy. Right? Just like you have a bunch of legacy codes,

and you're just generating

new legacy codes. Right? So you gotta be careful

with that. And that brings me to the question of some of the ways that

developers and development teams use something like tab 9 in their development workflow? Is it largely for

exploring new capabilities and, quickly iterating on prototypes? Is it for managing refactoring workflows, generating tests? I'm just wondering if you can talk to some of the ways that people think about the AI capabilities

as a complement to their existing skill set.

I think still the most prevalent use is in code generation and and and code completion

just because it integrates so naturally into the flow and happens with a very high frequency. Right? So every time you type,

top 9 gives you a code completion. Often, it is what you want, so you just hit the tab and take it. So this is definitely kind of the workhorse

of,

AI assistant. It happens

all the time.

Similarly,

when you write a kind of a method signature

or or comment inside the code that is, again, in cogeneration

style that may be longer form

cogeneration of an entire function or entire class,

directly from the comment. This is also,

very widely used. Other things like cogeneration,

documentation generation definitely happen, but just in naturally,

their frequency is lower just because their frequency in the software development kind of life cycle is slightly lower, right, unless your job is just to generate tests,

which happens.

You're still spending most of the coding time, like, just writing code, not writing tests or implementations, not to say. These are

being used quite often, but their frequency

is obviously lower. And top 9 chat is also used very frequently.

Then it's more like what you said, more discoverability,

like, how do I do x or

where can I find y

and stuff like that?

So it's,

yeah. The other piece of curiosity that I have around this overall

set of capabilities

is if there are any particular

categories of software

that something like tab 9 is

most effective or most capable thinking as far as web applications versus machine learning models versus

infrastructure as code? Just wondering if you can talk to some of the ways that it's used most broadly and some of the cases where it's most effective.

So the top 9 is used by over a 1000000 users,

you know, every month. And so, you know, these users come from all avenues of software, all all programming languages,

and I think

almost all kinds of applications.

And even here internally, in top 9, we use top 9 every day

across the entire stack. That being said, I think that

at least for cogeneration,

that there are languages that are more

amenable to kind of getting a very high, what we call, automation factor, like huge amounts of your code getting generated for you.

And, you know, languages

such as,

JavaScript

UI or TypeScript UI, React,

whatever. You you can get

a lot of automation there. Definitely,

Python

data science work is also highly automatable, again, because

many times, the the tasks are well defined. So it is easy for you to communicate to the AI what is it that you want,

and it is easy for you to judge that the result is what you wanted. So maybe

taking a step back,

I usually

talk about the fundamental

theorem of Gen AI, which holds also in GenAI for software,

which is, like, an I call it a theorem, but it's really maybe a trivial observation,

that just says that the cost of specifying what you want plus the cost of consuming the result has to be much, much lower than the cost of doing the task manually. Right? So

if I have to work real hard to tell top 9 what I want, and I have to work real hard

to check the results,

is what I wanted,

then maybe it would have been better for me just to do it manually. So so I would say the kinds of software or the style of software

for which AI

is is most, useful or appropriate

are tasks that are easy for you to define what you want and

easy for you to kinda judge whether what you got back from top 9 is is what you needed. Right? So it's it's really, again,

the the style of software is more about the where is it easy for the human to communicate with the system both in terms of input and in terms of consuming the output. And UI

is a trivial example because you can run

the program and see whether the UI looks like what you wanted or not. Right? So it's easy for you to judge that the science is another because you maybe

hit the button, see the plot, and say, yeah. That's the plot I wanted. Right? So things that are very easy for you to judge whether the the program does what you wanted it to do.

And the other interesting piece that you already alluded to is the

set of languages that are supported. You mentioned that when you were first doing code completion, you started with Java. Now that you're using generative AI, you've expanded the set of languages that it works with. I'm curious if you can talk to

what the analogy is of natural language,

large language models

being biased largely towards English and how that compares in the software ecosystem and some of the ways that you're looking to tackle the the long tail of languages that people would like to have supported?

Yeah. That that's interesting.

I think the

there's enough

code in

most

languages

to

drive

a very successful

model. So our current models

support, I think,

maybe up to, I don't know, 60, 80 languages depending on how you count.

The ability

of the model

to generate clever things

is

definitely

more biased toward towards the

heavy head of languages. So we'll it'll get great results, you know, JavaScript, TypeScript,

Java,

Python, PHP,

c plus plus c, Rust. I don't know. Probably Ruby probably forgot a bunch of of others and maybe the support

in Lua.

And Elixir

would be slightly

less sophisticated,

but there's definitely

transfer happening between languages,

also in natural languages, by the way, but more

so, I think, in in programming languages.

So you'll be surprised that even if you don't have a lot of code out there in the world,

the model does pretty well

on on and,

you know, even more,

interestingly on on very log or system very log code, which is also substantially different than than what we're used to as in

higher level,

program languages. So

it's not exactly the same effect like, you know, having a language model for Hebrew,

which is very, very challenging.

There's more transfer happening, I think, between the programming languages.

Yeah. I think that

the

overall problem space is compressed compared to all of human language because all programming languages are targeting the same core capabilities

of touring completeness. They might have different semantics or different ways of approaching it, but they all have the same base set of constructs and capabilities that they're trying to

compile down to. And I'm sure that that simplifies

the translation between

the syntactical elements because the core semantics are largely similar. I mean, there there are obviously

programming languages that have very esoteric capabilities or ways of approaching a problem, but at the end of the day, it all compiles down to CPU instruction sets.

Yeah. There's another there's another aspect here. You're absolutely right, but there's another aspect here that there's actually the base signal, like the carrier signal, which is the programming language syntax itself.

But,

largely, what we do today with program languages is calling APIs and libraries. And, know, the the the syntax of the program language is very, very simple, and it's easy for the model to pick that up. The model always almost does not make syntactic mistakes. Right? So it's all about the APIs and libraries and how you call them, how you sequence them, etcetera.

And that is largely shared between languages. So, you know, if you're using whatever Twilio API

in Java, JavaScript, Rust, or something, you know, you're gonna get almost

very similar names for kinda API calls.

And the syntax of calling a function may be slightly different, but, you you know, there there is some structure there that is really shared across all languages, and and the model picks that up quite nicely.

And now digging into

tab 9 itself, can you talk to some of the design and implementation

details of the product and application, but also some of the ways that you think about the model training and development and deployment aspects?

Yeah. So let me see where where where should I start?

Let me start like with large large blocks, let's call them. So,

I think people don't really

realize that the distance between a model

and and a product using a model. And, you know,

the the model changes all the time, and new models come out every week almost, and we kinda improve our models all the time.

But the model is a kind of small part of of the product. There's so much going on around that. So,

things like

vector memory

that

help the product,

be contextually aware of what's going on in your code base.

Things like,

semantic memory

that help the product be aware

of semantic

kind of context of what's going on in your project. So tab 9, for example,

you know, will know,

things that are defined in libraries

even very deep inside the library, even when the source code of the library is not available. Right? So there's some level

of semantic memory there that helps sub 9 be contextually

aware of what's going on in the project and and respect that.

Other components, you know, models for ranking, for filtering, for toxicity. I don't even know. There there are so many kind of moving parts,

in this,

in in the product.

But maybe I I would say there's, like, the the model is 1 big piece.

Semantic memory, vector memory are 2 other

big pieces,

and there are many, many small pieces around that. So now in terms of

address,

The the problem of contextual awareness, how do you,

provide

a product that is aware of customer code is also very central

to top 9. And so we have

both the machinery for fine tuning on the customer code base and the machinery for doing retrieval augmented generation both,

text based and semantic based

to provide contextual awareness. And on top of that,

we have some

additional

symbolic semantic awareness in the IDE

that that provides additional hints for the model for it to be generated. So definitely contextual awareness

is a central theme of the product that that required a lot a lot of work there. In in top 9 chat,

the additional aspects,

including,

again, the context awareness, but also

limiting what gets generated

and in some

some alignment problems there that are also interesting.

You also asked about deployment. So top 9 is basically deployed as,

into a Kubernetes cluster. It's a bunch of,

bunch of services,

inference,

and and the others that that I mentioned,

can deploy top 9

in your own, virtual private cloud or can deploy top 9 also completely air gapped on your hardware? And we have

a number of customers

that,

have chose have gone the kind of hardware route and have their own air gaps

deployments of of top 9, which I think is kind of As far as the model architecture, I'm also interested in understanding

what your approach has been and some of the experiments that you've done of, do you just have 1 monolithic core model that does all of the programming languages and code generation, or have you

broken it into a,

I guess, a core model, but then different models with subspecializations

and some of the ways that you think about the interaction across those boundaries?

Yeah. We we we kinda oscillate architecturally.

We have taken the approach of having multiple models

and,

definitely

there have been times that we served, I know, maybe

up to 8 different models, serving different program languages or even different

parts of

the of the software universe, like different stacks.

I think these days, we have converged to, like, maybe 3 or 4 different models, and they serve different

facets of of the product. So different parts of the product have very different latency

cost trade offs.

And

when you come to optimize top 9 and, you know, make it possible to run top 9 also on premise without requiring,

an entire

GPU farm there,

you need to

optimize the consumption in different facets of the product. Some facets are very latency sensitive. Right? So for code completions, you want to get generation as you type.

And so the model depth cannot be,

very large for that if you're optimizing for latency. And that kind of in turn,

derives

other parameters of of the model, and you get to certain model sizes

if you want to do that economically,

with very,

low latency.

Other facets

like chat

are maybe less latency sensitive, and you can,

go with much bigger models and

kind of you have more leeway on on how you handle,

the inferences and how you,

deal with

in flight

requests and batching and all that. There there are a lot like inference is hard. I don't think people realize

how hard is inference when you try to do that economically at scale, like when you try to serve

in top 9 SaaS, right, when you serve 100 of thousands of users

and you have to do that economically,

I think it becomes just the inference becomes a hard problem that then also dictates the kinds of models that you can and do deploy in production.

And as you said, specialized mods specialized smaller models

sometimes allow you to like, a sparse architecture sometimes allow you to do things,

with better latency

and and also more cost effectively.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production.

DataFold leverages data diffing to compare production and development environments and column level lineage to show you the exact impact of every code change on data, metrics, and BI tools,

keeping your team productive and stakeholders happy.

DataFold integrates with DBT,

the modern data stack, and seamlessly plugs in your DataCI for team wide and automated testing.

If you are migrating to a modern data stack, DataFold can also help you automate data and code validation to speed up the migration.

Learn more about DataFold by visiting data engineering podcast.com/datafold

today.

And while I was preparing for this interview,

I looked to see how long tab 9 has been around, how long you've been working in this space, and I saw that tab 9 as a business was founded back in 2013.

And I'm wondering if you can talk to some of the ways that your overall

goals and approach to your

core problem statement have changed or evolved over that period, and in particular, some of the ways that the advent of generative AI kind of shook up your ideas of what was possible.

Yeah. Definitely. So in fact, 2013 is an exaggeration

because it was, I was in,

a professor

working on this, like, super, super part time, just with Tor.

But 2017,

we kinda decided that this is maybe something worth pursuing seriously and and got, funding.

And so I would I would put the the line I would draw the line from 2017

for a real,

company.

But, definitely, when when we started,

I think we knew that we can do stuff

by learning from millions of examples to kinda help you develop software better.

The

imagination

was that all software will be written using AI. So that was the vision, I think, even as early as as 2013

when we just, like, played with ideas.

Initially, when we started, we said, okay. You will write code,

and we will present suggestions

in a sidebar.

Like, next to you, as you write, we'll give you, like, the suggestions of what to do.

And it was beautifully done, really. We had, like, a great designer back in 2017.

Super talented guy. It was the mute most beautiful product that I ever saw. I loved it.

And everybody absolutely hated it other than me. Like, all all developers in the world hated it,

with a raging passion.

And the reason is that as you would type,

things would refresh on the side all the time. So you're like,

imagine that you start writing and kind of you have a sidebar that always changes all the time. So it it's super distracting even if every suggestion there is exactly the right 1. And then 1 of our users said, like, hey, guys.

Like, why don't you just, like, put that do do exactly the same,

but make it as a code completion

natively in the ID? Just just insert it in the right place and pick 1. Just pick the best 1 because we said we'll show you several options you can pick, and we were totally kind of, like, very naive in the user interaction.

And when once we put that

in as a code completion, I think, then clicked, and we we got very good,

very good adoption starting 2018 and 2019 when we expanded to

additional languages, then top 9 really started to to get,

to get traction.

Yeah. I think I've been

humbled

by the the progress. So back 2019, had you asked me

if we'll be able to do what top 9 is doing now, so probably in a decade.

I I can't even imagine that this would be possible.

Every day, I'm kind

of surprised and amazed and humbled, but but what it can do. And as, you know, as as

the entire research area kind of,

keeps progressing,

I'm I'm kind of

amazed by by the things by the things that we are able to do these days. I think,

really I think for me

in the last few years I've kind of came to the

I don't know. My realization or speculation that

quantity

does,

breed quality, which is something I was highly skeptical

in the past. So, like I said, piling more and more parameters

will not

make you,

that much better.

And, you know, it looks like it does. So it's,

I think may whether you do it, like, you know, with a sparse architecture or keep on piling parameters into intense model is besides the point, but these models become

stronger and stronger.

And I'm amazed by by the abilities that we have in our hands today. Again, as I said

earlier in the conversation,

I think for many of the tasks, the human is the bottleneck.

So your ability to describe what is it that you want and your ability to consume back what has been generated is the bottleneck. So the bottleneck is IO, so to speak, between the human

and and the machine. Right? If only we can figure out how to make humans multicore.

Yeah. Exactly.

In your work of using these generative AI capabilities

to build a product that is focused on developers

and their workflows. You mentioned things like latency being a challenge. I'm wondering if you can talk to some of the

most complex aspects of customizing

an LLM for this specific context of software engineers.

Yeah. That's an interesting

question. I I think, as you mentioned, latency,

I think we spend a lot of time on on the granularity.

So it's not necessarily

the model directly, but again, how you interact with it. But

what is the right of generation? Like, should it be

5 lines, 1 lines, 200 lines?

What are the boundaries?

How do you make it easy for the user to consume the result?

And this is happening also in test generation. You know, like, how do you make the result accessible? Let's say I generated 200 tests for you.

How are you going to

what are you going to do with that? Are they all valuable? Which 1 are the valuable 1? How do you guide the model to generate the valuable tests and not just, you know, increase the test count? Right? So so these kind of questions, I

I I wouldn't say that the model per se

is is a challenge, but definitely everything

around the model.

Context awareness

still remains a problem in this space because the 1 of the advantages of you as a human is that you know

the entire project. You maybe know the entire organizational

code base, what microservices

are available, what other architectural decisions have been made. Right? And

as we try to convey this kind of information to the model,

there there is both a limitation on what the model can capture, but also a limitation on how you phrase that,

how you extract the information from, say, the code base or from the organization knowledge base and communicate that to to the model in a useful way. So that that's definitely

another

another challenge.

Another aspect of the problem is not just the capabilities, but also the accuracy

and understanding

when you've generated an accurate result and being able to build a feedback loop for the model. And I'm curious if you can talk to some of the ways that you think about

the

assessment and measurement of accuracy

given the fact that that can be a bit of a nebulous concept in software engineering because,

again, it comes back to are the requirements accurate? So

is the output accurate based on the requirements, or is it just that their requirements are inaccurate?

Yeah.

Hey. It's a it's a really hard problem, and we spend and spend our spending lot of energy on on evaluation. We obviously have our

own evaluation harnesses

that kind of,

you know, intuitively,

pick up a repository that we haven't seen in training

from GitHub or somewhere,

erase some parts, try to complete them, see how well we did. But, you know, even measuring how well you did is is a kind of a tricky concept. What exactly

do you measure?

Each metric has its own problems, and so we have, like, a slew of of metrics that we learned

how to kinda weigh them together and how to kinda get a read of whether the model is better than previous 1 or not. So definitely a lot of flab evaluation.

In terms of users,

I think the ultimate test,

if it's for cogeneration, is whether the code that we generated got adopted or not. Right? And so that's

slightly easier.

But, also, as you said, it's not necessarily,

equivalent in all users because we found out some users take it even if it's completely wrong, and then they massage it to to what they want. Right? Then some users

prefer not to take it and just write something very similar themselves.

And so the signals are also quite tricky,

to analyze. That's for code completion.

But for code completion, I I feel like we have

a pretty good read based on both the kind of internal harness and metrics

and on user feedback of accepted slash rejected,

completions.

For chat, this is much harder

because

both the specification, like the question of what you asked

is kind of more fuzzy often. It's more natural language

heavy,

the interaction.

It's multi term. Right? So it's

there's, like, more than 1 iteration in which you get the requirement specified and refined.

And then the the result may or may not be taken by the user

based on, you know, it may be too long, so the user didn't take it. So there are all sorts of

harder to analyze

signals when you're analyzing,

the the score or the kind of evaluation of

of chat. Again, a lot of fab evaluation there,

including,

human tagging. So we employ

some, you know, human tagging teams to give us feedback and to to help us,

feedback the model.

Some efforts around RLHF, obviously.

We do generate multiple answers in in top 9 chat, and we see how users interact with that,

allows us to get some read of of reality.

But these violations are very, very tricky, especially for chat that is generating code because just as you said, Tobias, the the the spec is unclear. Right? Somebody wrote

some natural language specification

and then got some result, which is not clearly completely wrong or completely right,

and it's very hard to evaluate. Yeah.

Even even I I would add, even using humans is very hard to evaluate. Like, even if you give me the results, right, or you give our human team, like, the results and say,

was the chat correct? Was timeline chat correct or not?

It's very it's pretty hard.

Another layer to that challenge also is

shared vocabulary

where

things like if you're in the conversational aspect of write me a module that does x where you're saying, oh, I want you to use dependency injection versus somebody else says, oh, I want you to use inversion of control and just some of the ways that the names for the same concept are different across different teams and how you manage to map those to the same inputs for the model. I'm wondering how how that plays out.

Yeah. The the honest answer is I don't know. It's it's hard. As you said,

I

I think we're not there yet. I think we're hitting a lower kind of, lower bar of of challenges on on the human expressivity

of people saying,

turn it, whatever. Write me a sorting algorithm. So, you know, you might get bubble sort and say, oh, no. I didn't want bubble sort. I want something else. You get some other code and say,

but, actually, this is totally not what I wanted because I wanted to make a library call. I didn't want you to reimplement

the entire sorting algorithm. Right? So now this entire interaction, is it good or bad? Should you

have given, like, the library call as the first answer for the first interaction?

You don't know

how to grade that. Right? And and we're seeing a lot a lot of this is like a trivial example, but seeing a lot of those interactions.

I

think,

as always,

with these technologies, they are successful when the humans adjust.

Right?

So I think it's not about tuning the model so as to tuning the humans. Right? It's like it's

kind of, calibration of expectations

and people

learn we and we see that with users. They they learn how to write the questions in a way that is helpful for kind of cutting through to the the right answer on on the first

on the first interaction.

It's not different from people learning how to use Google. Right? It's kind of same kind of skill.

And I think

I think developers will

improve

in doing that. So it's not people like to call it front engineering, but it's not really that. It's not as as far as that. It's just like

learning to communicate with an assistant

that is maybe very smart,

but very limited in its ability to communicate. Right? So Yeah. I was going to ask you about that idea of prompt engineering and being able to

share a set of prompts that produce a particular category of output or some of the ways that you think about training humans. And the the the biggest question I guess, the main question I'm driving at are what are the aspects

of customer education that you find yourself having to come back to as people are onboarding onto tab 9?

There there's definitely some aspect

of that. Like, for example,

when you're doing code completions,

people write comments, right, to kinda prompt the model. But in their head, they're thinking chat. So they write comment like, question, how do I do blah blah?

And this is not how the model has been trained for code completion. Code completion is not a conversational model. So you're kind of actually,

the model expects comments as they would appear in actual code. Right? And when you start to write these elaborate stories

as comments, you're actually distracting the model. So that's part of education,

that you have to do.

On on the chat side, again,

I think most of the education is, like, be short and to the point, kind of don't tell, like, the elaborate stories because, again, you're distracting the the model in in a sense.

But I think people already

get that. Right? And

Yeah. And and to the point of putting a conversational

mood into your code comments, it's also bad form for the actual longevity of the project because you don't want a very verbose comment. You just wanna know what is this trying to tell me and why do I care?

Right.

Yeah. Absolutely. I think 1 1 interesting thing in in actually test generation,

which I found interesting,

we do test generation

from the code itself right now. Right? So it's like,

the opposite of ttd in some sense. Right? You don't you don't say what are the requirements, look at the code and generate the test for it, but you're not given a spec. Right? You're not given a specification unless there is some comment there. So for test generation, for example, it is helpful to have a comment

that says what is the code expected to do. It helps test generation test what you intended for the code to do and not what the code does.

Right? So these kind of things are

are kind of usability

issues that should be solved probably in a different way in the product. Right? It's gonna ask you for the spec in natural language.

But but right now, that's the way that they work.

Yeah. It's interesting because there are certain there's a few libraries I've come across where the idea is that in the comments, you write the contract for the function, which it says, okay, this accepts these inputs. It should be within this range, etcetera. And so that it's intended to help with constraining the tests or informing the other developers of how this function should be used, but it also has the benefit of speaking to the model in the language that it understands.

Yeah. Ex exactly

right. And,

yeah, I think if people

the the mental model of the models should be

of the AI systems should be, I'm working with this other person who is very smart, but I have to communicate extremely clearly with. Otherwise, they get

distracted or derailed. Right? So that's that's the mental model.

And in your experience of building tab 9, particularly in your latest iterations

of using these generative models for

code suggestions,

code creation. I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

So people use it.

Okay. So

I've seen applied in

strange ways. So

maybe my application is that I do all my writing with tab 9. So emails and meeting summaries, everything, I write in in sublime using tab 9 just for the kind of the English model. I I got used to it.

I don't think it's

extremely unusual,

but that's just me.

I recently had,

some people

interested in using it

to migrate Cobalt code

to more modern languages, which

I thought

would not really work that well.

But, again, if you calibrate your expectations,

properly,

I think it's,

it's not that bad.

So,

again, migration project as a whole is probably bad use for type 9

because

of taking 1 code base and, you know,

abracadabra

make this cobble into Java is just not going to work architecturally.

Right? The architecture of the application is going to be completely different. But if you have, like, this

opaque

Cobble code and you want to massage it into some

Java procedure that does something that you can understand and maybe help it a little bit on the edges manually,

I think that is an interesting

or at least an unexpected use,

from my perspective.

I've seen people trying to do

magic TDD or TDD to the extreme, like writing the tests and trying to force top 9 to generate the code.

It kinda works in small

cases

that, you know, you can fit all the test cases into the context and and and whatnot. But I I think that's an interesting direction for for the future. I think something like that

would happen in certain domains. Right? That you just write the tests and hit a button and and get the code. So I can definitely imagine that as a viable use case of the of the technology,

moving forward.

Yeah. I think that that's roughly it at the high level.

From the test creation perspective too, it sounds like you would probably want to have your test harness preconfigured, and you're just filling in the individual tests rather than just, I have absolutely no test. Write something.

Yeah. Yeah. So

it has to be something like that. Yeah. Some some balance.

And

in your experience of building this product, working with the developer community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. There are

I found out that people are very passionate about languages that I didn't know existed.

And so,

I think we get a lot of requests.

Again, it's just my ignorance not to say that the language is bad in any way. It's just that I was not aware of, like, the the amount of code that is being created there. Like, for example, PowerShell was 1 example that

the community was very

excited about using top 9 for Power shell. And I was like, hey, guys. I don't even know that it works. Right? And and

but but, definitely,

that

I think those languages were kind of surprising.

Maybe another

surprising aspect is

kind of the very logs of the world, the hardware languages that, again, were completely off my my radar and and people just came and said, hey. It kinda works for these languages,

but can we make it better? And so we we are making it better for the for these languages. And it's interesting. It's very different than what I'm used to at least as as a software engineer.

In the space of software engineering, software development,

development teams, I'm wondering what are the cases where you see

AI

and AI assisted development as the wrong choice.

That's

that's a good question. I think

for for code generation,

probably if you are working on, like, the 1 off algorithm that is very clever

And especially if you're working on some I've worked in the past on, like, concurrent low level algorithms.

So I can imagine if you're working on a concurrent low level algorithm that has a lot of requires a lot of global reasoning,

subtle reasoning, and you're the only 1 in the world that ever wrote this algorithm. Right? It's not like you're reinventing

some concurrent garbage collector or something. But, like, you are

legit.

This is your algorithm first time,

then probably

AI cogeneration is not your best friend at this point. Right? It's a very subtle kind of puzzle that every piece has to fit neatly together with global reasoning.

And,

I think it's probably for the generation part. It's maybe not the best use of the tool. This is really a task that is heavy on human

intelligence and human reasoning right now could change tomorrow, right, when the models become even better.

For

other aspects that are not generation, I think AI assistance is always the correct choice. Like, if you're using it to review your code or to do test generation

or do stuff like that, even if the AI is wrong, it teaches you something. Because

if you say review this code and AI says gives you some comments, say, like,

this comment is wrong, but I understand

that the code is written in a way that could mislead you to that reasoning. So let me maybe restructure the code to make it more obvious to you and to the next human reader, actually. So I'm doing a service not to the AI. I'm not doing service of the AI. I'm in service of future Iran that will come to this code a year from now and say, who's the idiot who wrote that? Right? Because I I don't understand.

And so I think for review, for test generation,

AI assistance is always the correct choice even for the 1 off algorithm.

And as you continue to build and iterate on tab 9 and continue to play keep up with the rapidly evolving space of large language models. What are some of the things you have planned for the near to medium term or any particular problem areas that you're excited to dig into?

Super excited about code review. So in code review, we've worked on the problem now. I think current code review,

product is

version 3 of the product. So we had version 1 maybe a couple years back.

I loved it. Everybody hated it. I think you start to see a theme here. Right? So that

developers on the team said, like, this is mostly

distracting us. It's giving us comments.

Like, no, 2 out of 10

are what we want and the other 8 are, like, we just have to fight the tools. So we don't want that. Version 2 was much, much better. And

I think now with version

3, it's actually really, really,

useful and and valuable. So definitely excited about code review coming out,

later this year.

Integration with non code sources

is another thing that is coming out, and I'm super excited about the ability to

get all all of tab 9, be aware of, you know, Confluence and Notion and and Jira and other sources of information that are non code.

And this integration that that we've been working on for a while now, it's really hard it's a really hard 1. People think, oh, you just slapped no effect or database over the documents and you'll be fine. No. No. Far from it.

It's it's a really tricky 1 to

to make useful.

And I'm very excited about that because

informing

cogeneration,

test generation, all other tasks of of top 9

with some architectural details, with some other

non code kind of source of information

really changes how the product

reacts. Right? It's like kind of you suddenly

understand, like, it starts to use a microservice

that has not been defined anywhere other than the docs. Right? And you start to see interesting things,

happening because you've informed it in a

more general context like a human has. Right? So this is like, this vertical or something, these are horizontal, actually, integrations of with with other data sources,

I think are beginning to inject

the level of human expertise

that you'd expect from from a human

into the product. And I think as as we improve that,

also, you know, surfacing that in Kotzebue,

then the product will become

immensely

more human like. Yeah. Leveling it up from an intermediate to a senior engineer.

Kinda. Yeah. I guess. Yeah. I guess. That's a good way to phrase it there. Thank you.

Alright. And are there any other aspects of the tab 9 product itself, the overall space of AI assistance for software engineering,

or the,

rapidly evolving area of large language models that we didn't discuss yet that you would like to cover before we close out the show?

I'm I'm very curious about about this space. I think we will not go that far with Comm 9. So I'm just like

back in 2016

maybe.

I

I I worked on this kind of crazy idea

of learning from learning programming from YouTube videos.

So there are so many tutorials on YouTube that teach you how to do certain things,

with programming.

So I actually had very nice work with the student on, like, how to learn from those video tutorials. And

when I have time, someday, I'm curious about going back to that and maybe generating tutorials

programming tutorials completely automatically. I don't think we'll do that in top 9 anytime soon. I think the road map is pretty full.

But but something there feels right. I mean, like, programming tutorial videos and ability to generate them automatically sounds sounds exciting to me. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.

2 of them probably.

I think 1 is definitely privacy

and security as we see with top 9 customers that don't wanna send all their information

to outside the org or to something rather. So definitely a barrier to adoption there on kind of maybe architectural side. On

on the product side,

I maintain that the biggest adoption is the human. We need to find better ways to to interface with humans. This is not only for software creation. This is for any Gen AI product. You need to somehow find the right level of presentation to make it easy for your

user to kinda say, uh-huh. Yeah. This you just generated what I wanted. Right? And maybe it would make your needs easy because you get the princess riding the unicorn.

But,

with other products,

it it may be very, very hard and may require,

maybe even phrasing a different language for communicating back the result. Right? So this is

even if you think for us in top 9, 1 1 of the kind of high level thoughts is, like,

should there be, like, a different language in which you communicate the results? Even if you ask for a c program, like, maybe I shouldn't show you the 2, 000 lines of c program, but kind of summarize for you

the main ideas in this c program and say, yeah. Okay. I got it. That's what I wanted. Right?

Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you're doing on tab 9. It's a very interesting product. Definitely very exciting to have these capabilities for people working in the software space. So definitely appreciate all the time and energy that you and your team are putting into

accelerating

software engineers, and I hope you enjoy the rest of your day.

Thank you much. Thanks for having me. It was great.

Thank you for listening.

Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it.

Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links