Why an Open-Source Embedding Database?#

Anton appears on the Software Engineering Daily podcast#

April 20, 2023

We chatted with Lee Atchison of Software Engineering Daily to introduce Chroma and LLM apps to a broad software engineer audience.

Transcript#

(below is transcribed with whisper.cpp - please excuse transcription mistakes)

Chroma is an open source embedding database that is designed to make it easy to build large language model applications by making knowledge, facts, and skills pluggable.

And I'm trying to call for the co-founder of Chroma and he is my guest today.

This episode is hosted by Lee Atchison.

Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization.

His most recent book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments.

Lee is the host of his podcast, Modern Digital Business, an engaging and informative podcast produced for people looking to build and grow their digital business with the help of modern applications and processes developed for today's fast-moving business environment.

Subscribe at MDB.FM and follow Lee at LeeAtchison.com.

After and welcome to Software Engineering Daily.

Great to be here, Lee. Thanks for having me.

Intro to Chroma [02:55]#

First, let me ask, did I get your name right?

Yeah, pretty good.

Pretty good.

Good.

I always said instead of a block that, I always want to make sure I get it right.

I often don't get it right, so I just want to make sure.

OK, let's jump ahead.

So you just cried, "chroma" as an embedding data.

Can you tell me a little bit more what you mean by that?

Yeah, absolutely.

So there's a few pieces to that.

The first is sort of what embeddings are, really.

And embedding is a numerical vector representation of any kind of data.

And the special property that embedding models have that create these numerical representations is that things that are semantically similar, whether it's text images or references, or something like that.

Whether it's text images or audio tend to lie close together in the vector space and things that are semantically different tend to lie further apart.

That's how the embedding models are trained.

And so what a database that supports embeddings allows you to do is essentially semantic similarity search.

So if you have a document or a query that has certain meanings, you can pull in similar items that are already in the database to that query in a sort of a geometric way rather than an algebraic way.

Like you would in a SQL database.

So what Chrome is is a database that allows the creation of these embeddings.

Is that a fair statement?

It's more for the storage and manipulation of them.

So the embeddings themselves come from embedding models, either open source, like sentence transformers or the hugging phase transformer models, or from open AI or cohere or other embeddings providers that are coming online.

That's who actually creates the embeddings for us.

I think another way to think about Chrome is currently we use embeddings as the representation for the data.

But really what we're building is a, as you said, it's like pluggable knowledge for language model and the loop applications.

And it just sort of happens that embeddings are the currently the best representation of data to fulfill that function.

So tell me a little bit more about that.

So how is it optimized for these embeddings for these large language model embeddings?

How does it make that happen?

Yeah, there's a few pieces to that.

The Chrome is aim is to sort of give the best possible developer experiences when you're building a language model in the loop application that needs state and memory, which we provide through the embeddings store.

And there's a few pieces to that.

The first piece is that the embedding function itself is first class in Chrome.

In other words, once you specify it for a collection, you never have to think about a vector ever again.

In fact, one of the ways that we think about this is to use a sort of an embedding story in your language model in the loop application.

You shouldn't need an infrastructure engineer to actually run the vector database part of this.

And you also shouldn't need a data scientist to figure out if the retrieved information you're getting is actually relevant to the data.

So that's the other part of what Chrome does and the features that we're building, which are coming out very soon.

Things like query relevance to help you figure out whether or not you're actually retrieving information relevant to the task you're asking the model to perform.

Okay.

So it's so as opposed to a traditional table based database where the data is organized, structured, and very table driven fashion.

This is just all relationship based and it's relationship based on these embeddings.

Correct.

It's based on the geometric relationships between these.

So you're not exactly right.

Yeah, that makes sense.

So this is a brand new database, right?

And where are you in the development phase right now?

Do you have a proof of concept?

Do you have a product?

Where are you right now?

Yeah, so the concept of a vector database has been around actually for a little while.

Traditionally, these have been used in applications like web scale semantic search or, you know, recommender systems where if you're an e-commerce vendor, for example, you want to help people find similar products and you would have some model that allows you to do that, right?

And there's a few key differences between what we're doing because we're specifically targeting this AI in the loop use case versus what was done for the sort of for the vector DBs that were built

for those use cases which existed a few years ago.

The first is, first of all, that rather than having a very large index of perhaps billions of entries which is batch updated very infrequently,

Chroma needs to act more as an application database where it's updated continuously with it in response to sort of user actions, new data comes in, things are updated, they're deleted, they're removed.

So that's one part of it.

The other part of it, of course, is that we need to be aware of the actual model itself because we need to make sure that we're putting the right things in the context window.

Right now, we have an open source product available.

It's the simplest to use thing to get started with using embeddings in your language model in the loop application.

All it takes to use Chroma is pip install ChromaDB, import ChromaDB, and you're good to go.

Our docs page tries to make it as simple as possible because again, we're very focused on developer experience.

We have a hosted product in the works.

We expect that to be out in the next approximately two months, which will have in our true serverless pricing, allowing developers to sort of scale and use Chroma in the way that they're accustomed to using all kinds of other APIs.

And we have a lot of interesting features coming out, as I said, which are more focused on making sure that the right information is getting into the right model at the right time.

So you've got the open source version available now.

You're working on a hosted, a SAS version essentially of an IIS version or DB as a.

Whatever it is, DBJS.

But basically, it's a hosted version and that's where you're planning and monetizing the technology in order to make your company.

That's the starting point.

The reason that we're actually building hosting is because it's definitely the number one thing that our developers are requesting from us right now.

Application developers are at this point used to just being able to call out to an API for infrastructure and storage components like what Chroma is for them.

And so to meet them where they are, we really need to have that hosted version built out.

And we will be monetizing it, we will charge, there will be early revenue into the company from the hosted product.

But the intention is to keep the open source piece the easiest and best thing to use for our individual developers.

And then if you need a matter-toasted solution, we'll have that available on them.

If you want to scale this to an enterprise-grade application, we're building for that in the future too.

So the long-term monetization strategy is the enterprise grade.

That's the current plan.

Of course, the way that I think about this whole space is really on the one.

Very early.

It's very early. On the one hand, there seems to be a fairly traditional playbook here.

Open source data products are well-taught in software business.

There's Mongo and Databricks and these great companies.

And so there's a path that you go from open source to hosted and then you build for the enterprise and then you really drive monetization at that stage.

But as you just said, this is very, very early.

And one of the things that Chroma really aims to do is be there along the adoption curve.

Watch for the use cases that people are actually building on the ground.

And this is why we're investing so much in sort of developer outreach and really working with people.

And you might have heard of our hacker in residence program, which we just launched.

We think it's incredibly important to not call our shot too early to basically build for the use cases that are actually happening.

We think this general space of state memory, knowledge, just in time, knowledge for language model and the loop, large model and the loop applications as these models become multimodal.

I think it will be hard to just call them language models anymore.

There'll just be large models.

As those applications emerge, the shape of the product that we need to build further will naturally have to adapt.

And so we're thinking about it in that way.

What makes Chroma Unique [14:59]#

So in the space of other vector databases, other databases in the space, what makes you unique?

What is the number one thing that pulls you and makes you separate? Why would people use Chrome versus some of the other large language tissue-based models up there?

Sure, I think one of our big early advantages is the fact that we have an easy to start up embedded product.

In other words, once you, like I mentioned, once you use pip install Chrome DB, you just import into your Python application, you're up and running.

There's no additional configuration that you need to do.

But of course, we certainly provide you with the handles that you need to climb the abstraction curve if you want more fine-grade detail about our default, which is very sensible.

Application developer, who's starting today working with language models in the loop, never needs to think about vectors. They never need to think about embedding queries, anything like that.

It just works for them out of the box. You can literally just throw text at it and it will just work for you out of the box, completely in memory. It's nothing you have to do.

But we also allow you to sort of perform initial scaling also in your local developer machine and eventually we'll ramp you up to the hosted solution once you get to that scale.

That's key. That's an important piece of our differentiation right now. And I think the other important pieces, as I mentioned, the fact that we're really taking these AI use cases, large model use cases, very seriously, they're the thing that we're 100% focused on,

which means that we have to design the entire system and the features that we bring in not to be focused on that legacy application of like web scale recommender systems with billions of vectors.

It's more about really effectively managing, let's say, thousands of indices with hundreds of thousands of entries in them, which are updated continuously instead of like a billion scale index, which is updated maybe monthly or daily if you're a, if you're certain, well, well funded companies.

I guess what I'm hearing is you're, you're more than just storing the, storing the vectors, you are creating the vectors, given the raw content that's coming in.

You are performing the machine learning on it in order to create the vectors.

We basically make that transparent to the user. So if they don't want to think about an embedding model, we provide a sensible default that runs completely locally. If they'd like to use an embedding providers API, we provide very simple affordances for them to do that. All they have to do is drop in their API key and they're good to go.

The whole point is, as I mentioned, no repeated again.

We think that application developers and current, you know, a sort of entire outlook on this is AI application developers shouldn't need an infrastructure engineer or a data scientist to build robust applications with memory and skills, and that's all the things that Chrome is building towards.

So you, you've just received your first round of seed money, I believe.

Second round. This is the second round. We had a smoke.

That's fine. We had a, we had a pre-seed, which we raised last year, and we recently closed a pretty significant round with quiet capital, which, you know, is intended for hiring.

We're hiring across the board, everything from product engineering on the existing open source product to sort of platform distributed systems engineering in the back end, applied research API engineering community, because we really believe in the power of open source here.

Our community is very, very active. We take a lot of, we take a lot of input from other developers and we're building a company that's able to sustain that and be robust around it.

And so, yeah, hiring across the board there. And that's, that's essentially what that raise was about.

Got it. So, yeah, you're, okay. So, you, you already have a strong community of developers. You need to continue that.

And that's where you're getting your, your primary growth from. So, a lot of open source software.

You know, they, it's, while it is open source, most of the actual development that occurs in it occurs from commercial companies that need and leverage the open source.

A lot of the, the, the most popular open source software doesn't get a lot of community direct community upgrading and fixing working on the software itself.

Now you've got a community that's of users, how active is your, is your community of developers working on chroma.

Yeah. I mean, look, we feel a significant number of PRs every week. We, we get a lot of input from our sort of our Discord community chat.

I'm on the ground talking to users and people using chroma, essentially every week in person, getting a lot of input and feedback that way.

But from a development perspective, it is chroma, the company driving development. It is us, you know, grooming what comes in and thinking about what needs to be built.

But we've been fairly successful so far at integrating sort of community input and, and we'll continue to do so.

I think that, I think that there's an interesting effect, because as you mentioned, right now this is very early, right?

And that means that the boundary between somebody who helps builds this technology, say an embedding store, like a knowledge base for LMS or whatever you want to call it, whatever this ends up being.

And somebody who wants to build applications on top of that, that boundary is not so well defined at the moment.

And so giving people affordances to contribute back based on what they're building, I think is really valuable for us and really valuable for the community as a whole too.

Yeah, it's, it's a really nebulous space. It's, it, I imagine it's very, very difficult to talk to investors in particular, but also potential clients of what exactly you do, because what you're trying to figure out,

what you're trying to do.

Well, there's a few use cases that have proven to be very popular already.

I think chat GPT really switched people on to this entire domain.

The first use cases that people thought of were, you know, I want chat GPT, but I want it for my data.

I want it for my documents. I want it to reason about, you know, the things that I know about, which were not in its training data, or which I needed to really not hallucinate about.

I needed to just, you know, look at, look at this specific thing, which I've given it.

And then, you know, sparked off a very large burst of creativity.

That's actually something that a lot of people are building for still right now.

And then recently we've had a couple of very popular so-called agent projects, you know, baby HEI and auto GPT, which also requiring the sort of memory,

the auto continuously automatically updating memory, right? So this is another use case where immediately knowledge and skills is very important to what the application is doing.

So there is, there are a few concrete things, which do help us send our roadmap fairly continuously, and we're very fortunate that our investors in particular have been interested in this space and kind of have seen the potential for this kind of capability for language model and a loop applications for quite a while.

And that's why we chose to try to partner with quiet in particular.

And I think that while it's true that, you know, the most powerful applications we haven't emerged in, we haven't seen them, I think it is fairly obvious, at least to me, that something very important is happening here, that this will be a key component of how that type of software is going to get built.

The analogy I love to give is, you know, it feels like it's 1995, and people are just figuring out what to do with the web for the first time.

And what happens is, when a new technology or new capability emerges, the first thing people do is think by analogy because they think to themselves, well, what things am I already doing that this makes better?

But the most powerful things emerge later on when the things that this enables me to do for the first time get discovered.

I've never thought of before.

Yeah, exactly. I mean, and the reason for that is is twofold. One, there is a set of tools which never needed to exist before this new technology.

So, you know, before the web, nobody needed a Yahoo. Nobody needed a web portal.

But as the web emerged, this is great platform for publishing. Well, suddenly you need some sort of portal or aggregator to tell you where to go.

I mean, I certainly remember the days where I was writing URLs in my little notebook that friends told me we're good and I should visit, right?

So that's something that emerges because there's a need created by a new technology for the first time.

And then on the other side of this, there are things that can only exist for the first time because of that technology.

And, you know, I think of things like e-commerce, Amazon is the big one, right?

Like not just zero marginal cost of distribution, but infinite shelf space.

First, you have to think of how to actually utilize the idea of infinite shelf space in this new paradigm.

Whatever that happens to be for language model and the loop applications we haven't seen yet, but I'm confident that it's there.

You kind of have to take a bet.

Right. You have to take a leap of faith that there is something out there you just don't know what it is.

Do you have any -- so when you go to bed at night and you think about what the future might look like for machine learning, what types of things come to your mind that you think it's -- make a prediction, make a guess,

what do you think is going to be the killer app that comes out in the next five years?

Here's what I think is going to happen. Here's how I feel at the present moment.

We have these large general purpose general knowledge models like GPT4, chat GPT, you know, palm from Google, other language model providers coming online, and anthropic has claw at other things.

And these are kind of great, but they're really the first step.

And this sort of idea of pluggable knowledge kind of implies the future existence of another class of model where it's not that they have their knowledge.

It's not that they have knowledge and memory in their weights.

It's that the memory and skills that they have are external, completely controllable.

We understand them.

And the model itself is just a system for acquiring, composing, and synthesizing actions and knowledge based specifically on the information and skills that we give it.

There's early signs of this today.

It's a very early stage of research.

People are actively working on this, but I think that that class of model represents a very interesting sort of platform to build on.

It allows you to get their reasoning capabilities without necessarily having to store the knowledge the model is based on in the weights.

I think that that's a possible future.

I know that there are certainly things that I want to exist that I hope that they emerge, but even those feel so nearsighted to me.

The one that I always think about is I always want, you know, being a startup founder, you do a lot of sort of work that in principle is quite menial, but you have to do it.

There's not a lot of decision-making involved.

And I would like a little guy who basically sits across all of my communications and knows what my preferences are or can intuit them and just responds appropriately.

And then when things need my attention, then it interacts with me.

Otherwise, I can just stay in flow for the things that I really need to be inventing and thinking about.

A very different definition of a virtual assistant.

Yeah, a purely virtual assistant, right? A truly purely virtual assistant.

What's interesting, though, and again, this is why I feel so shortsighted even with that vision is because this is an idea that existed in the 80s.

This is when expert systems were big people thought that we could have these things and you read, you know, you read the science fiction literature of the time and it seemed kind of inevitable, but we haven't arrived yet, but now the possibility is there.

The other thing that I get really excited about where, and you know, obviously that's an application of like pluggable knowledge and just in time information for these models.

The other thing that I get really excited about and kind of this happens with every technology revolution.

I'll put this caveat that everybody believes that every new technology revolution will be a huge boon for education.

The reality is it gets tense to get used for entertainment or propaganda, but I'm still hopeful that there is enormous possibilities here in the world.

There is an adaptive educational curricula which allows everybody from, you know, a middle-scroller, elementary-scroller, all the way up to a grad student to really just spin up on any knowledge or skills that they want because what they have is instead of a flat textbook

is you have this saying that adapts to you. You can ask it questions. I've built a few demos of this, right?

It's fairly straightforward to build a chat-your-textbook application and then you could like ask it questions in broad strokes.

You can ask it questions in specifics. You can, you know, you can even show it sources and says, "Yeah, this is on this page of the textbook.

This is where you should go look."

Generalizing that capability I think will be really, really cool.

One thing that I would have seen in the near future and I've been putting this out that someone should build is like just basically let me throw a corpus of documents at a model or system and it just returns us to based repetition.

Program for me from that. That would be, that would be incredible. That's like the first step.

All of this, you know, but wiki would be a perfect use case for that, right?

Imagine trading.

A thousand people all putting the knowledge that they have for their job into some random database essentially and getting the information out and keeping it up to date and making it relevant is always a challenge.

How do you search? How do you index? How do you get the information out of it? There's so much knowledge in it. How do you get the information out of it?

It is a hard problem. So that would be a fantastic use case here.

I actually think that the most powerful applications, you know, in every era of automation, people believe that, "Oh, it's going to replace humans entirely."

Actually, I think what ends up happening is you find new ways to make humans more valuable in the loop than they could have been before because so much of the repetitive menial work ends up getting replaced.

A machine can do that, but this is the part where you need a human, which is where you need human input becomes more important.

You know, it's like an omdol's lore, a Jevon's paradox almost, right?

The rest of the system has gone down as small as it possibly can, so the marginal impact of another person is much, much higher.

Jobs replaced by AI [31:00]#

Right, right.

Yeah, everyone keeps talking about, you hear the mainstream talk about what jobs are going to be replaced by AI.

And I always have to ground when I hear that because it's just any other piece of technology does the exact same.

Well, what jobs are going to be replaced by AI? It changes the jobs. It creates jobs, but it doesn't replace jobs.

Your job will change and a lot of people's jobs will change, but they'll change for the better hopefully.

And that's what normally happens and night don't anticipate this to be any different.

A significant part of my job remains writing code and coding assistants have improved my speed and let me remain in focus much better than before.

I would say that I'm two or three times more productive having that assistant on board.

And what that tells me again as a founder is that each marginal engineer I hire is going to be that much better for me.

That makes me want to hire more engineers, not fewer.

Right, exactly. Cool.

So, let's see, you talked about query relevancy and can you give a little bit more of a definition of what you mean by that and how that relates specifically to the product you're working in?

Absolutely, yeah. So this is a great point.

So in a traditional vector database product, you again, as I mentioned, you have these vectors and then you put a query vector in and you grab the nearest product.

And then you grab the nearest neighbors under some metric, right? So you say you'll ask for the five nearest neighbors. Now, the thing is when you ask for five nearest neighbors, you will get five nearest neighbors.

That doesn't tell you how relevant they are to the actual query because they could be very far away.

You could have put a query somewhere into that vector space where there's nothing around.

It'll just grab five completely irrelevant things, right?

That's a real problem if you're an application developer because what happens is you grab the documents associated with those embeddings, you put them into the context window of the language model, but you've given it nonsense and it's going to be output nonsense.

It's not robust. You can't have not, excuse me, you know, you can't build applications that way because they will be not robust to user inputs.

They won't perform the functions that users expect. And that's kind of the flexibility of these models is kind of the entire point.

So having irrelevant results pretty much defeats that.

The other way is if you have redundant data inside your knowledge base.

Excuse me, I'm just going to have a quick drink order.

If you have redundant knowledge inside your knowledge base and you ask for five nearest neighbors, well, those documents might all be virtually identical.

And so the model might not have enough information to answer the query or perform the task.

So query relevance is basically you need an algorithmic way of determining whether you've put enough information into the context window of the model to perform the task or answer the query that the user has put in.

And so we have an algorithmic approach to doing this.

It's in the experimental phase where a trialing it have it released to sort of our users very soon.

And that's kind of our first step to overall kind of the disfariest it's almost an applied research direction in fact.

To help answer these questions for our users the number one questions that we get around the bedding system which model should I use?

How do I divide up my documents to make sure I'm getting relevant results?

But what they're really asking is how do I make sure that the model has the relevant context to the task or query that I'm asking it to do?

And how can I tell when a specific query does not have the relevance?

Correct.

Exactly.

So this is the word.

Exactly.

Because that allows the application developer to then do something about it.

So I'll give you an example workflow.

Imagine I have a customer support wiki or set of documentations and I go ahead and embed that on it's a knowledge base now.

It's inside Chroma.

And a query comes in and I raised the developer and I said okay this is the likelihood that there is something relevant in your knowledge base.

The likelihood is quite low.

And so what the developer can do then is say okay I don't actually have that information in my knowledge base yet.

I will call out to say my customer support Slack channel or however my customer support agents communicate and say here's a user query.

The knowledge base can answer it right now.

Can you create an answer?

Of course the customer support agent can respond.

The user gets the response in real time.

But now you have a new question answer pair which you can store in your knowledge base and you can use a model to synthesize even a new document for your knowledge base and then embed it in store it.

And now what you've done is you've taken a static set of documents into a dynamic adaptive knowledge base that will continue to improve over time.

That's a very powerful thing.

And that's again that's just like a basic first cut.

This is just the worst obvious thing but these kind of loops are the things that I look for.

These loops become very very powerful.

These self improving systems become very very powerful.

You had a quote on your website which I just highlighted and I threw in at some point here and it seems to make sense like the right place to do it right now.

And that is, "Kroma is building the database that learns."

That's essentially what you're talking about here is if you don't have the right information it's better to say you don't have the right information so that you can create the right information and learn it versus just spewing out nonsense.

Yeah and again this is one of those things where you really have to build for applications.

Not semantic search, not you know web scale recommender systems.

You're building an applications application knowledge base.

And that immediately things like "Okay human feedback right?"

You can incorporate human feedback as to whether or not you've got a relevant result from your users and then adapt the embedding space to reflect that user feedback.

This is something that you can only do in this kind of modality.

You can't do that really easily in a traditional database.

Building in all sorts of things like this and creating easy affordances for application developers to use these things is a very powerful thing and that's kind of what we're 100% focused on.

We don't really like traditional databases compete on sort of feeds and reads.

And we're very much focused on "No let's create this.

Let's give the developer access to memory, state, knowledge, just in time knowledge for the language model in the web application.

And whatever technology helps us to deliver that in an effective way so that the developer can build these things robustly on their own is the technology that we'll choose.

Makes sense.

So where can people who are intrigued by this conversation, where can they go to find out more about your about Chroma in general?

Sure.

Yeah, I mean look the right place to start is our website tricroma.com.

There's a fairly active discord which you can join.

You can also find me on Twitter really or even just email us directly at hello@trychroma.com and we'll be happy to chat either get you started with building these language model applications with state and memory or talk about working together.

There's a lot of opportunities here.

Cool.

Great.

Yeah.

So yeah, you're looking for both potential customers as well as potential employees as well as potential partners to to who are in the same space and trying to.

That can work with you on it.

There's a lot of work.

There's a lot of work to be done.

Chroma intends to do a lot of it, but you know effectively partnering to get a lot of it done.

Any parts really.

Right.

Yeah, you're going to need partners to to get the breadth of coverage of knowledge and.

Yeah.

Great.

Well, thank you.

Is there anything else you want to tell me about either Chroma or about what you're doing?

Anything we haven't talked about that you'd like to throw in.

I think we really did a good job of covering pretty much everything and you know where we came from where we're going.

What we're about.

So I'm happy with.

Yeah, I think I'm happy with what we covered.

Cool.

Okay.

So, Evan is excuse me.

Anton is the co-founder of chroma and open source embedding database.

Anton, thank you for being with me on software engineering daily.

Thanks, Lee.

Thanks for having me.