Show HN: CocoIndex – Open-Source Data framework for AI, built for data freshness

14 points by badmonster 3 months ago

Hi HN, I’ve been working on CocoIndex, an open-source Data ETL framework to transform data for AI, optimized for data freshness.

You can start a CocoIndex project with `pip install cocoindex` and declare a data flow that can build ETL like LEGO - build a RAG pipeline for vector embeddings, knowledge graphs, or extract, transform data with LLMs. It is a data processing framework beyond text. When you run the data flow either with live mode or batch mode, it will process the data incrementally with minimal recomputation and make it super fast to update the target stores on source changes.

Get started video: https://www.youtube.com/watch?v=gv5R8nOXsWU Demo video: https://www.youtube.com/watch?v=ZnmyoHslBSc

Previously, I’ve worked at Google on projects like search indexing and ETL infra for 8 years. After I left Google last year, I built various projects and went through pivoting hell. In all the projects I’ve built, data still sits in the center of the problem and I find myself focusing on building data infra other than the business logic I need for data transformation. The current prepackaged RAG-as-service doesn't serve my needs, because I need to choose a different strategy for the context, and I also need deduplication, clustering (items are related), and other custom features that are commonly needed. That’s where CocoIndex starts.

A simple philosophy behind it - data transformation is similar to formulas in spreadsheets. The ground of truth is at the source data, and all the steps to transform, and final target store are derived data, and should be reactive based on the source change. If you use CocoIndex, you only need to worry about defining transformations like formulas.

*Data flow paradigm* came in as an immediate choice - because there’s no side effect, lineage and observability just come out of the box.

*Incremental processing* - If you are a data expert, an analogy would be a materialized view beyond SQL. The framework tracks pipeline states in database, and only reprocessing necessary portions. When data has changed, framework handles the change data capture comprehensively and combines the mechanism for push and pull. Then clear stale derived data/versions and re-index data based on tracking data/logic changes or data TTL settings. There’s lots of edge cases to do it right, for example, when a row is referenced in other places, and the row changes. These should be handled at the level of the framework.

*At the compute engine level* - the framework should consider the multiple processes and concurrent updates. It should consider how to resume existing states from terminated execution. In the end, we want to build a framework that is easy to build with exceptional velocity, but scalable and robust in production.

*Standardized the interface throughout the data flow* - really easy to plugin custom logic like LEGO; with a variety of native built-in components. One example is that it takes a few lines to switch among Qdrant, Postgres, Neo4j.

CocoIndex is licensed under Apache 2.0 https://github.com/cocoindex-io/cocoindex

Getting started: https://cocoindex.io/docs/getting_started/quickstart

Excited to learn your thoughts, and thank you so much! Linghua

renning22 3 months ago

Love the idea. it saved me a ton of time updating my vector embeddings for my startup. The step-by-step tutorial made it easy to get started!

badmonster 3 months ago

Appreciate the feedback! we are adding more native built-ins, if there's anything i can help please feel free to open a ticket at our repo or shoot me a email linghua@cocoindex.io

zzhibb 3 months ago

I don't need RAG, but just want to parse a few documents and dump into database. can I use it?

badmonster 3 months ago

Absolutely! CocoIndex is an ETL framework not limited to RAG. You can use it to do transformation as you need, including parsing documents and extracting information you need using your model or code. We have a similar example, and welcome to checkout: https://cocoindex.io/blogs/patient-intake-form-extraction-wi...

jingconan 3 months ago

Love the idea, CocoIndex makes building AI backend way much easier!

nicolehuang 3 months ago

Can you give me an example about what can I use it to build?

badmonster 2 months ago

We have a few here, https://cocoindex.io/blogs/tags/examples/. more coming soon, thanks for the great question

gkabhi 3 months ago

what's the benefit of incremental processing, why do i need it?

badmonster 3 months ago

when you want your transformed data (the index) always reflect the latest changes from the source, and reprocessing everything is either too slow. So you want to incrementally process only the changed part. let me know your thoughts!
- gkabhi 3 months ago
  
  yes that makes sense... pretty cool!! cant wait to use the application!!

georgehe9 3 months ago

Love the idea of writing data pipeline similar to spreadsheet. Spreadsheets is an amazing programming model and I can write my formulas brainlessly, and it calculate the result in order, and also automatically takes care of any updates.