Hierarchical (and different) Indexes utilizing LlamaIndex for RAG Content material Enrichment

At our weekly This Week in Machine Studying (TWIML) conferences, (our chief and facilitataor) Darin Plutchok identified a LinkedIn weblog publish on Semantic Chunking that has been lately carried out within the LangChain framework. In contrast to extra conventional chunking approaches that use variety of tokens or separator tokens as a information, this one chunks teams of sentences into semantic items by breaking them when the (semantic) similarity between consecutive sentences (or sentence-grams) fall beneath some predefined threshold. I had tried it earlier (pre-LangChain) and whereas outcomes had been affordable, it could want a number of processing, so I went again to what I used to be utilizing earlier than.

I used to be additionally lately exploring LlamaIndex as a part of the hassle to familiarize myself with the GenAI ecosystem. LlamaIndex helps hierarchical indexes natively, that means it offers the info constructions that make constructing them simpler and extra pure. In contrast to the standard RAG index, that are only a sequence of chunks (and their vectors), hierarchical indexes would cluster chunks into dad or mum chunks, and dad or mum chunks into grandparent chunks, and so forth. A dad or mum chunk would usually inherit or merge a lot of the metadata from its kids, and its textual content can be a abstract of its kids’s textual content contents. For instance my level about LlamaIndex information constructions having pure assist for this sort of setup, listed here are the definitions of the LlamaIndex TextNode (the LlamaIndex Doc object is only a baby of TextNode with a further doc_id: str subject) and the LangChain Doc. Of specific curiosity is the relationships subject, which permits tips to different chunks utilizing named relationships PARENT, CHILD, NEXT, PREVIOUS, SOURCE, and so forth. Arguably, the LlamaIndex TextNode might be represented extra usually and succintly by the LangChain Doc, however the hooks do assist to assist hierarchical indexing extra naturally.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# this can be a LlamaIndex TextNode
class TextNode:
  id_: str = None
  embedding: Non-compulsory[List[float]] = None
  extra_info: Dict[str, Any]
  excluded_embed_metadata_keys: Listing[str] = None
  excluded_llm_metadata_keys: Listing[str] = None
  relationships: Dict[NodeRelationship, Union[RelatedNodeInfo, List[RelatedNodeInfo]] = None
  textual content: str
  start_char_idx: Non-compulsory[int] = None
  end_char_idx: Non-compulsory[int] = None
  text_template: str = "{metadata_str}nn{content material}"
  metadata_template: str = "{key}: {worth}",
  metadata_separator = str = "n"

# and this can be a LangChain Doc
class Doc:
  page_content: str
  metadata: Dict[str, Any]

In any case, having found the hammer that’s LlamaIndex, I started to see a number of potential hierarchical indexes nails. One such nail that occurred to me was to make use of Semantic Chunking to cluster consecutive chunks quite than sentences (or sentence-grams), after which create dad and mom nodes from these chunk clusters. As a substitute of computing cosine similarity between consecutive sentence vectors to construct up chunks, we compute cosine similarity throughout consecutive chunk vectors and cut up them up into clusters based mostly on some similarity threshold, i.e. if the similarity drops beneath the brink, we terminate the cluster and begin a brand new one.

Each LangChain and LlamaIndex have implementations of Semantic Chunking (for sentence clustering into chunks, not chunk clustering into dad or mum chunks). LangChain’s Semantic Chunking lets you set the brink utilizing percentiles, customary deviation and inter-quartile vary, whereas the LlamaIndex implementation helps solely the percentile threshold. However intuitively, here is how you might get an thought of the percentile threshold to make use of — thresholds for the opposite strategies might be computed equally. Assume your content material has N chunks and Okay clusters (based mostly in your understanding of the info or from different estimates), then assuming a uniform distribution, there can be N/Okay chunks in every cluster. If N/Okay is roughly 20%, then your percentile threshold can be roughly 80.

LlamaIndex offers an IngestionPipeline which takes an inventory of TransformComponent objects. My pipeline seems one thing like beneath. The final element is a customized subclass of TransformComponent, all you should do is to override it is __call__ technique, which takes a Listing[TextNode] and returns a Listing[TextNode].

1
2
3
4
5
6
7
8
transformations = [
    text_splitter: SentenceSplitter,
    embedding_generator: HuggingFaceEmbedding,
    summary_node_builder: SemanticChunkingSummaryNodeBuilder
]
ingestion_pipeline = IngestionPipeline(transformations=transformations)
docs = SimpleDirectoryReader("/path/to/enter/docs")
nodes = ingestion_pipeline.run(paperwork=docs)

My customized element takes the specified cluster dimension Okay throughout building. It makes use of the vectors computed by the (LlamaIndex offered) HuggingFaceEmbedding element to compute similarities between consecutive vectors and makes use of Okay to compute a threshold to make use of. It then makes use of the brink to cluster the chunks, leading to an inventory of listing of chunks Listing[List[TextNode]]. For every cluster, we create a abstract TextNode and set its CHILD relationships to the cluster nodes, and the PARENT relationship of every baby within the cluster to this new abstract node. The textual content of the kid nodes are first condensed utilizing extractive summarization, then these condensed summaries are additional summarized into one ultimate abstract utilizing abstractive summarization. I used bert-extractive-summarizer with bert-base-uncased for the primary and a HuggingFace summarization pipeline with fb/bert-large-cnn for the second. I suppose I may have used an LLM for the second step, however it could have taken extra time to construct the index, and I’ve been experimenting with concepts described within the DeepLearning.AI course Open Supply Fashions with HuggingFace.

Lastly, I recalculate the embeddings for the abstract nodes — I ran the abstract node texts via the HuggingFaceEmbedding, however I assume I may have executed some aggregation (mean-pool / max-pool) on the kid vectors as nicely.

Darin additionally identified one other occasion of Hierarchical Index proposed by way of the RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval and described intimately by the authors on this LlamaIndex webinar. This is a little more radical than my thought of utilizing semantic chunking to cluster consecutive chunks, in that it permits clustering of chunks throughout your complete corpus. One different important distinction is that it permits for soft-clustering, that means a bit generally is a member of multiple chunk. They first cut back the dimensionality of the vector area utilizing UMAP (Uniform Manifold Approximation and Projection) after which apply Gaussian Combination Mannequin (GMM) to do the delicate clustering. To seek out the optimum variety of clusters Okay for the GMM, one can use a mix of AIC (Aikake Info Criterion) and BIC (Bayesian Info Criterion).

In my case, when coaching the GMM, the AIC saved lowering because the variety of clusters elevated, and the BIC had its minimal worth for Okay=10, which corresponds roughly to the 12 chapters in my Snowflake e-book (my check corpus). However there was a number of overlap, which might pressure me to implement some form of logic to benefit from the delicate clustering, which I did not wish to do, since I wished to reuse code from my earlier Semantic Chunking node builder element. Finally, I settled on 90 clusters through the use of my authentic instinct to compute Okay, and the ensuing clusters appear moderately nicely separated as seen beneath.

Utilizing the outcomes of the clustering, I constructed this additionally as one other customized LlamaIndex TransformComponent for hierarchical indexing. This implementation differs from the earlier one solely in the way in which it assigns nodes to clusters, all different particulars with respect to textual content summarization and metadata merging are similar.

For each these indexes, we’ve a alternative to take care of the index as hierarchical, and determine which layer(s) to question based mostly on the query, or add the abstract nodes into the identical stage as the opposite chunks, and let vector similarity floor them when queries cope with cross-cutting points that could be discovered collectively in these nodes. The RAPTOR paper stories that they do not see a major acquire utilizing the primary method over the second. As a result of my question performance is LangChain based mostly, my method has been to generate the nodes after which reformat them into LangChain Doc objects and use LCEL to question the index and generate solutions, so I have not seemed into querying from a hierarchical index in any respect.

Trying again on this work, I’m reminded of comparable decisions when designing conventional search pipelines. Usually there’s a alternative between constructing performance into the index to assist a less expensive question implementation, or constructing the logic into the question pipeline that could be costlier but in addition extra versatile. I feel LlamaIndex began with the primary method (as evidenced by their weblog posts Chunking Methods for Giant Language Fashions Half I and Evaluating Splendid Chunk Sizes for RAG Methods utilizing LlamaIndex) whereas LangChain began with the second, regardless that these days there may be a number of convergence between the 2 frameworks.

Leave a Reply