How to implement Weaviate RAG applications with Local LLMs and Embedding models

Develop RAG applications and don’t share your private data with anyone!

Tomaz Bratanic
5 min readOct 30, 2023

In the spirit of Hacktoberfest, I decided to write a blog post using a vector database for change. The main reason for that is that in spirit of open source love, I have to give something back to Philip Vollet in exchange for all the significant exposure he provided me, starting from many years ago.

Philip works at Weaviate, which is a vector database, and vector similarity search is prevalent in retrieval-augmented applications nowadays. As you might imagine, we will be using Weaviate to power our RAG application. In addition, we’ll be using local LLM and embedding models, making it safe and convenient when dealing with private and confidential information that mustn’t leave your premises.

Agenda for this blog post. Image by author

They say that knowledge is power, and Huberman Labs podcast is one of the finer source of information of scientific discussion and scientific-based tools to enhance your life. In this blog post, we will use LangChain to fetch podcast captions from YouTube, embed and store them in Weaviate, and then use a local LLM to build a RAG application.

The code is available on GitHub.

Weaviate cloud services

To follow the examples in this blog post, you first need to register with WCS. Once you are registered, you can create a new Weaviate Cluster by clicking the “Create cluster” button. For this tutorial, we will be using the free trial plan, which will provide you with a sandbox for 14 days.

For the next steps, you will need the following two pieces of information to access your cluster:

  • The cluster URL
  • Weaviate API key (under “Enabled — Authentication”)
import weaviate


client = weaviate.Client(
url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY)

Local embedding and LLM models

I am most familiar with the LangChain LLM framework, so we will be using it to ingest documents as well as retrieve them. We will be using sentence_transformers/all-mpnet-base-v2 embedding model and zephyr-7b-alpha llm. Both of these models are open source and available on HuggingFace. The implementation code for these two models in LangChain was kindly borrowed from the following repository:

If you are using Google Collab environment, make sure to use GPU runtime.

We will begin by defining the embedding model, which can be easily retrieved from HuggingFace using the following code:

# specify embedding model (using huggingface sentence transformer)
embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(

Ingest HubermanLabs podcasts into Weaviate

I have learned that each channel on YouTube has an RSS feed, that can be used to fetch links to the latest 10 videos. As the RSS feed returns a XML, we need to employ a simple Python script to extract the links.

import requests
import xml.etree.ElementTree as ET

URL = ""
response = requests.get(URL)
xml_data = response.content

# Parse the XML data
root = ET.fromstring(xml_data)
# Define the namespace
namespaces = {
"atom": "",
"media": "",
# Extract YouTube links
youtube_links = [
for link in root.findall(".//atom:link[@rel='alternate']", namespaces)

Now that we have the links to the videos at hand, we can use the YoutubeLoader from LangChain to retrieve the captions. Next, as with most RAG ingestions pipelines, we have to chunk the text into smaller pieces before ingestion. We can use the text splitter functionality that is built into LangChain.

from langchain.document_loaders import YoutubeLoader

all_docs = []
for link in youtube_links:
# Retrieve captions
loader = YoutubeLoader.from_youtube_url(link)
docs = loader.load()
# Split documents
text_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=0)
split_docs = text_splitter.split_documents(all_docs)

# Ingest the documents into Weaviate
vector_db = Weaviate.from_documents(
split_docs, embeddings, client=client, by_text=False

You can test the vector retriever using the following code:

"Which are tools to bolster your mental health?", k=3)

Setting up a local LLM

This part of the code was completely copied from the example provided by the AI Geek. It loads the zephyr-7b-alpha-sharded model and its tokenizer from HuggingFace and loads it as a LangChain LLM module.

# specify model huggingface mode name
model_name = "anakin87/zephyr-7b-alpha-sharded"

# function for loading 4-bit quantized model
def load_quantized_model(model_name: str):
:param model_name: Name or path of the model to be loaded.
:return: Loaded quantized model.
bnb_config = BitsAndBytesConfig(

model = AutoModelForCausalLM.from_pretrained(
return model

# function for initializing tokenizer
def initialize_tokenizer(model_name: str):
Initialize the tokenizer with the specified model_name.

:param model_name: Name or path of the model for tokenizer initialization.
:return: Initialized tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)
tokenizer.bos_token_id = 1 # Set beginning of sentence token id
return tokenizer

# initialize tokenizer
tokenizer = initialize_tokenizer(model_name)
# load model
model = load_quantized_model(model_name)
# specify stop token ids
stop_token_ids = [0]

# build huggingface pipeline for using zephyr-7b-alpha
pipeline = pipeline(

# specify the llm
llm = HuggingFacePipeline(pipeline=pipeline)

I haven’t played around yet, but you could probably reuse this code to load other LLMs from HuggingFace.

Building a conversation chain

Now that we have our vector retrieval and th LLM ready, we can implement a retrieval-augmented chatbot in only a couple lines of code.

qa_chain = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=vector_db.as_retriever()

Let’s now test how well it works:

response =
"How does one increase their mental health?")

Let’s try another one:

response ="How to increase your willpower?")


Only a couple of months ago, most of us didn’t realize that we will be able to run LLMs on our laptop or free-tier Google Collab so soon. Many RAG applications deal with private and confidential data, where it can’t be shared with third-party LLM providers. In those cases, using a local embedding and LLM models as described in this blog post is the ideal solution.

As always, the code is available on GitHub.



Tomaz Bratanic

Data explorer. Turn everything into a graph. Author of Graph algorithms for Data Science at Manning publication.