This solution targets enterprise knowledge base question answering for customer support. Its core capabilities include LoRA fine-tuning, RAG-based retrieval augmentation, Agent tool calling, and Streamlit frontend deployment. It addresses three common problems: general-purpose LLMs lack business knowledge, answers are inconsistent, and private knowledge is difficult to integrate. Keywords: LangChain, RAG, LoRA.
This project delivers a complete customer support pipeline from training to deployment
| Parameter | Description |
|---|---|
| Language | Python 3.10 / 3.11 |
| Core Protocols | OpenAI-Compatible API, HTTP |
| GitHub Stars | Not provided in the source |
| Base Model | Qwen2.5-3B-Instruct |
| Fine-Tuning Method | LoRA |
| Vector Database | Milvus Lite / pymilvus |
| Orchestration Framework | LangChain |
| Frontend | Streamlit |
| Core Dependencies | transformers, peft, vllm, sentence-transformers, langchain, pymilvus, ragas |
The system’s core value is injecting domain knowledge into the customer support answer pipeline
The project goal is clear: build a single-turn knowledge base QA customer support system. After a user submits a question, the system first retrieves relevant passages from the knowledge base, then uses a fine-tuned instruction model to generate a professional answer.
Compared with directly calling a general-purpose LLM, this approach is better suited to business-heavy scenarios such as after-sales support, service bots, order inquiries, and returns. It solves two key problems: the model lacks enterprise private knowledge, and answer style is difficult to keep consistent.
The technology choices balance cost, quality, and deployability
The original project uses Qwen2.5-3B-Instruct as the base model because it follows instructions well, and the 3B parameter size fits consumer GPUs such as the RTX 4090. The fine-tuning layer uses LoRA, which significantly reduces GPU memory usage.
The vector database is Milvus, and the retrieval pipeline uses Dense retrieval + BM25 + RRF + Cross-Encoder reranking. In other words, the system does not rely on vector search alone. It improves hit rate and answer relevance through hybrid recall and reranking.
tech_stack = {
"llm": "Qwen2.5-3B-Instruct", # Base model responsible for final generation
"finetune": "LoRA", # Lightweight fine-tuning that reduces training cost
"retrieval": ["Dense", "BM25", "RRF", "CrossEncoder"], # Hybrid retrieval pipeline
"frontend": "Streamlit" # Fast way to build a chat interface
}
This configuration summarizes the project’s minimum viable technology stack.
The system uses a clear four-layer architecture to reduce engineering complexity
The overall flow can be summarized as follows: user question → query parsing → hybrid retrieval → reranking → answer generation. The retrieval side depends on Milvus and BM25, while the generation side depends on the fine-tuned Qwen model.
The module boundaries make replacement and extension easier later
- Knowledge base construction module: handles document loading, chunking, embedding, and indexing.
- Retrieval optimization module: handles vector recall, BM25 recall, RRF fusion, and Cross-Encoder reranking.
- Answer generation module: handles model invocation after deployment with vLLM/SGLang.
- Evaluation and monitoring module: handles RAGAS metrics and system effectiveness validation.
The data preparation stage determines both fine-tuning quality and retrieval ceiling
The project references Hugging Face’s customer_service_chat dataset, which includes three fields: instruction, input, and output. The instruction field is the user question, and the output field is the standard answer.
To improve training sample coverage, the original project also applies a simple form of data augmentation: synonym replacement. Although the method is straightforward, it provides practical value for diverse customer support phrasing.
def simple_synonym_replace(text):
synonyms = {"怎么": ["如何", "怎样"], "查询": ["查看", "了解"]}
for word, candidates in synonyms.items():
if word in text:
text = text.replace(word, candidates[0], 1) # Expand query phrasing with synonyms
return text
This code expands user question phrasing at low cost and improves fine-tuning data coverage.
AI Visual Insight: This image shows the terminal output after the data augmentation script runs. It highlights that the processed samples were written successfully, indicating that the original conversation data has been expanded and is ready for training.
AI Visual Insight: This image shows the local project directory structure. You can typically see folders such as raw, processed, config, and knowledge, which indicates that the training data, configuration files, and knowledge base code are organized in an engineering-friendly layout.
The LoRA fine-tuning workflow emphasizes low-cost training and rapid iteration
The project uses the LLaMA-Factory visual interface to run SFT fine-tuning, with 8-bit quantization and LoRA parameters enabled. The key settings include lora_rank=8, lora_alpha=16, and q_proj and v_proj as the target modules.
The author specifically notes that a batch size of 32 is somewhat aggressive, and 16 is a safer recommendation in practice. This shows that although the project is runnable end to end, production implementation still requires conservative tuning based on GPU memory stability.
llamafactory-cli train \
--stage sft \
--model_name_or_path ./Qwen/Qwen2.5-3B-Instruct \
--finetuning_type lora \
--dataset robot-qa \
--learning_rate 1e-4 \
--quantization_bit 8 \
--lora_rank 8 \
--lora_alpha 16
This command runs LoRA instruction fine-tuning, with the core goal of producing a business-adapted model at relatively low resource cost.
AI Visual Insight: This image shows the parameter configuration panel in a cloud-based training interface, indicating that the project adopts a visual fine-tuning workflow for quickly configuring datasets, training epochs, quantization, and LoRA hyperparameters.
AI Visual Insight: This image reflects real-time status monitoring after training starts. It typically includes loss, step count, and GPU memory usage, which helps verify whether fine-tuning is converging normally.
The inference deployment stage uses vLLM to provide a high-throughput compatible interface
At the deployment layer, the project prioritizes vLLM and enables dynamic LoRA loading. This allows the service to expose an OpenAI-compatible interface directly without merging weights first, which makes it easier for LangChain to call.
python -m vllm.entrypoints.openai.api_server \
--model ./Qwen2.5-3B-Instruct \
--enable-lora \
--lora-modules customer_service=./lora \
--port 6006 \
--served-model-name customer_service
This command starts an inference service with LoRA support and provides a unified model endpoint for downstream RAG and Agent workflows.
AI Visual Insight: This image shows the console output after the vLLM service starts successfully, indicating that the model is loaded, the port is listening, and the server can accept OpenAI-style requests.
The knowledge base construction stage uses LangChain and Milvus to index documents
The project loads PDF and TXT documents through DirectoryLoader, then uses RecursiveCharacterTextSplitter to split them into 512-character chunks with 100-character overlap. This parameter combination balances contextual continuity and retrieval granularity.
The embedding model is bge-small-zh-v1.5 with 512 dimensions. The chunked results are ultimately written into Milvus Lite, which enables local vector storage and retrieval experiments.
The hybrid retrieval strategy is the key lever for improving accuracy in this project
The project implements a HybridRetriever that runs vector retrieval, BM25 retrieval, RRF fusion, and Cross-Encoder reranking in sequence. This pipeline is more stable than single-path vector recall, especially for customer support text containing technical terms, product models, and error codes.
def hybrid_search(query, vector_results, bm25_results):
fused = rrf_fusion(vector_results, bm25_results) # Fuse the two retrieval result sets first
ranked = rerank_with_cross_encoder(query, fused) # Then rerank with a cross-encoder
return ranked[:5] # Return the top 5 most relevant knowledge entries
This pseudocode summarizes the main retrieval-augmentation pipeline in the project.
AI Visual Insight: This image shows the knowledge base construction or retrieval test output interface. It typically includes recalled passages, scores, and source files, demonstrating the interpretability of the retrieval pipeline.
RAG packaging and Agent tooling give the system extensible orchestration capabilities
The project does not stop at “retrieve and append to the prompt.” Instead, it further packages the RAG process as a tool for the Agent to call on demand. This means the system can be extended to connect business capabilities such as ticket lookup, order lookup, and FAQ routing.
The RagSummarize class is responsible for loading Milvus data, initializing the embedding model, connecting to vLLM, and concatenating context before calling the model to generate an answer. It is then exposed as an Agent tool through @tool.
from langchain_core.tools import tool
@tool(description="RAG-based knowledge summarization tool")
def rag_summarize_tool(query: str) -> str:
return rag.rag_summarize(query) # Call the packaged retrieval-augmented QA capability
This code turns the RAG capability into a standard tool interface so the Agent can decide when to call it.
AI Visual Insight: This image shows the Agent runtime effect. You can typically see the user question, traces of tool invocation, and the final answer, indicating that the system has evolved from static QA into an intelligent customer support assistant with tool-use capabilities.
Evaluation and frontend deployment complete the real-world delivery loop
The evaluation layer uses RAGAS, with metrics including Faithfulness, Answer Relevance, Context Recall, and Context Precision. This upgrades assessment from “does the answer sound right?” to “is the answer grounded in evidence?”
The frontend uses Streamlit to quickly build a chat page and improves the interaction experience through streaming output. For internal demos, PoC validation, and rapid business pilots, this combination is highly efficient.
if prompt := st.chat_input("请输入你的问题..."):
for chunk in st.session_state.agent.execute_stream(prompt):
full_response += chunk # Concatenate response content in a streaming manner
message_placeholder.markdown(full_response + "▌")
This code implements frontend chat input and streaming display of model responses.
AI Visual Insight: This image shows the result after the Streamlit service starts, indicating that the system already provides a browser-based human-computer interaction interface.
AI Visual Insight: This image further shows the actual QA effect of the customer support chat interface. You can inspect message bubbles, the input box, and generated output, which validates that end-to-end deployment is complete.
This project is well suited to fast implementation for small and medium-scale private customer support and knowledge QA scenarios
If your goal is to build an enterprise FAQ bot, after-sales assistant, product documentation QA system, or internal knowledge assistant, this solution has strong reuse value. It is not the academically optimal design, but from an engineering perspective it delivers a complete loop across training, retrieval, inference, evaluation, and frontend delivery.
One important caveat is that the original configuration contains issues such as Python version switching, dependency conflicts, and duplicated path settings. In real reproduction, it is best to split environments first and standardize configuration management.
FAQ
1. Why does this project use both LoRA and RAG?
LoRA addresses the model’s speaking style and domain-specific expression, while RAG addresses the model’s lack of up-to-date or private knowledge. You need both to balance professionalism, controllability, and knowledge freshness.
2. Why not rely only on a vector database for retrieval?
Customer support scenarios often include keywords, product models, error codes, and fixed terminology. Pure vector retrieval may miss exact-match content, so the project adds BM25, RRF, and Cross-Encoder reranking to improve recall and ranking quality.
3. Is this project better suited to a PoC or a production environment?
It is an excellent fit for PoCs, internal testing, and small to medium production pilots. If you move into formal production, you should add access control, logging and tracing, a configuration center, fault recovery, caching, and multi-instance deployment.
AI Readability Summary
This article reconstructs a knowledge base QA customer support project built with LangChain, RAG, LoRA, and Streamlit. It covers data preparation, LoRA fine-tuning, vLLM deployment, Milvus-based hybrid retrieval, Agent orchestration, RAGAS evaluation, and frontend launch. The solution is a strong fit for enterprise private intelligent customer support deployments.