In this video, we build a fully local RAG chatbot that runs entirely on a MacBook - no cloud APIs, no usage costs, complete privacy.
⭐️⭐️⭐️ More content on Substack at https://julsimon.substack.com ⭐️⭐️⭐️
We use Arcee's Trinity Mini, a 26-billion-parameter mixture-of-experts model trained for real-world enterprise tasks, including RAG, function calling, and tool use. Running in Q8 quantization through llama.cpp with Metal acceleration, it's surprisingly capable on Apple Silicon.
This builds on a previous video where we used Arcee Conductor for cloud-based inference. Same stack - LangChain for orchestration, ChromaDB for vector storage, Gradio for the UI - but now the model runs locally.
We also explore advanced retrieval techniques:
- MMR (Maximal Marginal Relevance) for diverse results
- Hybrid search combining vector similarity and BM25 keyword matching
- Query rewriting to clean up messy questions before retrieval
- Cross-encoder re-ranking for precision after recall
All running on a Mac. No internet required.
Resources
- https://www.arcee.ai/blog/the-trinity-manifesto
- https://huggingface.co/arcee-ai/Trinity-Mini-GGUF
- https://github.com/juliensimon/local-rag-chatbot/
#ArceeAI #TrinityMini #RAG #LocalLLM #llamacpp #ChromaDB #LangChain #HybridSearch #Reranking #AppleSilicon #EnterpriseAI #AITutorial #GenerativeAI #python