Inside OpenCoder: The Data and Cookbook
Paper 📜 https://arxiv.org/abs/2411.04905 OpenCoder Data 🧠 https://www.oxen.ai/OpenCoder-LLM/opc-sft-stage1 Links + Notes 📝 https://www.oxen.ai/blog/opencoder-the-open-cookbook-for-top-tier-code-llms Join Arxiv Dives 🤿 https://oxen.ai/community Discord 🗿 https://discord.com/invite/s3tBEn7Ptg -- Use Oxen AI 🐂 https://oxen.ai/ Oxen AI makes versioning your datasets as easy as versioning your code! Even is millions of unstructured images, the tool quickly handles any type of data so you can build cutting-edge AI. -- Chapters 0:00 Intro 2:19 OpenCoder 8:43 OpenCoder Goals 10:27 Pre-Training Data 11:07 RefineCode 12:41 Raw Code for Pre-Training 13:20 Data Preprocessing 14:05 Data Deduplication 17:05 How Data Deduplication Improved OpenCoder 18:12 Data Transformation 19:05 Data Filtering 31:58 Sampling 36:36 Code-Related Data 39:53 Post Training 48:49 The Two Stages of Instruct Tuning 51:20 Evaluation 53:47 Conclusion & Future Work
Download
0 formatsNo download links available.