Code LLMs: Evolution, Scorecard, and What’s Next

Big language models like ChatGPT are making big waves in software coding. This has sparked the creation of specialized models, called Code LLMs, tailored just for software engineering. Many of these Code LLMs are fine-tuned versions of general large language models, meaning they're updated often and their performance depends a lot on the base model. But here’s the thing: no one has really looked into these Code LLMs systematically. So, what’s the deal? Are they better than general models? Which ones are best for different tasks? We dug into 134 pieces of work from major databases to find out.

First, we sorted out the different types of Code LLMs and how they relate to each other and to general models. Then, we checked if Code LLMs are truly better at software tasks. Turns out, it really depends on the base model and the specific task. We also tested these models on multiple benchmarks to see which ones come out on top. Our findings can help developers choose the best base models for creating even more advanced Code LLMs. For practitioners, it provides clear directions on how to improve these models.

actions