🧼 From GPU-poor to data-rich: data quality practices for LLM fine-tuning

Thursday, May 23

11:00 - 11:30
Audience levelIntermediate
Elevator pitch

If you are GPU-poor you need to become data-rich. I will give an overview of what we learned from looking at Alpaca, LIMA, Dolly, UltraFeedback and Zephyr and how we applied that to fine-tuning a state-of-the-art open source LLM called Notus and Notux by becoming data-rich.


GPUs are in high demand and low supply but being GPU-poor can be solved by focusing on data quality and becoming data-rich. By looking at efforts like Alpaca, LIMA, Dolly, UltraFeedback and Zephyr, we can see again and again that data quality is often a thing that does not get the attention it deserves.

1) Alpaca was made up of synthetic data that was not representative of real-world usage. 2) LIMA standing for Less Is More Alignment showed that a high-quality curated preference dataset with only a fraction of the required data could outperform other datasets in alignment tasks. 3) Databricks employees seemed to misunderstand the annotation task at hand. 3) UltraFeedback showed synthetic data at scale was possible and that GPT4 could be used to curate data aligned with human judgement. 4) Zephyr was trained on UltraFeedback but overlooked a bug in the dataset. 5) We trained Notus by resolving this bug but overlooked the fact training data was present in the benchmarks. 6) We started distilabel and worked on Notux.

TagsMachine-Learning, Open-Source, Best Practice, Case Study, Natural Language Processing

David Berenstein

👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing | Developer Advocate @argilla-io