Milestones
- September 26, 2022: Announcement of the BigCode project.
- October 6, 2022: Webinar with the BigCode Community to provide strategic direction.
- October 27, 2022: Introduction of “The Stack” dataset and paper publication.
- November 15, 2022: Introduction of “Am I in The Stack” tool and BigCode Opt-Out process.
- November 23, 2022: Details shared on the approach to de-identification of personally identifiable information (PII).
- November 29, 2022: Sharing of Weights and Biases dashboards for the first models.
- December 1, 2022: Release of The Stack v1.1 with expanded data and programming languages.
- December 2, 2022: In-person meetup with the BigCode community alongside NeurIPS 2022.
- December 9, 2022: Meetup at EMNLP 2022 to raise awareness and engage with the NLP research community.
- December 12, 2022: Communication to raise awareness of “Am I in The Stack” and opt-out option.
- December 14, 2022: Second webinar with the BigCode Community to review progress.
- December 22, 2022: Release of SantaCoder, a 1.1B multilingual language model for code.
- March 20, 2023: Announcement of The Stack v1.2, including additional datasets and simplified opt-out process.
- April 13, 2023: Analysis of Chinchilla scaling laws for training smaller language models.
- May 4, 2023: Announcement of StarCoder and StarCoderBase, code language models trained on GitHub data.