Milestones

September 26, 2022: Announcement of the BigCode project.
October 6, 2022: Webinar with the BigCode Community to provide strategic direction.
October 27, 2022: Introduction of “The Stack” dataset and paper publication.
November 15, 2022: Introduction of “Am I in The Stack” tool and BigCode Opt-Out process.
November 23, 2022: Details shared on the approach to de-identification of personally identifiable information (PII).
November 29, 2022: Sharing of Weights and Biases dashboards for the first models.
December 1, 2022: Release of The Stack v1.1 with expanded data and programming languages.
December 2, 2022: In-person meetup with the BigCode community alongside NeurIPS 2022.
December 9, 2022: Meetup at EMNLP 2022 to raise awareness and engage with the NLP research community.
December 12, 2022: Communication to raise awareness of “Am I in The Stack” and opt-out option.
December 14, 2022: Second webinar with the BigCode Community to review progress.
December 22, 2022: Release of SantaCoder, a 1.1B multilingual language model for code.
March 20, 2023: Announcement of The Stack v1.2, including additional datasets and simplified opt-out process.
April 13, 2023: Analysis of Chinchilla scaling laws for training smaller language models.
May 4, 2023: Announcement of StarCoder and StarCoderBase, code language models trained on GitHub data.