Milestones

  • September 26, 2022: Announcement of the BigCode project. 
  • October 6, 2022: Webinar with the BigCode Community to provide strategic direction. 
  • October 27, 2022: Introduction of “The Stack” dataset and paper publication. 
  • November 15, 2022: Introduction of “Am I in The Stack” tool and BigCode Opt-Out process. 
  • November 23, 2022: Details shared on the approach to de-identification of personally identifiable information (PII). 
  • November 29, 2022: Sharing of Weights and Biases dashboards for the first models. 
  • December 1, 2022: Release of The Stack v1.1 with expanded data and programming languages. 
  • December 2, 2022: In-person meetup with the BigCode community alongside NeurIPS 2022. 
  • December 9, 2022: Meetup at EMNLP 2022 to raise awareness and engage with the NLP research community. 
  • December 12, 2022: Communication to raise awareness of “Am I in The Stack” and opt-out option. 
  • December 14, 2022: Second webinar with the BigCode Community to review progress. 
  • December 22, 2022: Release of SantaCoder, a 1.1B multilingual language model for code. 
  • March 20, 2023: Announcement of The Stack v1.2, including additional datasets and simplified opt-out process. 
  • April 13, 2023: Analysis of Chinchilla scaling laws for training smaller language models. 
  • May 4, 2023: Announcement of StarCoder and StarCoderBase, code language models trained on GitHub data.