GitHub releases an open dataset for multilingual developer content

Developers coordinate code across README files, issue threads, and pull request discussions. Much of that exchange happens in English, and a large share happens in other languages. GitHub has released a dataset built to help researchers and developers locate public repositories that carry non-English natural-language content.

GitHub Multilingual Repositories Dataset

The GitHub Multilingual Repositories Dataset is available on GitHub under the CC0-1.0 license. The release follows a commitment GitHub made in 2025 as part of Microsoft’s European Digital Commitments to widen access to multilingual data, including for open source AI developers.

Scope of the data

“The dataset covers over 80 million classification rows across more than 40 million repositories,” explained Kevin Xu, Staff Software Engineer at GitHub.

For each public repository, the dataset records language classifications of the README, the most-commented issue, and the most-commented pull request. The first 150 characters of each text serve as the input sample, and texts under 20 characters are excluded. Three classifiers handle the work: fastText, gcld3, and lingua-py. Each produces a confidence score, and the dataset includes only classifications above 0.5 confidence. The three classifiers stay separate so users can set their own strictness, requiring agreement among all three for high precision or accepting one for broad recall.

Each entry also carries repository metadata: creation timestamp, disk usage, stars, forks, primary programming language, SPDX license, issue and pull request counts, and the snapshot date.

Language distribution

Language patterns differ by text source. Korean ranks as the most common non-English language in issue text and the fifth-most common in README files. Portuguese leads the non-English README list, appearing in more than 3 million repositories.

Stated limits

GitHub describes the dataset as a discovery tool and cautions against treating it as a ground-truth benchmark for language identification. Repository text runs short and can mix badges, templates, commands, and code, so a 150-character sample may misrepresent a repository. The data carries repository-level signals and should not serve to infer sensitive attributes about owners, contributors, or communities.

More about

GitHub releases an open dataset for multilingual developer content

Scope of the data

Language distribution

Stated limits

Featured news

Resources

Don't miss