Open Source Public Fund experiment – One-and-a-half-year review

Updates

Highlights

We had the chance to improve the data, processes, and algorithms over the last couple of weeks. Here are some highlights:

We have completely recreated the Criticality Score calculation formulas in our documentation, which enables us to easily modify the algorithm.
With that, we have extended the algorithm to include usage metrics from the Ecosystems API and applied it to all open source accounts under the Open Collective, resulting in a new ranking system!

You can see all the changes on the Open Source Public Fund experiment document.

Dataset refresh

We have refreshed the account list and included the country and yearly budget in the data. There are now 4729 accounts with a code repository.

Criticality Score’s latest version

We have updated our Criticality Score repository to the latest version. The new version is written in Go rather than Python and includes a second algorithm that calculates the score by retrieving the “dependent count” data from the Open Source Insights (deps.dev) API.

Manual score calculations

We have recreated all the formulas for calculating scores in the “Criticality Score – Results” sheet. It is now possible to experiment with the weight of each parameter and see the results directly in the document.

New config with the Ecosystems data

Decoupling data collection from score calculation made it easy to extend the data from other custom resources. Thus, we have retrieved each repository’s “dependent_repos_count” data from the Ecosystems API and created a new algorithm configuration. This one replaces the deps.dev’s “dependent count” parameter.

Now there are three different algorithms for score calculation, and you can see their parameters under the “Criticality Score – Config” sheet:

original_pike – Yellow: The default algorithm from the Criticality Score.
pike_deps.dev – Green: The second algorithm from the Criticality Score includes the “dependent count” from deps.dev as an extra parameter.
ecosyste.ms – Blue: The new algorithm that uses the “dependent_repos_count” data from the Ecosystems API and replaces the deps.dev’s “dependent count” parameter. We use the results of this new algorithm to distribute the funds.

Changing a parameter on the “Criticality Score – Config” sheet updates the scores on the “Criticality Score – Results” sheet. Feel free to copy the document and try it yourself.

Stats

In the new “Stats” sheet, you can see a quick overview of the entire dataset, including budgets, languages, and licenses.

Results

This sheet shows the history and details of the three projects we randomly select each month to test the algorithm.

Process details

Below is a brief list of actions to prepare the data and the final results. One critical remark is that the process currently only works with GitHub repositories, so we exclude non-GitHub ones—hopefully a detail to improve in the future.

Retrieve account data from the Open Collective API to create “Accounts” and “Budgets” sheets.
Call the GitHub API to find the most starred repositories of each GitHub user, which you can find under the “GitHub – Top repos” sheet.
Run the Criticality Tool to retrieve the data points for each repository and store them in the “Criticality Score – Results” sheet.
Call the Ecosystems API’s Lookup endpoint to retrieve each repository’s additional “dependent_repos_count” data, then combine it with the other parameters in the “Criticality Score – Results” sheet.
Once the data is in place, the existing formulas in the “Criticality Score – Results” sheet calculate the scores.

Latest stats

Here are some stats that stand out in our dataset:

The total number of Open Collective accounts with a code repository is 4729.
96.32% of those accounts use GitHub as their code repository.
However, 15.38% of GitHub usernames are invalid or non-existent, a significant figure. It would be beneficial for Open Collective to implement a verification mechanism for such critical user information, or at least require users to update their profiles periodically.
81.62% of the accounts use USD as their currency. These accounts have an annual budget of approximately $15 million, with an average of $3,887 per account. These figures likely fall short of even one percent of ideal figures.
JavaScript is the most widely used programming language in our dataset; it’s used in almost a third of all repositories. Python is second, and PHP is third.
MIT dominates the license list with 41.91%, followed by 27 different licenses. 15.41% of the repositories don’t have any.
By country, the United States leads the list with 11.4%. China follows with 2.9%, the United Kingdom with 2.8%, Germany with 2.7%, and India with 2.5%.
The Ecosystems search returns 1453 matches across 3336 unique repositories. Among this data, npm is the top ecosystem at 46.94%, Go is second at 13.35%, and PyPI is third at 9.70%.
Using this experiment as a pretext, we invested $4,259 in 57 open source collectives over 19 rounds.
And last, here are the top five repositories with the highest criticality score based on the Ecosystems config:

You can see the full ranking of each algorithm and more under the “Stats” sheet.

What’s next?

Here are some of the items that we plan to work on:

Most under-appreciated: Find the accounts with the highest scores and the lowest yearly budgets.
License parameter: Categorize the licenses as permissive vs. copyleft, and add a license parameter to the algorithm as an experiment.
Repositories vs. releases: Combine repository data with release information, and improve the algorithm by including release metrics.
National public funds simulation: Categorize accounts by country and simulate fund distribution per country.

Thank you for tuning in, and we hope you enjoy this experiment as much as we do. See you on the next update!

About

Recent posts | All posts ››

About

Recent posts | All posts ››