flowchart TB
%% sources %%
plos["PLOS<br/>(n~60,000 pairs)"] --> unfiltered[Unfiltered Article-Repository Pairs]
softwarex["SoftwareX<br/>(n~700 pairs)"] --> unfiltered
joss["JOSS<br/>(n~3,000 pairs)"] --> unfiltered
pwc["Papers with Code<br/>(n~180,000 pairs)"] --> unfiltered
%% filtering %%
unfiltered --> |Filter non-GitHub Article-Repository Pairs| dataset[GitHub-only Article-Repository Pairs]
%% article processing %%
dataset --> s2[DOI Resolution with Semantic Scholar]
s2 --> openalex[Article Metadata, Author Information, and Bibliometrics Retrieved from OpenAlex]
%% repo processing %%
dataset --> gh-api[Repository Metadata and Contributor Information Retrieved from GitHub]
%% merge %%
openalex --> core-dataset[Metadata Enriched Article-Repository Dataset]
gh-api --> core-dataset
%% author-developer matching %%
subgraph admatching[" "]
direction TB
author1[Author] <-..-> contrib1[Code Contributor]
author2[Author] <--> |Match| contrib1
author3[Author] <-..-> contrib1
author4[Author] <-..-> contrib1
author1 <--> |Match| contrib2[Code Contributor]
author2 <-..-> contrib2
author3 <-..-> contrib2
author4 <-..-> contrib2
linkStyle 11,14 stroke:green,color:green;
end
core-dataset --> |Match Authors and Developers in Each Article-Repository Pair| admatching
admatching --> |Store Data for Articles, Repositories, Authors, Code Contributors, Matched Individuals| final-dataset[("Final Dataset 'rs-graph-v1'")]
style admatching fill: lightgrey
Code Contribution and Credit in Science
Software development has become essential to scientific research, but its relationship to traditional metrics of scholarly credit remains poorly understood. We develop a dataset of approximately 140,000 paired research articles and code repositories, and a predictive model that matches research article authors with software repository developer accounts. We use this dataset to investigate how software development activities influence credit allocation in collaborative scientific settings. Our findings reveal significant patterns distinguishing software contributions from traditional authorship credit. We find that ~30% of articles include non-author code contributors—individuals who participated in software development but received no authorship recognition. While code-contributing authors provide a ~4.2% increase in article citations, this effect becomes non-significant when controlling for domain, article type, and open access status. First authors are significantly more likely to be code contributors than other author positions. Notably, we identify a negative relationship between coding frequency and scholarly impact metrics. Authors who contribute code more frequently exhibit progressively lower h-indices than non-coding colleagues, even when controlling for publication count, author position, domain, and article type. These results suggest a disconnect between software contributions and credit, highlighting important implications for institutional reward structures and science policy.
scientific software, research software, authorship, contributorship, credit
1 Introduction
Recent advances in genomic sequencing, climate modeling, particle physics, and neuroimaging have all required the development of novel scientific software (Hocquet et al. 2024). In many ways, software development has changed how scientific work is organized and performed, but we lack large-scale quantitative evidence characterizing these effects. This lack of evidence has important consequences. Scientific institutions continue to rely heavily on bibliometric indicators to evaluate research productivity (Haustein and Larivière 2014), but software is rarely mentioned or cited in research publications (Du et al. 2021). As a result, researchers who invest substantial effort in building computational tools may face systematic disadvantages in career advancement. We argue that understanding whether and how software contributions translate into formal scientific credit is important for designing equitable reward structures and evidence-based science policy.
The challenge of measuring software’s role in science stems, in part, from the historical separation of code from publication. Scientific articles and their metadata (e.g. author lists, citations, and acknowledgments) can provide a valuable tool for estimating productivity, tracing collaboration, or inferring impact (Fortunato et al. 2018). However, software contributions often remain invisible in these records, making it difficult to quantify their relationship to traditional markers of scientific productivity (Weber and Thomer 2014). As a result, studies regarding the maintainers and contributors of scientific software have relied primarily on self-reports (surveys) (Carver et al. 2022) or participant observation (ethnographies) (Paine and Lee 2017), which provide valuable insights but are limited in the scope and generalizability.
In the following paper, we overcome this challenge by constructing a dataset (rs-graph-v1) of 138,596 research articles and their associated code repositories. Using this data we then develop a machine learning model to match article authors with repository contributor accounts. By connecting papers to repositories, and authors to developers, we are able to observe both the presence and effect of code contributions across different authorship positions and examine how coding activity relates to both article-level citations and career-level productivity metrics (e.g., h-index). This approach enables us to address fundamental questions about the scientific reward system: How do software contributions correspond to authorship positions? Does coding effort translate into citation impact? How do patterns of code contribution relate to long-term career trajectories?
In analyzing a filtered subset of the rs-graph-v1 dataset, we identify three patterns that distinguish code contributions from other forms of scientific recognition. First, we show that 28.6% (n=5,694) of articles have non-author code-contributors, or individuals who may have helped create the software but received no formal authorship credit. We also show that code-contributing authors are associated with a modest increase in article-level impact metrics (on average, a 4.2% increase in citations per code-contributing author), but these effects become statistically non-significant when controlling for domain, article type, and open access status. Second, we find that first authors are significantly more likely to be code contributors than other author positions across all conditions tested. Third, we find a negative relationship between coding frequency and scholarly impact: Authors who contribute code more frequently show progressively lower h-indices than their non-coding peers, a pattern that persists even when controlling for publication count, and author’s most common author position (first, middle, or last), domain (Physical Sciences, Health Sciences, Life Sciences, or Social Sciences), and article type (preprint, research article, or software article)1.
1 We use the phrase “most common” to mean most frequent, and in the case of a tie, their most recent. That is, if an author has been first author on four publications in our dataset, middle author on two publications, and has never been last author, they are considered a “first author” for the purposes of our analysis. We discuss the limitations of this approach in the Section 5.
The remainder of this paper proceeds as follows: First, we review related work regarding software development and the recognition of code contributors in scientific credit systems. In doing so, we motivate three specific hypotheses that guide our investigation. Next, we provide an overview of our data and methods, describing how we linked articles to repositories, trained and evaluated a predictive model for entity matching, and applied this model across each article-repository pair. We then present our analysis, focusing on article-level dynamics before moving to individual-level (career) patterns, formally accepting or rejecting each hypothesis based on our findings.
2 Background
The proper allocation of credit has been a longstanding topic in social studies of science. Merton’s description of a “Matthew Effect” is perhaps the most famous account of a cumulative advantage in publishing, where established scientists receive more attention for work of similar importance or value than their less established peers (Merton 1968). Contemporary studies of science continue to demonstrate important aspects of the dilemma. Dashun Wang et al. show that random career outcomes and the “hot streak” phenomenon suggest that cumulative advantage and early career success may be less predictive of long-term impact than traditionally assumed, complicating how we understand the accrual of scientific credit over time (Wang, Song, and Barabási 2013; L. Liu et al. 2018). Separately, work on team science demonstrates that larger teams receive disproportionate credit compared to smaller teams producing equally impactful work, further illustrating how credit allocation deviates from underlying contributions (Wu, Wang, and Evans 2019).
In quantitative work, the attribution of credit is most often established using proxies in bibliographic data like author order (Zuckerman 1968; Sarsons 2017; Sauermann and Haeussler 2017), which is an imperfect means of assigning credit, and doing so may exacerbate existing inequalities (West et al. 2013). More recent work developing algorithms to assign credit (Shen and Barabási 2014) and formally document contributor roles (Allen et al. 2014) are aimed at trying to improve upon author order as the status quo mechanism for assigning credit in collaborative settings. Further, the use of contributor statements as data have provided important insights into the actual division of labor in scientific collaborations and revealed systematic patterns in how different types of contributions are valued and recognized across fields (Lu et al. 2020, 2022).
Among the contributions systematically undervalued in collaborative science are technical and infrastructural contributions, particularly software development. While contributor role taxonomies have begun to make visible the diverse forms of labor that constitute scientific work, these systems still struggle to adequately recognize the significance of technical contributions that enable research, but may not fit traditional notions of intellectual authorship (Larivière, Pontille, and Sugimoto 2020). Even less is known about the broader patterns of attention that software-producing research teams receive, and under what conditions technical contributions achieve visibility in an attention economy driven by citations. We argue that understanding these patterns requires examining how team composition and software contributions relate to scientific attention at scale.
Next, we build upon these general findings to motivate three specific hypotheses about code contributions and credit that we then test using the rs-graph-v1 dataset.
2.1 H1: Research Team Composition and Scientific Attention
In collaborative settings, experimental and theoretical research often receives more citations than methods-focused contributions, with the exception being when methods represent foundational shifts in a field (Aksnes 2006; X. Liu, Zhang, and Li 2023; Chen et al. 2024). Software is often positioned as a methodological contribution and, as a result, can be found in some of the highest-cited papers of the 21st century (Jin et al. 2015; Hampton et al. 2013; Hasselbring et al. 2024).
Prior work also establishes a positive relationship between team size and citation count, where larger and more diverse teams produce higher-impact scientific work (Franceschet and Costantini 2010; Larivière et al. 2014; Yoo et al. 2024). Research in empirical software studies similarly finds that larger development teams tend to create more reliable software with fewer defects Herbsleb and Mockus (2003), though this comes at the expense of slower development cycles. These findings suggest that team size may be particularly important in scientific software development, where technical robustness and reproducibility remain gold standards (Milewicz and Raybourn 2018).
We argue that the unique characteristics of scientific software development — including implementing novel algorithms, requiring deep domain knowledge, and an increased emphasis on reproducibility (Muna et al. 2016; Howison and Herbsleb 2013) — make team composition especially important for understanding credit allocation. Software development in organized teams may enhance scientific impact through multiple mechanisms: teams can produce more robust and generalizable software tools for methodological contributions while enabling more sophisticated computational analyses and larger-scale data processing for experimental work.
Given these patterns in team dynamics, software development practices, and citations in collaborative research, we propose that:
H1: The number of individuals contributing code to a publication’s associated repository positively correlates with the article’s citation count.
2.3 H3: Code Contribution and Individual Scientific Impact
Despite the increasingly central role of software in science, researchers who dedicate significant effort to its development often face systemic hurdles in receiving formal scientific credit. Their contributions may be relegated to acknowledgment sections rather than rewarded with authorship credit (Weber and Thomer 2014). Further, the scholarly practice of software citation remains inconsistent, frequently undervaluing crucial maintenance and extension work (Carver et al. 2022; Philippe et al. 2019; Lamprecht et al. 2020; Katz et al. 2020; Smith, Katz, and Niemeyer 2016). The h-index, a widely used proxy for an individual’s cumulative scientific impact, is derived from an individual’s record of formally authored publications and the citations these publications receive (Hirsch 2005). Consequently, if substantial time and intellectual effort are invested in software development that does not consistently translate into formal authorship on associated research papers, or if the software outputs themselves are not robustly and formally cited in a way that accrues to the individual developer, then the primary activities that build a researcher’s h-index are effectively diminished or bypassed.
This creates a structural misalignment where contributions essential for scientific advancement (i.e., software development and maintenance) may not adequately capture and could even detract from time spent on activities that bolster traditional bibliometric indicators of individual success. While collaborative software development may yield short-term benefits through increased citations on individual papers (as suggested by H1), researchers specializing in code development may face long-term career disadvantages as their expertise becomes increasingly divorced from traditional publication pathways that drive academic recognition and advancement. Based on these challenges in the recognition and citation of software contributions and their potential impact on h-index accumulation, we hypothesize:
H3: The frequency with which individual researchers contribute code to their research projects negatively correlates with their h-index.
3 Data and Methods
We examine the relationship between software contributions and scientific credit through a three-step process: (1) building a dataset of linked scientific articles and code repositories; (2) developing a predictive model to match article authors with developer accounts; and, (3) analyzing patterns in these relationships.
Our dataset integrates article-repository pairs from four sources, each with explicit mechanisms for code sharing: the Journal of Open Source Software (JOSS) and SoftwareX require code repositories as part of publication, Papers with Code directly connects preprints from ArXiv with software implementations, and Public Library of Science (PLOS) articles include mandatory data and code availability statements that we mined for repository links. We focused exclusively on GitHub-hosted repositories, which represent the predominant platform for scientific software sharing (Cao et al. 2023; Escamilla et al. 2022). For each article in our corpus, we resolved the source DOI via Semantic Scholar to ensure we captured its latest version and then extracted publication metadata and author metrics through OpenAlex. Finally, we collected information about each code repository and the repository’s contributors via the GitHub API. A data collection and processing workflow diagram is available in Figure 1. All data was collected between October and November 20242.
2 The Journal of Open Source Software (JOSS) is available at https://joss.theoj.org/. SoftwareX articles are available at https://www.sciencedirect.com/journal/softwarex/ and SoftwareX repositories are available at https://github.com/ElsevierSoftwareX/. Public Library of Science (PLOS) is available at https://plos.org/ and article data is available via the allofplos Python package (https://github.com/PLOS/allofplos/). Papers with Code is available at https://paperswithcode.com/ and data for links between papers and code is available at https://paperswithcode.com/about/. Documentation for the Semantic Scholar API is available at https://api.semanticscholar.org/, documentation for the OpenAlex API is available at https://docs.openalex.org/, and documentation for the GitHub API is available at https://docs.github.com/en/rest/.
3 Papers with Code (arXiv) preprints likely include some number of software articles but without a clear and consistent mechanism to identify these articles, but we are unable to label them as such. We discuss this limitation further in Section 5. Further, while some PLOS, JOSS, and SoftwareX articles may also have a preprinted version, we select and analyze the published version of the article when one is available at the time of data processing. This is made possible via the DOI resolution process using Semantic Scholar.
4 OpenAlex documentation for concepts, topics, fields, and domains is available at https://help.openalex.org/hc/en-us/articles/24736129405719-Topics/
Articles collected from JOSS and SoftwareX were labeled as “software articles,” articles obtained from PLOS were labeled as “research articles,” and articles from Papers with Code were labeled as either “preprint” or “research article” based on their document type from OpenAlex. While Papers with Code data is largely tied to ArXiv preprints, because of our DOI resolution process, we were able to identify a subset of these articles that had been published in a journal and were thus labeled as “research articles” by OpenAlex3. Article domain classifications (e.g., Health Sciences, Life Sciences, Physical Sciences, and Social Sciences) were obtained from OpenAlex. While each article can be associated to multiple “topics”, we utilize the primary topic and its associated domain for our analyses4.
To match article authors with repository developer accounts, we developed a machine-learning approach using transformer-based architectures. Specifically, we use transformers to overcome entity matching challenges such as when exact name or email matching is insufficient due to formatting variations and incomplete information (e.g., “J. Doe” vs. “Jane Doe” in publications or use of institutional versus personal email addresses). When exact matching fails, there is typically high semantic overlap between an author’s name and developer account details (i.e., username, name, and email) that transformer models can leverage. We created a gold-standard dataset of 2,999 annotated author-developer account pairs from JOSS publications, where two independent reviewers classified each pair as matching or non-matching. After systematic evaluation of three transformer architectures with various feature combinations, our optimal model (fine-tuned DeBERTa-v3-base including developer account username and display name in the training data) achieved a binary F1 score of 0.944, with 0.938 precision and 0.95 recall (positive = “match”)5. Applying our model across all article-repository pairs yielded a large-scale dataset linking scientific authors and code contributors.
5 A detailed comparison of models and feature sets, and an evaluation of model performance across non-JOSS author-developer pairs is made available in the Appendix.
Our complete dataset, named the “Research Software Graph” (rs-graph-v1), contains 163,292 article-repository pairs. However, 24,696 article-repository pairs form many-to-many relationships in which the same article is linked to multiple repositories or the same repository is linked to multiple articles. The median number of repositories per article is 1 (95th percentile: 1, 97th percentile: 2, 99th percentile: 2). The median number of articles per repository is 1 (95th percentile: 1, 97th percentile: 2, 99th percentile: 2).
To ensure that our analysis focuses on clear, unambiguous relationships between articles and repositories, prior to analysis we utilize a partial rs-graph-v1 dataset which filters out any articles or repositories associated to multiple article-repository pairs. This filtering step removes 24,696 article-repository pairs. As shown in Table 1, the complete one-to-one article-repository dataset contains 138,596 unique article-repository pairs spanning multiple domains and publication types (2,336 unique articles, repositories, and pairs from JOSS, 6,090 from PLOS, 129,615 from Papers with Code, and 555 from SoftwareX). Additionally, the one-to-one dataset includes information for 295,806 distinct authors and 152,170 developer accounts. The one-to-one dataset includes 90,086 author-developer account relationships, creating a unique resource for investigating code contribution patterns in scientific teams. This dataset enables systematic examination of how software development work relates to scientific recognition and career trajectories. The complete rs-graph-v1 dataset (https://doi.org/10.7910/DVN/KPYVI1), the trained model (https://huggingface.co/evamxb/dev-author-em-clf), and a supporting inference library for using the model: sci-soft-models, are made publicly available to support further research.
| Category | Subset | Article-Repository Pairs | Authors | Developers |
| By Domain | Health Sciences | 5,172 | 25,979 | 7,248 |
| Life Sciences | 7,729 | 31,649 | 12,150 | |
| Physical Sciences | 116,600 | 240,545 | 130,592 | |
| Social Sciences | 8,838 | 29,269 | 14,043 | |
| By Document Type | preprint | 72,177 | 170,301 | 87,311 |
| research article | 63,528 | 173,183 | 78,935 | |
| software article | 2,891 | 9,294 | 12,868 | |
| By Access Status | Closed | 5,740 | 23,668 | 9,352 |
| Open | 132,856 | 286,874 | 147,831 | |
| By Data Source | joss | 2,336 | 7,105 | 11,362 |
| plos | 6,090 | 30,233 | 8,784 | |
| pwc | 129,615 | 262,889 | 134,926 | |
| softwarex | 555 | 2,244 | 1,628 | |
| Total | 138,596 | 295,806 | 152,170 |
5 Discussion
Our analysis reveals significant disparities in the recognition of software contributions to scientific research, with ~28.6% (n=5,694) of articles having code-contributors not matched to any article author, partially representing unrecognized code contribution. This persistent pattern over time and team size suggests a systemic disconnect between software development and scientific recognition systems, reflecting challenges in how scientific contributions are valued and credited. This exclusion reflects what Shapin (1989) observed about scientific authority—the selective attribution of technical work as either genuine knowledge or mere skill significantly impacts who receives formal recognition. These findings further support previous surveys Philippe et al. (2019) and Carver et al. (2022) where scientists report the relegation of software contributors to either acknowledgment sections or receiving no credit at all. The stability of this pattern over time indicates that this phenomenon has embedded itself in scientific software development rather than representing a transitional phase, raising questions about scientific labor and how reward structures integrate technical contributions.
Our finding that, on average, article-repository pairs have only a single code contributor mirrors prior work from Färber (2020). Further, the distribution of code contributions across author positions provides context to the hierarchical organization of scientific work. First authors are significantly more likely to contribute code with 69.8% of all first authors in our dataset contributing code. Middle and last authors, meanwhile, were statistically significantly less likely to contribute code, with only 9.7% of middle authors and 7.6% of last authors acting as code-contributing members of the research team. Corresponding authors were similarly less likely than expected to be code contributors, as we found that within our dataset, corresponding authors were code contributors 29.7% of the time. These patterns align with traditional scientific labor distribution where first authors might be expected to handle technical aspects of research while middle and last authors are likely specialist contributors or provide guidance and oversight (Larivière, Pontille, and Sugimoto 2020; Sauermann and Haeussler 2017). However, our data did not support our initial hypothesis that corresponding authors would also be more likely to contribute code due to their shared responsibility for the long-term maintenance of research artifacts. This finding suggests a potential strict division between project management responsibilities and direct technical engagement with software development.
The modest citation advantage associated with code-contributing authors (~4.2% increase in citations per code-contributing author) stands in contrast with the significant negative relationship between coding frequency and an individual’s scholarly impact (h-index). This misalignment between code contributions and scientific recognition creates an asymmetrical relationship in which software development may enhance research impact but potentially penalizes individual careers. The progressive reduction in h-index as coding frequency increases indicates a cumulative disadvantage for frequent code contributors. This pattern persists even when controlling for publication count, suggesting issues in how software contributions are valued relative to other scientific outputs. These findings echo concerns raised by Muna et al. (2016) about the sustainability of research software development and highlight how current reward structures may discourage talented developers from pursuing scientific careers.
Software development represents a form of scholarly labor that has become increasingly essential to modern research yet remains incompletely integrated into formal recognition systems. Similar to the high proportion of articles with authors who made data curation contributions towards research observed by Larivière, Pontille, and Sugimoto (2020), our finding that more than a quarter of papers have unacknowledged code contributors highlights a labor role that is simultaneously common and undervalued. The prevalence of code contributions across domains demonstrates the importance of this work to contemporary research. However, the persistent exclusion of contributors from authorship suggests that researchers continue to classify software development as technical support rather than intellectual contribution. This classification may reflect disciplinary traditions that privilege certain forms of scholarly production despite the growing recognition that software represents a legitimate research output (Katz et al. 2020). The tension between software’s importance and contributors’ recognition status raises questions about how we define, value, and reward different forms of scientific labor in an increasingly computational research landscape.
5.1 Limitations
Our data collection approach introduces several methodological constraints to consider when interpreting these results. By focusing exclusively on GitHub repositories, we likely miss contributions stored on alternative platforms such as GitLab, Bitbucket, or institutional repositories, potentially skewing our understanding of contribution patterns. As Trujillo, Hébert-Dufresne, and Bagrow (2022), Cao et al. (2023), and Escamilla et al. (2022) have all noted, while GitHub is the predominate host of scientific software, significant portions of research code exist on other platforms. Additionally, our reliance on public repositories means we cannot account for private repositories or code that were never publicly shared, potentially underrepresenting sensitive research areas or proprietary methods.
Further, while our data processing workflow began with ~60,000 possible article-repository pairs from PLOS, ~3,000 from JOSS, ~700 from SoftwareX, and ~180,000 from Papers with Code for a possible total of ~243,70012, rs-graph-v1 contained a total of 163,292 article-repository pairs. Many of the possible article-repository pairs were filtered out due to the linked repository not being accessible, or the bibliometric metadata not being available.
12 Article-repository pair counts are approximate because there are no snapshots of these databases at a single point in time. The estimates are based on counts from each data source taken in October 2025.
Our labeling of article types (software article, research article, preprint) was based on the data source (PLOS, JOSS, SoftwareX, Papers with Code) and in the case of Papers with Code articles, our DOI resolution process and the document type available from OpenAlex. This approach may misclassify certain articles, especially those from Papers with Code (arXiv). One potential alternative approach would involve classification of the repository itself following the recommendations of Hasselbring et al. (2025) in breaking down repositories by their role in research (e.g., “Modeling, Simulation, and Data Analysis”, “Technology Research Software”, and “Research Infrastructure Software”). This classification would allow us to investigate not only the differences of “software papers” vs “research articles” and “preprints” (which we believe would both typically be paired with “Modeling, Simulation, and Data Analysis” repositories), but the purpose of the code as it relates to the research. However, there is currently no established automated method for performing this classification at scale.
Similarly, our simplification of author positions, domains, and article types to each author’s “most common” (most frequent, with ties broken by most recent) introduces potential biases (Li, Rousseau, and Jia 2025). This reduction may obscure the diversity of an author’s contributions across different contexts, particularly for interdisciplinary researchers or those with varied roles in different projects. Further, these labels are created from metadata for articles only within our dataset. That is, even though a researcher may have dozens of articles, their “most common” author position, domain, and article type was determined with data for article-repository pairs. This inherently biases the dataset towards research teams who, as a collective, frequently create and share software and code as a part of their research process. While this approach was necessary for managing the complexity of our analysis, it may not fully capture the nuances of individual research careers.
Our predictive modeling approach for matching authors with developer accounts presents additional limitations. The model’s performance can be affected by shorter names where less textual information is available for matching, potentially creating biases against researchers from cultures with shorter naming conventions. Organization accounts used for project management pose particular challenges for accurate matching, and while we implemented filtering mechanisms to minimize their impact, some misclassifications may persist. Furthermore, our approach may not capture all code contributors if multiple individuals developed code but only one uploaded it to a repository—creating attribution artifacts that may systematically underrepresent specific contributors, particularly junior researchers or staff who may not have direct repository access. However, as discussed further in the Appendix (Section 12.1.3), our dataset is relatively diverse, with the median preprint-repository pair having a commit duration (the number of days between the repository’s creation and the repository’s most recent commit) of 53 days, research article-repository pairs having a median commit duration of 114 days, and software article-repository pairs having a median commit duration of 282 days. This diversity in commit durations suggests that our dataset contains a range of development practices, including some “code dumps,” as well as year, and multi-year long projects.
Our analytical approach required substantial data filtering to ensure reliable results, introducing potential selection biases in our sample. By focusing on article-repository pairs with commit activity no later than 90 days past the date of article publication and at least three authors and less than 11 authors, we may have systematically excluded certain types of research projects, particularly those with extended development timelines or extensive collaborations. Our categorization of coding status (non-coder, any coding, majority coding, always coding) necessarily simplifies complex contribution patterns. It does not account for code contributions’ quality, complexity, or significance. Additionally, our reliance on OpenAlex metadata introduces certain limitations to our analysis. While OpenAlex provides good overall coverage, it lags behind proprietary databases in indexing references and citations. The lag in OpenAlex data may affect our citation-based analyses and the completeness of author metadata used in our study (Alperin et al. 2024).
5.2 Future Work
Future technical improvements may enhance our understanding of the relationship between software development and scientific recognition systems. Expanding analysis beyond GitHub to include other code hosting platforms would provide a more comprehensive understanding of scientific software development practices across domains and institutional contexts. More sophisticated entity-matching techniques could improve author-developer account identification, particularly for cases with limited information or common names. Developing more nuanced measures and classifications of code contribution type, quality, and significance beyond binary contribution identification would better capture the true impact of technical contributions to research (as we have started to do in Section 12.1.2.2). These methodological advances would enable more precise tracking of how code contributions translate—or fail to translate—into formal scientific recognition, providing clearer evidence for policy interventions.
Our findings point to several directions for future research on the changing nature of scientific labor and recognition. Longitudinal studies tracking how code contribution patterns affect career trajectories would provide valuable insights into the long-term impacts of the observed h-index disparities and whether these effects vary across career stages. Comparative analyses across different scientific domains could reveal discipline-specific norms and practices around software recognition, potentially identifying models that more equitably credit technical contributions. Qualitative studies examining how research teams make authorship decisions regarding code contributors would complement our quantitative findings by illuminating the social and organizational factors influencing recognition practices. Additionally, to better understand corresponding authors’ role in maintaining research artifacts, future work could remove the 90-day post-publication commit activity filter to examine long-term sustainability actions. However, this approach must address introducing contributors unrelated to the original paper.
Despite their growing importance, the persistent under-recognition of software contributions suggests a need for structural interventions in how we conceptualize and reward scientific work. Building upon efforts like CRediT (Brand et al. 2015), future work should investigate potential policy changes to better align institutional incentives with the diverse spectrum of contributions that drive modern scientific progress. However, as the example of CRediT demonstrates, even well-intentioned taxonomies may reproduce existing hierarchies or create new forms of inequality if they fail to address underlying power dynamics in scientific communities. The challenge is not merely technical but social: creating recognition systems that simultaneously support innovation, ensure appropriate credit, maintain research integrity, and foster equitable participation in an increasingly computational scientific enterprise.
6 Acknowledgments
We thank our anonymous reviewers for their helpful feedback. We especially thank Molly Blank, Shahan Ali Memnon, David Farr, and Jevin West for their valuable insights at multiple stages of this research.
8 Competing Interests
The authors have no competing interests.
9 Funding Information
This research was supported by grants from the National Science Foundation’s (NSF) Office of Advanced Cyberinfrastructure (OAC-2211275) and the Sloan Foundation (G-2022-19347).
10 Data Availability
The software and code used to train, evaluate, and apply the author-developer account matching model is available at: https://github.com/evamaxfield/sci-soft-models (https://doi.org/10.5281/zenodo.17401863). The software and code used to gather, process, and analyze the dataset of linked scientific articles and code repositories and their associated contributors and metadata is available at: https://github.com/evamaxfield/rs-graph (https://doi.org/10.5281/zenodo.17401960). The compiled rs-graph-v1 dataset, as well as data required for model training and evaluation, is available at: https://doi.org/10.7910/DVN/KPYVI1. Certain portions of the compiled rs-graph-v1 dataset, and the model training and evaluation datasets contain linked personal data (e.g., author names, developer account usernames) and are only available by request.
11 References
12 Appendix
12.1 Extended Data and Methods
12.1.1 Building a Dataset of Linked Scientific Articles and Code Repositories
The increasing emphasis on research transparency has led many journals and platforms to require or recommend code and data sharing (Stodden, Guo, and Ma 2013; Sharma et al. 2024), creating traceable links between publications and code. These explicit links enable systematic study of both article-repository and author-developer account relationships (Hata et al. 2021; Kelley and Garijo 2021; Stankovski and Garijo 2024; Milewicz, Pinto, and Rodeghero 2019).
Our dataset collection process leveraged four sources of linked scientific articles and code repositories, each with specific mechanisms for establishing these connections:
- Public Library of Science (PLOS): We extracted repository links from PLOS articles’ mandatory data and code availability statements.
- Journal of Open Source Software (JOSS): JOSS requires explicit code repository submission and review as a core part of its publication process.
- SoftwareX: Similar to JOSS, SoftwareX mandates code repositories as a publication requirement.
- Papers with Code: This platform directly connects machine learning preprints with their implementations. We focus solely on the “official” article-repository relationships rather than the “unverified” or “unofficial” links.
We enriched these article-repository pairs with metadata from multiple sources to create a comprehensive and analyzable dataset. We utilized the Semantic Scholar API for DOI resolution to ensure we found the latest version of each article. This resolution step was particularly important when working with preprints, as journals may have published these papers since their inclusion in the Papers with Code dataset. Using Semantic Scholar, we successfully resolved ~56.3% (n=78,021) of all DOIs within our dataset13.
13 Broken out by dataset source, we resolved ~2.1% (n=125) of all PLOS DOIs, ~4.0% (n=93) of all JOSS DOIs, ~0.0% (n=0) of all SoftwareX DOIs, and ~49.2% (n=63,817) of all Papers with Code (arXiv) DOIs.
We then utilized the OpenAlex API to gather detailed publication metadata, including:
- Publication characteristics (open access status, domain, publication date)
- Author details (name, author position, corresponding author status)
- Article- and individual-level metrics (citation counts, FWCI, h-index)
Similarly, the GitHub API provided comprehensive information for source code repositories:
- Repository metadata (name, description, programming languages, creation date)
- Contributor details (username, display name, email)
- Repository-level metrics (star count, fork count, issue count)
12.1.3 Dataset Characteristics and Repository Types
Our compiled dataset appears to contain a mix of repository types, varying from analysis script repositories to software tools and likely some “code dumps” (where code is copied to a new repository immediately before publication). This diversity is reflected in the commit duration patterns across different publication types. The median commit duration for repositories in our analysis is:
- 53 days for preprints
- 114 days for research articles
- 282 days for software articles
Complete statistics on commit durations, including count, mean, and quantile details, are available in Table 4.
| count | mean | std | min | 10% | 25% | 50% | 75% | 90% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| article_type | ||||||||||
| preprint | 2683 | 110 | 182 | -1520 | 0 | 6 | 53 | 138 | 285 | 2091 |
| research article | 17017 | 193 | 253 | -931 | 0 | 19 | 114 | 269 | 487 | 3176 |
| software article | 200 | 394 | 475 | -1 | 0 | 50 | 282 | 536 | 951 | 3007 |
12.4 Additional Analysis of Code-Contributing Non-Authors
12.4.1 Sample Construction and Labeling
To better understand the nature and extent of contributions made by code-contributing non-authors in our dataset, we conducted a detailed analysis using a random sample of 200 individuals from our filtered dataset used in other analyses (Table 5). Our analysis combined qualitative labeling of contribution types with quantitative analysis of commit activity, additions, and deletions made by these contributors to their respective article-repository pairs.
12.4.1.1 Annotation Process
Two independent annotators labeled each of the 200 code-contributing non-authors across three dimensions after completing two rounds of trial labeling on 20 cases to establish agreement. The final labeling criteria were:
Contribution Type:
- “docs”: Contributors who only modified documentation files (README, LICENSE, etc.) or made changes limited to code comments.
- “code”: Contributors who modified actual code files (.py, .R, .js, etc.) with substantive changes or modified code support files (requirements.txt, pyproject.toml, package.json, etc.). Contributors who made code and documentation changes were labeled “code.”
- “other”: Contributors whose changes did not fit the above categories, including those who committed to upstream forks or merged code without authoring it.
Author Matching Assessment:
- “yes”: Contributors who should have been matched to an author (missed classification).
- “no”: Contributors correctly classified as non-authors.
- “unclear”: Cases with insufficient information for determination.
Bot Account Detection:
- “yes”: Automated accounts (GitHub Actions, Dependabot, etc.).
- “no”: Human users.
After establishing near perfect agreement for contribution type (κ=0.89), and perfect agreement for author matching assessment and bot account detection (κ=1.0), each annotator independently labeled 90 contributors— the final sample of 200 created by combining both sets plus the 20 cases used for criteria development.
12.4.1.2 Quantitative Metrics
For each code-contributing non-author, we collected commit activity data using the GitHub API contributor stats endpoint:
- Number of Commits: The total number of commits made by the code contributor to the article-repository pair.
- Number of Additions: The total number of lines of code added by the code contributor to the article-repository pair.
- Number of Deletions: The total number of lines of code deleted by the code contributor to the article-repository pair.
- Number of Total Repository Commits: The total number of commits made to the article-repository pair, regardless of code contributor.
- Number of Total Repository Additions: The number of lines of code added to the article-repository pair, regardless of code contributor.
- Number of Total Repository Deletions: The total number of lines of code deleted from the article-repository pair, regardless of code contributor.
We additionally calculated the absolute change for each code contributor as the sum of additions and deletions, which provides a measure of the total impact of their contributions. Further, we normalized these metrics by the total number of commits, additions, deletions, and absolute changes made to the article-repository pair, regardless of code contributor. This normalization allows us to compare the relative contribution of each code-contributing non-author to the overall amount of changes to the repository.
12.4.2 Results
We find that ~39% (n=78) of code-contributing non-authors were correctly classified as non-authors, ~30.5% (n=61) were unclear due to insufficient profile information, and ~30.5% (n=61) appeared to be missed classifications that should have been matched to authors. Only two accounts (~1%) were identified as bot accounts.
When broken out by contribution type, we find that:
- “true non-authors” (n=78): 59 contributed code, 13 contributed documentation, and 4 contributed some other type of change
- “missed classifications” (n=61): 49 contributed code, 12 contributed documentation, and 0 contributed some other type of change
- “unclear” (n=61): 50 contributed code, 8 contributed documentation, and 3 contributed some other type of change
Table 7 and Table 8 present commit statistics for true non-authors and unclear cases, respectively. Among true non-authors making code contributions, the top quartile (75th percentile and above) contributed ~10.7% of total repository commits and ~14.4% of absolute changes (additions + deletions). The unclear cases showed substantially higher contribution levels. Code contributors in this group comprised ~50.5% of total repository commits and ~41.7% of repository absolute changes, even at the 25th percentile.
| mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|
| commit_stats | 0.178 | 0.307 | 0.001 | 0.007 | 0.029 | 0.107 | 1.0 |
| addition_stats | 0.227 | 0.380 | 0.000 | 0.001 | 0.007 | 0.192 | 1.0 |
| deletion_stats | 0.193 | 0.358 | 0.000 | 0.000 | 0.006 | 0.149 | 1.0 |
| abs_stats | 0.222 | 0.379 | 0.000 | 0.001 | 0.010 | 0.144 | 1.0 |
| mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|
| commit_stats | 0.747 | 0.366 | 0.004 | 0.505 | 1.0 | 1.0 | 1.0 |
| addition_stats | 0.722 | 0.420 | 0.000 | 0.310 | 1.0 | 1.0 | 1.0 |
| deletion_stats | 0.737 | 0.417 | 0.000 | 0.722 | 1.0 | 1.0 | 1.0 |
| abs_stats | 0.733 | 0.404 | 0.000 | 0.417 | 1.0 | 1.0 | 1.0 |
We observed a notable pattern where very few true non-authors (n=4) were repository owners, while ~49.2% of unclear cases (n=30) owned the repositories they contributed to. This suggests that many unclear contributors were likely primary code authors who could not be matched due to limited profile information. When excluding repository owners from the unclear group (Table 9), the median contribution drops to ~34.6% of total commits and ~12.7% of absolute changes, though this still represents substantial technical involvement.
| mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|
| commit_stats | 0.454 | 0.399 | 0.004 | 0.041 | 0.346 | 0.841 | 1.0 |
| addition_stats | 0.380 | 0.443 | 0.000 | 0.016 | 0.094 | 0.917 | 1.0 |
| deletion_stats | 0.486 | 0.492 | 0.000 | 0.003 | 0.431 | 0.999 | 1.0 |
| abs_stats | 0.392 | 0.440 | 0.000 | 0.011 | 0.127 | 0.919 | 1.0 |
Our analysis provides evidence that code-contributing non-authors represent a heterogeneous group with varying contribution levels. While defining “substantial” contribution worthy of authorship remains challenging, our findings reveal a clear mix of legitimate non-authors and potentially missed classifications, with both groups often contributing meaningful portions of repository commits and code changes.
Our sample size of 200 limits generalizability to the full population of code-contributing non-authors. Additionally, the manual annotation process introduces potential subjectivity despite our established criteria, and our reliance on publicly available GitHub profiles may systematically underestimate contributions from developers with minimal profile information.
12.4.3 Qualitative Error Analysis of Missed Classifications
To better understand why certain code-contributing non-authors were missed classifications, we conducted a qualitative error analysis of the 61 contributors labeled as such. We identified several common themes:
- Limited Information from Text Alone: The original dataset for model training and evaluation was constructed using only text-based features from author names and developer information. However, for this extended examination, annotators utilized the full code-contributor profile, including linked websites or linked ORCID profiles. This was done because we wanted to understand the nature of missed classifications (with more time and information to make a classification) rather than strictly replicating the model’s text-only approach. From text alone, many of these missed classifications would have been very challenging to identify. This highlights a limitation in our current model, and a potential area for future work, such as incorporating details from linked websites or other contextual information to improve matching performance.
- Name Variations and Cultural Differences: The model performed better with Anglo-Saxon names, while names from other cultures were more likely to be missed. This suggests possible bias in the training data and a clear area for future work.
- Additional Unrelated Text in Names: When usernames or display names contained longer phrases or unrelated words, the model tended to classify them as no-match, even if there were strong indicators of a match. For example, a username such as “awesome_computational_biologist_john_d” paired with an author name “John Doe” might be missed due to the additional text in the username.
- Significant Differences Between Username and Author Name: The model struggled when there were substantial differences between the username and author name, such as when an individual provides a chosen name in their GitHub profile that differed significantly from their authorship name. Most commonly this occurred when an individual used a chosen “English” name in their GitHub profile that was very different from their authorship name.
These themes highlight areas for potential improvement in the model, such as incorporating more diverse training data and exploring additional features that could capture cultural name variations and contextual information.
12.5 Article Citation Linear Model Results
| Dep. Variable: | cited_by_count | No. Observations: | 19303 |
| Model: | GLM | Df Residuals: | 19298 |
| Model Family: | NegativeBinomial | Df Model: | 4 |
| Link Function: | Log | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -66676. |
| Date: | Tue, 06 Jan 2026 | Deviance: | 19541. |
| Time: | 15:34:48 | Pearson chi2: | 2.98e+04 |
| No. Iterations: | 13 | Pseudo R-squ. (CS): | 0.2879 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 0.9530 | 0.028 | 33.695 | 0.000 | 0.898 | 1.008 |
| Total Authors | 0.0764 | 0.004 | 17.815 | 0.000 | 0.068 | 0.085 |
| Code-Contrib. Authors | 0.0407 | 0.011 | 3.686 | 0.000 | 0.019 | 0.062 |
| Code-Contrib. Non-Authors | -0.0022 | 0.004 | -0.507 | 0.612 | -0.011 | 0.006 |
| Years Since Publication | 0.3954 | 0.004 | 90.744 | 0.000 | 0.387 | 0.404 |
| Dep. Variable: | cited_by_count | No. Observations: | 19303 |
| Model: | GLM | Df Residuals: | 19295 |
| Model Family: | NegativeBinomial | Df Model: | 7 |
| Link Function: | Log | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -66612. |
| Date: | Tue, 06 Jan 2026 | Deviance: | 19413. |
| Time: | 15:34:48 | Pearson chi2: | 2.95e+04 |
| No. Iterations: | 13 | Pseudo R-squ. (CS): | 0.2926 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 0.6503 | 0.065 | 9.953 | 0.000 | 0.522 | 0.778 |
| Total Authors | 0.0776 | 0.004 | 18.076 | 0.000 | 0.069 | 0.086 |
| Code-Contrib. Authors | -0.0337 | 0.048 | -0.699 | 0.484 | -0.128 | 0.061 |
| Code-Contrib. Non-Authors | 0.0122 | 0.016 | 0.766 | 0.444 | -0.019 | 0.043 |
| Years Since Publication | 0.3855 | 0.004 | 87.024 | 0.000 | 0.377 | 0.394 |
| Is Open Access | 0.3386 | 0.063 | 5.337 | 0.000 | 0.214 | 0.463 |
| Code-Contrib. Authors × Is Open Access | 0.0773 | 0.049 | 1.564 | 0.118 | -0.020 | 0.174 |
| Code-Contrib. Non-Authors × Is Open Access | -0.0141 | 0.017 | -0.853 | 0.393 | -0.046 | 0.018 |
| Dep. Variable: | cited_by_count | No. Observations: | 19303 |
| Model: | GLM | Df Residuals: | 19289 |
| Model Family: | NegativeBinomial | Df Model: | 13 |
| Link Function: | Log | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -66598. |
| Date: | Tue, 06 Jan 2026 | Deviance: | 19385. |
| Time: | 15:34:48 | Pearson chi2: | 2.95e+04 |
| No. Iterations: | 14 | Pseudo R-squ. (CS): | 0.2937 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 0.8737 | 0.075 | 11.648 | 0.000 | 0.727 | 1.021 |
| Total Authors | 0.0775 | 0.004 | 17.869 | 0.000 | 0.069 | 0.086 |
| Code-Contrib. Authors | -0.0017 | 0.059 | -0.028 | 0.977 | -0.117 | 0.113 |
| Code-Contrib. Non-Authors | 0.0194 | 0.027 | 0.713 | 0.476 | -0.034 | 0.073 |
| Years Since Publication | 0.4008 | 0.004 | 91.045 | 0.000 | 0.392 | 0.409 |
| Domain Life Sciences | -0.2270 | 0.084 | -2.690 | 0.007 | -0.392 | -0.062 |
| Domain Physical Sciences | 0.1081 | 0.071 | 1.528 | 0.126 | -0.031 | 0.247 |
| Domain Social Sciences | -0.2418 | 0.091 | -2.656 | 0.008 | -0.420 | -0.063 |
| Code-Contrib. Authors × Domain Life Sciences | 0.0964 | 0.070 | 1.374 | 0.169 | -0.041 | 0.234 |
| Code-Contrib. Authors × Domain Physical Sciences | 0.0340 | 0.060 | 0.568 | 0.570 | -0.083 | 0.152 |
| Code-Contrib. Authors × Domain Social Sciences | 0.1345 | 0.073 | 1.835 | 0.067 | -0.009 | 0.278 |
| Code-Contrib. Non-Authors × Domain Life Sciences | -0.0465 | 0.036 | -1.303 | 0.193 | -0.116 | 0.023 |
| Code-Contrib. Non-Authors × Domain Physical Sciences | -0.0226 | 0.028 | -0.820 | 0.412 | -0.077 | 0.031 |
| Code-Contrib. Non-Authors × Domain Social Sciences | -0.0348 | 0.037 | -0.931 | 0.352 | -0.108 | 0.038 |
| Dep. Variable: | cited_by_count | No. Observations: | 19303 |
| Model: | GLM | Df Residuals: | 19292 |
| Model Family: | NegativeBinomial | Df Model: | 10 |
| Link Function: | Log | Scale: | 1.0000 |
| Method: | IRLS | Log-Likelihood: | -66266. |
| Date: | Tue, 06 Jan 2026 | Deviance: | 18721. |
| Time: | 15:34:48 | Pearson chi2: | 2.86e+04 |
| No. Iterations: | 13 | Pseudo R-squ. (CS): | 0.3175 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 0.4361 | 0.046 | 9.583 | 0.000 | 0.347 | 0.525 |
| Total Authors | 0.0763 | 0.004 | 17.740 | 0.000 | 0.068 | 0.085 |
| Code-Contrib. Authors | -0.0051 | 0.029 | -0.175 | 0.861 | -0.062 | 0.052 |
| Code-Contrib. Non-Authors | -0.0308 | 0.010 | -3.143 | 0.002 | -0.050 | -0.012 |
| Years Since Publication | 0.4078 | 0.004 | 93.146 | 0.000 | 0.399 | 0.416 |
| Article Type Research Article | 0.5309 | 0.041 | 12.958 | 0.000 | 0.451 | 0.611 |
| Article Type Software Article | -0.4617 | 0.139 | -3.311 | 0.001 | -0.735 | -0.188 |
| Code-Contrib. Authors × Article Type Research Article | 0.0661 | 0.032 | 2.088 | 0.037 | 0.004 | 0.128 |
| Code-Contrib. Authors × Article Type Software Article | -0.0842 | 0.070 | -1.203 | 0.229 | -0.221 | 0.053 |
| Code-Contrib. Non-Authors × Article Type Research Article | 0.0377 | 0.011 | 3.440 | 0.001 | 0.016 | 0.059 |
| Code-Contrib. Non-Authors × Article Type Software Article | 0.0844 | 0.077 | 1.089 | 0.276 | -0.068 | 0.236 |
12.8 Filtered Dataset Description for h-Index Analysis
| Category | Subset | Total Authors | Any Code | Majority Code | Always Code |
| By Commmon Domain | Health Sciences | 1501 | 338 | 196 | 81 |
| Life Sciences | 1436 | 351 | 236 | 127 | |
| Physical Sciences | 49447 | 14756 | 7954 | 3725 | |
| Social Sciences | 1297 | 274 | 216 | 176 | |
| By Document Type | Preprint | 29038 | 9255 | 4828 | 2151 |
| Research Article | 24265 | 6419 | 3657 | 1830 | |
| Software Article | 378 | 45 | 117 | 128 | |
| By Author Position | First | 11459 | 1671 | 4864 | 3249 |
| Last | 10208 | 2260 | 550 | 186 | |
| Middle | 32014 | 11788 | 3188 | 674 | |
| Total | 53681 | 15719 | 8602 | 4109 |
12.9 h-Index Linear Model Results
| Dep. Variable: | h_index | No. Observations: | 49483 |
| Model: | GLM | Df Residuals: | 49478 |
| Model Family: | Gaussian | Df Model: | 4 |
| Link Function: | Log | Scale: | 198.76 |
| Method: | IRLS | Log-Likelihood: | -2.0115e+05 |
| Date: | Tue, 06 Jan 2026 | Deviance: | 9.8342e+06 |
| Time: | 15:34:49 | Pearson chi2: | 9.83e+06 |
| No. Iterations: | 46 | Pseudo R-squ. (CS): | 0.1757 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 3.1825 | 0.004 | 838.738 | 0.000 | 3.175 | 3.190 |
| Works Count | 0.0001 | 1.98e-06 | 66.904 | 0.000 | 0.000 | 0.000 |
| Any Coding | -0.3213 | 0.008 | -42.806 | 0.000 | -0.336 | -0.307 |
| Majority Coding | -0.7591 | 0.014 | -53.908 | 0.000 | -0.787 | -0.732 |
| Always Coding | -0.9583 | 0.025 | -38.142 | 0.000 | -1.008 | -0.909 |
| Dep. Variable: | h_index | No. Observations: | 49483 |
| Model: | GLM | Df Residuals: | 49466 |
| Model Family: | Gaussian | Df Model: | 16 |
| Link Function: | Log | Scale: | 197.47 |
| Method: | IRLS | Log-Likelihood: | -2.0098e+05 |
| Date: | Tue, 06 Jan 2026 | Deviance: | 9.7681e+06 |
| Time: | 15:34:49 | Pearson chi2: | 9.77e+06 |
| No. Iterations: | 48 | Pseudo R-squ. (CS): | 0.1823 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 3.3122 | 0.018 | 186.715 | 0.000 | 3.277 | 3.347 |
| Works Count | 0.0001 | 2.05e-06 | 66.928 | 0.000 | 0.000 | 0.000 |
| Any Coding | -0.3814 | 0.045 | -8.521 | 0.000 | -0.469 | -0.294 |
| Majority Coding | -1.4652 | 0.104 | -14.082 | 0.000 | -1.669 | -1.261 |
| Always Coding | -1.1829 | 0.187 | -6.335 | 0.000 | -1.549 | -0.817 |
| Common Domain Life Sciences | 0.1045 | 0.025 | 4.124 | 0.000 | 0.055 | 0.154 |
| Common Domain Physical Sciences | -0.1423 | 0.018 | -7.831 | 0.000 | -0.178 | -0.107 |
| Common Domain Social Sciences | -0.1658 | 0.030 | -5.459 | 0.000 | -0.225 | -0.106 |
| Any Coding × Common Domain Life Sciences | 0.0831 | 0.059 | 1.418 | 0.156 | -0.032 | 0.198 |
| Any Coding × Common Domain Physical Sciences | 0.0618 | 0.045 | 1.359 | 0.174 | -0.027 | 0.151 |
| Any Coding × Common Domain Social Sciences | 0.0174 | 0.073 | 0.238 | 0.812 | -0.126 | 0.161 |
| Majority Coding × Common Domain Life Sciences | 0.8432 | 0.120 | 7.054 | 0.000 | 0.609 | 1.078 |
| Majority Coding × Common Domain Physical Sciences | 0.7252 | 0.105 | 6.904 | 0.000 | 0.519 | 0.931 |
| Majority Coding × Common Domain Social Sciences | 0.7523 | 0.136 | 5.520 | 0.000 | 0.485 | 1.019 |
| Always Coding × Common Domain Life Sciences | 0.2804 | 0.213 | 1.315 | 0.188 | -0.137 | 0.698 |
| Always Coding × Common Domain Physical Sciences | 0.2258 | 0.189 | 1.197 | 0.231 | -0.144 | 0.595 |
| Always Coding × Common Domain Social Sciences | 0.2834 | 0.221 | 1.283 | 0.200 | -0.150 | 0.716 |
| Dep. Variable: | h_index | No. Observations: | 49483 |
| Model: | GLM | Df Residuals: | 49470 |
| Model Family: | Gaussian | Df Model: | 12 |
| Link Function: | Log | Scale: | 195.37 |
| Method: | IRLS | Log-Likelihood: | -2.0072e+05 |
| Date: | Tue, 06 Jan 2026 | Deviance: | 9.6651e+06 |
| Time: | 15:34:49 | Pearson chi2: | 9.67e+06 |
| No. Iterations: | 47 | Pseudo R-squ. (CS): | 0.1927 |
| Covariance Type: | nonrobust |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 3.0879 | 0.006 | 532.656 | 0.000 | 3.077 | 3.099 |
| Works Count | 0.0001 | 1.99e-06 | 64.759 | 0.000 | 0.000 | 0.000 |
| Any Coding | -0.2938 | 0.011 | -27.311 | 0.000 | -0.315 | -0.273 |
| Majority Coding | -0.7606 | 0.021 | -36.189 | 0.000 | -0.802 | -0.719 |
| Always Coding | -0.9799 | 0.040 | -24.644 | 0.000 | -1.058 | -0.902 |
| Common Article Type Research Article | 0.1836 | 0.008 | 24.214 | 0.000 | 0.169 | 0.198 |
| Common Article Type Software Article | 0.2231 | 0.055 | 4.034 | 0.000 | 0.115 | 0.331 |
| Any Coding × Common Article Type Research Article | -0.0296 | 0.015 | -1.984 | 0.047 | -0.059 | -0.000 |
| Any Coding × Common Article Type Software Article | 0.1663 | 0.103 | 1.622 | 0.105 | -0.035 | 0.367 |
| Majority Coding × Common Article Type Research Article | 0.0071 | 0.028 | 0.251 | 0.802 | -0.048 | 0.063 |
| Majority Coding × Common Article Type Software Article | 0.3797 | 0.090 | 4.220 | 0.000 | 0.203 | 0.556 |
| Always Coding × Common Article Type Research Article | 0.0018 | 0.052 | 0.034 | 0.973 | -0.101 | 0.104 |
| Always Coding × Common Article Type Software Article | 0.3679 | 0.108 | 3.417 | 0.001 | 0.157 | 0.579 |
12.10 Study Differences from Preregistration
12.10.1 Analysis of Article Field Weighted Citation Impact (FWCI) and Code Contribution
In our pre-registered analysis plan (https://osf.io/fc74m), we initially stated that we would additionally investigate the relationship between an article’s Field Weighted Citation Impact (FWCI) and the number of code contributors to the project. We decided against this analysis as the FWCI metric was only available from OpenAlex for 55.5% (n=76904) articles from the rs-graph-v1 dataset at the time of data processing. In addition, our analysis of the relationship between article citations and the number of code contributors to the project already includes the articles domain and duration since publication providing similar control.