Open source software GitHub ecosystem: a SEM approach

Abdulhassan Alshomali, Mohammad Azeez (2018) Open source software GitHub ecosystem: a SEM approach. PhD thesis, James Cook University.

PDF (Thesis)
Download (2MB) | Preview
View at Publisher Website:


Open source software (OSS) is a collaborative effort. Getting affordable high-quality software with less probability of errors or fails is not far away. Thousands of open-source projects (termed repos) are alternatives to proprietary software development. More than two-thirds of companies are contributing to open source. Open source technologies like OpenStack, Docker and KVM are being used to build the next generation of digital infrastructure. An iconic example of OSS is 'GitHub' - a successful social site. GitHub is a hosting platform that host repositories (repos) based on the Git version control system.

GitHub is a knowledge-based workspace. It has several features that facilitate user communication and work integration. Through this thesis I employ data extracted from GitHub, and seek to better understand the OSS ecosystem, and to what extent each of its deployed elements affects the successful development of the OSS ecosystem. In addition, I investigate a repo's growth over different time periods to test the changing behavior of the repo. From our observations developers do not follow one development methodology when developing, and growing their project, and such developers tend to cherry-pick from differing available software methodologies.

GitHub API remains the main OSS location engaged to extract the metadata for this thesis's research. This extraction process is time-consuming - due to restrictive access limitations (even with authentication). I apply Structure Equation Modelling (termed SEM) to investigate the relative path relationships between the GitHub- deployed OSS elements, and I determine the path strength contributions of each element to determine the OSS repo's activity level.

SEM is a multivariate statistical analysis technique used to analyze structural relationships. This technique is the combination of factor analysis and multiple regression analysis. It is used to analyze the structural relationship between measured variables and/or latent constructs.

This thesis bridges the research gap around longitude OSS studies. It engages large sample-size OSS repo metadata sets, data-quality control, and multiple programming language comparisons. Querying GitHub is not direct (nor simple) yet querying for all valid repos remains important - as sometimes illegal, or unrepresentative outlier repos (which may even be quite popular) do arise, and these then need to be removed from each initial OSS's language-specific metadata set.

Eight top GitHub programming languages, (selected as the most forked repos) are separately engaged in this thesis's research. This thesis observes these eight metadata sets of GitHub repos. Over time, it measures the different repo contributions of the deployed elements of each metadata set.

The number of stars-provided to the repo delivers a weaker contribution to its software development processes. Sometimes forks work against the repo's progress by generating very minor negative total effects into its commit (activity) level, and by sometimes diluting the focus of the repo's software development strategies. Here, a fork may generate new ideas, create a new repo, and then draw some original repo developers off into this new software development direction, thus retarding the original repo's commit (activity) level progression.

Multiple intermittent and minor version releases exert lesser GitHub JavaScript repo commit (or activity) changes because they often involve only slight OSS improvements, and because they only require minimal commit/commits contributions. More commit(s) also bring more changes to documentation, and again the GitHub OSS repo's commit (activity) level rises.

There are both direct and indirect drivers of the repo's OSS activity. Pulls and commits are the strongest drivers. This suggests creating higher levels of pull requests is likely a preferred prime target consideration for the repo creator's core team of developers.

This study offers a big data direction for future work. It allows for the deployment of more sophisticated statistical comparison techniques. It offers further indications around the internal and broad relationships that likely exist between GitHub's OSS big data. Its data extraction ideas suggest a link through to business/consumer consumption, and possibly how these may be connected using improved repo search algorithms that release individual business value components.

Item ID: 56908
Item Type: Thesis (PhD)
Keywords: activity level, agile methodologies, GitHub software development, GitHub, open source community open source software (OSS), open source, popularity, software development, software methodologies, structure equation model (SEM), structure equation modelling (SEM)
Copyright Information: Copyright © 2018 Mohammad Azeez Abdulhassan Alshomali
Date Deposited: 16 Jan 2019 04:40
FoR Codes: 15 COMMERCE, MANAGEMENT, TOURISM AND SERVICES > 1503 Business and Management > 150307 Innovation and Technology Management @ 50%
08 INFORMATION AND COMPUTING SCIENCES > 0806 Information Systems > 080699 Information Systems not elsewhere classified @ 50%
SEO Codes: 89 INFORMATION AND COMMUNICATION SERVICES > 8999 Other Information and Communication Services > 899999 Information and Communication Services not elsewhere classified @ 100%
Downloads: Total: 886
Last 12 Months: 47
More Statistics

Actions (Repository Staff Only)

Item Control Page Item Control Page