Now Reading
Parsing greater than 10TB of GitHub Logs with Trickest and Extracting Public Particulars of all GitHub Customers & Repositories

Parsing greater than 10TB of GitHub Logs with Trickest and Extracting Public Particulars of all GitHub Customers & Repositories

2023-06-15 09:37:42

In right this moment’s digital age, the huge expanse of information out there to us provides unbelievable alternatives but additionally presents advanced challenges. Amongst these challenges is the duty of navigating and parsing huge information logs, particularly when coping with open-source platforms like GitHub. This weblog submit will dive into how I employed Trickest’s workflow methodology to parse over 10TB of GitHub logs, extracting public particulars from all of the customers and repositories current, turning a colossal information pool into helpful insights.

Understanding the significance of this matter is essential, as GitHub embodies a vibrant, ever-evolving ecosystem of builders and their tasks. By way of parsing this information, we now have the chance to achieve insights that would in any other case slip beneath the radar, producing a wealth of open-source intelligence (OSINT), which may improve person engagement, tackle focused problem-solving, and foster predictive analytics.

Utilizing the latest Trickest engine replace, which boosts node performance and reliability, I launched into a journey to parse all of the GitHub logs from 2015, extracting the general public info for all customers and repositories logged inside. This exhaustive course of allowed me to generate a complete checklist of all customers and repositories spanning from 2015 thus far.

Regardless of GitHub’s restricted OSINT information, my purpose was to bolster the group’s information by sharing our findings and providing further particulars derived from the GitHub API in regards to the customers and repositories logged.

Exploring the GitHub Archive

Whereas there aren’t publicly out there rankings of repositories sorted by star or fork counts, person contribution information, and even the identities of deleted customers, the GH Archive steps in to fill this hole. This helpful useful resource homes all GitHub logs from 2011 onwards, a testomony to GitHub’s dedication to group service.

These logs include helpful info, together with particulars about adjustments made to every repository, in addition to the people behind these adjustments. Parsing this information permits for the uncovering of all customers and repositories – created, deleted, altered, and past. This abundance of knowledge could be harnessed for OSINT functions.

Nevertheless, it’s important to recollect the sheer scale of the archive. With roughly 15TB of logs (even when compressed to save lots of area), parsing this quantity of information is definitely a major endeavor.

Parsing the Logs

For the preliminary step, I turned to the not too long ago upgraded Trickest engine to handle this information, attracted by its functionality for operating duties in parallel and its outstanding pace. To streamline the method, I selected to restrict execution to medium machines (4GB of RAM) operating in parallel. Given the substantial quantity of information, these machines wanted to parse, it was important to maintain the scripts as memory-efficient as potential for parsing all of the logs.

We selected to parse logs from 2015 onward, because it was from this level that the logs have been formally recorded through the Occasions API.

The duty of parsing offered its personal set of challenges, given the huge amount of information to parse and retailer. This required a number of iterations to find out probably the most environment friendly technique for downloading, parsing, and storing the info for future evaluation. The method was so demanding that Trickest needed to ramp up the quantity sizes to carry all the info. Consequently, the preliminary script, designed for downloading and parsing the info, was divided into three distinct phases:

  1. The primary stage was dedicated to producing all mandatory URLs for downloading the logs. The URL format was {12 months}-{month:02d}-{day:02d}-{hour:02d}.json.gz. We configured Trickest to deal with these URLs in parallel chunks of 200.

  2. The second stage concerned the script gh-scraper, which took in a number of log URLs as enter, downloaded, and parsed every individually. For dealing with giant downloads, we used the script curl because it proved extra environment friendly than the Python requests library. To parse the info, the script opened every file as a JSON, pulled the mandatory info (names of customers, repositories, and information of adjustments in every repository), and wrote this information right into a CSV file. This system downloaded, parsed, and wrote information sequentially earlier than continuing to the subsequent file. This strategy was designed to attenuate reminiscence utilization.

  3. The ultimate stage concerned customized Python scripts that learn all of the CSV information generated earlier, merged them, mixed outcomes, and eliminated duplicates. We developed separate scripts for extracting each person and repository info.

The picture under offers a visible illustration of how the workflow appeared:


Even though these scripts are extremely memory-efficient, the code could not seem elegant because of a number of modifications made to deal with this quantity of information on machines with simply 4GB of RAM.

Operating the Workflow

Workflow run

Upon initiating the workflow, it took barely over 23 hours to obtain, parse, and merge all the info. The merging course of itself occupied 19 hours on two parallel giant machines, one every for customers and repositories. The downloading and parsing utilized 15 parallel medium machines, consuming a mere 4 hours to obtain, decompress, and parse all of the GitHub logs. A particular acknowledgment goes to Trickest for managing such a large dataset with such pace.

The end result was a large 4.6GB CSV file containing information on greater than 45 million customers and the repositories they’d contributed to. Moreover, an 8.6GB CSV file containing information on greater than 220 million repositories.

Enhancing the Knowledge

Regardless of having an inventory of customers and repositories that had been created, deleted, modified, and many others., from 2015 to the current day, detailed info was lacking. To resolve this, we determined to counterpoint the info with the GitHub API.

We created two related workflows in Trickest to reinforce the CSV information created within the earlier step with the GitHub API:

  • The workflow started by downloading the CSV file that listed both customers or repositories.
  • The checklist was then divided into chunks of 1,000,000 customers or repositories to make sure environment friendly parallel processing with out overloading a medium machine.
  • I ran the script gh-enhancer to counterpoint every section of customers and repositories utilizing the GitHub API.
  • The improved information from all segments was merged right into a single CSV file.
  • Lastly, we used the script gh-investigator to extract fascinating info from the CSV file generated within the earlier step.

“Improve Customers” Workflow

Enhance users workflow

See Also

Executing the workflow on 4 medium machines in parallel took roughly 11 hours and 45 minutes. To keep away from exceeding the request per hour restrict, I utilized 5 completely different GitHub API keys. Nevertheless, I by no means reached the API charge restrict, suggesting that the method may have been accomplished quicker by utilizing extra machines in parallel.

The ultimate person CSV file was 4.6GB and contained particulars of roughly 45 million customers, similar to:

person,repos_collab,deleted,site_admin,hireable,e mail,firm,github_star

The gh-investigator script created information with the next info:

  • Checklist of customers who’re site_admin
  • Checklist of customers who’re hireable
  • Checklist of customers who’ve configured a public e mail
  • Checklist of customers who’ve configured a firm
  • Checklist of customers who’re github_star
  • Checklist of customers who’ve deleted their account

“Improve Repos” Workflow:

Enhance repos workflow

This workflow took over 32 hours to run on 15 medium machines in parallel. We used the identical 5 completely different GitHub API keys, and on this case, we did hit the API charge restrict.

The ultimate repository CSV file was 8.8GB and contained particulars of over 220 million repositories, like:


The gh-investigator script created information with the next info:

  • Checklist of repositories sorted by stars (solely these with greater than 500 stars)
  • Checklist of repositories sorted by forks (solely these with greater than 100 forks)
  • Checklist of repositories sorted by watchers (solely these with greater than 30 watchers)
  • Checklist of deleted repositories
  • Checklist of personal repositories
  • Checklist of archived repositories
  • Checklist of disabled repositories

In Conclusion

We succeeded in our purpose of parsing roughly 15TB of GitHub logs courting again to 2015, all inside a span of 23 hours and 45 minutes, and additional enriched this information utilizing the GitHub API.
This undertaking led to the creation of a complete database of all customers and repositories, these created, deleted, modified, and extra, spanning from 2015 to the current, full with detailed insights on every.

For these intrigued by our course of and seeking to construct their very own workflows, or maybe make the most of pre-existing ones, we invite you to get entry by filling out the shape
and begin your personal journey, completely free. Within the upcoming installment of this weblog collection, we’re excited to delve deeper into the info evaluation, share our discoveries, and supply entry to all of the GitHub information we’ve collected.

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top