monitor adjustments over time by scraping to a Git repository

Git scraping: monitor adjustments over time by scraping to a Git repository
ninth October 2020
Git scraping is the identify I’ve given a scraping approach that I’ve been experimenting with for a couple of years now. It’s actually efficient, and extra individuals ought to use it.
Replace fifth March 2021: I offered a model of this publish as a five minute lightning talk at NICAR 2021, which features a reside coding demo of constructing a brand new git scraper.
Replace fifth January 2022: I launched a device referred to as git-history that helps analyze information that has been collected utilizing this system.
The web is stuffed with fascinating information that adjustments over time. These adjustments can typically be extra fascinating than the underlying static information—The @nyt_diff Twitter account tracks adjustments made to New York Instances headlines for instance, which presents a captivating perception into that publication’s editorial course of.
We have already got an ideal device for effectively monitoring adjustments to textual content over time: Git. And GitHub Actions (and different CI techniques) make it simple to create a scraper that runs each couple of minutes, data the present state of a useful resource and data adjustments to that useful resource over time within the commit historical past.
Right here’s a latest instance. Fires proceed to rage in California, and the CAL FIRE website presents an incident map exhibiting the most recent hearth exercise across the state.
Firing up the Firefox Community pane, filtering to requests triggered by XHR and sorting by measurement, largest first reveals this endpoint:
https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents
That’s a 241KB JSON endpoints with full particulars of the varied fires across the state.
So… I began working a git scraper in opposition to it. My scraper lives within the simonw/ca-fires-history repository on GitHub.
Each 20 minutes it grabs the most recent copy of that JSON endpoint, pretty-prints it (for diff readability) utilizing jq
and commits it again to the repo if it has modified.
This implies I now have a commit log of adjustments to that details about fires in California. Right here’s an example commit exhibiting that final night time the Zogg Fires proportion contained elevated from 90% to 92%, the variety of personnel concerned dropped from 968 to 798 and the variety of engines responding dropped from 82 to 59.
The implementation of the scraper is solely contained in a single GitHub Actions workflow. It’s in a file referred to as .github/workflows/scrape.yml which appears to be like like this:
identify: Scrape newest information
on:
push:
workflow_dispatch:
schedule:
- cron: '6,26,46 * * * *'
jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- identify: Take a look at this repo
makes use of: actions/checkout@v2
- identify: Fetch newest information
run: |-
curl https://www.hearth.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
- identify: Commit and push if it modified
run: |-
git config person.identify "Automated"
git config person.e mail "actions@customers.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Newest information: ${timestamp}" || exit 0
git push
That’s not lots of code!
It runs on a schedule at 6, 26 and 46 minutes previous the hour—I wish to offset my cron instances like this since I assume that almost all of crons run precisely on the hour, so working not-on-the-hour feels well mannered.
The scraper itself works by fetching the JSON utilizing curl
, piping it by way of jq .
to pretty-print it and saving the consequence to incidents.json
.
The “commit and push if it modified” block makes use of a sample that commits and pushes provided that the file has modified. I wrote about this sample in this TIL a couple of months in the past.
I’ve a complete bunch of repositories working git scrapers now. I’ve been labeling them with the git-scraping topic so that they present up in a single place on GitHub (different individuals have began utilizing that subject as nicely).
I’ve written about a few of these in the past:
I hope that by giving this system a reputation I can encourage extra individuals so as to add it to their toolbox. It’s a particularly efficient approach of turning all types of fascinating information sources right into a changelog over time.
Comment thread on this publish over on Hacker Information.