scrapeghost
scrapeghost
is an experimental library for scraping web sites utilizing OpenAI’s GPT.
The library supplies a way to scrape structured knowledge from HTML with out writing page-specific code.
Necessary
Earlier than you proceed, listed here are at the very least three the explanation why you shouldn’t use this library:
-
It’s very experimental, no ensures are made in regards to the stability of the API or the accuracy of the outcomes.
-
It depends on the OpenAI API, which is sort of gradual and will be costly. (See costs earlier than utilizing this library.)
-
At present licensed below Hippocratic License 3.0. (See FAQ.)
Use at your individual danger.
Quickstart
Step 1) Receive an OpenAI API key (https://platform.openai.com) and set an atmosphere variable:
export OPENAI_API_KEY=sk-...
Step 2) Set up the library nevertheless you want:
or
Step 3) Instantiate a SchemaScraper
by defining the form of the info you want to extract:
from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema={
"identify": "string",
"url": "url",
"district": "string",
"celebration": "string",
"photo_url": "url",
"places of work": [{"name": "string", "address": "string", "phone": "string"}],
}
)
Be aware
There is not any pre-defined format for the schema, the GPT fashions do job of determining what you need and you should use no matter values you need to present hints.
Step 4) Passing the scraper a URL (or HTML) to the ensuing scraper will return the scraped knowledge:
resp = scrape_legislators("https://www.ilga.gov/home/rep.asp?MemberID=3071")
resp.knowledge
{"identify": "Emanuel 'Chris' Welch",
"url": "https://www.ilga.gov/home/Rep.asp?MemberID=3071",
"district": "seventh", "celebration": "D",
"photo_url": "https://www.ilga.gov/pictures/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg",
"places of work": [
{"name": "Springfield Office",
"address": "300 Capitol Building, Springfield, IL 62706",
"phone": "(217) 782-5350"},
{"name": "District Office",
"address": "10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154",
"phone": "(708) 450-1000"}
]}
That is it!
Learn the tutorial for a step-by-step information to constructing a scraper.
Command Line Utilization Instance
If you happen to’ve put in the bundle (e.g. with pipx
), you should use the scrapeghost
command line software to experiment.
#!/bin/sh
scrapeghost https://www.ncleg.gov/Members/Biography/S/436
--schema "{'first_name': 'str', 'last_name': 'str',
'photo_url': 'url', 'places of work': [] }'"
--css div.card | python -m json.software
{
"first_name": "Gale",
"last_name": "Adcock",
"photo_url": "https://www.ncleg.gov/Members/MemberImage/S/436/Low",
"places of work": [
{
"type": "Mailing",
"address": "16 West Jones Street, Rm. 1104, Raleigh, NC 27601"
},
{
"type": "Office Phone",
"phone": "(919) 715-3036"
}
]
}
See the CLI docs for extra particulars.
Options
The aim of this library is to supply a handy interface for exploring net scraping with GPT.
Whereas the majority of the work is finished by the GPT mannequin, scrapeghost
supplies various options to make it simpler to make use of.
Python-based schema definition – Outline the form of the info you need to extract as any Python object, with as a lot or little element as you need.
Preprocessing
- HTML cleansing – Take away pointless HTML to cut back the dimensions and value of API requests.
- CSS and XPath selectors – Pre-filter HTML by writing a single CSS or XPath selector.
- Auto-splitting – Optionally cut up the HTML into a number of calls to the mannequin, permitting for bigger pages to be scraped.
Postprocessing
- JSON validation – Be certain that the response is legitimate JSON. (With the choice to kick it again to GPT for fixes if it isn’t.)
- Schema validation – Go a step additional, use a
pydantic
schema to validate the response. - Hallucination test – Does the info within the response actually exist on the web page?
Value Controls
- Scrapers hold working totals of what number of tokens have been despatched and obtained, so prices will be tracked.
- Assist for computerized fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall again to GPT-4 if wanted.)
- Permits setting a funds and stops the scraper if the funds is exceeded.