How does it know I would like csv? — An HTTP trick
Forgotten components of RFC2616
2023-01-17
by Cal Paterson
How come if you go to
https://csvbase.com/meripaterson/stock-exchanges
in a browser you get a webpage –:
however if you curl
the identical url you get a csv file?:
The url is identical – so how come?
The reply is HTTP’s built-in “content material negotiation”.
How content material negotiation works
When an HTTP shopper sends any request, it sends “headers” with that request.
Listed below are the headers that Google Chrome sends:
settle for: textual content/html,software/xhtml+xml,software/xml;q=0.9,picture/avif,picture/webp,picture/apng,*/*;q=0.8,software/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8
cache-control: no-cache
pragma: no-cache
sec-ch-ua: "Google Chrome";v="105", "Not)A;Model";v="8", "Chromium";v="105"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Linux"
sec-fetch-dest: doc
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36
That is loads. Solely the primary of those is related: the settle for
header:
settle for: textual content/html,software/xhtml+xml,software/xml;q=0.9,picture/avif,picture/webp,picture/apng,*/*;q=0.8,software/signed-exchange;v=b3;q=0.9
The settle for
header is an unordered record of preferences for what media kind
(aka “Content material Sort”, or file format) the server ought to ship.
Chrome is saying:
- Ideally, give me:
-
In any other case (“q”, or high quality, 0.9), give me
software/xml
- or
software/signed-exchange;v=b3
-
And if none of those can be found (lowest precedence; q=0.8):
- simply ship me something (
*/*
)
- simply ship me something (
csvbase has an HTML illustration of the url being requested, which is
Chrome’s joint prime desire, so it simply replies with that.
What settle for header does curl ship? It sends simply:
A lot shorter. Curl will take something. And csvbase has a default format:
csv, so it replies with that.
“Why although?”, escape hatches and non-negotiables
The primary purpose why csvbase bothers in any respect with that is to make it simpler to
export tables. For instance, to get a desk loaded into
pandas, all it’s important to do
is paste the url into the primary argument for pandas’ read_csv
– the identical url
as for the web page:
This works in most instruments – curl, pandas,
R and plenty of others.
However not all. Some, like Apache
Spark, ask for HTML for some
purpose. So csvbase has an escape hatch from content material negotiation — including
a file extension:
https://csvbase.com/meripaterson/stock-exchanges.csv
at all times returns csv file.
This escape hatch can also be helpful for different codecs. Media sorts are
managed by the Internet Assigned Numbers
Authority.
Not each file format has been given a media kind. For instance, parquet (the
present favorite of most information scientists) does not have a media kind. Neither
does jsonlines. At the least not but.
csvbase can output in these codecs too, however there isn’t any technique to content material neogiate
them – at the least not till the IANA get round to formally assigning them a
media kind.
Till then, use:
https://csvbase.com/meripaterson/stock-exchanges.parquet
You may even do:
import pandas as pd
pd.read_parquet("https://csvbase.com/meripaterson/stock-exchanges.parquet")