Now Reading
How does it know I would like csv? ⁠— An HTTP trick

How does it know I would like csv? ⁠— An HTTP trick

2023-01-17 01:42:55

image of a signpost

Forgotten components of RFC2616

2023-01-17

by Cal Paterson

How come if you go to
https://csvbase.com/meripaterson/stock-exchanges
in a browser you get a webpage –:

screenshot of csvbase table web page

however if you curl the identical url you get a csv file?:

screenshot of csvbase table in curl

The url is identical – so how come?

The reply is HTTP’s built-in “content material negotiation”.

How content material negotiation works

When an HTTP shopper sends any request, it sends “headers” with that request.
Listed below are the headers that Google Chrome sends:

settle for: textual content/html,software/xhtml+xml,software/xml;q=0.9,picture/avif,picture/webp,picture/apng,*/*;q=0.8,software/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-GB,en-US;q=0.9,en;q=0.8
cache-control: no-cache
pragma: no-cache
sec-ch-ua: "Google Chrome";v="105", "Not)A;Model";v="8", "Chromium";v="105"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Linux"
sec-fetch-dest: doc
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36

That is loads. Solely the primary of those is related: the settle for header:

settle for: textual content/html,software/xhtml+xml,software/xml;q=0.9,picture/avif,picture/webp,picture/apng,*/*;q=0.8,software/signed-exchange;v=b3;q=0.9

The settle for header is an unordered record of preferences for what media kind
(aka “Content material Sort”, or file format) the server ought to ship.

Chrome is saying:

  • Ideally, give me:
  • In any other case (“q”, or high quality, 0.9), give me

    • software/xml
    • or software/signed-exchange;v=b3
  • And if none of those can be found (lowest precedence; q=0.8):

    • simply ship me something (*/*)

csvbase has an HTML illustration of the url being requested, which is
Chrome’s joint prime desire, so it simply replies with that.

What settle for header does curl ship? It sends simply:

A lot shorter. Curl will take something. And csvbase has a default format:
csv, so it replies with that.

See Also

“Why although?”, escape hatches and non-negotiables

The primary purpose why csvbase bothers in any respect with that is to make it simpler to
export tables. For instance, to get a desk loaded into
pandas, all it’s important to do
is paste the url into the primary argument for pandas’ read_csvthe identical url
as for the web page
:

screenshot of csvbase table in curl

This works in most instruments – curl, pandas,
R and plenty of others.
However not all. Some, like Apache
Spark
, ask for HTML for some
purpose. So csvbase has an escape hatch from content material negotiation — including
a file extension:

https://csvbase.com/meripaterson/stock-exchanges.csv
at all times returns csv file.

This escape hatch can also be helpful for different codecs. Media sorts are
managed by the Internet Assigned Numbers
Authority
.
Not each file format has been given a media kind. For instance, parquet (the
present favorite of most information scientists) does not have a media kind. Neither
does jsonlines. At the least not but.

csvbase can output in these codecs too, however there isn’t any technique to content material neogiate
them – at the least not till the IANA get round to formally assigning them a
media kind.

Till then, use:
https://csvbase.com/meripaterson/stock-exchanges.parquet

You may even do:

import pandas as pd

pd.read_parquet("https://csvbase.com/meripaterson/stock-exchanges.parquet")

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top