Now Reading
🗃 Open supply self-hosted net archiving. Takes URLs/browser historical past/bookmarks/Pocket/Pinboard/and so forth., saves HTML, JS, PDFs, media, and extra…

🗃 Open supply self-hosted net archiving. Takes URLs/browser historical past/bookmarks/Pocket/Pinboard/and so forth., saves HTML, JS, PDFs, media, and extra…

2024-01-13 13:28:12

ArchiveBox is a robust, self-hosted web archiving resolution to gather, save, and look at web sites offline.

With out lively preservation effort, every part on the web ultimately dissapears or degrades. Archive.org does an excellent job as a free central archive, however they require all archives to be public, they usually can’t save each kind of content material.

ArchiveBox is an open supply instrument that helps you archive net content material by yourself (or privately inside a corporation): save copies of browser bookmarks, protect proof for authorized instances, backup photographs from FB / Insta / Flickr, obtain your media from YT / Soundcloud / and so forth., snapshot analysis papers & tutorial citations, and extra…

➡️ Use ArchiveBox as a command-line package and/or self-hosted web app on Linux, macOS, or in Docker.


📥 You’ll be able to feed ArchiveBox URLs one by one, or schedule common imports from browser bookmarks or historical past, feeds like RSS, bookmark companies like Pocket/Pinboard, and extra. See input formats for a full checklist.

snapshot detail page

💾 It saves snapshots of the URLs you feed it in a number of redundant codecs.
It additionally detects any content material featured inside every webpage & extracts it out right into a folder:

  • HTML/Generic web sites -> HTML, PDF, PNG, WARC, Singlefile
  • YouTube/SoundCloud/and so forth. -> MP3/MP4 + subtitles, description, thumbnail
  • Information articles -> article physique TXT + title, creator, featured pictures
  • Github/Gitlab/and so forth. hyperlinks -> git cloned supply code
  • and more…

It makes use of regular filesystem folders to prepare archives (no sophisticated proprietary codecs), and presents a CLI + net UI.


🏛️ ArchiveBox is utilized by many professionals and hobbyists who save content material off the net, for instance:

  • People:
    backing up browser bookmarks/historical past, saving FB/Insta/and so forth. content material, purchasing lists
  • Journalists:
    crawling and gathering analysis, preserving quoted materials, fact-checking and overview
  • Legal professionals:
    proof assortment, hashing & integrity verifying, search, tagging, & overview
  • Researchers:
    gathering AI coaching units, feeding evaluation / net crawling pipelines

The purpose is to sleep soundly figuring out the a part of the web you care about can be robotically preserved in sturdy, simply accessible codecs for decades after it goes down.

📦  Get ArchiveBox with docker / apt / brew / pip3 / nix / and so forth. (see Quickstart below).

# Get ArchiveBox with Docker or Docker Compose (advisable)
docker run -v $PWD/knowledge:/knowledge -it archivebox/archivebox:dev init --setup

# Or set up along with your most popular bundle supervisor (see Quickstart beneath for apt, brew, and extra)
pip3 set up archivebox

# Or use the non-obligatory auto setup script to put in it
curl -sSL 'https://get.archivebox.io' | sh

🔢 Instance utilization: including hyperlinks to archive.

archivebox add 'https://instance.com'                                   # add URLs one by one
archivebox add < ~/Downloads/bookmarks.json                            # or pipe in URLs in any text-based format
archivebox schedule --every=day --depth=1 https://instance.com/rss.xml  # or auto-import URLs usually on a schedule

🔢 Instance utilization: viewing the archived content material.

archivebox server 0.0.0.0:8000            # use the interactive net UI
archivebox checklist 'https://instance.com'     # use the CLI instructions (--help for extra)
ls ./archive/*/index.json                 # or browse instantly by way of the filesystem

Key Options

🤝 Skilled Integration

Contact us in case your non-profit establishment/org desires to make use of ArchiveBox professionally.

  • setup & help, group permissioning, hashing, audit logging, backups, customized archiving and so forth.
  • for people, NGOs, academia, governments, journalism, legislation, and extra…

All our work is open-source and primarily geared in the direction of non-profits.
Assist/consulting pays for internet hosting and funds new ArchiveBox open-source growth.


grassgrass

🖥  Supported OSs: Linux/BSD, macOS, Home windows (Docker)   👾  CPUs: amd64 (x86_64), arm64 (arm8), arm7 (raspi>=3)
Observe: On arm7 the playwright bundle is just not out there, so chromium have to be put in manually if wanted.

✳️  Simple Setup

Docker docker-compose (macOS/Linux/Home windows)   👈  advisable   (click on to increase)

👍 Docker Compose is advisable for the best set up/replace UX + greatest safety + all of the extras out-of-the-box.

  1. Set up Docker and Docker Compose in your system (if not already put in).
  2. Obtain the docker-compose.yml file into a brand new empty listing (could be wherever).
    mkdir ~/archivebox && cd ~/archivebox
    curl -O 'https://uncooked.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml'
    
  3. Run the preliminary setup and create an admin person.
    docker compose run archivebox init --setup
    
  4. Non-obligatory: Begin the server then login to the Internet UI http://127.0.0.1:8000 ⇢ Admin.
    docker compose up
    # fully non-obligatory, CLI can all the time be used with out operating a server
    # docker compose run [-T] archivebox [subcommand] [--args]
    

See below for extra utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.

Docker docker run (macOS/Linux/Home windows)

  1. Set up Docker in your system (if not already put in).
  2. Create a brand new empty listing and initialize your assortment (could be wherever).
    mkdir ~/archivebox && cd ~/archivebox
    docker run -v $PWD:/knowledge -it archivebox/archivebox init --setup
    
  3. Non-obligatory: Begin the server then login to the Internet UI http://127.0.0.1:8000 ⇢ Admin.
    docker run -v $PWD:/knowledge -p 8000:8000 archivebox/archivebox
    # fully non-obligatory, CLI can all the time be used with out operating a server
    # docker run -v $PWD:/knowledge -it [subcommand] [--args]
    

See below for extra utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.

curl sh automatic setup script bash auto-setup script (macOS/Linux)

  1. Set up Docker in your system (non-obligatory, extremely advisable however not required).
  2. Run the automated setup script.
    curl -sSL 'https://get.archivebox.io' | sh

See below for extra utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.
See setup.sh for the supply code of the auto-install script.
See “Against curl | sh as an install method” weblog submit for my ideas on the shortcomings of this set up technique.

🛠  Package deal Supervisor Setup

aptitude apt (Ubuntu/Debian)

  1. Add the ArchiveBox repository to your sources.
    echo "deb http://ppa.launchpad.internet/archivebox/archivebox/ubuntu focal primary" | sudo tee /and so forth/apt/sources.checklist.d/archivebox.checklist
    sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
    sudo apt replace
    
  2. Set up the ArchiveBox bundle utilizing apt.
    sudo apt set up archivebox
    sudo python3 -m pip set up --upgrade --ignore-installed archivebox   # pip wanted as a result of apt solely supplies a damaged older model of Django
    
  3. Create a brand new empty listing and initialize your assortment (could be wherever).
    mkdir ~/archivebox && cd ~/archivebox
    archivebox init --setup           # if any issues, set up with pip as an alternative
    

    Observe: When you encounter points with NPM/NodeJS, install a more recent version.

  4. Non-obligatory: Begin the server then login to the Internet UI http://127.0.0.1:8000 ⇢ Admin.
    archivebox server 0.0.0.0:8000
    # fully non-obligatory, CLI can all the time be used with out operating a server
    # archivebox [subcommand] [--args]
    

See below for extra utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.
See the debian-archivebox repo for extra particulars about this distribution.

homebrew brew (macOS)

  1. Set up Homebrew in your system (if not already put in).
  2. Set up the ArchiveBox bundle utilizing brew.
    brew faucet archivebox/archivebox
    brew set up archivebox
    
  3. Create a brand new empty listing and initialize your assortment (could be wherever).
    mkdir ~/archivebox && cd ~/archivebox
    archivebox init --setup         # if any issues, set up with pip as an alternative
    
  4. Non-obligatory: Begin the server then login to the Internet UI http://127.0.0.1:8000 ⇢ Admin.
    archivebox server 0.0.0.0:8000
    # fully non-obligatory, CLI can all the time be used with out operating a server
    # archivebox [subcommand] [--args]
    

See below for extra utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.
See the homebrew-archivebox repo for extra particulars about this distribution.

Pip pip (macOS/Linux/BSD)

  1. Set up Python >= v3.9 and Node >= v18 in your system (if not already put in).
  2. Set up the ArchiveBox bundle utilizing pip3.
    pip3 set up archivebox
    
  3. Create a brand new empty listing and initialize your assortment (could be wherever).
    mkdir ~/archivebox && cd ~/archivebox
    archivebox init --setup
    # set up any lacking extras like wget/git/ripgrep/and so forth. manually as wanted
    
  4. Non-obligatory: Begin the server then login to the Internet UI http://127.0.0.1:8000 ⇢ Admin.
    archivebox server 0.0.0.0:8000
    # fully non-obligatory, CLI can all the time be used with out operating a server
    # archivebox [subcommand] [--args]
    

See below for extra utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.
See the pip-archivebox repo for extra particulars about this distribution.

Arch pacman / FreeBSD pkg / Nix nix (Arch/FreeBSD/NixOS/extra)

> [!WARNING] > *These are contributed by exterior volunteers and should lag behind the official `pip` channel.*

See below for utilization examples utilizing the CLI, Internet UI, or filesystem/SQL/Python to handle your archive.

🎗  Different Choices

Docker docker + Electron electron Desktop App (macOS/Linux/Home windows)

  1. Set up Docker in your system (if not already put in).
  2. Obtain a binary launch on your OS or construct the native app from supply



✨ Alpha (contributors wished!): for more information, see the: Electron ArchiveBox repo.

paid Paid internet hosting options (cloud VPS)

For extra dialogue on managed and paid internet hosting choices see right here: Issue #531.

➡️  Subsequent Steps

Utilization

⚡️  CLI Utilization

# archivebox [subcommand] [--args]
# docker-compose run archivebox [subcommand] [--args]
# docker run -v $PWD:/knowledge -it [subcommand] [--args]

archivebox init --setup      # secure to run init a number of occasions (additionally the way you replace variations)
archivebox --version
archivebox assist
  • archivebox setup/init/config/standing/handle to manage your assortment
  • archivebox add/schedule/take away/replace/checklist/shell/oneshot to handle Snapshots within the archive
  • archivebox schedule to tug in recent URLs usually from bookmarks/history/Pocket/Pinboard/RSS/etc.

🖥  Internet UI Utilization

archivebox handle createsuperuser  # set an admin password
archivebox server 0.0.0.0:8000     # open http://127.0.0.1:8000 to view it

# you can too configure whether or not or not login is required for many options
archivebox config --set PUBLIC_INDEX=False
archivebox config --set PUBLIC_SNAPSHOTS=False
archivebox config --set PUBLIC_ADD_VIEW=False

🗄  SQL/Python/Filesystem Utilization

sqlite3 ./index.sqlite3    # run SQL queries in your index
archivebox shell           # discover the Python API in a REPL
ls ./archive/*/index.html  # or examine snapshots on the filesystem
grassgrass

lego

Enter Codecs

ArchiveBox helps many enter codecs for URLs, together with Pocket & Pinboard exports, Browser bookmarks, Browser historical past, plain textual content, HTML, markdown, and extra!

Click on these hyperlinks for directions on learn how to put together your hyperlinks from these sources:

# archivebox add --help
archivebox add 'https://instance.com/some/web page'
archivebox add < ~/Downloads/firefox_bookmarks_export.html
archivebox add --depth=1 'https://information.ycombinator.com#2020-12-12'
echo 'http://instance.com' | archivebox add
echo 'any_text_with [urls](https://instance.com) in it' | archivebox add

# if utilizing Docker, add -i when piping stdin:
# echo 'https://instance.com' | docker run -v $PWD:/knowledge -i archivebox/archivebox add
# if utilizing Docker Compose, add -T when piping stdin / stdout:
# echo 'https://instance.com' | docker compose run -T archivebox add

See the Usage: CLI web page for documentation and examples.

It additionally features a built-in scheduled import characteristic with archivebox schedule and browser bookmarklet, so you possibly can pull in URLs from RSS feeds, web sites, or the filesystem usually/on-demand.

Output Codecs

Inside every Snapshot folder, ArchiveBox saves these various kinds of extractor outputs as plain recordsdata:

./archive/TIMESTAMP/*

  • Index: index.html & index.json HTML and JSON index recordsdata containing metadata and particulars
  • Title, Favicon, Headers Response headers, web site favicon, and parsed web site title
  • SingleFile: singlefile.html HTML snapshot rendered with headless Chrome utilizing SingleFile
  • Wget Clone: instance.com/page-name.html wget clone of the location with warc/TIMESTAMP.gz
  • Chrome Headless
    • PDF: output.pdf Printed PDF of web site utilizing headless chrome
    • Screenshot: screenshot.png 1440×900 screenshot of web site utilizing headless chrome
    • DOM Dump: output.html DOM Dump of the HTML after rendering utilizing headless chrome
  • Article Textual content: article.html/json Article textual content extraction utilizing Readability & Mercury
  • Archive.org Permalink: archive.org.txt A hyperlink to the saved web site on archive.org
  • Audio & Video: media/ all audio/video recordsdata + playlists, together with subtitles & metadata with youtube-dl (or yt-dlp)
  • Supply Code: git/ clone of any repository discovered on GitHub, Bitbucket, or GitLab hyperlinks
  • Extra coming quickly! See the Roadmap

It does every part out-of-the-box by default, however you possibly can disable or tweak individual archive methods by way of atmosphere variables / config.

Configuration

ArchiveBox could be configured by way of atmosphere variables, by utilizing the archivebox config CLI, or by modifying ./ArchiveBox.conf instantly.

archivebox config                               # view your complete config
archivebox config --get CHROME_BINARY           # view a particular worth

archivebox config --set CHROME_BINARY=chromium  # persist a config utilizing CLI
# OR
echo CHROME_BINARY=chromium >> ArchiveBox.conf  # persist a config utilizing file
# OR
env CHROME_BINARY=chromium archivebox ...       # run with a one-off config

These strategies additionally work the identical method when run inside Docker, see the Docker Configuration wiki web page for particulars.

The config loading logic with all of the choices outlined is right here: archivebox/config.py.

Most choices are additionally documented on the Configuration Wiki page.

Most Frequent Choices to Tweak

# e.g. archivebox config --set TIMEOUT=120

TIMEOUT=120                # default: 60    add extra seconds on slower networks
CHECK_SSL_VALIDITY=True    # default: False True = permit saving URLs w/ unhealthy SSL
SAVE_ARCHIVE_DOT_ORG=False # default: True  False = disable Archive.org saving
MAX_MEDIA_SIZE=1500m       # default: 750m  increase/decrease youtubedl output dimension

PUBLIC_INDEX=True          # default: True  whether or not anon customers can view index
PUBLIC_SNAPSHOTS=True      # default: True  whether or not anon customers can view pages
PUBLIC_ADD_VIEW=False      # default: False whether or not anon customers can add new URLs

CHROME_USER_AGENT="Mozilla/5.0 ..."  # change these to get round bot blocking
WGET_USER_AGENT="Mozilla/5.0 ..."
CURL_USER_AGENT="Mozilla/5.0 ..."

Dependencies

To realize high-fidelity archives in as many conditions as potential, ArchiveBox relies on quite a lot of Third-party instruments specializing in extracting various kinds of content material.

Increase to study extra about ArchiveBox’s dependencies…

> *TIP: For higher safety, simpler updating, and to keep away from polluting your host system with additional dependencies,**it’s strongly advisable to make use of the [⭐️ official Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with every part pre-installed for one of the best expertise.*

These non-obligatory dependencies used for archiving websites embody:

archivebox --version CLI output screenshot showing dependencies installed

– `chromium` / `chrome` (for screenshots, PDF, DOM HTML, and headless JS scripts)
– `node` & `npm` (for readability, mercury, and singlefile)
– `wget` (for plain HTML, static recordsdata, and WARC saving)
– `curl` (for fetching headers, favicon, and posting to Archive.org)
– `yt-dlp` or `youtube-dl` (for audio, video, and subtitles)
– `git` (for cloning git repos)
– `singlefile` (for saving right into a self-contained html file)
– `postlight/parser` (for dialogue threads, boards, and articles)
– `readability` (for articles and lengthy textual content content material)
– and extra as we develop…

You need not set up each dependency to make use of ArchiveBox. ArchiveBox will robotically disable extractors that depend on dependencies that are not put in, based mostly on what’s configured and out there in your `$PATH`.

If not utilizing Docker, make certain to maintain the dependencies up-to-date your self and examine that ArchiveBox is not reporting any incompatibility with the variations you put in.

“`bash
# set up python3 and archivebox along with your system bundle supervisor
# apt/brew/pip/and so forth set up … (see Quickstart directions above)

archivebox setup # auto set up all of the extractors and extras
archivebox –version # see information and examine validity of put in dependencies
“`

Putting in instantly on **Home windows with out Docker or WSL/WSL2/Cygwin is just not formally supported** (I can’t reply to Home windows help tickets), however some superior customers have reported getting it working.

#### Study Extra

– https://github.com/ArchiveBox/ArchiveBox/wiki/Set up#dependencies
– https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Set up
– https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives
– https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#putting in

Archive Format

All of ArchiveBox’s state (together with the SQLite DB, archived belongings, config, logs, and so forth.) is saved in a single folder known as the “ArchiveBox Knowledge Folder”.
Knowledge folders could be created wherever (~/archivebox or $PWD/knowledge as seen in our examples), and you may create multiple for various collections.

Increase to study extra in regards to the structure of Archivebox’s knowledge on-disk…

All archivebox CLI instructions are designed to be run from inside an ArchiveBox knowledge folder, beginning with archivebox init to initialize a brand new assortment inside an empty listing.

mkdir ~/archivebox && cd ~/archivebox   # simply an instance, could be wherever
archivebox init

The on-disk structure is optimized to be straightforward to browse by hand and sturdy long-term. The primary index is a regular index.sqlite3 database within the root of the information folder (it will also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp within the ./archive/ subfolder.

/knowledge/
    index.sqlite3
    ArchiveBox.conf
    archive/
        ...
        1617687755/
            index.html
            index.json
            screenshot.png
            media/some_video.mp4
            warc/1617687755.warc.gz
            git/somerepo.git
            ...

Every snapshot subfolder ./archive/TIMESTAMP/ features a static index.json and index.html describing its contents, and the snapshot extractor outputs are plain recordsdata inside the folder.

Study Extra

  • https://github.com/ArchiveBox/ArchiveBox/wiki/Utilization#Disk-Format
  • https://github.com/ArchiveBox/ArchiveBox/wiki/Utilization#large-archives
  • https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview#output-folder
  • https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive
  • https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives

Static Archive Exporting

You’ll be able to export the primary index to browse it statically as plain HTML recordsdata in a folder (without having to run a server).

Increase to discover ways to export your ArchiveBox assortment…

> *NOTE: These exports are usually not paginated, exporting many URLs or your complete archive directly could also be sluggish. Use the filtering CLI flags on the `archivebox checklist` command to export particular Snapshots or ranges.*

“`bash
# archivebox checklist –help
archivebox checklist –html –with-headers > index.html # export to static html desk
archivebox checklist –json –with-headers > index.json # export to json blob
archivebox checklist –csv=timestamp,url,title > index.csv # export to csv spreadsheet

# (if utilizing Docker Compose, add the -T flag when piping)
# docker compose run -T archivebox checklist –html –filter-type=search snozzberries > index.json
“`

The paths within the static exports are relative, make certain to maintain them subsequent to your `./archive` folder when backing them up or viewing them.

#### Study Extra

– https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive#2-export-and-host-it-as-static-html
– https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview#publishing
– https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#public_index–public_snapshots–public_add_view


security graphic

Caveats

Archiving Personal Content material

When you’re importing pages with personal content material or URLs containing secret tokens you don’t need public (e.g Google Docs, paywalled content material, unlisted movies, and so forth.), it’s possible you’ll wish to disable a few of the extractor strategies to keep away from leaking that content material to Third get together APIs or the general public.

Click on to increase…

“`bash
# do not save personal content material to ArchiveBox, e.g.:
archivebox add ‘https://docs.google.com/doc/d/12345somePrivateDocument’
archivebox add ‘https://vimeo.com/somePrivateVideo’

# with out first disabling saving to Archive.org:
archivebox config –set SAVE_ARCHIVE_DOT_ORG=False # disable saving all URLs in Archive.org

# limit the primary index, Snapshot content material, and Add Web page to authenticated customers as-needed:
archivebox config –set PUBLIC_INDEX=False
archivebox config –set PUBLIC_SNAPSHOTS=False
archivebox config –set PUBLIC_ADD_VIEW=False

# if additional paranoid or anti-Google:
archivebox config –set SAVE_FAVICON=False # disable favicon fetching (it calls a Google API passing the URL’s area half solely)
archivebox config –set CHROME_BINARY=chromium # guarantee it is utilizing Chromium as an alternative of Chrome
“`

> *CAUTION: Assume anybody *viewing* your archives will have the ability to see any cookies, session tokens, or personal URLs handed to ArchiveBox throughout archiving.*
> *Be certain to safe your ArchiveBox knowledge and do not share snapshots with others with out stripping out delicate headers and content material first.*

#### Study Extra

– https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive
– https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview
– https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Set up#setting-up-a-chromium-user-profile
– https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir
– https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file

Safety Dangers of Viewing Archived JS

Remember that malicious archived JS can entry the contents of different pages in your archive when considered. As a result of the Internet UI serves all considered snapshots from a single area, they share a request context and typical CSRF/CORS/XSS/CSP protections don’t work to stop cross-site request assaults. See the Security Overview web page and Issue #239 for extra particulars.

Click on to increase…

“`bash
# visiting an archived web page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/instance.com/index.html

# instance.com/index.js can now make a request to learn every part from:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then instance.com/index.js can ship it off to some evil server
“`

The admin UI can be served from the identical origin as replayed JS, so malicious pages may additionally doubtlessly use your ArchiveBox login cookies to carry out admin actions (e.g. including/eradicating hyperlinks, operating extractors, and so forth.). We’re planning to repair this safety shortcoming in a future model by utilizing separate ports/origins to serve the Admin UI and archived content material (see [Issue #239](https://github.com/ArchiveBox/ArchiveBox/points/239)).

> *NOTE: Solely the `wget` & `dom` extractor strategies execute archived JS when viewing snapshots, all different archive strategies produce static output that doesn’t execute JS on viewing.*
> *In case you are anxious about these points ^ you need to disable these extractors utilizing `archivebox config –set SAVE_WGET=False SAVE_DOM=False`.*

#### Study Extra

– https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview
– https://github.com/ArchiveBox/ArchiveBox/points/239
– https://github.com/ArchiveBox/ArchiveBox/safety/advisories/GHSA-cr45-98w9-gwqx (`CVE-2023-45815`)
– https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview#publishing

Working Round Websites that Block Archiving

For varied causes, many giant websites (Reddit, Twitter, Cloudflare, and so forth.) actively block archiving or bots typically. There are a variety of approaches to work round this.

Click on to increase…

– Set [`CHROME_USER_AGENT`, `WGET_USER_AGENT`, `CURL_USER_AGENT`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#curl_user_agent) to impersonate an actual browser (as an alternative of an ArchiveBox bot)
– Arrange a logged-in browser session for archiving utilizing [`CHROME_DATA_DIR` & `COOKIES_FILE`](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Set up#setting-up-a-chromium-user-profile)
– Rewrite your URLs earlier than archiving to swap in an alternate frontend thats extra bot-friendly e.g.
`reddit.com/some/url` -> `teddit.internet/some/url`: https://github.com/mendel5/alternative-front-ends

Sooner or later we plan on including help for operating JS scripts throughout archiving to dam advertisements, cookie popups, modals, and repair different points. Comply with right here for progress: [Issue #51](https://github.com/ArchiveBox/ArchiveBox/points/51).

Saving A number of Snapshots of a Single URL

ArchiveBox appends a hash with the present date https://instance.com#2020-10-24 to distinguish when a single URL is archived a number of occasions.

Click on to increase…

As a result of ArchiveBox uniquely identifies snapshots by URL, it should use a workaround to take a number of snapshots of the identical URL (in any other case they’d present up as a single Snapshot entry). It makes the URLs of repeated snapshots distinctive by including a hash with the archive date on the finish:

“`bash
archivebox add ‘https://instance.com#2020-10-24’

archivebox add ‘https://instance.com#2020-10-25’
“`

The Re-Snapshot Button button within the Admin UI is a shortcut for this hash-date multi-snapshotting workaround.

Improved help for saving a number of snapshots of a single URL with out this hash-date workaround can be [added eventually](https://github.com/ArchiveBox/ArchiveBox/points/179) (together with the flexibility to view diffs of the modifications between runs).

#### Study Extra

– https://github.com/ArchiveBox/ArchiveBox/points/179
– https://github.com/ArchiveBox/ArchiveBox/wiki/Utilization#explanation-of-buttons-in-the-web-ui—admin-snapshots-list

Storage Necessities

As a result of ArchiveBox is designed to ingest a big quantity of URLs with a number of copies of every URL saved by totally different Third-party instruments, it may be fairly disk-space intensive.
There additionally additionally some particular necessities when utilizing filesystems like NFS/SMB/FUSE.

Click on to increase…

**ArchiveBox can use wherever from ~1gb per 1000 articles, to ~50gb per 1000 articles**, principally depending on whether or not you are saving audio & video utilizing `SAVE_MEDIA=True` and whether or not you decrease `MEDIA_MAX_SIZE=750mb`.

Disk utilization could be decreased by utilizing a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors strategies you do not want. You too can deduplicate content material with a instrument like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). **Do not retailer giant collections on older filesystems like EXT3/FAT** as they could not have the ability to deal with greater than 50k listing entries within the `archive/` folder. **Attempt to hold the `index.sqlite3` file on native drive (not a community mount)** or SSD for optimum efficiency, nonetheless the `archive/` folder could be on a community mount or slower HDD.

If utilizing Docker or NFS/SMB/FUSE for the `knowledge/archive/` folder, it’s possible you’ll have to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid–pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/points/1304) in your fileshare server.

#### Study Extra

– https://github.com/ArchiveBox/ArchiveBox/wiki/Utilization#Disk-Format
– https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview#output-folder
– https://github.com/ArchiveBox/ArchiveBox/wiki/Utilization#large-archives
– https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid–pgid
– https://github.com/ArchiveBox/ArchiveBox/wiki/Safety-Overview#do-not-run-as-root


Screenshots


paisley graphic

ArchiveBox goals to allow extra of the web to be saved from deterioration by empowering individuals to self-host their very own archives. The intent is for all the net content material you care about to be viewable with frequent software program in 50 – 100 years without having to run ArchiveBox or different specialised software program to replay it.

Click on to learn extra…

Huge treasure troves of data are misplaced each day on the web to hyperlink rot. As a society, we’ve an crucial to protect some necessary elements of that treasure, identical to we protect our books, work, and music in bodily libraries lengthy after the originals exit of print or fade into obscurity.

Whether or not it is to withstand censorship by saving articles earlier than they get taken down or edited, or simply to avoid wasting a group of early 2010’s flash video games you like to play, having the instruments to archive web content material permits to you save the stuff you care most about earlier than it disappears.

The steadiness between the permanence and ephemeral nature of content material on the web is a part of what makes it lovely. I do not assume every part must be preserved in an automatic fashion–making all content material everlasting and by no means detachable, however I do assume individuals ought to have the ability to determine for themselves and successfully archive particular content material that they care about.

As a result of fashionable web sites are sophisticated and infrequently depend on dynamic content material,
ArchiveBox archives the websites in **a number of totally different codecs** past what public archiving companies like Archive.org/Archive.is save. Utilizing a number of strategies and the market-dominant browser to execute JS ensures we are able to save even essentially the most advanced, finicky web sites in no less than a couple of high-quality, long-term knowledge codecs.

Comparability to Different Tasks

comparison

Take a look at our community wiki for a listing of net archiving instruments and orgs.

Quite a lot of open and closed-source archiving tasks exist, however few present a pleasant UI and CLI to handle a big, high-fidelity archive assortment over time.

Click on to learn extra…

ArchiveBox tries to be a strong, set-and-forget archiving resolution appropriate for archiving RSS feeds, bookmarks, or your whole shopping historical past (beware, it could be too massive to retailer), together with personal/authenticated content material that you simply would not in any other case share with a centralized service.

Comparability With Centralized Public Archives

Not all content material is appropriate to be archived in a centralized assortment, whether or not as a result of it is personal, copyrighted, too giant, or too advanced. ArchiveBox hopes to fill that hole.

By having every person retailer their very own content material domestically, we are able to save a lot bigger parts of everybody’s shopping historical past than a shared centralized service would have the ability to deal with. The eventual purpose is to work in the direction of federated archiving the place customers can share parts of their collections with one another.

See Also

Comparability With Different Self-Hosted Archiving Choices

ArchiveBox differentiates itself from [similar self-hosted projects](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group#Internet-Archiving-Tasks) by offering each a complete CLI interface for managing your archive, a Internet UI that can be utilized both independently or along with the CLI, and a easy on-disk knowledge format that can be utilized with out both.

*If you need higher constancy for very advanced interactive pages with heavy JS/streams/API requests, take a look at [ArchiveWeb.page](https://archiveweb.web page) and [ReplayWeb.page](https://replayweb.web page).*

*If you need extra bookmark categorization and note-taking options, take a look at [Archivy](https://archivy.github.io/), [Memex](https://github.com/WorldBrain/Memex), [Polar](https://getpolarized.io/), or [LinkAce](https://www.linkace.org/).*

*When you want extra superior recursive spider/crawling capability past `–depth=1`, take a look at [Browsertrix](https://github.com/webrecorder/browsertrix-crawler), [Photon](https://github.com/s0md3v/Photon), or [Scrapy](https://scrapy.org/) and pipe the outputted URLs into ArchiveBox.*

For extra alternate options, see our [list here](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group#Internet-Archiving-Tasks)…

ArchiveBox is neither the best constancy nor the best instrument out there for self-hosted archiving, slightly it is a jack-of-all-trades that tries to do most issues nicely by default. We encourage you to attempt these different instruments made by our associates if ArchiveBox is not suited to your wants.


dependencies graphic

Web Archiving Ecosystem

Our Group Wiki web page serves as an index of the broader net archiving group.

  • See the place archivists hang around on-line
  • Discover different open-source instruments on your net archiving wants
  • Study which organizations are the massive gamers within the net archiving house
Discover our index of net archiving software program, blogs, and communities around the globe…

– [Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group)
– [The Master Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group#the-master-lists)
_Community-maintained indexes of archiving instruments and establishments._
– [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group#web-archiving-projects)
_Open supply instruments and tasks within the web archiving house._
– [Reading List](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group#reading-list)
_Articles, posts, and blogs related to ArchiveBox and net archiving typically._
– [Communities](https://github.com/ArchiveBox/ArchiveBox/wiki/Internet-Archiving-Group#communities)
_A assortment of essentially the most lively web archiving communities and initiatives._
– Take a look at the ArchiveBox [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap) and [Changelog](https://github.com/ArchiveBox/ArchiveBox/wiki/Changelog)
– Study why archiving the web is necessary by studying the “[On the Importance of Web Archiving](https://objects.ssrc.org/parameters/on-the-importance-of-web-archiving/)” weblog submit.
– Attain out to me for questions and feedback by way of [@ArchiveBoxApp](https://twitter.com/ArchiveBoxApp) or [@theSquashSH](https://twitter.com/thesquashSH) on Twitter

Need assistance constructing a customized archiving resolution?

Hire the team that built Archivebox to work in your challenge. (@ArchiveBoxApp)

(We additionally supply common software program consulting throughout many industries)


documentation graphic

We use the GitHub wiki system and Read the Docs (WIP) for documentation.

You too can entry the docs domestically by trying within the ArchiveBox/docs/ folder.

Getting Began

Superior

Builders

Extra Data


development

All contributions to ArchiveBox are welcomed! Examine our issues and Roadmap for issues to work on, and please open a problem to debate your proposed implementation earlier than engaged on issues! In any other case we could have to shut your PR if it doesn’t align with our roadmap.

For low hanging fruit / straightforward first tickets, see: ArchiveBox/Issues #good first ticket #help wanted.

Python API Documentation: https://docs.archivebox.io/en/dev/archivebox.html#module-archivebox.primary

Setup the dev atmosphere

Click on to increase…

#### 1. Clone the primary code repo (ensuring to tug the submodules as nicely)

“`bash
git clone –recurse-submodules https://github.com/ArchiveBox/ArchiveBox
cd ArchiveBox
git checkout dev # or the department you wish to check
git submodule replace –init –recursive
git pull –recurse-submodules
“`

#### 2. Possibility A: Set up the Python, JS, and system dependencies instantly in your machine

“`bash
# Set up ArchiveBox + python dependencies
python3 -m venv .venv && supply .venv/bin/activate && pip set up -e ‘.[dev]’
# or: pipenv set up –dev && pipenv shell

# Set up node dependencies
npm set up
# or
archivebox setup

# Examine to see if something is lacking
archivebox –version
# set up any lacking dependencies manually, or use the helper script:
./bin/setup.sh
“`

#### 2. Possibility B: Construct the docker container and use that for growth as an alternative

“`bash
# Non-obligatory: develop by way of docker by mounting the code dir into the container
# in the event you edit e.g. ./archivebox/core/fashions.py on the docker host, runserver
# contained in the container will reload and choose up your modifications
docker construct . -t archivebox
docker run -it
-v $PWD/knowledge:/knowledge
archivebox init –setup
docker run -it -p 8000:8000
-v $PWD/knowledge:/knowledge
-v $PWD/archivebox:/app/archivebox
archivebox server 0.0.0.0:8000 –debug –reload

# (take away the –reload flag and add the –nothreading flag when profiling with the django debug toolbar)
# When utilizing –reload, make certain any recordsdata you create could be learn by the person within the Docker container, eg with ‘chmod a+rX’.
“`

Frequent growth duties

See the ./bin/ folder and skim the supply of the bash scripts inside.
You too can run all these in Docker. For extra examples see the GitHub Actions CI/CD checks which can be run: .github/workflows/*.yaml.

Run in DEBUG mode

Click on to increase…

“`bash
archivebox config –set DEBUG=True
# or
archivebox server –debug …
“`

https://stackoverflow.com/questions/1074212/how-can-i-see-the-raw-sql-queries-django-is-running

Set up and run a particular GitHub department

Click on to increase…

##### Use a Pre-Constructed Picture

When you’re searching for the newest `dev` Docker picture, it is typically out there pre-built on Docker Hub, merely pull and use `archivebox/archivebox:dev`.

“`bash
docker pull archivebox/archivebox:dev
docker run archivebox/archivebox:dev model
# confirm the BUILD_TIME and COMMIT_HASH within the output are current
“`

##### Construct Department from Supply

You too can construct and run any department your self from supply, for instance to construct & use `dev` domestically:

“`bash
# docker-compose.yml:
companies:
archivebox:
picture: archivebox/archivebox:dev
construct: ‘https://github.com/ArchiveBox/ArchiveBox.git#dev’

# or with plain Docker:
docker construct -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -it -v $PWD:/knowledge archivebox:dev init –setup

# or with pip:
pip set up ‘git+https://github.com/pirate/ArchiveBox@dev’
npm set up ‘git+https://github.com/ArchiveBox/ArchiveBox.git#dev’
archivebox init –setup
“`

Run the linters

Click on to increase…

“`bash
./bin/lint.sh
“`
(makes use of `flake8` and `mypy`)

Run the combination checks

Click on to increase…

“`bash
./bin/check.sh
“`
(makes use of `pytest -s`)

Make migrations or enter a django shell

Click on to increase…

Be certain to run this everytime you change issues in `fashions.py`.

“`bash
cd archivebox/
./handle.py makemigrations

cd path/to/check/knowledge/
archivebox shell
archivebox handle dbshell
“`

(makes use of `pytest -s`)
https://stackoverflow.com/questions/1074212/how-can-i-see-the-raw-sql-queries-django-is-running

Click on to increase…

ArchiveBox [`extractors`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/media.py) are exterior binaries or Python/Node scripts that ArchiveBox runs to archive content material on a web page.

Extractors take the URL of a web page to archive, write their output to the filesystem `archive/TIMESTAMP/EXTRACTOR/…`, and return an [`ArchiveResult`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/core/fashions.py#:~:textual content=returnpercent20qs-,classpercent20ArchiveResult,-(fashions.Mannequin)%3A) entry which is saved to the database (seen on the `Log` web page within the UI).

*Take a look at how we added **[`archivebox/extractors/singlefile.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/singlefile.py)** for example of the method: [Issue #399](https://github.com/ArchiveBox/ArchiveBox/points/399) + [PR #403](https://github.com/ArchiveBox/ArchiveBox/pull/403).*

**The method to contribute a brand new extractor is like this:**

1. [Open an issue](https://github.com/ArchiveBox/ArchiveBox/points/new?assignees=&labels=changespercent3A+behaviorpercent2Cstatuspercent3A+concept+part&template=feature_request.md&title=Characteristic+Requestpercent3A+…) along with your propsoed implementation (please hyperlink to the pages of any new exterior dependencies you intend on utilizing)
2. Guarantee any dependencies wanted are simply installable by way of a bundle managers like `apt`, `brew`, `pip3`, `npm`
(Ideally, choose to make use of exterior applications out there by way of `pip3` or `npm`, nonetheless we do help utilizing any binary installable by way of bundle supervisor that exposes a CLI/Python API and writes output to stdout or the filesystem.)
3. Create a brand new file in [`archivebox/extractors/EXTRACTOR.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors) (copy an current extractor like [`singlefile.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/singlefile.py) as a template)
4. Add config settings to allow/disable any new dependencies and the extractor as an entire, e.g. `USE_DEPENDENCYNAME`, `SAVE_EXTRACTORNAME`, `EXTRACTORNAME_SOMEOTHEROPTION` in [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py)
5. Add a preview part to [`archivebox/templates/core/snapshot.html`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/templates/core/snapshot.html) to view the output, and a column to [`archivebox/templates/core/index_row.html`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/templates/core/index_row.html) with an icon on your extractor
6. Add an integration check on your extractor in [`tests/test_extractors.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/checks/test_extractors.py)
7. [Submit your PR for review!](https://github.com/ArchiveBox/ArchiveBox/blob/dev/.github/CONTRIBUTING.md) 🎉
8. As soon as merged, please doc it in these locations and wherever else you see information about different extractors:
– https://github.com/ArchiveBox/ArchiveBox#output-formats
– https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#archive-method-toggles
– https://github.com/ArchiveBox/ArchiveBox/wiki/Set up#dependencies

Construct the docs, pip bundle, and docker picture

Click on to increase…

(Usually CI takes care of this, however these scripts could be run to do it manually)
“`bash
./bin/construct.sh

# or individually:
./bin/build_docs.sh
./bin/build_pip.sh
./bin/build_deb.sh
./bin/build_brew.sh
./bin/build_docker.sh
“`

Roll a launch

Click on to increase…

(Usually CI takes care of this, however these scripts could be run to do it manually)
“`bash
./bin/launch.sh

# or individually:
./bin/release_docs.sh
./bin/release_pip.sh
./bin/release_deb.sh
./bin/release_brew.sh
./bin/release_docker.sh
“`


Additional Studying




Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top