Search Now to Find Amazing Website on I Dmoz ORG

WARC Websites

The WARC (Web ARChive) file format is a successor to the ARC format. Specifies a method for combining multiple digital resources into an aggregate archival file together with related information.- Category ID : 54797
1 -

Web Data Commons

The project extracts structured data from the Common Crawl and provides it for public download.
2 -

Common Crawl data set

Description of the data set.
3 -

Github: example-warc-java

Java and Clojure examples for processing Common Crawl WARC files.
4 -

Github: webarchive-commons

Common web archive utility code.
5 -

WARC, Web ARChive file format

Format description, ISO 28500:2009. Used by archival institutions to store content harvested by web crawls, for example via use of the Heritrix harvesting tool.
6 -

Wget with WARC output

About the development version of Wget which is capable to save WARC files.
7 -

The WARC File Format (ISO 28500)

Information, maintenance, drafts, hosted by the Bibliothèque nationale de France.
8 -


Python library for reading and writing warc files and warc headers.
9 -

WARC File Format Specifications

Collection of a number of drafts prepared as the WARC format has developed.
10 -

Web Archive Transformation (WAT) Specification, Utilities, and Usage Overview

Utilities to extract metadata from WARC files and create data analysis reports. Terminology, using WAT and Pig for data analysis.
11 -

The WARC Ecosystem

Wiki with resources about the WARC format and the tools that support it.
12 -

International Internet Preservation Consortium: Tools and Software

Perspectives of setting up a Web archiving chain, contains tools recommended and used by members of the IIPC.
13 -


A lightweight Erlang library to write Web Archiving software. Overview, requirements, quick start, tutorial, support services, bugs reports, license and third party libraries.
14 -

WARC Implementation Guidelines v.1

To gather advice and best practice to help institutions designing and creating WARC files for collection management, access, preservation, and interoperability with collections from different institutions.
15 -

Github: pylibwarc

A Python library for dealing with Web ARChive (WARC) files.
16 -

Digital Preservation Coalition: Web-Archiving

Report intended for those with an interest in, or responsibility for, setting up a web archive, particularly new practitioners or senior managers wishing to develop a holistic understanding of the issues and options available.
All Languages