SIST ISO 28500:2018
Information and documentation -- WARC file format
Information and documentation -- WARC file format
ISO 28500:2017 specifies the WARC file format:
- to store both the payload content and control information from mainstream Internet application layer protocols, such as the HTTP, DNS, and FTP;
- to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding);
- to support data compression and maintain data record integrity;
- to store all control information from the harvesting protocol (e.g. request headers), not just response information;
- to store the results of data transformations linked to other stored data;
- to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources);
- to be extended without disruption to existing functionality;
- to support handling of overly long records by truncation or segmentation, where desired.
Information et documentation -- Format de fichier WARC
Informatika in dokumentacija - Datotečna oblika zapisa WARC
Ta dokument določa datotečno obliko zapisa WARC:
— za shranjevanje koristne vsebine in nadzornih podatkov iz glavnih internetnih protokolov aplikacijskih plasti, kot so HTTP, DNS in FTP;
— za shranjevanje metapodatkov, povezanih z drugimi shranjenimi podatki (kot so klasifikator zadeve, odkriti jezik in kodiranje);
— za podporo stiskanja podatkov in ohranitev celovitosti podatkovnega zapisa;
— za shranjevanje vseh nadzornih podatkov iz protokola povzemanja (npr. glav zahtev), ne samo podatkov o odzivih;
— za shranjevanje rezultatov spreminjanja podatkov, povezanih z drugimi shranjenimi podatki;
— za shranjevanje dogodka zaznavanja podvojitev, povezanega z drugimi shranjenimi podatki (za zmanjševanje zasedenosti shrambe v prisotnosti
identičnih ali zelo podobnih virov);
— za razširitev brez motenj obstoječih funkcij;
— za podporo obravnavanja zelo dolgih zapisov s krajšanjem ali segmentacijo, kjer je zaželeno.
General Information
Relations
Standards Content (Sample)
SLOVENSKI STANDARD
01-september-2018
1DGRPHãþD
SIST ISO 28500:2009
,QIRUPDWLNDLQGRNXPHQWDFLMD'DWRWHþQDREOLND]DSLVD:$5&
Information and documentation -- WARC file format
Information et documentation -- Format de fichier WARC
Ta slovenski standard je istoveten z: ISO 28500:2017
ICS:
35.240.30 Uporabniške rešitve IT v IT applications in information,
informatiki, dokumentiranju in documentation and
založništvu publishing
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.
INTERNATIONAL ISO
STANDARD 28500
Second edition
2017-08
Information and documentation —
WARC file format
Information et documentation — Format de fichier WARC
Reference number
©
ISO 2017
© ISO 2017, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2017 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 2
4 File and record model . 3
5 Named fields . 5
5.1 General . 5
5.2 WARC-Record-ID (mandatory) . 5
5.3 Content-Length (mandatory) . 5
5.4 WARC-Date (mandatory) . 6
5.5 WARC-Type (mandatory) . 6
5.6 Content-Type . 6
5.7 WARC-Concurrent-To . 7
5.8 WARC-Block-Digest . 7
5.9 WARC-Payload-Digest . 7
5.10 WARC-IP-Address . 8
5.11 WARC-Refers-To . 8
5.12 WARC-Refers-To-Target-URI . 8
5.13 WARC-Refers-To-Date . 8
5.14 WARC-Target-URI . 9
5.15 WARC-Truncated. 9
5.16 WARC-Warcinfo-ID . 9
5.17 WARC-Filename. 9
5.18 WARC-Profile .10
5.19 WARC-Identified-Payload-Type .10
5.20 WARC-Segment-Number .10
5.21 WARC-Segment-Origin-ID .10
5.22 WARC-Segment-Total-Length .10
6 WARC record types .11
6.1 General .11
6.2 ‘warcinfo’ .11
6.3 ‘response’ .11
6.3.1 General.11
6.3.2 ‘http’ and ‘https’ schemes .12
6.3.3 Other URI schemes .12
6.4 ‘resource’ .12
6.4.1 General.12
6.4.2 ‘http’ and ‘https’ schemes .12
6.4.3 ‘ftp’ scheme .12
6.4.4 ‘dns’ scheme .13
6.4.5 Other URI schemes .13
6.5 ‘request’ .13
6.5.1 General.13
6.5.2 ‘http’ and ‘https’ schemes .13
6.5.3 Other URI schemes .13
6.6 ‘metadata’ .13
6.7 ‘revisit’ .14
6.7.1 General.14
6.7.2 Profile: Identical Payload Digest .14
6.7.3 Profile: Server Not Modified .15
6.7.4 Other profiles .15
6.8 ‘conversion’ .15
6.9 ‘continuation’ .16
7 Record segmentation .16
8 WARC file name, size and compression .16
Annex A (informative) Use cases for writing WARC records .18
Annex B (informative) Examples of WARC records .21
Annex C (informative) WARC file size and name recommendations .24
Annex D (informative) Compression recommendations .25
Bibliography .26
iv © ISO 2017 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directiv
...
INTERNATIONAL ISO
STANDARD 28500
Second edition
2017-08
Information and documentation —
WARC file format
Information et documentation — Format de fichier WARC
Reference number
©
ISO 2017
© ISO 2017, Published in Switzerland
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form
or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior
written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of
the requester.
ISO copyright office
Ch. de Blandonnet 8 • CP 401
CH-1214 Vernier, Geneva, Switzerland
Tel. +41 22 749 01 11
Fax +41 22 749 09 47
copyright@iso.org
www.iso.org
ii © ISO 2017 – All rights reserved
Contents Page
Foreword .v
Introduction .vi
1 Scope . 1
2 Normative references . 1
3 Terms, definitions and abbreviated terms . 2
4 File and record model . 3
5 Named fields . 5
5.1 General . 5
5.2 WARC-Record-ID (mandatory) . 5
5.3 Content-Length (mandatory) . 5
5.4 WARC-Date (mandatory) . 6
5.5 WARC-Type (mandatory) . 6
5.6 Content-Type . 6
5.7 WARC-Concurrent-To . 7
5.8 WARC-Block-Digest . 7
5.9 WARC-Payload-Digest . 7
5.10 WARC-IP-Address . 8
5.11 WARC-Refers-To . 8
5.12 WARC-Refers-To-Target-URI . 8
5.13 WARC-Refers-To-Date . 8
5.14 WARC-Target-URI . 9
5.15 WARC-Truncated. 9
5.16 WARC-Warcinfo-ID . 9
5.17 WARC-Filename. 9
5.18 WARC-Profile .10
5.19 WARC-Identified-Payload-Type .10
5.20 WARC-Segment-Number .10
5.21 WARC-Segment-Origin-ID .10
5.22 WARC-Segment-Total-Length .10
6 WARC record types .11
6.1 General .11
6.2 ‘warcinfo’ .11
6.3 ‘response’ .11
6.3.1 General.11
6.3.2 ‘http’ and ‘https’ schemes .12
6.3.3 Other URI schemes .12
6.4 ‘resource’ .12
6.4.1 General.12
6.4.2 ‘http’ and ‘https’ schemes .12
6.4.3 ‘ftp’ scheme .12
6.4.4 ‘dns’ scheme .13
6.4.5 Other URI schemes .13
6.5 ‘request’ .13
6.5.1 General.13
6.5.2 ‘http’ and ‘https’ schemes .13
6.5.3 Other URI schemes .13
6.6 ‘metadata’ .13
6.7 ‘revisit’ .14
6.7.1 General.14
6.7.2 Profile: Identical Payload Digest .14
6.7.3 Profile: Server Not Modified .15
6.7.4 Other profiles .15
6.8 ‘conversion’ .15
6.9 ‘continuation’ .16
7 Record segmentation .16
8 WARC file name, size and compression .16
Annex A (informative) Use cases for writing WARC records .18
Annex B (informative) Examples of WARC records .21
Annex C (informative) WARC file size and name recommendations .24
Annex D (informative) Compression recommendations .25
Bibliography .26
iv © ISO 2017 – All rights reserved
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards
bodies (ISO member bodies). The work of preparing International Standards is normally carried out
through ISO technical committees. Each member body interested in a subject for which a technical
committee has been established has the right to be represented on that committee. International
organizations, governmental and non-governmental, in liaison with ISO, also take part in the work.
ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of
electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are
described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the
different types of ISO documents should be noted. This document was drafted in accordance with the
editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of
any patent rights identified during the development of the document will be in the Introduction and/or
on the ISO list of patent declarations received (see www .iso .org/ patents).
Any trade name used in this document is information given for the convenience of users and does not
constitute an endorsement.
For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and
expressions related to conformity assessment, as well as information about ISO’s adherence to the
World Trade Organization (WTO) principles in the Technical Barriers to Trade
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.