Skip to content

zimscraperlib.rewriting.url_rewriting

URL rewriting tools

This module is about url and entry path rewriting.

The global scheme is the following:

Entries are stored in the ZIM file using their decoded fully decoded path: - The full path is the full url without the scheme, username, password, port, fragment (ie : "/(?<query_string)"). See documentation of the normalize function for more details. - urldecoded: the path itself must not be urlencoded or it would conflict with ZIM specification and readers won't be able to retrieve it, some parts (e.g. querystring) might be absorbed by a web server, ... . This is valid : "foo/part with space/bar?key=value" . This is NOT valid : "foo/part%20with%20space/bar%3Fkey%3Dvalue" - even having multiple ? in a ZIM path is valid . This is valid : "foo/part/file with ? and +?who=Chip&Dale&question=It there any + here?" . This is NOT valid : "foo/part/file with %3F and +?who=Chip%26Dale&quer=Is%20there%20any%20%2B%20here%3F" - space in query string must be stored as , not %2B, %20 or +, the + in a ZIM path means a `%2B in web resource (HTML document, ...): . This is valid : "foo/part/file?question=Is there any + here?" . This is NOT valid : "foo/part/file?question%3DIs%20there%20any%20%2B%20here%3F"

On top of that, fuzzy rules are applied on the ZIM path: For instance a path "https://www.youtube.com/youtubei/v1/foo/baz/things?key=value &other_key=other_value&videoId=xxxx&yet_another_key=yet_another_value" is transformed to "youtube.fuzzy.replayweb.page/youtubei/v1/foo/baz/things?videoId=xxxx" by slightly simplifying the path and keeping only the usefull arguments in the querystring.

When rewriting documents (HTML, CSS, JS, ...), every time we find a URI to rewrite we start by resolving it into an absolute URL (based on the containing document absolute URI), applying the transformation to compute the corresponding ZIM path and we url-encode the whole ZIM path, so that readers will have one single blob to process, url-decode and find corresponding ZIM entry. Only '/' separators are considered safe and not url-encoded.

Classes:

Attributes:

COMPILED_FUZZY_RULES module-attribute

COMPILED_FUZZY_RULES = [
    (
        AdditionalRule(
            match=compile(rule["pattern"]),
            replace=rule["replace"],
        )
    )
    for rule in FUZZY_RULES
]

AdditionalRule dataclass

AdditionalRule(match: Pattern[str], replace: str)

Attributes:

match instance-attribute

match: Pattern[str]

replace instance-attribute

replace: str

ArticleUrlRewriter

ArticleUrlRewriter(
    *,
    article_url: HttpUrl,
    article_path: ZimPath | None = None,
    existing_zim_paths: set[ZimPath] | None = None,
    missing_zim_paths: set[ZimPath] | None = None,
)

Rewrite urls in article.

This is typically used to rewrite urls found in an HTML document, but can be used beyong that usage.

Initialise the rewriter

Parameters:

  • article_url (HttpUrl) –

    URL where the original document was located, used to resolve

relative URLS which will be passed existing_zim_paths: list of ZIM paths which are known to exist, useful if one wants to rewrite the URL to a local one only if item exists in the ZIM missing_zim_paths: list of ZIM paths which are known to already be missing from the existing_zim_paths ; usefull only in complement with this variable ; new missing entries will be added as URLs are normalized in this function

Methods:

  • apply_additional_rules

    Apply additional rules on a URL or relative path

  • get_document_uri

    Given an ZIM item path and its fragment, get the URI to use in document

  • get_item_path

    Utility to transform an item URL into a ZimPath

  • normalize

    Transform a HTTP URL into a ZIM path to use as a entry's key.

Attributes:

Source code in src/zimscraperlib/rewriting/url_rewriting.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def __init__(
    self,
    *,
    article_url: HttpUrl,
    article_path: ZimPath | None = None,
    existing_zim_paths: set[ZimPath] | None = None,
    missing_zim_paths: set[ZimPath] | None = None,
):
    """
    Initialise the rewriter

    Args:
      article_url: URL where the original document was located, used to resolve
    relative URLS which will be passed
      existing_zim_paths: list of ZIM paths which are known to exist, useful if one
    wants to rewrite the URL to a local one only if item exists in the ZIM
      missing_zim_paths: list of ZIM paths which are known to already be missing
    from the existing_zim_paths ; usefull only in complement with this variable ;
    new missing entries will be added as URLs are normalized in this function
    """
    self.article_path = article_path or ArticleUrlRewriter.normalize(article_url)
    self.article_url = article_url
    self.existing_zim_paths = existing_zim_paths
    self.missing_zim_paths = missing_zim_paths

additional_rules class-attribute

additional_rules: list[AdditionalRule] = (
    COMPILED_FUZZY_RULES
)

article_path instance-attribute

article_path = article_path or normalize(article_url)

article_url instance-attribute

article_url = article_url

existing_zim_paths instance-attribute

existing_zim_paths = existing_zim_paths

missing_zim_paths instance-attribute

missing_zim_paths = missing_zim_paths

apply_additional_rules classmethod

apply_additional_rules(uri: HttpUrl | str) -> str

Apply additional rules on a URL or relative path

First matching additional rule matching the input value is applied and its result is returned.

If no additional rule is matching, the input is returned as-is.

Source code in src/zimscraperlib/rewriting/url_rewriting.py
339
340
341
342
343
344
345
346
347
348
349
350
351
352
@classmethod
def apply_additional_rules(cls, uri: HttpUrl | str) -> str:
    """Apply additional rules on a URL or relative path

    First matching additional rule matching the input value is applied and its
    result is returned.

    If no additional rule is matching, the input is returned as-is.
    """
    value = uri.value if isinstance(uri, HttpUrl) else uri
    for rule in cls.additional_rules:
        if match := rule.match.match(value):
            return match.expand(rule.replace)
    return value

get_document_uri

get_document_uri(
    item_path: ZimPath, item_fragment: str
) -> str

Given an ZIM item path and its fragment, get the URI to use in document

This function transforms the path of a ZIM item we want to adress from current document (HTML / JS / ...) and returns the corresponding URI to use.

It computes the relative path based on current document location and escape everything which needs to be to transform the ZIM path into a valid RFC 3986 URI

It also append a potential trailing item fragment at the end of the resulting URI.

Source code in src/zimscraperlib/rewriting/url_rewriting.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
def get_document_uri(self, item_path: ZimPath, item_fragment: str) -> str:
    """Given an ZIM item path and its fragment, get the URI to use in document

    This function transforms the path of a ZIM item we want to adress from current
    document (HTML / JS / ...) and returns the corresponding URI to use.

    It computes the relative path based on current document location and escape
    everything which needs to be to transform the ZIM path into a valid RFC 3986 URI

    It also append a potential trailing item fragment at the end of the resulting
    URI.

    """
    item_parts = urlsplit(item_path.value)

    # item_path is both path + querystring, both will be url-encoded in the document
    # so that readers consider them as a whole and properly pass them to libzim
    item_url = item_parts.path
    if item_parts.query:
        item_url += "?" + item_parts.query
    relative_path = str(
        PurePosixPath(item_url).relative_to(
            (
                PurePosixPath(self.article_path.value)
                if self.article_path.value.endswith("/")
                else PurePosixPath(self.article_path.value).parent
            ),
            walk_up=True,
        )
    )
    # relative_to removes a potential last '/' in the path, we add it back
    if item_path.value.endswith("/"):
        relative_path += "/"

    return (
        f"{quote(relative_path, safe='/')}"
        f"{'#' + item_fragment if item_fragment else ''}"
    )

get_item_path

get_item_path(
    item_url: str, base_href: str | None
) -> ZimPath

Utility to transform an item URL into a ZimPath

Source code in src/zimscraperlib/rewriting/url_rewriting.py
192
193
194
195
196
197
198
def get_item_path(self, item_url: str, base_href: str | None) -> ZimPath:
    """Utility to transform an item URL into a ZimPath"""

    item_absolute_url = urljoin(
        urljoin(self.article_url.value, base_href), item_url
    )
    return ArticleUrlRewriter.normalize(HttpUrl(item_absolute_url))

normalize classmethod

normalize(url: HttpUrl) -> ZimPath

Transform a HTTP URL into a ZIM path to use as a entry's key.

According to RFC 3986, a URL allows only a very limited set of characters, so we assume by default that the url is encoded to match this specification.

The transformation rewrites the hostname, the path and the querystring.

The transformation drops the URL scheme, username, password, port and fragment: - we suppose there is no conflict of URL scheme or port: there is no two ressources with same hostname, path and querystring but different URL scheme or port leading to different content - we consider username/password port are purely authentication mechanism which have no impact on the content to server - we know that the fragment is never passed to the server, it stays in the User-Agent, so if we encounter a fragment while normalizing a URL found in a document, it won't make its way to the ZIM anyway and will stay client-side

The transformation consists mainly in decoding the three components so that ZIM path is not encoded at all, as required by the ZIM specification.

Decoding is done differently for the hostname (decoded with puny encoding) and the path and querystring (both decoded with url decoding).

The final transformation is the application of fuzzy rules (sourced from wabac) to transform some URLs into replay URLs and drop some useless stuff.

Returned value is a ZIM path, without any puny/url encoding applied, ready to be passed to python-libzim for UTF-8 encoding.

Source code in src/zimscraperlib/rewriting/url_rewriting.py
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
@classmethod
def normalize(cls, url: HttpUrl) -> ZimPath:
    """Transform a HTTP URL into a ZIM path to use as a entry's key.

    According to RFC 3986, a URL allows only a very limited set of characters, so we
    assume by default that the url is encoded to match this specification.

    The transformation rewrites the hostname, the path and the querystring.

    The transformation drops the URL scheme, username, password, port and fragment:
    - we suppose there is no conflict of URL scheme or port: there is no two
    ressources with same hostname, path and querystring but different URL scheme or
      port leading to different content
    - we consider username/password port are purely authentication mechanism which
    have no impact on the content to server
    - we know that the fragment is never passed to the server, it stays in the
    User-Agent, so if we encounter a fragment while normalizing a URL found in a
    document, it won't make its way to the ZIM anyway and will stay client-side

    The transformation consists mainly in decoding the three components so that ZIM
    path is not encoded at all, as required by the ZIM specification.

    Decoding is done differently for the hostname (decoded with puny encoding) and
    the path and querystring (both decoded with url decoding).

    The final transformation is the application of fuzzy rules (sourced from wabac)
    to transform some URLs into replay URLs and drop some useless stuff.

    Returned value is a ZIM path, without any puny/url encoding applied, ready to be
    passed to python-libzim for UTF-8 encoding.
    """

    url_parts = urlsplit(url.value)

    if not url_parts.hostname:
        # cannot happen because of the HttpUrl checks, but important to please the
        # type checker
        raise Exception("Hostname is missing")  # pragma: no cover

    # decode the hostname if it is punny-encoded
    hostname = (
        idna.decode(url_parts.hostname)
        if url_parts.hostname.startswith("xn--")
        else url_parts.hostname
    )

    path = url_parts.path

    if path:
        # unquote the path so that it is stored unencoded in the ZIM as required by
        # ZIM specification
        path = unquote(path)
    else:
        # if path is empty, we need a "/" to remove ambiguities, e.g.
        # https://example.com and https://example.com/ must all lead to the same ZIM
        # entry to match RFC 3986 section 6.2.3:
        # https://www.rfc-editor.org/rfc/rfc3986#section-6.2.3
        path = "/"

    query = url_parts.query

    # if query is missing, we do not add it at all, not even a trailing ? without
    # anything after it
    if url_parts.query:
        # `+`` in query parameter must be decoded as space first to remove
        # ambiguities between a space (encoded as `+` in url query parameter) and a
        # real plus sign (encoded as %2B but soon decoded in ZIM path)
        query = query.replace("+", " ")
        # unquote the query so that it is stored unencoded in the ZIM as required by
        # ZIM specification
        query = "?" + unquote(query)
    else:
        query = ""

    fuzzified_url = ArticleUrlRewriter.apply_additional_rules(
        f"{hostname}{ArticleUrlRewriter._remove_subsequent_slashes(path)}{ArticleUrlRewriter._remove_subsequent_slashes(query)}"
    )

    return ZimPath(fuzzified_url)

HttpUrl

HttpUrl(value: str)

A utility class representing an HTTP url, usefull to pass this data around

Includes a basic validation, ensuring that URL is encoded, scheme is provided.

Methods:

Attributes:

Source code in src/zimscraperlib/rewriting/url_rewriting.py
71
72
73
def __init__(self, value: str) -> None:
    HttpUrl.check_validity(value)
    self._value = value

value property

value: str

check_validity classmethod

check_validity(value: str) -> None
Source code in src/zimscraperlib/rewriting/url_rewriting.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
@classmethod
def check_validity(cls, value: str) -> None:
    parts = urlsplit(value)

    if parts.scheme.lower() not in ["http", "https"]:
        raise ValueError(
            f"Incorrect HttpUrl scheme in value: {value} {parts.scheme}"
        )

    if not parts.hostname:
        raise ValueError(f"Unsupported empty hostname in value: {value}")

    if parts.hostname.lower() not in value:
        raise ValueError(f"Unsupported upper-case chars in hostname : {value}")

RewriteResult dataclass

RewriteResult(
    absolute_url: str,
    rewriten_url: str,
    zim_path: ZimPath | None,
)

Attributes:

absolute_url instance-attribute

absolute_url: str

rewriten_url instance-attribute

rewriten_url: str

zim_path instance-attribute

zim_path: ZimPath | None

ZimPath

ZimPath(value: str)

A utility class representing a ZIM path, usefull to pass this data around

Includes a basic validation, ensuring that path does start with scheme, hostname,...

Methods:

Attributes:

Source code in src/zimscraperlib/rewriting/url_rewriting.py
113
114
115
def __init__(self, value: str) -> None:
    ZimPath.check_validity(value)
    self._value = value

value property

value: str

check_validity classmethod

check_validity(value: str) -> None
Source code in src/zimscraperlib/rewriting/url_rewriting.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
@classmethod
def check_validity(cls, value: str) -> None:
    parts = urlsplit(value)

    if parts.scheme:
        raise ValueError(f"Unexpected scheme in value: {value} {parts.scheme}")

    if parts.hostname:
        raise ValueError(f"Unexpected hostname in value: {value} {parts.hostname}")

    if parts.username:
        raise ValueError(f"Unexpected username in value: {value} {parts.username}")

    if parts.password:
        raise ValueError(f"Unexpected password in value: {value} {parts.password}")