zimscraperlib.rewriting.url_rewriting

URL rewriting tools

This module is about url and entry path rewriting.

The global scheme is the following:

Entries are stored in the ZIM file using their decoded fully decoded path: - The full path is the full url without the scheme, username, password, port, fragment (ie : "/(?<query_string)"). See documentation of the normalize function for more details. - urldecoded: the path itself must not be urlencoded or it would conflict with ZIM specification and readers won't be able to retrieve it, some parts (e.g. querystring) might be absorbed by a web server, ... . This is valid : "foo/part with space/bar?key=value" . This is NOT valid : "foo/part%20with%20space/bar%3Fkey%3Dvalue" - even having multiple ? in a ZIM path is valid . This is valid : "foo/part/file with ? and +?who=Chip&Dale&question=It there any + here?" . This is NOT valid : "foo/part/file with %3F and +?who=Chip%26Dale&quer=Is%20there%20any%20%2B%20here%3F" - space in query string must be stored as , not %2B, %20 or +, the + in a ZIM path means a `%2B in web resource (HTML document, ...): . This is valid : "foo/part/file?question=Is there any + here?" . This is NOT valid : "foo/part/file?question%3DIs%20there%20any%20%2B%20here%3F"

On top of that, fuzzy rules are applied on the ZIM path: For instance a path "https://www.youtube.com/youtubei/v1/foo/baz/things?key=value &other_key=other_value&videoId=xxxx&yet_another_key=yet_another_value" is transformed to "youtube.fuzzy.replayweb.page/youtubei/v1/foo/baz/things?videoId=xxxx" by slightly simplifying the path and keeping only the usefull arguments in the querystring.

When rewriting documents (HTML, CSS, JS, ...), every time we find a URI to rewrite we start by resolving it into an absolute URL (based on the containing document absolute URI), applying the transformation to compute the corresponding ZIM path and we url-encode the whole ZIM path, so that readers will have one single blob to process, url-decode and find corresponding ZIM entry. Only '/', '=' and ',' are considered safe and not url-encoded. '=' and ',' are kept unencoded because some sites (e.g. YouTube) embed scripts that parse their own script URL and expect these characters to appear literally in it.

Classes:

AdditionalRule –
ArticleUrlRewriter –

Rewrite urls in article.
HttpUrl –

A utility class representing an HTTP url, usefull to pass this data around
RewriteResult –
ZimPath –

A utility class representing a ZIM path, usefull to pass this data around

Attributes:

COMPILED_FUZZY_RULES –

COMPILED_FUZZY_RULES `module-attribute`

COMPILED_FUZZY_RULES = [
    (
        AdditionalRule(
            match=re.compile(rule["pattern"]),
            replace=rule["replace"],
        )
    )
    for rule in FUZZY_RULES
]

AdditionalRule `dataclass`

AdditionalRule(match: Pattern[str], replace: str)

Attributes:

match (Pattern[str]) –
replace (str) –

match `instance-attribute`

match: Pattern[str]

replace `instance-attribute`

replace: str

ArticleUrlRewriter

ArticleUrlRewriter(
    *,
    article_url: HttpUrl,
    article_path: ZimPath | None = None,
    existing_zim_paths: set[ZimPath] | None = None,
    missing_zim_paths: set[ZimPath] | None = None,
)

Rewrite urls in article.

This is typically used to rewrite urls found in an HTML document, but can be used beyong that usage.

Initialise the rewriter

Parameters:

article_url (HttpUrl) –

URL where the original document was located, used to resolve

relative URLS which will be passed existing_zim_paths: list of ZIM paths which are known to exist, useful if one wants to rewrite the URL to a local one only if item exists in the ZIM missing_zim_paths: list of ZIM paths which are known to already be missing from the existing_zim_paths ; usefull only in complement with this variable ; new missing entries will be added as URLs are normalized in this function

Methods:

apply_additional_rules –

Apply additional rules on a URL or relative path
get_document_uri –

Given an ZIM item path and its fragment, get the URI to use in document
get_item_path –

Utility to transform an item URL into a ZimPath
normalize –

Transform a HTTP URL into a ZIM path to use as a entry's key.

Attributes:

additional_rules (list[AdditionalRule]) –
article_path –
article_url –
existing_zim_paths –
missing_zim_paths –

Source code in src/zimscraperlib/rewriting/url_rewriting.py

def __init__(
    self,
    *,
    article_url: HttpUrl,
    article_path: ZimPath | None = None,
    existing_zim_paths: set[ZimPath] | None = None,
    missing_zim_paths: set[ZimPath] | None = None,
):
    """
    Initialise the rewriter

    Args:
      article_url: URL where the original document was located, used to resolve
    relative URLS which will be passed
      existing_zim_paths: list of ZIM paths which are known to exist, useful if one
    wants to rewrite the URL to a local one only if item exists in the ZIM
      missing_zim_paths: list of ZIM paths which are known to already be missing
    from the existing_zim_paths ; usefull only in complement with this variable ;
    new missing entries will be added as URLs are normalized in this function
    """
    self.article_path = article_path or ArticleUrlRewriter.normalize(article_url)
    self.article_url = article_url
    self.existing_zim_paths = existing_zim_paths
    self.missing_zim_paths = missing_zim_paths

additional_rules `class-attribute`

additional_rules: list[AdditionalRule] = (
    COMPILED_FUZZY_RULES
)

article_path `instance-attribute`

article_path = article_path or ArticleUrlRewriter.normalize(
    article_url
)

article_url `instance-attribute`

article_url = article_url

existing_zim_paths `instance-attribute`

existing_zim_paths = existing_zim_paths

missing_zim_paths `instance-attribute`

missing_zim_paths = missing_zim_paths

apply_additional_rules `classmethod`

apply_additional_rules(uri: HttpUrl | str) -> str

Apply additional rules on a URL or relative path

First matching additional rule matching the input value is applied and its result is returned.

If no additional rule is matching, the input is returned as-is.

Source code in src/zimscraperlib/rewriting/url_rewriting.py

@classmethod
def apply_additional_rules(cls, uri: HttpUrl | str) -> str:
    """Apply additional rules on a URL or relative path

    First matching additional rule matching the input value is applied and its
    result is returned.

    If no additional rule is matching, the input is returned as-is.
    """
    value = uri.value if isinstance(uri, HttpUrl) else uri
    for rule in cls.additional_rules:
        if match := rule.match.match(value):
            return match.expand(rule.replace)
    return value

get_document_uri

get_document_uri(
    item_path: ZimPath, item_fragment: str
) -> str

Given an ZIM item path and its fragment, get the URI to use in document

This function transforms the path of a ZIM item we want to adress from current document (HTML / JS / ...) and returns the corresponding URI to use.

It computes the relative path based on current document location and escape everything which needs to be to transform the ZIM path into a valid RFC 3986 URI

It also append a potential trailing item fragment at the end of the resulting URI.

Source code in src/zimscraperlib/rewriting/url_rewriting.py

def get_document_uri(self, item_path: ZimPath, item_fragment: str) -> str:
    """Given an ZIM item path and its fragment, get the URI to use in document

    This function transforms the path of a ZIM item we want to adress from current
    document (HTML / JS / ...) and returns the corresponding URI to use.

    It computes the relative path based on current document location and escape
    everything which needs to be to transform the ZIM path into a valid RFC 3986 URI

    It also append a potential trailing item fragment at the end of the resulting
    URI.

    """
    item_parts = urlsplit(item_path.value)

    # item_path is both path + querystring, both will be url-encoded in the document
    # so that readers consider them as a whole and properly pass them to libzim
    item_url = item_parts.path
    if item_parts.query:
        item_url += "?" + item_parts.query
    relative_path = str(
        PurePosixPath(item_url).relative_to(
            (
                PurePosixPath(self.article_path.value)
                if self.article_path.value.endswith("/")
                else PurePosixPath(self.article_path.value).parent
            ),
            walk_up=True,
        )
    )
    # relative_to removes a potential last '/' in the path, we add it back
    if item_path.value.endswith("/"):
        relative_path += "/"

    return (
        f"{quote(relative_path, safe='/=,')}"
        f"{'#' + item_fragment if item_fragment else ''}"
    )

get_item_path

get_item_path(
    item_url: str, base_href: str | None
) -> ZimPath

Utility to transform an item URL into a ZimPath

Source code in src/zimscraperlib/rewriting/url_rewriting.py

def get_item_path(self, item_url: str, base_href: str | None) -> ZimPath:
    """Utility to transform an item URL into a ZimPath"""

    item_absolute_url = urljoin(
        urljoin(self.article_url.value, base_href), item_url
    )
    return ArticleUrlRewriter.normalize(HttpUrl(item_absolute_url))

normalize `classmethod`

normalize(url: HttpUrl) -> ZimPath

Transform a HTTP URL into a ZIM path to use as a entry's key.

According to RFC 3986, a URL allows only a very limited set of characters, so we assume by default that the url is encoded to match this specification.

The transformation rewrites the hostname, the path and the querystring.

The transformation drops the URL scheme, username, password, port and fragment: - we suppose there is no conflict of URL scheme or port: there is no two ressources with same hostname, path and querystring but different URL scheme or port leading to different content - we consider username/password port are purely authentication mechanism which have no impact on the content to server - we know that the fragment is never passed to the server, it stays in the User-Agent, so if we encounter a fragment while normalizing a URL found in a document, it won't make its way to the ZIM anyway and will stay client-side

The transformation consists mainly in decoding the three components so that ZIM path is not encoded at all, as required by the ZIM specification.

Decoding is done differently for the hostname (decoded with puny encoding) and the path and querystring (both decoded with url decoding).

The final transformation is the application of fuzzy rules (sourced from wabac) to transform some URLs into replay URLs and drop some useless stuff.

Returned value is a ZIM path, without any puny/url encoding applied, ready to be passed to python-libzim for UTF-8 encoding.

Source code in src/zimscraperlib/rewriting/url_rewriting.py

@classmethod
def normalize(cls, url: HttpUrl) -> ZimPath:
    """Transform a HTTP URL into a ZIM path to use as a entry's key.

    According to RFC 3986, a URL allows only a very limited set of characters, so we
    assume by default that the url is encoded to match this specification.

    The transformation rewrites the hostname, the path and the querystring.

    The transformation drops the URL scheme, username, password, port and fragment:
    - we suppose there is no conflict of URL scheme or port: there is no two
    ressources with same hostname, path and querystring but different URL scheme or
      port leading to different content
    - we consider username/password port are purely authentication mechanism which
    have no impact on the content to server
    - we know that the fragment is never passed to the server, it stays in the
    User-Agent, so if we encounter a fragment while normalizing a URL found in a
    document, it won't make its way to the ZIM anyway and will stay client-side

    The transformation consists mainly in decoding the three components so that ZIM
    path is not encoded at all, as required by the ZIM specification.

    Decoding is done differently for the hostname (decoded with puny encoding) and
    the path and querystring (both decoded with url decoding).

    The final transformation is the application of fuzzy rules (sourced from wabac)
    to transform some URLs into replay URLs and drop some useless stuff.

    Returned value is a ZIM path, without any puny/url encoding applied, ready to be
    passed to python-libzim for UTF-8 encoding.
    """

    url_parts = urlsplit(url.value)

    if not url_parts.hostname:
        # cannot happen because of the HttpUrl checks, but important to please the
        # type checker
        raise Exception("Hostname is missing")  # pragma: no cover

    # decode the hostname if it is punny-encoded
    hostname = (
        idna.decode(url_parts.hostname)
        if url_parts.hostname.startswith("xn--")
        else url_parts.hostname
    )

    path = url_parts.path

    if path:
        # unquote the path so that it is stored unencoded in the ZIM as required by
        # ZIM specification
        path = unquote(path)
    else:
        # if path is empty, we need a "/" to remove ambiguities, e.g.
        # https://example.com and https://example.com/ must all lead to the same ZIM
        # entry to match RFC 3986 section 6.2.3:
        # https://www.rfc-editor.org/rfc/rfc3986#section-6.2.3
        path = "/"

    query = url_parts.query

    # if query is missing, we do not add it at all, not even a trailing ? without
    # anything after it
    if url_parts.query:
        # `+`` in query parameter must be decoded as space first to remove
        # ambiguities between a space (encoded as `+` in url query parameter) and a
        # real plus sign (encoded as %2B but soon decoded in ZIM path)
        query = query.replace("+", " ")
        # unquote the query so that it is stored unencoded in the ZIM as required by
        # ZIM specification
        query = "?" + unquote(query)
    else:
        query = ""

    fuzzified_url = ArticleUrlRewriter.apply_additional_rules(
        f"{hostname}{ArticleUrlRewriter._remove_subsequent_slashes(path)}{ArticleUrlRewriter._remove_subsequent_slashes(query)}"
    )

    return ZimPath(fuzzified_url)

HttpUrl

HttpUrl(value: str)

A utility class representing an HTTP url, usefull to pass this data around

Includes a basic validation, ensuring that URL is encoded, scheme is provided.

Methods:

check_validity –

Attributes:

value (str) –

Source code in src/zimscraperlib/rewriting/url_rewriting.py

def __init__(self, value: str) -> None:
    HttpUrl.check_validity(value)
    self._value = value

value `property`

value: str

check_validity `classmethod`

check_validity(value: str) -> None

Source code in src/zimscraperlib/rewriting/url_rewriting.py

@classmethod
def check_validity(cls, value: str) -> None:
    parts = urlsplit(value)

    if parts.scheme.lower() not in ["http", "https"]:
        raise ValueError(
            f"Incorrect HttpUrl scheme in value: {value} {parts.scheme}"
        )

    if not parts.hostname:
        raise ValueError(f"Unsupported empty hostname in value: {value}")

    if parts.hostname.lower() not in value:
        raise ValueError(f"Unsupported upper-case chars in hostname : {value}")

RewriteResult `dataclass`

RewriteResult(
    absolute_url: str,
    rewriten_url: str,
    zim_path: ZimPath | None,
)

Attributes:

absolute_url (str) –
rewriten_url (str) –
zim_path (ZimPath | None) –

absolute_url `instance-attribute`

absolute_url: str

rewriten_url `instance-attribute`

rewriten_url: str

zim_path `instance-attribute`

zim_path: ZimPath | None

ZimPath

ZimPath(value: str)

A utility class representing a ZIM path, usefull to pass this data around

Includes a basic validation, ensuring that path does start with scheme, hostname,...

Methods:

check_validity –

Attributes:

value (str) –

Source code in src/zimscraperlib/rewriting/url_rewriting.py

def __init__(self, value: str) -> None:
    ZimPath.check_validity(value)
    self._value = value

value `property`

value: str

check_validity `classmethod`

check_validity(value: str) -> None

Source code in src/zimscraperlib/rewriting/url_rewriting.py

@classmethod
def check_validity(cls, value: str) -> None:
    parts = urlsplit(value)

    if parts.scheme:
        raise ValueError(f"Unexpected scheme in value: {value} {parts.scheme}")

    if parts.hostname:
        raise ValueError(f"Unexpected hostname in value: {value} {parts.hostname}")

    if parts.username:
        raise ValueError(f"Unexpected username in value: {value} {parts.username}")

    if parts.password:
        raise ValueError(f"Unexpected password in value: {value} {parts.password}")

zimscraperlib.rewriting.url_rewriting

COMPILED_FUZZY_RULES module-attribute

AdditionalRule dataclass

match instance-attribute

replace instance-attribute

ArticleUrlRewriter

additional_rules class-attribute

article_path instance-attribute

article_url instance-attribute

existing_zim_paths instance-attribute

missing_zim_paths instance-attribute

apply_additional_rules classmethod

get_document_uri

get_item_path

normalize classmethod

HttpUrl

value property

check_validity classmethod

RewriteResult dataclass

absolute_url instance-attribute

rewriten_url instance-attribute

zim_path instance-attribute

ZimPath

value property

check_validity classmethod

COMPILED_FUZZY_RULES `module-attribute`

AdditionalRule `dataclass`

match `instance-attribute`

replace `instance-attribute`

additional_rules `class-attribute`

article_path `instance-attribute`

article_url `instance-attribute`

existing_zim_paths `instance-attribute`

missing_zim_paths `instance-attribute`

apply_additional_rules `classmethod`

normalize `classmethod`

value `property`

check_validity `classmethod`

RewriteResult `dataclass`

absolute_url `instance-attribute`

rewriten_url `instance-attribute`

zim_path `instance-attribute`

value `property`

check_validity `classmethod`