zimscraperlib.rewriting.url_rewriting
URL rewriting tools
This module is about url and entry path rewriting.
The global scheme is the following:
Entries are stored in the ZIM file using their decoded fully decoded path:
- The full path is the full url without the scheme, username, password, port, fragment
(ie : "normalize function
for more details.
- urldecoded: the path itself must not be urlencoded or it would conflict with ZIM
specification and readers won't be able to retrieve it, some parts (e.g. querystring)
might be absorbed by a web server, ...
. This is valid : "foo/part with space/bar?key=value"
. This is NOT valid : "foo/part%20with%20space/bar%3Fkey%3Dvalue"
- even having multiple ? in a ZIM path is valid
. This is valid :
"foo/part/file with ? and +?who=Chip&Dale&question=It there any + here?"
. This is NOT valid :
"foo/part/file with %3F and +?who=Chip%26Dale&quer=Is%20there%20any%20%2B%20here%3F"
- space in query string must be stored as , not %2B, %20 or +, the + in a ZIM
path means a `%2B in web resource (HTML document, ...):
. This is valid : "foo/part/file?question=Is there any + here?"
. This is NOT valid : "foo/part/file?question%3DIs%20there%20any%20%2B%20here%3F"
On top of that, fuzzy rules are applied on the ZIM path: For instance a path "https://www.youtube.com/youtubei/v1/foo/baz/things?key=value &other_key=other_value&videoId=xxxx&yet_another_key=yet_another_value" is transformed to "youtube.fuzzy.replayweb.page/youtubei/v1/foo/baz/things?videoId=xxxx" by slightly simplifying the path and keeping only the usefull arguments in the querystring.
When rewriting documents (HTML, CSS, JS, ...), every time we find a URI to rewrite we start by resolving it into an absolute URL (based on the containing document absolute URI), applying the transformation to compute the corresponding ZIM path and we url-encode the whole ZIM path, so that readers will have one single blob to process, url-decode and find corresponding ZIM entry. Only '/' separators are considered safe and not url-encoded.
Classes:
-
AdditionalRule– -
ArticleUrlRewriter–Rewrite urls in article.
-
HttpUrl–A utility class representing an HTTP url, usefull to pass this data around
-
RewriteResult– -
ZimPath–A utility class representing a ZIM path, usefull to pass this data around
Attributes:
COMPILED_FUZZY_RULES
module-attribute
COMPILED_FUZZY_RULES = [
(
AdditionalRule(
match=compile(rule["pattern"]),
replace=rule["replace"],
)
)
for rule in FUZZY_RULES
]
AdditionalRule
dataclass
ArticleUrlRewriter
ArticleUrlRewriter(
*,
article_url: HttpUrl,
article_path: ZimPath | None = None,
existing_zim_paths: set[ZimPath] | None = None,
missing_zim_paths: set[ZimPath] | None = None,
)
Rewrite urls in article.
This is typically used to rewrite urls found in an HTML document, but can be used beyong that usage.
Initialise the rewriter
Parameters:
-
article_url(HttpUrl) –URL where the original document was located, used to resolve
relative URLS which will be passed existing_zim_paths: list of ZIM paths which are known to exist, useful if one wants to rewrite the URL to a local one only if item exists in the ZIM missing_zim_paths: list of ZIM paths which are known to already be missing from the existing_zim_paths ; usefull only in complement with this variable ; new missing entries will be added as URLs are normalized in this function
Methods:
-
apply_additional_rules–Apply additional rules on a URL or relative path
-
get_document_uri–Given an ZIM item path and its fragment, get the URI to use in document
-
get_item_path–Utility to transform an item URL into a ZimPath
-
normalize–Transform a HTTP URL into a ZIM path to use as a entry's key.
Attributes:
-
additional_rules(list[AdditionalRule]) – -
article_path– -
article_url– -
existing_zim_paths– -
missing_zim_paths–
Source code in src/zimscraperlib/rewriting/url_rewriting.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | |
article_url
instance-attribute
article_url = article_url
existing_zim_paths
instance-attribute
existing_zim_paths = existing_zim_paths
missing_zim_paths
instance-attribute
missing_zim_paths = missing_zim_paths
apply_additional_rules
classmethod
Apply additional rules on a URL or relative path
First matching additional rule matching the input value is applied and its result is returned.
If no additional rule is matching, the input is returned as-is.
Source code in src/zimscraperlib/rewriting/url_rewriting.py
339 340 341 342 343 344 345 346 347 348 349 350 351 352 | |
get_document_uri
Given an ZIM item path and its fragment, get the URI to use in document
This function transforms the path of a ZIM item we want to adress from current document (HTML / JS / ...) and returns the corresponding URI to use.
It computes the relative path based on current document location and escape everything which needs to be to transform the ZIM path into a valid RFC 3986 URI
It also append a potential trailing item fragment at the end of the resulting URI.
Source code in src/zimscraperlib/rewriting/url_rewriting.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | |
get_item_path
Utility to transform an item URL into a ZimPath
Source code in src/zimscraperlib/rewriting/url_rewriting.py
192 193 194 195 196 197 198 | |
normalize
classmethod
Transform a HTTP URL into a ZIM path to use as a entry's key.
According to RFC 3986, a URL allows only a very limited set of characters, so we assume by default that the url is encoded to match this specification.
The transformation rewrites the hostname, the path and the querystring.
The transformation drops the URL scheme, username, password, port and fragment: - we suppose there is no conflict of URL scheme or port: there is no two ressources with same hostname, path and querystring but different URL scheme or port leading to different content - we consider username/password port are purely authentication mechanism which have no impact on the content to server - we know that the fragment is never passed to the server, it stays in the User-Agent, so if we encounter a fragment while normalizing a URL found in a document, it won't make its way to the ZIM anyway and will stay client-side
The transformation consists mainly in decoding the three components so that ZIM path is not encoded at all, as required by the ZIM specification.
Decoding is done differently for the hostname (decoded with puny encoding) and the path and querystring (both decoded with url decoding).
The final transformation is the application of fuzzy rules (sourced from wabac) to transform some URLs into replay URLs and drop some useless stuff.
Returned value is a ZIM path, without any puny/url encoding applied, ready to be passed to python-libzim for UTF-8 encoding.
Source code in src/zimscraperlib/rewriting/url_rewriting.py
354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 | |
HttpUrl
HttpUrl(value: str)
A utility class representing an HTTP url, usefull to pass this data around
Includes a basic validation, ensuring that URL is encoded, scheme is provided.
Methods:
Attributes:
Source code in src/zimscraperlib/rewriting/url_rewriting.py
71 72 73 | |
check_validity
classmethod
check_validity(value: str) -> None
Source code in src/zimscraperlib/rewriting/url_rewriting.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
RewriteResult
dataclass
Attributes:
-
absolute_url(str) – -
rewriten_url(str) – -
zim_path(ZimPath | None) –
ZimPath
ZimPath(value: str)
A utility class representing a ZIM path, usefull to pass this data around
Includes a basic validation, ensuring that path does start with scheme, hostname,...
Methods:
Attributes:
Source code in src/zimscraperlib/rewriting/url_rewriting.py
113 114 115 | |
check_validity
classmethod
check_validity(value: str) -> None
Source code in src/zimscraperlib/rewriting/url_rewriting.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | |