zimscraperlib.zim.dedup
Classes:
-
Deduplicator–Automatically deduplicate potential ZIM items before adding them to the ZIM
Attributes:
CONTENT_BUFFER_READ_SIZE
module-attribute
CONTENT_BUFFER_READ_SIZE = 1048576
Deduplicator
Deduplicator(creator: Creator)
Automatically deduplicate potential ZIM items before adding them to the ZIM
This class automatically computes the digest of every item added to the ZIM, and either add the entry (if item is not yet inside the ZIM) or an alias (if item with same digest has already been added inside the ZIM).
This class must be configured with filters to specifiy which items paths to consider. It is of course possible to consider all paths (i.e. all items) with a wide regex or to operate on a subset (e.g. all images) with more precise filters. Item is considered for deduplication if any filter matches. It is recommended to properly configure these filters to save time / memory by automatically ignoring items which are known to always be different and / or be too numerous.
Only the digest and path of items matching the filters are computed and stored.
The xxh32 algorithm (https://github.com/Cyan4973/xxHash) which is known to be good at avoiding collision with minimal memory and CPU footprint is used, so the sheer memory consumption will come from the paths we have to keep. This hashing algorithm is not meant for security purpose since one might infer original content from hashes, but this is not our use case.
Methods:
-
add_item_for–Add an item at given path or an alias
Attributes:
Source code in src/zimscraperlib/zim/dedup.py
36 37 38 39 | |
creator
instance-attribute
creator = creator
add_item_for
add_item_for(
path: str,
title: str | None = None,
*,
fpath: Path | None = None,
content: bytes | str | None = None,
**kwargs: Any,
)
Add an item at given path or an alias
Source code in src/zimscraperlib/zim/dedup.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | |