Skip to content

zimscraperlib.zim.dedup

Classes:

  • Deduplicator

    Automatically deduplicate potential ZIM items before adding them to the ZIM

Attributes:

CONTENT_BUFFER_READ_SIZE module-attribute

CONTENT_BUFFER_READ_SIZE = 1048576

Deduplicator

Deduplicator(creator: Creator)

Automatically deduplicate potential ZIM items before adding them to the ZIM

This class automatically computes the digest of every item added to the ZIM, and either add the entry (if item is not yet inside the ZIM) or an alias (if item with same digest has already been added inside the ZIM).

This class must be configured with filters to specifiy which items paths to consider. It is of course possible to consider all paths (i.e. all items) with a wide regex or to operate on a subset (e.g. all images) with more precise filters. Item is considered for deduplication if any filter matches. It is recommended to properly configure these filters to save time / memory by automatically ignoring items which are known to always be different and / or be too numerous.

Only the digest and path of items matching the filters are computed and stored.

The xxh32 algorithm (https://github.com/Cyan4973/xxHash) which is known to be good at avoiding collision with minimal memory and CPU footprint is used, so the sheer memory consumption will come from the paths we have to keep. This hashing algorithm is not meant for security purpose since one might infer original content from hashes, but this is not our use case.

Methods:

Attributes:

Source code in src/zimscraperlib/zim/dedup.py
36
37
38
39
def __init__(self, creator: Creator):
    self.creator = creator
    self.filters: list[re.Pattern[str]] = []
    self.added_items: dict[bytes, str] = {}

added_items instance-attribute

added_items: dict[bytes, str] = {}

creator instance-attribute

creator = creator

filters instance-attribute

filters: list[Pattern[str]] = []

add_item_for

add_item_for(
    path: str,
    title: str | None = None,
    *,
    fpath: Path | None = None,
    content: bytes | str | None = None,
    **kwargs: Any,
)

Add an item at given path or an alias

Source code in src/zimscraperlib/zim/dedup.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def add_item_for(
    self,
    path: str,
    title: str | None = None,
    *,
    fpath: pathlib.Path | None = None,
    content: bytes | str | None = None,
    **kwargs: Any,
):
    """Add an item at given path or an alias"""
    existing_item = None
    if any(_filter.match(path) is not None for _filter in self.filters):
        if content:
            digest = xxhash.xxh32(
                content.encode() if isinstance(content, str) else content
            ).digest()
        else:
            if not fpath:
                raise Exception("Either content or fpath are mandatory")
            xxh32 = xxhash.xxh32()
            with open(fpath, "rb") as f:
                while True:
                    data = f.read(CONTENT_BUFFER_READ_SIZE)  # read content in chunk
                    if not data:
                        break
                    xxh32.update(data)
            digest = xxh32.digest()

        if existing_item := self.added_items.get(digest):
            self.creator.add_alias(
                path,
                targetPath=existing_item,
                title=title or path,
                hints={Hint.FRONT_ARTICLE: True} if kwargs.get("is_front") else {},
            )
            return
        else:
            self.added_items[digest] = path

    self.creator.add_item_for(path, title, fpath=fpath, content=content, **kwargs)