Skip to content

zimscraperlib.rewriting.css

CSS Rewriting

This modules contains tools to rewrite CSS retrieved from an online source so that it can safely operate within a ZIM, linking only to ZIM entries everytime a URL is used.

The rewriter needs to have an article url rewriter to rewrite URLs found in CSS, an optional base href if the CSS to rewrite was found inline an HTML document which has a base href set, and an optional flag indicating if in case of parsing error we want to fallback to simple regex rewriting or we prefer to drop the offending rule.

Classes:

CssRewriter

CssRewriter(
    url_rewriter: ArticleUrlRewriter,
    base_href: str | None,
    *,
    remove_errors: bool = False,
)

CSS rewriting class

Parameters:

  • url_rewriter (ArticleUrlRewriter) –

    the rewriter of URLs

  • base_href (str | None) –

    if CSS to rewrite has been found inline on an HTML page, this is

the potential base href found in HTML document remove_errors: if True, we just drop bad CSS rules ; if False, we fallback to regex-based rewriting of the whole CSS document

Methods:

Attributes:

Source code in src/zimscraperlib/rewriting/css.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def __init__(
    self,
    url_rewriter: ArticleUrlRewriter,
    base_href: str | None,
    *,
    remove_errors: bool = False,
):
    """
    Args:
      url_rewriter: the rewriter of URLs
      base_href: if CSS to rewrite has been found inline on an HTML page, this is
    the potential base href found in HTML document
      remove_errors: if True, we just drop bad CSS rules ; if False, we fallback to
    regex-based rewriting of the whole CSS document
    """
    self.url_rewriter = url_rewriter
    self.base_href = base_href
    self.remove_errors = remove_errors
    self.fallback_rewriter = FallbackRegexCssRewriter(url_rewriter, base_href)

base_href instance-attribute

base_href = base_href

fallback_rewriter instance-attribute

fallback_rewriter = FallbackRegexCssRewriter(
    url_rewriter, base_href
)

remove_errors instance-attribute

remove_errors = remove_errors

url_rewriter instance-attribute

url_rewriter = url_rewriter

rewrite

rewrite(content: str | bytes) -> str

Rewrite a 'standalone' CSS document

'standalone' means "not inline an HTML document"

Source code in src/zimscraperlib/rewriting/css.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
def rewrite(self, content: str | bytes) -> str:
    """
    Rewrite a 'standalone' CSS document

    'standalone' means "not inline an HTML document"
    """
    try:
        if isinstance(content, bytes):
            rules, _ = parse_stylesheet_bytes(content)
        else:
            rules = parse_stylesheet(content)
        self._process_list(rules)
        return self._serialize_rules(rules)
    except Exception:
        # If tinycss fail to parse css, it will generate a "Error" token.
        # Exception is raised at serialization time.
        # We try/catch the whole process to be sure anyway.
        logger.warning(
            (
                "Css transformation fails. Fallback to regex rewriter.\n"
                "Article path is %s"
            ),
            self.url_rewriter.article_url,
        )
        return self.fallback_rewriter.rewrite(content, {})

rewrite_inline

rewrite_inline(content: str) -> str

Rewrite an 'inline' CSS document

'inline' means "inline an HTML document"

Source code in src/zimscraperlib/rewriting/css.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def rewrite_inline(self, content: str) -> str:
    """
    Rewrite an 'inline' CSS document

    'inline' means "inline an HTML document"
    """
    try:
        rules = parse_declaration_list(content)
        self._process_list(rules)
        return self._serialize_rules(rules)
    except Exception:
        # If tinycss fail to parse css, it will generate a "Error" token.
        # Exception is raised at serialization time.
        # We try/catch the whole process to be sure anyway.
        logger.warning(
            (
                "Css transformation fails. Fallback to regex rewriter.\n"
                "Content is `%s`"
            ),
            content,
        )
        return self.fallback_rewriter.rewrite(content, {})

FallbackRegexCssRewriter

FallbackRegexCssRewriter(
    url_rewriter: ArticleUrlRewriter, base_href: str | None
)

Bases: RxRewriter

Fallback CSS rewriting based on regular expression.

This is obviously way less powerful than real CSS parsing, but it allows to cope with CSS we failed to parse without dropping any CSS rule (problem could be just a parsing issue, not necessarily a bad CSS rule)

Create a RxRewriter adapted for CSS rules rewriting

Methods:

  • rewrite

    Apply the unique compiled_rules pattern and replace the content.

Attributes:

Source code in src/zimscraperlib/rewriting/css.py
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def __init__(self, url_rewriter: ArticleUrlRewriter, base_href: str | None):
    """Create a RxRewriter adapted for CSS rules rewriting"""

    # we have only only rule, searching for url(...) functions and rewriting the
    # URL found
    rules = [
        TransformationRule(
            [
                re.compile(
                    r"""url\((?P<quote>['"])?(?P<url>.+?)(?P=quote)(?<!\\)\)"""
                ),
                partial(
                    self.__simple_transform,
                    url_rewriter=url_rewriter,
                    base_href=base_href,
                ),
            ]
        )
    ]
    super().__init__(rules)

compiled_rule instance-attribute

compiled_rule: Pattern[str] | None = None

rules instance-attribute

rules = rules or []

rewrite

rewrite(
    text: str | bytes, opts: dict[str, Any] | None = None
) -> str

Apply the unique compiled_rules pattern and replace the content.

Source code in src/zimscraperlib/rewriting/rx_replacer.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
def rewrite(
    self,
    text: str | bytes,
    opts: dict[str, Any] | None = None,
) -> str:
    """
    Apply the unique `compiled_rules` pattern and replace the content.
    """
    if isinstance(text, bytes):
        text = text.decode()

    def replace(m_object: re.Match[str]) -> str:
        """
        This method search for the specific rule which have matched and apply it.
        """
        for i, rule in enumerate(self.rules, 1):
            if not m_object.group(i):
                # This is not the ith rules which match
                continue
            result = rule[1](m_object, opts)
            return result
        # fallback never supposed to be reached since this method is called
        # by Pattern.sub which already checks there is a match
        return text  # pragma: no cover

    assert self.compiled_rule is not None  # noqa
    return self.compiled_rule.sub(replace, text)