zimscraperlib.rewriting.css

CSS Rewriting

This modules contains tools to rewrite CSS retrieved from an online source so that it can safely operate within a ZIM, linking only to ZIM entries everytime a URL is used.

The rewriter needs to have an article url rewriter to rewrite URLs found in CSS, an optional base href if the CSS to rewrite was found inline an HTML document which has a base href set, and an optional flag indicating if in case of parsing error we want to fallback to simple regex rewriting or we prefer to drop the offending rule.

Classes:

CssRewriter –

CSS rewriting class
FallbackRegexCssRewriter –

Fallback CSS rewriting based on regular expression.

CssRewriter

CssRewriter(
    url_rewriter: ArticleUrlRewriter,
    base_href: str | None,
    *,
    remove_errors: bool = False,
)

CSS rewriting class

Parameters:

url_rewriter (ArticleUrlRewriter) –

the rewriter of URLs
base_href (str | None) –

if CSS to rewrite has been found inline on an HTML page, this is

the potential base href found in HTML document remove_errors: if True, we just drop bad CSS rules ; if False, we fallback to regex-based rewriting of the whole CSS document

Methods:

rewrite –

Rewrite a 'standalone' CSS document
rewrite_inline –

Rewrite an 'inline' CSS document

Attributes:

base_href –
fallback_rewriter –
remove_errors –
url_rewriter –

Source code in src/zimscraperlib/rewriting/css.py

def __init__(
    self,
    url_rewriter: ArticleUrlRewriter,
    base_href: str | None,
    *,
    remove_errors: bool = False,
):
    """
    Args:
      url_rewriter: the rewriter of URLs
      base_href: if CSS to rewrite has been found inline on an HTML page, this is
    the potential base href found in HTML document
      remove_errors: if True, we just drop bad CSS rules ; if False, we fallback to
    regex-based rewriting of the whole CSS document
    """
    self.url_rewriter = url_rewriter
    self.base_href = base_href
    self.remove_errors = remove_errors
    self.fallback_rewriter = FallbackRegexCssRewriter(url_rewriter, base_href)

base_href `instance-attribute`

base_href = base_href

fallback_rewriter `instance-attribute`

fallback_rewriter = FallbackRegexCssRewriter(
    url_rewriter, base_href
)

remove_errors `instance-attribute`

remove_errors = remove_errors

url_rewriter `instance-attribute`

url_rewriter = url_rewriter

rewrite

rewrite(content: str | bytes) -> str

Rewrite a 'standalone' CSS document

'standalone' means "not inline an HTML document"

Source code in src/zimscraperlib/rewriting/css.py

def rewrite(self, content: str | bytes) -> str:
    """
    Rewrite a 'standalone' CSS document

    'standalone' means "not inline an HTML document"
    """
    try:
        if isinstance(content, bytes):
            rules, _ = parse_stylesheet_bytes(content)
        else:
            rules = parse_stylesheet(content)
        self._process_list(rules)
        return self._serialize_rules(rules)
    except Exception:
        # If tinycss fail to parse css, it will generate a "Error" token.
        # Exception is raised at serialization time.
        # We try/catch the whole process to be sure anyway.
        logger.warning(
            (
                "Css transformation fails. Fallback to regex rewriter.\n"
                "Article path is %s"
            ),
            self.url_rewriter.article_url,
        )
        return self.fallback_rewriter.rewrite(content, {})

rewrite_inline

rewrite_inline(content: str) -> str

Rewrite an 'inline' CSS document

'inline' means "inline an HTML document"

Source code in src/zimscraperlib/rewriting/css.py

def rewrite_inline(self, content: str) -> str:
    """
    Rewrite an 'inline' CSS document

    'inline' means "inline an HTML document"
    """
    try:
        rules = parse_declaration_list(content)
        self._process_list(rules)
        return self._serialize_rules(rules)
    except Exception:
        # If tinycss fail to parse css, it will generate a "Error" token.
        # Exception is raised at serialization time.
        # We try/catch the whole process to be sure anyway.
        logger.warning(
            (
                "Css transformation fails. Fallback to regex rewriter.\n"
                "Content is `%s`"
            ),
            content,
        )
        return self.fallback_rewriter.rewrite(content, {})

FallbackRegexCssRewriter

FallbackRegexCssRewriter(
    url_rewriter: ArticleUrlRewriter, base_href: str | None
)

Bases: RxRewriter

Fallback CSS rewriting based on regular expression.

This is obviously way less powerful than real CSS parsing, but it allows to cope with CSS we failed to parse without dropping any CSS rule (problem could be just a parsing issue, not necessarily a bad CSS rule)

Create a RxRewriter adapted for CSS rules rewriting

Methods:

rewrite –

Apply the unique compiled_rules pattern and replace the content.

Attributes:

compiled_rule (Pattern[str] | None) –
rules –

Source code in src/zimscraperlib/rewriting/css.py

def __init__(self, url_rewriter: ArticleUrlRewriter, base_href: str | None):
    """Create a RxRewriter adapted for CSS rules rewriting"""

    # we have only only rule, searching for url(...) functions and rewriting the
    # URL found
    rules = [
        TransformationRule(
            [
                re.compile(
                    r"""url\((?P<quote>['"])?(?P<url>.+?)(?P=quote)(?<!\\)\)"""
                ),
                partial(
                    self.__simple_transform,
                    url_rewriter=url_rewriter,
                    base_href=base_href,
                ),
            ]
        )
    ]
    super().__init__(rules)

compiled_rule `instance-attribute`

compiled_rule: Pattern[str] | None = None

rules `instance-attribute`

rules = rules or []

rewrite

rewrite(
    text: str | bytes, opts: dict[str, Any] | None = None
) -> str

Apply the unique compiled_rules pattern and replace the content.

Source code in src/zimscraperlib/rewriting/rx_replacer.py

def rewrite(
    self,
    text: str | bytes,
    opts: dict[str, Any] | None = None,
) -> str:
    """
    Apply the unique `compiled_rules` pattern and replace the content.
    """
    if isinstance(text, bytes):
        text = text.decode()

    def replace(m_object: re.Match[str]) -> str:
        """
        This method search for the specific rule which have matched and apply it.
        """
        for i, rule in enumerate(self.rules, 1):
            if not m_object.group(i):
                # This is not the ith rules which match
                continue
            result = rule[1](m_object, opts)
            return result
        # fallback never supposed to be reached since this method is called
        # by Pattern.sub which already checks there is a match
        return text  # pragma: no cover

    assert self.compiled_rule is not None  # noqa
    return self.compiled_rule.sub(replace, text)

zimscraperlib.rewriting.css

CssRewriter

base_href instance-attribute

fallback_rewriter instance-attribute

remove_errors instance-attribute

url_rewriter instance-attribute

rewrite

rewrite_inline

FallbackRegexCssRewriter

compiled_rule instance-attribute

rules instance-attribute

rewrite

base_href `instance-attribute`

fallback_rewriter `instance-attribute`

remove_errors `instance-attribute`

url_rewriter `instance-attribute`

compiled_rule `instance-attribute`

rules `instance-attribute`