Utility functions for the engines¶

Utility functions for the engines

searx.utils.SEARCH_LANGUAGE_CODES = frozenset({'af', 'ar', 'be', 'bg', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'ga', 'gd', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'kn', 'ko', 'lt', 'lv', 'ml', 'mr', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'ur', 'vi', 'zh'})¶: Languages supported by most searxng engines (searx.sxng_locales.sxng_locales).

searx.utils.searxng_useragent() → str[source]¶: Return the SearXNG User Agent

searx.utils.gen_useragent(os_string: str | None = None) → str[source]¶

Return a random browser User Agent

See searx/data/useragents.json

searx.utils.html_to_text(html_str: str) → str[source]¶

Extract text from a HTML string

Args:

html_str (str): string HTML

Returns:

str: extracted text

Examples:

>>> html_to_text('Example <span id="42">#2</span>')
'Example #2'

>>> html_to_text('<style>.span { color: red; }</style><span>Example</span>')
'Example'

>>> html_to_text(r'regexp: (?<![a-zA-Z]')
'regexp: (?<![a-zA-Z]'

searx.utils.markdown_to_text(markdown_str: str) → str[source]¶

Extract text from a Markdown string

Args:

markdown_str (str): string Markdown

Returns:

str: extracted text

Examples:

>>> markdown_to_text('[example](https://example.com)')
'example'

>>> markdown_to_text('## Headline')
'Headline'

searx.utils.extract_text(xpath_results, allow_none: bool = False) → str | None[source]¶

Extract text from a lxml result

if xpath_results is list, extract the text from each result and concat the list
if xpath_results is a xml element, extract all the text node from it ( text_content() method from lxml )
if xpath_results is a string element, then it’s already done

searx.utils.normalize_url(url: str, base_url: str) → str[source]¶

Normalize URL: add protocol, join URL with base_url, add trailing slash if there is no path

Args:

url (str): Relative URL
base_url (str): Base URL, it must be an absolute URL.

Example:

>>> normalize_url('https://example.com', 'http://example.com/')
'https://example.com/'
>>> normalize_url('//example.com', 'http://example.com/')
'http://example.com/'
>>> normalize_url('//example.com', 'https://example.com/')
'https://example.com/'
>>> normalize_url('/path?a=1', 'https://example.com')
'https://example.com/path?a=1'
>>> normalize_url('', 'https://example.com')
'https://example.com/'
>>> normalize_url('/test', '/path')
raise ValueError

Raises:

lxml.etree.ParserError

Returns:

str: normalized URL

searx.utils.extract_url(xpath_results, base_url) → str[source]¶

Extract and normalize URL from lxml Element

Args:

xpath_results (Union[List[html.HtmlElement], html.HtmlElement]): lxml Element(s)
base_url (str): Base URL

Example:

>>> def f(s, search_url):
>>>    return searx.utils.extract_url(html.fromstring(s), search_url)
>>> f('<span id="42">https://example.com</span>', 'http://example.com/')
'https://example.com/'
>>> f('https://example.com', 'http://example.com/')
'https://example.com/'
>>> f('//example.com', 'http://example.com/')
'http://example.com/'
>>> f('//example.com', 'https://example.com/')
'https://example.com/'
>>> f('/path?a=1', 'https://example.com')
'https://example.com/path?a=1'
>>> f('', 'https://example.com')
raise lxml.etree.ParserError
>>> searx.utils.extract_url([], 'https://example.com')
raise ValueError

Raises:

ValueError
lxml.etree.ParserError

Returns:

str: normalized URL

searx.utils.dict_subset(dictionary: MutableMapping, properties: Set[str]) → Dict[source]¶

Extract a subset of a dict

Examples:

>>> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'C'])
{'A': 'a', 'C': 'c'}
>>> >> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'D'])
{'A': 'a'}

searx.utils.humanize_bytes(size, precision=2)[source]¶: Determine the human readable value of bytes on 1024 base (1KB=1024B).

searx.utils.humanize_number(size, precision=0)[source]¶: Determine the human readable value of a decimal number.

searx.utils.convert_str_to_int(number_str: str) → int[source]¶: Convert number_str to int or 0 if number_str is not a number.

searx.utils.extr(txt: str, begin: str, end: str, default: str = '')[source]¶

Extract the string between begin and end from txt

Parameters:

txt – String to search in
begin – First string to be searched for
end – Second string to be searched for after begin
default – Default value if one of begin or end is not found. Defaults to an empty string.

Returns:

The string between the two search-strings begin and end. If at least one of begin or end is not found, the value of default is returned.

Examples:

>>> extr("abcde", "a", "e")
"bcd"
>>> extr("abcde", "a", "z", deafult="nothing")
"nothing"

searx.utils.int_or_zero(num: List[str] | str) → int[source]¶: Convert num to int or 0. num can be either a str or a list. If num is a list, the first element is converted to int (or return 0 if the list is empty). If num is a str, see convert_str_to_int

searx.utils.is_valid_lang(lang) → Tuple[bool, str, str] | None[source]¶

Return language code and name if lang describe a language.

Examples:

>>> is_valid_lang('zz')
None
>>> is_valid_lang('uk')
(True, 'uk', 'ukrainian')
>>> is_valid_lang(b'uk')
(True, 'uk', 'ukrainian')
>>> is_valid_lang('en')
(True, 'en', 'english')
>>> searx.utils.is_valid_lang('Español')
(True, 'es', 'spanish')
>>> searx.utils.is_valid_lang('Spanish')
(True, 'es', 'spanish')

searx.utils.to_string(obj: Any) → str[source]¶: Convert obj to its string representation.

searx.utils.ecma_unescape(string: str) → str[source]¶

Python implementation of the unescape javascript function

https://www.ecma-international.org/ecma-262/6.0/#sec-unescape-string https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape

Examples:

>>> ecma_unescape('%u5409')
'吉'
>>> ecma_unescape('%20')
' '
>>> ecma_unescape('%F3')
'ó'

searx.utils.remove_pua_from_str(string)[source]¶: Removes unicode’s “PRIVATE USE CHARACTER”s (PUA) from a string.

searx.utils.get_engine_from_settings(name: str) → Dict[source]¶: Return engine configuration from settings.yml of a given engine name

searx.utils.get_xpath(xpath_spec: str | XPath) → XPath[source]¶

Return cached compiled XPath

There is no thread lock. Worst case scenario, xpath_str is compiled more than one time.

Args:

xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath

Returns:

result (bool, float, list, str): Results.

Raises:

TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath

searx.utils.eval_xpath(element: ElementBase, xpath_spec: str | XPath)[source]¶

Equivalent of element.xpath(xpath_str) but compile xpath_str once for all. See https://lxml.de/xpathxslt.html#xpath-return-values

Args:

element (ElementBase): [description]
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath

Returns:

result (bool, float, list, str): Results.

Raises:

TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
SearxEngineXPathException: Raise when the XPath can’t be evaluated.

searx.utils.eval_xpath_list(element: ElementBase, xpath_spec: str | XPath, min_len: int | None = None)[source]¶

Same as eval_xpath, check if the result is a list

Args:

element (ElementBase): [description]
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath
min_len (int, optional): [description]. Defaults to None.

Raises:

TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
SearxEngineXPathException: raise if the result is not a list

Returns:

result (bool, float, list, str): Results.

searx.utils.eval_xpath_getindex(elements: ~lxml.etree.ElementBase, xpath_spec: str | ~lxml.etree.XPath, index: int, default=<searx.utils._NotSetClass object>)[source]¶

Call eval_xpath_list then get one element using the index parameter. If the index does not exist, either raise an exception is default is not set, other return the default value (can be None).

Args:

elements (ElementBase): lxml element to apply the xpath.
xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath.
index (int): index to get
default (Object, optional): Defaults if index doesn’t exist.

Raises:

TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath
SearxXPathSyntaxException: Raise when there is a syntax error in the XPath
SearxEngineXPathException: if the index is not found. Also see eval_xpath.

Returns:

result (bool, float, list, str): Results.

searx.utils.get_embeded_stream_url(url)[source]¶: Converts a standard video URL into its embed format. Supported services include Youtube, Facebook, Instagram, TikTok, Dailymotion, and Bilibili.

searx.utils.detect_language(text: str, threshold: float = 0.3, only_search_languages: bool = False) → str | None[source]¶

Detect the language of the text parameter.

Parameters:

text (str) – The string whose language is to be detected.
threshold (float) – Threshold filters the returned labels by a threshold on probability. A choice of 0.3 will return labels with at least 0.3 probability.
only_search_languages (bool) – If True, returns only supported SearXNG search languages. see searx.languages

Return type:

str, None

Returns:

The detected language code or None. See below.

Raises:

ValueError – If text is not a string.

The language detection is done by using a fork of the fastText library (python fasttext). fastText distributes the language identification model, for reference:

The language identification model support the language codes (ISO-639-3):

af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs
bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es
et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia
id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li
lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah
nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru
rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl
tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh

By using only_search_languages=True the language identification model is harmonized with the SearXNG’s language (locale) model. General conditions of SearXNG’s locale model are:

SearXNG’s locale of a query is passed to the searx.locales.get_engine_locale to get a language and/or region code that is used by an engine.
Most of SearXNG’s engines do not support all the languages from language identification model and there is also a discrepancy in the ISO-639-3 (fasttext) and ISO-639-2 (SearXNG)handling. Further more, in SearXNG the locales like zh-TH (zh-CN) are mapped to zh_Hant (zh_Hans) while the language identification model reduce both to zh.

searx.utils.js_variable_to_python(js_variable)[source]¶

Convert a javascript variable into JSON and then load the value

It does not deal with all cases, but it is good enough for now. chompjs has a better implementation.

searx.utils.parse_duration_string(duration_str: str) → timedelta | None[source]¶

Parse a time string in format MM:SS or HH:MM:SS and convert it to a timedelta object.

Returns None if the provided string doesn’t match any of the formats.

Utility functions for the engines¶

Table of Contents

Project Links

Navigation

This Page