Utility functions for the engines

Utility functions for the engines

searx.utils.convert_str_to_int(number_str: str) int[source]

Convert number_str to int or 0 if number_str is not a number.

searx.utils.detect_language(text: str, threshold: float = 0.3, only_search_languages: bool = False) str | None[source]

Detect the language of the text parameter.

Parameters:
  • text (str) – The string whose language is to be detected.

  • threshold (float) – Threshold filters the returned labels by a threshold on probability. A choice of 0.3 will return labels with at least 0.3 probability.

  • only_search_languages (bool) – If True, returns only supported SearXNG search languages. see searx.languages

Return type:

str, None

Returns:

The detected language code or None. See below.

Raises:

ValueError – If text is not a string.

The language detection is done by using a fork of the fastText library (python fasttext). fastText distributes the language identification model, for reference:

The language identification model support the language codes (ISO-639-3):

af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs
bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es
et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia
id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li
lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah
nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru
rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl
tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh

By using only_search_languages=True the language identification model is harmonized with the SearXNG’s language (locale) model. General conditions of SearXNG’s locale model are:

  1. SearXNG’s locale of a query is passed to the searx.locales.get_engine_locale to get a language and/or region code that is used by an engine.

  2. Most of SearXNG’s engines do not support all the languages from language identification model and there is also a discrepancy in the ISO-639-3 (fasttext) and ISO-639-2 (SearXNG)handling. Further more, in SearXNG the locales like zh-TH (zh-CN) are mapped to zh_Hant (zh_Hans) while the language identification model reduce both to zh.

searx.utils.dict_subset(dictionary: MutableMapping, properties: Set[str]) Dict[source]

Extract a subset of a dict

Examples:
>>> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'C'])
{'A': 'a', 'C': 'c'}
>>> >> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'D'])
{'A': 'a'}
searx.utils.ecma_unescape(string: str) str[source]

Python implementation of the unescape javascript function

https://www.ecma-international.org/ecma-262/6.0/#sec-unescape-string https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape

Examples:
>>> ecma_unescape('%u5409')
'吉'
>>> ecma_unescape('%20')
' '
>>> ecma_unescape('%F3')
'ó'
searx.utils.eval_xpath(element: ElementBase, xpath_spec: str | XPath)[source]

Equivalent of element.xpath(xpath_str) but compile xpath_str once for all. See https://lxml.de/xpathxslt.html#xpath-return-values

Args:
  • element (ElementBase): [description]

  • xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath

Returns:
  • result (bool, float, list, str): Results.

Raises:
  • TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath

  • SearxXPathSyntaxException: Raise when there is a syntax error in the XPath

  • SearxEngineXPathException: Raise when the XPath can’t be evaluated.

searx.utils.eval_xpath_getindex(elements: ~lxml.etree.ElementBase, xpath_spec: str | ~lxml.etree.XPath, index: int, default=<searx.utils._NotSetClass object>)[source]

Call eval_xpath_list then get one element using the index parameter. If the index does not exist, either raise an exception is default is not set, other return the default value (can be None).

Args:
  • elements (ElementBase): lxml element to apply the xpath.

  • xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath.

  • index (int): index to get

  • default (Object, optional): Defaults if index doesn’t exist.

Raises:
  • TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath

  • SearxXPathSyntaxException: Raise when there is a syntax error in the XPath

  • SearxEngineXPathException: if the index is not found. Also see eval_xpath.

Returns:
  • result (bool, float, list, str): Results.

searx.utils.eval_xpath_list(element: ElementBase, xpath_spec: str | XPath, min_len: int | None = None)[source]

Same as eval_xpath, check if the result is a list

Args:
  • element (ElementBase): [description]

  • xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath

  • min_len (int, optional): [description]. Defaults to None.

Raises:
  • TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath

  • SearxXPathSyntaxException: Raise when there is a syntax error in the XPath

  • SearxEngineXPathException: raise if the result is not a list

Returns:
  • result (bool, float, list, str): Results.

searx.utils.extr(txt: str, begin: str, end: str, default: str = '')[source]

Extract the string between begin and end from txt

Parameters:
  • txt – String to search in

  • begin – First string to be searched for

  • end – Second string to be searched for after begin

  • default – Default value if one of begin or end is not found. Defaults to an empty string.

Returns:

The string between the two search-strings begin and end. If at least one of begin or end is not found, the value of default is returned.

Examples:
>>> extr("abcde", "a", "e")
"bcd"
>>> extr("abcde", "a", "z", deafult="nothing")
"nothing"
searx.utils.extract_text(xpath_results, allow_none: bool = False) str | None[source]

Extract text from a lxml result

  • if xpath_results is list, extract the text from each result and concat the list

  • if xpath_results is a xml element, extract all the text node from it ( text_content() method from lxml )

  • if xpath_results is a string element, then it’s already done

searx.utils.extract_url(xpath_results, base_url) str[source]

Extract and normalize URL from lxml Element

Args:
  • xpath_results (Union[List[html.HtmlElement], html.HtmlElement]): lxml Element(s)

  • base_url (str): Base URL

Example:
>>> def f(s, search_url):
>>>    return searx.utils.extract_url(html.fromstring(s), search_url)
>>> f('<span id="42">https://example.com</span>', 'http://example.com/')
'https://example.com/'
>>> f('https://example.com', 'http://example.com/')
'https://example.com/'
>>> f('//example.com', 'http://example.com/')
'http://example.com/'
>>> f('//example.com', 'https://example.com/')
'https://example.com/'
>>> f('/path?a=1', 'https://example.com')
'https://example.com/path?a=1'
>>> f('', 'https://example.com')
raise lxml.etree.ParserError
>>> searx.utils.extract_url([], 'https://example.com')
raise ValueError
Raises:
  • ValueError

  • lxml.etree.ParserError

Returns:
  • str: normalized URL

searx.utils.gen_useragent(os_string: str | None = None) str[source]

Return a random browser User Agent

See searx/data/useragents.json

searx.utils.get_engine_from_settings(name: str) Dict[source]

Return engine configuration from settings.yml of a given engine name

searx.utils.get_xpath(xpath_spec: str | XPath) XPath[source]

Return cached compiled XPath

There is no thread lock. Worst case scenario, xpath_str is compiled more than one time.

Args:
  • xpath_spec (str|lxml.etree.XPath): XPath as a str or lxml.etree.XPath

Returns:
  • result (bool, float, list, str): Results.

Raises:
  • TypeError: Raise when xpath_spec is neither a str nor a lxml.etree.XPath

  • SearxXPathSyntaxException: Raise when there is a syntax error in the XPath

searx.utils.html_to_text(html_str: str) str[source]

Extract text from a HTML string

Args:
  • html_str (str): string HTML

Returns:
  • str: extracted text

Examples:
>>> html_to_text('Example <span id="42">#2</span>')
'Example #2'
>>> html_to_text('<style>.span { color: red; }</style><span>Example</span>')
'Example'
>>> html_to_text(r'regexp: (?<![a-zA-Z]')
'regexp: (?<![a-zA-Z]'
searx.utils.humanize_bytes(size, precision=2)[source]

Determine the human readable value of bytes on 1024 base (1KB=1024B).

searx.utils.humanize_number(size, precision=0)[source]

Determine the human readable value of a decimal number.

searx.utils.int_or_zero(num: List[str] | str) int[source]

Convert num to int or 0. num can be either a str or a list. If num is a list, the first element is converted to int (or return 0 if the list is empty). If num is a str, see convert_str_to_int

searx.utils.is_valid_lang(lang) Tuple[bool, str, str] | None[source]

Return language code and name if lang describe a language.

Examples:
>>> is_valid_lang('zz')
None
>>> is_valid_lang('uk')
(True, 'uk', 'ukrainian')
>>> is_valid_lang(b'uk')
(True, 'uk', 'ukrainian')
>>> is_valid_lang('en')
(True, 'en', 'english')
>>> searx.utils.is_valid_lang('Español')
(True, 'es', 'spanish')
>>> searx.utils.is_valid_lang('Spanish')
(True, 'es', 'spanish')
searx.utils.js_variable_to_python(js_variable)[source]

Convert a javascript variable into JSON and then load the value

It does not deal with all cases, but it is good enough for now. chompjs has a better implementation.

searx.utils.markdown_to_text(markdown_str: str) str[source]

Extract text from a Markdown string

Args:
  • markdown_str (str): string Markdown

Returns:
  • str: extracted text

Examples:
>>> markdown_to_text('[example](https://example.com)')
'example'
>>> markdown_to_text('## Headline')
'Headline'
searx.utils.normalize_url(url: str, base_url: str) str[source]

Normalize URL: add protocol, join URL with base_url, add trailing slash if there is no path

Args:
  • url (str): Relative URL

  • base_url (str): Base URL, it must be an absolute URL.

Example:
>>> normalize_url('https://example.com', 'http://example.com/')
'https://example.com/'
>>> normalize_url('//example.com', 'http://example.com/')
'http://example.com/'
>>> normalize_url('//example.com', 'https://example.com/')
'https://example.com/'
>>> normalize_url('/path?a=1', 'https://example.com')
'https://example.com/path?a=1'
>>> normalize_url('', 'https://example.com')
'https://example.com/'
>>> normalize_url('/test', '/path')
raise ValueError
Raises:
  • lxml.etree.ParserError

Returns:
  • str: normalized URL

searx.utils.searx_useragent() str[source]

Return the searx User Agent

searx.utils.to_string(obj: Any) str[source]

Convert obj to its string representation.

searx.utils.SEARCH_LANGUAGE_CODES = frozenset({'af', 'ar', 'be', 'bg', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fa', 'fi', 'fr', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'it', 'ja', 'kn', 'ko', 'lt', 'lv', 'ml', 'mr', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sv', 'ta', 'th', 'tr', 'uk', 'ur', 'vi', 'zh'})

Languages supported by most searxng engines (searx.sxng_locales.sxng_locales).