Utility functions for the engines

Utility functions for the engines

searx.utils.XPathSpecType: TypeAlias = str | lxml.etree.XPath

Type alias used by searx.utils.get_xpath, searx.utils.eval_xpath and other XPath selectors.

searx.utils.searxng_useragent() str[source]

Return the SearXNG User Agent

searx.utils.gen_useragent(os_string: str | None = None) str[source]

Return a random browser User Agent

See searx/data/useragents.json

searx.utils.gen_gsa_useragent() str[source]

Return a random “Android Google App” User Agent suitable for Google

See searx/data/gsa_useragents.txt

class searx.utils.HTMLTextExtractor[source]

Internal class to extract text from HTML

searx.utils.html_to_text(html_str: str) str[source]

Extract text from a HTML string

Args:
  • html_str (str): string HTML

Returns:
  • str: extracted text

Examples:
>>> html_to_text('Example <span id="42">#2</span>')
'Example #2'
>>> html_to_text('<style>.span { color: red; }</style><span>Example</span>')
'Example'
>>> html_to_text(r'regexp: (?&lt;![a-zA-Z]')
'regexp: (?<![a-zA-Z]'
>>> html_to_text(r'<p><b>Lorem ipsum </i>dolor sit amet</p>')
'Lorem ipsum </i>dolor sit amet</p>'
>>> html_to_text(r'&#x3e &#x3c &#97')
'> < a'
searx.utils.markdown_to_text(markdown_str: str) str[source]

Extract text from a Markdown string

Args:
  • markdown_str (str): string Markdown

Returns:
  • str: extracted text

Examples:
>>> markdown_to_text('[example](https://example.com)')
'example'
>>> markdown_to_text('## Headline')
'Headline'
searx.utils.extract_text(xpath_results: list[ElementBase | _Element] | ElementBase | _Element | str | Number | bool | None, allow_none: bool = False) str | None[source]

Extract text from a lxml result

  • If xpath_results is a list of ElementType objects, extract the text from each result and concatenate the list in a string.

  • If xpath_results is a ElementType object, extract all the text node from it ( lxml.html.tostring, method="text" )

  • If xpath_results is of type str or Number, bool the string value is returned.

  • If xpath_results is of type None a ValueError is raised, except allow_none is True where None is returned.

searx.utils.normalize_url(url: str, base_url: str) str[source]

Normalize URL: add protocol, join URL with base_url, add trailing slash if there is no path

Args:
  • url (str): Relative URL

  • base_url (str): Base URL, it must be an absolute URL.

Example:
>>> normalize_url('https://example.com', 'http://example.com/')
'https://example.com/'
>>> normalize_url('//example.com', 'http://example.com/')
'http://example.com/'
>>> normalize_url('//example.com', 'https://example.com/')
'https://example.com/'
>>> normalize_url('/path?a=1', 'https://example.com')
'https://example.com/path?a=1'
>>> normalize_url('', 'https://example.com')
'https://example.com/'
>>> normalize_url('/test', '/path')
raise ValueError
Raises:
  • lxml.etree.ParserError

Returns:
  • str: normalized URL

searx.utils.extract_url(xpath_results: list[ElementBase | _Element] | ElementBase | _Element | str | Number | bool | None, base_url: str) str[source]

Extract and normalize URL from lxml Element

Example:
>>> def f(s, search_url):
>>>    return searx.utils.extract_url(html.fromstring(s), search_url)
>>> f('<span id="42">https://example.com</span>', 'http://example.com/')
'https://example.com/'
>>> f('https://example.com', 'http://example.com/')
'https://example.com/'
>>> f('//example.com', 'http://example.com/')
'http://example.com/'
>>> f('//example.com', 'https://example.com/')
'https://example.com/'
>>> f('/path?a=1', 'https://example.com')
'https://example.com/path?a=1'
>>> f('', 'https://example.com')
raise lxml.etree.ParserError
>>> searx.utils.extract_url([], 'https://example.com')
raise ValueError
Raises:
  • ValueError

  • lxml.etree.ParserError

Returns:
  • str: normalized URL

searx.utils.dict_subset(dictionary: MutableMapping[Any, Any], properties: set[str]) MutableMapping[str, Any][source]

Extract a subset of a dict

Examples:
>>> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'C'])
{'A': 'a', 'C': 'c'}
>>> >> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'D'])
{'A': 'a'}
searx.utils.humanize_bytes(size: int | float, precision: int = 2)[source]

Determine the human readable value of bytes on 1024 base (1KB=1024B).

searx.utils.humanize_number(size: int | float, precision: int = 0)[source]

Determine the human readable value of a decimal number.

searx.utils.convert_str_to_int(number_str: str) int[source]

Convert number_str to int or 0 if number_str is not a number.

searx.utils.extr(txt: str, begin: str, end: str, default: str = '') str[source]

Extract the string between begin and end from txt

Parameters:
  • txt – String to search in

  • begin – First string to be searched for

  • end – Second string to be searched for after begin

  • default – Default value if one of begin or end is not found. Defaults to an empty string.

Returns:

The string between the two search-strings begin and end. If at least one of begin or end is not found, the value of default is returned.

Examples:
>>> extr("abcde", "a", "e")
"bcd"
>>> extr("abcde", "a", "z", deafult="nothing")
"nothing"
searx.utils.int_or_zero(num: list[str] | str) int[source]

Convert num to int or 0. num can be either a str or a list. If num is a list, the first element is converted to int (or return 0 if the list is empty). If num is a str, see convert_str_to_int

searx.utils.to_string(obj: Any) str[source]

Convert obj to its string representation.

searx.utils.ecma_unescape(string: str) str[source]

Python implementation of the unescape javascript function

https://www.ecma-international.org/ecma-262/6.0/#sec-unescape-string https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape

Examples:
>>> ecma_unescape('%u5409')
'吉'
>>> ecma_unescape('%20')
' '
>>> ecma_unescape('%F3')
'ó'
searx.utils.remove_pua_from_str(string: str)[source]

Removes unicode’s “PRIVATE USE CHARACTER”s (PUA) from a string.

searx.utils.get_engine_from_settings(name: str) dict[str, dict[str, str]][source]

Return engine configuration from settings.yml of a given engine name

searx.utils.get_xpath(xpath_spec: str | XPath) XPath[source]

Return cached compiled lxml.etree.XPath object.

TypeError:

Raised when xpath_spec is neither a str nor a lxml.etree.XPath.

SearxXPathSyntaxException:

Raised when there is a syntax error in the XPath selector (str).

searx.utils.eval_xpath(element: ElementBase | _Element, xpath_spec: str | XPath) Any[source]

Equivalent of element.xpath(xpath_str) but compile xpath_str into a lxml.etree.XPath object once for all. The return value of xpath(..) is complex, read XPath return values for more details.

TypeError:

Raised when xpath_spec is neither a str nor a lxml.etree.XPath.

SearxXPathSyntaxException:

Raised when there is a syntax error in the XPath selector (str).

SearxEngineXPathException:

Raised when the XPath can’t be evaluated (masked lxml.etree..XPathError).

searx.utils.eval_xpath_list(element: ElementBase | _Element, xpath_spec: str | XPath, min_len: int | None = None) list[Any][source]

Same as searx.utils.eval_xpath, but additionally ensures the return value is a list. The minimum length of the list is also checked (if min_len is set).

searx.utils.eval_xpath_getindex(element: ~lxml.etree.ElementBase | ~lxml.etree._Element, xpath_spec: str | ~lxml.etree.XPath, index: int, default: ~typing.Any = <searx.utils._NotSetClass object>) Any[source]

Same as searx.utils.eval_xpath_list, but returns item on position index from the list (index starts with 0).

The exceptions known from searx.utils.eval_xpath are thrown. If a default is specified, this is returned if an element at position index could not be determined.

searx.utils.get_embeded_stream_url(url: str)[source]

Converts a standard video URL into its embed format. Supported services include Youtube, Facebook, Instagram, TikTok, Dailymotion, and Bilibili.

searx.utils.js_obj_str_to_python(js_obj_str: str) Any[source]

Convert a javascript variable into JSON and then load the value

It does not deal with all cases, but it is good enough for now. chompjs has a better implementation.

searx.utils.parse_duration_string(duration_str: str) timedelta | None[source]

Parse a time string in format MM:SS or HH:MM:SS and convert it to a timedelta object.

Returns None if the provided string doesn’t match any of the formats.