Utility functions for the engines¶
Utility functions for the engines
- searx.utils.XPathSpecType: TypeAlias = str | lxml.etree.XPath¶
Type alias used by
searx.utils.get_xpath,searx.utils.eval_xpathand other XPath selectors.
- searx.utils.gen_useragent(os_string: str | None = None) str[source]¶
Return a random browser User Agent
See searx/data/useragents.json
- searx.utils.gen_gsa_useragent() str[source]¶
Return a random “Android Google App” User Agent suitable for Google
See searx/data/gsa_useragents.txt
- searx.utils.html_to_text(html_str: str) str[source]¶
Extract text from a HTML string
- Args:
html_str (str): string HTML
- Returns:
str: extracted text
- Examples:
>>> html_to_text('Example <span id="42">#2</span>') 'Example #2'
>>> html_to_text('<style>.span { color: red; }</style><span>Example</span>') 'Example'
>>> html_to_text(r'regexp: (?<![a-zA-Z]') 'regexp: (?<![a-zA-Z]'
>>> html_to_text(r'<p><b>Lorem ipsum </i>dolor sit amet</p>') 'Lorem ipsum </i>dolor sit amet</p>'
>>> html_to_text(r'> < a') '> < a'
- searx.utils.markdown_to_text(markdown_str: str) str[source]¶
Extract text from a Markdown string
- Args:
markdown_str (str): string Markdown
- Returns:
str: extracted text
- Examples:
>>> markdown_to_text('[example](https://example.com)') 'example'
>>> markdown_to_text('## Headline') 'Headline'
- searx.utils.extract_text(xpath_results: list[ElementBase | _Element] | ElementBase | _Element | str | Number | bool | None, allow_none: bool = False) str | None[source]¶
Extract text from a lxml result
If
xpath_resultsis a list ofElementTypeobjects, extract the text from each result and concatenate the list in a string.If
xpath_resultsis aElementTypeobject, extract all the text node from it (lxml.html.tostring,method="text")If
xpath_resultsis of typestrorNumber,boolthe string value is returned.If
xpath_resultsis of typeNoneaValueErroris raised, exceptallow_noneisTruewhereNoneis returned.
- searx.utils.normalize_url(url: str, base_url: str) str[source]¶
Normalize URL: add protocol, join URL with base_url, add trailing slash if there is no path
- Args:
url (str): Relative URL
base_url (str): Base URL, it must be an absolute URL.
- Example:
>>> normalize_url('https://example.com', 'http://example.com/') 'https://example.com/' >>> normalize_url('//example.com', 'http://example.com/') 'http://example.com/' >>> normalize_url('//example.com', 'https://example.com/') 'https://example.com/' >>> normalize_url('/path?a=1', 'https://example.com') 'https://example.com/path?a=1' >>> normalize_url('', 'https://example.com') 'https://example.com/' >>> normalize_url('/test', '/path') raise ValueError
- Raises:
lxml.etree.ParserError
- Returns:
str: normalized URL
- searx.utils.extract_url(xpath_results: list[ElementBase | _Element] | ElementBase | _Element | str | Number | bool | None, base_url: str) str[source]¶
Extract and normalize URL from lxml Element
- Example:
>>> def f(s, search_url): >>> return searx.utils.extract_url(html.fromstring(s), search_url) >>> f('<span id="42">https://example.com</span>', 'http://example.com/') 'https://example.com/' >>> f('https://example.com', 'http://example.com/') 'https://example.com/' >>> f('//example.com', 'http://example.com/') 'http://example.com/' >>> f('//example.com', 'https://example.com/') 'https://example.com/' >>> f('/path?a=1', 'https://example.com') 'https://example.com/path?a=1' >>> f('', 'https://example.com') raise lxml.etree.ParserError >>> searx.utils.extract_url([], 'https://example.com') raise ValueError
- Raises:
ValueError
lxml.etree.ParserError
- Returns:
str: normalized URL
- searx.utils.dict_subset(dictionary: MutableMapping[Any, Any], properties: set[str]) MutableMapping[str, Any][source]¶
Extract a subset of a dict
- Examples:
>>> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'C']) {'A': 'a', 'C': 'c'} >>> >> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'D']) {'A': 'a'}
- searx.utils.humanize_bytes(size: int | float, precision: int = 2)[source]¶
Determine the human readable value of bytes on 1024 base (1KB=1024B).
- searx.utils.humanize_number(size: int | float, precision: int = 0)[source]¶
Determine the human readable value of a decimal number.
- searx.utils.convert_str_to_int(number_str: str) int[source]¶
Convert number_str to int or 0 if number_str is not a number.
- searx.utils.extr(txt: str, begin: str, end: str, default: str = '') str[source]¶
Extract the string between
beginandendfromtxt- Parameters:
txt – String to search in
begin – First string to be searched for
end – Second string to be searched for after
begindefault – Default value if one of
beginorendis not found. Defaults to an empty string.
- Returns:
The string between the two search-strings
beginandend. If at least one ofbeginorendis not found, the value ofdefaultis returned.
- Examples:
>>> extr("abcde", "a", "e") "bcd" >>> extr("abcde", "a", "z", deafult="nothing") "nothing"
- searx.utils.int_or_zero(num: list[str] | str) int[source]¶
Convert num to int or 0. num can be either a str or a list. If num is a list, the first element is converted to int (or return 0 if the list is empty). If num is a str, see convert_str_to_int
- searx.utils.ecma_unescape(string: str) str[source]¶
Python implementation of the unescape javascript function
https://www.ecma-international.org/ecma-262/6.0/#sec-unescape-string https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape
- Examples:
>>> ecma_unescape('%u5409') '吉' >>> ecma_unescape('%20') ' ' >>> ecma_unescape('%F3') 'ó'
- searx.utils.remove_pua_from_str(string: str)[source]¶
Removes unicode’s “PRIVATE USE CHARACTER”s (PUA) from a string.
- searx.utils.get_engine_from_settings(name: str) dict[str, dict[str, str]][source]¶
Return engine configuration from settings.yml of a given engine name
- searx.utils.get_xpath(xpath_spec: str | XPath) XPath[source]¶
Return cached compiled
lxml.etree.XPathobject.TypeError:Raised when
xpath_specis neither astrnor alxml.etree.XPath.SearxXPathSyntaxException:Raised when there is a syntax error in the XPath selector (
str).
- searx.utils.eval_xpath(element: ElementBase | _Element, xpath_spec: str | XPath) Any[source]¶
Equivalent of
element.xpath(xpath_str)but compilexpath_strinto alxml.etree.XPathobject once for all. The return value ofxpath(..)is complex, read XPath return values for more details.TypeError:Raised when
xpath_specis neither astrnor alxml.etree.XPath.SearxXPathSyntaxException:Raised when there is a syntax error in the XPath selector (
str).SearxEngineXPathException:Raised when the XPath can’t be evaluated (masked
lxml.etree..XPathError).
- searx.utils.eval_xpath_list(element: ElementBase | _Element, xpath_spec: str | XPath, min_len: int | None = None) list[Any][source]¶
Same as
searx.utils.eval_xpath, but additionally ensures the return value is alist. The minimum length of the list is also checked (ifmin_lenis set).
- searx.utils.eval_xpath_getindex(element: ~lxml.etree.ElementBase | ~lxml.etree._Element, xpath_spec: str | ~lxml.etree.XPath, index: int, default: ~typing.Any = <searx.utils._NotSetClass object>) Any[source]¶
Same as
searx.utils.eval_xpath_list, but returns item on positionindexfrom the list (index starts with0).The exceptions known from
searx.utils.eval_xpathare thrown. If a default is specified, this is returned if an element at positionindexcould not be determined.
- searx.utils.get_embeded_stream_url(url: str)[source]¶
Converts a standard video URL into its embed format. Supported services include Youtube, Facebook, Instagram, TikTok, Dailymotion, and Bilibili.