Utility functions for the engines¶
Utility functions for the engines
- searx.utils.XPathSpecType: TypeAlias = str | lxml.etree.XPath¶
Type alias used by
searx.utils.get_xpath
,searx.utils.eval_xpath
and other XPath selectors.
- searx.utils.SEARCH_LANGUAGE_CODES = frozenset({'af', 'ar', 'be', 'bg', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'ga', 'gd', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'kn', 'ko', 'lt', 'lv', 'ml', 'mr', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'ur', 'vi', 'zh'})¶
Languages supported by most searxng engines (
searx.sxng_locales.sxng_locales
).
- searx.utils.gen_useragent(os_string: str | None = None) str [source]¶
Return a random browser User Agent
See searx/data/useragents.json
- searx.utils.html_to_text(html_str: str) str [source]¶
Extract text from a HTML string
- Args:
html_str (str): string HTML
- Returns:
str: extracted text
- Examples:
>>> html_to_text('Example <span id="42">#2</span>') 'Example #2'
>>> html_to_text('<style>.span { color: red; }</style><span>Example</span>') 'Example'
>>> html_to_text(r'regexp: (?<![a-zA-Z]') 'regexp: (?<![a-zA-Z]'
>>> html_to_text(r'<p><b>Lorem ipsum </i>dolor sit amet</p>') 'Lorem ipsum </i>dolor sit amet</p>'
>>> html_to_text(r'> < a') '> < a'
- searx.utils.markdown_to_text(markdown_str: str) str [source]¶
Extract text from a Markdown string
- Args:
markdown_str (str): string Markdown
- Returns:
str: extracted text
- Examples:
>>> markdown_to_text('[example](https://example.com)') 'example'
>>> markdown_to_text('## Headline') 'Headline'
- searx.utils.extract_text(xpath_results: list[ElementBase] | ElementBase | str | Number | bool | None, allow_none: bool = False) str | None [source]¶
Extract text from a lxml result
if xpath_results is list, extract the text from each result and concat the list
if xpath_results is a xml element, extract all the text node from it ( text_content() method from lxml )
if xpath_results is a string element, then it’s already done
- searx.utils.normalize_url(url: str, base_url: str) str [source]¶
Normalize URL: add protocol, join URL with base_url, add trailing slash if there is no path
- Args:
url (str): Relative URL
base_url (str): Base URL, it must be an absolute URL.
- Example:
>>> normalize_url('https://example.com', 'http://example.com/') 'https://example.com/' >>> normalize_url('//example.com', 'http://example.com/') 'http://example.com/' >>> normalize_url('//example.com', 'https://example.com/') 'https://example.com/' >>> normalize_url('/path?a=1', 'https://example.com') 'https://example.com/path?a=1' >>> normalize_url('', 'https://example.com') 'https://example.com/' >>> normalize_url('/test', '/path') raise ValueError
- Raises:
lxml.etree.ParserError
- Returns:
str: normalized URL
- searx.utils.extract_url(xpath_results: list[ElementBase] | ElementBase | str | Number | bool | None, base_url: str) str [source]¶
Extract and normalize URL from lxml Element
- Example:
>>> def f(s, search_url): >>> return searx.utils.extract_url(html.fromstring(s), search_url) >>> f('<span id="42">https://example.com</span>', 'http://example.com/') 'https://example.com/' >>> f('https://example.com', 'http://example.com/') 'https://example.com/' >>> f('//example.com', 'http://example.com/') 'http://example.com/' >>> f('//example.com', 'https://example.com/') 'https://example.com/' >>> f('/path?a=1', 'https://example.com') 'https://example.com/path?a=1' >>> f('', 'https://example.com') raise lxml.etree.ParserError >>> searx.utils.extract_url([], 'https://example.com') raise ValueError
- Raises:
ValueError
lxml.etree.ParserError
- Returns:
str: normalized URL
- searx.utils.dict_subset(dictionary: MutableMapping[Any, Any], properties: set[str]) MutableMapping[str, Any] [source]¶
Extract a subset of a dict
- Examples:
>>> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'C']) {'A': 'a', 'C': 'c'} >>> >> dict_subset({'A': 'a', 'B': 'b', 'C': 'c'}, ['A', 'D']) {'A': 'a'}
- searx.utils.humanize_bytes(size: int | float, precision: int = 2)[source]¶
Determine the human readable value of bytes on 1024 base (1KB=1024B).
- searx.utils.humanize_number(size: int | float, precision: int = 0)[source]¶
Determine the human readable value of a decimal number.
- searx.utils.convert_str_to_int(number_str: str) int [source]¶
Convert number_str to int or 0 if number_str is not a number.
- searx.utils.extr(txt: str, begin: str, end: str, default: str = '')[source]¶
Extract the string between
begin
andend
fromtxt
- Parameters:
txt – String to search in
begin – First string to be searched for
end – Second string to be searched for after
begin
default – Default value if one of
begin
orend
is not found. Defaults to an empty string.
- Returns:
The string between the two search-strings
begin
andend
. If at least one ofbegin
orend
is not found, the value ofdefault
is returned.
- Examples:
>>> extr("abcde", "a", "e") "bcd" >>> extr("abcde", "a", "z", deafult="nothing") "nothing"
- searx.utils.int_or_zero(num: list[str] | str) int [source]¶
Convert num to int or 0. num can be either a str or a list. If num is a list, the first element is converted to int (or return 0 if the list is empty). If num is a str, see convert_str_to_int
- searx.utils.is_valid_lang(lang: str) tuple[bool, str, str] | None [source]¶
Return language code and name if lang describe a language.
- Examples:
>>> is_valid_lang('zz') None >>> is_valid_lang('uk') (True, 'uk', 'ukrainian') >>> is_valid_lang(b'uk') (True, 'uk', 'ukrainian') >>> is_valid_lang('en') (True, 'en', 'english') >>> searx.utils.is_valid_lang('Español') (True, 'es', 'spanish') >>> searx.utils.is_valid_lang('Spanish') (True, 'es', 'spanish')
- searx.utils.ecma_unescape(string: str) str [source]¶
Python implementation of the unescape javascript function
https://www.ecma-international.org/ecma-262/6.0/#sec-unescape-string https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Objets_globaux/unescape
- Examples:
>>> ecma_unescape('%u5409') '吉' >>> ecma_unescape('%20') ' ' >>> ecma_unescape('%F3') 'ó'
- searx.utils.remove_pua_from_str(string: str)[source]¶
Removes unicode’s “PRIVATE USE CHARACTER”s (PUA) from a string.
- searx.utils.get_engine_from_settings(name: str) dict[str, dict[str, str]] [source]¶
Return engine configuration from settings.yml of a given engine name
- searx.utils.get_xpath(xpath_spec: str | XPath) XPath [source]¶
Return cached compiled
lxml.etree.XPath
object.TypeError
:Raised when
xpath_spec
is neither astr
nor alxml.etree.XPath
.SearxXPathSyntaxException
:Raised when there is a syntax error in the XPath selector (
str
).
- searx.utils.eval_xpath(element: ElementBase, xpath_spec: str | XPath) Any [source]¶
Equivalent of
element.xpath(xpath_str)
but compilexpath_str
into alxml.etree.XPath
object once for all. The return value ofxpath(..)
is complex, read XPath return values for more details.TypeError
:Raised when
xpath_spec
is neither astr
nor alxml.etree.XPath
.SearxXPathSyntaxException
:Raised when there is a syntax error in the XPath selector (
str
).SearxEngineXPathException:
Raised when the XPath can’t be evaluated (masked
lxml.etree..XPathError
).
- searx.utils.eval_xpath_list(element: ElementBase, xpath_spec: str | XPath, min_len: int | None = None) list[Any] [source]¶
Same as
searx.utils.eval_xpath
, but additionally ensures the return value is alist
. The minimum length of the list is also checked (ifmin_len
is set).
- searx.utils.eval_xpath_getindex(element: ~lxml.etree.ElementBase, xpath_spec: str | ~lxml.etree.XPath, index: int, default: ~typing.Any = <searx.utils._NotSetClass object>) Any [source]¶
Same as
searx.utils.eval_xpath_list
, but returns item on positionindex
from the list (index starts with0
).The exceptions known from
searx.utils.eval_xpath
are thrown. If a default is specified, this is returned if an element at positionindex
could not be determined.
- searx.utils.get_embeded_stream_url(url: str)[source]¶
Converts a standard video URL into its embed format. Supported services include Youtube, Facebook, Instagram, TikTok, Dailymotion, and Bilibili.
- searx.utils.detect_language(text: str, threshold: float = 0.3, only_search_languages: bool = False) str | None [source]¶
Detect the language of the
text
parameter.- Parameters:
text (str) – The string whose language is to be detected.
threshold (float) – Threshold filters the returned labels by a threshold on probability. A choice of 0.3 will return labels with at least 0.3 probability.
only_search_languages (bool) – If
True
, returns only supported SearXNG search languages. seesearx.languages
- Return type:
str, None
- Returns:
The detected language code or
None
. See below.- Raises:
ValueError – If
text
is not a string.
The language detection is done by using a fork of the fastText library (python fasttext). fastText distributes the language identification model, for reference:
The language identification model support the language codes (ISO-639-3):
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
By using
only_search_languages=True
the language identification model is harmonized with the SearXNG’s language (locale) model. General conditions of SearXNG’s locale model are:SearXNG’s locale of a query is passed to the
searx.locales.get_engine_locale
to get a language and/or region code that is used by an engine.Most of SearXNG’s engines do not support all the languages from language identification model and there is also a discrepancy in the ISO-639-3 (fasttext) and ISO-639-2 (SearXNG)handling. Further more, in SearXNG the locales like
zh-TH
(zh-CN
) are mapped tozh_Hant
(zh_Hans
) while the language identification model reduce both tozh
.