datadiligence.rules package¶
Submodules¶
datadiligence.rules.base module¶
- class datadiligence.rules.base.BulkRule[source]¶
Bases:
RuleBase class for bulk rules. filter_allowed and is_ready must be implemented.
- class datadiligence.rules.base.HttpRule(user_agent=None)[source]¶
Bases:
Rule- get_header_value(headers, header_name)[source]¶
Handle the headers object to get the header value.
- Parameters:
headers (dict|http.client.HTTPMessage|CaseInsensitiveDict) – The headers object.
header_name (str) – The header name.
- Returns:
The header value.
- Return type:
- get_header_value_from_response(response, header_name)[source]¶
Handle the response object to get the header value.
- Parameters:
response (http.client.HTTPResponse|requests.Response) – The response object.
header_name (str) – The header name.
- Returns:
The header value.
- Return type:
datadiligence.rules.http module¶
Rules to manage validation using HTTP properties
- class datadiligence.rules.http.TDMRepHeader[source]¶
Bases:
HttpRuleThis class wraps logic to evaluate the TDM Reservation Protocol headers: https://www.w3.org/2022/tdmrep/.
- HEADER_NAME = 'tdm-reservation'¶
- is_allowed(url=None, response=None, headers=None, **kwargs)[source]¶
Check if the tdm-rep header allows access to the resource without a policy.
- Parameters:
url – (str): The URL of the resource.
response (http.client.HTTPResponse|requests.Response, optional) – The response object. Defaults to None
headers (dict|http.client.HTTPMessage, optional) – The headers dictionary. Defaults to None.
- Returns:
True if access is allowed for the resource, False otherwise.
- Return type:
- class datadiligence.rules.http.XRobotsTagHeader(user_agent=None, respect_noindex=False)[source]¶
Bases:
HttpRuleThis class wraps logic to read the X-Robots-Tag header.
- AI_DISALLOWED_VALUES = ['noai', 'noimageai']¶
- HEADER_NAME = 'X-Robots-Tag'¶
- INDEX_DISALLOWED_VALUES = ['noindex', 'none', 'noimageindex', 'noai', 'noimageai']¶
- is_allowed(url=None, response=None, headers=None, **kwargs)[source]¶
Check if the X-Robots-Tag header allows the user agent to access the resource.
- Parameters:
url – (str): The URL of the resource.
response (http.client.HTTPResponse|requests.Response, optional) – The response object. Defaults to None
headers (dict|http.client.HTTPMessage, optional) – The headers dictionary. Defaults to None.
- Returns:
True if the user agent is allowed to access the resource, False otherwise.
- Return type:
datadiligence.rules.spawning module¶
This module wraps HTTP calls to the Spawning AI API. See Spawning API documentation here: https://opts-api.spawningaiapi.com/docs.
- class datadiligence.rules.spawning.SpawningAPI(user_agent=None, chunk_size=10000, timeout=20, max_retries=5, max_concurrent_requests=10)[source]¶
Bases:
BulkRuleThis class wraps basic requests to the Spawning API.
- API_KEY_ENV_VAR = 'SPAWNING_OPTS_KEY'¶
- DEFAULT_TIMEOUT = 20¶
- MAX_CHUNK_SIZE = 10000¶
- MAX_CONCURRENT_REQUESTS = 10¶
- SPAWNING_AI_API_URL = 'https://opts-api.spawningaiapi.com/api/v2/query/urls/'¶
- filter_allowed(urls=None, url=None, **kwargs)[source]¶
Submit a list of URLs to the Spawning AI API. :param urls: A list of URLs to submit. :type urls: list :param url: A single URL to submit. :type url: str
- Returns:
A list containing the allowed urls of the submission.
- Return type:
- async filter_allowed_async(urls=None, url=None, **kwargs)[source]¶
Submit a list of URLs to the Spawning AI API.
- is_allowed(urls=None, url=None, **kwargs)[source]¶
Submit a list of URLs to the Spawning AI API. :param urls: A list of URLs to submit. :type urls: list :param url: A single URL to submit. :type url: str
- Returns:
A list containing booleans, indicating if a respective URL is allowed.
- Return type:
Module contents¶
This module contains default Rules.