datadiligence.rules package

Submodules

datadiligence.rules.base module

class datadiligence.rules.base.BulkRule[source]

Bases: Rule

Base class for bulk rules. filter_allowed and is_ready must be implemented.

filter_allowed(**kwargs)[source]

Filter a list of entries based on the rules in this evaluator.

class datadiligence.rules.base.HttpRule(user_agent=None)[source]

Bases: Rule

get_header_value(headers, header_name)[source]

Handle the headers object to get the header value.

Parameters:
Returns:

The header value.

Return type:

str

get_header_value_from_response(response, header_name)[source]

Handle the response object to get the header value.

Parameters:
Returns:

The header value.

Return type:

str

is_ready()[source]

These rules should always be ready.

class datadiligence.rules.base.Rule[source]

Bases: object

Base class for rules. is_allowed and is_ready must be implemented.

is_allowed(**kwargs)[source]

Check if the request is allowed. Must be implemented.

Parameters:

**kwargs – Arbitrary keyword arguments to read args from.

is_ready()[source]

Check if the rule is ready to be used.

datadiligence.rules.http module

Rules to manage validation using HTTP properties

class datadiligence.rules.http.TDMRepHeader[source]

Bases: HttpRule

This class wraps logic to evaluate the TDM Reservation Protocol headers: https://www.w3.org/2022/tdmrep/.

HEADER_NAME = 'tdm-reservation'
is_allowed(url=None, response=None, headers=None, **kwargs)[source]

Check if the tdm-rep header allows access to the resource without a policy.

Parameters:
Returns:

True if access is allowed for the resource, False otherwise.

Return type:

bool

class datadiligence.rules.http.XRobotsTagHeader(user_agent=None, respect_noindex=False)[source]

Bases: HttpRule

This class wraps logic to read the X-Robots-Tag header.

AI_DISALLOWED_VALUES = ['noai', 'noimageai']
HEADER_NAME = 'X-Robots-Tag'
INDEX_DISALLOWED_VALUES = ['noindex', 'none', 'noimageindex', 'noai', 'noimageai']
is_allowed(url=None, response=None, headers=None, **kwargs)[source]

Check if the X-Robots-Tag header allows the user agent to access the resource.

Parameters:
Returns:

True if the user agent is allowed to access the resource, False otherwise.

Return type:

bool

datadiligence.rules.spawning module

This module wraps HTTP calls to the Spawning AI API. See Spawning API documentation here: https://opts-api.spawningaiapi.com/docs.

class datadiligence.rules.spawning.SpawningAPI(user_agent=None, chunk_size=10000, timeout=20, max_retries=5, max_concurrent_requests=10)[source]

Bases: BulkRule

This class wraps basic requests to the Spawning API.

API_KEY_ENV_VAR = 'SPAWNING_OPTS_KEY'
DEFAULT_TIMEOUT = 20
MAX_CHUNK_SIZE = 10000
MAX_CONCURRENT_REQUESTS = 10
SPAWNING_AI_API_URL = 'https://opts-api.spawningaiapi.com/api/v2/query/urls/'
filter_allowed(urls=None, url=None, **kwargs)[source]

Submit a list of URLs to the Spawning AI API. :param urls: A list of URLs to submit. :type urls: list :param url: A single URL to submit. :type url: str

Returns:

A list containing the allowed urls of the submission.

Return type:

list

async filter_allowed_async(urls=None, url=None, **kwargs)[source]

Submit a list of URLs to the Spawning AI API.

Parameters:
  • urls (list) – A list of URLs to submit.

  • url (str) – A single URL to submit.

Returns:

A list containing the allowed URLs.

Return type:

list

is_allowed(urls=None, url=None, **kwargs)[source]

Submit a list of URLs to the Spawning AI API. :param urls: A list of URLs to submit. :type urls: list :param url: A single URL to submit. :type url: str

Returns:

A list containing booleans, indicating if a respective URL is allowed.

Return type:

list

async is_allowed_async(urls=None, url=None, **kwargs)[source]

Submit a list of URLs to the Spawning AI API.

Parameters:
  • urls (list) – A list of URLs to submit.

  • url (str) – A single URL to submit.

Returns:

A list of boolean values indicating if a URL is allowed to be used or not.

Return type:

list

is_ready()[source]

Check if the Spawning AI API is ready to be used. This is determined by whether or not the API key is set. :returns: True if the API key is set, False otherwise. :rtype: bool

Module contents

This module contains default Rules.