Advanced Usage

For most users, the default settings will be sufficient. However, if you want to create your own Rules and Evaluators, it’s pretty easy.

Make your own Rules

Rules are where the logic to filter out results is performed. A Rule should always descend from the datadiligence.rules.Rule class:

from datadiligence.rules import Rule

class MyRule(Rule)
    def is_ready(self):
        return True

    def is_allowed(self, **kwargs):
        return True

A Rule has two functions you need to implement, is_ready and is_allowed.

The is_ready function is used to determine if the rule’s dependencies are all present. For example, if the rule requires an API Key in an environment variable:

def is_ready(self):
    return os.environ.get('MY_ENV_VAR') is not None

The is_allowed functions is where the logic is performed to evaluate if the arguments are allowed. For example, if we want to disallow any resource loading over HTTP (instead of HTTPS):

def is_allowed(self, url=None, **kwargs):
    if url is None:
        raise Exception("Url is required")

    return url.startswith('https://')

Note this rule will raise an exception if the url argument is not provided.This is because we don’t want this rule to silently fail if url is not provided. This function could also be receiving other arguments intended for other Rules, so we keep the **kwargs in the function signature.

It’s best to try and keep the arguments sent to the is_allowed function as the minimum required to perform the filtering/validation logic. Any additional arguments may be best provided to the Rule constructor:

def __init__(self, block_http=True, **kwargs):
    self.block_http = block_http

def is_allowed(self, url=None, **kwargs):
    if url is None:
        raise Exception("Url is required")

    return self.block_http and url.startswith('https://')

Though this is up to the developer to decide, be wary that these arguments could be unintentionally passed to other Rules and evaluated accidentally.

Now that we have our Rule, we can add it to the default Evaluator, or to one of the existing Evaluators:

from datadiligence.evaluators import Evaluator
from my_code import MyRule

my_evaluator = Evaluator()
my_evaluator.add_rule(MyRule())

And now the Evaluator will run your rule like any other!:

>>> my_evaluator.is_allowed(url='http://example.com')
False
>>> my_evaluator.is_allowed(url='https://example.com')
True

Make your own Evaluator

Sometimes you want to bundle together several Rules, and provide default arguments, flags to customize the Rules, etc. In this case, it’s best to make your own Evaluator:

from datadiligence.evaluators import Evaluator
from my_code import MyRule

class MyEvaluator(Evaluator):
    def __init__(self, check_http=True, block_http=True):
        super().__init__()
        if not self.check_http:
            self.add_rule(MyRule(block_http))

This Evaluator will run the MyRule if the check_http flag is set to True.

>>> my_evaluator = MyEvaluator()
>>> my_evaluator.is_allowed(url='http://example.com')
False

All Rules in an Evaluator should have at least one common keyword argument provided which they can use to evaluate, or else the Rule should throw an exception. In other words, an Evaluator should NOT mix-and-match required arguments for any contained Rules. Otherwise, unnecessary Rules could populate Evaluators and provide a mistaken sense of compliance, when in fact they’re not being evaluated at all:

class MyRule(Rule)
    def is_ready(self):
        return True

    def is_allowed(self, argument=None, **kwargs):
        # please do this
        if argument is None:
            raise Exception("Argument is required")

This also lets Evaluators be composed into logical units which can be used in different contexts, instead of one massive Evaluator with all the rules in it which then must be customized further. Instead, we can try to create default Evaluators with sane defaults for given contexts.

For example, PostprocessEvaluator and PreprocessEvaluator were both created with the img2dataset workflow in mind. However, other workflows may not already have HTTP responses, thus other evaluators may consume a URL and download the response directly.

Make your own Bulk Rule

Bulk Rules are handled slightly differently than normal Rules. They are intended to be a subclass of datadiligence.rules.BulkRule and implement the filter_allowed function:

from datadiligence.rules import BulkRule

class MyBulkRule(BulkRule):
    def is_ready(self):
        return True

    def filter_allowed(self, **kwargs):
        return []

Notice the filter_allowed function should be called for BulkRule classes. The Evaluator should also have the filter_allowed function implemented:

class MyEvaluator(Evaluator):
    def filter_allowed(self, urls= [] **kwargs):
        # set default to allow everything
        allowed = [True] * len(urls)
        for rule in self.rules:
            if rule.is_ready():
                rule_results = rule.is_allowed(urls=urls)

                # set each url as disallowed, and never re-enable it
                allowed = [a and b for a, b in zip(allowed, rule_results)]

Notice the response type is also not a boolean, but a list. The responses should be a list of approved URLs. As a best practice, the rules that will catch the most URLs should be run first, and the rules that will catch the least URLs should be run last.

Only Run the Spawning API

If you only want to check your URLs against the Spawning API, perform the following setup:

$ export SPAWNING_API_KEY=<your_key>
$ python
>>> from datadiligence.rules import SpawningAPI
>>> urls = ['http://example.com', 'https://example.com']
>>> spawning_rule = SpawningAPI(user_agent="my-user-agent")
>>> spawning_rule.filter_allowed(urls=urls)
[]