Harmful Detector

The Harmful detector prevents potentially harmful content using OpenAI's moderation API.

Vulnerability

Usage

Utilizing the Harmful detector allows moderators to detect and monitor LLMs confidently for harmful content. Our detector screens input prompt for:

hate
hate/threatening
maliciousness
controversiality
coherence
correctness
harmfulness
malicious
helpfulness
misogyny
criminality
insensitivity

Configuration

Initialize the Harmful Detector with the desired options:

from guardrail.firewall.input_detectors import HarmfulInput

thresholds = {
    "hate": 0.6,
    "hate/threatening": 0.6,
    "maliciousness": 0.5,
    "controversiality": 0.7,
    "coherence": 0.7,
    "correctness": 0.6,
    "harmfulness": 0.5,
    "maliciousness": 0.5,
    "helpfulness": 0.7,
    "misogyny": 0.5,
    "criminality": 0.5,
    "insensitivity": 0.5
}

firewall = Firewall()
input_detectors = [HarmfulInput(thresholds)]

sanitized_prompt, valid_results, risk_score = firewall.scan_input(prompt, input_detectors)

Here's what the option is for:

thresholds (dict): dictionary of thresholds from 0.0 to 1.0 to be used to determine if category is harmful based off score