Harmful Detector
The Harmful detector prevents potentially harmful content using OpenAI's moderation API.
Vulnerability
Usage
Utilizing the Harmful detector allows moderators to detect and monitor LLMs confidently for harmful content. Our detector screens input prompt for:
hate
hate/threatening
maliciousness
controversiality
coherence
correctness
harmfulness
malicious
helpfulness
misogyny
criminality
insensitivity
Configuration
Initialize the Harmful Detector with the desired options:
from guardrail.firewall.input_detectors import HarmfulInput
thresholds = {
"hate": 0.6,
"hate/threatening": 0.6,
"maliciousness": 0.5,
"controversiality": 0.7,
"coherence": 0.7,
"correctness": 0.6,
"harmfulness": 0.5,
"maliciousness": 0.5,
"helpfulness": 0.7,
"misogyny": 0.5,
"criminality": 0.5,
"insensitivity": 0.5
}
firewall = Firewall()
input_detectors = [HarmfulInput(thresholds)]
sanitized_prompt, valid_results, risk_score = firewall.scan_input(prompt, input_detectors)
Here's what the option is for:
thresholds
(dict): dictionary of thresholds from 0.0 to 1.0 to be used to determine if category is harmful based off score