Harmful Detector

The Harmful Content Detector serves as a robust defense mechanism, employing OpenAI's moderation API to identify and prevent potentially harmful content generated by language models.

Vulnerability

Preventing the dissemination of harmful or inappropriate content is of utmost importance. Such content could range from hate speech and threats to malicious or criminal intent. Allowing such content to surface poses a severe risk to the institution's reputation, customer trust, and potential regulatory compliance.

Usage

By integrating Guardrail's Harmful Content Detection module, teams can confidently moderate language model outputs. The detector meticulously screens the generated content for various harmful elements, including hate speech, malicious intent, controversiality, and more. Its vigilance extends to aspects such as coherence, correctness, and insensitivity, ensuring a thorough assessment.

The Harmful Content Detection module utilizes OpenAI's moderation API to provide real-time alerts for any detected harmful content. Swift action can then be taken to prevent the dissemination of inappropriate or damaging information, maintaining a safe and respectful environment for users and stakeholders.

Utilizing the Harmful detector allows moderators to detect and monitor LLMs confidently for harmful content. Our detector screens output prompt for:

hate
hate/threatening
maliciousness
controversiality
coherence
correctness
harmfulness
malicious
helpfulness
misogyny
criminality
insensitivity

Configuration

Initialize the Harmful Detector with the desired options:

from guardrail.firewall.output_detectors import HarmfulOutput

thresholds = {
    "hate": 0.6,
    "hate/threatening": 0.6,
    "maliciousness": 0.5,
    "controversiality": 0.7,
    "coherence": 0.7,
    "correctness": 0.6,
    "harmfulness": 0.5,
    "maliciousness": 0.5,
    "helpfulness": 0.7,
    "misogyny": 0.5,
    "criminality": 0.5,
    "insensitivity": 0.5
}

firewall = Firewall()
output_detectors = [HarmfulOutput(thresholds)]

sanitized_response, valid_results, risk_score = firewall.scan_output(sanitized_prompt, response_text, output_detectors)

Here's what the option is for:

thresholds (dict): dictionary of thresholds from 0.0 to 1.0 to be used to determine if category is harmful based off score