Skip to content

Toxicity Detector

The Toxicity Detector is designed to identify and filter out toxic or harmful content from LLM outputs. This detector plays a crucial role in maintaining a positive and respectful user experience while interacting with AI systems.

Vulnerability

In the absence of content moderation, LLM outputs may contain offensive, harmful, or inappropriate language. Such content can not only create a negative user experience but can also lead to legal and ethical concerns. To counter this vulnerability, the Toxicity Detector is employed to filter out toxic content and ensure that only respectful and safe interactions occur.

Usage

By integrating the Toxicity Detector into your applications, you can prevent toxic content from reaching LLMs, thereby fostering a safe and inclusive environment for users. This detector utilizes the Perspective API, a powerful tool developed by Jigsaw, which assesses the toxicity of text outputs.

Note: The Perspective API employs machine learning models to analyze text for toxicity. It evaluates the likelihood of a text being perceived as toxic, offensive, or inappropriate.

Configuration

To configure the Toxicity Detector, follow these steps:

  1. Obtain API Key: Sign up for the Perspective API and obtain your unique API key. You can find more information and sign up here.

  2. Set PERSPECTIVE_API_KEY in local .env

  3. Initialize output detector and scan

from guardrail.firewall.output_detectors import ToxicityOutput

firewall = Firewall()
output_detectors = [ToxicityOutput()]

sanitized_prompt, valid_results, risk_score = firewall.scan_output(prompt, response_text, output_detectors)

The toxicity_score returned indicates the likelihood of the output being toxic. You can set a threshold based on your application's tolerance level and take appropriate actions if the toxicity score exceeds this threshold (e.g., filtering the content, issuing warnings, or blocking the output).

By implementing the Toxicity Detector, you can ensure a safer and more respectful interaction environment, protecting both users and the integrity of your applications.