Contradictions

The Contradictions and Entailment detector assesses whether the generated LLM output contradicts or refutes a given statement or prompt. The detector's primary purpose is to ensure the consistency and correctness of language model outputs, especially in situations where logical contradictions can cause problems.

Vulnerability

When interacting with users or processing information, it is crucial for a language model not to generate responses that directly contradict the provided inputs or established facts. Such contradictions can lead to confusion or spread misinformation, which is detrimental in any development context.

The Contradictions detector is employed to highlight such inconsistencies in the output, allowing developers to rectify them before the content is disseminated.

Usage

The detector leverages pretrained natural language inference (NLI) models from HuggingFace, like ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli ( and other variants), to assess the relationship between a given input prompt and the generated LLM output.

A high contradiction score, calculated by the detector, indicates that the output contradicts the given prompt.

The calculated score is then compared to a predetermined threshold. Outputs that exceed this threshold are flagged as contradictory, enabling developers to identify and address these inconsistencies promptly.

Configuration

Initialize the Factual Consistency detector with the desired options:

from guardrail.firewall.output_detectors import Contradictions

firewall = Firewall(no_defaults=True)
output_detectors = [Contradictions(threshold=0.7)]

sanitized_response, valid_results, risk_score = firewall.scan_output(sanitized_prompt, response_text, output_detectors)

By integrating the Factual Consitency Detector into your Firewall, you can enhance the accuracy and reliability of your language model outputs, ensuring a seamless user experience and preventing the dissemination of contradictory or misleading information.