Skip to content

Prompt Injection Detector

Our Prompt Injection Detector provides robust defense against adversarial input manipulations aimed at large language models (LLMs). By promptly identifying and neutralizing these attacks, our detector ensures the LLM operates securely, preventing it from falling victim to injection attacks.

Vulnerability

Injection attacks, particularly within LLM contexts, can prompt the model to execute unintended actions. Attackers typically exploit these vulnerabilities in two main ways:

  • Direct Injection: Involves overwriting system prompts directly.

  • Indirect Injection: Involves altering inputs sourced from external channels.

Info

As specified by the OWASP Top 10 LLM attacks, this vulnerability is categorized under:

LLM01: Prompt Injection - Manipulating LLMs via crafted inputs can lead to unauthorized access, data breaches, and compromised decision-making.

How it works

Our detector leverages model JasperLS/deberta-v3-base-injection for its operation. However, it's important to note that although the current model is effective in detecting attempts, it may occasionally produce false positives.

Warning

Given this limitation, it is advisable to exercise caution when contemplating its deployment in a production environment.

We are building internal models with prompt injection databases and known attack patterns from research and external data.

Usage

Configuration

from guardrail.firewall.input_detectors import PromptInjection

firewall = Firewall()
input_detectors = [PromptInjection(threshold=0.7)]

sanitized_prompt, valid_results, risk_score = firewall.scan_input(prompt, input_detectors)