Prompt Injection Detector
Our Prompt Injection Detector provides robust defense against adversarial input manipulations aimed at large language models (LLMs). By promptly identifying and neutralizing these attacks, our detector ensures the LLM operates securely, preventing it from falling victim to injection attacks.
Vulnerability
Injection attacks, particularly within LLM contexts, can prompt the model to execute unintended actions. Attackers typically exploit these vulnerabilities in two main ways:
-
Direct Injection: Involves overwriting system prompts directly.
-
Indirect Injection: Involves altering inputs sourced from external channels.
Info
As specified by the OWASP Top 10 LLM attacks
, this vulnerability is categorized under:
LLM01: Prompt Injection - Manipulating LLMs via crafted inputs can lead to unauthorized access, data breaches, and compromised decision-making.
How it works
Our detector leverages model JasperLS/deberta-v3-base-injection for its operation. However, it's important to note that although the current model is effective in detecting attempts, it may occasionally produce false positives.
Warning
Given this limitation, it is advisable to exercise caution when contemplating its deployment in a production environment.
We are building internal models with prompt injection databases and known attack patterns from research and external data.
Usage
Configuration
from guardrail.firewall.input_detectors import PromptInjection
firewall = Firewall()
input_detectors = [PromptInjection(threshold=0.7)]
sanitized_prompt, valid_results, risk_score = firewall.scan_input(prompt, input_detectors)