Home
Guardrail ML - Alignment toolkit for safeguarding LLMs
Guardrail ML is an alignment toolkit to use LLMs safely and securely. Our firewall scans prompts and LLM behaviors for risks to bring your AI app from prototype to production with confidence.
What is Guardrail?
Our firewall wraps your GenAI apps with a protective layer, safeguarding malicious inputs and filtering model outputs. Our comprehensive toolkit has 20+ out-of-the-box detectors for robust protection of your GenAI apps.
From prompt injections, PII leakage, DoS to ungrounded additions (hallucinations) and harmful language detection, our firewall protects LLMs through a multi-layered defense.
Installation
$ pip install guardrail-ml
Quickstart
Go to app.useguardrail.com to get your API key and set it in your .env
.
🛡️Features
Block CVEs with our extensible firewall that gets stronger the more that it's attacked. Our automatic evaluators log everything for compliance and debugging.
Our default Firewall covers most of the Detectors, except for the ones with * which needs to be further customized or requires an API_KEY
.
- Input Detectors
- Anonymize
- DoS Tokens
- Malware URL
- Prompt Injections
- Secrets
- Stop Input Substrings
- Toxicity *
perspective_api_key
required - Harmful Moderation
- Text Quality
- Coding Language *
coding language
required - Regex *
patterns
required
- Output Detectors
- Factual Consistency
- Deanonymize
- Factuality Tool *
openai_api_key
required - Sensitive PII
- Stop Output Substrings
- Toxicity *
perspective_api_key
required - Harmful Moderation
- Text Quality
- Regex *
patterns
required - Coding Language *
coding language
required - Relevance
Multi-layered defense
- Heuristic Detectors: detect and filter potentially malicious input before reaching LLMs
- LLM-based detectors: use specialized LLMs to analyze inputs and outputs to identify potential attacks and safety risks
- Vector Database: store previous attacks in a vector database to recognize and prevent similar attacks (coming soon)
- Red Teaming: Adversarial Tests: auto-generate adversarial attacks to fortify defenses (coming soon)