guardrails-safety
Protecting AI applications - input/output guards, toxicity detection, PII protection, injection defense, constitutional AI. Use when securing AI systems, preventing misuse, or ensuring compliance.
When & Why to Use This Skill
This Claude skill provides a comprehensive security framework for AI applications, implementing multi-layered defense mechanisms such as input/output validation, toxicity detection, and PII protection. It empowers developers to build trustworthy AI systems by preventing prompt injections, ensuring data privacy, and maintaining alignment with safety principles through Constitutional AI techniques.
Use Cases
- Prompt Injection Defense: Identifying and blocking malicious attempts to override system instructions or hijack the AI's persona.
- PII Redaction and Privacy: Automatically detecting and masking sensitive information like emails, phone numbers, and credit card details to ensure regulatory compliance.
- Content Moderation: Real-time filtering of toxic or harmful content in both user queries and model responses to maintain a safe user environment.
- Hallucination Control: Validating the factuality and citation accuracy of AI-generated outputs to improve reliability in production.
- Constitutional AI Enforcement: Applying a set of ethical principles to critique and revise AI responses, ensuring they align with organizational values and safety standards.
| name | guardrails-safety |
|---|---|
| description | Protecting AI applications - input/output guards, toxicity detection, PII protection, injection defense, constitutional AI. Use when securing AI systems, preventing misuse, or ensuring compliance. |
Guardrails & Safety Skill
Protecting AI applications from misuse.
Input Guardrails
class InputGuard:
def __init__(self):
self.toxicity = load_toxicity_model()
self.pii = PIIDetector()
self.injection = InjectionDetector()
def check(self, text):
result = {"allowed": True, "issues": []}
# Toxicity
if self.toxicity.predict(text) > 0.7:
result["allowed"] = False
result["issues"].append("toxic")
# PII
pii = self.pii.detect(text)
if pii:
result["issues"].append(f"pii: {pii}")
text = self.pii.redact(text)
# Injection
if self.injection.detect(text):
result["allowed"] = False
result["issues"].append("injection")
result["sanitized"] = text
return result
Output Guardrails
class OutputGuard:
def check(self, output, context=None):
result = {"allowed": True, "issues": []}
# Factuality
if context:
if self.fact_checker.check(output, context) < 0.7:
result["issues"].append("hallucination")
# Toxicity
if self.toxicity.predict(output) > 0.5:
result["allowed"] = False
result["issues"].append("toxic")
# Citations
invalid = self.citation_validator.check(output)
if invalid:
result["issues"].append(f"bad_citations: {len(invalid)}")
return result
Injection Detection
class InjectionDetector:
PATTERNS = [
r"ignore (previous|all) instructions",
r"forget (your|all) (instructions|rules)",
r"you are now",
r"new persona",
r"act as",
r"pretend to be",
r"disregard",
]
def detect(self, text):
text_lower = text.lower()
for pattern in self.PATTERNS:
if re.search(pattern, text_lower):
return True
return False
Constitutional AI
class ConstitutionalFilter:
def __init__(self, principles):
self.principles = principles
self.critic = load_model("critic")
self.reviser = load_model("reviser")
def filter(self, response):
for principle in self.principles:
critique = self.critic.generate(f"""
Does this violate: "{principle}"?
Response: {response}
""")
if "violates" in critique.lower():
response = self.reviser.generate(f"""
Rewrite to comply with: "{principle}"
Original: {response}
Critique: {critique}
""")
return response
PRINCIPLES = [
"Do not provide harmful instructions",
"Do not reveal personal information",
"Acknowledge uncertainty",
"Do not fabricate facts",
]
PII Protection
class PIIDetector:
PATTERNS = {
"email": r"\b[\w.-]+@[\w.-]+\.\w+\b",
"phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",
}
def detect(self, text):
found = {}
for name, pattern in self.PATTERNS.items():
matches = re.findall(pattern, text)
if matches:
found[name] = matches
return found
def redact(self, text):
for name, pattern in self.PATTERNS.items():
text = re.sub(pattern, f"[{name.upper()}]", text)
return text
Best Practices
- Defense in depth (multiple layers)
- Log all blocked content
- Regular adversarial testing
- Update patterns continuously
- Fail closed (block if uncertain)