Guardrails

Every request through ACAI is protected by content safety, PII detection, and prompt injection prevention — included in every tier at no extra cost.

Content Safety

Requests and responses are evaluated against content safety policies. Content that violates policies is blocked with a structured error response.

Hate speech and harassment detection
Violence and self-harm content filtering
Sexual content filtering
Configurable severity thresholds per category

When content is blocked, you receive a 400 response with a content_policy_violation error code and details about which category was triggered.

PII Detection & Redaction

ACAI scans requests for personally identifiable information and can redact or flag it before it reaches the model.

Social Security Numbers (SSN)
Email addresses
Phone numbers
Credit card numbers
IP addresses
Custom patterns via guardrail rules

PII detection modes:

Mode	Behavior
block	Reject the request with 400 error
redact	Replace PII with placeholder tokens (e.g., [SSN_REDACTED])
flag	Allow the request but log a warning in audit logs

Prompt Injection Prevention

Multi-layer injection detection protects against adversarial inputs that attempt to override system instructions.

Heuristic detection — pattern matching for common injection techniques
Token analysis — detects encoding-based evasion (base64, unicode, homoglyphs)
Regex patterns — matches known injection signatures
Prompt Shield — optional integration with cloud-native content safety APIs

Injection attempts are blocked with a 400 response and logged to the audit trail.

What Does Redaction Look Like?

When PII detection mode is set to redact, sensitive data is replaced with typed placeholders before reaching the model. Here's a before/after example:

Original request (what your app sends)

{
  "model": "gpt-4o-mini",
  "messages": [{
    "role": "user",
    "content": "Patient Jane Doe (SSN 123-45-6789) called from
      555-867-5309. Her email is jane.doe@example.com and she
      paid with card 4111-1111-1111-1111. Please summarize
      her visit notes."
  }]
}

After redaction (what the model sees)

{
  "model": "gpt-4o-mini",
  "messages": [{
    "role": "user",
    "content": "Patient [NAME_REDACTED] (SSN [SSN_REDACTED])
      called from [PHONE_REDACTED]. Her email is
      [EMAIL_REDACTED] and she paid with card
      [CREDIT_CARD_REDACTED]. Please summarize her visit
      notes."
  }]
}

Redaction metadata is included in the X-DirectAI-Redactions response header and logged to the audit trail. Each redaction records the entity type, character offset, and a one-way hash of the original value for correlation.

Placeholder	Detected Entity
[NAME_REDACTED]	Person name
[SSN_REDACTED]	Social Security Number
[EMAIL_REDACTED]	Email address
[PHONE_REDACTED]	Phone number
[CREDIT_CARD_REDACTED]	Credit / debit card number
[IP_REDACTED]	IP address

Custom Rules

Define custom guardrail rules via the dashboard or the Rules API. Rules can match on request content using regex patterns, keyword lists, or string matching.

POST /api/v1/guardrails/rules
{
  "name": "block-competitor-names",
  "description": "Block requests mentioning competitor products",
  "pattern": "\\b(CompetitorA|CompetitorB)\\b",
  "action": "block",
  "scope": "input"
}

Configuration

Configure guardrail behavior per-user in the Dashboard → Guardrails → Configuration. Settings include:

Enable/disable individual guardrail types
Set severity thresholds for content safety categories
Choose PII detection mode (block, redact, flag)
Configure injection detection sensitivity

Violation Tracking

All guardrail violations are tracked and viewable in Dashboard → Guardrails → Violations. Each violation records:

Violation type (content safety, PII, injection)
Request ID for correlation
Timestamp and API key
Category and severity