01What guardrails do

Guardrails prevent AI systems from producing outputs or taking actions that violate defined boundaries. They may prevent an AI from making statements the organisation has not approved, from providing advice in areas outside its authorised scope, from accessing data it is not permitted to use, or from taking actions that require human approval.

In a simple example, a customer service AI with appropriate guardrails will refuse to provide legal or medical advice, will not make pricing commitments outside pre-approved ranges, and will escalate to a human when it encounters queries outside its authorised scope. Without these guardrails, the AI may provide legally problematic advice or make commitments the organisation cannot or does not intend to honour.

02Types of guardrails

Guardrails operate at several levels.

Model-level guardrails are built into the AI model itself during training. Constitutional AI is an example: Anthropic trains Claude to reason about its outputs in terms of values that constitute guardrails at a fundamental level. Model-level guardrails are the most robust because they cannot be easily overridden by prompts.

System prompt guardrails are instructions in the system prompt that tell the AI what it should and should not do. They are effective for shaping AI behaviour for specific deployment contexts but can potentially be overridden by sufficiently sophisticated inputs (prompt injection).

External guardrails are filters applied to AI inputs and outputs before they reach the model and before the output reaches the user. Azure OpenAI Content Safety is a service that adds external guardrails to AI deployments: detecting and blocking inappropriate content in both directions.

Workflow guardrails are process controls that require human review before certain AI-generated outputs are acted on. They are not technical controls on the AI itself but governance controls on what happens with AI output.

03Governance implications

For boards, the key governance questions about AI guardrails are: are guardrails in place for all production AI deployments? Are those guardrails appropriate to the risk level of each deployment? Are the guardrails tested to confirm they are effective? And are they monitored to detect when they are being bypassed?

Guardrails that have never been adversarially tested may pass normal use without incident but fail when the AI encounters users who are actively attempting to bypass them. Testing includes attempting to bypass guardrails deliberately, which should be part of AI deployment validation.

Key Takeaways

1.AI guardrails are technical controls that constrain AI behaviour to comply with policy, regulation, and brand standards.
2.Guardrails operate at multiple levels: model-level (trained in), system prompt (instructed), external filters (Azure Content Safety), and workflow controls (human review).
3.Model-level guardrails are most robust; system prompt guardrails can potentially be overridden by prompt injection; external filters add an additional layer of protection.
4.Guardrails must be adversarially tested to confirm they hold under deliberate bypass attempts, not just normal use.
5.Board governance of AI should include confirmation that appropriate guardrails are in place, tested, and monitored for each production AI deployment.

References & Further Reading

[1]
Azure AI Content SafetyMicrosoft

Want to discuss this with an expert?

Book a strategy call to explore how these insights apply to your organisation.

Book a Strategy Call

What Is an AI Guardrail? The Governance Layer That Keeps AI Within Business Policy

01What guardrails do

02Types of guardrails

03Governance implications

Key Takeaways

References & Further Reading