Tool Calling is the Real Safety Feature
When an AI detects suicide risk, it shouldn’t improvise. It should perform a validated clinical assessment. In clinical care, that means something like the Columbia Suicide Severity Rating Scale. Humans don’t make up the questions each time. Neither should a model.
We need probabilistic models to run deterministic, predefined assessments reliably. Some research is pushing LLMs in that direction, but it's still early and unproven in production. The answer today is tool calling. When the model hits a risk flag, it can invoke a structured, evidence-based protocol: the same one a clinician would use. That makes its behavior predictable and testable.
Tool calling isn’t just for assessment, it’s how LLMs can take real, safe action. A model could directly schedule a session with a human coach or clinician, trigger alerts within an EHR so the care team can reach out, or even escalate to emergency services when criteria are met. These aren’t theoretical ideas; they’re how AI can integrate into existing care systems while keeping humans firmly in the loop.
Most evaluation frameworks focus on whether a model says the right thing when it encounters suicidal ideation. That’s a start, but it’s not enough. The evaluation needs to test what happens next: how the model uses tool calls to take action, route information, or trigger the right workflow. That’s what actually determines safety.
We’re building toward this kind of evaluation now. But it’s not something any one team or company can define alone. We need researchers, builders, clinicians, and other companies working together to figure out what reliable, safe tool calling should look like in mental health. And it needs to happen in the open. Shared standards are the only way to make sure “safety” means the same thing for everyone.

