Story

Show HN: Managed MCP Sandbox Environments for RL Training on Tool Use

wirehack Friday, December 12, 2025

Hi HN! We are Klavis AI (https://www.klavis.ai/) and we are launching a managed MCP Sandbox-as-a-Service for RL training on tool use.

If you want a model to learn tool use through RL, you need realistic environments where the model can take actions, you can observe the resulting state, and compute a reward. For SaaS tools, this means managing dozens of test accounts, handling OAuth and token refresh, seeding realistic data for each episode, resetting state between runs, and ensuring isolation when you're running concurrent training sessions. Most research teams spend months building this plumbing per integration.

Klavis is a managed sandbox service that handles all of that. You call our API to get an isolated sandbox backed by a real service instance (not a mock), initialize it with whatever data state you need, let your model interact via MCP, then dump the final state to compute your reward. One more API call resets everything for the next episode.

The key thing is these are real services, not static mocks. When your model creates a calendar event or updates a Salesforce record, that action actually executes against real infrastructure. The state changes are real. This matters because you want training to reflect production behavior exactly.

We currently support 50+ integrations across productivity tools (Google Calendar, Outlook, Slack), CRM (Salesforce, HubSpot), dev tools (GitHub, Jira, Linear), databases (Postgres, Snowflake), and others. We handle the account pooling, auth management, and lifecycle orchestration so researchers can focus on the actual training.

Technically, the workflow is: create a sandbox, call initialize API with a JSON payload defining your starting state, let the model interact via standard MCP tools, call dump API to get a typed snapshot of the final state, compare against your target for reward calculation, then call reset or delete. We use strict Pydantic schemas for all inputs and outputs so malformed data gets rejected immediately rather than causing silent failures mid-training.

Here is a quick demo: https://youtu.be/10C18rpCYcA.

We look forward to your comments. Thanks for reading!

3 0
Read on Hacker News