Analysis

Salesforce Agentforce 360: What Enterprise Buyers Should Ask

Agentforce 360 brings observability, MCP, and FedRAMP High to Salesforce's agent platform. Here's what the public record shows, what it doesn't, and how to pilot it.

Photo by Deliberate Directions on Unsplash.

By Marcus Wong, Insightful AI Desk

On 23 June 2025, Salesforce announced Agentforce 3, framing the release around two themes that had become the most common objections from enterprise buyers in the prior nine months: visibility into what agents were doing in production, and a standard way to wire those agents to systems outside Salesforce. Roughly four months later, on 13 October 2025, the company rolled the platform forward again with Agentforce 360, which unified the agent tooling, Slack, Tableau, Data 360 and the underlying Atlas reasoning engine into a single branded surface. Read together, the two announcements describe a deliberate pivot from "we have agents" to "we have an operating system you can run agents on." The interesting question for enterprise buyers is no longer whether Salesforce ships agentic features. It is whether the production evidence has caught up with the marketing surface, and whether the contract you sign in May 2026 will hold up against the operational reality you discover in August.

The honest answer, after reading the primary materials and the publicly disclosed customer outcomes, is: in some places yes, in some places not yet, and in a few places the data simply does not exist in the public record. That gap is itself the story.

What concretely shipped

The Agentforce 3 release in June introduced what Salesforce calls a Command Center — an observability layer that exposes agent traces, tool calls, latency, escalation rates and topic-level performance to administrators. Built on OpenTelemetry, the Command Center is the first piece of the platform that lets a buyer answer "what did the agent actually do last Tuesday at 14:32?" without instrumenting their own logging stack. It also added native support for the Model Context Protocol (MCP), the open specification originally proposed by Anthropic that lets agents discover and call external tools through a standardised server interface rather than bespoke connectors. The October Agentforce 360 release then extended Atlas, layered in a low-latency voice modality, brought Slack into the agentic surface as a first-class environment, and made the entire platform FedRAMP High authorised for U.S. federal use.

A laptop screen displaying performance analytics charts, evoking the observability dashboards now central to Agentforce's Command Center. — Photo by Luke Chesser on Unsplash.

Atlas as a configurable, multi-model controller

The piece worth dwelling on is Atlas. Salesforce describes Atlas as a reasoning engine rather than a single model — an orchestration layer that decomposes a user request into sub-tasks, retrieves grounding data, plans the tool calls, executes, evaluates the output, and iterates. The architecture is explicitly ensemble: Atlas can route sub-tasks to OpenAI's GPT family, to Anthropic's Claude models on Amazon Bedrock, to Google Gemini, or to Salesforce's own xGen models, depending on the workload. In the company's own framing the engine implements an inference-time "System 2" loop in the Kahneman sense — slow, deliberate, evaluative reasoning over the faster pattern-matching of a base LLM — with the explicit goal of lowering hallucination rates on enterprise tasks.

The Agentforce 360 update made Atlas configurable, which is the change buyers should care about most. Administrators can now bias the engine toward deterministic, rule-bound execution for workflows where consistency matters more than creativity (claims triage, refund processing, compliance lookups) and toward more open-ended reasoning where it does not (sales coaching, content drafting, product discovery). That dial is the difference between an agent you can defend in front of a regulator and an agent you can defend in front of a marketing audit. It is also the dial that, in practice, will determine how a customer's bill scales — more aggressive reasoning loops mean more model calls and higher per-conversation cost.

MCP and the AgentExchange topology

The MCP story is structurally more important than the headline suggests. Before MCP support, every external integration was a custom Apex action, a flow, or a partner-built connector with its own auth model and its own breakage profile. With MCP, an Agentforce agent calls a remote MCP server — hosted by the tool vendor — and the server advertises its tools, schemas, and auth requirements in a standard handshake. According to Salesforce's expanded AgentExchange, more than 30 partners shipped MCP servers at launch, with three early integrations worth naming:

Box. An Agentforce service agent can call Box's MCP server to retrieve a signed NDA from a customer's content cloud, summarise the salient terms, and write the summary back into a Slack thread for the account team — without the customer first copying the document into Salesforce.
PayPal. Through PayPal's MCP server, agents can list products, place orders, take payments, dispute claims, track shipments, manage subscriptions and issue refunds. This is the closest the platform comes to "agentic commerce" as a wired-in primitive rather than a demo.
WRITER. Agentforce can hand a generation task — long-form content, knowledge retrieval, or a regulated compliance check — to WRITER's enterprise agents through the same MCP transport, then continue the workflow inside Salesforce.

The architectural payoff is that the integration surface is now versioned and inspectable. The risk is that every MCP server you wire in becomes part of your agent's blast radius. We will return to that point.

FedRAMP High and what it actually unlocks

The FedRAMP High authorisation matters in a way that is easy to underweight from the commercial side. High is the top of the three FedRAMP impact levels, which means Agentforce, Data Cloud, Marketing Cloud and Tableau Next are now cleared for U.S. federal data where the loss of confidentiality, integrity, or availability would have a "severe or catastrophic" impact — law enforcement, certain health and benefits workloads, critical infrastructure. The authorisation also means agencies can buy through existing vehicles such as GSA Schedule, NASA SEWP and AWS Marketplace without layering additional security assessments on top. For the broader market, FedRAMP High functions as a credible third-party security baseline: a private-sector buyer in a regulated industry can use the assessment as a reference point even if they are not procuring under federal rules. It is the closest thing the agent market currently has to an external security floor.

The customers who have publicly committed numbers

Public claims about agent performance can usually be sorted into three buckets: aggregate platform statistics, named-customer percentages, and named-customer absolute numbers. Salesforce has supplied examples of all three, with varying density.

On the aggregate side, the June press release reported that Agentforce had handled "more than one million customer conversations" on Salesforce's own Help Portal, with a resolution rate of 84 percent on the conversations it took. That is a single internal deployment, which is worth saying out loud — Salesforce running Salesforce is a useful proof point but not a generalisable one. The denominator (how many conversations Agentforce was eligible to take versus how many were escalated immediately to a human) is not disclosed.

The named-customer disclosures are more revealing. OpenTable said that within three weeks of launch its diner-facing agent was handling 73 percent of restaurant web queries, and within a few weeks the agent had taken "tens of thousands" of conversations that would otherwise have gone to human support. Publisher Wiley reported a more than 40 percent increase in case resolution after going live, outperforming its previous bot. Pandora Jewelry has deployed an Agentforce service agent named Gemma — expected to deflect around 30 percent of support calls — and is separately building a personal-shopper agent for jewelry selection ahead of a wider rollout. Numerical outcomes on the personal-shopper agent have not yet been disclosed.

In travel and hospitality, Heathrow Airport committed to using Agentforce against its knowledge base, flight-info APIs and Data Cloud passenger records to deliver 24/7 status, wayfinding and amenity responses with what it described as 95 percent accuracy. Air India, in an April 2025 expansion of its Salesforce relationship, became one of the first airlines to take Agentforce into production for refunds — collapsing a "several days" process into "a few hours." Staffing group Adecco described agents that screen resumes, reply to applicants with feedback and alternate role suggestions, and schedule interviews so that human recruiters spend more time on direct candidate relationships.

Then there is the Williams-Sonoma deployment, which is worth tracing across both announcements. The retailer surfaced in earlier Salesforce material as an Agentforce evaluator. In October 2025 it formally committed to Agentforce 360 across its full brand portfolio — Williams Sonoma, Williams Sonoma Home, West Elm, Pottery Barn, Pottery Barn Kids, Pottery Barn Teen, Rejuvenation, Mark & Graham and GreenRow — and disclosed an internal target of autonomously resolving more than 60 percent of chat inquiries. It is also building a culinary agent named "Olive" on top of Data 360, designed to plan menus, discover products and walk a customer through hosting events. The Williams-Sonoma case is the clearest example of an enterprise moving from pilot disclosure to portfolio commitment in a single calendar year.

Customers walking down a brightly lit retail store aisle, evoking the consumer-facing context where retailers such as Williams-Sonoma are deploying customer-service agents. — Photo by Zoshua Colah on Unsplash.

Why named-customer numbers vary so much in depth

The variation between "tens of thousands of conversations" (OpenTable), "more than 40 percent increase in resolution" (Wiley), "95 percent accuracy" (Heathrow), and "more than 60 percent autonomous resolution target" (Williams-Sonoma) is not random. Each of those framings is a deliberate choice by the customer's communications team about how much operational reality to expose to competitors, regulators, employees and investors. A precise resolution percentage tells a competitor how many human agents you have probably retired. A precise CSAT delta tells a regulator what your prior baseline was. A precise dollar figure tells an investor whether the unit economics are working. Buyers reading these disclosures should treat them as the upper bound of what a customer was comfortable saying in public — the underlying operating data is almost always less flattering, not because the technology is failing, but because edge cases, model regressions and seasonality always exist in real workloads.

What the public record still does not show

The disclosures above are the most concrete the market has. They still leave three categories of information on the table.

The first is unit economics. None of the named customers has published the per-conversation cost of an Agentforce-handled case versus the fully-loaded cost of the human-handled equivalent. Salesforce's pricing for Agentforce moved during 2025 from a flat per-conversation model toward consumption-based pricing tied to "Flex Credits" — a shift that makes total cost of ownership materially harder to forecast for a buyer planning a multi-year deployment. Without unit economics in the public record, the resolution percentages are difficult to translate into operating-margin impact.

The second is quality at the tail. Aggregate resolution rates and accuracy scores describe the modal case. They do not describe the distribution of failures: how often the agent gives a confidently wrong answer, how often it hands a customer off after enough turns that the customer is already annoyed, how often it escalates correctly but with insufficient context for the human to recover. These are the metrics that determine whether an agent is a quiet productivity gain or a brand-reputation hazard, and they are nearly absent from public materials across the agentic-AI industry — Salesforce is not alone in this.

The third is longitudinal performance. Most named-customer disclosures cover weeks-to-quarters, not years. Whether resolution rates hold up after the first model migration, the first product-catalogue change, the first prompt update from a partner MCP server, or the first regulatory disclosure obligation is something the market will only learn in 2026 and 2027.

Tightly bundled network cables in a data-center rack, evoking the infrastructure layer underneath FedRAMP-authorised platforms and the security surface buyers are asking about. — Photo by Taylor Vick on Unsplash.

What the third-party evaluators currently show

The independent-analyst picture is informative but partial. Forrester named Salesforce a Leader in the Customer Service Solutions Wave for Q1 2026, citing enterprise scale and Agentforce Service momentum, while explicitly flagging complex pricing and slower-than-expected value realisation on the agent layer as cautions. Salesforce was also positioned as a Leader in the 2026 Forrester Wave for Revenue Marketing Platforms (B2B) and recognised across Forrester's CRM and Digital Experience Platform evaluations. Gartner's 2025 cycle placed Salesforce as a Leader in the Magic Quadrant for Customer Data Platforms, "highest in Ability to Execute and furthest in Completeness of Vision," and Gartner has separately predicted that 40 percent of enterprise apps will feature task-specific AI agents by 2026, up from less than 5 percent in 2025 — a market-shaping projection that benefits any incumbent CRM with an agent layer. What the evaluations have not yet produced is a head-to-head, like-for-like benchmark of agent quality across vendors on enterprise tasks. Independent academic benchmarks — the agentic equivalent of SWE-bench or τ-bench — are emerging but are not yet authoritative for procurement decisions, and IDC's vendor assessments in the space remain focused on capability matrices rather than measured outcomes.

The security and governance evidence gap

The most consequential gap is on the security side. In September 2025, Noma Security publicly disclosed "ForcedLeak," an indirect prompt injection chain in Agentforce with a CVSS score of 9.4. The vulnerability used a 42,000-character Web-to-Lead description field as the payload vector and exploited an expired allow-listed domain in the Content Security Policy that the researchers were able to purchase for 5. USD Salesforce patched the issue, enforced trusted-URL controls for Agentforce and Einstein, and the relevant attack path is now closed. A parallel disclosure, "PipeLeak" from Capsule Security, followed in April 2026 — roughly six months later. Both incidents are useful in the same way the early-2000s SQL-injection disclosures were useful: they convert an abstract risk into a measured one. They also reveal the absence the industry should be filling. There is no standard public report — comparable to a SOC 2 Type II, or to the cloud providers' annual penetration-test summaries — that tells a buyer how a given agent platform performs under red-team prompt-injection conditions, how often it leaks data through tool calls, or how often a jailbreak via an MCP server reaches grounded customer data. Until that evidence exists, every buyer's security review is, in effect, doing the work that the market has not yet standardised.

What enterprise buyers should be asking

For a CIO or a head of customer operations evaluating Agentforce 360 — or, candidly, any vendor's agentic platform — the most useful exercise right now is a structured set of questions designed to surface evidence rather than narrative. We would propose seven:

What is the per-conversation cost at our projected volume, and how does it scale with reasoning depth? Because Agentforce moved to Flex Credits, the difference between a one-step lookup and a multi-step plan-and-execute loop can be several multiples. A good vendor answer models three or four representative conversation shapes against your projected mix and gives you a credible per-unit cost band, not a single headline number. If the only answer is "it depends on your usage," that is a signal to ask harder.
Show us the trace, end-to-end, for a representative failed conversation. Resolution rate is meaningless without an audit trail you can defend. The Command Center is the right primitive — confirm that it surfaces tool calls, model selections, retrieval grounding, and escalation reasons at the conversation level, exportable to your SIEM. A good demo lets you click into the worst-performing topic of the past 24 hours and see the precise reasoning chain on three or four real (anonymised) cases.
Which MCP servers will be in our blast radius, and who owns the patch SLA on each? Box, PayPal and WRITER are public; your final list will be longer. Every MCP server you wire in is potentially a privileged caller into your agent. A good answer maps each server to a named owner, a documented auth model, a tested rollback path, and an incident-notification clause. If a partner cannot commit to a patch SLA in writing, treat it as a trust-store decision, not an integration decision.
What does the human-in-the-loop workflow look like for the top five most consequential decision types? Refunds above a threshold. Account closures. Policy lookups for regulated products. Hiring decisions. Health or safety advisories. A good answer specifies the exact gate, the SLA for human review, the fallback if the human queue is saturated, and how the platform learns from override patterns without quietly raising the auto-resolution threshold.
What is the documented behaviour of Atlas under prompt-injection conditions? Ask for the post-ForcedLeak hardening report in writing, including the trusted-URL enforcement, the input-length controls on lead and case fields, and the red-team frequency. A good answer references the MCP support model for inbound calls, a named internal red-team cadence, and a clear protocol for customer notification when a vulnerability is disclosed.
What is our exit cost? Specifically: if we decide in 18 months to move Atlas workloads to a different orchestration layer while keeping Data 360 as the system of record, what is the contractual and technical cost of doing so? A good answer includes an export specification for agent definitions and traces, a transition period with parallel-run support, and a price-protection clause if you commit to additional consumption.
Where are the failure modes in regulated workflows? If we deploy in financial services, healthcare or HR, what guardrails apply, and what is your evidence base — including FedRAMP High control mappings — for those guardrails holding in production? A good answer pairs the control framework with two or three named regulated-industry customers willing to take a confidential reference call.

Red flags that should kill the procurement

Some answers should end the conversation rather than continue it. If the vendor refuses to model unit economics against your actual conversation mix, you do not have a partnership; you have a meter. If you cannot find a single named customer in your vertical at your scale that has gone past pilot, you are funding the reference case rather than benefiting from one. If observability is closed-loop — visible inside the vendor's console but not exportable in OpenTelemetry or an equivalent open format to your SIEM, your data lake, or your incident-response tooling — you have bought a system you cannot supervise. If the security disclosures are described as "confidential to existing customers" rather than summarised in a written attestation, you have bought a system you cannot defend. And if the contract does not contain a price-protection clause against unilateral pricing model changes, you have effectively signed a variable rate against a market that is still finding its floor.

How to run a defensible 90-day pilot

The procurement questions matter, but the pilot is where the real evidence is created. A 90-day Agentforce pilot — or any agentic-platform pilot — should be designed from the start to produce data you can defend to a regulator, a board and a successor team. The following is a builder's playbook: concrete, opinionated, and meant to be useful on Monday morning.

A team collaborating around a whiteboard in a modern office, evoking the planning workshops that should produce a defensible 90-day Agentforce pilot. — Photo by Vitaly Gariev on Unsplash.

Day-one instrumentation

Before a single production conversation runs, instrument the platform end-to-end. Configure the Command Center's OpenTelemetry export into your SIEM (Splunk, Sentinel, Chronicle — whichever you already operate). Mirror the same traces into your data lake for longitudinal analysis. Tag every conversation with a stable session identifier that joins to your CRM record and, where applicable, to your billing or claims system, so that downstream business outcomes can be attributed back to the conversation that produced them. If you cannot trace an Agentforce conversation from intent to resolution to financial outcome in your own warehouse, you are flying on the vendor's instruments only.

A measurement plan written before go-live

Write the measurement plan before the first agent handles a real customer. Define what "resolved" means in operational, not marketing, terms — typically: customer did not return within N days for the same intent, no follow-up human contact required, and no negative survey signal. Define the failure ladder: hallucination (factually wrong answer), policy violation (correct facts, wrong action), unhelpful (correct facts, no useful action), and abandonment (customer disengaged mid-conversation). Set the thresholds that will trigger a model change, a prompt change, or a rollback — and put them in writing, signed by the executive sponsor. Without these in advance, every regression becomes a debate rather than a decision.

A composable human-in-the-loop review workflow

Build the human-in-the-loop review as a first-class workflow, not an exception path. For the top five consequential decision types, route them to a named queue with a defined SLA, an explicit reviewer rubric, and a feedback loop that updates the agent's grounding rather than only its prompt. Capture the reviewer's rationale every time — even a single-sentence note converts a binary approve/override into a training signal. Plan for queue saturation: define what happens when the human queue is full at peak hours, and decide explicitly whether the answer is to hold the conversation, escalate to a senior reviewer, or defer to a deterministic fallback path. Do not let saturation quietly degrade into auto-approval.

Failure-mode capture for post-mortem

Every failure is a free training set if you capture it. For each failure-ladder category, sample at least 20 cases per week into a structured review log. Tag the root cause: grounding gap, retrieval miss, tool-call error, model-reasoning error, policy ambiguity, MCP-server regression. Run a weekly post-mortem on the sample, and produce a single-page summary per category — what failed, why, what you changed, what you will measure next. By day 90 you will have a body of evidence that tells you whether the platform is improving on the failure types that matter to your business, or improving only on the ones the vendor's own benchmarks measure.

Contract-protection clauses to insist on

Translate the measurement plan into contract language. Insist on: a notification window before any model or default-prompt change that materially affects your production agents; an obligation to provide post-incident reports within a defined window for any security disclosure on the platform; a price-protection clause that caps year-over-year increases in per-credit cost; an export specification for agent definitions, traces and grounding data; and a parallel-run period at no additional cost if Salesforce changes the underlying model architecture in a way that affects measured performance. None of these clauses prevents the vendor from improving the product. They prevent you from being a passenger when it changes.

What is worth watching from here

There are four indicators we would track over the next two quarters.

The first is whether the Command Center's OpenTelemetry exports become a real feedstock for third-party observability stacks — Datadog, Dynatrace, Grafana — at customer sites, rather than living only inside the Salesforce console. A platform that lets you take your traces with you is a platform that respects the buyer's leverage.

The second is the maturation of the AgentExchange MCP catalogue. Thirty partners at launch is a starting position. What matters is the patch cadence, the security review process for new entrants, and whether buyers can express trust-store policies — "this agent is allowed to call these MCP servers and no others" — as a first-class control. That is the equivalent of an app store's review process, and it will determine how comfortable regulated industries get with the integration surface.

The third is named-customer disclosure of unit economics rather than resolution percentages. The first large customer that publishes per-conversation cost, fully loaded, against a credible human baseline will reshape how the rest of the market negotiates. It is also the disclosure that the named customers have the most reason to delay, because precise unit economics tell a competitor exactly how their cost structure has shifted.

The fourth is regulatory friction. The EU AI Act took effect on 1 August 2024, with prohibited-practice rules enforceable from 2 February 2025, general-purpose AI obligations from 2 August 2025, and high-risk-system obligations from 2 August 2026. Customer-facing agents that materially influence consequential decisions — credit, insurance, employment, housing, healthcare, education, essential services — are within the high-risk perimeter, and the Act's extraterritorial reach captures non-EU vendors and deployers serving EU users. On the U.S. side, the Colorado AI Act (SB 24-205) was delayed by the August 2025 special session and is currently scheduled to take effect 30 June 2026, while California's SB 53 Frontier AI Transparency Act went live on 1 January 2026. The cross-border implication for a multinational Agentforce deployment is concrete: a single agent design may need region-specific guardrails, region-specific disclosure flows, region-specific human-review thresholds, and region-specific data-residency configurations. FedRAMP High solves the U.S. federal procurement question; it does not solve the EU high-risk classification question, the Colorado consequential-decision question, or the California transparency question. Buyers operating across all three regimes should expect their compliance surface to be wider than their vendor's contract suggests, and they should ask — explicitly — which obligations the vendor will take, which it will share, and which it will leave to the deployer.

Agentforce 360 is a serious piece of platform engineering wrapped around a serious bet about how enterprises will buy AI over the next five years. The named-customer evidence is real, the security disclosures have been handled responsibly, and the federal authorisation is meaningful. What is still missing is the layer of public, comparable, longitudinal operational data that would let buyers move from narrative to analysis. The enterprises that pilot most rigorously over the next two quarters — instrumenting from day one, writing their measurement plan before go-live, capturing their failures, and protecting themselves in contract — will be the ones in a position to demand, and reasonably expect, that the measurement gap close.

Further reading: start with Salesforce Engineering’s deep dive on the Atlas reasoning engine, the Model Context Protocol specification, Forrester’s analyst assessment of Agentforce’s adoption gap, and The Hacker News’ disclosure of the ForcedLeak vulnerability.

How we use AI and review our work: About Insightful AI Desk.