Resilience is the operational property every owner says they value and almost nobody engineers for. There's no time for "what if" scenarios while the business is busy succeeding. Then the breakage happens. The key person calls out. The platform goes down. The vendor pulls a contract. The customer leaks. And the operator discovers, in real time and on the wrong day, how exposed the business actually was.
Resilience is engineered before the disruption. It cannot be improvised during one.
What follows is a checklist. Five questions, the failure modes that follow each, and what to do about them. It's designed to be read with your specific business in mind, not as a generic exercise.
This is not a Business Continuity Plan
A business continuity plan is a deliverable. Resilience is a property. The two get confused, and the confusion is expensive.
Most BCPs I've read are theater. Written once, often by an outside firm, signed off, filed in a binder, never tested. The plan exists. The capability does not.
Resilience is what your business actually does when pressure arrives. The systems hold or they don't.
The questions below are meant to surface what's brittle, not generate another deliverable. If they produce a deliverable, the deliverable should be a list of things you fixed and a date next to each one.
Question 01If your most critical employee called out for two weeks tomorrow, what specifically breaks?
Not "could we cope?" Coping is what you do under pressure with willpower. The question is sharper. What stops?
The honest answer is usually some mix of customer relationships, vendor escalation, billing decisions, hiring approvals, signature authority, and "the way we always handle this exact issue." That list is rarely written down because the person you're worrying about is the one who would be writing it.
This question has a brutal corollary. If the answer is "we'd be fine," you're either lying or you're not paying that person enough. Critical people leave gaps. The work is mapping them before the absence does.
What to do: pick the top three people in your business by criticality, not title. For each one, write a one-page brief. What decisions they make daily. Who their key relationships are. The quirks of the systems they touch. Hand that brief to the next new hire on day one. Most companies that try this discover the brief was a better diagnostic than they expected.
Question 02If your primary tool stack went down for 48 hours, could you still serve your top 10 customers?
Tech dependency is the resilience question most owners haven't actually thought through. There's a difference between "we use this tool every day" and "we cannot operate without this tool for two days."
Pick a category. Scheduling. Payments. Communications. File storage. Anything customer-facing. For each one, ask: if it went down right now and stayed down through Friday, what's the workaround? Some have a clean fallback. Email goes down, you can call. Some don't. If the booking system goes down and you have eighty appointments tomorrow, the workaround isn't a workaround. It's chaos with apologies.
What to do: list your tools. Mark each one Critical, Important, or Useful. For each Critical one, write the manual fallback in two sentences. If you can't, that's not a tool. That's a single point of failure with a billing relationship attached.
Question 03Where is the "everyone just knows" knowledge that nobody has written down?
Every operating business of any size has tribal knowledge. The way returns get processed when the customer is over a certain threshold. Which vendors will break the rules and which won't. The clause that goes in every contract because the legal review missed it once and now it lives in tradition. The reason that one column in the spreadsheet says nothing but matters.
Most of this knowledge sits in three or four heads. It's the difference between an organization that runs and one that has to be run.
The test isn't "is it documented." It's "could a smart, motivated new hire produce reasonable results in thirty days using only the documents that exist?" If the answer is no, the gap between your documented operation and your real operation is bigger than you think. AI doesn't fix this. It exposes it. So does turnover.
What to do: pick three operational areas where you're sure tribal knowledge is heaviest. Onboarding. Customer escalations. Vendor management. Find the person who holds it. Pay them for an afternoon to spell it out. The cost of that afternoon is microscopic compared to the cost of finding out it was missing.
Question 04Who has authority to make a decision in the next 60 minutes?
The crisis where decision authority is undefined kills more value than the actual incident.
Concrete scenario. A vendor calls at four o'clock on a Friday demanding a contract change. The change is not insignificant. The vendor needs an answer by Monday. Your point of contact doesn't have signature authority. The person who does is on a flight. The chain of escalation has never been written down. By Monday, half the company has weighed in, and the answer that gets given is the answer of whoever was most available, not whoever should have decided.
Multiply that scenario across the year. Public statement after a customer complaint. Authorizing emergency overtime. Approving a refund above policy. Personnel action in a crisis. Each one has a "right answer" version where someone with the right authority decides quickly, and a "default answer" version where availability is the deciding factor.
What to do: write a one-page decision matrix. Spend approval up to a defined dollar amount. Public communications. Policy overrides. Personnel actions. Name the role and the named backup for each. Distribute it. The act of writing it surfaces gaps you didn't know existed.
Question 05When was the last time you tested any of this?
Most resilience is theoretical. The plan you've never run is fiction. The decision matrix you've never used in anger is wishful.
The companies I see actually being resilient run small tabletop exercises. Not annual five-day disaster simulations from a consulting firm. Forty-five minutes, once a quarter, with the leadership team. Pick a scenario. Walk through it out loud. Notice where the "obvious" steps weren't actually obvious to everyone in the room.
A CEO I worked with put it this way. The first time they ran one, they discovered three of their senior people each had a different mental model of who handled customer-facing crisis communications. Each one was sure they were right. None of them had ever talked to the others about it. That conversation took twenty minutes. It would have taken twenty days during an actual incident.
What to do: this quarter, pick one of the four questions above. Run a tabletop. If you don't have time for forty-five minutes, you don't have a resilience problem. You have a calendar problem, and the resilience consequences are downstream.
The gap was always there
The thing every operator notices on the other side of a real incident is that the gap was always there. The incident didn't create it. It just made it visible.
Resilience is engineered before the disruption. The questions in this piece aren't a checklist you complete. They're a calendar you keep. Quarterly is enough. Yearly is theater. Never is the default, and never is what produces the war stories you don't want to be in.
If you can't answer them clearly for your business, the questions aren't the problem. The questions are doing their job. They're showing you what to engineer.