Production Incident Management

0/2 in this phase0/41 across the roadmap

📖 Concept

Managing production incidents in enterprise Salesforce requires a systematic approach — from detection through resolution to post-mortem. This is a critical skill for senior developers and architects.

Incident severity levels:

P1 — Critical: System down, all users affected, data loss risk
P2 — High: Major feature broken, business impact, workaround exists
P3 — Medium: Feature degraded, limited users affected
P4 — Low: Minor issue, cosmetic, no business impact

Incident response process:

  1. Detection — Alerts, user reports, monitoring
  2. Triage — Severity assessment, team notification
  3. Investigation — Debug logs, error logs, deployment history
  4. Resolution — Hotfix, rollback, configuration change
  5. Communication — Stakeholder updates, user notification
  6. Post-Mortem — Root cause analysis, prevention measures

Common production issues:

  1. Governor limit exceptions (new code hitting limits with production data)
  2. Sharing calculation delays (large role changes)
  3. Integration failures (external system changes)
  4. Record locking (concurrent updates to same records)
  5. Performance degradation (non-selective queries with growing data)

Rollback strategies:

  • Metadata rollback: Deploy previous version from Git
  • Data rollback: Restore from backup (limited Salesforce support)
  • Feature toggle: Disable via Custom Metadata / Feature Flag
  • Destructive changes: Remove problematic components

💻 Code Example

codeTap to expand ⛶
1// Production Incident Management Patterns
2
3// 1. Feature Kill Switch
4public class FeatureKillSwitch {
5
6 // Uses Custom Setting for instant toggle (no deployment needed)
7 public static Boolean isFeatureEnabled(String featureName) {
8 Kill_Switch__c settings = Kill_Switch__c.getInstance();
9
10 switch on featureName {
11 when 'NewPricingEngine' { return settings.New_Pricing__c; }
12 when 'AutoEscalation' { return settings.Auto_Escalation__c; }
13 when 'ExternalSync' { return settings.External_Sync__c; }
14 when else { return true; } // Default: enabled
15 }
16 }
17}
18
19// In trigger handler:
20// if (FeatureKillSwitch.isFeatureEnabled('NewPricingEngine')) {
21// PricingService.calculate(opps);
22// }
23
24// 2. Circuit Breaker for External Integrations
25public class CircuitBreaker {
26 // Track failures in Custom Setting (persists across transactions)
27
28 public static Boolean isOpen(String serviceName) {
29 Circuit_Breaker__c cb = Circuit_Breaker__c.getInstance(serviceName);
30 if (cb == null) return false;
31
32 // Open circuit if too many failures
33 if (cb.Failure_Count__c >= cb.Max_Failures__c) {
34 // Check cooldown period
35 if (cb.Last_Failure__c != null &&
36 cb.Last_Failure__c.addMinutes((Integer)cb.Cooldown_Minutes__c) > Datetime.now()) {
37 return true; // Circuit is open — skip calls
38 }
39 // Cooldown expired — reset and try again
40 resetCircuit(serviceName);
41 }
42 return false;
43 }
44
45 public static void recordFailure(String serviceName) {
46 Circuit_Breaker__c cb = Circuit_Breaker__c.getOrgDefaults();
47 // In production, update the specific service's settings
48 System.debug('Circuit breaker: ' + serviceName + ' failure recorded');
49 }
50
51 public static void resetCircuit(String serviceName) {
52 System.debug('Circuit breaker: ' + serviceName + ' reset');
53 }
54}
55
56// 3. Incident alert notification
57public class IncidentAlert {
58
59 public static void raiseAlert(String severity, String message, String context) {
60 // Log to custom object
61 insert new Incident_Log__c(
62 Severity__c = severity,
63 Message__c = message,
64 Context__c = context,
65 Timestamp__c = Datetime.now(),
66 Resolved__c = false
67 );
68
69 // Send email alert for P1/P2
70 if (severity == 'P1' || severity == 'P2') {
71 sendAlertEmail(severity, message, context);
72 }
73
74 // Publish Platform Event for real-time monitoring
75 EventBus.publish(new Incident_Event__e(
76 Severity__c = severity,
77 Message__c = message,
78 Context__c = context
79 ));
80 }
81
82 private static void sendAlertEmail(String severity, String msg, String ctx) {
83 Messaging.SingleEmailMessage mail = new Messaging.SingleEmailMessage();
84 mail.setToAddresses(new List<String>{
85 'oncall@company.com', 'sf-admin@company.com'
86 });
87 mail.setSubject('[' + severity + '] Salesforce Production Alert');
88 mail.setPlainTextBody(
89 'Severity: ' + severity + '\n' +
90 'Message: ' + msg + '\n' +
91 'Context: ' + ctx + '\n' +
92 'Time: ' + Datetime.now().format() + '\n' +
93 'Org: ' + URL.getOrgDomainUrl().toExternalForm()
94 );
95 Messaging.sendEmail(new List<Messaging.SingleEmailMessage>{mail});
96 }
97}

🏋️ Practice Exercise

Incident Management Practice:

  1. Create a Kill_Switch__c Custom Setting with toggles for 5 features
  2. Implement a Circuit Breaker for an external API integration
  3. Build an Incident_Log__c object with severity, status, and resolution tracking
  4. Create a Platform Event that fires on P1/P2 incidents for real-time monitoring
  5. Design a rollback procedure for a failed production deployment
  6. Write a post-mortem template document for Salesforce production incidents
  7. Set up Apex Exception Email alerts for your production org
  8. Create a monitoring dashboard showing incident trends and MTTR (mean time to resolution)
  9. Simulate a governor limit failure in sandbox and practice the investigation process
  10. Design an on-call rotation system for Salesforce production support

⚠️ Common Mistakes

  • Not having kill switches for new features — without them, the only rollback option is a full redeployment, which takes hours

  • Deploying to production on Friday afternoon — follow change management best practices: deploy during low-traffic hours, not before weekends

  • Not testing with production-data volumes — code works in sandbox (1K records) but fails in production (1M records)

  • No rollback plan — every production deployment should have a documented rollback procedure before it starts

  • Not logging errors persistently — debug logs expire in 24 hours. Without an Error_Log__c, production errors are lost

💼 Interview Questions

🎤 Mock Interview

Mock interview is powered by AI for Production Incident Management. Login to unlock this feature.