Production Incident Management
📖 Concept
Managing production incidents in enterprise Salesforce requires a systematic approach — from detection through resolution to post-mortem. This is a critical skill for senior developers and architects.
Incident severity levels:
P1 — Critical: System down, all users affected, data loss risk
P2 — High: Major feature broken, business impact, workaround exists
P3 — Medium: Feature degraded, limited users affected
P4 — Low: Minor issue, cosmetic, no business impact
Incident response process:
- Detection — Alerts, user reports, monitoring
- Triage — Severity assessment, team notification
- Investigation — Debug logs, error logs, deployment history
- Resolution — Hotfix, rollback, configuration change
- Communication — Stakeholder updates, user notification
- Post-Mortem — Root cause analysis, prevention measures
Common production issues:
- Governor limit exceptions (new code hitting limits with production data)
- Sharing calculation delays (large role changes)
- Integration failures (external system changes)
- Record locking (concurrent updates to same records)
- Performance degradation (non-selective queries with growing data)
Rollback strategies:
- Metadata rollback: Deploy previous version from Git
- Data rollback: Restore from backup (limited Salesforce support)
- Feature toggle: Disable via Custom Metadata / Feature Flag
- Destructive changes: Remove problematic components
💻 Code Example
1// Production Incident Management Patterns23// 1. Feature Kill Switch4public class FeatureKillSwitch {56 // Uses Custom Setting for instant toggle (no deployment needed)7 public static Boolean isFeatureEnabled(String featureName) {8 Kill_Switch__c settings = Kill_Switch__c.getInstance();910 switch on featureName {11 when 'NewPricingEngine' { return settings.New_Pricing__c; }12 when 'AutoEscalation' { return settings.Auto_Escalation__c; }13 when 'ExternalSync' { return settings.External_Sync__c; }14 when else { return true; } // Default: enabled15 }16 }17}1819// In trigger handler:20// if (FeatureKillSwitch.isFeatureEnabled('NewPricingEngine')) {21// PricingService.calculate(opps);22// }2324// 2. Circuit Breaker for External Integrations25public class CircuitBreaker {26 // Track failures in Custom Setting (persists across transactions)2728 public static Boolean isOpen(String serviceName) {29 Circuit_Breaker__c cb = Circuit_Breaker__c.getInstance(serviceName);30 if (cb == null) return false;3132 // Open circuit if too many failures33 if (cb.Failure_Count__c >= cb.Max_Failures__c) {34 // Check cooldown period35 if (cb.Last_Failure__c != null &&36 cb.Last_Failure__c.addMinutes((Integer)cb.Cooldown_Minutes__c) > Datetime.now()) {37 return true; // Circuit is open — skip calls38 }39 // Cooldown expired — reset and try again40 resetCircuit(serviceName);41 }42 return false;43 }4445 public static void recordFailure(String serviceName) {46 Circuit_Breaker__c cb = Circuit_Breaker__c.getOrgDefaults();47 // In production, update the specific service's settings48 System.debug('Circuit breaker: ' + serviceName + ' failure recorded');49 }5051 public static void resetCircuit(String serviceName) {52 System.debug('Circuit breaker: ' + serviceName + ' reset');53 }54}5556// 3. Incident alert notification57public class IncidentAlert {5859 public static void raiseAlert(String severity, String message, String context) {60 // Log to custom object61 insert new Incident_Log__c(62 Severity__c = severity,63 Message__c = message,64 Context__c = context,65 Timestamp__c = Datetime.now(),66 Resolved__c = false67 );6869 // Send email alert for P1/P270 if (severity == 'P1' || severity == 'P2') {71 sendAlertEmail(severity, message, context);72 }7374 // Publish Platform Event for real-time monitoring75 EventBus.publish(new Incident_Event__e(76 Severity__c = severity,77 Message__c = message,78 Context__c = context79 ));80 }8182 private static void sendAlertEmail(String severity, String msg, String ctx) {83 Messaging.SingleEmailMessage mail = new Messaging.SingleEmailMessage();84 mail.setToAddresses(new List<String>{85 'oncall@company.com', 'sf-admin@company.com'86 });87 mail.setSubject('[' + severity + '] Salesforce Production Alert');88 mail.setPlainTextBody(89 'Severity: ' + severity + '\n' +90 'Message: ' + msg + '\n' +91 'Context: ' + ctx + '\n' +92 'Time: ' + Datetime.now().format() + '\n' +93 'Org: ' + URL.getOrgDomainUrl().toExternalForm()94 );95 Messaging.sendEmail(new List<Messaging.SingleEmailMessage>{mail});96 }97}
🏋️ Practice Exercise
Incident Management Practice:
- Create a Kill_Switch__c Custom Setting with toggles for 5 features
- Implement a Circuit Breaker for an external API integration
- Build an Incident_Log__c object with severity, status, and resolution tracking
- Create a Platform Event that fires on P1/P2 incidents for real-time monitoring
- Design a rollback procedure for a failed production deployment
- Write a post-mortem template document for Salesforce production incidents
- Set up Apex Exception Email alerts for your production org
- Create a monitoring dashboard showing incident trends and MTTR (mean time to resolution)
- Simulate a governor limit failure in sandbox and practice the investigation process
- Design an on-call rotation system for Salesforce production support
⚠️ Common Mistakes
Not having kill switches for new features — without them, the only rollback option is a full redeployment, which takes hours
Deploying to production on Friday afternoon — follow change management best practices: deploy during low-traffic hours, not before weekends
Not testing with production-data volumes — code works in sandbox (1K records) but fails in production (1M records)
No rollback plan — every production deployment should have a documented rollback procedure before it starts
Not logging errors persistently — debug logs expire in 24 hours. Without an Error_Log__c, production errors are lost
💼 Interview Questions
🎤 Mock Interview
Mock interview is powered by AI for Production Incident Management. Login to unlock this feature.