Production Incident Management

0/2 in this phase0/41 across the roadmap

📖 Concept

Managing production incidents in enterprise Salesforce requires a systematic approach — from detection through resolution to post-mortem. This is a critical skill for senior developers and architects.

Incident severity levels:

P1 — Critical: System down, all users affected, data loss risk
P2 — High: Major feature broken, business impact, workaround exists
P3 — Medium: Feature degraded, limited users affected
P4 — Low: Minor issue, cosmetic, no business impact

Incident response process:

Detection — Alerts, user reports, monitoring
Triage — Severity assessment, team notification
Investigation — Debug logs, error logs, deployment history
Resolution — Hotfix, rollback, configuration change
Communication — Stakeholder updates, user notification
Post-Mortem — Root cause analysis, prevention measures

Common production issues:

Governor limit exceptions (new code hitting limits with production data)
Sharing calculation delays (large role changes)
Integration failures (external system changes)
Record locking (concurrent updates to same records)
Performance degradation (non-selective queries with growing data)

Rollback strategies:

Metadata rollback: Deploy previous version from Git
Data rollback: Restore from backup (limited Salesforce support)
Feature toggle: Disable via Custom Metadata / Feature Flag
Destructive changes: Remove problematic components

💻 Code Example

codeTap to expand ⛶

1// Production Incident Management Patterns
2
3// 1. Feature Kill Switch
4public class FeatureKillSwitch {
5    
6    // Uses Custom Setting for instant toggle (no deployment needed)
7    public static Boolean isFeatureEnabled(String featureName) {
8        Kill_Switch__c settings = Kill_Switch__c.getInstance();
9        
10        switch on featureName {
11            when 'NewPricingEngine' { return settings.New_Pricing__c; }
12            when 'AutoEscalation' { return settings.Auto_Escalation__c; }
13            when 'ExternalSync' { return settings.External_Sync__c; }
14            when else { return true; } // Default: enabled
15        }
16    }
17}
18
19// In trigger handler:
20// if (FeatureKillSwitch.isFeatureEnabled('NewPricingEngine')) {
21//     PricingService.calculate(opps);
22// }
23
24// 2. Circuit Breaker for External Integrations
25public class CircuitBreaker {
26    // Track failures in Custom Setting (persists across transactions)
27    
28    public static Boolean isOpen(String serviceName) {
29        Circuit_Breaker__c cb = Circuit_Breaker__c.getInstance(serviceName);
30        if (cb == null) return false;
31        
32        // Open circuit if too many failures
33        if (cb.Failure_Count__c >= cb.Max_Failures__c) {
34            // Check cooldown period
35            if (cb.Last_Failure__c != null && 
36                cb.Last_Failure__c.addMinutes((Integer)cb.Cooldown_Minutes__c) > Datetime.now()) {
37                return true; // Circuit is open — skip calls
38            }
39            // Cooldown expired — reset and try again
40            resetCircuit(serviceName);
41        }
42        return false;
43    }
44    
45    public static void recordFailure(String serviceName) {
46        Circuit_Breaker__c cb = Circuit_Breaker__c.getOrgDefaults();
47        // In production, update the specific service's settings
48        System.debug('Circuit breaker: ' + serviceName + ' failure recorded');
49    }
50    
51    public static void resetCircuit(String serviceName) {
52        System.debug('Circuit breaker: ' + serviceName + ' reset');
53    }
54}
55
56// 3. Incident alert notification
57public class IncidentAlert {
58    
59    public static void raiseAlert(String severity, String message, String context) {
60        // Log to custom object
61        insert new Incident_Log__c(
62            Severity__c = severity,
63            Message__c = message,
64            Context__c = context,
65            Timestamp__c = Datetime.now(),
66            Resolved__c = false
67        );
68        
69        // Send email alert for P1/P2
70        if (severity == 'P1' || severity == 'P2') {
71            sendAlertEmail(severity, message, context);
72        }
73        
74        // Publish Platform Event for real-time monitoring
75        EventBus.publish(new Incident_Event__e(
76            Severity__c = severity,
77            Message__c = message,
78            Context__c = context
79        ));
80    }
81    
82    private static void sendAlertEmail(String severity, String msg, String ctx) {
83        Messaging.SingleEmailMessage mail = new Messaging.SingleEmailMessage();
84        mail.setToAddresses(new List<String>{
85            'oncall@company.com', 'sf-admin@company.com'
86        });
87        mail.setSubject('[' + severity + '] Salesforce Production Alert');
88        mail.setPlainTextBody(
89            'Severity: ' + severity + '\n' +
90            'Message: ' + msg + '\n' +
91            'Context: ' + ctx + '\n' +
92            'Time: ' + Datetime.now().format() + '\n' +
93            'Org: ' + URL.getOrgDomainUrl().toExternalForm()
94        );
95        Messaging.sendEmail(new List<Messaging.SingleEmailMessage>{mail});
96    }
97}

🏋️ Practice Exercise

Incident Management Practice:

Create a Kill_Switch__c Custom Setting with toggles for 5 features
Implement a Circuit Breaker for an external API integration
Build an Incident_Log__c object with severity, status, and resolution tracking
Create a Platform Event that fires on P1/P2 incidents for real-time monitoring
Design a rollback procedure for a failed production deployment
Write a post-mortem template document for Salesforce production incidents
Set up Apex Exception Email alerts for your production org
Create a monitoring dashboard showing incident trends and MTTR (mean time to resolution)
Simulate a governor limit failure in sandbox and practice the investigation process
Design an on-call rotation system for Salesforce production support

⚠️ Common Mistakes

Not having kill switches for new features — without them, the only rollback option is a full redeployment, which takes hours
Deploying to production on Friday afternoon — follow change management best practices: deploy during low-traffic hours, not before weekends
Not testing with production-data volumes — code works in sandbox (1K records) but fails in production (1M records)
No rollback plan — every production deployment should have a documented rollback procedure before it starts
Not logging errors persistently — debug logs expire in 24 hours. Without an Error_Log__c, production errors are lost

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Production Incident Management

Was this topic helpful?

← PreviousDebugging Tools & Techniques Next →Flows Architecture & Best Practices