Troubleshooting and Diagnostics in Azure Service Fabric

Troubleshooting and Diagnostics in Service Fabric

Even well-designed systems face issues. In production, being able to quickly troubleshoot and diagnose problems in Service Fabric is essential to maintaining reliability and performance.

🔍 Common Issues You May Encounter

  • Application services fail to start or crash.
  • Nodes become unhealthy or down.
  • Cluster upgrade failures.
  • Partition movement delays or replica build failures.
Real-World Analogy:

Think of Service Fabric like a complex transportation system — if a train is delayed (service failure) or track is broken (node down), you need live dashboards, logs, and alerts to fix issues quickly!

🚑 Key Troubleshooting Techniques

1. Service Fabric Explorer (SFX)
  • Open the Cluster Dashboard (http://localhost:19080/Explorer or Azure cluster URL).
  • Look for Red (Error) or Yellow (Warning) markers.
  • Click into:
    • Nodes → see node health.
    • Applications → view replica status.
    • System Services → ensure platform health.
2. Events and Logs
  • Use Windows Event Viewer for local clusters.
  • Use Azure Diagnostics Logs for cloud clusters (enabled via portal).
  • Look under Operational or Admin logs for critical errors.
3. ETW (Event Tracing for Windows)
  • Advanced logging method capturing detailed trace events.
  • Use PerfView or Azure Monitor to read ETW logs.
4. Health Reports
  • Services can proactively report degraded conditions.
  • View health events inside SFX → Health Events tab.

🚀 Step-by-Step: Diagnosing Common Problems

Problem 1: A Node Shows "Down" Status
  • Check VM status on Azure Portal.
  • Verify if VM agent is running.
  • Restart Service Fabric services on the VM if needed.
Problem 2: Application Fails to Deploy
  • Check ApplicationManifest.xml and ServiceManifest.xml for version mismatch.
  • Check logs for deployment errors.
Problem 3: High CPU or Memory Usage
  • Use Service Fabric Explorer metrics dashboard.
  • Review the load report for specific services.
  • Scale out nodes or partition services if needed.

🛠️ Helpful PowerShell Commands

Useful for faster troubleshooting:

Get-ServiceFabricClusterHealth
Get-ServiceFabricNodeHealth
Get-ServiceFabricApplicationHealth -ApplicationName fabric:/MyApp
Get-ServiceFabricServiceHealth -ServiceName fabric:/MyApp/MyService
    

💡 Did You Know?

In Azure, Service Fabric Auto-Healing can automatically replace failed nodes based on health status!

⚡ Common Diagnostic Mistakes and Solutions

  • Problem: No logs available.
    Solution: Ensure logging is enabled in ServiceManifest.xml and proper diagnostics extensions are configured in Azure.
  • Problem: Hard to pinpoint which service failed.
    Solution: Use ServiceFabric Explorer tree structure to isolate failing services easily.
  • Problem: ETW files too big.
    Solution: Use filters and targeted collection during production troubleshooting.

🚨 Best Practices for Troubleshooting

  • Always enable basic telemetry even for local clusters.
  • Use SFX for quick root cause analysis first before digging into ETW or deep logs.
  • Set up Azure Monitor Alerts based on key health metrics.

✅ Self-Check Quiz

  • What dashboard tool allows you to see live cluster health?
  • What method allows fine-grained tracing in Service Fabric?
  • Name two PowerShell commands useful for troubleshooting service health.