SCOM 2012 R2 - Agent Health Service constantly restarts
Some time ago I received an e-mail from a client that’s using System Center Operations Manager to monitor application up-time from two locatio
Some time ago I received an e-mail from a client that’s using System Center Operations Manager to monitor application up-time from two locations and provide significant insights based on the data sent back from the agents, had a problem with one of the agents constantly restarting every 15-30 minutes. That constant health service restart caused a 3-5 minute gap between when collecting and aggregating the data from the agents.
The client also reported that one of the agents HealthServiceStore database was huge compared to the other one which raised some eyebrows because they were identical in terms of loaded management packs.
The obvious answer to the problem was to flush the cache on the agents.
Import-Module -Name OperationsManager $AgentName = Read-Host -Prompt 'Enter the agent name' $SCOMAgents= Get-SCOMClass -Name 'Microsoft.SystemCenter.Agent' | Get-SCOMClassInstance foreach ($Agent in $SCOMAgents) { if ($Agent.DisplayName -like "$AgentName*") { Get-SCOMTask 'Flush Health Service State and Cache' | Start-SCOMTask -instance $Agent -Verbose } }
Flushing the cache caught me off-guard because instead of solving the issue, it helped aggravate it.So after the operation was done, both agents started to restart this time in sync, creating a 3-5 minute gap in data collection. After that happened I had a feeling that I knew what was causing the problem but without any information the fix I had in mind would artificially solve the issue and that would have delayed the inevitable.
So we waited a bit to collect more data on the problem, and after a while the HealthServiceStore database on the agent that was causing problems in the first place, started growing again. That led me to believe that the management packs that were sent by the management server were not properly loaded on the machine where the agent was installed.
Now Operations Manager has an automatic mechanism that defragments the HealthStore every day, and that process that should solve the everlasting growth of the agent’s database, but in this particular case that didn’t happen. Now believing that the automatic defragment process isn’t working properly, I passed on the necessary information to perform a manual offline defragment of the HealthServiceStore database and waited for a resolution.
#requires -Version 1 $AgentLocation = Get-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Setup\' $AgentInstallDirectory = $AgentLocation.InstallDirectory Stop-Service -Name 'Microsoft Monitoring Agent' Set-Location "$AgentInstallDirectory\Health Service State\Health Service Store\" esentutl.exe /r edb esentutl.exe /d HealthServiceStore.edb Start-Service -Name 'Microsoft Monitoring Agent'
After a while I received the e-mail. The offline defragmentation process was done and the good part was that the agent stopped restarting randomly but the bad part was that it still restarted every 30 minutes.
Now that everything failed to work, I suggested to do an ETL trace on the machine with the problematic agent to find out what’s actually causing the problem. You may think that this would have been the obvious step after flushing the cache but doing this on e-mail without any context doesn’t help at all.
$AgentLocation = Get-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Setup\' $AgentInstallDirectory = $AgentLocation.InstallDirectory Set-Location "$AgentInstallDirectory\Tools\" .\StopTracing.cmd Remove-Item C:\Windows\Logs\OpsMgrTrace\* -Recurse -Force .\StartTracing.cmd Start-Sleep -Seconds 7200 .\StopTracing.cmd .\FormatTracing.cmd Invoke-item 'C:\Windows\Logs\OpsMgrTrace\'
So the ETL tracing was done and I had the log file in front of me. So what did I see?
Well I saw that the log was filled with:
[SecureStorageManager] [] [Error] :CSecureStorageManager::resolveReference{SecureStorageManager_cpp5746}Unable to get type of the storage Id: WINERROR=80FF003F
That error can mean access denied or it cannot run an operation, which is pretty generic and doesn’t help much but somewhere in the log something struck my eye:
[ModulesCrimson] [] [Error] :CrimsonSubscriber::ReadEvent{CrimsonSubscriber_cpp679}Error retrieving event, EvtNext failed with 122(ERROR_INSUFFICIENT_BUFFER)
[ModulesCrimson] [] [Error] :CrimsonSubscriber::OnWorkCallback{CrimsonSubscriber_cpp898}`ReadEvent()` failed, ignoring ERROR: {hr= 0x8007007a(ERROR_INSUFFICIENT_BUFFER)}
And at that point I knew how to actually solve the problem. Remember at the start of the post that I knew an artificial way to solve the problem? Well in this case it was exactly what was needed to fix the problem.
SCOM has an auto-recovery mechanism that restarts the agents once the memory private bytes reaches or passes 300MB or when the handle count passes 6000. Now depending on the situation, once the agent reaches one or both limits it will restart itself in order to flush the memory / handles.
Now that I had all the information I needed, I asked the client to check if the Monitoring Host Private Bytes Threshold state history and if there’s any red in there, override Private Bytes Threshold from 300MB to something higher like 700MB and monitor the situation.
In this example I have it raised to 1 GB but for 700MB the override value would be 734003200.
After a couple of days, the client replied back telling me that the problem was solved and everything is running perfectly. Success! Problem solved.
Have a good one!