答疑解惑:EMC VMAX3 MMCS控制台不定期重启原因分析
今天有个朋友咨询他们有一台EMC的VMAX100k设备,其中MMCS2的管理控制台定期重启,但始终无法找到重启原因,稍微花了点时间,帮客户看了下。先说结论,MMCS2确实不定期发生重启,每次reboot都是一样的message信息:The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly. 由于时间很短,问题的根本原因没有找到,但这里把处理问题的思路和用到的一些脚本分享给大家。
首先,客户的反馈是不定期重启,没有一个定量的描述,比如什么时间,有没有规律,发生过什么变更后等。针对这个问题,我们用下面的power shell来检查windows系统的reboot情况:
C:\Users\mmcs>powershell
Windows PowerShell
Copyright (C) 2009 Microsoft Corporation. All rights reserved.
PS C:\Users\mmcs> Get-EventLog -LogName System | Where-Object {$_.EventID -eq 41} | Format-Table -AutoSize
Index Time EntryType Source InstanceID Message
----- ---- --------- ------ ---------- -------
11468 Mar 25 01:26 0 Microsoft-Windows-Kernel-Power 41 The s...
如果发生过多次重启,这里就可以看到很多条的记录。下面是客户的输出,确实发生过很多次的重启。
然后我们,针对其中的一条记录看详细的日志:
在 Message 字段中,应该有更详细的错误描述,通常会提到重启的原因。你可以尝试执行以下命令来查看完整的错误消息:
PS C:\Users\mmcs> Get-EventLog -logname system -InstanceId 41 | format-list
Index : 11468
EntryType : 0
InstanceId : 41
Message : The system has rebooted without cleanly shutting down firs
t. This error could be caused if the system stopped respon
ding, crashed, or lost power unexpectedly.
Category : (63)
CategoryNumber : 63
ReplacementStrings : {0, 0x0, 0x0, 0x0...}
Source : Microsoft-Windows-Kernel-Power
TimeGenerated : 3/25/2025 1:26:33 AM
TimeWritten : 3/25/2025 1:26:33 AM
UserName : NT AUTHORITY\SYSTEM
重点看message字段,其实从这个里面看不出什么问题来,就是说发生了重启,重启的原因可能是系统停止了响应,crash了或者掉电了。
那么问题来了,下一步该往哪里去排查问题呢?由于这个reboot没有更进一步的有价值的东西出来。我们就围绕这个记录,看reboot前系统发生了什么?
查看这个报错之前的日志,看看有什么有价值的信息
Get-EventLog -LogName System -Newest 100 | Where-Object { $_.TimeGenerated -lt '3/25/2025 1:26:33 AM' } | Format-Table TimeGenerated, EntryType, Source, EventID, Message -Wrap
下面是一个分析reboot的powershell脚本,分享给大家:
# Save this script as Diagnose-Reboot.ps1 and run in PowerShell as Administrator
$report = @()
$now = Get-Date
$header = "===== Windows Unexpected Reboot Diagnostic Report =====`nGenerated at: $now`n===================================================="
$report += $header
# 1. Recent Kernel-Power (Event ID 41) errors
$report += "`n[1] Recent Kernel-Power (Event ID 41) errors:"
$report += Get-EventLog -LogName System -InstanceId 41 -Newest 5 |
Select-Object TimeGenerated, Message | Format-List | Out-String
# 2. Recent blue screens or system errors
$report += "`n[2] Recent blue screen or system errors:"
$report += Get-EventLog -LogName System -InstanceId 1001 -Newest 3 |
Select-Object TimeGenerated, Message | Format-List | Out-String
# 3. System events within 5 minutes before last Kernel-Power event
$report += "`n[3] System events within 5 minutes before last Kernel-Power event (up to 20 entries):"
$lastKP = Get-EventLog -LogName System -InstanceId 41 -Newest 1
$cutoff = $lastKP.TimeGenerated.AddMinutes(-5)
$report += Get-EventLog -LogName System -After $cutoff -Before $lastKP.TimeGenerated -Newest 20 |
Select-Object TimeGenerated, Source, EventID, EntryType, Message | Format-List | Out-String
# 4. Recent disk-related errors (Source: Disk)
$report += "`n[4] Recent disk-related errors (Source: Disk):"
$report += Get-EventLog -LogName System | Where-Object { $_.Source -eq "Disk" } |
Select-Object -First 5 TimeGenerated, Message | Format-List | Out-String
# 5. Current power plan configuration
$report += "`n[5] Power plan configuration (current plan):"
$report += powercfg /query | Out-String
# 6. System file check suggestion
$report += "`n[6] Suggested command: Run 'sfc /scannow' to check system file integrity"
# 7. Memory diagnostic suggestion
$report += "`n[7] Suggested tool: Run 'mdsched.exe' to perform Windows Memory Diagnostic"
# Output report
$path = "$env:USERPROFILE\Desktop\Reboot-Diag-Report.txt"
$report | Out-File -Encoding UTF8 -FilePath $path
Write-Output "Report generated: $path"
在powershell下运行这个脚本就可以在桌面自动生成一个叫做 Reboot-Diag_report.txt的文件,把一些有价值的信息直接输出到这文档中,便于分析查看。也可以发给我们来帮忙分析,添加vx: StorageExpert。
写到最后,其实EMC VMAX MMCS不定期重启无外乎就是软件或者硬件,从我们上面的检查来看,没有明确的硬件问题,再结合message的信息内容,就是The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly。首先lost power这个可以排除,剩下的就是系统停止响应了,系统crash了。再结合MMCS,可以推测由于破解了系统,导致部分后台进程有问题,所以出现 stopped responding,crashed,还有和对端MMCS的communication没有了,或者长时间timeout等,都有可能导致windows系统重启。如果需要进一步分析,可以从service crash 或者application crash的event入手去查看。当然,这个需要花费大量的时间和精力。最简单的处理方式就是直接更换一个MMCS,硬件问题和软件问题一起解决了。