Android看门狗(WatchDog)
一:概念
Android WatchDog(看门狗)是Android系统中用于监控系统关键服务的运行状态的机制,其核心目标是检测系统服务是否因死锁、阻塞或异常导致长时间无响应,并在必要时触发系统恢复(如重启)。
二:核心功能
2.1 服务状态监控
定期检查关键系统服务(如ActivityManager、WindowManager等)是否正常响应,防止服务阻塞导致系统卡死。
2.2 超时处理
若某个服务未在规定时间内更新“心跳”(monitor),WatchDog判定为超时,触发后续处理流程。
2.3 日志收集与调试
超时发生时,自动收集系统堆栈信息(包括所有线程的调用栈),帮助定位问题根源。
2.4 系统恢复
在严重超时情况下,可能强制重启系统进程(system_server)或整个设备,避免用户长时间面对无响应界面。
三:实现逻辑
3.1 启动
WatchDog像系统服务一样,是由SystemServer启动的,具体启动逻辑如下
frameworks/base/services/java/com/android/server/SystemServer.java
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
...
//Watchdog对象的创建及线程的启动
t.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance();
watchdog.start();
mDumper.addDumpable(watchdog);
t.traceEnd();
...
//Watchdog初始化
t.traceBegin("InitWatchdog");
watchdog.init(mSystemContext, mActivityManagerService);
t.traceEnd();
...
}
3.2 初始化
- WatchDog运行在"watchdog"线程中
- "watchdog.monitor"后台线程,负责处理通过Handler发送的任务
- HandlerChecker一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行
- 把"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程加入监控中
- 初始化Binder线程的监视器
frameworks/base/services/core/java/com/android/server/Watchdog.java
private Watchdog() {
//新建一个名为"watchdog"的线程
mThread = new Thread(this::run, "watchdog");
...
//启动一个名为"watchdog.monitor"的后台线程,负责处理通过Handler发送的任务
ServiceThread t = new ServiceThread("watchdog.monitor",
android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
t.start();
//创建一个与ServiceThread的Looper关联的Handler。这个Handler会将消息和任务发送到ServiceThread中执行,用于监控线程的响应性,确保线程正常运行
mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread", mLock);
//监控"monitor thread"、"foreground thread"、"main thread"、"ui thread"、"i/o thread"、"display thread"、"animation thread"、"surface animation thread"等系统关键线程
mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
...
//初始化Binder线程的监视器
addMonitor(new BinderThreadMonitor());
...
}
public void start() {
//启动"watchdog"线程
mThread.start();
}
public void init(Context context, ActivityManagerService activity) {
mActivity = activity;
//注册重启广播
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
...
}
3.3 周期性检测
- 检查周期为15s,最大等待时间为60s
- 用不同的状态表示不同的等待时间:WAITING(等待了15s之内)/WAITED_UNTIL_PRE_WATCHDOG(等待了15-60s)/OVERDUE(等待超过60s)/COMPLETED(已完成)
- 等待时间在15s-30s,会收集堆栈信息但不立即重启;如果超过60s,会收集堆栈信息并立刻重启
- 检测手段:调用被监控线程的monitor函数,根据返回时间来判断超时时间
private void run() {
boolean waitedHalf = false;
while (true) {
List<HandlerChecker> blockedCheckers = Collections.emptyList();
...
boolean doWaitedPreDump = false;
//watchdog超时时间(60s)
final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
//watchdog检查间隔(15s)
final long checkIntervalMillis = watchdogTimeoutMillis / PRE_WATCHDOG_TIMEOUT_RATIO;
...
synchronized (mLock) {
long sfHangTime;
long timeout = checkIntervalMillis;
...
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
//把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue
hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
.orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
}
...
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
...
try {
//等待一个检查周期(15s)
mLock.wait(timeout);
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
...
timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);
}
...
if (sfHangTime > TIME_SF_WAIT * 2) {
...
} else {
//针对每一个HandlerChecker的等待时间,返回不用的状态(WAITING-等待了15s之内/WAITED_UNTIL_PRE_WATCHDOG-等待了15-60s/OVERDUE-等待超过60s/COMPLETED-已完成)
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {//monitor的thread已按时返回心跳,重置waitedHalf,继续下一次循环
...
waitedHalf = false;
continue;
} else if (waitState == WAITING) {//monitor的thread没有在15s内返回心跳,继续等待,不做处理,继续下一次循环
continue;
} else if (waitState == WAITED_UNTIL_PRE_WATCHDOG) {//monitor的thread没有在15s-60s内返回心跳
if (!waitedHalf) {//monitor的thread没有在15s-30s内返回心跳
Slog.i(TAG, "WAITED_UNTIL_PRE_WATCHDOG");
waitedHalf = true;
//获取阻塞的线程
blockedCheckers = getCheckersWithStateLocked(WAITED_UNTIL_PRE_WATCHDOG);
subject = describeCheckersLocked(blockedCheckers);
pids = new ArrayList<>(mInterestingJavaPids);
doWaitedPreDump = true;
} else {//monitor的thread没有在30s-60s内返回心跳
continue;
}
} else {//monitor的thread没有在60s内返回心跳
// something is overdue!
blockedCheckers = getCheckersWithStateLocked(OVERDUE);
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
pids = new ArrayList<>(mInterestingJavaPids);
}
}
} // END synchronized (mLock)
//打印堆栈到日志中
logWatchog(doWaitedPreDump, subject, pids);
if (doWaitedPreDump) {
//monitor的thread没有在15s-30s内返回心跳,继续下一次循环
continue;
}
...
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
//诊断被阻塞的检查器并记录相关信息
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
...
exceptionHang.WDTMatterJava(330);
if (mSfHang) {
...
} else {
//把WatchDog进程杀掉(WatchDog自杀)
Process.killProcess(Process.myPid());
}
//终止当前运行的JVM
System.exit(10);
}
waitedHalf = false;
}
}
public static class HandlerChecker implements Runnable {
public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
mWaitMaxMillis = handlerCheckerTimeoutMillis;
if (mCompleted) {
//把mMonitorQueue中的monitor加入到mMonitors中,并清空mMonitorQueue
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
...
}
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (mLock) {
mCurrentMonitor = mMonitors.get(i);
}
//检查监控线程状态,如果被监控线程卡住,这里也会卡住
mCurrentMonitor.monitor();
}
synchronized (mLock) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
四:总结
4.1 如何将特定线程加入WatchDog监控
答:可参考AMS
4.1.1 实现Watchdog.Monitor接口
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback, ActivityManagerGlobalLock {
public void monitor() {
synchronized (this) { }
}
}
2.1.2 初始化时把自身加入到WatchDog的监控线程中
public ActivityManagerService(Context systemContext, ActivityTaskManagerService atm) {
...
//AMS把自身加入到WatchDog的监控线程中
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
...
}
4.2 优点
4.2.1 避免误杀
区分轻度/严重超时,防止短暂高负载导致误重启。
4.2.2 性能开销
检测间隔(默认15秒)权衡实时性与资源消耗。
4.2.3 死锁检测
通过多服务心跳协同,发现跨服务死锁问题。