Linux里的容器被OOM killed的两种情况
生产上遇到过几次容器实例被OOM的现象,总结一下LInux OOM的两种触发条件。我的虚拟机是ubuntu 24.0.4版本,分配4G内存,在我的虚拟机上复现这两种case。
一 宿主机物理内存不够
当linux上所有应用程序的内存需求加起来超出了物理内存(包括 swap)的容量,内核(OOM killer)必须杀掉一些进程才能腾出空间保障系统正常运行。oom_score分数最高的进程在这个时候会被杀掉。
用来每隔一秒分配100MB RSS物理内存的程序
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#define BLOCK_SIZE (100*1024*1024)
int main(int argc, char **argv)
{
int thr, i;
char *p1;
for (i = 0; ; i++) {
p1 = malloc(BLOCK_SIZE);
memset(p1, 0x00, BLOCK_SIZE);
printf("set to %d Mbytes\n", i * 100);
sleep(1);
}
return 0;
}
Dockerfile
FROM ubuntu:24.04
COPY ./mem-alloc/mem_alloc /
CMD ["/mem_alloc", "2000"]
Makefile
all: image
mem_alloc: mem-alloc/mem_alloc.c
gcc -o mem-alloc/mem_alloc mem-alloc/mem_alloc.c
image: mem_alloc
docker build -t registry/mem_alloc:v1 .
clean:
rm mem-alloc/mem_alloc -f
docker stop mem_alloc;docker rm mem_alloc;docker rmi registry/mem_alloc:v1
执行sudo make后,启动容器,不指定内存上限,故意让这个docker吃掉所有的宿主机的内存
sudo docker run --privileged -it registry/mem_alloc:v1 /bin/bash
容器内执行
root@4c81480d3366:/# echo 1 > /sys/fs/cgroup/memory.oom.group
root@96bd973a3a59:/# ./mem_alloc 1000
set to 0 Mbytes
set to 100 Mbytes
set to 200 Mbytes
set to 300 Mbytes
set to 400 Mbytes
set to 500 Mbytes
观察docker stats 的输出,容器消耗完所有虚拟机的内存后,被杀死
执行dmesg
[86949.786621] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-96bd973a3a598d2fe5dfda4213cbb2ea028b0a4c3922fb50830daa12a9a4ecba.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-96bd973a3a598d2fe5dfda4213cbb2ea028b0a4c3922fb50830daa12a9a4ecba.scope,task=mem_alloc,pid=13545,uid=0
[86949.786647] Out of memory: Killed process 13545 (mem_alloc) total-vm:6556108kB, anon-rss:3656448kB, file-rss:128kB, shmem-rss:0kB, UID:0 pgtables:12968kB oom_score_adj:0
[86949.788874] Tasks in /system.slice/docker-96bd973a3a598d2fe5dfda4213cbb2ea028b0a4c3922fb50830daa12a9a4ecba.scope are going to be killed due to memory.oom.group set
[86949.788889] Out of memory: Killed process 13518 (bash) total-vm:4296kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:0
[86949.789235] Out of memory: Killed process 13545 (mem_alloc) total-vm:6556108kB, anon-rss:3656448kB, file-rss:128kB, shmem-rss:0kB, UID:0 pgtables:12968kB oom_score_adj:0
[86951.427580] docker0: port 1(veth1e6a35b) entered disabled state
[86951.427667] vethce0aa4c: renamed from eth0
[86951.445966] docker0: port 1(veth1e6a35b) entered disabled state
[86951.446626] veth1e6a35b (unregistering): left allmulticast mode
[86951.446705] veth1e6a35b (unregistering): left promiscuous mode
[86951.446707] docker0: port 1(veth1e6a35b) entered disabled state
这个容器实例因为RSS使用了3.6G,宿主机的所有可用物理内存,导致被宿主机killed掉。
K8s配置最佳实践
生产环境的容器的内存的limit和request配置一样,不要超卖,一旦宿主机的物理内存不够,一定会有容器被宿主机kill掉。
二 docker里的进程使用的内存超过了cgroup的限制
容器实例使用的内存如果超过了cgroup(control group)的限制,会被内核直接杀掉
其他不变,容器启动命令增加内存限制的参数-m 2000m --memory-swap 2000m
sudo docker run -m 2000m --memory-swap 2000m --privileged -it registry/mem_alloc:v1 /bin/bash
容器内执行
root@4c81480d3366:/# echo 1 > /sys/fs/cgroup/memory.oom.group
root@96bd973a3a59:/# ./mem_alloc 1000
set to 0 Mbytes
set to 100 Mbytes
set to 200 Mbytes
set to 300 Mbytes
set to 400 Mbytes
set to 500 Mbytes
观察docker stats 的输出,容器的使用内存达到cgroup的限制后,被杀掉
使用dmesg,查看内核日志
[88880.963934] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-38bdd956b75425c816afa4f5d3071bc4b175065cf405414c41a23ebf44e5f2fa.scope,mems_allowed=0,oom_memcg=/system.slice/docker-38bdd956b75425c816afa4f5d3071bc4b175065cf405414c41a23ebf44e5f2fa.scope,task_memcg=/system.slice/docker-38bdd956b75425c816afa4f5d3071bc4b175065cf405414c41a23ebf44e5f2fa.scope,task=mem_alloc,pid=13890,uid=0
[88880.963948] Memory cgroup out of memory: Killed process 13890 (mem_alloc) total-vm:2050332kB, anon-rss:2042752kB, file-rss:128kB, shmem-rss:0kB, UID:0 pgtables:4116kB oom_score_adj:0
[88880.967339] Tasks in /system.slice/docker-38bdd956b75425c816afa4f5d3071bc4b175065cf405414c41a23ebf44e5f2fa.scope are going to be killed due to memory.oom.group set
[88880.967351] Memory cgroup out of memory: Killed process 13859 (bash) total-vm:4296kB, anon-rss:384kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:48kB oom_score_adj:0
[88880.968337] Memory cgroup out of memory: Killed process 13890 (mem_alloc) total-vm:2050332kB, anon-rss:2042752kB, file-rss:128kB, shmem-rss:0kB, UID:0 pgtables:4116kB oom_score_adj:0
[88881.076584] docker0: port 1(veth0cb68cb) entered disabled state
[88881.076763] veth6ef9720: renamed from eth0
[88881.094585] docker0: port 1(veth0cb68cb) entered disabled state
[88881.096610] veth0cb68cb (unregistering): left allmulticast mode
[88881.096714] veth0cb68cb (unregistering): left promiscuous mode
[88881.096717] docker0: port 1(veth0cb68cb) entered disabled state
内核日志的关键词和上面的不一样,Memory cgroup out of memory, 有关键词cgroup
jvm参数配置最佳实践
jvm进程要控制好堆的大小,堆配置的过大,容易导致jvm整体内存超过了容器的限制值。堆的大小自适应容器的内存大小,建议配置65% 到 75%之间。(下面的几个参数在Java 8u191 +之后才支持, 不生效的话注意版本)
-XX:+UseContainerSupport
-XX:InitialRAMPercentage=70.0
-XX:MaxRAMPercentage=70.0
-XX:MinRAMPercentage=70.0
memory.oom.group 参数说明
/sys/fs/cgroup/memory.oom.group
A read-write single value file which exists on non-root cgroups. The default value is “0”.
Determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. This can be used to avoid partial kills to guarantee workload integrity.
如果这个参数设置成1,那么当内存超过cgroup的限制,整个cgroup组里所有进程会被视为一个整体一起杀掉。如果设置成0,当内存不够用的时候,会杀掉cgroup分组里oom_score分数最高的那个进程。
参考资料
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html.