* Re: how to make memory.memsw.failcnt is nonzero [not found] <4EFADFF8.5020703@cn.fujitsu.com> @ 2012-01-03 16:04 ` Michal Hocko 2012-01-06 9:47 ` Peng Haitao 2012-01-30 2:34 ` Peng Haitao 0 siblings, 2 replies; 7+ messages in thread From: Michal Hocko @ 2012-01-03 16:04 UTC (permalink / raw) To: Peng Haitao; +Cc: cgroups, kamezawa.hiroyu, Johannes Weiner, linux-mm, LKML [Let's add some people to the CC list] Hi, sorry for the late reply (some vacation and holiday) On Wed 28-12-11 17:23:04, Peng Haitao wrote: > > memory.memsw.failcnt shows the number of memory+Swap hits limits. > So I think when memory+swap usage is equal to limit, memsw.failcnt should be nonzero. > > I test as follows: > > # uname -a > Linux K-test 3.2.0-rc7-17-g371de6e #2 SMP Wed Dec 28 12:02:52 CST 2011 x86_64 x86_64 x86_64 GNU/Linux > # mkdir /cgroup/memory/group > # cd /cgroup/memory/group/ > # echo 10M > memory.limit_in_bytes > # echo 10M > memory.memsw.limit_in_bytes > # echo $$ > tasks > # dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M > Killed > # cat memory.memsw.failcnt > 0 > # grep "failcnt" /var/log/messages | tail -2 > Dec 28 17:05:52 K-test kernel: memory: usage 10240kB, limit 10240kB, failcnt 21 > Dec 28 17:05:52 K-test kernel: memory+swap: usage 10240kB, limit 10240kB, failcnt 0 > > memory+swap usage is equal to limit, but memsw.failcnt is zero. > > I change memory.memsw.limit_in_bytes to 15M. > > # echo 15M > memory.memsw.limit_in_bytes > # dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M > Killed > # grep "failcnt" /var/log/messages | tail -2 > Dec 28 17:08:45 K-test kernel: memory: usage 10240kB, limit 10240kB, failcnt 86 > Dec 28 17:08:45 K-test kernel: memory+swap: usage 10240kB, limit 15360kB, failcnt 0 > # cat memory.memsw.failcnt > 0 > > The limit is 15M, but memory+swap usage also is 10M. > I think memory+swap usage should be 15M and memsw.failcnt should be nonzero. > > This is a kernel bug or I misunderstand memory+swap? Well, we might end up with memory.failcnt > 0 while memory.memsw.failcnt == 0 quite easily and that would happen in the cases like you describe above when there is a big pagecache pressure on the hardlimit without much anonymous memory so there is not much that could be swapped out. Please note that memsw.limit_in_bytes is triggered only if we have consumed some swap space already (and the feature is primarily intended to stop extensive swap usage in fact). It goes like this: If we trigger hard limit (memory.limit_in_bytes) then we start the direct reclaim (with swap available). If we trigger memsw limit then we try to reclaim without swap available. We will OOM if we cannot reclaim enough to satisfy the respective limit. The other part of the answer is, yes there is something wrong going on her because we definitely shouldn't OOM. The workload is a single threaded and we have a plenty of page cache that could be reclaimed easily. On the other hand we end up with: # echo $$ > tasks /dev/memctl/a# echo 10M > memory.limit_in_bytes /dev/memctl/a# echo 10M > memory.memsw.limit_in_bytes /dev/memctl/a# dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M Killed /dev/memctl/a# cat memory.stat cache 9265152 rss 143360 mapped_file 0 pgpgin 3352 pgpgout 1055 swap 0 pgfault 798 pgmajfault 1 inactive_anon 12288 active_anon 114688 inactive_file 9261056 active_file 4096 unevictable 0 [...] So there is almost 10M of page cache that we can simply reclaim. If we use 40M limit then we are OK. So this looks like the small limit somehow tricks our math in the reclaim path and we think there is nothing to reclaim. I will look into this. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: how to make memory.memsw.failcnt is nonzero 2012-01-03 16:04 ` how to make memory.memsw.failcnt is nonzero Michal Hocko @ 2012-01-06 9:47 ` Peng Haitao 2012-01-06 10:12 ` Michal Hocko 2012-01-30 2:34 ` Peng Haitao 1 sibling, 1 reply; 7+ messages in thread From: Peng Haitao @ 2012-01-06 9:47 UTC (permalink / raw) To: Michal Hocko; +Cc: cgroups, kamezawa.hiroyu, Johannes Weiner, linux-mm, LKML Michal Hocko said the following on 2012-1-4 0:04: >> # echo 15M > memory.memsw.limit_in_bytes >> # dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M >> Killed >> # grep "failcnt" /var/log/messages | tail -2 >> Dec 28 17:08:45 K-test kernel: memory: usage 10240kB, limit 10240kB, failcnt 86 >> Dec 28 17:08:45 K-test kernel: memory+swap: usage 10240kB, limit 15360kB, failcnt 0 >> # cat memory.memsw.failcnt >> 0 >> >> The limit is 15M, but memory+swap usage also is 10M. >> I think memory+swap usage should be 15M and memsw.failcnt should be nonzero. >> > So there is almost 10M of page cache that we can simply reclaim. If we > use 40M limit then we are OK. So this looks like the small limit somehow > tricks our math in the reclaim path and we think there is nothing to > reclaim. > I will look into this. Thanks for you reply. If there is something wrong, I think the bug will be in mem_cgroup_do_charge() of mm/memcontrol.c 2210 ret = res_counter_charge(&memcg->res, csize, &fail_res); 2211 2212 if (likely(!ret)) { 2213 if (!do_swap_account) 2214 return CHARGE_OK; 2215 ret = res_counter_charge(&memcg->memsw, csize, &fail_res); 2216 if (likely(!ret)) 2217 return CHARGE_OK; 2218 2219 res_counter_uncharge(&memcg->res, csize); 2220 mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); 2221 flags |= MEM_CGROUP_RECLAIM_NOSWAP; 2222 } else 2223 mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); When hit memory.limit_in_bytes, res_counter_charge() will return -ENOMEM, this will execute line 2222: } else. But I think when hit memory.limit_in_bytes, the function should determine further to memory.memsw.limit_in_bytes. This think is OK? -- Best Regards, Peng ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: how to make memory.memsw.failcnt is nonzero 2012-01-06 9:47 ` Peng Haitao @ 2012-01-06 10:12 ` Michal Hocko 2012-01-30 2:47 ` Peng Haitao 0 siblings, 1 reply; 7+ messages in thread From: Michal Hocko @ 2012-01-06 10:12 UTC (permalink / raw) To: Peng Haitao; +Cc: cgroups, kamezawa.hiroyu, Johannes Weiner, linux-mm, LKML On Fri 06-01-12 17:47:10, Peng Haitao wrote: > > Michal Hocko said the following on 2012-1-4 0:04: > >> # echo 15M > memory.memsw.limit_in_bytes > >> # dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M > >> Killed > >> # grep "failcnt" /var/log/messages | tail -2 > >> Dec 28 17:08:45 K-test kernel: memory: usage 10240kB, limit 10240kB, failcnt 86 > >> Dec 28 17:08:45 K-test kernel: memory+swap: usage 10240kB, limit 15360kB, failcnt 0 > >> # cat memory.memsw.failcnt > >> 0 > >> > >> The limit is 15M, but memory+swap usage also is 10M. > >> I think memory+swap usage should be 15M and memsw.failcnt should be nonzero. > >> > > So there is almost 10M of page cache that we can simply reclaim. If we > > use 40M limit then we are OK. So this looks like the small limit somehow > > tricks our math in the reclaim path and we think there is nothing to > > reclaim. > > I will look into this. > > Thanks for you reply. > If there is something wrong, I think the bug will be in mem_cgroup_do_charge() > of mm/memcontrol.c > > 2210 ret = res_counter_charge(&memcg->res, csize, &fail_res); > 2211 > 2212 if (likely(!ret)) { > 2213 if (!do_swap_account) > 2214 return CHARGE_OK; > 2215 ret = res_counter_charge(&memcg->memsw, csize, &fail_res); > 2216 if (likely(!ret)) > 2217 return CHARGE_OK; > 2218 > 2219 res_counter_uncharge(&memcg->res, csize); > 2220 mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); > 2221 flags |= MEM_CGROUP_RECLAIM_NOSWAP; > 2222 } else > 2223 mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); > > When hit memory.limit_in_bytes, res_counter_charge() will return -ENOMEM, > this will execute line 2222: } else. > But I think when hit memory.limit_in_bytes, the function should determine further > to memory.memsw.limit_in_bytes. > This think is OK? I don't think so. We have an invariant (hard limit is "stronger" than memsw limit) memory.limit_in_bytes <= memory.memsw.limit_in_bytes so when we hit the hard limit we do not have to consider memsw because resource counter: a) we already have to do reclaim for hard limit b) we check whether we might swap out later on in mem_cgroup_hierarchical_reclaim (root_memcg->memsw_is_minimum) so we will not end up swapping just to make hard limit ok and go over memsw limit. Please also note that we will retry charging after reclaim if there is a chance to meet the limit. Makes sense? -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: how to make memory.memsw.failcnt is nonzero 2012-01-06 10:12 ` Michal Hocko @ 2012-01-30 2:47 ` Peng Haitao 2012-01-30 7:24 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 7+ messages in thread From: Peng Haitao @ 2012-01-30 2:47 UTC (permalink / raw) To: Michal Hocko; +Cc: cgroups, kamezawa.hiroyu, Johannes Weiner, linux-mm, LKML Michal Hocko said the following on 2012-1-6 18:12: >> If there is something wrong, I think the bug will be in mem_cgroup_do_charge() >> of mm/memcontrol.c >> >> 2210 ret = res_counter_charge(&memcg->res, csize, &fail_res); >> 2211 >> 2212 if (likely(!ret)) { ... >> 2221 flags |= MEM_CGROUP_RECLAIM_NOSWAP; >> 2222 } else >> 2223 mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); >> >> When hit memory.limit_in_bytes, res_counter_charge() will return -ENOMEM, >> this will execute line 2222: } else. >> But I think when hit memory.limit_in_bytes, the function should determine further >> to memory.memsw.limit_in_bytes. >> This think is OK? > > I don't think so. We have an invariant (hard limit is "stronger" than > memsw limit) memory.limit_in_bytes <= memory.memsw.limit_in_bytes so > when we hit the hard limit we do not have to consider memsw because > resource counter: > a) we already have to do reclaim for hard limit > b) we check whether we might swap out later on in > mem_cgroup_hierarchical_reclaim (root_memcg->memsw_is_minimum) so we > will not end up swapping just to make hard limit ok and go over memsw > limit. > > Please also note that we will retry charging after reclaim if there is a > chance to meet the limit. > Makes sense? Yeah. But I want to test memory.memsw.failcnt is nonzero, how steps? Thanks. -- Best Regards, Peng ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: how to make memory.memsw.failcnt is nonzero 2012-01-30 2:47 ` Peng Haitao @ 2012-01-30 7:24 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 7+ messages in thread From: KAMEZAWA Hiroyuki @ 2012-01-30 7:24 UTC (permalink / raw) To: Peng Haitao; +Cc: Michal Hocko, cgroups, Johannes Weiner, linux-mm, LKML On Mon, 30 Jan 2012 10:47:33 +0800 Peng Haitao <penght@cn.fujitsu.com> wrote: > > Michal Hocko said the following on 2012-1-6 18:12: > >> If there is something wrong, I think the bug will be in mem_cgroup_do_charge() > >> of mm/memcontrol.c > >> > >> 2210 ret = res_counter_charge(&memcg->res, csize, &fail_res); > >> 2211 > >> 2212 if (likely(!ret)) { > ... > >> 2221 flags |= MEM_CGROUP_RECLAIM_NOSWAP; > >> 2222 } else > >> 2223 mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); > >> > >> When hit memory.limit_in_bytes, res_counter_charge() will return -ENOMEM, > >> this will execute line 2222: } else. > >> But I think when hit memory.limit_in_bytes, the function should determine further > >> to memory.memsw.limit_in_bytes. > >> This think is OK? > > > > I don't think so. We have an invariant (hard limit is "stronger" than > > memsw limit) memory.limit_in_bytes <= memory.memsw.limit_in_bytes so > > when we hit the hard limit we do not have to consider memsw because > > resource counter: > > a) we already have to do reclaim for hard limit > > b) we check whether we might swap out later on in > > mem_cgroup_hierarchical_reclaim (root_memcg->memsw_is_minimum) so we > > will not end up swapping just to make hard limit ok and go over memsw > > limit. > > > > Please also note that we will retry charging after reclaim if there is a > > chance to meet the limit. > > Makes sense? > > Yeah. > > But I want to test memory.memsw.failcnt is nonzero, how steps? > Thanks. > Here is a quick hacked test program. see below. A rough test. [root@bluextal memcg_test]# cgcreate -g memory:X [root@bluextal memcg_test]# cgset -r memory.limit_in_bytes=200M X [root@bluextal memcg_test]# cgset -r memory.memsw.limit_in_bytes=300M X [root@bluextal memcg_test]# cgexec -g memory:X ./check 200 300 [root@bluextal memcg_test]# echo 0 > /cgroup/memory/X/memory.memsw.failcnt [root@bluextal memcg_test]# cat /cgroup/memory/X/memory.memsw.failcnt 0 [root@bluextal memcg_test]# cgexec -g memory:X ./check 200 300 Killed <-----------------------------------------------------------------------OOM Killed. [root@bluextal memcg_test]# cat /cgroup/memory/X/memory.memsw.failcnt 17 <-----------------------------------------------------------------------memsw failcnt up. Easy way is 1. allocate memory in Anon. 2. kick out anon memory to swap as much as possible by file I/O.-------(*1) 3. delete file cache by some way (I used unlink() here.) --------------(*2) 4. allocate anon memory. The important points are (*1) and (*2). see a program below. You can prevent OOM (freeze-at-oom) by [root@bluextal memcg_test]# cgset -r memory.oom_control=1 X Here is the memory.stat at OOM. [root@bluextal test]# cat /cgroup/memory/X/memory.stat cache 0 rss 209666048 mapped_file 0 pgpgin 30567 pgpgout 72381 swap 104906752 <snip> hierarchical_memory_limit 209715200 hierarchical_memsw_limit 314572800 rss+cache < memory.limit rss+swap == memsw.limit. == #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> int main(int argc, char *argv[]) { char filename[] = "./tmpfile-for-test"; unsigned long mem_size = atoi(argv[1]); unsigned long memsw_size = atoi(argv[2]); unsigned long file_size; int fd, len; char *addr, *buf; if (memsw_size < 100) return 0; mem_size *= 1024 * 1024; memsw_size *= 1024 * 1024; memsw_size = memsw_size - 10 * 1024 * 1024; /* 10M Bytes of margin */ addr = mmap(NULL, memsw_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); /* allocate pages and cause swap out */ memset(addr, 0, memsw_size); /* create file, this will make more swaps. */ file_size = mem_size * 80 / 100; fd = open(filename, O_RDWR| O_TRUNC, 0644); buf = malloc(1024 *1024); for (len = 0; len < file_size; len += 1024*1024) { write(fd, buf, 1024*1024); } /* read the file again */ lseek(fd, SEEK_SET, 0); for (len = 0; len < file_size; len += 1024 * 1024) read(fd, buf, 1024 * 1024); lseek(fd, SEEK_SET, 0); for (len = 0; len < file_size; len += 1024 * 1024) read(fd, buf, 1024 * 1024); unlink(filename); addr = malloc(9 * 1024 * 1024); memset(addr, 0, 9 * 1024 * 1024); printf("done\n"); sleep(100); } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: how to make memory.memsw.failcnt is nonzero 2012-01-03 16:04 ` how to make memory.memsw.failcnt is nonzero Michal Hocko 2012-01-06 9:47 ` Peng Haitao @ 2012-01-30 2:34 ` Peng Haitao 2012-01-30 8:46 ` Michal Hocko 1 sibling, 1 reply; 7+ messages in thread From: Peng Haitao @ 2012-01-30 2:34 UTC (permalink / raw) To: Michal Hocko; +Cc: cgroups, kamezawa.hiroyu, Johannes Weiner, linux-mm, LKML Michal Hocko said the following on 2012-1-4 0:04: > On Wed 28-12-11 17:23:04, Peng Haitao wrote: >> >> memory.memsw.failcnt shows the number of memory+Swap hits limits. >> So I think when memory+swap usage is equal to limit, memsw.failcnt should be nonzero. >> >> I test as follows: >> >> # uname -a >> Linux K-test 3.2.0-rc7-17-g371de6e #2 SMP Wed Dec 28 12:02:52 CST 2011 x86_64 x86_64 x86_64 GNU/Linux >> # mkdir /cgroup/memory/group >> # cd /cgroup/memory/group/ >> # echo 10M > memory.limit_in_bytes >> # echo 10M > memory.memsw.limit_in_bytes >> # echo $$ > tasks >> # dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M >> Killed >> # cat memory.memsw.failcnt >> 0 >> # grep "failcnt" /var/log/messages | tail -2 >> Dec 28 17:05:52 K-test kernel: memory: usage 10240kB, limit 10240kB, failcnt 21 >> Dec 28 17:05:52 K-test kernel: memory+swap: usage 10240kB, limit 10240kB, failcnt 0 >> >> memory+swap usage is equal to limit, but memsw.failcnt is zero. >> > Please note that memsw.limit_in_bytes is triggered only if we have > consumed some swap space already (and the feature is primarily intended > to stop extensive swap usage in fact). > It goes like this: If we trigger hard limit (memory.limit_in_bytes) then > we start the direct reclaim (with swap available). If we trigger memsw > limit then we try to reclaim without swap available. We will OOM if we > cannot reclaim enough to satisfy the respective limit. > > The other part of the answer is, yes there is something wrong going > on her because we definitely shouldn't OOM. The workload is a single > threaded and we have a plenty of page cache that could be reclaimed > easily. On the other hand we end up with: > # echo $$ > tasks > /dev/memctl/a# echo 10M > memory.limit_in_bytes > /dev/memctl/a# echo 10M > memory.memsw.limit_in_bytes > /dev/memctl/a# dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M > Killed > /dev/memctl/a# cat memory.stat > cache 9265152 > [...] > > So there is almost 10M of page cache that we can simply reclaim. If we > use 40M limit then we are OK. So this looks like the small limit somehow > tricks our math in the reclaim path and we think there is nothing to > reclaim. > I will look into this. Have any conclusion for this? Thanks. -- Best Regards, Peng ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: how to make memory.memsw.failcnt is nonzero 2012-01-30 2:34 ` Peng Haitao @ 2012-01-30 8:46 ` Michal Hocko 0 siblings, 0 replies; 7+ messages in thread From: Michal Hocko @ 2012-01-30 8:46 UTC (permalink / raw) To: Peng Haitao; +Cc: cgroups, kamezawa.hiroyu, Johannes Weiner, linux-mm, LKML On Mon 30-01-12 10:34:49, Peng Haitao wrote: > > Michal Hocko said the following on 2012-1-4 0:04: > > On Wed 28-12-11 17:23:04, Peng Haitao wrote: > >> > >> memory.memsw.failcnt shows the number of memory+Swap hits limits. > >> So I think when memory+swap usage is equal to limit, memsw.failcnt should be nonzero. > >> > >> I test as follows: > >> > >> # uname -a > >> Linux K-test 3.2.0-rc7-17-g371de6e #2 SMP Wed Dec 28 12:02:52 CST 2011 x86_64 x86_64 x86_64 GNU/Linux > >> # mkdir /cgroup/memory/group > >> # cd /cgroup/memory/group/ > >> # echo 10M > memory.limit_in_bytes > >> # echo 10M > memory.memsw.limit_in_bytes > >> # echo $$ > tasks > >> # dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M > >> Killed > >> # cat memory.memsw.failcnt > >> 0 > >> # grep "failcnt" /var/log/messages | tail -2 > >> Dec 28 17:05:52 K-test kernel: memory: usage 10240kB, limit 10240kB, failcnt 21 > >> Dec 28 17:05:52 K-test kernel: memory+swap: usage 10240kB, limit 10240kB, failcnt 0 > >> > >> memory+swap usage is equal to limit, but memsw.failcnt is zero. > >> > > Please note that memsw.limit_in_bytes is triggered only if we have > > consumed some swap space already (and the feature is primarily intended > > to stop extensive swap usage in fact). > > It goes like this: If we trigger hard limit (memory.limit_in_bytes) then > > we start the direct reclaim (with swap available). If we trigger memsw > > limit then we try to reclaim without swap available. We will OOM if we > > cannot reclaim enough to satisfy the respective limit. > > > > The other part of the answer is, yes there is something wrong going > > on her because we definitely shouldn't OOM. The workload is a single > > threaded and we have a plenty of page cache that could be reclaimed > > easily. On the other hand we end up with: > > # echo $$ > tasks > > /dev/memctl/a# echo 10M > memory.limit_in_bytes > > /dev/memctl/a# echo 10M > memory.memsw.limit_in_bytes > > /dev/memctl/a# dd if=/dev/zero of=/tmp/temp_file count=20 bs=1M > > Killed > > /dev/memctl/a# cat memory.stat > > cache 9265152 > > [...] > > > > So there is almost 10M of page cache that we can simply reclaim. If we > > use 40M limit then we are OK. So this looks like the small limit somehow > > tricks our math in the reclaim path and we think there is nothing to > > reclaim. > > I will look into this. > > Have any conclusion for this? I am sorry, but I didn't get to this. The last two months were really busy and I am leaving for a long vacation next week. It's still on my todo list... > Thanks. -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-01-30 8:46 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <4EFADFF8.5020703@cn.fujitsu.com> 2012-01-03 16:04 ` how to make memory.memsw.failcnt is nonzero Michal Hocko 2012-01-06 9:47 ` Peng Haitao 2012-01-06 10:12 ` Michal Hocko 2012-01-30 2:47 ` Peng Haitao 2012-01-30 7:24 ` KAMEZAWA Hiroyuki 2012-01-30 2:34 ` Peng Haitao 2012-01-30 8:46 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).