* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast @ 2022-09-13 13:09 yong 2022-09-13 13:54 ` Greg KH 0 siblings, 1 reply; 33+ messages in thread From: yong @ 2022-09-13 13:09 UTC (permalink / raw) To: jaewon31.kim, gregkh, mhocko; +Cc: linux-kernel, stable, wang.yong12 Hello, This patch is required to be patched in linux-5.4.y and linux-4.19.y. In addition to that, the following two patches are somewhat related: 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx 9282012 page_alloc: fix invalid watermark check on a negative value thanks. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2022-09-13 13:09 [PATCH v4] page_alloc: consider highatomic reserve in watermark fast yong @ 2022-09-13 13:54 ` Greg KH 2022-09-14 0:46 ` yong w 0 siblings, 1 reply; 33+ messages in thread From: Greg KH @ 2022-09-13 13:54 UTC (permalink / raw) To: yong; +Cc: jaewon31.kim, mhocko, linux-kernel, stable, wang.yong12 On Tue, Sep 13, 2022 at 09:09:47PM +0800, yong wrote: > Hello, > This patch is required to be patched in linux-5.4.y and linux-4.19.y. What is "this patch"? There is no context here :( > In addition to that, the following two patches are somewhat related: > > 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx > 9282012 page_alloc: fix invalid watermark check on a negative value In what way? What should be done here by us? confused, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2022-09-13 13:54 ` Greg KH @ 2022-09-14 0:46 ` yong w 2022-09-16 9:40 ` Greg KH [not found] ` <CGME20220916094017epcas1p1deed4041f897d2bf0e0486554d79b3af@epcms1p4> 0 siblings, 2 replies; 33+ messages in thread From: yong w @ 2022-09-14 0:46 UTC (permalink / raw) To: Greg KH; +Cc: jaewon31.kim, mhocko, linux-kernel, stable, wang.yong12 Greg KH <gregkh@linuxfoundation.org> 于2022年9月13日周二 21:54写道: > > On Tue, Sep 13, 2022 at 09:09:47PM +0800, yong wrote: > > Hello, > > This patch is required to be patched in linux-5.4.y and linux-4.19.y. > > What is "this patch"? There is no context here :( > Sorry, I forgot to quote the original patch. the patch is as follows f27ce0e page_alloc: consider highatomic reserve in watermark fast > > In addition to that, the following two patches are somewhat related: > > > > 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx > > 9282012 page_alloc: fix invalid watermark check on a negative value > > In what way? What should be done here by us? > I think these two patches should also be merged. The classzone_idx parameter is used in the zone_watermark_fast functionzone, and 3334a45 use ac->high_zoneidx for classzone_idx. "9282012 page_alloc: fix invalid watermark check on a negative value" fix f27ce0e introduced issues > confused, > > greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2022-09-14 0:46 ` yong w @ 2022-09-16 9:40 ` Greg KH 2022-09-16 17:05 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 wangyong [not found] ` <CGME20220916094017epcas1p1deed4041f897d2bf0e0486554d79b3af@epcms1p4> 1 sibling, 1 reply; 33+ messages in thread From: Greg KH @ 2022-09-16 9:40 UTC (permalink / raw) To: yong w; +Cc: jaewon31.kim, mhocko, linux-kernel, stable, wang.yong12 On Wed, Sep 14, 2022 at 08:46:15AM +0800, yong w wrote: > Greg KH <gregkh@linuxfoundation.org> 于2022年9月13日周二 21:54写道: > > > > > On Tue, Sep 13, 2022 at 09:09:47PM +0800, yong wrote: > > > Hello, > > > This patch is required to be patched in linux-5.4.y and linux-4.19.y. > > > > What is "this patch"? There is no context here :( > > > Sorry, I forgot to quote the original patch. the patch is as follows > > f27ce0e page_alloc: consider highatomic reserve in watermark fast > > > > In addition to that, the following two patches are somewhat related: > > > > > > 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx > > > 9282012 page_alloc: fix invalid watermark check on a negative value > > > > In what way? What should be done here by us? > > > > I think these two patches should also be merged. > > The classzone_idx parameter is used in the zone_watermark_fast > functionzone, and 3334a45 use ac->high_zoneidx for classzone_idx. > "9282012 page_alloc: fix invalid watermark check on a negative > value" fix f27ce0e introduced issues Ok, I need an ack by all the developers involved in those commits, as well as the subsystem maintainer so that I know it's ok to take them. Can you provide a series of backported and tested patches so that they are easy to review? thanks, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 2022-09-16 9:40 ` Greg KH @ 2022-09-16 17:05 ` wangyong 2022-09-16 17:05 ` [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: wangyong @ 2022-09-16 17:05 UTC (permalink / raw) To: gregkh; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur Here are the corresponding backports to 4.19. And fix classzone_idx context differences causing patch merge conflicts. Jaewon Kim (2): page_alloc: consider highatomic reserve in watermark fast page_alloc: fix invalid watermark check on a negative value Joonsoo Kim (1): mm/page_alloc: use ac->high_zoneidx for classzone_idx mm/internal.h | 2 +- mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++------------------------ 2 files changed, 41 insertions(+), 30 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-16 17:05 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 wangyong @ 2022-09-16 17:05 ` wangyong 2022-09-16 17:09 ` kernel test robot 2022-09-16 17:05 ` [PATCH 2/3] page_alloc: consider highatomic reserve in watermark fast wangyong ` (2 subsequent siblings) 3 siblings, 1 reply; 33+ messages in thread From: wangyong @ 2022-09-16 17:05 UTC (permalink / raw) To: gregkh Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur, Joonsoo Kim, Andrew Morton, Johannes Weiner, Minchan Kim, Mel Gorman, Linus Torvalds From: Joonsoo Kim <iamjoonsoo.kim@lge.com> Patch series "integrate classzone_idx and high_zoneidx", v5. This patchset is followup of the problem reported and discussed two years ago [1, 2]. The problem this patchset solves is related to the classzone_idx on the NUMA system. It causes a problem when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. This problem was reported two years ago, and, at that time, the solution got general agreements [2]. But it was not upstreamed. [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com This patch (of 2): Currently, we use classzone_idx to calculate lowmem reserve proetection for an allocation request. This classzone_idx causes a problem on NUMA systems when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. Before further explanation, I should first clarify how to compute the classzone_idx and the high_zoneidx. - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and represents the index of the highest zone the allocation can use - classzone_idx was supposed to be the index of the highest zone on the local node that the allocation can use, that is actually available in the system Think about following example. Node 0 has 4 populated zone, DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones, such as MOVABLE, doesn't exist on node 1 and this makes following difference. Assume that there is an allocation request whose gfp_zone(gfp_mask) is the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is initiated on node 0, it's classzone_idx is 3 since actually available/usable zone on local (node 0) is MOVABLE. If this allocation is initiated on node 1, it's classzone_idx is 2 since actually available/usable zone on local (node 1) is NORMAL. You can see that classzone_idx of the allocation request are different according to their starting node, even if their high_zoneidx is the same. Think more about these two allocation requests. If they are processed on local, there is no problem. However, if allocation is initiated on node 1 are processed on remote, in this example, at the NORMAL zone on node 0, due to memory shortage, problem occurs. Their different classzone_idx leads to different lowmem reserve and then different min watermark. See the following example. root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo Node 0, zone DMA per-node stats ... pages free 3965 min 5 low 8 high 11 spanned 4095 present 3998 managed 3977 protection: (0, 2961, 4928, 5440) ... Node 0, zone DMA32 pages free 757955 min 1129 low 1887 high 2645 spanned 1044480 present 782303 managed 758116 protection: (0, 0, 1967, 2479) ... Node 0, zone Normal pages free 459806 min 750 low 1253 high 1756 spanned 524288 present 524288 managed 503620 protection: (0, 0, 0, 4096) ... Node 0, zone Movable pages free 130759 min 195 low 326 high 457 spanned 1966079 present 131072 managed 131072 protection: (0, 0, 0, 0) ... Node 1, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone Normal per-node stats ... pages free 233277 min 383 low 640 high 897 spanned 262144 present 262144 managed 257744 protection: (0, 0, 0, 0) ... Node 1, zone Movable pages free 0 min 0 low 0 high 0 spanned 262144 present 0 managed 0 protection: (0, 0, 0, 0) - static min watermark for the NORMAL zone on node 0 is 750. - lowmem reserve for the request with classzone idx 3 at the NORMAL on node 0 is 4096. - lowmem reserve for the request with classzone idx 2 at the NORMAL on node 0 is 0. So, overall min watermark is: allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846 allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750 Allocation initiated on node 1 will have some precedence than allocation initiated on node 0 because min watermark of the former allocation is lower than the other. So, allocation initiated on node 1 could succeed on node 0 when allocation initiated on node 0 could not, and, this could cause too many numa_miss allocation. Then, performance could be downgraded. Recently, there was a regression report about this problem on CMA patches since CMA memory are placed in ZONE_MOVABLE by those patches. I checked that problem is disappeared with this fix that uses high_zoneidx for classzone_idx. http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop Using high_zoneidx for classzone_idx is more consistent way than previous approach because system's memory layout doesn't affect anything to it. With this patch, both classzone_idx on above example will be 3 so will have the same min watermark. allocation initiated on node 0: 750 + 4096 = 4846 allocation initiated on node 1: 750 + 4096 = 4846 One could wonder if there is a side effect that allocation initiated on node 1 will use higher bar when allocation is handled on local since classzone_idx could be higher than before. It will not happen because the zone without managed page doesn't contributes lowmem_reserve at all. Reported-by: Ye Xiaolong <xiaolong.ye@intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Ye Xiaolong <xiaolong.ye@intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 3a2e973..922a173 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -123,7 +123,7 @@ struct alloc_context { bool spread_dirty_pages; }; -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref) +#define ac_classzone_idx(ac) (ac->high_zoneidx) /* * Locate the struct page for both the matching buddy in our -- 2.7.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-16 17:05 ` [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong @ 2022-09-16 17:09 ` kernel test robot 0 siblings, 0 replies; 33+ messages in thread From: kernel test robot @ 2022-09-16 17:09 UTC (permalink / raw) To: wangyong; +Cc: stable, kbuild-all Hi, Thanks for your patch. FYI: kernel test robot notices the stable kernel rule is not satisfied. Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx Link: https://lore.kernel.org/stable/1663347949-20389-2-git-send-email-wang.yong12%40zte.com.cn The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 2/3] page_alloc: consider highatomic reserve in watermark fast 2022-09-16 17:05 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 wangyong 2022-09-16 17:05 ` [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong @ 2022-09-16 17:05 ` wangyong 2022-09-16 17:05 ` [PATCH 3/3] page_alloc: fix invalid watermark check on a negative value wangyong 2022-09-20 17:41 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH 3 siblings, 0 replies; 33+ messages in thread From: wangyong @ 2022-09-16 17:05 UTC (permalink / raw) To: gregkh Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur, Andrew Morton, Johannes Weiner, Yong-Taek Lee, Linus Torvalds From: Jaewon Kim <jaewon31.kim@samsung.com> zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, page_alloc: shortcut watermark checks for order-0 pages"). The commit simply checks if free pages is bigger than watermark without additional calculation such like reducing watermark. It considered free cma pages but it did not consider highatomic reserved. This may incur exhaustion of free pages except high order atomic free pages. Assume that reserved_highatomic pageblock is bigger than watermark min, and there are only few free pages except high order atomic free. Because zone_watermark_fast passes the allocation without considering high order atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume all the free pages. Then finally order-0 atomic allocation may fail on allocation. This means watermark min is not protected against non-atomic allocation. The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed. Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also can be failed. To avoid the problem, zone_watermark_fast should consider highatomic reserve. If the actual size of high atomic free is counted accurately like cma free, we may use it. On this patch just use nr_reserved_highatomic. Additionally introduce __zone_watermark_unusable_free to factor out common parts between zone_watermark_fast and __zone_watermark_ok. This is an example of ALLOC_HARDER allocation failure using v4.19 based kernel. Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) Call trace: [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0 [<ffffff8008223320>] warn_alloc+0xd8/0x12c [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250 [<ffffff800827f6e8>] new_slab+0x128/0x604 [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670 [<ffffff800827ba00>] __kmalloc+0x2f8/0x310 [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144 [<ffffff80084ad880>] security_sid_to_context+0x10/0x18 [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28 [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70 [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c Mem-Info: active_anon:102061 inactive_anon:81551 isolated_anon:0 active_file:59102 inactive_file:68924 isolated_file:64 unevictable:611 dirty:63 writeback:0 unstable:0 slab_reclaimable:13324 slab_unreclaimable:44354 mapped:83015 shmem:4858 pagetables:26316 bounce:0 free:2727 free_pcp:1035 free_cma:178 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB lowmem_reserve[]: 0 0 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB 138826 total pagecache pages 5460 pages in swap cache Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 based kernel. kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null) kswapd0 cpuset=/ mems_allowed=0 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 Call trace: [<0000000000000000>] dump_backtrace+0x0/0x248 [<0000000000000000>] show_stack+0x18/0x20 [<0000000000000000>] __dump_stack+0x20/0x28 [<0000000000000000>] dump_stack+0x68/0x90 [<0000000000000000>] warn_alloc+0x104/0x198 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 [<0000000000000000>] zs_malloc+0x148/0x3d0 [<0000000000000000>] zram_bvec_rw+0x410/0x798 [<0000000000000000>] zram_rw_page+0x88/0xdc [<0000000000000000>] bdev_write_page+0x70/0xbc [<0000000000000000>] __swap_writepage+0x58/0x37c [<0000000000000000>] swap_writepage+0x40/0x4c [<0000000000000000>] shrink_page_list+0xc30/0xf48 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c [<0000000000000000>] shrink_node_memcg+0x23c/0x618 [<0000000000000000>] shrink_node+0x1c8/0x304 [<0000000000000000>] kswapd+0x680/0x7c4 [<0000000000000000>] kthread+0x110/0x120 [<0000000000000000>] ret_from_fork+0x10/0x18 Mem-Info: active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB This is trace log which shows GFP_HIGHUSER consumes free pages right before ALLOC_NO_WATERMARKS. <...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE [jaewon31.kim@samsung.com: remove redundant code for high-order] Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.com Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++------------------------- 1 file changed, 36 insertions(+), 29 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9c35403..237463d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3130,6 +3130,29 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ +static inline long __zone_watermark_unusable_free(struct zone *z, + unsigned int order, unsigned int alloc_flags) +{ + const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); + long unusable_free = (1 << order) - 1; + + /* + * If the caller does not have rights to ALLOC_HARDER then subtract + * the high-atomic reserves. This will over-estimate the size of the + * atomic reserve but it avoids a search. + */ + if (likely(!alloc_harder)) + unusable_free += z->nr_reserved_highatomic; + +#ifdef CONFIG_CMA + /* If allocation can't use CMA areas don't use free CMA pages */ + if (!(alloc_flags & ALLOC_CMA)) + unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES); +#endif + + return unusable_free; +} + /* * Return true if free base pages are above 'mark'. For high-order checks it * will return true of the order-0 watermark is reached and there is at least @@ -3145,19 +3168,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); /* free_pages may go negative - that's OK */ - free_pages -= (1 << order) - 1; + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); if (alloc_flags & ALLOC_HIGH) min -= min / 2; - /* - * If the caller does not have rights to ALLOC_HARDER then subtract - * the high-atomic reserves. This will over-estimate the size of the - * atomic reserve but it avoids a search. - */ - if (likely(!alloc_harder)) { - free_pages -= z->nr_reserved_highatomic; - } else { + if (unlikely(alloc_harder)) { /* * OOM victims can try even harder than normal ALLOC_HARDER * users on the grounds that it's definitely going to be in @@ -3170,13 +3186,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, min -= min / 4; } - -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); -#endif - /* * Check watermarks for an order-0 allocation request. If these * are not met, then a high-order request also cannot go ahead @@ -3225,24 +3234,22 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, static inline bool zone_watermark_fast(struct zone *z, unsigned int order, unsigned long mark, int classzone_idx, unsigned int alloc_flags) { - long free_pages = zone_page_state(z, NR_FREE_PAGES); - long cma_pages = 0; + long free_pages; -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); -#endif + free_pages = zone_page_state(z, NR_FREE_PAGES); /* * Fast check for order-0 only. If this fails then the reserves - * need to be calculated. There is a corner case where the check - * passes but only the high-order atomic reserve are free. If - * the caller is !atomic then it'll uselessly search the free - * list. That corner case is then slower but it is harmless. + * need to be calculated. */ - if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx]) - return true; + if (!order) { + long fast_free; + + fast_free = free_pages; + fast_free -= __zone_watermark_unusable_free(z, 0, alloc_flags); + if (fast_free > mark + z->lowmem_reserve[classzone_idx]) + return true; + } return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, free_pages); -- 2.7.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH 3/3] page_alloc: fix invalid watermark check on a negative value 2022-09-16 17:05 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 wangyong 2022-09-16 17:05 ` [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong 2022-09-16 17:05 ` [PATCH 2/3] page_alloc: consider highatomic reserve in watermark fast wangyong @ 2022-09-16 17:05 ` wangyong 2022-09-20 17:41 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH 3 siblings, 0 replies; 33+ messages in thread From: wangyong @ 2022-09-16 17:05 UTC (permalink / raw) To: gregkh Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur, Minchan Kim, Baoquan He, Vlastimil Babka, Johannes Weiner, Yong-Taek Lee, stable, Andrew Morton From: Jaewon Kim <jaewon31.kim@samsung.com> There was a report that a task is waiting at the throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was increasing. This is a bug where zone_watermark_fast returns true even when the free is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") changed the watermark fast to consider highatomic reserve. But it did not handle a negative value case which can be happened when reserved_highatomic pageblock is bigger than the actual free. If watermark is considered as ok for the negative value, allocating contexts for order-0 will consume all free pages without direct reclaim, and finally free page may become depleted except highatomic free. Then allocating contexts may fall into throttle_direct_reclaim. This symptom may easily happen in a system where wmark min is low and other reclaimers like kswapd does not make free pages quickly. Handle the negative case by using MIN. Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reported-by: GyeongHwan Hong <gh21.hong@samsung.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/page_alloc.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 237463d..d6d8a37 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3243,11 +3243,15 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, * need to be calculated. */ if (!order) { - long fast_free; + long usable_free; + long reserved; - fast_free = free_pages; - fast_free -= __zone_watermark_unusable_free(z, 0, alloc_flags); - if (fast_free > mark + z->lowmem_reserve[classzone_idx]) + usable_free = free_pages; + reserved = __zone_watermark_unusable_free(z, 0, alloc_flags); + + /* reserved may over estimate high-atomic reserves. */ + usable_free -= min(usable_free, reserved); + if (usable_free > mark + z->lowmem_reserve[classzone_idx]) return true; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 2022-09-16 17:05 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 wangyong ` (2 preceding siblings ...) 2022-09-16 17:05 ` [PATCH 3/3] page_alloc: fix invalid watermark check on a negative value wangyong @ 2022-09-20 17:41 ` Greg KH 2022-09-25 10:35 ` [PATCH v2 " wangyong 3 siblings, 1 reply; 33+ messages in thread From: Greg KH @ 2022-09-20 17:41 UTC (permalink / raw) To: wangyong; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12 On Fri, Sep 16, 2022 at 10:05:46AM -0700, wangyong wrote: > Here are the corresponding backports to 4.19. > And fix classzone_idx context differences causing patch merge conflicts. > > Jaewon Kim (2): > page_alloc: consider highatomic reserve in watermark fast > page_alloc: fix invalid watermark check on a negative value > > Joonsoo Kim (1): > mm/page_alloc: use ac->high_zoneidx for classzone_idx > > mm/internal.h | 2 +- > mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++------------------------ > 2 files changed, 41 insertions(+), 30 deletions(-) > > -- > 2.7.4 > What are the git commit ids of these commits? That needs to be in the commit changelog. Also you did not sign off on the backports, please fix that up when you resend this series. thanks, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 2022-09-20 17:41 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH @ 2022-09-25 10:35 ` wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: wangyong @ 2022-09-25 10:35 UTC (permalink / raw) To: gregkh; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur Here are the corresponding backports to 4.19. And fix classzone_idx context differences causing patch merge conflicts. Original commit IDS: 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx f27ce0e page_alloc: consider highatomic reserve in watermark fast 9282012 page_alloc: fix invalid watermark check on a negative value Changes from v1: - Add commit information of the original patches. Jaewon Kim (2): page_alloc: consider highatomic reserve in watermark fast page_alloc: fix invalid watermark check on a negative value Joonsoo Kim (1): mm/page_alloc: use ac->high_zoneidx for classzone_idx mm/internal.h | 2 +- mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++------------------------ 2 files changed, 41 insertions(+), 30 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-25 10:35 ` [PATCH v2 " wangyong @ 2022-09-25 10:35 ` wangyong 2022-09-25 10:36 ` kernel test robot 2022-09-25 11:00 ` Greg KH 2022-09-25 10:35 ` [PATCH v2 stable-4.19 2/3] page_alloc: consider highatomic reserve in watermark fast wangyong ` (2 subsequent siblings) 3 siblings, 2 replies; 33+ messages in thread From: wangyong @ 2022-09-25 10:35 UTC (permalink / raw) To: gregkh Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur, Joonsoo Kim, Andrew Morton, Johannes Weiner, Minchan Kim, Mel Gorman, Linus Torvalds From: Joonsoo Kim <iamjoonsoo.kim@lge.com> [ backport of commit 3334a45eb9e2bb040c880ef65e1d72357a0a008b ] Patch series "integrate classzone_idx and high_zoneidx", v5. This patchset is followup of the problem reported and discussed two years ago [1, 2]. The problem this patchset solves is related to the classzone_idx on the NUMA system. It causes a problem when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. This problem was reported two years ago, and, at that time, the solution got general agreements [2]. But it was not upstreamed. [1]: http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop [2]: http://lkml.kernel.org/r/1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com This patch (of 2): Currently, we use classzone_idx to calculate lowmem reserve proetection for an allocation request. This classzone_idx causes a problem on NUMA systems when the lowmem reserve protection exists for some zones on a node that do not exist on other nodes. Before further explanation, I should first clarify how to compute the classzone_idx and the high_zoneidx. - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and represents the index of the highest zone the allocation can use - classzone_idx was supposed to be the index of the highest zone on the local node that the allocation can use, that is actually available in the system Think about following example. Node 0 has 4 populated zone, DMA/DMA32/NORMAL/MOVABLE. Node 1 has 1 populated zone, NORMAL. Some zones, such as MOVABLE, doesn't exist on node 1 and this makes following difference. Assume that there is an allocation request whose gfp_zone(gfp_mask) is the zone, MOVABLE. Then, it's high_zoneidx is 3. If this allocation is initiated on node 0, it's classzone_idx is 3 since actually available/usable zone on local (node 0) is MOVABLE. If this allocation is initiated on node 1, it's classzone_idx is 2 since actually available/usable zone on local (node 1) is NORMAL. You can see that classzone_idx of the allocation request are different according to their starting node, even if their high_zoneidx is the same. Think more about these two allocation requests. If they are processed on local, there is no problem. However, if allocation is initiated on node 1 are processed on remote, in this example, at the NORMAL zone on node 0, due to memory shortage, problem occurs. Their different classzone_idx leads to different lowmem reserve and then different min watermark. See the following example. root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo Node 0, zone DMA per-node stats ... pages free 3965 min 5 low 8 high 11 spanned 4095 present 3998 managed 3977 protection: (0, 2961, 4928, 5440) ... Node 0, zone DMA32 pages free 757955 min 1129 low 1887 high 2645 spanned 1044480 present 782303 managed 758116 protection: (0, 0, 1967, 2479) ... Node 0, zone Normal pages free 459806 min 750 low 1253 high 1756 spanned 524288 present 524288 managed 503620 protection: (0, 0, 0, 4096) ... Node 0, zone Movable pages free 130759 min 195 low 326 high 457 spanned 1966079 present 131072 managed 131072 protection: (0, 0, 0, 0) ... Node 1, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1006, 1006) Node 1, zone Normal per-node stats ... pages free 233277 min 383 low 640 high 897 spanned 262144 present 262144 managed 257744 protection: (0, 0, 0, 0) ... Node 1, zone Movable pages free 0 min 0 low 0 high 0 spanned 262144 present 0 managed 0 protection: (0, 0, 0, 0) - static min watermark for the NORMAL zone on node 0 is 750. - lowmem reserve for the request with classzone idx 3 at the NORMAL on node 0 is 4096. - lowmem reserve for the request with classzone idx 2 at the NORMAL on node 0 is 0. So, overall min watermark is: allocation initiated on node 0 (classzone_idx 3): 750 + 4096 = 4846 allocation initiated on node 1 (classzone_idx 2): 750 + 0 = 750 Allocation initiated on node 1 will have some precedence than allocation initiated on node 0 because min watermark of the former allocation is lower than the other. So, allocation initiated on node 1 could succeed on node 0 when allocation initiated on node 0 could not, and, this could cause too many numa_miss allocation. Then, performance could be downgraded. Recently, there was a regression report about this problem on CMA patches since CMA memory are placed in ZONE_MOVABLE by those patches. I checked that problem is disappeared with this fix that uses high_zoneidx for classzone_idx. http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop Using high_zoneidx for classzone_idx is more consistent way than previous approach because system's memory layout doesn't affect anything to it. With this patch, both classzone_idx on above example will be 3 so will have the same min watermark. allocation initiated on node 0: 750 + 4096 = 4846 allocation initiated on node 1: 750 + 4096 = 4846 One could wonder if there is a side effect that allocation initiated on node 1 will use higher bar when allocation is handled on local since classzone_idx could be higher than before. It will not happen because the zone without managed page doesn't contributes lowmem_reserve at all. Reported-by: Ye Xiaolong <xiaolong.ye@intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Ye Xiaolong <xiaolong.ye@intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/1587095923-7515-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1587095923-7515-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 3a2e973..922a173 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -123,7 +123,7 @@ struct alloc_context { bool spread_dirty_pages; }; -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref) +#define ac_classzone_idx(ac) (ac->high_zoneidx) /* * Locate the struct page for both the matching buddy in our -- 2.7.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-25 10:35 ` [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong @ 2022-09-25 10:36 ` kernel test robot 2022-09-25 11:00 ` Greg KH 1 sibling, 0 replies; 33+ messages in thread From: kernel test robot @ 2022-09-25 10:36 UTC (permalink / raw) To: wangyong; +Cc: stable, kbuild-all Hi, Thanks for your patch. FYI: kernel test robot notices the stable kernel rule is not satisfied. Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.' Subject: [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx Link: https://lore.kernel.org/stable/20220925103529.13716-2-yongw.pur%40gmail.com The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html -- 0-DAY CI Kernel Test Service https://01.org/lkp ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-25 10:35 ` [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong 2022-09-25 10:36 ` kernel test robot @ 2022-09-25 11:00 ` Greg KH 2022-09-25 14:32 ` yong w 1 sibling, 1 reply; 33+ messages in thread From: Greg KH @ 2022-09-25 11:00 UTC (permalink / raw) To: wangyong Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, Joonsoo Kim, Andrew Morton, Johannes Weiner, Minchan Kim, Mel Gorman, Linus Torvalds On Sun, Sep 25, 2022 at 03:35:27AM -0700, wangyong wrote: > From: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > [ backport of commit 3334a45eb9e2bb040c880ef65e1d72357a0a008b ] This is from 5.8. What about the 5.4.y kernel? Why would someone upgrading from 4.19.y to 5.4.y suffer a regression here? And why wouldn't someone who has this issue just not use 5.10.y instead? What prevents someone from moving off of 4.19.y at this point in time? thanks, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-25 11:00 ` Greg KH @ 2022-09-25 14:32 ` yong w 2022-09-26 6:46 ` Greg KH 0 siblings, 1 reply; 33+ messages in thread From: yong w @ 2022-09-25 14:32 UTC (permalink / raw) To: Greg KH, jaewon31.kim Cc: linux-kernel, mhocko, stable, wang.yong12, Joonsoo Kim, Andrew Morton, Johannes Weiner, Minchan Kim, Mel Gorman, Linus Torvalds Greg KH <gregkh@linuxfoundation.org> 于2022年9月25日周日 19:00写道: > > On Sun, Sep 25, 2022 at 03:35:27AM -0700, wangyong wrote: > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > > > [ backport of commit 3334a45eb9e2bb040c880ef65e1d72357a0a008b ] > > This is from 5.8. What about the 5.4.y kernel? Why would someone > upgrading from 4.19.y to 5.4.y suffer a regression here? > I encountered this problem on 4.19, but I haven't encountered it on 5.4. However, this should be a common problem, so 5.4 may also need to be merged. Hello, Joonsoo, what do you think? > And why wouldn't someone who has this issue just not use 5.10.y instead? > What prevents someone from moving off of 4.19.y at this point in time? > This is a solution, but upgrading the kernel version requires time and overhead, so use the patch is the most effective way, if there is. Thanks. > thanks, > > greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx 2022-09-25 14:32 ` yong w @ 2022-09-26 6:46 ` Greg KH 0 siblings, 0 replies; 33+ messages in thread From: Greg KH @ 2022-09-26 6:46 UTC (permalink / raw) To: yong w Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, Joonsoo Kim, Andrew Morton, Johannes Weiner, Minchan Kim, Mel Gorman, Linus Torvalds On Sun, Sep 25, 2022 at 10:32:32PM +0800, yong w wrote: > Greg KH <gregkh@linuxfoundation.org> 于2022年9月25日周日 19:00写道: > > > > On Sun, Sep 25, 2022 at 03:35:27AM -0700, wangyong wrote: > > > From: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > > > > > [ backport of commit 3334a45eb9e2bb040c880ef65e1d72357a0a008b ] > > > > This is from 5.8. What about the 5.4.y kernel? Why would someone > > upgrading from 4.19.y to 5.4.y suffer a regression here? > > > I encountered this problem on 4.19, but I haven't encountered it on 5.4. > However, this should be a common problem, so 5.4 may also need to be > merged. > > Hello, Joonsoo, what do you think? > > > And why wouldn't someone who has this issue just not use 5.10.y instead? > > What prevents someone from moving off of 4.19.y at this point in time? > > > This is a solution, but upgrading the kernel version requires time and overhead, > so use the patch is the most effective way, if there is. You will have to move off of 4.19 soon anyway, so why delay the change? thanks, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v2 stable-4.19 2/3] page_alloc: consider highatomic reserve in watermark fast 2022-09-25 10:35 ` [PATCH v2 " wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong @ 2022-09-25 10:35 ` wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 3/3] page_alloc: fix invalid watermark check on a negative value wangyong 2022-10-02 15:37 ` [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH 3 siblings, 0 replies; 33+ messages in thread From: wangyong @ 2022-09-25 10:35 UTC (permalink / raw) To: gregkh Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur, Andrew Morton, Johannes Weiner, Yong-Taek Lee, Linus Torvalds From: Jaewon Kim <jaewon31.kim@samsung.com> [ backport of commit f27ce0e14088b23f8d54ae4a44f70307ec420e64 ] zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, page_alloc: shortcut watermark checks for order-0 pages"). The commit simply checks if free pages is bigger than watermark without additional calculation such like reducing watermark. It considered free cma pages but it did not consider highatomic reserved. This may incur exhaustion of free pages except high order atomic free pages. Assume that reserved_highatomic pageblock is bigger than watermark min, and there are only few free pages except high order atomic free. Because zone_watermark_fast passes the allocation without considering high order atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume all the free pages. Then finally order-0 atomic allocation may fail on allocation. This means watermark min is not protected against non-atomic allocation. The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed. Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also can be failed. To avoid the problem, zone_watermark_fast should consider highatomic reserve. If the actual size of high atomic free is counted accurately like cma free, we may use it. On this patch just use nr_reserved_highatomic. Additionally introduce __zone_watermark_unusable_free to factor out common parts between zone_watermark_fast and __zone_watermark_ok. This is an example of ALLOC_HARDER allocation failure using v4.19 based kernel. Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) Call trace: [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0 [<ffffff8008223320>] warn_alloc+0xd8/0x12c [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250 [<ffffff800827f6e8>] new_slab+0x128/0x604 [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670 [<ffffff800827ba00>] __kmalloc+0x2f8/0x310 [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144 [<ffffff80084ad880>] security_sid_to_context+0x10/0x18 [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28 [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70 [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c Mem-Info: active_anon:102061 inactive_anon:81551 isolated_anon:0 active_file:59102 inactive_file:68924 isolated_file:64 unevictable:611 dirty:63 writeback:0 unstable:0 slab_reclaimable:13324 slab_unreclaimable:44354 mapped:83015 shmem:4858 pagetables:26316 bounce:0 free:2727 free_pcp:1035 free_cma:178 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB lowmem_reserve[]: 0 0 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB 138826 total pagecache pages 5460 pages in swap cache Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 based kernel. kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null) kswapd0 cpuset=/ mems_allowed=0 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 Call trace: [<0000000000000000>] dump_backtrace+0x0/0x248 [<0000000000000000>] show_stack+0x18/0x20 [<0000000000000000>] __dump_stack+0x20/0x28 [<0000000000000000>] dump_stack+0x68/0x90 [<0000000000000000>] warn_alloc+0x104/0x198 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 [<0000000000000000>] zs_malloc+0x148/0x3d0 [<0000000000000000>] zram_bvec_rw+0x410/0x798 [<0000000000000000>] zram_rw_page+0x88/0xdc [<0000000000000000>] bdev_write_page+0x70/0xbc [<0000000000000000>] __swap_writepage+0x58/0x37c [<0000000000000000>] swap_writepage+0x40/0x4c [<0000000000000000>] shrink_page_list+0xc30/0xf48 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c [<0000000000000000>] shrink_node_memcg+0x23c/0x618 [<0000000000000000>] shrink_node+0x1c8/0x304 [<0000000000000000>] kswapd+0x680/0x7c4 [<0000000000000000>] kthread+0x110/0x120 [<0000000000000000>] ret_from_fork+0x10/0x18 Mem-Info: active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB This is trace log which shows GFP_HIGHUSER consumes free pages right before ALLOC_NO_WATERMARKS. <...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE [jaewon31.kim@samsung.com: remove redundant code for high-order] Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.com Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/page_alloc.c | 65 ++++++++++++++++++++++++++++++++------------------------- 1 file changed, 36 insertions(+), 29 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9c35403..237463d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3130,6 +3130,29 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ +static inline long __zone_watermark_unusable_free(struct zone *z, + unsigned int order, unsigned int alloc_flags) +{ + const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); + long unusable_free = (1 << order) - 1; + + /* + * If the caller does not have rights to ALLOC_HARDER then subtract + * the high-atomic reserves. This will over-estimate the size of the + * atomic reserve but it avoids a search. + */ + if (likely(!alloc_harder)) + unusable_free += z->nr_reserved_highatomic; + +#ifdef CONFIG_CMA + /* If allocation can't use CMA areas don't use free CMA pages */ + if (!(alloc_flags & ALLOC_CMA)) + unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES); +#endif + + return unusable_free; +} + /* * Return true if free base pages are above 'mark'. For high-order checks it * will return true of the order-0 watermark is reached and there is at least @@ -3145,19 +3168,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); /* free_pages may go negative - that's OK */ - free_pages -= (1 << order) - 1; + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); if (alloc_flags & ALLOC_HIGH) min -= min / 2; - /* - * If the caller does not have rights to ALLOC_HARDER then subtract - * the high-atomic reserves. This will over-estimate the size of the - * atomic reserve but it avoids a search. - */ - if (likely(!alloc_harder)) { - free_pages -= z->nr_reserved_highatomic; - } else { + if (unlikely(alloc_harder)) { /* * OOM victims can try even harder than normal ALLOC_HARDER * users on the grounds that it's definitely going to be in @@ -3170,13 +3186,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, min -= min / 4; } - -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); -#endif - /* * Check watermarks for an order-0 allocation request. If these * are not met, then a high-order request also cannot go ahead @@ -3225,24 +3234,22 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, static inline bool zone_watermark_fast(struct zone *z, unsigned int order, unsigned long mark, int classzone_idx, unsigned int alloc_flags) { - long free_pages = zone_page_state(z, NR_FREE_PAGES); - long cma_pages = 0; + long free_pages; -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); -#endif + free_pages = zone_page_state(z, NR_FREE_PAGES); /* * Fast check for order-0 only. If this fails then the reserves - * need to be calculated. There is a corner case where the check - * passes but only the high-order atomic reserve are free. If - * the caller is !atomic then it'll uselessly search the free - * list. That corner case is then slower but it is harmless. + * need to be calculated. */ - if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx]) - return true; + if (!order) { + long fast_free; + + fast_free = free_pages; + fast_free -= __zone_watermark_unusable_free(z, 0, alloc_flags); + if (fast_free > mark + z->lowmem_reserve[classzone_idx]) + return true; + } return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, free_pages); -- 2.7.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH v2 stable-4.19 3/3] page_alloc: fix invalid watermark check on a negative value 2022-09-25 10:35 ` [PATCH v2 " wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 2/3] page_alloc: consider highatomic reserve in watermark fast wangyong @ 2022-09-25 10:35 ` wangyong 2022-10-02 15:37 ` [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH 3 siblings, 0 replies; 33+ messages in thread From: wangyong @ 2022-09-25 10:35 UTC (permalink / raw) To: gregkh Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12, yongw.pur, Minchan Kim, Baoquan He, Vlastimil Babka, Johannes Weiner, Yong-Taek Lee, stable, Andrew Morton From: Jaewon Kim <jaewon31.kim@samsung.com> [ backport of commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 ] There was a report that a task is waiting at the throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was increasing. This is a bug where zone_watermark_fast returns true even when the free is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") changed the watermark fast to consider highatomic reserve. But it did not handle a negative value case which can be happened when reserved_highatomic pageblock is bigger than the actual free. If watermark is considered as ok for the negative value, allocating contexts for order-0 will consume all free pages without direct reclaim, and finally free page may become depleted except highatomic free. Then allocating contexts may fall into throttle_direct_reclaim. This symptom may easily happen in a system where wmark min is low and other reclaimers like kswapd does not make free pages quickly. Handle the negative case by using MIN. Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reported-by: GyeongHwan Hong <gh21.hong@samsung.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/page_alloc.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 237463d..d6d8a37 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3243,11 +3243,15 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, * need to be calculated. */ if (!order) { - long fast_free; + long usable_free; + long reserved; - fast_free = free_pages; - fast_free -= __zone_watermark_unusable_free(z, 0, alloc_flags); - if (fast_free > mark + z->lowmem_reserve[classzone_idx]) + usable_free = free_pages; + reserved = __zone_watermark_unusable_free(z, 0, alloc_flags); + + /* reserved may over estimate high-atomic reserves. */ + usable_free -= min(usable_free, reserved); + if (usable_free > mark + z->lowmem_reserve[classzone_idx]) return true; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 2022-09-25 10:35 ` [PATCH v2 " wangyong ` (2 preceding siblings ...) 2022-09-25 10:35 ` [PATCH v2 stable-4.19 3/3] page_alloc: fix invalid watermark check on a negative value wangyong @ 2022-10-02 15:37 ` Greg KH [not found] ` <CAOH5QeB2EqpqQd6fw-P199w8K8-3QNv_t-u_Wn1BLnfaSscmCg@mail.gmail.com> 3 siblings, 1 reply; 33+ messages in thread From: Greg KH @ 2022-10-02 15:37 UTC (permalink / raw) To: wangyong; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12 On Sun, Sep 25, 2022 at 03:35:26AM -0700, wangyong wrote: > Here are the corresponding backports to 4.19. > And fix classzone_idx context differences causing patch merge conflicts. > > Original commit IDS: > 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx > f27ce0e page_alloc: consider highatomic reserve in watermark fast > 9282012 page_alloc: fix invalid watermark check on a negative value > > Changes from v1: > - Add commit information of the original patches. None of these have your signed-off-by on them showing that the backport came from you and that you are responsible for them. So even if we did think they were valid to backport, I can't take them as-is :( thanks, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <CAOH5QeB2EqpqQd6fw-P199w8K8-3QNv_t-u_Wn1BLnfaSscmCg@mail.gmail.com>]
* Re: [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 [not found] ` <CAOH5QeB2EqpqQd6fw-P199w8K8-3QNv_t-u_Wn1BLnfaSscmCg@mail.gmail.com> @ 2022-10-07 16:41 ` Greg KH 2022-10-10 15:47 ` yong w 0 siblings, 1 reply; 33+ messages in thread From: Greg KH @ 2022-10-07 16:41 UTC (permalink / raw) To: yong w; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12 A: http://en.wikipedia.org/wiki/Top_post Q: Were do I find info about this thing called top-posting? A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? A: No. Q: Should I include quotations after my reply? http://daringfireball.net/2007/07/on_top On Fri, Oct 07, 2022 at 04:53:50PM +0800, yong w wrote: > Is it ok to add my signed-off-by? my signed-off-by is as follows: > > Signed-off-by: wangyong <wang.yong12@zte.com.cn> For obvious reasons, I can not take that from a random gmail account (nor should ZTE want me to do that.) Please fix up your email systems and do this properly and send the series again. thanks, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 2022-10-07 16:41 ` Greg KH @ 2022-10-10 15:47 ` yong w 2022-10-10 15:58 ` Greg KH 0 siblings, 1 reply; 33+ messages in thread From: yong w @ 2022-10-10 15:47 UTC (permalink / raw) To: Greg KH; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12 Greg KH <gregkh@linuxfoundation.org> 于2022年10月8日周六 00:40写道: > > A: http://en.wikipedia.org/wiki/Top_post > Q: Were do I find info about this thing called top-posting? > A: Because it messes up the order in which people normally read text. > Q: Why is top-posting such a bad thing? > A: Top-posting. > Q: What is the most annoying thing in e-mail? > > A: No. > Q: Should I include quotations after my reply? > > http://daringfireball.net/2007/07/on_top > > > On Fri, Oct 07, 2022 at 04:53:50PM +0800, yong w wrote: > > Is it ok to add my signed-off-by? my signed-off-by is as follows: > > > > Signed-off-by: wangyong <wang.yong12@zte.com.cn> > > For obvious reasons, I can not take that from a random gmail account > (nor should ZTE want me to do that.) > > Please fix up your email systems and do this properly and send the > series again. > > thanks, > > greg k-h Sorry, our mail system cannot send external mail for some reason. And this is my email, I can receive an email and reply to it. thanks. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 2022-10-10 15:47 ` yong w @ 2022-10-10 15:58 ` Greg KH 0 siblings, 0 replies; 33+ messages in thread From: Greg KH @ 2022-10-10 15:58 UTC (permalink / raw) To: yong w; +Cc: jaewon31.kim, linux-kernel, mhocko, stable, wang.yong12 On Mon, Oct 10, 2022 at 11:47:30PM +0800, yong w wrote: > Greg KH <gregkh@linuxfoundation.org> 于2022年10月8日周六 00:40写道: > > > > A: http://en.wikipedia.org/wiki/Top_post > > Q: Were do I find info about this thing called top-posting? > > A: Because it messes up the order in which people normally read text. > > Q: Why is top-posting such a bad thing? > > A: Top-posting. > > Q: What is the most annoying thing in e-mail? > > > > A: No. > > Q: Should I include quotations after my reply? > > > > http://daringfireball.net/2007/07/on_top > > > > > > On Fri, Oct 07, 2022 at 04:53:50PM +0800, yong w wrote: > > > Is it ok to add my signed-off-by? my signed-off-by is as follows: > > > > > > Signed-off-by: wangyong <wang.yong12@zte.com.cn> > > > > For obvious reasons, I can not take that from a random gmail account > > (nor should ZTE want me to do that.) > > > > Please fix up your email systems and do this properly and send the > > series again. > > > > thanks, > > > > greg k-h > > Sorry, our mail system cannot send external mail for some reason. Sorry, then I can not attribute anything to your company. It has already been warned that it can not continue to contribute to the Linux kernel and some of your gmail "aliases" have been banned from the mailing lists. Please fix your email system so that you can properly contribute to Linux. > And this is my email, I can receive an email and reply to it. Yes, but I have no proof that your ZTE email is correct, only that this is a random gmail.com address :( What would you do if you were in my situation? Please work with your IT group to fix their email systems. good luck, greg k-h ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <CGME20220916094017epcas1p1deed4041f897d2bf0e0486554d79b3af@epcms1p4>]
* RE: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast [not found] ` <CGME20220916094017epcas1p1deed4041f897d2bf0e0486554d79b3af@epcms1p4> @ 2022-09-18 1:41 ` Jaewon Kim 2022-09-19 13:21 ` yong w 0 siblings, 1 reply; 33+ messages in thread From: Jaewon Kim @ 2022-09-18 1:41 UTC (permalink / raw) To: Greg KH, yong w Cc: Jaewon Kim, mhocko, linux-kernel, stable, wang.yong12, YongTaek Lee >On Wed, Sep 14, 2022 at 08:46:15AM +0800, yong w wrote: >> Greg KH <gregkh@linuxfoundation.org> 于2022年9月13日周二 21:54?道: >> >> > >> > On Tue, Sep 13, 2022 at 09:09:47PM +0800, yong wrote: >> > > Hello, >> > > This patch is required to be patched in linux-5.4.y and linux-4.19.y. >> > >> > What is "this patch"? There is no context here :( >> > >> Sorry, I forgot to quote the original patch. the patch is as follows >> >> f27ce0e page_alloc: consider highatomic reserve in watermark fast >> >> > > In addition to that, the following two patches are somewhat related: >> > > >> > > 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx >> > > 9282012 page_alloc: fix invalid watermark check on a negative value >> > >> > In what way? What should be done here by us? >> > >> >> I think these two patches should also be merged. >> >> The classzone_idx parameter is used in the zone_watermark_fast >> functionzone, and 3334a45 use ac->high_zoneidx for classzone_idx. >> "9282012 page_alloc: fix invalid watermark check on a negative >> value" fix f27ce0e introduced issues > >Ok, I need an ack by all the developers involved in those commits, as >well as the subsystem maintainer so that I know it's ok to take them. > >Can you provide a series of backported and tested patches so that they >are easy to review? > >thanks, > >greg k-h Hello I didn't know my Act is needed to merge it. Acked-by: Jaewon Kim <jaewon31.kim@samsung.com> I don't understand well why the commit f27ce0e has dependency on 3334a45, though. Thank you Jaewon Kim ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2022-09-18 1:41 ` [PATCH v4] page_alloc: consider highatomic reserve in watermark fast Jaewon Kim @ 2022-09-19 13:21 ` yong w 0 siblings, 0 replies; 33+ messages in thread From: yong w @ 2022-09-19 13:21 UTC (permalink / raw) To: jaewon31.kim Cc: Greg KH, mhocko, linux-kernel, stable, wang.yong12, YongTaek Lee Jaewon Kim <jaewon31.kim@samsung.com> 于2022年9月19日周一 09:08写道: > > >On Wed, Sep 14, 2022 at 08:46:15AM +0800, yong w wrote: > >> Greg KH <gregkh@linuxfoundation.org> 于2022年9月13日周二 21:54?道: > >> > >> > > >> > On Tue, Sep 13, 2022 at 09:09:47PM +0800, yong wrote: > >> > > Hello, > >> > > This patch is required to be patched in linux-5.4.y and linux-4.19.y. > >> > > >> > What is "this patch"? There is no context here :( > >> > > >> Sorry, I forgot to quote the original patch. the patch is as follows > >> > >> f27ce0e page_alloc: consider highatomic reserve in watermark fast > >> > >> > > In addition to that, the following two patches are somewhat related: > >> > > > >> > > 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx > >> > > 9282012 page_alloc: fix invalid watermark check on a negative value > >> > > >> > In what way? What should be done here by us? > >> > > >> > >> I think these two patches should also be merged. > >> > >> The classzone_idx parameter is used in the zone_watermark_fast > >> functionzone, and 3334a45 use ac->high_zoneidx for classzone_idx. > >> "9282012 page_alloc: fix invalid watermark check on a negative > >> value" fix f27ce0e introduced issues > > > >Ok, I need an ack by all the developers involved in those commits, as > >well as the subsystem maintainer so that I know it's ok to take them. > > > >Can you provide a series of backported and tested patches so that they > >are easy to review? > > > >thanks, > > > >greg k-h > > Hello I didn't know my Act is needed to merge it. > > Acked-by: Jaewon Kim <jaewon31.kim@samsung.com> > > I don't understand well why the commit f27ce0e has dependency on 3334a45, though. > Hello, the classzone_idx is used in the zone_watermark_fast function, and there will be conflicts when f27ce0e is merged. Looking back, the following two patches adjust the classzone_idx parameter. 3334a45 mm/page_alloc: use ac->high_zoneidx for classzone_idx 97a225e mm/page_alloc: integrate classzone_idx and high_zoneidx and 3334a45 is the key modification. Actually, I think 3334a45 can be merged or not. Thanks. ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcas1p1.samsung.com>]
* [PATCH v4] page_alloc: consider highatomic reserve in watermark fast [not found] <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcas1p1.samsung.com> @ 2020-06-19 23:59 ` Jaewon Kim 2020-06-19 12:42 ` Baoquan He ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: Jaewon Kim @ 2020-06-19 23:59 UTC (permalink / raw) To: vbabka, bhe, mgorman, minchan, mgorman, hannes, akpm Cc: linux-mm, linux-kernel, jaewon31.kim, ytk.lee, cmlaika.kim, Jaewon Kim zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, page_alloc: shortcut watermark checks for order-0 pages"). The commit simply checks if free pages is bigger than watermark without additional calculation such like reducing watermark. It considered free cma pages but it did not consider highatomic reserved. This may incur exhaustion of free pages except high order atomic free pages. Assume that reserved_highatomic pageblock is bigger than watermark min, and there are only few free pages except high order atomic free. Because zone_watermark_fast passes the allocation without considering high order atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume all the free pages. Then finally order-0 atomic allocation may fail on allocation. This means watermark min is not protected against non-atomic allocation. The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed. Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also can be failed. To avoid the problem, zone_watermark_fast should consider highatomic reserve. If the actual size of high atomic free is counted accurately like cma free, we may use it. On this patch just use nr_reserved_highatomic. Additionally introduce __zone_watermark_unusable_free to factor out common parts between zone_watermark_fast and __zone_watermark_ok. This is an example of ALLOC_HARDER allocation failure using v4.19 based kernel. Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) Call trace: [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0 [<ffffff8008223320>] warn_alloc+0xd8/0x12c [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250 [<ffffff800827f6e8>] new_slab+0x128/0x604 [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670 [<ffffff800827ba00>] __kmalloc+0x2f8/0x310 [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144 [<ffffff80084ad880>] security_sid_to_context+0x10/0x18 [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28 [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70 [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c Mem-Info: active_anon:102061 inactive_anon:81551 isolated_anon:0 active_file:59102 inactive_file:68924 isolated_file:64 unevictable:611 dirty:63 writeback:0 unstable:0 slab_reclaimable:13324 slab_unreclaimable:44354 mapped:83015 shmem:4858 pagetables:26316 bounce:0 free:2727 free_pcp:1035 free_cma:178 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB lowmem_reserve[]: 0 0 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB 138826 total pagecache pages 5460 pages in swap cache Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 based kernel. kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null) kswapd0 cpuset=/ mems_allowed=0 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 Call trace: [<0000000000000000>] dump_backtrace+0x0/0x248 [<0000000000000000>] show_stack+0x18/0x20 [<0000000000000000>] __dump_stack+0x20/0x28 [<0000000000000000>] dump_stack+0x68/0x90 [<0000000000000000>] warn_alloc+0x104/0x198 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 [<0000000000000000>] zs_malloc+0x148/0x3d0 [<0000000000000000>] zram_bvec_rw+0x410/0x798 [<0000000000000000>] zram_rw_page+0x88/0xdc [<0000000000000000>] bdev_write_page+0x70/0xbc [<0000000000000000>] __swap_writepage+0x58/0x37c [<0000000000000000>] swap_writepage+0x40/0x4c [<0000000000000000>] shrink_page_list+0xc30/0xf48 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c [<0000000000000000>] shrink_node_memcg+0x23c/0x618 [<0000000000000000>] shrink_node+0x1c8/0x304 [<0000000000000000>] kswapd+0x680/0x7c4 [<0000000000000000>] kthread+0x110/0x120 [<0000000000000000>] ret_from_fork+0x10/0x18 Mem-Info: active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB This is trace log which shows GFP_HIGHUSER consumes free pages right before ALLOC_NO_WATERMARKS. <...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- v4: change description only; typo and log v3: change log in description to one having reserved_highatomic change comment in code v2: factor out common part v1: consider highatomic reserve --- mm/page_alloc.c | 66 +++++++++++++++++++++++++++---------------------- 1 file changed, 36 insertions(+), 30 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48eb0f1410d4..fe83f88ce188 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3487,6 +3487,29 @@ static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) } ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); +static inline long __zone_watermark_unusable_free(struct zone *z, + unsigned int order, unsigned int alloc_flags) +{ + const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); + long unusable_free = (1 << order) - 1; + + /* + * If the caller does not have rights to ALLOC_HARDER then subtract + * the high-atomic reserves. This will over-estimate the size of the + * atomic reserve but it avoids a search. + */ + if (likely(!alloc_harder)) + unusable_free += z->nr_reserved_highatomic; + +#ifdef CONFIG_CMA + /* If allocation can't use CMA areas don't use free CMA pages */ + if (!(alloc_flags & ALLOC_CMA)) + unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES); +#endif + + return unusable_free; +} + /* * Return true if free base pages are above 'mark'. For high-order checks it * will return true of the order-0 watermark is reached and there is at least @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); /* free_pages may go negative - that's OK */ - free_pages -= (1 << order) - 1; + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); if (alloc_flags & ALLOC_HIGH) min -= min / 2; - /* - * If the caller does not have rights to ALLOC_HARDER then subtract - * the high-atomic reserves. This will over-estimate the size of the - * atomic reserve but it avoids a search. - */ - if (likely(!alloc_harder)) { - free_pages -= z->nr_reserved_highatomic; - } else { + if (unlikely(alloc_harder)) { /* * OOM victims can try even harder than normal ALLOC_HARDER * users on the grounds that it's definitely going to be in @@ -3527,13 +3543,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, min -= min / 4; } - -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); -#endif - /* * Check watermarks for an order-0 allocation request. If these * are not met, then a high-order request also cannot go ahead @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags) { - long free_pages = zone_page_state(z, NR_FREE_PAGES); - long cma_pages = 0; + long free_pages; + long unusable_free; -#ifdef CONFIG_CMA - /* If allocation can't use CMA areas don't use free CMA pages */ - if (!(alloc_flags & ALLOC_CMA)) - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); -#endif + free_pages = zone_page_state(z, NR_FREE_PAGES); + unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags); /* * Fast check for order-0 only. If this fails then the reserves - * need to be calculated. There is a corner case where the check - * passes but only the high-order atomic reserve are free. If - * the caller is !atomic then it'll uselessly search the free - * list. That corner case is then slower but it is harmless. + * need to be calculated. */ - if (!order && (free_pages - cma_pages) > - mark + z->lowmem_reserve[highest_zoneidx]) - return true; + if (!order) { + long fast_free = free_pages - unusable_free; + + if (fast_free > mark + z->lowmem_reserve[highest_zoneidx]) + return true; + } return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, free_pages); -- 2.17.1 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-19 23:59 ` Jaewon Kim @ 2020-06-19 12:42 ` Baoquan He 2020-06-22 8:55 ` Mel Gorman ` (2 subsequent siblings) 3 siblings, 0 replies; 33+ messages in thread From: Baoquan He @ 2020-06-19 12:42 UTC (permalink / raw) To: Jaewon Kim Cc: vbabka, mgorman, minchan, mgorman, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ytk.lee, cmlaika.kim On 06/20/20 at 08:59am, Jaewon Kim wrote: ... > kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE > > Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> > Suggested-by: Minchan Kim <minchan@kernel.org> > Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> > Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Baoquan He <bhe@redhat.com> > --- > v4: change description only; typo and log > v3: change log in description to one having reserved_highatomic > change comment in code > v2: factor out common part > v1: consider highatomic reserve > --- > mm/page_alloc.c | 66 +++++++++++++++++++++++++++---------------------- > 1 file changed, 36 insertions(+), 30 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 48eb0f1410d4..fe83f88ce188 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3487,6 +3487,29 @@ static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) > } > ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); > > +static inline long __zone_watermark_unusable_free(struct zone *z, > + unsigned int order, unsigned int alloc_flags) > +{ > + const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); > + long unusable_free = (1 << order) - 1; > + > + /* > + * If the caller does not have rights to ALLOC_HARDER then subtract > + * the high-atomic reserves. This will over-estimate the size of the > + * atomic reserve but it avoids a search. > + */ > + if (likely(!alloc_harder)) > + unusable_free += z->nr_reserved_highatomic; > + > +#ifdef CONFIG_CMA > + /* If allocation can't use CMA areas don't use free CMA pages */ > + if (!(alloc_flags & ALLOC_CMA)) > + unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES); > +#endif > + > + return unusable_free; > +} > + > /* > * Return true if free base pages are above 'mark'. For high-order checks it > * will return true of the order-0 watermark is reached and there is at least > @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); > > /* free_pages may go negative - that's OK */ > - free_pages -= (1 << order) - 1; > + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); > > if (alloc_flags & ALLOC_HIGH) > min -= min / 2; > > - /* > - * If the caller does not have rights to ALLOC_HARDER then subtract > - * the high-atomic reserves. This will over-estimate the size of the > - * atomic reserve but it avoids a search. > - */ > - if (likely(!alloc_harder)) { > - free_pages -= z->nr_reserved_highatomic; > - } else { > + if (unlikely(alloc_harder)) { > /* > * OOM victims can try even harder than normal ALLOC_HARDER > * users on the grounds that it's definitely going to be in > @@ -3527,13 +3543,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > min -= min / 4; > } > > - > -#ifdef CONFIG_CMA > - /* If allocation can't use CMA areas don't use free CMA pages */ > - if (!(alloc_flags & ALLOC_CMA)) > - free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); > -#endif > - > /* > * Check watermarks for an order-0 allocation request. If these > * are not met, then a high-order request also cannot go ahead > @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, > unsigned long mark, int highest_zoneidx, > unsigned int alloc_flags) > { > - long free_pages = zone_page_state(z, NR_FREE_PAGES); > - long cma_pages = 0; > + long free_pages; > + long unusable_free; > > -#ifdef CONFIG_CMA > - /* If allocation can't use CMA areas don't use free CMA pages */ > - if (!(alloc_flags & ALLOC_CMA)) > - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); > -#endif > + free_pages = zone_page_state(z, NR_FREE_PAGES); > + unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags); > > /* > * Fast check for order-0 only. If this fails then the reserves > - * need to be calculated. There is a corner case where the check > - * passes but only the high-order atomic reserve are free. If > - * the caller is !atomic then it'll uselessly search the free > - * list. That corner case is then slower but it is harmless. > + * need to be calculated. > */ > - if (!order && (free_pages - cma_pages) > > - mark + z->lowmem_reserve[highest_zoneidx]) > - return true; > + if (!order) { > + long fast_free = free_pages - unusable_free; > + > + if (fast_free > mark + z->lowmem_reserve[highest_zoneidx]) > + return true; > + } > > return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, > free_pages); > -- > 2.17.1 > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-19 23:59 ` Jaewon Kim 2020-06-19 12:42 ` Baoquan He @ 2020-06-22 8:55 ` Mel Gorman 2020-06-22 9:11 ` Michal Hocko [not found] ` <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcms1p6> 3 siblings, 0 replies; 33+ messages in thread From: Mel Gorman @ 2020-06-22 8:55 UTC (permalink / raw) To: Jaewon Kim Cc: vbabka, bhe, minchan, mgorman, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ytk.lee, cmlaika.kim On Sat, Jun 20, 2020 at 08:59:58AM +0900, Jaewon Kim wrote: > zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, > page_alloc: shortcut watermark checks for order-0 pages"). The commit > simply checks if free pages is bigger than watermark without additional > calculation such like reducing watermark. > > It considered free cma pages but it did not consider highatomic > reserved. This may incur exhaustion of free pages except high order > atomic free pages. > > Assume that reserved_highatomic pageblock is bigger than watermark min, > and there are only few free pages except high order atomic free. Because > zone_watermark_fast passes the allocation without considering high order > atomic free, normal reclaimable allocation like GFP_HIGHUSER will > consume all the free pages. Then finally order-0 atomic allocation may > fail on allocation. > > This means watermark min is not protected against non-atomic allocation. > The order-0 atomic allocation with ALLOC_HARDER unwantedly can be > failed. Additionally the __GFP_MEMALLOC allocation with > ALLOC_NO_WATERMARKS also can be failed. > > To avoid the problem, zone_watermark_fast should consider highatomic > reserve. If the actual size of high atomic free is counted accurately > like cma free, we may use it. On this patch just use > nr_reserved_highatomic. Additionally introduce > __zone_watermark_unusable_free to factor out common parts between > zone_watermark_fast and __zone_watermark_ok. > > This is an example of ALLOC_HARDER allocation failure using v4.19 based > kernel. > > Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) > Call trace: > [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0 > [<ffffff8008223320>] warn_alloc+0xd8/0x12c > [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250 > [<ffffff800827f6e8>] new_slab+0x128/0x604 > [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670 > [<ffffff800827ba00>] __kmalloc+0x2f8/0x310 > [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc > [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144 > [<ffffff80084ad880>] security_sid_to_context+0x10/0x18 > [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28 > [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70 > [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c > Mem-Info: > active_anon:102061 inactive_anon:81551 isolated_anon:0 > active_file:59102 inactive_file:68924 isolated_file:64 > unevictable:611 dirty:63 writeback:0 unstable:0 > slab_reclaimable:13324 slab_unreclaimable:44354 > mapped:83015 shmem:4858 pagetables:26316 bounce:0 > free:2727 free_pcp:1035 free_cma:178 > Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB > lowmem_reserve[]: 0 0 > Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB > 138826 total pagecache pages > 5460 pages in swap cache > Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 > > This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 > based kernel. > > kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null) > kswapd0 cpuset=/ mems_allowed=0 > CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 > Call trace: > [<0000000000000000>] dump_backtrace+0x0/0x248 > [<0000000000000000>] show_stack+0x18/0x20 > [<0000000000000000>] __dump_stack+0x20/0x28 > [<0000000000000000>] dump_stack+0x68/0x90 > [<0000000000000000>] warn_alloc+0x104/0x198 > [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 > [<0000000000000000>] zs_malloc+0x148/0x3d0 > [<0000000000000000>] zram_bvec_rw+0x410/0x798 > [<0000000000000000>] zram_rw_page+0x88/0xdc > [<0000000000000000>] bdev_write_page+0x70/0xbc > [<0000000000000000>] __swap_writepage+0x58/0x37c > [<0000000000000000>] swap_writepage+0x40/0x4c > [<0000000000000000>] shrink_page_list+0xc30/0xf48 > [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c > [<0000000000000000>] shrink_node_memcg+0x23c/0x618 > [<0000000000000000>] shrink_node+0x1c8/0x304 > [<0000000000000000>] kswapd+0x680/0x7c4 > [<0000000000000000>] kthread+0x110/0x120 > [<0000000000000000>] ret_from_fork+0x10/0x18 > Mem-Info: > active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0 > Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0 > Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB > > This is trace log which shows GFP_HIGHUSER consumes free pages right > before ALLOC_NO_WATERMARKS. > > <...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > <...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO > kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE > > Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> > Suggested-by: Minchan Kim <minchan@kernel.org> > Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> > Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-19 23:59 ` Jaewon Kim 2020-06-19 12:42 ` Baoquan He 2020-06-22 8:55 ` Mel Gorman @ 2020-06-22 9:11 ` Michal Hocko [not found] ` <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcms1p6> 3 siblings, 0 replies; 33+ messages in thread From: Michal Hocko @ 2020-06-22 9:11 UTC (permalink / raw) To: Jaewon Kim Cc: vbabka, bhe, mgorman, minchan, mgorman, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ytk.lee, cmlaika.kim On Sat 20-06-20 08:59:58, Jaewon Kim wrote: [...] > @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); > > /* free_pages may go negative - that's OK */ > - free_pages -= (1 << order) - 1; > + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); > > if (alloc_flags & ALLOC_HIGH) > min -= min / 2; > > - /* > - * If the caller does not have rights to ALLOC_HARDER then subtract > - * the high-atomic reserves. This will over-estimate the size of the > - * atomic reserve but it avoids a search. > - */ > - if (likely(!alloc_harder)) { > - free_pages -= z->nr_reserved_highatomic; > - } else { > + if (unlikely(alloc_harder)) { > /* > * OOM victims can try even harder than normal ALLOC_HARDER > * users on the grounds that it's definitely going to be in [...] > @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, > unsigned long mark, int highest_zoneidx, > unsigned int alloc_flags) > { > - long free_pages = zone_page_state(z, NR_FREE_PAGES); > - long cma_pages = 0; > + long free_pages; > + long unusable_free; > > -#ifdef CONFIG_CMA > - /* If allocation can't use CMA areas don't use free CMA pages */ > - if (!(alloc_flags & ALLOC_CMA)) > - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); > -#endif > + free_pages = zone_page_state(z, NR_FREE_PAGES); > + unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags); > > /* > * Fast check for order-0 only. If this fails then the reserves > - * need to be calculated. There is a corner case where the check > - * passes but only the high-order atomic reserve are free. If > - * the caller is !atomic then it'll uselessly search the free > - * list. That corner case is then slower but it is harmless. > + * need to be calculated. > */ > - if (!order && (free_pages - cma_pages) > > - mark + z->lowmem_reserve[highest_zoneidx]) > - return true; > + if (!order) { > + long fast_free = free_pages - unusable_free; > + > + if (fast_free > mark + z->lowmem_reserve[highest_zoneidx]) > + return true; > + } There is no user of unusable_free for order > 0. With you current code __zone_watermark_unusable_free would be called twice for high order allocations unless compiler tries to be clever.. But more importantly, I have hard time to follow why we need both zone_watermark_fast and zone_watermark_ok now. They should be essentially the same for anything but order == 0. For order 0 the only difference between the two is that zone_watermark_ok checks for ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about the former and why do we need it these days? > > return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, > free_pages); > -- > 2.17.1 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcms1p6>]
* RE: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast [not found] ` <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcms1p6> @ 2020-06-22 9:40 ` 김재원 2020-06-22 10:04 ` Mel Gorman 0 siblings, 1 reply; 33+ messages in thread From: 김재원 @ 2020-06-22 9:40 UTC (permalink / raw) To: Michal Hocko, 김재원 Cc: vbabka, bhe, mgorman, minchan, mgorman, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, 이용택, 김철민 >On Sat 20-06-20 08:59:58, Jaewon Kim wrote: >[...] >> @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, >> const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); >> >> /* free_pages may go negative - that's OK */ >> - free_pages -= (1 << order) - 1; >> + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); >> >> if (alloc_flags & ALLOC_HIGH) >> min -= min / 2; >> >> - /* >> - * If the caller does not have rights to ALLOC_HARDER then subtract >> - * the high-atomic reserves. This will over-estimate the size of the >> - * atomic reserve but it avoids a search. >> - */ >> - if (likely(!alloc_harder)) { >> - free_pages -= z->nr_reserved_highatomic; >> - } else { >> + if (unlikely(alloc_harder)) { >> /* >> * OOM victims can try even harder than normal ALLOC_HARDER >> * users on the grounds that it's definitely going to be in >[...] >> @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, >> unsigned long mark, int highest_zoneidx, >> unsigned int alloc_flags) >> { >> - long free_pages = zone_page_state(z, NR_FREE_PAGES); >> - long cma_pages = 0; >> + long free_pages; >> + long unusable_free; >> >> -#ifdef CONFIG_CMA >> - /* If allocation can't use CMA areas don't use free CMA pages */ >> - if (!(alloc_flags & ALLOC_CMA)) >> - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); >> -#endif >> + free_pages = zone_page_state(z, NR_FREE_PAGES); >> + unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags); >> >> /* >> * Fast check for order-0 only. If this fails then the reserves >> - * need to be calculated. There is a corner case where the check >> - * passes but only the high-order atomic reserve are free. If >> - * the caller is !atomic then it'll uselessly search the free >> - * list. That corner case is then slower but it is harmless. >> + * need to be calculated. >> */ >> - if (!order && (free_pages - cma_pages) > >> - mark + z->lowmem_reserve[highest_zoneidx]) >> - return true; >> + if (!order) { >> + long fast_free = free_pages - unusable_free; >> + >> + if (fast_free > mark + z->lowmem_reserve[highest_zoneidx]) >> + return true; >> + } > >There is no user of unusable_free for order > 0. With you current code >__zone_watermark_unusable_free would be called twice for high order >allocations unless compiler tries to be clever.. Yes you're right. Following code could be moved only for order-0. unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags); Let me fix it at v5. > >But more importantly, I have hard time to follow why we need both >zone_watermark_fast and zone_watermark_ok now. They should be >essentially the same for anything but order == 0. For order 0 the >only difference between the two is that zone_watermark_ok checks for >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about >the former and why do we need it these days? > I think the author, Mel, may ansewr. But I think the wmark_fast may fast by 1) not checking more condition about wmark and 2) using inline rather than function. According to description on commit 48ee5f3696f6, it seems to bring about 4% improvement. >> >> return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, >> free_pages); >> -- >> 2.17.1 >> > >-- >Michal Hocko >SUSE Labs > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-22 9:40 ` 김재원 @ 2020-06-22 10:04 ` Mel Gorman 2020-06-22 14:23 ` Michal Hocko 0 siblings, 1 reply; 33+ messages in thread From: Mel Gorman @ 2020-06-22 10:04 UTC (permalink / raw) To: ????????? Cc: Michal Hocko, vbabka, bhe, minchan, mgorman, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ?????????, ????????? On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote: > >But more importantly, I have hard time to follow why we need both > >zone_watermark_fast and zone_watermark_ok now. They should be > >essentially the same for anything but order == 0. For order 0 the > >only difference between the two is that zone_watermark_ok checks for > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about > >the former and why do we need it these days? > > > > I think the author, Mel, may ansewr. But I think the wmark_fast may > fast by 1) not checking more condition about wmark and 2) using inline > rather than function. According to description on commit 48ee5f3696f6, > it seems to bring about 4% improvement. > The original intent was that watermark checks were expensive as some of the calculations are only necessary when a zone is relatively low on memory and the check does not always have to be 100% accurate. This is probably still true given that __zone_watermark_ok() makes a number of calculations depending on alloc flags even if a zone is almost completely free. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-22 10:04 ` Mel Gorman @ 2020-06-22 14:23 ` Michal Hocko 2020-06-22 16:25 ` Mel Gorman 0 siblings, 1 reply; 33+ messages in thread From: Michal Hocko @ 2020-06-22 14:23 UTC (permalink / raw) To: Mel Gorman Cc: ?????????, vbabka, bhe, minchan, mgorman, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ?????????, ????????? On Mon 22-06-20 11:04:39, Mel Gorman wrote: > On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote: > > >But more importantly, I have hard time to follow why we need both > > >zone_watermark_fast and zone_watermark_ok now. They should be > > >essentially the same for anything but order == 0. For order 0 the > > >only difference between the two is that zone_watermark_ok checks for > > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about > > >the former and why do we need it these days? > > > > > > > I think the author, Mel, may ansewr. But I think the wmark_fast may > > fast by 1) not checking more condition about wmark and 2) using inline > > rather than function. According to description on commit 48ee5f3696f6, > > it seems to bring about 4% improvement. > > > > The original intent was that watermark checks were expensive as some of the > calculations are only necessary when a zone is relatively low on memory > and the check does not always have to be 100% accurate. This is probably > still true given that __zone_watermark_ok() makes a number of calculations > depending on alloc flags even if a zone is almost completely free. OK, so we are talking about if (alloc_flags & ALLOC_HIGH) min -= min / 2; if (unlikely((alloc_flags & (ALLOC_HARDER|ALLOC_OOM))) { /* * OOM victims can try even harder than normal ALLOC_HARDER * users on the grounds that it's definitely going to be in * the exit path shortly and free memory. Any allocation it * makes during the free path will be small and short-lived. */ if (alloc_flags & ALLOC_OOM) min -= min / 2; else min -= min / 4; } Is this something even measurable and something that would justify a complex code? If we really want to keep it even after these changes which are making the two closer in the cost then can we have it documented at least? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-22 14:23 ` Michal Hocko @ 2020-06-22 16:25 ` Mel Gorman 2020-06-23 7:11 ` Michal Hocko 0 siblings, 1 reply; 33+ messages in thread From: Mel Gorman @ 2020-06-22 16:25 UTC (permalink / raw) To: Michal Hocko Cc: Mel Gorman, ?????????, vbabka, bhe, minchan, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ?????????, ????????? On Mon, Jun 22, 2020 at 04:23:04PM +0200, Michal Hocko wrote: > On Mon 22-06-20 11:04:39, Mel Gorman wrote: > > On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote: > > > >But more importantly, I have hard time to follow why we need both > > > >zone_watermark_fast and zone_watermark_ok now. They should be > > > >essentially the same for anything but order == 0. For order 0 the > > > >only difference between the two is that zone_watermark_ok checks for > > > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about > > > >the former and why do we need it these days? > > > > > > > > > > I think the author, Mel, may ansewr. But I think the wmark_fast may > > > fast by 1) not checking more condition about wmark and 2) using inline > > > rather than function. According to description on commit 48ee5f3696f6, > > > it seems to bring about 4% improvement. > > > > > > > The original intent was that watermark checks were expensive as some of the > > calculations are only necessary when a zone is relatively low on memory > > and the check does not always have to be 100% accurate. This is probably > > still true given that __zone_watermark_ok() makes a number of calculations > > depending on alloc flags even if a zone is almost completely free. > > OK, so we are talking about > if (alloc_flags & ALLOC_HIGH) > min -= min / 2; > > if (unlikely((alloc_flags & (ALLOC_HARDER|ALLOC_OOM))) { > /* > * OOM victims can try even harder than normal ALLOC_HARDER > * users on the grounds that it's definitely going to be in > * the exit path shortly and free memory. Any allocation it > * makes during the free path will be small and short-lived. > */ > if (alloc_flags & ALLOC_OOM) > min -= min / 2; > else > min -= min / 4; > } > > Is this something even measurable and something that would justify a > complex code? If we really want to keep it even after these changes > which are making the two closer in the cost then can we have it > documented at least? It was originally documented as being roughly 4% for a page allocator micro-benchmark but that was 4 years ago and I do not even remember what type of machine that was on. Chances are the relative cost is different now but I haven't measured it as the microbenchmark in question doesn't even compile with recent kernels. For many allocations, the bulk of the allocation cost is zeroing the page so I have no particular objection to zone_watermark_fast being removed if it makes the code easier to read. While I have not looked recently, the cost of allocation in general and the increasing scope of the zone->lock with larger NUMA nodes for high-order allocations like THP are more of a concern than two branches and potentially two minor calculations. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v4] page_alloc: consider highatomic reserve in watermark fast 2020-06-22 16:25 ` Mel Gorman @ 2020-06-23 7:11 ` Michal Hocko 0 siblings, 0 replies; 33+ messages in thread From: Michal Hocko @ 2020-06-23 7:11 UTC (permalink / raw) To: Mel Gorman Cc: Mel Gorman, ?????????, vbabka, bhe, minchan, hannes, akpm, linux-mm, linux-kernel, jaewon31.kim, ?????????, ????????? On Mon 22-06-20 17:25:01, Mel Gorman wrote: > On Mon, Jun 22, 2020 at 04:23:04PM +0200, Michal Hocko wrote: > > On Mon 22-06-20 11:04:39, Mel Gorman wrote: > > > On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote: > > > > >But more importantly, I have hard time to follow why we need both > > > > >zone_watermark_fast and zone_watermark_ok now. They should be > > > > >essentially the same for anything but order == 0. For order 0 the > > > > >only difference between the two is that zone_watermark_ok checks for > > > > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about > > > > >the former and why do we need it these days? > > > > > > > > > > > > > I think the author, Mel, may ansewr. But I think the wmark_fast may > > > > fast by 1) not checking more condition about wmark and 2) using inline > > > > rather than function. According to description on commit 48ee5f3696f6, > > > > it seems to bring about 4% improvement. > > > > > > > > > > The original intent was that watermark checks were expensive as some of the > > > calculations are only necessary when a zone is relatively low on memory > > > and the check does not always have to be 100% accurate. This is probably > > > still true given that __zone_watermark_ok() makes a number of calculations > > > depending on alloc flags even if a zone is almost completely free. > > > > OK, so we are talking about > > if (alloc_flags & ALLOC_HIGH) > > min -= min / 2; > > > > if (unlikely((alloc_flags & (ALLOC_HARDER|ALLOC_OOM))) { > > /* > > * OOM victims can try even harder than normal ALLOC_HARDER > > * users on the grounds that it's definitely going to be in > > * the exit path shortly and free memory. Any allocation it > > * makes during the free path will be small and short-lived. > > */ > > if (alloc_flags & ALLOC_OOM) > > min -= min / 2; > > else > > min -= min / 4; > > } > > > > Is this something even measurable and something that would justify a > > complex code? If we really want to keep it even after these changes > > which are making the two closer in the cost then can we have it > > documented at least? > > It was originally documented as being roughly 4% for a page allocator > micro-benchmark but that was 4 years ago and I do not even remember what > type of machine that was on. Chances are the relative cost is different > now but I haven't measured it as the microbenchmark in question doesn't > even compile with recent kernels. Thanks for the clarification. > For many allocations, the bulk of the > allocation cost is zeroing the page so I have no particular objection > to zone_watermark_fast being removed if it makes the code easier to > read. While I have not looked recently, the cost of allocation in general > and the increasing scope of the zone->lock with larger NUMA nodes for > high-order allocations like THP are more of a concern than two branches > and potentially two minor calculations. OK, then I would rather go with the code simplification for the future maintainability. If somebody can test this and provide good numbers then we can reintroduce a fast check. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2022-10-10 15:58 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-13 13:09 [PATCH v4] page_alloc: consider highatomic reserve in watermark fast yong 2022-09-13 13:54 ` Greg KH 2022-09-14 0:46 ` yong w 2022-09-16 9:40 ` Greg KH 2022-09-16 17:05 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 wangyong 2022-09-16 17:05 ` [PATCH 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong 2022-09-16 17:09 ` kernel test robot 2022-09-16 17:05 ` [PATCH 2/3] page_alloc: consider highatomic reserve in watermark fast wangyong 2022-09-16 17:05 ` [PATCH 3/3] page_alloc: fix invalid watermark check on a negative value wangyong 2022-09-20 17:41 ` [PATCH stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH 2022-09-25 10:35 ` [PATCH v2 " wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 1/3] mm/page_alloc: use ac->high_zoneidx for classzone_idx wangyong 2022-09-25 10:36 ` kernel test robot 2022-09-25 11:00 ` Greg KH 2022-09-25 14:32 ` yong w 2022-09-26 6:46 ` Greg KH 2022-09-25 10:35 ` [PATCH v2 stable-4.19 2/3] page_alloc: consider highatomic reserve in watermark fast wangyong 2022-09-25 10:35 ` [PATCH v2 stable-4.19 3/3] page_alloc: fix invalid watermark check on a negative value wangyong 2022-10-02 15:37 ` [PATCH v2 stable-4.19 0/3] page_alloc: consider highatomic reserve in watermark fast backports to 4.19 Greg KH [not found] ` <CAOH5QeB2EqpqQd6fw-P199w8K8-3QNv_t-u_Wn1BLnfaSscmCg@mail.gmail.com> 2022-10-07 16:41 ` Greg KH 2022-10-10 15:47 ` yong w 2022-10-10 15:58 ` Greg KH [not found] ` <CGME20220916094017epcas1p1deed4041f897d2bf0e0486554d79b3af@epcms1p4> 2022-09-18 1:41 ` [PATCH v4] page_alloc: consider highatomic reserve in watermark fast Jaewon Kim 2022-09-19 13:21 ` yong w [not found] <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcas1p1.samsung.com> 2020-06-19 23:59 ` Jaewon Kim 2020-06-19 12:42 ` Baoquan He 2020-06-22 8:55 ` Mel Gorman 2020-06-22 9:11 ` Michal Hocko [not found] ` <CGME20200619055816epcas1p184da90b01aff559fe3cd690ebcd921ca@epcms1p6> 2020-06-22 9:40 ` 김재원 2020-06-22 10:04 ` Mel Gorman 2020-06-22 14:23 ` Michal Hocko 2020-06-22 16:25 ` Mel Gorman 2020-06-23 7:11 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.