* [PATCH RFC 0/1] mm, page_alloc: fix incorrect zone_statistics data @ 2016-12-12 5:59 Jia He 2016-12-12 5:59 ` [PATCH RFC 1/1] " Jia He 0 siblings, 1 reply; 23+ messages in thread From: Jia He @ 2016-12-12 5:59 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko, Johannes Weiner, Joonsoo Kim, Taku Izumi, Jia He In commit b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics"), it reconstructed the code to reduce the branch miss rate. Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) z->node would not be equal to preferred_zone->node. That seems to be incorrect. Here is what I catch, dumpstack() is triggered when z->node == preferred_zone->node and z->node != numa_node_id() z=5,prefer=5,local=4, flag&OTHER_NODE=0 [c000000cdcaef440] [c0000000002e88cc] cache_grow_begin+0xcc/0x500 [c000000cdcaef6f0] [c0000000002ecb44] do_tune_cpucache+0x64/0x100 [c000000cdcaef750] [c0000000002ecc7c] enable_cpucache+0x9c/0x180 [c000000cdcaef7d0] [c0000000002ed01c] __kmem_cache_create+0x1ec/00x2c0 [c000000cdcaef820] [c000000000291c98] create_cache+0xb8/0x240 [c000000cdcaef890] [c000000000291fa8] kmem_cache_create+0x188/0x2290 [c000000cdcaef950] [d000000011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4] [c000000cdcaef9f0] [d000000011daaedc] ext4_fill_super+0x266c/0x33390 [ext4] [c000000cdcaefb30] [c000000000328b8c] mount_bdev+0x22c/0x260 [c000000cdcaefbd0] [d000000011da1fa8] ext4_mount+0x48/0x60 [ext4] [c000000cdcaefc10] [c00000000032a11c] mount_fs+0x8c/0x230 [c000000cdcaefcb0] [c000000000351f98] vfs_kern_mount+0x78/0x180 [c000000cdcaefd00] [c000000000356d68] do_mount+0x258/0xea0 [c000000cdcaefde0] [c000000000357da0] SyS_mount+0xa0/0x110 [c000000cdcaefe30] [c00000000000bd84] system_call+0x38/0xe0 Before this patch, the numa_miss and numa_foreign looked very odd: linux:~ # numastat node0 node1 node2 node3 node4 node5 node6 numa_hit 42216 0 0 0 96755 0 0 numa_miss 1 718 711 726 860 712 719 numa_foreign 1 718 711 726 860 712 719 interleave_hit 631 638 632 641 621 633 636 local_node 42216 0 0 0 96755 0 0 other_node 0 0 0 0 0 0 0 After this patch linux:~ # numastat node0 node1 node2 node3 node4 node5 node6 numa_hit 177891 718 711 726 60302 712 719 numa_miss 0 196944 237222 253424 0 36265 0 numa_foreign 723855 0 0 0 0 0 0 interleave_hit 631 638 632 641 621 633 636 local_node 177891 0 0 0 59444 0 0 other_node 0 718 711 726 858 712 719 Jia He (1): mm, page_alloc: fix incorrect zone_statistics data mm/page_alloc.c | 3 +++ 1 file changed, 3 insertions(+) -- 2.5.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-12 5:59 [PATCH RFC 0/1] mm, page_alloc: fix incorrect zone_statistics data Jia He @ 2016-12-12 5:59 ` Jia He 2016-12-20 9:18 ` Michal Hocko 2016-12-20 12:31 ` Mel Gorman 0 siblings, 2 replies; 23+ messages in thread From: Jia He @ 2016-12-12 5:59 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko, Johannes Weiner, Joonsoo Kim, Taku Izumi, Jia He In commit b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics"), it reconstructed codes to reduce the branch miss rate. Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) z->node would not be equal to preferred_zone->node. That seems to be incorrect. Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") Signed-off-by: Jia He <hejianet@gmail.com> --- mm/page_alloc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6de9440..474757e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2568,6 +2568,9 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, if (z->node == local_nid) { __inc_zone_state(z, NUMA_HIT); __inc_zone_state(z, local_stat); + } else if (z->node == preferred_zone->node) { + __inc_zone_state(z, NUMA_HIT); + __inc_zone_state(z, NUMA_OTHER); } else { __inc_zone_state(z, NUMA_MISS); __inc_zone_state(preferred_zone, NUMA_FOREIGN); -- 2.5.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-12 5:59 ` [PATCH RFC 1/1] " Jia He @ 2016-12-20 9:18 ` Michal Hocko 2016-12-20 13:10 ` Mel Gorman ` (2 more replies) 2016-12-20 12:31 ` Mel Gorman 1 sibling, 3 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-20 9:18 UTC (permalink / raw) To: Jia He Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Joonsoo Kim, Taku Izumi On Mon 12-12-16 13:59:07, Jia He wrote: > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > z->node would not be equal to preferred_zone->node. That seems to be > incorrect. I am sorry but I have hard time following the changelog. It is clear that you are trying to fix a missed NUMA_{HIT,OTHER} accounting but it is not really clear when such thing happens. You are adding preferred_zone->node check. preferred_zone is the first zone in the requested zonelist. So for the most allocations it is a node from the local node. But if something request an explicit numa node (without __GFP_OTHER_NODE which would be the majority I suspect) then we could indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the referenced patch indeed caused an unintended change of accounting AFAIU. If this is correct then it should be a part of the changelog. I also cannot say I would like the fix. First of all I am not sure __GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the flag any different from an explicit __alloc_pages_node(non_local_nid)? In both cases we ask for an allocation on a remote node and successful allocation is a NUMA_HIT and NUMA_OTHER. That being said, why cannot we simply do the following? As a bonus, we can get rid of a barely used __GFP_OTHER_NODE. Also the number of branches will stay same. --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 429855be6ec9..f035d5c8b864 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order) * Update NUMA hit/miss statistics * * Must be called with interrupts disabled. - * - * When __GFP_OTHER_NODE is set assume the node of the preferred - * zone is the local node. This is useful for daemons who allocate - * memory on behalf of other processes. */ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags) { #ifdef CONFIG_NUMA - int local_nid = numa_node_id(); - enum zone_stat_item local_stat = NUMA_LOCAL; - - if (unlikely(flags & __GFP_OTHER_NODE)) { - local_stat = NUMA_OTHER; - local_nid = preferred_zone->node; - } + if (z->node == preferred_zone->node) { + enum zone_stat_item local_stat = NUMA_LOCAL; - if (z->node == local_nid) { __inc_zone_state(z, NUMA_HIT); + if (z->node != numa_node_id()) + local_stat = NUMA_OTHER; __inc_zone_state(z, local_stat); } else { __inc_zone_state(z, NUMA_MISS); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 9:18 ` Michal Hocko @ 2016-12-20 13:10 ` Mel Gorman 2016-12-20 13:26 ` Michal Hocko 2016-12-20 15:13 ` Vlastimil Babka 2016-12-21 3:01 ` hejianet 2 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2016-12-20 13:10 UTC (permalink / raw) To: Michal Hocko Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > On Mon 12-12-16 13:59:07, Jia He wrote: > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > z->node would not be equal to preferred_zone->node. That seems to be > > incorrect. > > I am sorry but I have hard time following the changelog. It is clear > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > but it is not really clear when such thing happens. You are adding > preferred_zone->node check. preferred_zone is the first zone in the > requested zonelist. So for the most allocations it is a node from the > local node. But if something request an explicit numa node (without > __GFP_OTHER_NODE which would be the majority I suspect) then we could > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > referenced patch indeed caused an unintended change of accounting AFAIU. > This is a similar concern to what I had. If the preferred zone, which is the first valid usable zone, is not a "hit" for the statistics then I don't know what "hit" is meant to mean. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 13:10 ` Mel Gorman @ 2016-12-20 13:26 ` Michal Hocko 2016-12-20 14:28 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-20 13:26 UTC (permalink / raw) To: Mel Gorman Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue 20-12-16 13:10:40, Mel Gorman wrote: > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > > On Mon 12-12-16 13:59:07, Jia He wrote: > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > > z->node would not be equal to preferred_zone->node. That seems to be > > > incorrect. > > > > I am sorry but I have hard time following the changelog. It is clear > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > > but it is not really clear when such thing happens. You are adding > > preferred_zone->node check. preferred_zone is the first zone in the > > requested zonelist. So for the most allocations it is a node from the > > local node. But if something request an explicit numa node (without > > __GFP_OTHER_NODE which would be the majority I suspect) then we could > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > > referenced patch indeed caused an unintended change of accounting AFAIU. > > > > This is a similar concern to what I had. If the preferred zone, which is > the first valid usable zone, is not a "hit" for the statistics then I > don't know what "hit" is meant to mean. But the first valid usable zone is defined based on the requested numa node. Unless the requested node is memoryless then we should have a hit, no? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 13:26 ` Michal Hocko @ 2016-12-20 14:28 ` Mel Gorman 2016-12-20 14:35 ` Michal Hocko 2016-12-20 14:42 ` [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data Mel Gorman 0 siblings, 2 replies; 23+ messages in thread From: Mel Gorman @ 2016-12-20 14:28 UTC (permalink / raw) To: Michal Hocko Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote: > On Tue 20-12-16 13:10:40, Mel Gorman wrote: > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > > > On Mon 12-12-16 13:59:07, Jia He wrote: > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > > > z->node would not be equal to preferred_zone->node. That seems to be > > > > incorrect. > > > > > > I am sorry but I have hard time following the changelog. It is clear > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > > > but it is not really clear when such thing happens. You are adding > > > preferred_zone->node check. preferred_zone is the first zone in the > > > requested zonelist. So for the most allocations it is a node from the > > > local node. But if something request an explicit numa node (without > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > > > referenced patch indeed caused an unintended change of accounting AFAIU. > > > > > > > This is a similar concern to what I had. If the preferred zone, which is > > the first valid usable zone, is not a "hit" for the statistics then I > > don't know what "hit" is meant to mean. > > But the first valid usable zone is defined based on the requested numa > node. Unless the requested node is memoryless then we should have a hit, > no? > Should be. If the local node is memoryless then there would be a difference between hit and whether it's local or not but that to me is a little useless. A local vs remote page allocated has a specific meaning and consequence. It's hard to see how hit can be meaningfully interpreted if there are memoryless nodes. I don't have a strong objection to the patch so I didn't nak it, I'm just not convinced it matters. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 14:28 ` Mel Gorman @ 2016-12-20 14:35 ` Michal Hocko 2016-12-20 14:49 ` Vlastimil Babka 2016-12-20 14:54 ` Mel Gorman 2016-12-20 14:42 ` [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data Mel Gorman 1 sibling, 2 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-20 14:35 UTC (permalink / raw) To: Mel Gorman Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue 20-12-16 14:28:45, Mel Gorman wrote: > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote: > > On Tue 20-12-16 13:10:40, Mel Gorman wrote: > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > > > > On Mon 12-12-16 13:59:07, Jia He wrote: > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > > > > z->node would not be equal to preferred_zone->node. That seems to be > > > > > incorrect. > > > > > > > > I am sorry but I have hard time following the changelog. It is clear > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > > > > but it is not really clear when such thing happens. You are adding > > > > preferred_zone->node check. preferred_zone is the first zone in the > > > > requested zonelist. So for the most allocations it is a node from the > > > > local node. But if something request an explicit numa node (without > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > > > > referenced patch indeed caused an unintended change of accounting AFAIU. > > > > > > > > > > This is a similar concern to what I had. If the preferred zone, which is > > > the first valid usable zone, is not a "hit" for the statistics then I > > > don't know what "hit" is meant to mean. > > > > But the first valid usable zone is defined based on the requested numa > > node. Unless the requested node is memoryless then we should have a hit, > > no? > > > > Should be. If the local node is memoryless then there would be a difference > between hit and whether it's local or not but that to me is a little > useless. A local vs remote page allocated has a specific meaning and > consequence. It's hard to see how hit can be meaningfully interpreted if > there are memoryless nodes. I don't have a strong objection to the patch > so I didn't nak it, I'm just not convinced it matters. So what do you think about http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz I think that we should get rid of __GFP_OTHER_NODE thingy. It is just one off thing and the gfp space it rather precious. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 14:35 ` Michal Hocko @ 2016-12-20 14:49 ` Vlastimil Babka 2016-12-20 14:54 ` Mel Gorman 1 sibling, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2016-12-20 14:49 UTC (permalink / raw) To: Michal Hocko, Mel Gorman Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Johannes Weiner, Joonsoo Kim, Taku Izumi, Andi Kleen On 12/20/2016 03:35 PM, Michal Hocko wrote: > On Tue 20-12-16 14:28:45, Mel Gorman wrote: >> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote: >>> On Tue 20-12-16 13:10:40, Mel Gorman wrote: >>>> On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: >>>>> On Mon 12-12-16 13:59:07, Jia He wrote: >>>>>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in >>>>>> zone_statistics"), it reconstructed codes to reduce the branch miss rate. >>>>>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) >>>>>> z->node would not be equal to preferred_zone->node. That seems to be >>>>>> incorrect. >>>>> >>>>> I am sorry but I have hard time following the changelog. It is clear >>>>> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting >>>>> but it is not really clear when such thing happens. You are adding >>>>> preferred_zone->node check. preferred_zone is the first zone in the >>>>> requested zonelist. So for the most allocations it is a node from the >>>>> local node. But if something request an explicit numa node (without >>>>> __GFP_OTHER_NODE which would be the majority I suspect) then we could >>>>> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the >>>>> referenced patch indeed caused an unintended change of accounting AFAIU. >>>>> >>>> >>>> This is a similar concern to what I had. If the preferred zone, which is >>>> the first valid usable zone, is not a "hit" for the statistics then I >>>> don't know what "hit" is meant to mean. >>> >>> But the first valid usable zone is defined based on the requested numa >>> node. Unless the requested node is memoryless then we should have a hit, >>> no? >>> >> >> Should be. If the local node is memoryless then there would be a difference >> between hit and whether it's local or not but that to me is a little >> useless. A local vs remote page allocated has a specific meaning and >> consequence. It's hard to see how hit can be meaningfully interpreted if >> there are memoryless nodes. I don't have a strong objection to the patch >> so I didn't nak it, I'm just not convinced it matters. > > So what do you think about > http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz > > I think that we should get rid of __GFP_OTHER_NODE thingy. It is just > one off thing and the gfp space it rather precious. Let's CC Andi who introduced it by commit 78afd5612deb8. Personally I agree that the reasoning provided by that commit does not justify the troubles. We already have the HIT and MISS counters to record if we allocated on the node we explicitly wanted/preferred (local or remote). The LOCAL and OTHER should thus be true local/remote statistics, why fake them in some rare cases such as khugepaged? The only other case is do_huge_pmd_wp_page_fallback() where it perhaps makes even less sense. No others were added in 5 years so I think the flag really didn't catch on, let's get rid of it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 14:35 ` Michal Hocko 2016-12-20 14:49 ` Vlastimil Babka @ 2016-12-20 14:54 ` Mel Gorman 2016-12-21 7:57 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Mel Gorman @ 2016-12-20 14:54 UTC (permalink / raw) To: Michal Hocko Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue, Dec 20, 2016 at 03:35:02PM +0100, Michal Hocko wrote: > On Tue 20-12-16 14:28:45, Mel Gorman wrote: > > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote: > > > On Tue 20-12-16 13:10:40, Mel Gorman wrote: > > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > > > > > On Mon 12-12-16 13:59:07, Jia He wrote: > > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > > > > > z->node would not be equal to preferred_zone->node. That seems to be > > > > > > incorrect. > > > > > > > > > > I am sorry but I have hard time following the changelog. It is clear > > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > > > > > but it is not really clear when such thing happens. You are adding > > > > > preferred_zone->node check. preferred_zone is the first zone in the > > > > > requested zonelist. So for the most allocations it is a node from the > > > > > local node. But if something request an explicit numa node (without > > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could > > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > > > > > referenced patch indeed caused an unintended change of accounting AFAIU. > > > > > > > > > > > > > This is a similar concern to what I had. If the preferred zone, which is > > > > the first valid usable zone, is not a "hit" for the statistics then I > > > > don't know what "hit" is meant to mean. > > > > > > But the first valid usable zone is defined based on the requested numa > > > node. Unless the requested node is memoryless then we should have a hit, > > > no? > > > > > > > Should be. If the local node is memoryless then there would be a difference > > between hit and whether it's local or not but that to me is a little > > useless. A local vs remote page allocated has a specific meaning and > > consequence. It's hard to see how hit can be meaningfully interpreted if > > there are memoryless nodes. I don't have a strong objection to the patch > > so I didn't nak it, I'm just not convinced it matters. > > So what do you think about > http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz > This doesn't appear to resolve for me and I've 30 minutes left before being offline for 4 days so didn't go digging. > I think that we should get rid of __GFP_OTHER_NODE thingy. It is just > one off thing and the gfp space it rather precious. > However, broadly speaking, I'd be ok with getting rid of __GFP_OTHER_NODE altogether and making it truely only about local vs remote hits because those are the ones that matter in terms of performance. If a user has memoryless nodes or policies that allow local CPUs but forbid local memory and they need to debug an issue, they're going to need tracepoints anyway. Hit/miss/other is not sufficient for most interesting problems involving local or remote memory usage. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 14:54 ` Mel Gorman @ 2016-12-21 7:57 ` Michal Hocko 2016-12-21 8:06 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-21 7:57 UTC (permalink / raw) To: Mel Gorman Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue 20-12-16 14:54:35, Mel Gorman wrote: > On Tue, Dec 20, 2016 at 03:35:02PM +0100, Michal Hocko wrote: > > On Tue 20-12-16 14:28:45, Mel Gorman wrote: > > > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote: > > > > On Tue 20-12-16 13:10:40, Mel Gorman wrote: > > > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > > > > > > On Mon 12-12-16 13:59:07, Jia He wrote: > > > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > > > > > > z->node would not be equal to preferred_zone->node. That seems to be > > > > > > > incorrect. > > > > > > > > > > > > I am sorry but I have hard time following the changelog. It is clear > > > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > > > > > > but it is not really clear when such thing happens. You are adding > > > > > > preferred_zone->node check. preferred_zone is the first zone in the > > > > > > requested zonelist. So for the most allocations it is a node from the > > > > > > local node. But if something request an explicit numa node (without > > > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could > > > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > > > > > > referenced patch indeed caused an unintended change of accounting AFAIU. > > > > > > > > > > > > > > > > This is a similar concern to what I had. If the preferred zone, which is > > > > > the first valid usable zone, is not a "hit" for the statistics then I > > > > > don't know what "hit" is meant to mean. > > > > > > > > But the first valid usable zone is defined based on the requested numa > > > > node. Unless the requested node is memoryless then we should have a hit, > > > > no? > > > > > > > > > > Should be. If the local node is memoryless then there would be a difference > > > between hit and whether it's local or not but that to me is a little > > > useless. A local vs remote page allocated has a specific meaning and > > > consequence. It's hard to see how hit can be meaningfully interpreted if > > > there are memoryless nodes. I don't have a strong objection to the patch > > > so I didn't nak it, I'm just not convinced it matters. > > > > So what do you think about > > http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz > > > > This doesn't appear to resolve for me and I've 30 minutes left before > being offline for 4 days so didn't go digging. OK, it seems that it didn't go to the lkml so it didn't get to the archive indexed by the message id. I will send the two patches as a reply to this email for reference. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/2] mm: fix remote numa hits statistics 2016-12-21 7:57 ` Michal Hocko @ 2016-12-21 8:06 ` Michal Hocko 2016-12-21 8:06 ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-21 8:06 UTC (permalink / raw) To: linux-mm Cc: Mel Gorman, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner, Michal Hocko From: Michal Hocko <mhocko@suse.com> Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") has an unintentional side effect that remote node allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are many of these potentially because the flag is used very rarely while we have many users of __alloc_pages_node. Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a follow up patch) and treat all allocations that were satisfied from the preferred zone's node as NUMA_HITS because this is the same node we requested the allocation from in most cases. If this is not the local node then we just account it as NUMA_OTHER rather than NUMA_LOCAL. One downsize would be that an allocation request for a node which is outside of the mempolicy nodemask would be reported as a hit which is a bit weird but that was the case before b9f00e147f27 already. Reported-by: Jia He <hejianet@gmail.com> Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/page_alloc.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f6d5b73e1d7c..506946a902c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order) * Update NUMA hit/miss statistics * * Must be called with interrupts disabled. - * - * When __GFP_OTHER_NODE is set assume the node of the preferred - * zone is the local node. This is useful for daemons who allocate - * memory on behalf of other processes. */ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags) { #ifdef CONFIG_NUMA - int local_nid = numa_node_id(); - enum zone_stat_item local_stat = NUMA_LOCAL; - - if (unlikely(flags & __GFP_OTHER_NODE)) { - local_stat = NUMA_OTHER; - local_nid = preferred_zone->node; - } + if (z->node == preferred_zone->node) { + enum zone_stat_item local_stat = NUMA_LOCAL; - if (z->node == local_nid) { __inc_zone_state(z, NUMA_HIT); + if (z->node != numa_node_id()) + local_stat = NUMA_OTHER; __inc_zone_state(z, local_stat); } else { __inc_zone_state(z, NUMA_MISS); -- 2.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE 2016-12-21 8:06 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko @ 2016-12-21 8:06 ` Michal Hocko 2017-01-02 14:18 ` Vlastimil Babka 2016-12-29 11:46 ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman 2017-01-02 14:16 ` Vlastimil Babka 2 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-21 8:06 UTC (permalink / raw) To: linux-mm Cc: Mel Gorman, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner, Michal Hocko From: Michal Hocko <mhocko@suse.com> The flag has been introduced by 78afd5612deb ("mm: add __GFP_OTHER_NODE flag") to allow proper accounting of remote node allocations done by kernel daemons on behalf of a process - e.g. khugepaged. After "mm: fix remote numa hits statistics" we do not need and actually use the flag so we can safely remove it because all allocations which are satisfied from their "home" node are accounted properly. Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/gfp.h | 13 +++---------- include/trace/events/mmflags.h | 1 - mm/huge_memory.c | 3 +-- mm/khugepaged.c | 5 ++--- mm/page_alloc.c | 5 ++--- tools/perf/builtin-kmem.c | 1 - 6 files changed, 8 insertions(+), 20 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 4175dca4ac39..7806a8f80abc 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -38,9 +38,8 @@ struct vm_area_struct; #define ___GFP_ACCOUNT 0x100000u #define ___GFP_NOTRACK 0x200000u #define ___GFP_DIRECT_RECLAIM 0x400000u -#define ___GFP_OTHER_NODE 0x800000u -#define ___GFP_WRITE 0x1000000u -#define ___GFP_KSWAPD_RECLAIM 0x2000000u +#define ___GFP_WRITE 0x800000u +#define ___GFP_KSWAPD_RECLAIM 0x1000000u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -172,11 +171,6 @@ struct vm_area_struct; * __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of * distinguishing in the source between false positives and allocations that * cannot be supported (e.g. page tables). - * - * __GFP_OTHER_NODE is for allocations that are on a remote node but that - * should not be accounted for as a remote allocation in vmstat. A - * typical user would be khugepaged collapsing a huge page on a remote - * node. */ #define __GFP_COLD ((__force gfp_t)___GFP_COLD) #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) @@ -184,10 +178,9 @@ struct vm_area_struct; #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT 26 +#define __GFP_BITS_SHIFT 25 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 5a81ab48a2fb..556a0efa8298 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -48,7 +48,6 @@ {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ - {(unsigned long)__GFP_OTHER_NODE, "__GFP_OTHER_NODE"} \ #define show_gfp_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f3c2040edbb1..8206abf4ac03 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -918,8 +918,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, } for (i = 0; i < HPAGE_PMD_NR; i++) { - pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE | - __GFP_OTHER_NODE, vma, + pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma, vmf->address, page_to_nid(page)); if (unlikely(!pages[i] || mem_cgroup_try_charge(pages[i], vma->vm_mm, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e32389a97030..211974a3992b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -943,7 +943,7 @@ static void collapse_huge_page(struct mm_struct *mm, VM_BUG_ON(address & ~HPAGE_PMD_MASK); /* Only allocate from the target node */ - gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE; + gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; /* * Before allocating the hugepage, release the mmap_sem read lock. @@ -1326,8 +1326,7 @@ static void collapse_shmem(struct mm_struct *mm, VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); /* Only allocate from the target node */ - gfp = alloc_hugepage_khugepaged_gfpmask() | - __GFP_OTHER_NODE | __GFP_THISNODE; + gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; new_page = khugepaged_alloc_page(hpage, gfp, node); if (!new_page) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 506946a902c5..647e940e6921 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2584,8 +2584,7 @@ int __isolate_free_page(struct page *page, unsigned int order) * * Must be called with interrupts disabled. */ -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, - gfp_t flags) +static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) { #ifdef CONFIG_NUMA if (z->node == preferred_zone->node) { @@ -2666,7 +2665,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, } __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); - zone_statistics(preferred_zone, zone, gfp_flags); + zone_statistics(preferred_zone, zone); local_irq_restore(flags); VM_BUG_ON_PAGE(bad_range(zone, page), page); diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index d426dcb18ce9..33b959d47545 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -645,7 +645,6 @@ static const struct { { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" }, { "__GFP_KSWAPD_RECLAIM", "KR" }, - { "__GFP_OTHER_NODE", "ON" }, }; static size_t max_gfp_len; -- 2.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE 2016-12-21 8:06 ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko @ 2017-01-02 14:18 ` Vlastimil Babka 0 siblings, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2017-01-02 14:18 UTC (permalink / raw) To: Michal Hocko, linux-mm Cc: Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner, Michal Hocko, Andi Kleen [CC Andi who introduced __GFP_OTHER_NODE] On 12/21/2016 09:06 AM, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > The flag has been introduced by 78afd5612deb ("mm: add __GFP_OTHER_NODE > flag") to allow proper accounting of remote node allocations done by > kernel daemons on behalf of a process - e.g. khugepaged. > > After "mm: fix remote numa hits statistics" we do not need and actually > use the flag so we can safely remove it because all allocations which > are satisfied from their "home" node are accounted properly. > > Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > include/linux/gfp.h | 13 +++---------- > include/trace/events/mmflags.h | 1 - > mm/huge_memory.c | 3 +-- > mm/khugepaged.c | 5 ++--- > mm/page_alloc.c | 5 ++--- > tools/perf/builtin-kmem.c | 1 - > 6 files changed, 8 insertions(+), 20 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 4175dca4ac39..7806a8f80abc 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -38,9 +38,8 @@ struct vm_area_struct; > #define ___GFP_ACCOUNT 0x100000u > #define ___GFP_NOTRACK 0x200000u > #define ___GFP_DIRECT_RECLAIM 0x400000u > -#define ___GFP_OTHER_NODE 0x800000u > -#define ___GFP_WRITE 0x1000000u > -#define ___GFP_KSWAPD_RECLAIM 0x2000000u > +#define ___GFP_WRITE 0x800000u > +#define ___GFP_KSWAPD_RECLAIM 0x1000000u > /* If the above are modified, __GFP_BITS_SHIFT may need updating */ > > /* > @@ -172,11 +171,6 @@ struct vm_area_struct; > * __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of > * distinguishing in the source between false positives and allocations that > * cannot be supported (e.g. page tables). > - * > - * __GFP_OTHER_NODE is for allocations that are on a remote node but that > - * should not be accounted for as a remote allocation in vmstat. A > - * typical user would be khugepaged collapsing a huge page on a remote > - * node. > */ > #define __GFP_COLD ((__force gfp_t)___GFP_COLD) > #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) > @@ -184,10 +178,9 @@ struct vm_area_struct; > #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) > #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) > -#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) > > /* Room for N __GFP_FOO bits */ > -#define __GFP_BITS_SHIFT 26 > +#define __GFP_BITS_SHIFT 25 > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) > > /* > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h > index 5a81ab48a2fb..556a0efa8298 100644 > --- a/include/trace/events/mmflags.h > +++ b/include/trace/events/mmflags.h > @@ -48,7 +48,6 @@ > {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ > {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ > {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ > - {(unsigned long)__GFP_OTHER_NODE, "__GFP_OTHER_NODE"} \ > > #define show_gfp_flags(flags) \ > (flags) ? __print_flags(flags, "|", \ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index f3c2040edbb1..8206abf4ac03 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -918,8 +918,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd, > } > > for (i = 0; i < HPAGE_PMD_NR; i++) { > - pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE | > - __GFP_OTHER_NODE, vma, > + pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma, > vmf->address, page_to_nid(page)); > if (unlikely(!pages[i] || > mem_cgroup_try_charge(pages[i], vma->vm_mm, > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index e32389a97030..211974a3992b 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -943,7 +943,7 @@ static void collapse_huge_page(struct mm_struct *mm, > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > /* Only allocate from the target node */ > - gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE; > + gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > /* > * Before allocating the hugepage, release the mmap_sem read lock. > @@ -1326,8 +1326,7 @@ static void collapse_shmem(struct mm_struct *mm, > VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); > > /* Only allocate from the target node */ > - gfp = alloc_hugepage_khugepaged_gfpmask() | > - __GFP_OTHER_NODE | __GFP_THISNODE; > + gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > new_page = khugepaged_alloc_page(hpage, gfp, node); > if (!new_page) { > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 506946a902c5..647e940e6921 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2584,8 +2584,7 @@ int __isolate_free_page(struct page *page, unsigned int order) > * > * Must be called with interrupts disabled. > */ > -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, > - gfp_t flags) > +static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > { > #ifdef CONFIG_NUMA > if (z->node == preferred_zone->node) { > @@ -2666,7 +2665,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, > } > > __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); > - zone_statistics(preferred_zone, zone, gfp_flags); > + zone_statistics(preferred_zone, zone); > local_irq_restore(flags); > > VM_BUG_ON_PAGE(bad_range(zone, page), page); > diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c > index d426dcb18ce9..33b959d47545 100644 > --- a/tools/perf/builtin-kmem.c > +++ b/tools/perf/builtin-kmem.c > @@ -645,7 +645,6 @@ static const struct { > { "__GFP_RECLAIM", "R" }, > { "__GFP_DIRECT_RECLAIM", "DR" }, > { "__GFP_KSWAPD_RECLAIM", "KR" }, > - { "__GFP_OTHER_NODE", "ON" }, > }; > > static size_t max_gfp_len; > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] mm: fix remote numa hits statistics 2016-12-21 8:06 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko 2016-12-21 8:06 ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko @ 2016-12-29 11:46 ` Mel Gorman 2016-12-29 12:28 ` Michal Hocko 2017-01-02 14:16 ` Vlastimil Babka 2 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2016-12-29 11:46 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner, Michal Hocko On Wed, Dec 21, 2016 at 09:06:52AM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches > in zone_statistics") has an unintentional side effect that remote node > allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and > NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are > many of these potentially because the flag is used very rarely while > we have many users of __alloc_pages_node. > > Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a > follow up patch) and treat all allocations that were satisfied from the > preferred zone's node as NUMA_HITS because this is the same node we > requested the allocation from in most cases. If this is not the local > node then we just account it as NUMA_OTHER rather than NUMA_LOCAL. > > One downsize would be that an allocation request for a node which is > outside of the mempolicy nodemask would be reported as a hit which is a > bit weird but that was the case before b9f00e147f27 already. > > Reported-by: Jia He <hejianet@gmail.com> > Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") > Signed-off-by: Michal Hocko <mhocko@suse.com> For both patches; Acked-by: Mel Gorman <mgorman@suse.de> -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] mm: fix remote numa hits statistics 2016-12-29 11:46 ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman @ 2016-12-29 12:28 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-29 12:28 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner On Thu 29-12-16 11:46:01, Mel Gorman wrote: > On Wed, Dec 21, 2016 at 09:06:52AM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches > > in zone_statistics") has an unintentional side effect that remote node > > allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and > > NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are > > many of these potentially because the flag is used very rarely while > > we have many users of __alloc_pages_node. > > > > Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a > > follow up patch) and treat all allocations that were satisfied from the > > preferred zone's node as NUMA_HITS because this is the same node we > > requested the allocation from in most cases. If this is not the local > > node then we just account it as NUMA_OTHER rather than NUMA_LOCAL. > > > > One downsize would be that an allocation request for a node which is > > outside of the mempolicy nodemask would be reported as a hit which is a > > bit weird but that was the case before b9f00e147f27 already. > > > > Reported-by: Jia He <hejianet@gmail.com> > > Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > For both patches; > > Acked-by: Mel Gorman <mgorman@suse.de> Thanks! I will give it a week for others to get back to it after holiday and then resubmit. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] mm: fix remote numa hits statistics 2016-12-21 8:06 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko 2016-12-21 8:06 ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko 2016-12-29 11:46 ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman @ 2017-01-02 14:16 ` Vlastimil Babka 2017-01-02 14:46 ` Michal Hocko 2 siblings, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2017-01-02 14:16 UTC (permalink / raw) To: Michal Hocko, linux-mm Cc: Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner, Michal Hocko On 12/21/2016 09:06 AM, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches > in zone_statistics") has an unintentional side effect that remote node > allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and > NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are > many of these potentially because the flag is used very rarely while > we have many users of __alloc_pages_node. > > Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a > follow up patch) and treat all allocations that were satisfied from the > preferred zone's node as NUMA_HITS because this is the same node we > requested the allocation from in most cases. If this is not the local > node then we just account it as NUMA_OTHER rather than NUMA_LOCAL. > > One downsize would be that an allocation request for a node which is > outside of the mempolicy nodemask would be reported as a hit which is a > bit weird but that was the case before b9f00e147f27 already. > > Reported-by: Jia He <hejianet@gmail.com> > Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") > Signed-off-by: Michal Hocko <mhocko@suse.com> cbmc tells me that this patch is not equal to pre-commit b9f00e147f27 (in situation where __GFP_OTHER_NODE is not passed, as that's the only relevant scenario after your patch), which seems unintended. counter example: numa_node_id() == 1 preferred_zone on node 0 allocated from zone on node 1 pre-b9f00e147f27: allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL preferred zone (node 0) increased NUMA_FOREIGN (that looks correct to me) your patch: allocated zone (node 1) increased NUMA_MISS preferred zone (node 0) increased NUMA_FOREIGN i.e. it's missing NUMA_LOCAL on node 1, which is IMHO wrong as this was an allocation local to the CPU (albeit a MISS wrt the preferred node). Vlastimil > --- > mm/page_alloc.c | 16 ++++------------ > 1 file changed, 4 insertions(+), 12 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f6d5b73e1d7c..506946a902c5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order) > * Update NUMA hit/miss statistics > * > * Must be called with interrupts disabled. > - * > - * When __GFP_OTHER_NODE is set assume the node of the preferred > - * zone is the local node. This is useful for daemons who allocate > - * memory on behalf of other processes. > */ > static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, > gfp_t flags) > { > #ifdef CONFIG_NUMA > - int local_nid = numa_node_id(); > - enum zone_stat_item local_stat = NUMA_LOCAL; > - > - if (unlikely(flags & __GFP_OTHER_NODE)) { > - local_stat = NUMA_OTHER; > - local_nid = preferred_zone->node; > - } > + if (z->node == preferred_zone->node) { > + enum zone_stat_item local_stat = NUMA_LOCAL; > > - if (z->node == local_nid) { > __inc_zone_state(z, NUMA_HIT); > + if (z->node != numa_node_id()) > + local_stat = NUMA_OTHER; > __inc_zone_state(z, local_stat); > } else { > __inc_zone_state(z, NUMA_MISS); > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] mm: fix remote numa hits statistics 2017-01-02 14:16 ` Vlastimil Babka @ 2017-01-02 14:46 ` Michal Hocko 2017-01-02 15:07 ` Vlastimil Babka 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2017-01-02 14:46 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-mm, Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner On Mon 02-01-17 15:16:23, Vlastimil Babka wrote: > On 12/21/2016 09:06 AM, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches > > in zone_statistics") has an unintentional side effect that remote node > > allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and > > NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are > > many of these potentially because the flag is used very rarely while > > we have many users of __alloc_pages_node. > > > > Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a > > follow up patch) and treat all allocations that were satisfied from the > > preferred zone's node as NUMA_HITS because this is the same node we > > requested the allocation from in most cases. If this is not the local > > node then we just account it as NUMA_OTHER rather than NUMA_LOCAL. > > > > One downsize would be that an allocation request for a node which is > > outside of the mempolicy nodemask would be reported as a hit which is a > > bit weird but that was the case before b9f00e147f27 already. > > > > Reported-by: Jia He <hejianet@gmail.com> > > Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > cbmc tells me that this patch is not equal to pre-commit b9f00e147f27 > (in situation where __GFP_OTHER_NODE is not passed, as that's the only > relevant scenario after your patch), which seems unintended. > > counter example: > numa_node_id() == 1 > preferred_zone on node 0 > allocated from zone on node 1 > > pre-b9f00e147f27: > allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL > preferred zone (node 0) increased NUMA_FOREIGN > > (that looks correct to me) > > your patch: > allocated zone (node 1) increased NUMA_MISS > preferred zone (node 0) increased NUMA_FOREIGN > > i.e. it's missing NUMA_LOCAL on node 1, which is IMHO wrong as this was > an allocation local to the CPU (albeit a MISS wrt the preferred node). I guess the following should fix that, right? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 647e940e6921..ea60dc06d280 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2587,17 +2587,18 @@ int __isolate_free_page(struct page *page, unsigned int order) static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) { #ifdef CONFIG_NUMA - if (z->node == preferred_zone->node) { - enum zone_stat_item local_stat = NUMA_LOCAL; + enum zone_stat_item local_stat = NUMA_LOCAL; + if (z->node != numa_node_id()) + local_stat = NUMA_OTHER; + + if (z->node == preferred_zone->node) __inc_zone_state(z, NUMA_HIT); - if (z->node != numa_node_id()) - local_stat = NUMA_OTHER; - __inc_zone_state(z, local_stat); - } else { + else { __inc_zone_state(z, NUMA_MISS); __inc_zone_state(preferred_zone, NUMA_FOREIGN); } + __inc_zone_state(z, local_stat); #endif } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] mm: fix remote numa hits statistics 2017-01-02 14:46 ` Michal Hocko @ 2017-01-02 15:07 ` Vlastimil Babka 0 siblings, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2017-01-02 15:07 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi, Johannes Weiner On 01/02/2017 03:46 PM, Michal Hocko wrote: > On Mon 02-01-17 15:16:23, Vlastimil Babka wrote: >> On 12/21/2016 09:06 AM, Michal Hocko wrote: >>> From: Michal Hocko <mhocko@suse.com> >>> >>> Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches >>> in zone_statistics") has an unintentional side effect that remote node >>> allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and >>> NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are >>> many of these potentially because the flag is used very rarely while >>> we have many users of __alloc_pages_node. >>> >>> Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a >>> follow up patch) and treat all allocations that were satisfied from the >>> preferred zone's node as NUMA_HITS because this is the same node we >>> requested the allocation from in most cases. If this is not the local >>> node then we just account it as NUMA_OTHER rather than NUMA_LOCAL. >>> >>> One downsize would be that an allocation request for a node which is >>> outside of the mempolicy nodemask would be reported as a hit which is a >>> bit weird but that was the case before b9f00e147f27 already. >>> >>> Reported-by: Jia He <hejianet@gmail.com> >>> Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics") >>> Signed-off-by: Michal Hocko <mhocko@suse.com> >> >> cbmc tells me that this patch is not equal to pre-commit b9f00e147f27 >> (in situation where __GFP_OTHER_NODE is not passed, as that's the only >> relevant scenario after your patch), which seems unintended. >> >> counter example: >> numa_node_id() == 1 >> preferred_zone on node 0 >> allocated from zone on node 1 >> >> pre-b9f00e147f27: >> allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL >> preferred zone (node 0) increased NUMA_FOREIGN >> >> (that looks correct to me) >> >> your patch: >> allocated zone (node 1) increased NUMA_MISS >> preferred zone (node 0) increased NUMA_FOREIGN >> >> i.e. it's missing NUMA_LOCAL on node 1, which is IMHO wrong as this was >> an allocation local to the CPU (albeit a MISS wrt the preferred node). > > I guess the following should fix that, right? Yes, it does. With that, you can add: Reviewed-by: Vlastimil Babka <vbabka@suse.cz> (which is a lie, since the computer did that ;) > --- > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 647e940e6921..ea60dc06d280 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2587,17 +2587,18 @@ int __isolate_free_page(struct page *page, unsigned int order) > static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) > { > #ifdef CONFIG_NUMA > - if (z->node == preferred_zone->node) { > - enum zone_stat_item local_stat = NUMA_LOCAL; > + enum zone_stat_item local_stat = NUMA_LOCAL; > > + if (z->node != numa_node_id()) > + local_stat = NUMA_OTHER; > + > + if (z->node == preferred_zone->node) > __inc_zone_state(z, NUMA_HIT); > - if (z->node != numa_node_id()) > - local_stat = NUMA_OTHER; > - __inc_zone_state(z, local_stat); > - } else { > + else { > __inc_zone_state(z, NUMA_MISS); > __inc_zone_state(preferred_zone, NUMA_FOREIGN); > } > + __inc_zone_state(z, local_stat); > #endif > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 14:28 ` Mel Gorman 2016-12-20 14:35 ` Michal Hocko @ 2016-12-20 14:42 ` Mel Gorman 1 sibling, 0 replies; 23+ messages in thread From: Mel Gorman @ 2016-12-20 14:42 UTC (permalink / raw) To: Michal Hocko Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Johannes Weiner, Joonsoo Kim, Taku Izumi On Tue, Dec 20, 2016 at 02:28:45PM +0000, Mel Gorman wrote: > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote: > > On Tue 20-12-16 13:10:40, Mel Gorman wrote: > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote: > > > > On Mon 12-12-16 13:59:07, Jia He wrote: > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > > > > > z->node would not be equal to preferred_zone->node. That seems to be > > > > > incorrect. > > > > > > > > I am sorry but I have hard time following the changelog. It is clear > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > > > > but it is not really clear when such thing happens. You are adding > > > > preferred_zone->node check. preferred_zone is the first zone in the > > > > requested zonelist. So for the most allocations it is a node from the > > > > local node. But if something request an explicit numa node (without > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > > > > referenced patch indeed caused an unintended change of accounting AFAIU. > > > > > > > > > > This is a similar concern to what I had. If the preferred zone, which is > > > the first valid usable zone, is not a "hit" for the statistics then I > > > don't know what "hit" is meant to mean. > > > > But the first valid usable zone is defined based on the requested numa > > node. Unless the requested node is memoryless then we should have a hit, > > no? > > > > Should be. If the local node is memoryless then there would be a difference > between hit and whether it's local or not but that to me is a little > useless. A local vs remote page allocated has a specific meaning and > consequence. It's hard to see how hit can be meaningfully interpreted if > there are memoryless nodes. I don't have a strong objection to the patch > so I didn't nak it, I'm just not convinced it matters. > That said, it's also hard to interpret "hit" when memory policies forbid a memory node but allows the CPUs to be used. Local and remote can be interpreted regardless of machine, "hit" has meaning depending on the configuration and interpreting it is weird. It had some value when zone_reclaim_mode was always enabled because a low hit rate would imply zone_reclaim_mode was broken but that's a corner case. I would prefer to get rid of it because interpreting it is such a mess if there wasn't tools already collecting the data. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 9:18 ` Michal Hocko 2016-12-20 13:10 ` Mel Gorman @ 2016-12-20 15:13 ` Vlastimil Babka 2016-12-21 3:01 ` hejianet 2 siblings, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2016-12-20 15:13 UTC (permalink / raw) To: Michal Hocko, Jia He Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman, Johannes Weiner, Joonsoo Kim, Taku Izumi [-- Attachment #1: Type: text/plain, Size: 2451 bytes --] On 12/20/2016 10:18 AM, Michal Hocko wrote: > On Mon 12-12-16 13:59:07, Jia He wrote: >> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in >> zone_statistics"), it reconstructed codes to reduce the branch miss rate. >> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) >> z->node would not be equal to preferred_zone->node. That seems to be >> incorrect. > > I am sorry but I have hard time following the changelog. It is clear > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > but it is not really clear when such thing happens. You are adding > preferred_zone->node check. preferred_zone is the first zone in the > requested zonelist. So for the most allocations it is a node from the > local node. But if something request an explicit numa node (without > __GFP_OTHER_NODE which would be the majority I suspect) then we could > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > referenced patch indeed caused an unintended change of accounting AFAIU. Since I could not quite wrap my head around all possible combinations, I've got lazy and employed cbmc, inspired by [1] (test file attached if anyone wants to play with it) and tried to test whether both commit b9f00e147f27, and Jia He's patch are equivalent to pre-b9f00e147f27. The tool found counter examples for both cases (there might be more scenarios, but it only reports the first one it hits). Note that decoding the output is not exactly trivial... [1] http://paulmck.livejournal.com/38997.html ============== Counterexample for commit b9f00e147f27 vs pre-b9f00e147f27: numa_node_id() == 1 preferred_zone on node 0 allocated from zone on node 1 without __GFP_OTHER_NODE pre-b9f00e147f27: allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL preferred zone (node 0) increased NUMA_FOREIGN (that looks correct to me) b9f00e147f27: allocated zone (node 1) got NUMA_HIT and NUMA_LOCAL there's no NUMA_FOREIGN (that's wrong wrt HIT and FOREIGN) ============== Counterexample for Jia He's patch vs pre-b9f00e147f27: numa_node_id() == 0 preferred_zone on node 0 allocated from zone on node 1 without __GFP_OTHER_NODE pre-b9f00e147f27: allocated zone (node 1) increased NUMA_MISS and NUMA_OTHER preferred zone (node 0) increased NUMA_FOREIGN (that looks correct to me) Jia He's patch: allocated zone (node 1) increased NUMA_MISS preferred zone (node 0) increased NUMA_FOREIGN (it's missing NUMA_OTHER) [-- Attachment #2: test-numa.c --] [-- Type: text/x-csrc, Size: 3798 bytes --] #include <stdio.h> #define NODES 3 #define ZONES 2 enum zone_stat_item { NUMA_HIT, NUMA_MISS, NUMA_LOCAL, NUMA_OTHER, NUMA_FOREIGN, ZONE_STAT_ITEMS }; struct node; struct zone { struct node *zone_pgdat; int node; int vm_stat[ZONE_STAT_ITEMS]; }; struct node { int nid; struct zone node_zones[ZONES]; }; typedef int gfp_t; #define __GFP_OTHER_NODE 1 #define unlikely(x) (x) int local_nid; int numa_node_id() { return local_nid; } static void __inc_zone_state(struct zone *z, enum zone_stat_item stat) { z->vm_stat[stat]++; } // before b9f00e147f27 void zone_statistics1(struct zone *preferred_zone, struct zone *z, gfp_t flags) { if (z->zone_pgdat == preferred_zone->zone_pgdat) { __inc_zone_state(z, NUMA_HIT); } else { __inc_zone_state(z, NUMA_MISS); __inc_zone_state(preferred_zone, NUMA_FOREIGN); } if (z->node == ((flags & __GFP_OTHER_NODE) ? preferred_zone->node : numa_node_id())) __inc_zone_state(z, NUMA_LOCAL); else __inc_zone_state(z, NUMA_OTHER); } // after b9f00e147f27 void zone_statistics2(struct zone *preferred_zone, struct zone *z, gfp_t flags) { int local_nid = numa_node_id(); enum zone_stat_item local_stat = NUMA_LOCAL; if (unlikely(flags & __GFP_OTHER_NODE)) { local_stat = NUMA_OTHER; local_nid = preferred_zone->node; } if (z->node == local_nid) { __inc_zone_state(z, NUMA_HIT); __inc_zone_state(z, local_stat); } else { __inc_zone_state(z, NUMA_MISS); __inc_zone_state(preferred_zone, NUMA_FOREIGN); } } // after Jia He's patch static inline void zone_statistics3(struct zone *preferred_zone, struct zone *z, gfp_t flags) { int local_nid = numa_node_id(); enum zone_stat_item local_stat = NUMA_LOCAL; if (unlikely(flags & __GFP_OTHER_NODE)) { local_stat = NUMA_OTHER; local_nid = preferred_zone->node; } if (z->node == local_nid) { __inc_zone_state(z, NUMA_HIT); __inc_zone_state(z, local_stat); } else if (z->node == preferred_zone->node) { __inc_zone_state(z, NUMA_HIT); __inc_zone_state(z, NUMA_OTHER); } else { __inc_zone_state(z, NUMA_MISS); __inc_zone_state(preferred_zone, NUMA_FOREIGN); } } void init_node(struct node * n, int nid) { n->nid = nid; for (int i = 0; i < ZONES; i++) { struct zone * z = &n->node_zones[i]; z->zone_pgdat = n; z->node = nid; for (int j = 0; j < ZONE_STAT_ITEMS; j++) z->vm_stat[j] = 0; } } struct node nodes1[NODES]; struct node nodes2[NODES]; void check_stats() { for (int i = 0; i < NODES; i++) for (int j = 0; j < ZONES; j++) for (int k = 0; k < ZONE_STAT_ITEMS; k++) assert (nodes1[i].node_zones[j].vm_stat[k] == nodes2[i].node_zones[j].vm_stat[k]); } int main(int argc, char *argv[]) { for (int i = 0; i < NODES; i++) { init_node(&nodes1[i], i); init_node(&nodes2[i], i); } local_nid = ((unsigned int)argv[1]) % NODES; int zone_nid = ((unsigned int)argv[2]) % NODES; int pzone_nid = ((unsigned int)argv[3]) % NODES; int zid = ((unsigned int)argv[4]) % ZONES; int pzid = ((unsigned int)argv[5]) % ZONES; /* we should not allocate from higher than preferred zone */ if (zid > pzid) zid = pzid; gfp_t flags = ((unsigned int)argv[6]) & __GFP_OTHER_NODE; zone_statistics1(&nodes1[pzone_nid].node_zones[pzid], &nodes1[zone_nid].node_zones[zid], flags); zone_statistics3(&nodes2[pzone_nid].node_zones[pzid], &nodes2[zone_nid].node_zones[zid], flags); check_stats(); } ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 9:18 ` Michal Hocko 2016-12-20 13:10 ` Mel Gorman 2016-12-20 15:13 ` Vlastimil Babka @ 2016-12-21 3:01 ` hejianet 2 siblings, 0 replies; 23+ messages in thread From: hejianet @ 2016-12-21 3:01 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner, Joonsoo Kim, Taku Izumi On 20/12/2016 5:18 PM, Michal Hocko wrote: > On Mon 12-12-16 13:59:07, Jia He wrote: >> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in >> zone_statistics"), it reconstructed codes to reduce the branch miss rate. >> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) >> z->node would not be equal to preferred_zone->node. That seems to be >> incorrect. > I am sorry but I have hard time following the changelog. It is clear > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting > but it is not really clear when such thing happens. You are adding > preferred_zone->node check. preferred_zone is the first zone in the > requested zonelist. So for the most allocations it is a node from the > local node. But if something request an explicit numa node (without > __GFP_OTHER_NODE which would be the majority I suspect) then we could > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the > referenced patch indeed caused an unintended change of accounting AFAIU. > > If this is correct then it should be a part of the changelog. I also > cannot say I would like the fix. First of all I am not sure > __GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the > flag any different from an explicit __alloc_pages_node(non_local_nid)? > In both cases we ask for an allocation on a remote node and successful > allocation is a NUMA_HIT and NUMA_OTHER. > > That being said, why cannot we simply do the following? As a bonus, we > can get rid of a barely used __GFP_OTHER_NODE. Also the number of > branches will stay same. Yes, I agree maybe we can get rid of __GFP_OTHER_NODE if no objections Seems currently it is only used for hugepage and statistics > --- > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 429855be6ec9..f035d5c8b864 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order) > * Update NUMA hit/miss statistics > * > * Must be called with interrupts disabled. > - * > - * When __GFP_OTHER_NODE is set assume the node of the preferred > - * zone is the local node. This is useful for daemons who allocate > - * memory on behalf of other processes. > */ > static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, > gfp_t flags) > { > #ifdef CONFIG_NUMA > - int local_nid = numa_node_id(); > - enum zone_stat_item local_stat = NUMA_LOCAL; > - > - if (unlikely(flags & __GFP_OTHER_NODE)) { > - local_stat = NUMA_OTHER; > - local_nid = preferred_zone->node; > - } > + if (z->node == preferred_zone->node) { > + enum zone_stat_item local_stat = NUMA_LOCAL; > > - if (z->node == local_nid) { > __inc_zone_state(z, NUMA_HIT); > + if (z->node != numa_node_id()) > + local_stat = NUMA_OTHER; > __inc_zone_state(z, local_stat); > } else { > __inc_zone_state(z, NUMA_MISS); I thought the logic here is different Here is the zone_statistics() before introducing __GFP_OTHER_NODE: if (z->zone_pgdat == preferred_zone->zone_pgdat) { __inc_zone_state(z, NUMA_HIT); } else { __inc_zone_state(z, NUMA_MISS); __inc_zone_state(preferred_zone, NUMA_FOREIGN); } if (z->node == numa_node_id()) __inc_zone_state(z, NUMA_LOCAL); else __inc_zone_state(z, NUMA_OTHER); B.R. Jia -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-12 5:59 ` [PATCH RFC 1/1] " Jia He 2016-12-20 9:18 ` Michal Hocko @ 2016-12-20 12:31 ` Mel Gorman 2016-12-21 3:07 ` hejianet 1 sibling, 1 reply; 23+ messages in thread From: Mel Gorman @ 2016-12-20 12:31 UTC (permalink / raw) To: Jia He Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner, Joonsoo Kim, Taku Izumi On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote: > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in > zone_statistics"), it reconstructed codes to reduce the branch miss rate. > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) > z->node would not be equal to preferred_zone->node. That seems to be > incorrect. > > Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in > zone_statistics") > > Signed-off-by: Jia He <hejianet@gmail.com> This is slightly curious. It appear it would only occur if a process was running on a node that was outside the memory policy. Can you confirm that is the case? If so, your patch is a a semantic curiousity because it's actually impossible for a NUMA allocation to be local and the definition of "HIT" is fuzzy enough to be useless. I won't object to the patch but it makes me trust "hit" even less than I already do for any analysis. Note that after this mail that I'll be unavailable by mail until early new years. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data 2016-12-20 12:31 ` Mel Gorman @ 2016-12-21 3:07 ` hejianet 0 siblings, 0 replies; 23+ messages in thread From: hejianet @ 2016-12-21 3:07 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka, Michal Hocko, Johannes Weiner, Joonsoo Kim, Taku Izumi On 20/12/2016 8:31 PM, Mel Gorman wrote: > On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote: >> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in >> zone_statistics"), it reconstructed codes to reduce the branch miss rate. >> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE) >> z->node would not be equal to preferred_zone->node. That seems to be >> incorrect. >> >> Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in >> zone_statistics") >> >> Signed-off-by: Jia He <hejianet@gmail.com> > This is slightly curious. It appear it would only occur if a process was > running on a node that was outside the memory policy. Can you confirm > that is the case? Yes, here is what I caught: dumpstack() is triggered when z->node(5) == preferred_zone->node(5) and z->node != numa_node_id(4) without flag GET_OTHER_NODE It is not a rare case. I saw hundreds of such cases when kernel boots up [c000000cdcaef440] [c0000000002e88cc] cache_grow_begin+0xcc/0x500 [c000000cdcaef6f0] [c0000000002ecb44] do_tune_cpucache+0x64/0x100 [c000000cdcaef750] [c0000000002ecc7c] enable_cpucache+0x9c/0x180 [c000000cdcaef7d0] [c0000000002ed01c] __kmem_cache_create+0x1ec/00x2c0 [c000000cdcaef820] [c000000000291c98] create_cache+0xb8/0x240 [c000000cdcaef890] [c000000000291fa8] kmem_cache_create+0x188/0x2290 [c000000cdcaef950] [d000000011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4] [c000000cdcaef9f0] [d000000011daaedc] ext4_fill_super+0x266c/0x33390 [ext4] [c000000cdcaefb30] [c000000000328b8c] mount_bdev+0x22c/0x260 [c000000cdcaefbd0] [d000000011da1fa8] ext4_mount+0x48/0x60 [ext4] [c000000cdcaefc10] [c00000000032a11c] mount_fs+0x8c/0x230 [c000000cdcaefcb0] [c000000000351f98] vfs_kern_mount+0x78/0x180 [c000000cdcaefd00] [c000000000356d68] do_mount+0x258/0xea0 [c000000cdcaefde0] [c000000000357da0] SyS_mount+0xa0/0x110 [c000000cdcaefe30] [c00000000000bd84] system_call+0x38/0xe0 > If so, your patch is a a semantic curiousity because it's actually > impossible for a NUMA allocation to be local and the definition of "HIT" > is fuzzy enough to be useless. > > I won't object to the patch but it makes me trust "hit" even less than I > already do for any analysis. > > Note that after this mail that I'll be unavailable by mail until early > new years. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-01-02 15:07 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-12 5:59 [PATCH RFC 0/1] mm, page_alloc: fix incorrect zone_statistics data Jia He 2016-12-12 5:59 ` [PATCH RFC 1/1] " Jia He 2016-12-20 9:18 ` Michal Hocko 2016-12-20 13:10 ` Mel Gorman 2016-12-20 13:26 ` Michal Hocko 2016-12-20 14:28 ` Mel Gorman 2016-12-20 14:35 ` Michal Hocko 2016-12-20 14:49 ` Vlastimil Babka 2016-12-20 14:54 ` Mel Gorman 2016-12-21 7:57 ` Michal Hocko 2016-12-21 8:06 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko 2016-12-21 8:06 ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko 2017-01-02 14:18 ` Vlastimil Babka 2016-12-29 11:46 ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman 2016-12-29 12:28 ` Michal Hocko 2017-01-02 14:16 ` Vlastimil Babka 2017-01-02 14:46 ` Michal Hocko 2017-01-02 15:07 ` Vlastimil Babka 2016-12-20 14:42 ` [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data Mel Gorman 2016-12-20 15:13 ` Vlastimil Babka 2016-12-21 3:01 ` hejianet 2016-12-20 12:31 ` Mel Gorman 2016-12-21 3:07 ` hejianet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).