linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/1] mm, page_alloc: fix incorrect zone_statistics data
@ 2016-12-12  5:59 Jia He
  2016-12-12  5:59 ` [PATCH RFC 1/1] " Jia He
  0 siblings, 1 reply; 23+ messages in thread
From: Jia He @ 2016-12-12  5:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko,
	Johannes Weiner, Joonsoo Kim, Taku Izumi, Jia He


In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed the code to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
 z->node would not be equal to preferred_zone->node. That seems to be
incorrect.

Here is what I catch, dumpstack() is triggered when z->node ==
preferred_zone->node and z->node != numa_node_id()

z=5,prefer=5,local=4, flag&OTHER_NODE=0
[c000000cdcaef440] [c0000000002e88cc] cache_grow_begin+0xcc/0x500
[c000000cdcaef6f0] [c0000000002ecb44] do_tune_cpucache+0x64/0x100
[c000000cdcaef750] [c0000000002ecc7c] enable_cpucache+0x9c/0x180
[c000000cdcaef7d0] [c0000000002ed01c] __kmem_cache_create+0x1ec/00x2c0
[c000000cdcaef820] [c000000000291c98] create_cache+0xb8/0x240
[c000000cdcaef890] [c000000000291fa8] kmem_cache_create+0x188/0x2290
[c000000cdcaef950] [d000000011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4]
[c000000cdcaef9f0] [d000000011daaedc] ext4_fill_super+0x266c/0x33390 [ext4]
[c000000cdcaefb30] [c000000000328b8c] mount_bdev+0x22c/0x260
[c000000cdcaefbd0] [d000000011da1fa8] ext4_mount+0x48/0x60 [ext4]
[c000000cdcaefc10] [c00000000032a11c] mount_fs+0x8c/0x230
[c000000cdcaefcb0] [c000000000351f98] vfs_kern_mount+0x78/0x180
[c000000cdcaefd00] [c000000000356d68] do_mount+0x258/0xea0
[c000000cdcaefde0] [c000000000357da0] SyS_mount+0xa0/0x110
[c000000cdcaefe30] [c00000000000bd84] system_call+0x38/0xe0

Before this patch, the numa_miss and numa_foreign looked very odd:
linux:~ # numastat
                           node0           node1           node2           node3           node4           node5           node6
numa_hit                   42216               0               0               0           96755               0               0
numa_miss                      1             718             711             726             860             712             719
numa_foreign                   1             718             711             726             860             712             719
interleave_hit               631             638             632             641             621             633             636
local_node                 42216               0               0               0           96755               0               0
other_node                     0               0               0               0               0               0               0

After this patch
linux:~ # numastat  
                           node0           node1           node2           node3           node4           node5           node6
numa_hit                  177891             718             711             726           60302             712             719
numa_miss                      0          196944          237222          253424               0           36265               0
numa_foreign              723855               0               0               0               0               0               0
interleave_hit               631             638             632             641             621             633             636
local_node                177891               0               0               0           59444               0               0
other_node                     0             718             711             726             858             712             719
Jia He (1):
  mm, page_alloc: fix incorrect zone_statistics data

 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-12  5:59 [PATCH RFC 0/1] mm, page_alloc: fix incorrect zone_statistics data Jia He
@ 2016-12-12  5:59 ` Jia He
  2016-12-20  9:18   ` Michal Hocko
  2016-12-20 12:31   ` Mel Gorman
  0 siblings, 2 replies; 23+ messages in thread
From: Jia He @ 2016-12-12  5:59 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Michal Hocko,
	Johannes Weiner, Joonsoo Kim, Taku Izumi, Jia He

In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics"), it reconstructed codes to reduce the branch miss rate.
Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
 z->node would not be equal to preferred_zone->node. That seems to be
incorrect.

Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
zone_statistics")

Signed-off-by: Jia He <hejianet@gmail.com>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6de9440..474757e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2568,6 +2568,9 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 	if (z->node == local_nid) {
 		__inc_zone_state(z, NUMA_HIT);
 		__inc_zone_state(z, local_stat);
+	} else if (z->node == preferred_zone->node) {
+		__inc_zone_state(z, NUMA_HIT);
+		__inc_zone_state(z, NUMA_OTHER);
 	} else {
 		__inc_zone_state(z, NUMA_MISS);
 		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
-- 
2.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-12  5:59 ` [PATCH RFC 1/1] " Jia He
@ 2016-12-20  9:18   ` Michal Hocko
  2016-12-20 13:10     ` Mel Gorman
                       ` (2 more replies)
  2016-12-20 12:31   ` Mel Gorman
  1 sibling, 3 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-20  9:18 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Mel Gorman, Johannes Weiner, Joonsoo Kim, Taku Izumi

On Mon 12-12-16 13:59:07, Jia He wrote:
> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>  z->node would not be equal to preferred_zone->node. That seems to be
> incorrect.

I am sorry but I have hard time following the changelog. It is clear
that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
but it is not really clear when such thing happens. You are adding
preferred_zone->node check. preferred_zone is the first zone in the
requested zonelist. So for the most allocations it is a node from the
local node. But if something request an explicit numa node (without
__GFP_OTHER_NODE which would be the majority I suspect) then we could
indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
referenced patch indeed caused an unintended change of accounting AFAIU.

If this is correct then it should be a part of the changelog. I also
cannot say I would like the fix. First of all I am not sure
__GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the
flag any different from an explicit __alloc_pages_node(non_local_nid)?
In both cases we ask for an allocation on a remote node and successful
allocation is a NUMA_HIT and NUMA_OTHER.

That being said, why cannot we simply do the following? As a bonus, we
can get rid of a barely used __GFP_OTHER_NODE. Also the number of
branches will stay same.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 429855be6ec9..f035d5c8b864 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order)
  * Update NUMA hit/miss statistics
  *
  * Must be called with interrupts disabled.
- *
- * When __GFP_OTHER_NODE is set assume the node of the preferred
- * zone is the local node. This is useful for daemons who allocate
- * memory on behalf of other processes.
  */
 static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 								gfp_t flags)
 {
 #ifdef CONFIG_NUMA
-	int local_nid = numa_node_id();
-	enum zone_stat_item local_stat = NUMA_LOCAL;
-
-	if (unlikely(flags & __GFP_OTHER_NODE)) {
-		local_stat = NUMA_OTHER;
-		local_nid = preferred_zone->node;
-	}
+	if (z->node == preferred_zone->node) {
+		enum zone_stat_item local_stat = NUMA_LOCAL;
 
-	if (z->node == local_nid) {
 		__inc_zone_state(z, NUMA_HIT);
+		if (z->node != numa_node_id())
+			local_stat = NUMA_OTHER;
 		__inc_zone_state(z, local_stat);
 	} else {
 		__inc_zone_state(z, NUMA_MISS);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-12  5:59 ` [PATCH RFC 1/1] " Jia He
  2016-12-20  9:18   ` Michal Hocko
@ 2016-12-20 12:31   ` Mel Gorman
  2016-12-21  3:07     ` hejianet
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2016-12-20 12:31 UTC (permalink / raw)
  To: Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Michal Hocko, Johannes Weiner, Joonsoo Kim, Taku Izumi

On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote:
> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>  z->node would not be equal to preferred_zone->node. That seems to be
> incorrect.
> 
> Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> zone_statistics")
> 
> Signed-off-by: Jia He <hejianet@gmail.com>

This is slightly curious. It appear it would only occur if a process was
running on a node that was outside the memory policy. Can you confirm
that is the case?

If so, your patch is a a semantic curiousity because it's actually
impossible for a NUMA allocation to be local and the definition of "HIT"
is fuzzy enough to be useless.

I won't object to the patch but it makes me trust "hit" even less than I
already do for any analysis.

Note that after this mail that I'll be unavailable by mail until early
new years.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20  9:18   ` Michal Hocko
@ 2016-12-20 13:10     ` Mel Gorman
  2016-12-20 13:26       ` Michal Hocko
  2016-12-20 15:13     ` Vlastimil Babka
  2016-12-21  3:01     ` hejianet
  2 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2016-12-20 13:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> On Mon 12-12-16 13:59:07, Jia He wrote:
> > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> >  z->node would not be equal to preferred_zone->node. That seems to be
> > incorrect.
> 
> I am sorry but I have hard time following the changelog. It is clear
> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> but it is not really clear when such thing happens. You are adding
> preferred_zone->node check. preferred_zone is the first zone in the
> requested zonelist. So for the most allocations it is a node from the
> local node. But if something request an explicit numa node (without
> __GFP_OTHER_NODE which would be the majority I suspect) then we could
> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> referenced patch indeed caused an unintended change of accounting AFAIU.
> 

This is a similar concern to what I had. If the preferred zone, which is
the first valid usable zone, is not a "hit" for the statistics then I
don't know what "hit" is meant to mean.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 13:10     ` Mel Gorman
@ 2016-12-20 13:26       ` Michal Hocko
  2016-12-20 14:28         ` Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-20 13:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > >  z->node would not be equal to preferred_zone->node. That seems to be
> > > incorrect.
> > 
> > I am sorry but I have hard time following the changelog. It is clear
> > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > but it is not really clear when such thing happens. You are adding
> > preferred_zone->node check. preferred_zone is the first zone in the
> > requested zonelist. So for the most allocations it is a node from the
> > local node. But if something request an explicit numa node (without
> > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > referenced patch indeed caused an unintended change of accounting AFAIU.
> > 
> 
> This is a similar concern to what I had. If the preferred zone, which is
> the first valid usable zone, is not a "hit" for the statistics then I
> don't know what "hit" is meant to mean.

But the first valid usable zone is defined based on the requested numa
node. Unless the requested node is memoryless then we should have a hit,
no?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 13:26       ` Michal Hocko
@ 2016-12-20 14:28         ` Mel Gorman
  2016-12-20 14:35           ` Michal Hocko
  2016-12-20 14:42           ` [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data Mel Gorman
  0 siblings, 2 replies; 23+ messages in thread
From: Mel Gorman @ 2016-12-20 14:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > >  z->node would not be equal to preferred_zone->node. That seems to be
> > > > incorrect.
> > > 
> > > I am sorry but I have hard time following the changelog. It is clear
> > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > but it is not really clear when such thing happens. You are adding
> > > preferred_zone->node check. preferred_zone is the first zone in the
> > > requested zonelist. So for the most allocations it is a node from the
> > > local node. But if something request an explicit numa node (without
> > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > 
> > 
> > This is a similar concern to what I had. If the preferred zone, which is
> > the first valid usable zone, is not a "hit" for the statistics then I
> > don't know what "hit" is meant to mean.
> 
> But the first valid usable zone is defined based on the requested numa
> node. Unless the requested node is memoryless then we should have a hit,
> no?
> 

Should be. If the local node is memoryless then there would be a difference
between hit and whether it's local or not but that to me is a little
useless. A local vs remote page allocated has a specific meaning and
consequence. It's hard to see how hit can be meaningfully interpreted if
there are memoryless nodes. I don't have a strong objection to the patch
so I didn't nak it, I'm just not convinced it matters.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 14:28         ` Mel Gorman
@ 2016-12-20 14:35           ` Michal Hocko
  2016-12-20 14:49             ` Vlastimil Babka
  2016-12-20 14:54             ` Mel Gorman
  2016-12-20 14:42           ` [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data Mel Gorman
  1 sibling, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-20 14:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue 20-12-16 14:28:45, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > >  z->node would not be equal to preferred_zone->node. That seems to be
> > > > > incorrect.
> > > > 
> > > > I am sorry but I have hard time following the changelog. It is clear
> > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > but it is not really clear when such thing happens. You are adding
> > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > requested zonelist. So for the most allocations it is a node from the
> > > > local node. But if something request an explicit numa node (without
> > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > > 
> > > 
> > > This is a similar concern to what I had. If the preferred zone, which is
> > > the first valid usable zone, is not a "hit" for the statistics then I
> > > don't know what "hit" is meant to mean.
> > 
> > But the first valid usable zone is defined based on the requested numa
> > node. Unless the requested node is memoryless then we should have a hit,
> > no?
> > 
> 
> Should be. If the local node is memoryless then there would be a difference
> between hit and whether it's local or not but that to me is a little
> useless. A local vs remote page allocated has a specific meaning and
> consequence. It's hard to see how hit can be meaningfully interpreted if
> there are memoryless nodes. I don't have a strong objection to the patch
> so I didn't nak it, I'm just not convinced it matters.

So what do you think about
http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz

I think that we should get rid of __GFP_OTHER_NODE thingy. It is just
one off thing and the gfp space it rather precious.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 14:28         ` Mel Gorman
  2016-12-20 14:35           ` Michal Hocko
@ 2016-12-20 14:42           ` Mel Gorman
  1 sibling, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2016-12-20 14:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue, Dec 20, 2016 at 02:28:45PM +0000, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > >  z->node would not be equal to preferred_zone->node. That seems to be
> > > > > incorrect.
> > > > 
> > > > I am sorry but I have hard time following the changelog. It is clear
> > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > but it is not really clear when such thing happens. You are adding
> > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > requested zonelist. So for the most allocations it is a node from the
> > > > local node. But if something request an explicit numa node (without
> > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > > 
> > > 
> > > This is a similar concern to what I had. If the preferred zone, which is
> > > the first valid usable zone, is not a "hit" for the statistics then I
> > > don't know what "hit" is meant to mean.
> > 
> > But the first valid usable zone is defined based on the requested numa
> > node. Unless the requested node is memoryless then we should have a hit,
> > no?
> > 
> 
> Should be. If the local node is memoryless then there would be a difference
> between hit and whether it's local or not but that to me is a little
> useless. A local vs remote page allocated has a specific meaning and
> consequence. It's hard to see how hit can be meaningfully interpreted if
> there are memoryless nodes. I don't have a strong objection to the patch
> so I didn't nak it, I'm just not convinced it matters.
> 

That said, it's also hard to interpret "hit" when memory policies
forbid a memory node but allows the CPUs to be used. Local and remote
can be interpreted regardless of machine, "hit" has meaning depending on
the configuration and interpreting it is weird. It had some value when
zone_reclaim_mode was always enabled because a low hit rate would imply
zone_reclaim_mode was broken but that's a corner case. I would prefer to
get rid of it because interpreting it is such a mess if there wasn't tools
already collecting the data.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 14:35           ` Michal Hocko
@ 2016-12-20 14:49             ` Vlastimil Babka
  2016-12-20 14:54             ` Mel Gorman
  1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2016-12-20 14:49 UTC (permalink / raw)
  To: Michal Hocko, Mel Gorman
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Joonsoo Kim, Taku Izumi, Andi Kleen

On 12/20/2016 03:35 PM, Michal Hocko wrote:
> On Tue 20-12-16 14:28:45, Mel Gorman wrote:
>> On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
>>> On Tue 20-12-16 13:10:40, Mel Gorman wrote:
>>>> On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
>>>>> On Mon 12-12-16 13:59:07, Jia He wrote:
>>>>>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>>>>>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>>>>>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>>>>>>  z->node would not be equal to preferred_zone->node. That seems to be
>>>>>> incorrect.
>>>>>
>>>>> I am sorry but I have hard time following the changelog. It is clear
>>>>> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
>>>>> but it is not really clear when such thing happens. You are adding
>>>>> preferred_zone->node check. preferred_zone is the first zone in the
>>>>> requested zonelist. So for the most allocations it is a node from the
>>>>> local node. But if something request an explicit numa node (without
>>>>> __GFP_OTHER_NODE which would be the majority I suspect) then we could
>>>>> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
>>>>> referenced patch indeed caused an unintended change of accounting AFAIU.
>>>>>
>>>>
>>>> This is a similar concern to what I had. If the preferred zone, which is
>>>> the first valid usable zone, is not a "hit" for the statistics then I
>>>> don't know what "hit" is meant to mean.
>>>
>>> But the first valid usable zone is defined based on the requested numa
>>> node. Unless the requested node is memoryless then we should have a hit,
>>> no?
>>>
>>
>> Should be. If the local node is memoryless then there would be a difference
>> between hit and whether it's local or not but that to me is a little
>> useless. A local vs remote page allocated has a specific meaning and
>> consequence. It's hard to see how hit can be meaningfully interpreted if
>> there are memoryless nodes. I don't have a strong objection to the patch
>> so I didn't nak it, I'm just not convinced it matters.
> 
> So what do you think about
> http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz
> 
> I think that we should get rid of __GFP_OTHER_NODE thingy. It is just
> one off thing and the gfp space it rather precious.

Let's CC Andi who introduced it by commit 78afd5612deb8.
Personally I agree that the reasoning provided by that commit does not
justify the troubles. We already have the HIT and MISS counters to
record if we allocated on the node we explicitly wanted/preferred (local
or remote). The LOCAL and OTHER should thus be true local/remote
statistics, why fake them in some rare cases such as khugepaged? The
only other case is do_huge_pmd_wp_page_fallback() where it perhaps makes
even less sense. No others were added in 5 years so I think the flag
really didn't catch on, let's get rid of it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 14:35           ` Michal Hocko
  2016-12-20 14:49             ` Vlastimil Babka
@ 2016-12-20 14:54             ` Mel Gorman
  2016-12-21  7:57               ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2016-12-20 14:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue, Dec 20, 2016 at 03:35:02PM +0100, Michal Hocko wrote:
> On Tue 20-12-16 14:28:45, Mel Gorman wrote:
> > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > > >  z->node would not be equal to preferred_zone->node. That seems to be
> > > > > > incorrect.
> > > > > 
> > > > > I am sorry but I have hard time following the changelog. It is clear
> > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > > but it is not really clear when such thing happens. You are adding
> > > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > > requested zonelist. So for the most allocations it is a node from the
> > > > > local node. But if something request an explicit numa node (without
> > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > > > 
> > > > 
> > > > This is a similar concern to what I had. If the preferred zone, which is
> > > > the first valid usable zone, is not a "hit" for the statistics then I
> > > > don't know what "hit" is meant to mean.
> > > 
> > > But the first valid usable zone is defined based on the requested numa
> > > node. Unless the requested node is memoryless then we should have a hit,
> > > no?
> > > 
> > 
> > Should be. If the local node is memoryless then there would be a difference
> > between hit and whether it's local or not but that to me is a little
> > useless. A local vs remote page allocated has a specific meaning and
> > consequence. It's hard to see how hit can be meaningfully interpreted if
> > there are memoryless nodes. I don't have a strong objection to the patch
> > so I didn't nak it, I'm just not convinced it matters.
> 
> So what do you think about
> http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz
> 

This doesn't appear to resolve for me and I've 30 minutes left before
being offline for 4 days so didn't go digging. 

> I think that we should get rid of __GFP_OTHER_NODE thingy. It is just
> one off thing and the gfp space it rather precious.
> 

However, broadly speaking, I'd be ok with getting rid of
__GFP_OTHER_NODE altogether and making it truely only about local vs
remote hits because those are the ones that matter in terms of
performance. If a user has memoryless nodes or policies that allow local
CPUs but forbid local memory and they need to debug an issue, they're
going to need tracepoints anyway. Hit/miss/other is not sufficient for
most interesting problems involving local or remote memory usage.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20  9:18   ` Michal Hocko
  2016-12-20 13:10     ` Mel Gorman
@ 2016-12-20 15:13     ` Vlastimil Babka
  2016-12-21  3:01     ` hejianet
  2 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2016-12-20 15:13 UTC (permalink / raw)
  To: Michal Hocko, Jia He
  Cc: linux-mm, linux-kernel, Andrew Morton, Mel Gorman,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

[-- Attachment #1: Type: text/plain, Size: 2451 bytes --]

On 12/20/2016 10:18 AM, Michal Hocko wrote:
> On Mon 12-12-16 13:59:07, Jia He wrote:
>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>>  z->node would not be equal to preferred_zone->node. That seems to be
>> incorrect.
> 
> I am sorry but I have hard time following the changelog. It is clear
> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> but it is not really clear when such thing happens. You are adding
> preferred_zone->node check. preferred_zone is the first zone in the
> requested zonelist. So for the most allocations it is a node from the
> local node. But if something request an explicit numa node (without
> __GFP_OTHER_NODE which would be the majority I suspect) then we could
> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> referenced patch indeed caused an unintended change of accounting AFAIU.

Since I could not quite wrap my head around all possible combinations,
I've got lazy and employed cbmc, inspired by [1] (test file attached if
anyone wants to play with it) and tried to test whether both commit
b9f00e147f27, and Jia He's patch are equivalent to pre-b9f00e147f27. The
tool found counter examples for both cases (there might be more
scenarios, but it only reports the first one it hits). Note that
decoding the output is not exactly trivial...

[1] http://paulmck.livejournal.com/38997.html

==============
Counterexample for commit b9f00e147f27 vs pre-b9f00e147f27:

numa_node_id() == 1
preferred_zone on node 0
allocated from zone on node 1
without __GFP_OTHER_NODE

pre-b9f00e147f27:
allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL
preferred zone (node 0) increased NUMA_FOREIGN

(that looks correct to me)

b9f00e147f27:
allocated zone (node 1) got NUMA_HIT and NUMA_LOCAL
there's no NUMA_FOREIGN

(that's wrong wrt HIT and FOREIGN)

==============
Counterexample for Jia He's patch vs pre-b9f00e147f27:

numa_node_id() == 0
preferred_zone on node 0
allocated from zone on node 1
without __GFP_OTHER_NODE

pre-b9f00e147f27:
allocated zone (node 1) increased NUMA_MISS and NUMA_OTHER
preferred zone (node 0) increased NUMA_FOREIGN

(that looks correct to me)

Jia He's patch:
allocated zone (node 1) increased NUMA_MISS
preferred zone (node 0) increased NUMA_FOREIGN

(it's missing NUMA_OTHER)

[-- Attachment #2: test-numa.c --]
[-- Type: text/x-csrc, Size: 3798 bytes --]

#include <stdio.h>

#define NODES	3
#define ZONES	2

enum zone_stat_item {
	NUMA_HIT,
	NUMA_MISS,
	NUMA_LOCAL,
	NUMA_OTHER,
	NUMA_FOREIGN,
	ZONE_STAT_ITEMS
};

struct node;
struct zone {
	struct node *zone_pgdat;
	int node;
	int vm_stat[ZONE_STAT_ITEMS];
};

struct node {
	int nid;
	struct zone node_zones[ZONES];
};

typedef int gfp_t;
#define __GFP_OTHER_NODE	1
#define unlikely(x) (x)

int local_nid;
int numa_node_id()
{
	return local_nid;
}

static void __inc_zone_state(struct zone *z, enum zone_stat_item stat)
{
	z->vm_stat[stat]++;
}

// before b9f00e147f27
void zone_statistics1(struct zone *preferred_zone, struct zone *z, gfp_t flags)
{
	if (z->zone_pgdat == preferred_zone->zone_pgdat) {
		__inc_zone_state(z, NUMA_HIT);
	} else {
		__inc_zone_state(z, NUMA_MISS);
		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
	}

	if (z->node == ((flags & __GFP_OTHER_NODE) ?
		preferred_zone->node : numa_node_id()))
                __inc_zone_state(z, NUMA_LOCAL);
        else
                __inc_zone_state(z, NUMA_OTHER);
}

// after b9f00e147f27
void zone_statistics2(struct zone *preferred_zone, struct zone *z, gfp_t flags)
{
        int local_nid = numa_node_id();
        enum zone_stat_item local_stat = NUMA_LOCAL;

        if (unlikely(flags & __GFP_OTHER_NODE)) {
                local_stat = NUMA_OTHER;
                local_nid = preferred_zone->node;
        }

        if (z->node == local_nid) {
                __inc_zone_state(z, NUMA_HIT);
                __inc_zone_state(z, local_stat);
        } else {
                __inc_zone_state(z, NUMA_MISS);
                __inc_zone_state(preferred_zone, NUMA_FOREIGN);
        }
}

// after Jia He's patch
static inline void zone_statistics3(struct zone *preferred_zone, struct zone *z,
                                                                gfp_t flags)
{
        int local_nid = numa_node_id();
        enum zone_stat_item local_stat = NUMA_LOCAL;

        if (unlikely(flags & __GFP_OTHER_NODE)) {
                local_stat = NUMA_OTHER;
                local_nid = preferred_zone->node;
        }

        if (z->node == local_nid) {
                __inc_zone_state(z, NUMA_HIT);
                __inc_zone_state(z, local_stat);
        } else if (z->node == preferred_zone->node) {
                __inc_zone_state(z, NUMA_HIT);
                __inc_zone_state(z, NUMA_OTHER);
        } else {
                __inc_zone_state(z, NUMA_MISS);
                __inc_zone_state(preferred_zone, NUMA_FOREIGN);
        }
}


void init_node(struct node * n, int nid)
{
	n->nid = nid;
	for (int i = 0; i < ZONES; i++) {
		struct zone * z = &n->node_zones[i];

		z->zone_pgdat = n;
		z->node = nid;
		for (int j = 0; j < ZONE_STAT_ITEMS; j++)
			z->vm_stat[j] = 0;
	}
}

struct node nodes1[NODES];
struct node nodes2[NODES];

void check_stats()
{
	for (int i = 0; i < NODES; i++)
		for (int j = 0; j < ZONES; j++)
			for (int k = 0; k < ZONE_STAT_ITEMS; k++)
				assert (nodes1[i].node_zones[j].vm_stat[k]
				     == nodes2[i].node_zones[j].vm_stat[k]);

}

int main(int argc, char *argv[])
{

	for (int i = 0; i < NODES; i++) {
		init_node(&nodes1[i], i);
		init_node(&nodes2[i], i);
	}

	local_nid = ((unsigned int)argv[1]) % NODES;
	int zone_nid = ((unsigned int)argv[2]) % NODES;
	int pzone_nid = ((unsigned int)argv[3]) % NODES;
	int zid = ((unsigned int)argv[4]) % ZONES;
	int pzid = ((unsigned int)argv[5]) % ZONES;

	/* we should not allocate from higher than preferred zone */
	if (zid > pzid)
		zid = pzid;

	gfp_t flags = ((unsigned int)argv[6]) & __GFP_OTHER_NODE;

	zone_statistics1(&nodes1[pzone_nid].node_zones[pzid],
			 &nodes1[zone_nid].node_zones[zid],
			 flags);

	zone_statistics3(&nodes2[pzone_nid].node_zones[pzid],
			 &nodes2[zone_nid].node_zones[zid],
			 flags);

	check_stats();
}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20  9:18   ` Michal Hocko
  2016-12-20 13:10     ` Mel Gorman
  2016-12-20 15:13     ` Vlastimil Babka
@ 2016-12-21  3:01     ` hejianet
  2 siblings, 0 replies; 23+ messages in thread
From: hejianet @ 2016-12-21  3:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Mel Gorman, Johannes Weiner, Joonsoo Kim, Taku Izumi



On 20/12/2016 5:18 PM, Michal Hocko wrote:
> On Mon 12-12-16 13:59:07, Jia He wrote:
>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>>   z->node would not be equal to preferred_zone->node. That seems to be
>> incorrect.
> I am sorry but I have hard time following the changelog. It is clear
> that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> but it is not really clear when such thing happens. You are adding
> preferred_zone->node check. preferred_zone is the first zone in the
> requested zonelist. So for the most allocations it is a node from the
> local node. But if something request an explicit numa node (without
> __GFP_OTHER_NODE which would be the majority I suspect) then we could
> indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> referenced patch indeed caused an unintended change of accounting AFAIU.
>
> If this is correct then it should be a part of the changelog. I also
> cannot say I would like the fix. First of all I am not sure
> __GFP_OTHER_NODE is a good idea at all. How is an explicit usage of the
> flag any different from an explicit __alloc_pages_node(non_local_nid)?
> In both cases we ask for an allocation on a remote node and successful
> allocation is a NUMA_HIT and NUMA_OTHER.
>
> That being said, why cannot we simply do the following? As a bonus, we
> can get rid of a barely used __GFP_OTHER_NODE. Also the number of
> branches will stay same.
Yes, I agree maybe we can get rid of __GFP_OTHER_NODE if no objections
Seems currently it is only used for hugepage and statistics
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 429855be6ec9..f035d5c8b864 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order)
>    * Update NUMA hit/miss statistics
>    *
>    * Must be called with interrupts disabled.
> - *
> - * When __GFP_OTHER_NODE is set assume the node of the preferred
> - * zone is the local node. This is useful for daemons who allocate
> - * memory on behalf of other processes.
>    */
>   static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>   								gfp_t flags)
>   {
>   #ifdef CONFIG_NUMA
> -	int local_nid = numa_node_id();
> -	enum zone_stat_item local_stat = NUMA_LOCAL;
> -
> -	if (unlikely(flags & __GFP_OTHER_NODE)) {
> -		local_stat = NUMA_OTHER;
> -		local_nid = preferred_zone->node;
> -	}
> +	if (z->node == preferred_zone->node) {
> +		enum zone_stat_item local_stat = NUMA_LOCAL;
>   
> -	if (z->node == local_nid) {
>   		__inc_zone_state(z, NUMA_HIT);
> +		if (z->node != numa_node_id())
> +			local_stat = NUMA_OTHER;
>   		__inc_zone_state(z, local_stat);
>   	} else {
>   		__inc_zone_state(z, NUMA_MISS);
I thought the logic here is different
Here is the zone_statistics() before introducing __GFP_OTHER_NODE:

if (z->zone_pgdat == preferred_zone->zone_pgdat) {
         __inc_zone_state(z, NUMA_HIT);
     } else {
         __inc_zone_state(z, NUMA_MISS);
         __inc_zone_state(preferred_zone, NUMA_FOREIGN);
     }
     if (z->node == numa_node_id())
         __inc_zone_state(z, NUMA_LOCAL);
     else
         __inc_zone_state(z, NUMA_OTHER);

B.R.
Jia

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 12:31   ` Mel Gorman
@ 2016-12-21  3:07     ` hejianet
  0 siblings, 0 replies; 23+ messages in thread
From: hejianet @ 2016-12-21  3:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Michal Hocko, Johannes Weiner, Joonsoo Kim, Taku Izumi



On 20/12/2016 8:31 PM, Mel Gorman wrote:
> On Mon, Dec 12, 2016 at 01:59:07PM +0800, Jia He wrote:
>> In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics"), it reconstructed codes to reduce the branch miss rate.
>> Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
>>   z->node would not be equal to preferred_zone->node. That seems to be
>> incorrect.
>>
>> Fixes: commit b9f00e147f27 ("mm, page_alloc: reduce branches in
>> zone_statistics")
>>
>> Signed-off-by: Jia He <hejianet@gmail.com>
> This is slightly curious. It appear it would only occur if a process was
> running on a node that was outside the memory policy. Can you confirm
> that is the case?
Yes, here is what I caught:

dumpstack() is triggered when z->node(5) == preferred_zone->node(5) and z->node != numa_node_id(4)
without flag GET_OTHER_NODE

It is not a rare case. I saw hundreds of such cases when kernel boots up
[c000000cdcaef440] [c0000000002e88cc] cache_grow_begin+0xcc/0x500
[c000000cdcaef6f0] [c0000000002ecb44] do_tune_cpucache+0x64/0x100
[c000000cdcaef750] [c0000000002ecc7c] enable_cpucache+0x9c/0x180
[c000000cdcaef7d0] [c0000000002ed01c] __kmem_cache_create+0x1ec/00x2c0
[c000000cdcaef820] [c000000000291c98] create_cache+0xb8/0x240
[c000000cdcaef890] [c000000000291fa8] kmem_cache_create+0x188/0x2290
[c000000cdcaef950] [d000000011dc5c70] ext4_mb_init+0x3c0/0x5e0 [eext4]
[c000000cdcaef9f0] [d000000011daaedc] ext4_fill_super+0x266c/0x33390 [ext4]
[c000000cdcaefb30] [c000000000328b8c] mount_bdev+0x22c/0x260
[c000000cdcaefbd0] [d000000011da1fa8] ext4_mount+0x48/0x60 [ext4]
[c000000cdcaefc10] [c00000000032a11c] mount_fs+0x8c/0x230
[c000000cdcaefcb0] [c000000000351f98] vfs_kern_mount+0x78/0x180
[c000000cdcaefd00] [c000000000356d68] do_mount+0x258/0xea0
[c000000cdcaefde0] [c000000000357da0] SyS_mount+0xa0/0x110
[c000000cdcaefe30] [c00000000000bd84] system_call+0x38/0xe0


> If so, your patch is a a semantic curiousity because it's actually
> impossible for a NUMA allocation to be local and the definition of "HIT"
> is fuzzy enough to be useless.
>
> I won't object to the patch but it makes me trust "hit" even less than I
> already do for any analysis.
>
> Note that after this mail that I'll be unavailable by mail until early
> new years.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data
  2016-12-20 14:54             ` Mel Gorman
@ 2016-12-21  7:57               ` Michal Hocko
  2016-12-21  8:06                 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-21  7:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Jia He, linux-mm, linux-kernel, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Joonsoo Kim, Taku Izumi

On Tue 20-12-16 14:54:35, Mel Gorman wrote:
> On Tue, Dec 20, 2016 at 03:35:02PM +0100, Michal Hocko wrote:
> > On Tue 20-12-16 14:28:45, Mel Gorman wrote:
> > > On Tue, Dec 20, 2016 at 02:26:43PM +0100, Michal Hocko wrote:
> > > > On Tue 20-12-16 13:10:40, Mel Gorman wrote:
> > > > > On Tue, Dec 20, 2016 at 10:18:14AM +0100, Michal Hocko wrote:
> > > > > > On Mon 12-12-16 13:59:07, Jia He wrote:
> > > > > > > In commit b9f00e147f27 ("mm, page_alloc: reduce branches in
> > > > > > > zone_statistics"), it reconstructed codes to reduce the branch miss rate.
> > > > > > > Compared with the original logic, it assumed if !(flag & __GFP_OTHER_NODE)
> > > > > > >  z->node would not be equal to preferred_zone->node. That seems to be
> > > > > > > incorrect.
> > > > > > 
> > > > > > I am sorry but I have hard time following the changelog. It is clear
> > > > > > that you are trying to fix a missed NUMA_{HIT,OTHER} accounting
> > > > > > but it is not really clear when such thing happens. You are adding
> > > > > > preferred_zone->node check. preferred_zone is the first zone in the
> > > > > > requested zonelist. So for the most allocations it is a node from the
> > > > > > local node. But if something request an explicit numa node (without
> > > > > > __GFP_OTHER_NODE which would be the majority I suspect) then we could
> > > > > > indeed end up accounting that as a NUMA_MISS, NUMA_FOREIGN so the
> > > > > > referenced patch indeed caused an unintended change of accounting AFAIU.
> > > > > > 
> > > > > 
> > > > > This is a similar concern to what I had. If the preferred zone, which is
> > > > > the first valid usable zone, is not a "hit" for the statistics then I
> > > > > don't know what "hit" is meant to mean.
> > > > 
> > > > But the first valid usable zone is defined based on the requested numa
> > > > node. Unless the requested node is memoryless then we should have a hit,
> > > > no?
> > > > 
> > > 
> > > Should be. If the local node is memoryless then there would be a difference
> > > between hit and whether it's local or not but that to me is a little
> > > useless. A local vs remote page allocated has a specific meaning and
> > > consequence. It's hard to see how hit can be meaningfully interpreted if
> > > there are memoryless nodes. I don't have a strong objection to the patch
> > > so I didn't nak it, I'm just not convinced it matters.
> > 
> > So what do you think about
> > http://lkml.kernel.org/r/20161220091814.GC3769@dhcp22.suse.cz
> > 
> 
> This doesn't appear to resolve for me and I've 30 minutes left before
> being offline for 4 days so didn't go digging. 

OK, it seems that it didn't go to the lkml so it didn't get to the
archive indexed by the message id. I will send the two patches as a
reply to this email for reference.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] mm: fix remote numa hits statistics
  2016-12-21  7:57               ` Michal Hocko
@ 2016-12-21  8:06                 ` Michal Hocko
  2016-12-21  8:06                   ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko
                                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-21  8:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He,
	Taku Izumi, Johannes Weiner, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches
in zone_statistics") has an unintentional side effect that remote node
allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and
NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are
many of these potentially because the flag is used very rarely while
we have many users of __alloc_pages_node.

Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
follow up patch) and treat all allocations that were satisfied from the
preferred zone's node as NUMA_HITS because this is the same node we
requested the allocation from in most cases. If this is not the local
node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.

One downsize would be that an allocation request for a node which is
outside of the mempolicy nodemask would be reported as a hit which is a
bit weird but that was the case before b9f00e147f27 already.

Reported-by: Jia He <hejianet@gmail.com>
Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6d5b73e1d7c..506946a902c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order)
  * Update NUMA hit/miss statistics
  *
  * Must be called with interrupts disabled.
- *
- * When __GFP_OTHER_NODE is set assume the node of the preferred
- * zone is the local node. This is useful for daemons who allocate
- * memory on behalf of other processes.
  */
 static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 								gfp_t flags)
 {
 #ifdef CONFIG_NUMA
-	int local_nid = numa_node_id();
-	enum zone_stat_item local_stat = NUMA_LOCAL;
-
-	if (unlikely(flags & __GFP_OTHER_NODE)) {
-		local_stat = NUMA_OTHER;
-		local_nid = preferred_zone->node;
-	}
+	if (z->node == preferred_zone->node) {
+		enum zone_stat_item local_stat = NUMA_LOCAL;
 
-	if (z->node == local_nid) {
 		__inc_zone_state(z, NUMA_HIT);
+		if (z->node != numa_node_id())
+			local_stat = NUMA_OTHER;
 		__inc_zone_state(z, local_stat);
 	} else {
 		__inc_zone_state(z, NUMA_MISS);
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE
  2016-12-21  8:06                 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko
@ 2016-12-21  8:06                   ` Michal Hocko
  2017-01-02 14:18                     ` Vlastimil Babka
  2016-12-29 11:46                   ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman
  2017-01-02 14:16                   ` Vlastimil Babka
  2 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2016-12-21  8:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He,
	Taku Izumi, Johannes Weiner, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The flag has been introduced by 78afd5612deb ("mm: add __GFP_OTHER_NODE
flag") to allow proper accounting of remote node allocations done by
kernel daemons on behalf of a process - e.g. khugepaged.

After "mm: fix remote numa hits statistics" we do not need and actually
use the flag so we can safely remove it because all allocations which
are satisfied from their "home" node are accounted properly.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h            | 13 +++----------
 include/trace/events/mmflags.h |  1 -
 mm/huge_memory.c               |  3 +--
 mm/khugepaged.c                |  5 ++---
 mm/page_alloc.c                |  5 ++---
 tools/perf/builtin-kmem.c      |  1 -
 6 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..7806a8f80abc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -38,9 +38,8 @@ struct vm_area_struct;
 #define ___GFP_ACCOUNT		0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_DIRECT_RECLAIM	0x400000u
-#define ___GFP_OTHER_NODE	0x800000u
-#define ___GFP_WRITE		0x1000000u
-#define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#define ___GFP_WRITE		0x800000u
+#define ___GFP_KSWAPD_RECLAIM	0x1000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -172,11 +171,6 @@ struct vm_area_struct;
  * __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of
  *   distinguishing in the source between false positives and allocations that
  *   cannot be supported (e.g. page tables).
- *
- * __GFP_OTHER_NODE is for allocations that are on a remote node but that
- *   should not be accounted for as a remote allocation in vmstat. A
- *   typical user would be khugepaged collapsing a huge page on a remote
- *   node.
  */
 #define __GFP_COLD	((__force gfp_t)___GFP_COLD)
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
@@ -184,10 +178,9 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
-#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT 25
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab48a2fb..556a0efa8298 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -48,7 +48,6 @@
 	{(unsigned long)__GFP_RECLAIM,		"__GFP_RECLAIM"},	\
 	{(unsigned long)__GFP_DIRECT_RECLAIM,	"__GFP_DIRECT_RECLAIM"},\
 	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"__GFP_KSWAPD_RECLAIM"},\
-	{(unsigned long)__GFP_OTHER_NODE,	"__GFP_OTHER_NODE"}	\
 
 #define show_gfp_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f3c2040edbb1..8206abf4ac03 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -918,8 +918,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
-					       __GFP_OTHER_NODE, vma,
+		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma,
 					       vmf->address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_try_charge(pages[i], vma->vm_mm,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e32389a97030..211974a3992b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -943,7 +943,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
+	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
 
 	/*
 	 * Before allocating the hugepage, release the mmap_sem read lock.
@@ -1326,8 +1326,7 @@ static void collapse_shmem(struct mm_struct *mm,
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
 	/* Only allocate from the target node */
-	gfp = alloc_hugepage_khugepaged_gfpmask() |
-		__GFP_OTHER_NODE | __GFP_THISNODE;
+	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
 
 	new_page = khugepaged_alloc_page(hpage, gfp, node);
 	if (!new_page) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 506946a902c5..647e940e6921 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2584,8 +2584,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
  *
  * Must be called with interrupts disabled.
  */
-static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
-								gfp_t flags)
+static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
 	if (z->node == preferred_zone->node) {
@@ -2666,7 +2665,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 	}
 
 	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
-	zone_statistics(preferred_zone, zone, gfp_flags);
+	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
 
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index d426dcb18ce9..33b959d47545 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -645,7 +645,6 @@ static const struct {
 	{ "__GFP_RECLAIM",		"R" },
 	{ "__GFP_DIRECT_RECLAIM",	"DR" },
 	{ "__GFP_KSWAPD_RECLAIM",	"KR" },
-	{ "__GFP_OTHER_NODE",		"ON" },
 };
 
 static size_t max_gfp_len;
-- 
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm: fix remote numa hits statistics
  2016-12-21  8:06                 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko
  2016-12-21  8:06                   ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko
@ 2016-12-29 11:46                   ` Mel Gorman
  2016-12-29 12:28                     ` Michal Hocko
  2017-01-02 14:16                   ` Vlastimil Babka
  2 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2016-12-29 11:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He,
	Taku Izumi, Johannes Weiner, Michal Hocko

On Wed, Dec 21, 2016 at 09:06:52AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches
> in zone_statistics") has an unintentional side effect that remote node
> allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and
> NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are
> many of these potentially because the flag is used very rarely while
> we have many users of __alloc_pages_node.
> 
> Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
> follow up patch) and treat all allocations that were satisfied from the
> preferred zone's node as NUMA_HITS because this is the same node we
> requested the allocation from in most cases. If this is not the local
> node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
> 
> One downsize would be that an allocation request for a node which is
> outside of the mempolicy nodemask would be reported as a hit which is a
> bit weird but that was the case before b9f00e147f27 already.
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
> Signed-off-by: Michal Hocko <mhocko@suse.com>

For both patches;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm: fix remote numa hits statistics
  2016-12-29 11:46                   ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman
@ 2016-12-29 12:28                     ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2016-12-29 12:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Vlastimil Babka, Andrew Morton, Joonsoo Kim, Jia He,
	Taku Izumi, Johannes Weiner

On Thu 29-12-16 11:46:01, Mel Gorman wrote:
> On Wed, Dec 21, 2016 at 09:06:52AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches
> > in zone_statistics") has an unintentional side effect that remote node
> > allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and
> > NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are
> > many of these potentially because the flag is used very rarely while
> > we have many users of __alloc_pages_node.
> > 
> > Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
> > follow up patch) and treat all allocations that were satisfied from the
> > preferred zone's node as NUMA_HITS because this is the same node we
> > requested the allocation from in most cases. If this is not the local
> > node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
> > 
> > One downsize would be that an allocation request for a node which is
> > outside of the mempolicy nodemask would be reported as a hit which is a
> > bit weird but that was the case before b9f00e147f27 already.
> > 
> > Reported-by: Jia He <hejianet@gmail.com>
> > Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> For both patches;
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks! I will give it a week for others to get back to it after holiday
and then resubmit.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm: fix remote numa hits statistics
  2016-12-21  8:06                 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko
  2016-12-21  8:06                   ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko
  2016-12-29 11:46                   ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman
@ 2017-01-02 14:16                   ` Vlastimil Babka
  2017-01-02 14:46                     ` Michal Hocko
  2 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2017-01-02 14:16 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi,
	Johannes Weiner, Michal Hocko

On 12/21/2016 09:06 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches
> in zone_statistics") has an unintentional side effect that remote node
> allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and
> NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are
> many of these potentially because the flag is used very rarely while
> we have many users of __alloc_pages_node.
> 
> Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
> follow up patch) and treat all allocations that were satisfied from the
> preferred zone's node as NUMA_HITS because this is the same node we
> requested the allocation from in most cases. If this is not the local
> node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
> 
> One downsize would be that an allocation request for a node which is
> outside of the mempolicy nodemask would be reported as a hit which is a
> bit weird but that was the case before b9f00e147f27 already.
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
> Signed-off-by: Michal Hocko <mhocko@suse.com>

cbmc tells me that this patch is not equal to pre-commit b9f00e147f27
(in situation where __GFP_OTHER_NODE is not passed, as that's the only
relevant scenario after your patch), which seems unintended.

counter example:
numa_node_id() == 1
preferred_zone on node 0
allocated from zone on node 1

pre-b9f00e147f27:
allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL
preferred zone (node 0) increased NUMA_FOREIGN

(that looks correct to me)

your patch:
allocated zone (node 1) increased NUMA_MISS
preferred zone (node 0) increased NUMA_FOREIGN

i.e. it's missing NUMA_LOCAL on node 1, which is IMHO wrong as this was
an allocation local to the CPU (albeit a MISS wrt the preferred node).

Vlastimil

> ---
>  mm/page_alloc.c | 16 ++++------------
>  1 file changed, 4 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f6d5b73e1d7c..506946a902c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2583,25 +2583,17 @@ int __isolate_free_page(struct page *page, unsigned int order)
>   * Update NUMA hit/miss statistics
>   *
>   * Must be called with interrupts disabled.
> - *
> - * When __GFP_OTHER_NODE is set assume the node of the preferred
> - * zone is the local node. This is useful for daemons who allocate
> - * memory on behalf of other processes.
>   */
>  static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>  								gfp_t flags)
>  {
>  #ifdef CONFIG_NUMA
> -	int local_nid = numa_node_id();
> -	enum zone_stat_item local_stat = NUMA_LOCAL;
> -
> -	if (unlikely(flags & __GFP_OTHER_NODE)) {
> -		local_stat = NUMA_OTHER;
> -		local_nid = preferred_zone->node;
> -	}
> +	if (z->node == preferred_zone->node) {
> +		enum zone_stat_item local_stat = NUMA_LOCAL;
>  
> -	if (z->node == local_nid) {
>  		__inc_zone_state(z, NUMA_HIT);
> +		if (z->node != numa_node_id())
> +			local_stat = NUMA_OTHER;
>  		__inc_zone_state(z, local_stat);
>  	} else {
>  		__inc_zone_state(z, NUMA_MISS);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE
  2016-12-21  8:06                   ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko
@ 2017-01-02 14:18                     ` Vlastimil Babka
  0 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2017-01-02 14:18 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He, Taku Izumi,
	Johannes Weiner, Michal Hocko, Andi Kleen

[CC Andi who introduced __GFP_OTHER_NODE]

On 12/21/2016 09:06 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The flag has been introduced by 78afd5612deb ("mm: add __GFP_OTHER_NODE
> flag") to allow proper accounting of remote node allocations done by
> kernel daemons on behalf of a process - e.g. khugepaged.
> 
> After "mm: fix remote numa hits statistics" we do not need and actually
> use the flag so we can safely remove it because all allocations which
> are satisfied from their "home" node are accounted properly.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/gfp.h            | 13 +++----------
>  include/trace/events/mmflags.h |  1 -
>  mm/huge_memory.c               |  3 +--
>  mm/khugepaged.c                |  5 ++---
>  mm/page_alloc.c                |  5 ++---
>  tools/perf/builtin-kmem.c      |  1 -
>  6 files changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4175dca4ac39..7806a8f80abc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -38,9 +38,8 @@ struct vm_area_struct;
>  #define ___GFP_ACCOUNT		0x100000u
>  #define ___GFP_NOTRACK		0x200000u
>  #define ___GFP_DIRECT_RECLAIM	0x400000u
> -#define ___GFP_OTHER_NODE	0x800000u
> -#define ___GFP_WRITE		0x1000000u
> -#define ___GFP_KSWAPD_RECLAIM	0x2000000u
> +#define ___GFP_WRITE		0x800000u
> +#define ___GFP_KSWAPD_RECLAIM	0x1000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> @@ -172,11 +171,6 @@ struct vm_area_struct;
>   * __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of
>   *   distinguishing in the source between false positives and allocations that
>   *   cannot be supported (e.g. page tables).
> - *
> - * __GFP_OTHER_NODE is for allocations that are on a remote node but that
> - *   should not be accounted for as a remote allocation in vmstat. A
> - *   typical user would be khugepaged collapsing a huge page on a remote
> - *   node.
>   */
>  #define __GFP_COLD	((__force gfp_t)___GFP_COLD)
>  #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
> @@ -184,10 +178,9 @@ struct vm_area_struct;
>  #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
>  #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
> -#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
>  
>  /* Room for N __GFP_FOO bits */
> -#define __GFP_BITS_SHIFT 26
> +#define __GFP_BITS_SHIFT 25
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /*
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 5a81ab48a2fb..556a0efa8298 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -48,7 +48,6 @@
>  	{(unsigned long)__GFP_RECLAIM,		"__GFP_RECLAIM"},	\
>  	{(unsigned long)__GFP_DIRECT_RECLAIM,	"__GFP_DIRECT_RECLAIM"},\
>  	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"__GFP_KSWAPD_RECLAIM"},\
> -	{(unsigned long)__GFP_OTHER_NODE,	"__GFP_OTHER_NODE"}	\
>  
>  #define show_gfp_flags(flags)						\
>  	(flags) ? __print_flags(flags, "|",				\
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f3c2040edbb1..8206abf4ac03 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -918,8 +918,7 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
>  	}
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> -		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
> -					       __GFP_OTHER_NODE, vma,
> +		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma,
>  					       vmf->address, page_to_nid(page));
>  		if (unlikely(!pages[i] ||
>  			     mem_cgroup_try_charge(pages[i], vma->vm_mm,
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e32389a97030..211974a3992b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -943,7 +943,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	/* Only allocate from the target node */
> -	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
> +	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>  
>  	/*
>  	 * Before allocating the hugepage, release the mmap_sem read lock.
> @@ -1326,8 +1326,7 @@ static void collapse_shmem(struct mm_struct *mm,
>  	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>  
>  	/* Only allocate from the target node */
> -	gfp = alloc_hugepage_khugepaged_gfpmask() |
> -		__GFP_OTHER_NODE | __GFP_THISNODE;
> +	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>  
>  	new_page = khugepaged_alloc_page(hpage, gfp, node);
>  	if (!new_page) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 506946a902c5..647e940e6921 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2584,8 +2584,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
>   *
>   * Must be called with interrupts disabled.
>   */
> -static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
> -								gfp_t flags)
> +static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
>  {
>  #ifdef CONFIG_NUMA
>  	if (z->node == preferred_zone->node) {
> @@ -2666,7 +2665,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  	}
>  
>  	__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> -	zone_statistics(preferred_zone, zone, gfp_flags);
> +	zone_statistics(preferred_zone, zone);
>  	local_irq_restore(flags);
>  
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
> index d426dcb18ce9..33b959d47545 100644
> --- a/tools/perf/builtin-kmem.c
> +++ b/tools/perf/builtin-kmem.c
> @@ -645,7 +645,6 @@ static const struct {
>  	{ "__GFP_RECLAIM",		"R" },
>  	{ "__GFP_DIRECT_RECLAIM",	"DR" },
>  	{ "__GFP_KSWAPD_RECLAIM",	"KR" },
> -	{ "__GFP_OTHER_NODE",		"ON" },
>  };
>  
>  static size_t max_gfp_len;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm: fix remote numa hits statistics
  2017-01-02 14:16                   ` Vlastimil Babka
@ 2017-01-02 14:46                     ` Michal Hocko
  2017-01-02 15:07                       ` Vlastimil Babka
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2017-01-02 14:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He,
	Taku Izumi, Johannes Weiner

On Mon 02-01-17 15:16:23, Vlastimil Babka wrote:
> On 12/21/2016 09:06 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches
> > in zone_statistics") has an unintentional side effect that remote node
> > allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and
> > NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are
> > many of these potentially because the flag is used very rarely while
> > we have many users of __alloc_pages_node.
> > 
> > Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
> > follow up patch) and treat all allocations that were satisfied from the
> > preferred zone's node as NUMA_HITS because this is the same node we
> > requested the allocation from in most cases. If this is not the local
> > node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
> > 
> > One downsize would be that an allocation request for a node which is
> > outside of the mempolicy nodemask would be reported as a hit which is a
> > bit weird but that was the case before b9f00e147f27 already.
> > 
> > Reported-by: Jia He <hejianet@gmail.com>
> > Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> cbmc tells me that this patch is not equal to pre-commit b9f00e147f27
> (in situation where __GFP_OTHER_NODE is not passed, as that's the only
> relevant scenario after your patch), which seems unintended.
> 
> counter example:
> numa_node_id() == 1
> preferred_zone on node 0
> allocated from zone on node 1
> 
> pre-b9f00e147f27:
> allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL
> preferred zone (node 0) increased NUMA_FOREIGN
> 
> (that looks correct to me)
> 
> your patch:
> allocated zone (node 1) increased NUMA_MISS
> preferred zone (node 0) increased NUMA_FOREIGN
> 
> i.e. it's missing NUMA_LOCAL on node 1, which is IMHO wrong as this was
> an allocation local to the CPU (albeit a MISS wrt the preferred node).

I guess the following should fix that, right?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 647e940e6921..ea60dc06d280 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2587,17 +2587,18 @@ int __isolate_free_page(struct page *page, unsigned int order)
 static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 {
 #ifdef CONFIG_NUMA
-	if (z->node == preferred_zone->node) {
-		enum zone_stat_item local_stat = NUMA_LOCAL;
+	enum zone_stat_item local_stat = NUMA_LOCAL;
 
+	if (z->node != numa_node_id())
+		local_stat = NUMA_OTHER;
+
+	if (z->node == preferred_zone->node)
 		__inc_zone_state(z, NUMA_HIT);
-		if (z->node != numa_node_id())
-			local_stat = NUMA_OTHER;
-		__inc_zone_state(z, local_stat);
-	} else {
+	else {
 		__inc_zone_state(z, NUMA_MISS);
 		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
 	}
+	__inc_zone_state(z, local_stat);
 #endif
 }
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] mm: fix remote numa hits statistics
  2017-01-02 14:46                     ` Michal Hocko
@ 2017-01-02 15:07                       ` Vlastimil Babka
  0 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2017-01-02 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mel Gorman, Andrew Morton, Joonsoo Kim, Jia He,
	Taku Izumi, Johannes Weiner

On 01/02/2017 03:46 PM, Michal Hocko wrote:
> On Mon 02-01-17 15:16:23, Vlastimil Babka wrote:
>> On 12/21/2016 09:06 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> Jia He has noticed that b9f00e147f27 ("mm, page_alloc: reduce branches
>>> in zone_statistics") has an unintentional side effect that remote node
>>> allocation requests are accounted as NUMA_MISS rathat than NUMA_HIT and
>>> NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE. There are
>>> many of these potentially because the flag is used very rarely while
>>> we have many users of __alloc_pages_node.
>>>
>>> Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
>>> follow up patch) and treat all allocations that were satisfied from the
>>> preferred zone's node as NUMA_HITS because this is the same node we
>>> requested the allocation from in most cases. If this is not the local
>>> node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.
>>>
>>> One downsize would be that an allocation request for a node which is
>>> outside of the mempolicy nodemask would be reported as a hit which is a
>>> bit weird but that was the case before b9f00e147f27 already.
>>>
>>> Reported-by: Jia He <hejianet@gmail.com>
>>> Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> cbmc tells me that this patch is not equal to pre-commit b9f00e147f27
>> (in situation where __GFP_OTHER_NODE is not passed, as that's the only
>> relevant scenario after your patch), which seems unintended.
>>
>> counter example:
>> numa_node_id() == 1
>> preferred_zone on node 0
>> allocated from zone on node 1
>>
>> pre-b9f00e147f27:
>> allocated zone (node 1) increased NUMA_MISS and NUMA_LOCAL
>> preferred zone (node 0) increased NUMA_FOREIGN
>>
>> (that looks correct to me)
>>
>> your patch:
>> allocated zone (node 1) increased NUMA_MISS
>> preferred zone (node 0) increased NUMA_FOREIGN
>>
>> i.e. it's missing NUMA_LOCAL on node 1, which is IMHO wrong as this was
>> an allocation local to the CPU (albeit a MISS wrt the preferred node).
> 
> I guess the following should fix that, right?

Yes, it does.

With that, you can add:

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

(which is a lie, since the computer did that ;)

> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 647e940e6921..ea60dc06d280 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2587,17 +2587,18 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
>  {
>  #ifdef CONFIG_NUMA
> -	if (z->node == preferred_zone->node) {
> -		enum zone_stat_item local_stat = NUMA_LOCAL;
> +	enum zone_stat_item local_stat = NUMA_LOCAL;
>  
> +	if (z->node != numa_node_id())
> +		local_stat = NUMA_OTHER;
> +
> +	if (z->node == preferred_zone->node)
>  		__inc_zone_state(z, NUMA_HIT);
> -		if (z->node != numa_node_id())
> -			local_stat = NUMA_OTHER;
> -		__inc_zone_state(z, local_stat);
> -	} else {
> +	else {
>  		__inc_zone_state(z, NUMA_MISS);
>  		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
>  	}
> +	__inc_zone_state(z, local_stat);
>  #endif
>  }
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-01-02 15:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-12  5:59 [PATCH RFC 0/1] mm, page_alloc: fix incorrect zone_statistics data Jia He
2016-12-12  5:59 ` [PATCH RFC 1/1] " Jia He
2016-12-20  9:18   ` Michal Hocko
2016-12-20 13:10     ` Mel Gorman
2016-12-20 13:26       ` Michal Hocko
2016-12-20 14:28         ` Mel Gorman
2016-12-20 14:35           ` Michal Hocko
2016-12-20 14:49             ` Vlastimil Babka
2016-12-20 14:54             ` Mel Gorman
2016-12-21  7:57               ` Michal Hocko
2016-12-21  8:06                 ` [PATCH 1/2] mm: fix remote numa hits statistics Michal Hocko
2016-12-21  8:06                   ` [PATCH 2/2] mm: get rid of __GFP_OTHER_NODE Michal Hocko
2017-01-02 14:18                     ` Vlastimil Babka
2016-12-29 11:46                   ` [PATCH 1/2] mm: fix remote numa hits statistics Mel Gorman
2016-12-29 12:28                     ` Michal Hocko
2017-01-02 14:16                   ` Vlastimil Babka
2017-01-02 14:46                     ` Michal Hocko
2017-01-02 15:07                       ` Vlastimil Babka
2016-12-20 14:42           ` [PATCH RFC 1/1] mm, page_alloc: fix incorrect zone_statistics data Mel Gorman
2016-12-20 15:13     ` Vlastimil Babka
2016-12-21  3:01     ` hejianet
2016-12-20 12:31   ` Mel Gorman
2016-12-21  3:07     ` hejianet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).