linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: fix pgalloc_stall on unpopulated zone
@ 2016-07-13  2:24 Minchan Kim
  2016-07-13  9:25 ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Minchan Kim @ 2016-07-13  2:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-kernel, linux-mm, Minchan Kim

If we use sc->reclaim_idx for accounting pgstall, it can increase
the count on unpopulated zone, for example, movable zone(but
my system doesn't have movable zone) if allocation request were
GFP_HIGHUSER_MOVABLE. It doesn't make no sense.

This patch fixes it so that it can account it on first populated
zone at or below highest_zoneidx of the request.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 fs/buffer.c          | 2 +-
 include/linux/swap.h | 3 ++-
 mm/page_alloc.c      | 3 ++-
 mm/vmscan.c          | 5 +++--
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 46b3568..69841f4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -268,7 +268,7 @@ static void free_more_memory(void)
 						gfp_zone(GFP_NOFS), NULL);
 		if (z->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-						GFP_NOFS, NULL);
+					GFP_NOFS, NULL, gfp_zone(GFP_NOFS));
 	}
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index cc753c6..935f7e1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -309,7 +309,8 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 /* linux/mm/vmscan.c */
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-					gfp_t gfp_mask, nodemask_t *mask);
+					gfp_t gfp_mask, nodemask_t *mask,
+					enum zone_type classzone_idx);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80c9b9a..5f20d4b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3305,7 +3305,8 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	current->reclaim_state = &reclaim_state;
 
 	progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
-								ac->nodemask);
+				ac->nodemask,
+				zonelist_zone_idx(ac->preferred_zoneref));
 
 	current->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c538a8c..1f91e2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2855,13 +2855,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				enum zone_type classzone_idx)
 {
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
-		.reclaim_idx = gfp_zone(gfp_mask),
+		.reclaim_idx = classzone_idx,
 		.order = order,
 		.nodemask = nodemask,
 		.priority = DEF_PRIORITY,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix pgalloc_stall on unpopulated zone
  2016-07-13  2:24 [PATCH] mm: fix pgalloc_stall on unpopulated zone Minchan Kim
@ 2016-07-13  9:25 ` Mel Gorman
  2016-07-14  1:11   ` Minchan Kim
  0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2016-07-13  9:25 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, Jul 13, 2016 at 11:24:13AM +0900, Minchan Kim wrote:
> If we use sc->reclaim_idx for accounting pgstall, it can increase
> the count on unpopulated zone, for example, movable zone(but
> my system doesn't have movable zone) if allocation request were
> GFP_HIGHUSER_MOVABLE. It doesn't make no sense.
> 

I wanted to track the highest zone allowed by each allocation regardless
of what the zone population state was. Otherwise, consider the following
on a NUMA system

1. An allocation request arrives for GFP_HIGHUSER_MOVABLE that stalls
2. System has two nodes, node 0 with ZONE_NORMAL, node 1 with ZONE_HIGHMEM
3. If the allocating process is on node 0, the stall is accounted on ZONE_NORMAL
4. If the allocatinn process is on node 1, the stall is accounted on ZONE_HIGHMEM

Multiple runs of the same workload on the same machine will see stall
statistics on different zones and renders the stat useless. This is
difficult to analyse because stalls accounted for on ZONE_NORMAL may or
may not be zone-constrained allocations.

The patch means that the vmstat accounting and tracepoint data is also
out of sync. One thing I wanted to be able to do was

1. Observe that there are alloc stalls on DMA32 or some other low zone
2. Activate mm_vmscan_direct_reclaim_begin, filter on classzone_idx ==
   DMA32 and identify the source of the lowmem allocations

If your patch is applied, I cannot depend on the stall stats any more
and the tracepoint is required to determine if there really any
zone-contrained allocations. It can be *inferred* from the skip stats
but only if such skips occurred and that is not guaranteed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix pgalloc_stall on unpopulated zone
  2016-07-13  9:25 ` Mel Gorman
@ 2016-07-14  1:11   ` Minchan Kim
  2016-07-14  8:51     ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Minchan Kim @ 2016-07-14  1:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, Jul 13, 2016 at 10:25:04AM +0100, Mel Gorman wrote:
> On Wed, Jul 13, 2016 at 11:24:13AM +0900, Minchan Kim wrote:
> > If we use sc->reclaim_idx for accounting pgstall, it can increase
> > the count on unpopulated zone, for example, movable zone(but
> > my system doesn't have movable zone) if allocation request were
> > GFP_HIGHUSER_MOVABLE. It doesn't make no sense.
> > 
> 
> I wanted to track the highest zone allowed by each allocation regardless
> of what the zone population state was. Otherwise, consider the following
> on a NUMA system
> 
> 1. An allocation request arrives for GFP_HIGHUSER_MOVABLE that stalls
> 2. System has two nodes, node 0 with ZONE_NORMAL, node 1 with ZONE_HIGHMEM
> 3. If the allocating process is on node 0, the stall is accounted on ZONE_NORMAL
> 4. If the allocatinn process is on node 1, the stall is accounted on ZONE_HIGHMEM
> 
> Multiple runs of the same workload on the same machine will see stall
> statistics on different zones and renders the stat useless. This is
> difficult to analyse because stalls accounted for on ZONE_NORMAL may or
> may not be zone-constrained allocations.

Fair enough.
For the allocstall, it would be better to show the requested zone.

> 
> The patch means that the vmstat accounting and tracepoint data is also
> out of sync. One thing I wanted to be able to do was
> 
> 1. Observe that there are alloc stalls on DMA32 or some other low zone
> 2. Activate mm_vmscan_direct_reclaim_begin, filter on classzone_idx ==
>    DMA32 and identify the source of the lowmem allocations
> 
> If your patch is applied, I cannot depend on the stall stats any more
> and the tracepoint is required to determine if there really any
> zone-contrained allocations. It can be *inferred* from the skip stats
> but only if such skips occurred and that is not guaranteed.

Just a nit:

Hmm, can't we omit classzone_idx in mm_vm_scan_direct_begin_template?
Because every functions already have gfp_flags so that we can classzone_idx
via gfp_zone(gfp_flags) without passing it.


> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix pgalloc_stall on unpopulated zone
  2016-07-14  1:11   ` Minchan Kim
@ 2016-07-14  8:51     ` Mel Gorman
  2016-07-14  8:58       ` Minchan Kim
  0 siblings, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2016-07-14  8:51 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

On Thu, Jul 14, 2016 at 10:11:19AM +0900, Minchan Kim wrote:
> > The patch means that the vmstat accounting and tracepoint data is also
> > out of sync. One thing I wanted to be able to do was
> > 
> > 1. Observe that there are alloc stalls on DMA32 or some other low zone
> > 2. Activate mm_vmscan_direct_reclaim_begin, filter on classzone_idx ==
> >    DMA32 and identify the source of the lowmem allocations
> > 
> > If your patch is applied, I cannot depend on the stall stats any more
> > and the tracepoint is required to determine if there really any
> > zone-contrained allocations. It can be *inferred* from the skip stats
> > but only if such skips occurred and that is not guaranteed.
> 
> Just a nit:
> 
> Hmm, can't we omit classzone_idx in mm_vm_scan_direct_begin_template?
> Because every functions already have gfp_flags so that we can classzone_idx
> via gfp_zone(gfp_flags) without passing it.
> 

We could but it's potentially wrong. classzone_idx *should* be derived
from the gfp_flags but it's possible a bug would lead it to be another
value. The saving from passing it in is marginal at best.

If it's omitted from the tracepoint itself, there is a small amount of
disk saving which is potentially significant if there is a lot of direct
reclaim. Unfortunately, it also makes it harder to filter that
tracepoint because the filter rules must be an implementation of
gfp_zone.

Right now I believe the saving is marginal and the cost of potentially
using the wrong information or making the filtering harder offsets that
marginal saving.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: fix pgalloc_stall on unpopulated zone
  2016-07-14  8:51     ` Mel Gorman
@ 2016-07-14  8:58       ` Minchan Kim
  0 siblings, 0 replies; 5+ messages in thread
From: Minchan Kim @ 2016-07-14  8:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-kernel, linux-mm

On Thu, Jul 14, 2016 at 09:51:53AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2016 at 10:11:19AM +0900, Minchan Kim wrote:
> > > The patch means that the vmstat accounting and tracepoint data is also
> > > out of sync. One thing I wanted to be able to do was
> > > 
> > > 1. Observe that there are alloc stalls on DMA32 or some other low zone
> > > 2. Activate mm_vmscan_direct_reclaim_begin, filter on classzone_idx ==
> > >    DMA32 and identify the source of the lowmem allocations
> > > 
> > > If your patch is applied, I cannot depend on the stall stats any more
> > > and the tracepoint is required to determine if there really any
> > > zone-contrained allocations. It can be *inferred* from the skip stats
> > > but only if such skips occurred and that is not guaranteed.
> > 
> > Just a nit:
> > 
> > Hmm, can't we omit classzone_idx in mm_vm_scan_direct_begin_template?
> > Because every functions already have gfp_flags so that we can classzone_idx
> > via gfp_zone(gfp_flags) without passing it.
> > 
> 
> We could but it's potentially wrong. classzone_idx *should* be derived
> from the gfp_flags but it's possible a bug would lead it to be another
> value. The saving from passing it in is marginal at best.
> 
> If it's omitted from the tracepoint itself, there is a small amount of
> disk saving which is potentially significant if there is a lot of direct
> reclaim. Unfortunately, it also makes it harder to filter that
> tracepoint because the filter rules must be an implementation of
> gfp_zone.
> 
> Right now I believe the saving is marginal and the cost of potentially
> using the wrong information or making the filtering harder offsets that
> marginal saving.

Agreed.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-07-14  8:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-13  2:24 [PATCH] mm: fix pgalloc_stall on unpopulated zone Minchan Kim
2016-07-13  9:25 ` Mel Gorman
2016-07-14  1:11   ` Minchan Kim
2016-07-14  8:51     ` Mel Gorman
2016-07-14  8:58       ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).