[PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
@ 2015-03-27 19:28 Nishanth Aravamudan
  2015-03-27 19:39 ` Nishanth Aravamudan
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-03-27 19:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: anton, linuxppc-dev, linux-mm, linux-kernel, Andrew Morton,
	Johannes Weiner, Michal Hocko, Rik van Riel, Dan Streetman

Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
reserves if node has no ZONE_NORMAL") from Mel.

We have a system with the following topology:

(0) root @ br30p03: /root
# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31
node 0 size: 28273 MB
node 0 free: 27323 MB
node 2 cpus:
node 2 size: 16384 MB
node 2 free: 0 MB
node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 3 size: 30533 MB
node 3 free: 13273 MB
node distances:
node   0   2   3 
  0:  10  20  20 
  2:  20  10  20 
  3:  20  20  10 

Node 2 has no free memory, because:

# cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 
1

This leads to the following zoneinfo:

Node 2, zone      DMA
  pages free     0
        min      1840
        low      2300
        high     2760
        scanned  0
        spanned  262144
        present  262144
        managed  262144
...
  all_unreclaimable: 1

If one then attempts to allocate some normal 16M hugepages:

echo 37 > /proc/sys/vm/nr_hugepages

The echo enver returns and kswapd2 consumes CPU cycles.

This is because throttle_direct_reclaim ends up calling
wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
pfmemalloc_watermark_ok() in turn checks all zones on the node and see
if the there are any reserves, and if so, then indicates the watermarks
are ok, by seeing if there are sufficient free pages.

675becce15 added a condition already for memoryless nodes. In this case,
though, the node has memory, it is just all consumed (and not
recliamable). Effectively, though, the result is the same on this
call to pfmemalloc_watermark_ok() and thus seems like a reasonable
additional condition.

With this change, the afore-mentioned 16M hugepage allocation succeeds
and correctly round-robins between Nodes 1 and 3.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcd90c8..033c2b7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)

        for (i = 0; i <= ZONE_NORMAL; i++) {
                zone = &pgdat->node_zones[i];
-               if (!populated_zone(zone))
+               if (!populated_zone(zone) || !zone_reclaimable(zone))
                        continue;

                pfmemalloc_reserve += min_wmark_pages(zone);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
  2015-03-27 19:28 [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones Nishanth Aravamudan
@ 2015-03-27 19:39 ` Nishanth Aravamudan
  2015-03-27 19:58 ` Dan Streetman
  2015-03-27 20:17 ` Dave Hansen
  2 siblings, 0 replies; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-03-27 19:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: anton, linuxppc-dev, linux-mm, linux-kernel, Andrew Morton,
	Johannes Weiner, Michal Hocko, Rik van Riel, Dan Streetman

[ Sorry, typo'd anton's address ]

On 27.03.2015 [12:28:50 -0700], Nishanth Aravamudan wrote:
> Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> reserves if node has no ZONE_NORMAL") from Mel.
> 
> We have a system with the following topology:
> 
> (0) root @ br30p03: /root
> # numactl -H
> available: 3 nodes (0,2-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> 23 24 25 26 27 28 29 30 31
> node 0 size: 28273 MB
> node 0 free: 27323 MB
> node 2 cpus:
> node 2 size: 16384 MB
> node 2 free: 0 MB
> node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 3 size: 30533 MB
> node 3 free: 13273 MB
> node distances:
> node   0   2   3 
>   0:  10  20  20 
>   2:  20  10  20 
>   3:  20  20  10 
> 
> Node 2 has no free memory, because:
> 
> # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages 
> 1
> 
> This leads to the following zoneinfo:
> 
> Node 2, zone      DMA
>   pages free     0
>         min      1840
>         low      2300
>         high     2760
>         scanned  0
>         spanned  262144
>         present  262144
>         managed  262144
> ...
>   all_unreclaimable: 1
> 
> If one then attempts to allocate some normal 16M hugepages:
> 
> echo 37 > /proc/sys/vm/nr_hugepages
> 
> The echo enver returns and kswapd2 consumes CPU cycles.
> 
> This is because throttle_direct_reclaim ends up calling
> wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> pfmemalloc_watermark_ok() in turn checks all zones on the node and see
> if the there are any reserves, and if so, then indicates the watermarks
> are ok, by seeing if there are sufficient free pages.
> 
> 675becce15 added a condition already for memoryless nodes. In this case,
> though, the node has memory, it is just all consumed (and not
> recliamable). Effectively, though, the result is the same on this
> call to pfmemalloc_watermark_ok() and thus seems like a reasonable
> additional condition.
> 
> With this change, the afore-mentioned 16M hugepage allocation succeeds
> and correctly round-robins between Nodes 1 and 3.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dcd90c8..033c2b7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
> -               if (!populated_zone(zone))
> +               if (!populated_zone(zone) || !zone_reclaimable(zone))
>                         continue;
>  
>                 pfmemalloc_reserve += min_wmark_pages(zone);
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
  2015-03-27 19:28 [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones Nishanth Aravamudan
  2015-03-27 19:39 ` Nishanth Aravamudan
@ 2015-03-27 19:58 ` Dan Streetman
  2015-03-27 20:17 ` Dave Hansen
  2 siblings, 0 replies; 15+ messages in thread
From: Dan Streetman @ 2015-03-27 19:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Mel Gorman, anton, linuxppc-dev, Linux-MM, linux-kernel,
	Andrew Morton, Johannes Weiner, Michal Hocko, Rik van Riel

On Fri, Mar 27, 2015 at 3:28 PM, Nishanth Aravamudan
<nacc@linux.vnet.ibm.com> wrote:
> Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> reserves if node has no ZONE_NORMAL") from Mel.
>
> We have a system with the following topology:
>
> (0) root @ br30p03: /root
> # numactl -H
> available: 3 nodes (0,2-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> 23 24 25 26 27 28 29 30 31
> node 0 size: 28273 MB
> node 0 free: 27323 MB
> node 2 cpus:
> node 2 size: 16384 MB
> node 2 free: 0 MB
> node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 3 size: 30533 MB
> node 3 free: 13273 MB
> node distances:
> node   0   2   3
>   0:  10  20  20
>   2:  20  10  20
>   3:  20  20  10
>
> Node 2 has no free memory, because:
>
> # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> 1
>
> This leads to the following zoneinfo:
>
> Node 2, zone      DMA
>   pages free     0
>         min      1840
>         low      2300
>         high     2760
>         scanned  0
>         spanned  262144
>         present  262144
>         managed  262144
> ...
>   all_unreclaimable: 1
>
> If one then attempts to allocate some normal 16M hugepages:
>
> echo 37 > /proc/sys/vm/nr_hugepages
>
> The echo enver returns and kswapd2 consumes CPU cycles.
>
> This is because throttle_direct_reclaim ends up calling
> wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> pfmemalloc_watermark_ok() in turn checks all zones on the node and see
> if the there are any reserves, and if so, then indicates the watermarks
> are ok, by seeing if there are sufficient free pages.
>
> 675becce15 added a condition already for memoryless nodes. In this case,
> though, the node has memory, it is just all consumed (and not
> recliamable). Effectively, though, the result is the same on this
> call to pfmemalloc_watermark_ok() and thus seems like a reasonable
> additional condition.
>
> With this change, the afore-mentioned 16M hugepage allocation succeeds
> and correctly round-robins between Nodes 1 and 3.
>
> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Reviewed-by: Dan Streetman <ddstreet@ieee.org>

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dcd90c8..033c2b7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
> -               if (!populated_zone(zone))
> +               if (!populated_zone(zone) || !zone_reclaimable(zone))
>                         continue;
>
>                 pfmemalloc_reserve += min_wmark_pages(zone);
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones
  2015-03-27 19:28 [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones Nishanth Aravamudan
  2015-03-27 19:39 ` Nishanth Aravamudan
  2015-03-27 19:58 ` Dan Streetman
@ 2015-03-27 20:17 ` Dave Hansen
  2015-03-27 22:23   ` [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages Nishanth Aravamudan
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2015-03-27 20:17 UTC (permalink / raw)
  To: Nishanth Aravamudan, Mel Gorman
  Cc: anton, linuxppc-dev, linux-mm, linux-kernel, Andrew Morton,
	Johannes Weiner, Michal Hocko, Rik van Riel, Dan Streetman

On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
> -               if (!populated_zone(zone))
> +               if (!populated_zone(zone) || !zone_reclaimable(zone))
>                         continue;
>  
>                 pfmemalloc_reserve += min_wmark_pages(zone);

Do you really want zone_reclaimable()?  Or do you want something more
direct like "zone_reclaimable_pages(zone) == 0"?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-03-27 20:17 ` Dave Hansen
@ 2015-03-27 22:23   ` Nishanth Aravamudan
  2015-03-31  9:48     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-03-27 22:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, anton, linuxppc-dev, linux-mm, linux-kernel,
	Andrew Morton, Johannes Weiner, Michal Hocko, Rik van Riel,
	Dan Streetman

On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> > @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >  
> >         for (i = 0; i <= ZONE_NORMAL; i++) {
> >                 zone = &pgdat->node_zones[i];
> > -               if (!populated_zone(zone))
> > +               if (!populated_zone(zone) || !zone_reclaimable(zone))
> >                         continue;
> >  
> >                 pfmemalloc_reserve += min_wmark_pages(zone);
> 
> Do you really want zone_reclaimable()?  Or do you want something more
> direct like "zone_reclaimable_pages(zone) == 0"?

Yeah, I guess in my testing this worked out to be the same, since
zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
always be false. Thanks!

Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
reserves if node has no ZONE_NORMAL") from Mel.

We have a system with the following topology:

# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31
node 0 size: 28273 MB
node 0 free: 27323 MB
node 2 cpus:
node 2 size: 16384 MB
node 2 free: 0 MB
node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 3 size: 30533 MB
node 3 free: 13273 MB
node distances:
node   0   2   3
  0:  10  20  20
  2:  20  10  20
  3:  20  20  10

Node 2 has no free memory, because:
# cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
1

This leads to the following zoneinfo:

Node 2, zone      DMA
  pages free     0
        min      1840
        low      2300
        high     2760
        scanned  0
        spanned  262144
        present  262144
        managed  262144
...
  all_unreclaimable: 1

If one then attempts to allocate some normal 16M hugepages via

echo 37 > /proc/sys/vm/nr_hugepages

The echo never returns and kswapd2 consumes CPU cycles.

This is because throttle_direct_reclaim ends up calling
wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
pfmemalloc_watermark_ok() in turn checks all zones on the node if there
are any reserves, and if so, then indicates the watermarks are ok, by
seeing if there are sufficient free pages.

675becce15 added a condition already for memoryless nodes. In this case,
though, the node has memory, it is just all consumed (and not
reclaimable). Effectively, though, the result is the same on this call
to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
condition.

With this change, the afore-mentioned 16M hugepage allocation attempt
succeeds and correctly round-robins between Nodes 1 and 3.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

---
v1 -> v2:
  Check against zone_reclaimable_pages, rather zone_reclaimable, based
  upon feedback from Dave Hansen.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..c627fa4c991f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!populated_zone(zone))
+		if (!populated_zone(zone) ||
+		    zone_reclaimable_pages(zone) == 0)
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-03-27 22:23   ` [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages Nishanth Aravamudan
@ 2015-03-31  9:48     ` Michal Hocko
  2015-04-03  7:57       ` Vlastimil Babka
  2015-04-03 17:43       ` Nishanth Aravamudan
  0 siblings, 2 replies; 15+ messages in thread
From: Michal Hocko @ 2015-03-31  9:48 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Dave Hansen, Mel Gorman, anton, linuxppc-dev, linux-mm,
	linux-kernel, Andrew Morton, Johannes Weiner, Rik van Riel,
	Dan Streetman

On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
> On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> > On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> > > @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > >  
> > >         for (i = 0; i <= ZONE_NORMAL; i++) {
> > >                 zone = &pgdat->node_zones[i];
> > > -               if (!populated_zone(zone))
> > > +               if (!populated_zone(zone) || !zone_reclaimable(zone))
> > >                         continue;
> > >  
> > >                 pfmemalloc_reserve += min_wmark_pages(zone);
> > 
> > Do you really want zone_reclaimable()?  Or do you want something more
> > direct like "zone_reclaimable_pages(zone) == 0"?
> 
> Yeah, I guess in my testing this worked out to be the same, since
> zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
> always be false. Thanks!
> 
> Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> reserves if node has no ZONE_NORMAL") from Mel.
> 
> We have a system with the following topology:
> 
> # numactl -H
> available: 3 nodes (0,2-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> 23 24 25 26 27 28 29 30 31
> node 0 size: 28273 MB
> node 0 free: 27323 MB
> node 2 cpus:
> node 2 size: 16384 MB
> node 2 free: 0 MB
> node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 3 size: 30533 MB
> node 3 free: 13273 MB
> node distances:
> node   0   2   3
>   0:  10  20  20
>   2:  20  10  20
>   3:  20  20  10
> 
> Node 2 has no free memory, because:
> # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> 1
> 
> This leads to the following zoneinfo:
> 
> Node 2, zone      DMA
>   pages free     0
>         min      1840
>         low      2300
>         high     2760
>         scanned  0
>         spanned  262144
>         present  262144
>         managed  262144
> ...
>   all_unreclaimable: 1

Blee, this is a weird configuration.

> If one then attempts to allocate some normal 16M hugepages via
> 
> echo 37 > /proc/sys/vm/nr_hugepages
> 
> The echo never returns and kswapd2 consumes CPU cycles.
> 
> This is because throttle_direct_reclaim ends up calling
> wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> pfmemalloc_watermark_ok() in turn checks all zones on the node if there
> are any reserves, and if so, then indicates the watermarks are ok, by
> seeing if there are sufficient free pages.
> 
> 675becce15 added a condition already for memoryless nodes. In this case,
> though, the node has memory, it is just all consumed (and not
> reclaimable). Effectively, though, the result is the same on this call
> to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
> condition.
> 
> With this change, the afore-mentioned 16M hugepage allocation attempt
> succeeds and correctly round-robins between Nodes 1 and 3.

I am just wondering whether this is the right/complete fix. Don't we
need a similar treatment at more places?
I would expect kswapd would be looping endlessly because the zone
wouldn't be balanced obviously. But I would be wrong... because
pgdat_balanced is doing this:
		/*
		 * A special case here:
		 *
		 * balance_pgdat() skips over all_unreclaimable after
		 * DEF_PRIORITY. Effectively, it considers them balanced so
		 * they must be considered balanced here as well!
		 */
		if (!zone_reclaimable(zone)) {
			balanced_pages += zone->managed_pages;
			continue;
		}

and zone_reclaimable is false for you as you didn't have any
zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
zones right?) and so the kswapd would be woken up easily. So it looks
like a mess.

There are possibly other places which rely on populated_zone or
for_each_populated_zone without checking reclaimability. Are those
working as expected?

That being said. I am not objecting to this patch. I am just trying to
wrap my head around possible issues from such a weird configuration and
all the consequences.

> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

The patch as is doesn't seem to be harmful.

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> v1 -> v2:
>   Check against zone_reclaimable_pages, rather zone_reclaimable, based
>   upon feedback from Dave Hansen.

Dunno, but shouldn't we use the same thing here and in pgdat_balanced?
zone_reclaimable_pages seems to be used only from zone_reclaimable().

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd71bac..c627fa4c991f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>  	for (i = 0; i <= ZONE_NORMAL; i++) {
>  		zone = &pgdat->node_zones[i];
> -		if (!populated_zone(zone))
> +		if (!populated_zone(zone) ||
> +		    zone_reclaimable_pages(zone) == 0)
>  			continue;
>  
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-03-31  9:48     ` Michal Hocko
@ 2015-04-03  7:57       ` Vlastimil Babka
  2015-04-03 17:45         ` Nishanth Aravamudan
  2015-04-03 17:43       ` Nishanth Aravamudan
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2015-04-03  7:57 UTC (permalink / raw)
  To: Michal Hocko, Nishanth Aravamudan
  Cc: Dave Hansen, Mel Gorman, anton, linuxppc-dev, linux-mm,
	linux-kernel, Andrew Morton, Johannes Weiner, Rik van Riel,
	Dan Streetman

On 03/31/2015 11:48 AM, Michal Hocko wrote:
> On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
>> On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
>>> On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
>>>> @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>>>>
>>>>          for (i = 0; i <= ZONE_NORMAL; i++) {
>>>>                  zone = &pgdat->node_zones[i];
>>>> -               if (!populated_zone(zone))
>>>> +               if (!populated_zone(zone) || !zone_reclaimable(zone))
>>>>                          continue;
>>>>
>>>>                  pfmemalloc_reserve += min_wmark_pages(zone);
>>>
>>> Do you really want zone_reclaimable()?  Or do you want something more
>>> direct like "zone_reclaimable_pages(zone) == 0"?
>>
>> Yeah, I guess in my testing this worked out to be the same, since
>> zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
>> always be false. Thanks!
>>
>> Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
>> reserves if node has no ZONE_NORMAL") from Mel.
>>
>> We have a system with the following topology:
>>
>> # numactl -H
>> available: 3 nodes (0,2-3)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
>> 23 24 25 26 27 28 29 30 31
>> node 0 size: 28273 MB
>> node 0 free: 27323 MB
>> node 2 cpus:
>> node 2 size: 16384 MB
>> node 2 free: 0 MB
>> node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 3 size: 30533 MB
>> node 3 free: 13273 MB
>> node distances:
>> node   0   2   3
>>    0:  10  20  20
>>    2:  20  10  20
>>    3:  20  20  10
>>
>> Node 2 has no free memory, because:
>> # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
>> 1
>>
>> This leads to the following zoneinfo:
>>
>> Node 2, zone      DMA
>>    pages free     0
>>          min      1840
>>          low      2300
>>          high     2760
>>          scanned  0
>>          spanned  262144
>>          present  262144
>>          managed  262144
>> ...
>>    all_unreclaimable: 1
>
> Blee, this is a weird configuration.
>
>> If one then attempts to allocate some normal 16M hugepages via
>>
>> echo 37 > /proc/sys/vm/nr_hugepages
>>
>> The echo never returns and kswapd2 consumes CPU cycles.
>>
>> This is because throttle_direct_reclaim ends up calling
>> wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
>> pfmemalloc_watermark_ok() in turn checks all zones on the node if there
>> are any reserves, and if so, then indicates the watermarks are ok, by
>> seeing if there are sufficient free pages.
>>
>> 675becce15 added a condition already for memoryless nodes. In this case,
>> though, the node has memory, it is just all consumed (and not
>> reclaimable). Effectively, though, the result is the same on this call
>> to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
>> condition.
>>
>> With this change, the afore-mentioned 16M hugepage allocation attempt
>> succeeds and correctly round-robins between Nodes 1 and 3.
>
> I am just wondering whether this is the right/complete fix. Don't we
> need a similar treatment at more places?
> I would expect kswapd would be looping endlessly because the zone
> wouldn't be balanced obviously. But I would be wrong... because
> pgdat_balanced is doing this:
> 		/*
> 		 * A special case here:
> 		 *
> 		 * balance_pgdat() skips over all_unreclaimable after
> 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> 		 * they must be considered balanced here as well!
> 		 */
> 		if (!zone_reclaimable(zone)) {
> 			balanced_pages += zone->managed_pages;
> 			continue;
> 		}
>
> and zone_reclaimable is false for you as you didn't have any
> zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> zones right?) and so the kswapd would be woken up easily. So it looks
> like a mess.

Yeah, looks like a much cleaner/complete solution would be to remove 
such zones from zonelists. But that means covering all situations when 
these hugepages are allocated/removed and the approach then looks 
similar to memory hotplug.
Also I'm not sure if the ability to actually allocate the reserved 
hugepage would be impossible due to not being reachable by a zonelist...

> There are possibly other places which rely on populated_zone or
> for_each_populated_zone without checking reclaimability. Are those
> working as expected?

Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point in 
waking it up just to let it immediately go to sleep again.

> That being said. I am not objecting to this patch. I am just trying to
> wrap my head around possible issues from such a weird configuration and
> all the consequences.
>
>> Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
>
> The patch as is doesn't seem to be harmful.
>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>
>> ---
>> v1 -> v2:
>>    Check against zone_reclaimable_pages, rather zone_reclaimable, based
>>    upon feedback from Dave Hansen.
>
> Dunno, but shouldn't we use the same thing here and in pgdat_balanced?
> zone_reclaimable_pages seems to be used only from zone_reclaimable().

pgdat_balanced() has a different goal than pfmemalloc_watermark_ok() and 
needs to match what balance_pgdat() does, which includes considering 
NR_PAGES_SCANNED through zone_reclaimable(). For the situation 
considered in this patch, result of zone_reclaimable() will match the 
test zone_reclaimable_pages(zone) == 0, so it is fine I think.

What I find somewhat worrying though is that we could potentially break 
the pfmemalloc_watermark_ok() test in situations where 
zone_reclaimable_pages(zone) == 0 is a transient situation (and not a 
permanently allocated hugepage). In that case, the throttling is 
supposed to help system recover, and we might be breaking that ability 
with this patch, no?

>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 5e8eadd71bac..c627fa4c991f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>>
>>   	for (i = 0; i <= ZONE_NORMAL; i++) {
>>   		zone = &pgdat->node_zones[i];
>> -		if (!populated_zone(zone))
>> +		if (!populated_zone(zone) ||
>> +		    zone_reclaimable_pages(zone) == 0)
>>   			continue;
>>
>>   		pfmemalloc_reserve += min_wmark_pages(zone);
>>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-03-31  9:48     ` Michal Hocko
  2015-04-03  7:57       ` Vlastimil Babka
@ 2015-04-03 17:43       ` Nishanth Aravamudan
  2015-04-03 18:24         ` Michal Hocko
  1 sibling, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-04-03 17:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Mel Gorman, anton, linuxppc-dev, linux-mm,
	linux-kernel, Andrew Morton, Johannes Weiner, Rik van Riel,
	Dan Streetman

On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
> On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
> > On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> > > On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> > > > @@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > > >  
> > > >         for (i = 0; i <= ZONE_NORMAL; i++) {
> > > >                 zone = &pgdat->node_zones[i];
> > > > -               if (!populated_zone(zone))
> > > > +               if (!populated_zone(zone) || !zone_reclaimable(zone))
> > > >                         continue;
> > > >  
> > > >                 pfmemalloc_reserve += min_wmark_pages(zone);
> > > 
> > > Do you really want zone_reclaimable()?  Or do you want something more
> > > direct like "zone_reclaimable_pages(zone) == 0"?
> > 
> > Yeah, I guess in my testing this worked out to be the same, since
> > zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
> > always be false. Thanks!
> > 
> > Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> > reserves if node has no ZONE_NORMAL") from Mel.
> > 
> > We have a system with the following topology:
> > 
> > # numactl -H
> > available: 3 nodes (0,2-3)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> > 23 24 25 26 27 28 29 30 31
> > node 0 size: 28273 MB
> > node 0 free: 27323 MB
> > node 2 cpus:
> > node 2 size: 16384 MB
> > node 2 free: 0 MB
> > node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> > node 3 size: 30533 MB
> > node 3 free: 13273 MB
> > node distances:
> > node   0   2   3
> >   0:  10  20  20
> >   2:  20  10  20
> >   3:  20  20  10
> > 
> > Node 2 has no free memory, because:
> > # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> > 1
> > 
> > This leads to the following zoneinfo:
> > 
> > Node 2, zone      DMA
> >   pages free     0
> >         min      1840
> >         low      2300
> >         high     2760
> >         scanned  0
> >         spanned  262144
> >         present  262144
> >         managed  262144
> > ...
> >   all_unreclaimable: 1
> 
> Blee, this is a weird configuration.

Yep, super gross. It's relatively rare in the field, thankfully. But 16G
pages definitely make it pretty likely to hit (as in, I've seen it once
before :)
 
> > If one then attempts to allocate some normal 16M hugepages via
> > 
> > echo 37 > /proc/sys/vm/nr_hugepages
> > 
> > The echo never returns and kswapd2 consumes CPU cycles.
> > 
> > This is because throttle_direct_reclaim ends up calling
> > wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> > pfmemalloc_watermark_ok() in turn checks all zones on the node if there
> > are any reserves, and if so, then indicates the watermarks are ok, by
> > seeing if there are sufficient free pages.
> > 
> > 675becce15 added a condition already for memoryless nodes. In this case,
> > though, the node has memory, it is just all consumed (and not
> > reclaimable). Effectively, though, the result is the same on this call
> > to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
> > condition.
> > 
> > With this change, the afore-mentioned 16M hugepage allocation attempt
> > succeeds and correctly round-robins between Nodes 1 and 3.
> 
> I am just wondering whether this is the right/complete fix. Don't we
> need a similar treatment at more places?

Almost certainly needs an audit. Exhausted nodes are tough to reproduce
easily (fully exhausted, that is), for me.

> I would expect kswapd would be looping endlessly because the zone
> wouldn't be balanced obviously. But I would be wrong... because
> pgdat_balanced is doing this:
> 		/*
> 		 * A special case here:
> 		 *
> 		 * balance_pgdat() skips over all_unreclaimable after
> 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> 		 * they must be considered balanced here as well!
> 		 */
> 		if (!zone_reclaimable(zone)) {
> 			balanced_pages += zone->managed_pages;
> 			continue;
> 		}
> 
> and zone_reclaimable is false for you as you didn't have any
> zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> zones right?) and so the kswapd would be woken up easily. So it looks
> like a mess.

My understanding, and I could easily be wrong, is that kswapd2 (node 2
is the exhausted one) spins endlessly, because the reclaim logic sees
that we are reclaiming from somewhere but the allocation request for
node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
never complete, so we just continue to reclaim.

> There are possibly other places which rely on populated_zone or
> for_each_populated_zone without checking reclaimability. Are those
> working as expected?

Not yet verified, admittedly.

> That being said. I am not objecting to this patch. I am just trying to
> wrap my head around possible issues from such a weird configuration and
> all the consequences.

Yeah, there are almost certainly more. Luckily, our test organization is
hammering this configuration, so hopefully I'll get reports about
further issues soon, if there are any, with the patch applied.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-04-03  7:57       ` Vlastimil Babka
@ 2015-04-03 17:45         ` Nishanth Aravamudan
  2015-05-05 22:09           ` Nishanth Aravamudan
  0 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-04-03 17:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Dave Hansen, Mel Gorman, anton, linuxppc-dev,
	linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Rik van Riel, Dan Streetman

On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote:
> On 03/31/2015 11:48 AM, Michal Hocko wrote:
> >On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
> >>On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> >>>On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> >>>>@@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >>>>
> >>>>         for (i = 0; i <= ZONE_NORMAL; i++) {
> >>>>                 zone = &pgdat->node_zones[i];
> >>>>-               if (!populated_zone(zone))
> >>>>+               if (!populated_zone(zone) || !zone_reclaimable(zone))
> >>>>                         continue;
> >>>>
> >>>>                 pfmemalloc_reserve += min_wmark_pages(zone);
> >>>
> >>>Do you really want zone_reclaimable()?  Or do you want something more
> >>>direct like "zone_reclaimable_pages(zone) == 0"?
> >>
> >>Yeah, I guess in my testing this worked out to be the same, since
> >>zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
> >>always be false. Thanks!
> >>
> >>Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> >>reserves if node has no ZONE_NORMAL") from Mel.
> >>
> >>We have a system with the following topology:
> >>
> >># numactl -H
> >>available: 3 nodes (0,2-3)
> >>node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> >>23 24 25 26 27 28 29 30 31
> >>node 0 size: 28273 MB
> >>node 0 free: 27323 MB
> >>node 2 cpus:
> >>node 2 size: 16384 MB
> >>node 2 free: 0 MB
> >>node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> >>node 3 size: 30533 MB
> >>node 3 free: 13273 MB
> >>node distances:
> >>node   0   2   3
> >>   0:  10  20  20
> >>   2:  20  10  20
> >>   3:  20  20  10
> >>
> >>Node 2 has no free memory, because:
> >># cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> >>1
> >>
> >>This leads to the following zoneinfo:
> >>
> >>Node 2, zone      DMA
> >>   pages free     0
> >>         min      1840
> >>         low      2300
> >>         high     2760
> >>         scanned  0
> >>         spanned  262144
> >>         present  262144
> >>         managed  262144
> >>...
> >>   all_unreclaimable: 1
> >
> >Blee, this is a weird configuration.
> >
> >>If one then attempts to allocate some normal 16M hugepages via
> >>
> >>echo 37 > /proc/sys/vm/nr_hugepages
> >>
> >>The echo never returns and kswapd2 consumes CPU cycles.
> >>
> >>This is because throttle_direct_reclaim ends up calling
> >>wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> >>pfmemalloc_watermark_ok() in turn checks all zones on the node if there
> >>are any reserves, and if so, then indicates the watermarks are ok, by
> >>seeing if there are sufficient free pages.
> >>
> >>675becce15 added a condition already for memoryless nodes. In this case,
> >>though, the node has memory, it is just all consumed (and not
> >>reclaimable). Effectively, though, the result is the same on this call
> >>to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
> >>condition.
> >>
> >>With this change, the afore-mentioned 16M hugepage allocation attempt
> >>succeeds and correctly round-robins between Nodes 1 and 3.
> >
> >I am just wondering whether this is the right/complete fix. Don't we
> >need a similar treatment at more places?
> >I would expect kswapd would be looping endlessly because the zone
> >wouldn't be balanced obviously. But I would be wrong... because
> >pgdat_balanced is doing this:
> >		/*
> >		 * A special case here:
> >		 *
> >		 * balance_pgdat() skips over all_unreclaimable after
> >		 * DEF_PRIORITY. Effectively, it considers them balanced so
> >		 * they must be considered balanced here as well!
> >		 */
> >		if (!zone_reclaimable(zone)) {
> >			balanced_pages += zone->managed_pages;
> >			continue;
> >		}
> >
> >and zone_reclaimable is false for you as you didn't have any
> >zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> >would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> >zones right?) and so the kswapd would be woken up easily. So it looks
> >like a mess.
> 
> Yeah, looks like a much cleaner/complete solution would be to remove
> such zones from zonelists. But that means covering all situations
> when these hugepages are allocated/removed and the approach then
> looks similar to memory hotplug.
> Also I'm not sure if the ability to actually allocate the reserved
> hugepage would be impossible due to not being reachable by a
> zonelist...
> 
> >There are possibly other places which rely on populated_zone or
> >for_each_populated_zone without checking reclaimability. Are those
> >working as expected?
> 
> Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point
> in waking it up just to let it immediately go to sleep again.
> 
> >That being said. I am not objecting to this patch. I am just trying to
> >wrap my head around possible issues from such a weird configuration and
> >all the consequences.
> >
> >>Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> >
> >The patch as is doesn't seem to be harmful.
> >
> >Reviewed-by: Michal Hocko <mhocko@suse.cz>
> >
> >>---
> >>v1 -> v2:
> >>   Check against zone_reclaimable_pages, rather zone_reclaimable, based
> >>   upon feedback from Dave Hansen.
> >
> >Dunno, but shouldn't we use the same thing here and in pgdat_balanced?
> >zone_reclaimable_pages seems to be used only from zone_reclaimable().
> 
> pgdat_balanced() has a different goal than pfmemalloc_watermark_ok()
> and needs to match what balance_pgdat() does, which includes
> considering NR_PAGES_SCANNED through zone_reclaimable(). For the
> situation considered in this patch, result of zone_reclaimable()
> will match the test zone_reclaimable_pages(zone) == 0, so it is fine
> I think.

Right.

> What I find somewhat worrying though is that we could potentially
> break the pfmemalloc_watermark_ok() test in situations where
> zone_reclaimable_pages(zone) == 0 is a transient situation (and not
> a permanently allocated hugepage). In that case, the throttling is
> supposed to help system recover, and we might be breaking that
> ability with this patch, no?

Well, if it's transient, we'll skip it this time through, and once there
are reclaimable pages, we should notice it again.

I'm not familiar enough with this logic, so I'll read through the code
again soon to see if your concern is valid, as best I can.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-04-03 17:43       ` Nishanth Aravamudan
@ 2015-04-03 18:24         ` Michal Hocko
  2015-04-03 18:50           ` Nishanth Aravamudan
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2015-04-03 18:24 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Dave Hansen, Mel Gorman, anton, linuxppc-dev, linux-mm,
	linux-kernel, Andrew Morton, Johannes Weiner, Rik van Riel,
	Dan Streetman

On Fri 03-04-15 10:43:57, Nishanth Aravamudan wrote:
> On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
[...]
> > I would expect kswapd would be looping endlessly because the zone
> > wouldn't be balanced obviously. But I would be wrong... because
> > pgdat_balanced is doing this:
> > 		/*
> > 		 * A special case here:
> > 		 *
> > 		 * balance_pgdat() skips over all_unreclaimable after
> > 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> > 		 * they must be considered balanced here as well!
> > 		 */
> > 		if (!zone_reclaimable(zone)) {
> > 			balanced_pages += zone->managed_pages;
> > 			continue;
> > 		}
> > 
> > and zone_reclaimable is false for you as you didn't have any
> > zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> > would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> > zones right?) and so the kswapd would be woken up easily. So it looks
> > like a mess.
> 
> My understanding, and I could easily be wrong, is that kswapd2 (node 2
> is the exhausted one) spins endlessly, because the reclaim logic sees
> that we are reclaiming from somewhere but the allocation request for
> node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
> never complete, so we just continue to reclaim.

__GFP_THISNODE would be waking up kswapd2 again and again, that is true.
I am just wondering whether we will have any __GFP_THISNODE allocations
for a node without CPUs (numa_node_id() shouldn't return such a node
AFAICS). Maybe if somebody is bound to Node2 explicitly but I would
consider this as a misconfiguration.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-04-03 18:24         ` Michal Hocko
@ 2015-04-03 18:50           ` Nishanth Aravamudan
  0 siblings, 0 replies; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-04-03 18:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Mel Gorman, anton, linuxppc-dev, linux-mm,
	linux-kernel, Andrew Morton, Johannes Weiner, Rik van Riel,
	Dan Streetman

On 03.04.2015 [20:24:45 +0200], Michal Hocko wrote:
> On Fri 03-04-15 10:43:57, Nishanth Aravamudan wrote:
> > On 31.03.2015 [11:48:29 +0200], Michal Hocko wrote:
> [...]
> > > I would expect kswapd would be looping endlessly because the zone
> > > wouldn't be balanced obviously. But I would be wrong... because
> > > pgdat_balanced is doing this:
> > > 		/*
> > > 		 * A special case here:
> > > 		 *
> > > 		 * balance_pgdat() skips over all_unreclaimable after
> > > 		 * DEF_PRIORITY. Effectively, it considers them balanced so
> > > 		 * they must be considered balanced here as well!
> > > 		 */
> > > 		if (!zone_reclaimable(zone)) {
> > > 			balanced_pages += zone->managed_pages;
> > > 			continue;
> > > 		}
> > > 
> > > and zone_reclaimable is false for you as you didn't have any
> > > zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> > > would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> > > zones right?) and so the kswapd would be woken up easily. So it looks
> > > like a mess.
> > 
> > My understanding, and I could easily be wrong, is that kswapd2 (node 2
> > is the exhausted one) spins endlessly, because the reclaim logic sees
> > that we are reclaiming from somewhere but the allocation request for
> > node 2 (which is __GFP_THISNODE for hugepages, not GFP_THISNODE) will
> > never complete, so we just continue to reclaim.
> 
> __GFP_THISNODE would be waking up kswapd2 again and again, that is true.

Right, one idea I had for this was ensuring that we perform reclaim with
somehow some knowledge of __GFP_THISNODE -- that is it needs to be
somewhat targetted in order to actually help satisfy the current
allocation. But it got pretty hairy fast and I didn't want to break the
world :)

> I am just wondering whether we will have any __GFP_THISNODE allocations
> for a node without CPUs (numa_node_id() shouldn't return such a node
> AFAICS). Maybe if somebody is bound to Node2 explicitly but I would
> consider this as a misconfiguration.

Right, I'd need to check what happens if in our setup you taskset to
node2 and tried to force memory to be local -- I think you'd either be
killed immediately, or the kernel will just disagree with your binding
since it's invalid (e.g., that will happen if you try to bind to a
memoryless node, I think).

Keep in mind that although in my config node2 had no CPUs, that's not a
hard & fast requirement. I do believe in a previous iteration of this
bug, the exhausted node had no free memory but did have cpus assigned to
it.

-Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-04-03 17:45         ` Nishanth Aravamudan
@ 2015-05-05 22:09           ` Nishanth Aravamudan
  2015-05-06  9:28             ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-05-05 22:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Dave Hansen, Mel Gorman, anton, linuxppc-dev,
	linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Rik van Riel, Dan Streetman

On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote:
> On 03.04.2015 [09:57:35 +0200], Vlastimil Babka wrote:
> > On 03/31/2015 11:48 AM, Michal Hocko wrote:
> > >On Fri 27-03-15 15:23:50, Nishanth Aravamudan wrote:
> > >>On 27.03.2015 [13:17:59 -0700], Dave Hansen wrote:
> > >>>On 03/27/2015 12:28 PM, Nishanth Aravamudan wrote:
> > >>>>@@ -2585,7 +2585,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > >>>>
> > >>>>         for (i = 0; i <= ZONE_NORMAL; i++) {
> > >>>>                 zone = &pgdat->node_zones[i];
> > >>>>-               if (!populated_zone(zone))
> > >>>>+               if (!populated_zone(zone) || !zone_reclaimable(zone))
> > >>>>                         continue;
> > >>>>
> > >>>>                 pfmemalloc_reserve += min_wmark_pages(zone);
> > >>>
> > >>>Do you really want zone_reclaimable()?  Or do you want something more
> > >>>direct like "zone_reclaimable_pages(zone) == 0"?
> > >>
> > >>Yeah, I guess in my testing this worked out to be the same, since
> > >>zone_reclaimable_pages(zone) is 0 and so zone_reclaimable(zone) will
> > >>always be false. Thanks!
> > >>
> > >>Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
> > >>reserves if node has no ZONE_NORMAL") from Mel.
> > >>
> > >>We have a system with the following topology:
> > >>
> > >># numactl -H
> > >>available: 3 nodes (0,2-3)
> > >>node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
> > >>23 24 25 26 27 28 29 30 31
> > >>node 0 size: 28273 MB
> > >>node 0 free: 27323 MB
> > >>node 2 cpus:
> > >>node 2 size: 16384 MB
> > >>node 2 free: 0 MB
> > >>node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> > >>node 3 size: 30533 MB
> > >>node 3 free: 13273 MB
> > >>node distances:
> > >>node   0   2   3
> > >>   0:  10  20  20
> > >>   2:  20  10  20
> > >>   3:  20  20  10
> > >>
> > >>Node 2 has no free memory, because:
> > >># cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
> > >>1
> > >>
> > >>This leads to the following zoneinfo:
> > >>
> > >>Node 2, zone      DMA
> > >>   pages free     0
> > >>         min      1840
> > >>         low      2300
> > >>         high     2760
> > >>         scanned  0
> > >>         spanned  262144
> > >>         present  262144
> > >>         managed  262144
> > >>...
> > >>   all_unreclaimable: 1
> > >
> > >Blee, this is a weird configuration.
> > >
> > >>If one then attempts to allocate some normal 16M hugepages via
> > >>
> > >>echo 37 > /proc/sys/vm/nr_hugepages
> > >>
> > >>The echo never returns and kswapd2 consumes CPU cycles.
> > >>
> > >>This is because throttle_direct_reclaim ends up calling
> > >>wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
> > >>pfmemalloc_watermark_ok() in turn checks all zones on the node if there
> > >>are any reserves, and if so, then indicates the watermarks are ok, by
> > >>seeing if there are sufficient free pages.
> > >>
> > >>675becce15 added a condition already for memoryless nodes. In this case,
> > >>though, the node has memory, it is just all consumed (and not
> > >>reclaimable). Effectively, though, the result is the same on this call
> > >>to pfmemalloc_watermark_ok() and thus seems like a reasonable additional
> > >>condition.
> > >>
> > >>With this change, the afore-mentioned 16M hugepage allocation attempt
> > >>succeeds and correctly round-robins between Nodes 1 and 3.
> > >
> > >I am just wondering whether this is the right/complete fix. Don't we
> > >need a similar treatment at more places?
> > >I would expect kswapd would be looping endlessly because the zone
> > >wouldn't be balanced obviously. But I would be wrong... because
> > >pgdat_balanced is doing this:
> > >		/*
> > >		 * A special case here:
> > >		 *
> > >		 * balance_pgdat() skips over all_unreclaimable after
> > >		 * DEF_PRIORITY. Effectively, it considers them balanced so
> > >		 * they must be considered balanced here as well!
> > >		 */
> > >		if (!zone_reclaimable(zone)) {
> > >			balanced_pages += zone->managed_pages;
> > >			continue;
> > >		}
> > >
> > >and zone_reclaimable is false for you as you didn't have any
> > >zone_reclaimable_pages(). But wakeup_kswapd doesn't do this check so it
> > >would see !zone_balanced() AFAICS (build_zonelists doesn't ignore those
> > >zones right?) and so the kswapd would be woken up easily. So it looks
> > >like a mess.
> > 
> > Yeah, looks like a much cleaner/complete solution would be to remove
> > such zones from zonelists. But that means covering all situations
> > when these hugepages are allocated/removed and the approach then
> > looks similar to memory hotplug.
> > Also I'm not sure if the ability to actually allocate the reserved
> > hugepage would be impossible due to not being reachable by a
> > zonelist...
> > 
> > >There are possibly other places which rely on populated_zone or
> > >for_each_populated_zone without checking reclaimability. Are those
> > >working as expected?
> > 
> > Yeah. At least the wakeup_kswapd case should be fixed IMHO. No point
> > in waking it up just to let it immediately go to sleep again.
> > 
> > >That being said. I am not objecting to this patch. I am just trying to
> > >wrap my head around possible issues from such a weird configuration and
> > >all the consequences.
> > >
> > >>Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> > >
> > >The patch as is doesn't seem to be harmful.
> > >
> > >Reviewed-by: Michal Hocko <mhocko@suse.cz>
> > >
> > >>---
> > >>v1 -> v2:
> > >>   Check against zone_reclaimable_pages, rather zone_reclaimable, based
> > >>   upon feedback from Dave Hansen.
> > >
> > >Dunno, but shouldn't we use the same thing here and in pgdat_balanced?
> > >zone_reclaimable_pages seems to be used only from zone_reclaimable().
> > 
> > pgdat_balanced() has a different goal than pfmemalloc_watermark_ok()
> > and needs to match what balance_pgdat() does, which includes
> > considering NR_PAGES_SCANNED through zone_reclaimable(). For the
> > situation considered in this patch, result of zone_reclaimable()
> > will match the test zone_reclaimable_pages(zone) == 0, so it is fine
> > I think.
> 
> Right.
> 
> > What I find somewhat worrying though is that we could potentially
> > break the pfmemalloc_watermark_ok() test in situations where
> > zone_reclaimable_pages(zone) == 0 is a transient situation (and not
> > a permanently allocated hugepage). In that case, the throttling is
> > supposed to help system recover, and we might be breaking that
> > ability with this patch, no?
> 
> Well, if it's transient, we'll skip it this time through, and once there
> are reclaimable pages, we should notice it again.
> 
> I'm not familiar enough with this logic, so I'll read through the code
> again soon to see if your concern is valid, as best I can.

In reviewing the code, I think that transiently unreclaimable zones will
lead to some higher direct reclaim rates and possible contention, but
shouldn't cause any major harm. The likelihood of that situation, as
well, in a non-reserved memory setup like the one I described, seems
exceedingly low.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-05-05 22:09           ` Nishanth Aravamudan
@ 2015-05-06  9:28             ` Vlastimil Babka
  2015-05-08 22:47               ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2015-05-06  9:28 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Michal Hocko, Dave Hansen, Mel Gorman, anton, linuxppc-dev,
	linux-mm, linux-kernel, Andrew Morton, Johannes Weiner,
	Rik van Riel, Dan Streetman

On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote:
> On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote:
>>> What I find somewhat worrying though is that we could potentially
>>> break the pfmemalloc_watermark_ok() test in situations where
>>> zone_reclaimable_pages(zone) == 0 is a transient situation (and not
>>> a permanently allocated hugepage). In that case, the throttling is
>>> supposed to help system recover, and we might be breaking that
>>> ability with this patch, no?
>>
>> Well, if it's transient, we'll skip it this time through, and once there
>> are reclaimable pages, we should notice it again.
>>
>> I'm not familiar enough with this logic, so I'll read through the code
>> again soon to see if your concern is valid, as best I can.
>
> In reviewing the code, I think that transiently unreclaimable zones will
> lead to some higher direct reclaim rates and possible contention, but
> shouldn't cause any major harm. The likelihood of that situation, as
> well, in a non-reserved memory setup like the one I described, seems
> exceedingly low.

OK, I guess when a reasonably configured system has nothing to reclaim, 
it's already busted and throttling won't change much.

Consider the patch Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Thanks,
> Nish
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-05-06  9:28             ` Vlastimil Babka
@ 2015-05-08 22:47               ` Andrew Morton
  2015-05-08 23:18                 ` Nishanth Aravamudan
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2015-05-08 22:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Nishanth Aravamudan, Michal Hocko, Dave Hansen, Mel Gorman,
	anton, linuxppc-dev, linux-mm, linux-kernel, Johannes Weiner,
	Rik van Riel, Dan Streetman

On Wed, 06 May 2015 11:28:12 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote:
> > On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote:
> >>> What I find somewhat worrying though is that we could potentially
> >>> break the pfmemalloc_watermark_ok() test in situations where
> >>> zone_reclaimable_pages(zone) == 0 is a transient situation (and not
> >>> a permanently allocated hugepage). In that case, the throttling is
> >>> supposed to help system recover, and we might be breaking that
> >>> ability with this patch, no?
> >>
> >> Well, if it's transient, we'll skip it this time through, and once there
> >> are reclaimable pages, we should notice it again.
> >>
> >> I'm not familiar enough with this logic, so I'll read through the code
> >> again soon to see if your concern is valid, as best I can.
> >
> > In reviewing the code, I think that transiently unreclaimable zones will
> > lead to some higher direct reclaim rates and possible contention, but
> > shouldn't cause any major harm. The likelihood of that situation, as
> > well, in a non-reserved memory setup like the one I described, seems
> > exceedingly low.
> 
> OK, I guess when a reasonably configured system has nothing to reclaim, 
> it's already busted and throttling won't change much.
> 
> Consider the patch Acked-by: Vlastimil Babka <vbabka@suse.cz>

OK, thanks, I'll move this patch into the queue for 4.2-rc1.

Or is it important enough to merge into 4.1?



From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Subject: mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages

Based upon 675becce15 ("mm: vmscan: do not throttle based on pfmemalloc
reserves if node has no ZONE_NORMAL") from Mel.

We have a system with the following topology:

# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31
node 0 size: 28273 MB
node 0 free: 27323 MB
node 2 cpus:
node 2 size: 16384 MB
node 2 free: 0 MB
node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 3 size: 30533 MB
node 3 free: 13273 MB
node distances:
node   0   2   3
  0:  10  20  20
  2:  20  10  20
  3:  20  20  10

Node 2 has no free memory, because:
# cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
1

This leads to the following zoneinfo:

Node 2, zone      DMA
  pages free     0
        min      1840
        low      2300
        high     2760
        scanned  0
        spanned  262144
        present  262144
        managed  262144
...
  all_unreclaimable: 1

If one then attempts to allocate some normal 16M hugepages via

echo 37 > /proc/sys/vm/nr_hugepages

The echo never returns and kswapd2 consumes CPU cycles.

This is because throttle_direct_reclaim ends up calling
wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...). 
pfmemalloc_watermark_ok() in turn checks all zones on the node if there
are any reserves, and if so, then indicates the watermarks are ok, by
seeing if there are sufficient free pages.

675becce15 added a condition already for memoryless nodes.  In this case,
though, the node has memory, it is just all consumed (and not
reclaimable).  Effectively, though, the result is the same on this call to
pfmemalloc_watermark_ok() and thus seems like a reasonable additional
condition.

With this change, the afore-mentioned 16M hugepage allocation attempt
succeeds and correctly round-robins between Nodes 1 and 3.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anton Blanchard <anton@samba.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN mm/vmscan.c~mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-reclaimable-pages mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-do-not-throttle-based-on-pfmemalloc-reserves-if-node-has-no-reclaimable-pages
+++ a/mm/vmscan.c
@@ -2646,7 +2646,8 @@ static bool pfmemalloc_watermark_ok(pg_d
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!populated_zone(zone))
+		if (!populated_zone(zone) ||
+		    zone_reclaimable_pages(zone) == 0)
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
_


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages
  2015-05-08 22:47               ` Andrew Morton
@ 2015-05-08 23:18                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 15+ messages in thread
From: Nishanth Aravamudan @ 2015-05-08 23:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Dave Hansen, Mel Gorman, anton,
	linuxppc-dev, linux-mm, linux-kernel, Johannes Weiner,
	Rik van Riel, Dan Streetman

On 08.05.2015 [15:47:26 -0700], Andrew Morton wrote:
> On Wed, 06 May 2015 11:28:12 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> > On 05/06/2015 12:09 AM, Nishanth Aravamudan wrote:
> > > On 03.04.2015 [10:45:56 -0700], Nishanth Aravamudan wrote:
> > >>> What I find somewhat worrying though is that we could potentially
> > >>> break the pfmemalloc_watermark_ok() test in situations where
> > >>> zone_reclaimable_pages(zone) == 0 is a transient situation (and not
> > >>> a permanently allocated hugepage). In that case, the throttling is
> > >>> supposed to help system recover, and we might be breaking that
> > >>> ability with this patch, no?
> > >>
> > >> Well, if it's transient, we'll skip it this time through, and once there
> > >> are reclaimable pages, we should notice it again.
> > >>
> > >> I'm not familiar enough with this logic, so I'll read through the code
> > >> again soon to see if your concern is valid, as best I can.
> > >
> > > In reviewing the code, I think that transiently unreclaimable zones will
> > > lead to some higher direct reclaim rates and possible contention, but
> > > shouldn't cause any major harm. The likelihood of that situation, as
> > > well, in a non-reserved memory setup like the one I described, seems
> > > exceedingly low.
> > 
> > OK, I guess when a reasonably configured system has nothing to reclaim, 
> > it's already busted and throttling won't change much.
> > 
> > Consider the patch Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> OK, thanks, I'll move this patch into the queue for 4.2-rc1.

Thank you!

> Or is it important enough to merge into 4.1?

I think 4.2 is sufficient, but I wonder now if I should have included a
stable tag? The issue has been around for a while and there's a
relatively easily workaround (use the per-node sysfs files to manually
round-robin around the exhausted node) in older kernels, so I had
decided against it before.

Thanks,
Nish


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-05-08 23:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-27 19:28 [PATCH] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable zones Nishanth Aravamudan
2015-03-27 19:39 ` Nishanth Aravamudan
2015-03-27 19:58 ` Dan Streetman
2015-03-27 20:17 ` Dave Hansen
2015-03-27 22:23   ` [PATCH v2] mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages Nishanth Aravamudan
2015-03-31  9:48     ` Michal Hocko
2015-04-03  7:57       ` Vlastimil Babka
2015-04-03 17:45         ` Nishanth Aravamudan
2015-05-05 22:09           ` Nishanth Aravamudan
2015-05-06  9:28             ` Vlastimil Babka
2015-05-08 22:47               ` Andrew Morton
2015-05-08 23:18                 ` Nishanth Aravamudan
2015-04-03 17:43       ` Nishanth Aravamudan
2015-04-03 18:24         ` Michal Hocko
2015-04-03 18:50           ` Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).