linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI
@ 2020-07-01 15:26 Dave Hansen
  2020-07-01 15:26 ` [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI Dave Hansen
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Dave Hansen @ 2020-07-01 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai

A previous cleanup accidentally changed the vm.zone_reclaim_mode ABI.

This series restores the ABI and then reorganizes the code to make
the ABI more obvious.  Since the single-patch v1[1], I've:

 * Restored the RECLAIM_ZONE naming, comment and Documentation now
   that the implicit checks for it are known.
 * Move RECLAIM_* definitions to a uapi header
 * Add a node_reclaim_enabled() helper

 Documentation/admin-guide/sysctl/vm.rst |   10 +++++-----
 include/linux/swap.h                    |    7 +++++++
 include/uapi/linux/mempolicy.h          |    7 +++++++
 mm/khugepaged.c                         |    2 +-
 mm/page_alloc.c                         |    2 +-
 mm/vmscan.c                             |    3 ---
 6 files changed, 21 insertions(+), 10 deletions(-)

1. https://lore.kernel.org/linux-mm/20200626003459.D8E015CA@viggo.jf.intel.com/

Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Daniel Wagner <dwagner@suse.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI
  2020-07-01 15:26 [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Dave Hansen
@ 2020-07-01 15:26 ` Dave Hansen
  2020-07-01 20:03   ` David Rientjes
  2020-07-02 11:28   ` Huang, Ying
  2020-07-01 15:26 ` [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header Dave Hansen
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2020-07-01 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai, stable


From: Dave Hansen <dave.hansen@linux.intel.com>

I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
sysctl.  Like a good kernel developer, I also went to go update the
documentation.  I noticed that the bits in the documentation didn't
match the bits in the #defines.

The VM never explicitly checks the RECLAIM_ZONE bit.  The bit is,
however implicitly checked when checking 'node_reclaim_mode==0'.
The RECLAIM_ZONE #define was removed in a cleanup.  That, by itself
is fine.

But, when the bit was removed (bit 0) the _other_ bit locations also
got changed.  That's not OK because the bit values are documented to
mean one specific thing and users surely rely on them meaning that one
thing and not changing from kernel to kernel.  The end result is that
if someone had a script that did:

	sysctl vm.zone_reclaim_mode=1

That script went from doing nothing to writing out pages during
node reclaim after the commit in question.  That's not great.

Put the bits back the way they were and add a comment so something
like this is a bit harder to do again.  Update the documentation to
make it clear that the first bit is ignored.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: stable@vger.kernel.org
---

 b/Documentation/admin-guide/sysctl/vm.rst |   10 +++++-----
 b/mm/vmscan.c                             |    9 +++++++--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi	2020-07-01 08:22:11.354955336 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2020-07-01 08:22:11.360955336 -0700
@@ -948,11 +948,11 @@ that benefit from having their data cach
 left disabled as the caching effect is likely to be more important than
 data locality.
 
-zone_reclaim may be enabled if it's known that the workload is partitioned
-such that each partition fits within a NUMA node and that accessing remote
-memory would cause a measurable performance reduction.  The page allocator
-will then reclaim easily reusable pages (those page cache pages that are
-currently not used) before allocating off node pages.
+Consider enabling one or more zone_reclaim mode bits if it's known that the
+workload is partitioned such that each partition fits within a NUMA node
+and that accessing remote memory would cause a measurable performance
+reduction.  The page allocator will take additional actions before
+allocating off node pages.
 
 Allowing zone reclaim to write out pages stops processes that are
 writing large amounts of data from dirtying pages on other nodes. Zone
diff -puN mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi	2020-07-01 08:22:11.356955336 -0700
+++ b/mm/vmscan.c	2020-07-01 08:22:11.362955336 -0700
@@ -4090,8 +4090,13 @@ module_init(kswapd_init)
  */
 int node_reclaim_mode __read_mostly;
 
-#define RECLAIM_WRITE (1<<0)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<1)	/* Unmap pages during reclaim */
+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI.  New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE  (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header
  2020-07-01 15:26 [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Dave Hansen
  2020-07-01 15:26 ` [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI Dave Hansen
@ 2020-07-01 15:26 ` Dave Hansen
  2020-07-01 15:46   ` Ben Widawsky
  2020-07-01 20:03   ` David Rientjes
  2020-07-01 15:26 ` [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks Dave Hansen
  2020-07-01 16:00 ` [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Ben Widawsky
  3 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2020-07-01 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai


From: Dave Hansen <dave.hansen@linux.intel.com>

It is currently not obvious that the RECLAIM_* bits are part of the
uapi since they are defined in vmscan.c.  Move them to a uapi header
to make it obvious.

This should have no functional impact.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Daniel Wagner <dwagner@suse.de>

--

Note: This is not cc'd to stable.  It does not fix any bugs.
---

 b/include/uapi/linux/mempolicy.h |    7 +++++++
 b/mm/vmscan.c                    |    8 --------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff -puN include/uapi/linux/mempolicy.h~mm-vmscan-move-RECLAIM-bits-to-uapi include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~mm-vmscan-move-RECLAIM-bits-to-uapi	2020-07-01 08:22:12.502955333 -0700
+++ b/include/uapi/linux/mempolicy.h	2020-07-01 08:22:12.508955333 -0700
@@ -62,5 +62,12 @@ enum {
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
+/*
+ * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
+ * ABI.  New bits are OK, but existing bits can never change.
+ */
+#define RECLAIM_ZONE  (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~mm-vmscan-move-RECLAIM-bits-to-uapi mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-move-RECLAIM-bits-to-uapi	2020-07-01 08:22:12.504955333 -0700
+++ b/mm/vmscan.c	2020-07-01 08:22:12.509955333 -0700
@@ -4091,14 +4091,6 @@ module_init(kswapd_init)
 int node_reclaim_mode __read_mostly;
 
 /*
- * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
- * ABI.  New bits are OK, but existing bits can never change.
- */
-#define RECLAIM_ZONE  (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
-
-/*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
  * a zone.
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
  2020-07-01 15:26 [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Dave Hansen
  2020-07-01 15:26 ` [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI Dave Hansen
  2020-07-01 15:26 ` [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header Dave Hansen
@ 2020-07-01 15:26 ` Dave Hansen
  2020-07-01 20:03   ` David Rientjes
  2020-07-01 16:00 ` [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Ben Widawsky
  3 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2020-07-01 15:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai


From: Dave Hansen <dave.hansen@linux.intel.com>

RECLAIM_ZONE was assumed to be unused because it was never explicitly
used in the kernel.  However, there were a number of places where it
was checked implicitly by checking 'node_reclaim_mode' for a zero
value.

These zero checks are not great because it is not obvious what a zero
mode *means* in the code.  Replace them with a helper which makes it
more obvious: node_reclaim_enabled().

This helper also provides a handy place to explicitly check the
RECLAIM_ZONE bit itself.  Check it explicitly there to make it more
obvious where the bit can affect behavior.

This should have no functional impact.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Daniel Wagner <dwagner@suse.de>

--

Note: This is not cc'd to stable.  It does not fix any bugs.
---

 b/include/linux/swap.h |    7 +++++++
 b/mm/khugepaged.c      |    2 +-
 b/mm/page_alloc.c      |    2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff -puN include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper include/linux/swap.h
--- a/include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper	2020-07-01 08:22:13.650955330 -0700
+++ b/include/linux/swap.h	2020-07-01 08:22:13.659955330 -0700
@@ -12,6 +12,7 @@
 #include <linux/fs.h>
 #include <linux/atomic.h>
 #include <linux/page-flags.h>
+#include <uapi/linux/mempolicy.h>
 #include <asm/page.h>
 
 struct notifier_block;
@@ -374,6 +375,12 @@ extern int sysctl_min_slab_ratio;
 #define node_reclaim_mode 0
 #endif
 
+static inline bool node_reclaim_enabled(void)
+{
+	/* Is any node_reclaim_mode bit set? */
+	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
+}
+
 extern void check_move_unevictable_pages(struct pagevec *pvec);
 
 extern int kswapd_run(int nid);
diff -puN mm/khugepaged.c~mm-vmscan-node_reclaim_mode_helper mm/khugepaged.c
--- a/mm/khugepaged.c~mm-vmscan-node_reclaim_mode_helper	2020-07-01 08:22:13.652955330 -0700
+++ b/mm/khugepaged.c	2020-07-01 08:22:13.660955330 -0700
@@ -709,7 +709,7 @@ static bool khugepaged_scan_abort(int ni
 	 * If node_reclaim_mode is disabled, then no extra effort is made to
 	 * allocate memory locally.
 	 */
-	if (!node_reclaim_mode)
+	if (!node_reclaim_enabled())
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
diff -puN mm/page_alloc.c~mm-vmscan-node_reclaim_mode_helper mm/page_alloc.c
--- a/mm/page_alloc.c~mm-vmscan-node_reclaim_mode_helper	2020-07-01 08:22:13.655955330 -0700
+++ b/mm/page_alloc.c	2020-07-01 08:22:13.662955330 -0700
@@ -3733,7 +3733,7 @@ retry:
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (node_reclaim_mode == 0 ||
+			if (!node_reclaim_enabled() ||
 			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
 				continue;
 
_

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header
  2020-07-01 15:26 ` [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header Dave Hansen
@ 2020-07-01 15:46   ` Ben Widawsky
  2020-07-01 15:56     ` Dave Hansen
  2020-07-01 20:03   ` David Rientjes
  1 sibling, 1 reply; 15+ messages in thread
From: Ben Widawsky @ 2020-07-01 15:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, alex.shi, dwagner, tobin, cl, akpm,
	ying.huang, dan.j.williams, cai

On 20-07-01 08:26:24, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> It is currently not obvious that the RECLAIM_* bits are part of the
> uapi since they are defined in vmscan.c.  Move them to a uapi header
> to make it obvious.
> 
> This should have no functional impact.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: "Tobin C. Harding" <tobin@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Daniel Wagner <dwagner@suse.de>
> 
> --
> 
> Note: This is not cc'd to stable.  It does not fix any bugs.
> ---
> 
>  b/include/uapi/linux/mempolicy.h |    7 +++++++
>  b/mm/vmscan.c                    |    8 --------
>  2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff -puN include/uapi/linux/mempolicy.h~mm-vmscan-move-RECLAIM-bits-to-uapi include/uapi/linux/mempolicy.h
> --- a/include/uapi/linux/mempolicy.h~mm-vmscan-move-RECLAIM-bits-to-uapi	2020-07-01 08:22:12.502955333 -0700
> +++ b/include/uapi/linux/mempolicy.h	2020-07-01 08:22:12.508955333 -0700
> @@ -62,5 +62,12 @@ enum {
>  #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
>  #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
>  
> +/*
> + * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
> + * ABI.  New bits are OK, but existing bits can never change.
> + */
> +#define RECLAIM_ZONE  (1<<0)	/* Run shrink_inactive_list on the zone */
> +#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
> +#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */

Have you considered turning this into an enum while moving it?

>  
>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
> diff -puN mm/vmscan.c~mm-vmscan-move-RECLAIM-bits-to-uapi mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-move-RECLAIM-bits-to-uapi	2020-07-01 08:22:12.504955333 -0700
> +++ b/mm/vmscan.c	2020-07-01 08:22:12.509955333 -0700
> @@ -4091,14 +4091,6 @@ module_init(kswapd_init)
>  int node_reclaim_mode __read_mostly;
>  
>  /*
> - * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
> - * ABI.  New bits are OK, but existing bits can never change.
> - */
> -#define RECLAIM_ZONE  (1<<0)	/* Run shrink_inactive_list on the zone */
> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
> -
> -/*
>   * Priority for NODE_RECLAIM. This determines the fraction of pages
>   * of a node considered for each zone_reclaim. 4 scans 1/16th of
>   * a zone.
> _

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header
  2020-07-01 15:46   ` Ben Widawsky
@ 2020-07-01 15:56     ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-07-01 15:56 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai

On 7/1/20 8:46 AM, Ben Widawsky wrote:
>> +/*
>> + * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>> + * ABI.  New bits are OK, but existing bits can never change.
>> + */
>> +#define RECLAIM_ZONE  (1<<0)	/* Run shrink_inactive_list on the zone */
>> +#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>> +#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
> Have you considered turning this into an enum while moving it?

The thought occurred to me, but all of the other bits in the uapi file
were defined this way.  I decided to not not attempt to buck the trend
in their new home.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI
  2020-07-01 15:26 [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Dave Hansen
                   ` (2 preceding siblings ...)
  2020-07-01 15:26 ` [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks Dave Hansen
@ 2020-07-01 16:00 ` Ben Widawsky
  3 siblings, 0 replies; 15+ messages in thread
From: Ben Widawsky @ 2020-07-01 16:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, alex.shi, dwagner, tobin, cl, akpm,
	ying.huang, dan.j.williams, cai

On 20-07-01 08:26:21, Dave Hansen wrote:
> A previous cleanup accidentally changed the vm.zone_reclaim_mode ABI.
> 
> This series restores the ABI and then reorganizes the code to make
> the ABI more obvious.  Since the single-patch v1[1], I've:
> 
>  * Restored the RECLAIM_ZONE naming, comment and Documentation now
>    that the implicit checks for it are known.
>  * Move RECLAIM_* definitions to a uapi header
>  * Add a node_reclaim_enabled() helper
> 
>  Documentation/admin-guide/sysctl/vm.rst |   10 +++++-----
>  include/linux/swap.h                    |    7 +++++++
>  include/uapi/linux/mempolicy.h          |    7 +++++++
>  mm/khugepaged.c                         |    2 +-
>  mm/page_alloc.c                         |    2 +-
>  mm/vmscan.c                             |    3 ---
>  6 files changed, 21 insertions(+), 10 deletions(-)
> 
> 1. https://lore.kernel.org/linux-mm/20200626003459.D8E015CA@viggo.jf.intel.com/
> 
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: "Tobin C. Harding" <tobin@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Daniel Wagner <dwagner@suse.de>

Series is:
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>

I was more thorough this time in checking all uses of node_reclaim_mode :-). I
do think in patch 2/3, using an enum would be a little better, which I've
mentioned there.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
  2020-07-01 15:26 ` [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks Dave Hansen
@ 2020-07-01 20:03   ` David Rientjes
  2020-07-01 20:04     ` Ben Widawsky
  0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2020-07-01 20:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai

On Wed, 1 Jul 2020, Dave Hansen wrote:

> diff -puN include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper include/linux/swap.h
> --- a/include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper	2020-07-01 08:22:13.650955330 -0700
> +++ b/include/linux/swap.h	2020-07-01 08:22:13.659955330 -0700
> @@ -12,6 +12,7 @@
>  #include <linux/fs.h>
>  #include <linux/atomic.h>
>  #include <linux/page-flags.h>
> +#include <uapi/linux/mempolicy.h>
>  #include <asm/page.h>
>  
>  struct notifier_block;
> @@ -374,6 +375,12 @@ extern int sysctl_min_slab_ratio;
>  #define node_reclaim_mode 0
>  #endif
>  
> +static inline bool node_reclaim_enabled(void)
> +{
> +	/* Is any node_reclaim_mode bit set? */
> +	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +}
> +
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
>  
>  extern int kswapd_run(int nid);

If a user writes a bit that isn't a RECLAIM_* bit to vm.zone_reclaim_mode 
today, it acts as though RECLAIM_ZONE is enabled: we try to reclaim in 
zonelist order before falling back to the next zone in the page allocator.  
The sysctl doesn't enforce any max value :/  I dont know if there is any 
such user, but this would break them if there is.

Should this simply be return !!node_reclaim_mode?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI
  2020-07-01 15:26 ` [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI Dave Hansen
@ 2020-07-01 20:03   ` David Rientjes
  2020-07-02 11:28   ` Huang, Ying
  1 sibling, 0 replies; 15+ messages in thread
From: David Rientjes @ 2020-07-01 20:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai, stable

On Wed, 1 Jul 2020, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
> sysctl.  Like a good kernel developer, I also went to go update the
> documentation.  I noticed that the bits in the documentation didn't
> match the bits in the #defines.
> 
> The VM never explicitly checks the RECLAIM_ZONE bit.  The bit is,
> however implicitly checked when checking 'node_reclaim_mode==0'.
> The RECLAIM_ZONE #define was removed in a cleanup.  That, by itself
> is fine.
> 
> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed.  That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel.  The end result is that
> if someone had a script that did:
> 
> 	sysctl vm.zone_reclaim_mode=1
> 
> That script went from doing nothing to writing out pages during
> node reclaim after the commit in question.  That's not great.
> 
> Put the bits back the way they were and add a comment so something
> like this is a bit harder to do again.  Update the documentation to
> make it clear that the first bit is ignored.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE")
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: "Tobin C. Harding" <tobin@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: stable@vger.kernel.org

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header
  2020-07-01 15:26 ` [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header Dave Hansen
  2020-07-01 15:46   ` Ben Widawsky
@ 2020-07-01 20:03   ` David Rientjes
  1 sibling, 0 replies; 15+ messages in thread
From: David Rientjes @ 2020-07-01 20:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai

On Wed, 1 Jul 2020, Dave Hansen wrote:

> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> It is currently not obvious that the RECLAIM_* bits are part of the
> uapi since they are defined in vmscan.c.  Move them to a uapi header
> to make it obvious.
> 
> This should have no functional impact.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Ben Widawsky <ben.widawsky@intel.com>
> Cc: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: "Tobin C. Harding" <tobin@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Daniel Wagner <dwagner@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
  2020-07-01 20:03   ` David Rientjes
@ 2020-07-01 20:04     ` Ben Widawsky
  2020-07-01 21:29       ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Widawsky @ 2020-07-01 20:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, linux-kernel, linux-mm, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai

On 20-07-01 13:03:01, David Rientjes wrote:
> On Wed, 1 Jul 2020, Dave Hansen wrote:
> 
> > diff -puN include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper include/linux/swap.h
> > --- a/include/linux/swap.h~mm-vmscan-node_reclaim_mode_helper	2020-07-01 08:22:13.650955330 -0700
> > +++ b/include/linux/swap.h	2020-07-01 08:22:13.659955330 -0700
> > @@ -12,6 +12,7 @@
> >  #include <linux/fs.h>
> >  #include <linux/atomic.h>
> >  #include <linux/page-flags.h>
> > +#include <uapi/linux/mempolicy.h>
> >  #include <asm/page.h>
> >  
> >  struct notifier_block;
> > @@ -374,6 +375,12 @@ extern int sysctl_min_slab_ratio;
> >  #define node_reclaim_mode 0
> >  #endif
> >  
> > +static inline bool node_reclaim_enabled(void)
> > +{
> > +	/* Is any node_reclaim_mode bit set? */
> > +	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> > +}
> > +
> >  extern void check_move_unevictable_pages(struct pagevec *pvec);
> >  
> >  extern int kswapd_run(int nid);
> 
> If a user writes a bit that isn't a RECLAIM_* bit to vm.zone_reclaim_mode 
> today, it acts as though RECLAIM_ZONE is enabled: we try to reclaim in 
> zonelist order before falling back to the next zone in the page allocator.  
> The sysctl doesn't enforce any max value :/  I dont know if there is any 
> such user, but this would break them if there is.
> 
> Should this simply be return !!node_reclaim_mode?
> 

I don't think so because I don't think anything else validates the unused bits
remain unused.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
  2020-07-01 20:04     ` Ben Widawsky
@ 2020-07-01 21:29       ` Dave Hansen
  2020-07-01 22:01         ` David Rientjes
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2020-07-01 21:29 UTC (permalink / raw)
  To: David Rientjes, Dave Hansen, linux-kernel, linux-mm, alex.shi,
	dwagner, tobin, cl, akpm, ying.huang, dan.j.williams, cai

On 7/1/20 1:04 PM, Ben Widawsky wrote:
>> +static inline bool node_reclaim_enabled(void)
>> +{
>> +	/* Is any node_reclaim_mode bit set? */
>> +	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
>> +}
>> +
>>  extern void check_move_unevictable_pages(struct pagevec *pvec);
>>  
>>  extern int kswapd_run(int nid);
> If a user writes a bit that isn't a RECLAIM_* bit to vm.zone_reclaim_mode 
> today, it acts as though RECLAIM_ZONE is enabled: we try to reclaim in 
> zonelist order before falling back to the next zone in the page allocator.  
> The sysctl doesn't enforce any max value :/  I dont know if there is any 
> such user, but this would break them if there is.
> 
> Should this simply be return !!node_reclaim_mode?

You're right that there _could_ be a user-visible behavior change here.
 But, if there were a change it would be for a bit which wasn't even
mentioned in the documentation.  Somebody would have had to look at the
doc mentioning 1,2,4 and written an 8.  If they did that, they're asking
for trouble because we could have defined the '8' bit to do nasty things
like auto-demote all your memory. :)

I'll mention it in the changelog, but I still think we should check the
actual, known bits rather than check for 0.

BTW, in the hardware, they almost invariably make unused bits "reserved"
and do mean things like #GP if someone tries to set them.  This is a
case where the kernel probably should have done the same.  It would have
saved us the trouble of asking these questions now.  Maybe we should
even do that going forward.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
  2020-07-01 21:29       ` Dave Hansen
@ 2020-07-01 22:01         ` David Rientjes
  0 siblings, 0 replies; 15+ messages in thread
From: David Rientjes @ 2020-07-01 22:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-kernel, linux-mm, alex.shi, dwagner, tobin,
	cl, akpm, ying.huang, dan.j.williams, cai

On Wed, 1 Jul 2020, Dave Hansen wrote:

> On 7/1/20 1:04 PM, Ben Widawsky wrote:
> >> +static inline bool node_reclaim_enabled(void)
> >> +{
> >> +	/* Is any node_reclaim_mode bit set? */
> >> +	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> >> +}
> >> +
> >>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> >>  
> >>  extern int kswapd_run(int nid);
> > If a user writes a bit that isn't a RECLAIM_* bit to vm.zone_reclaim_mode 
> > today, it acts as though RECLAIM_ZONE is enabled: we try to reclaim in 
> > zonelist order before falling back to the next zone in the page allocator.  
> > The sysctl doesn't enforce any max value :/  I dont know if there is any 
> > such user, but this would break them if there is.
> > 
> > Should this simply be return !!node_reclaim_mode?
> 
> You're right that there _could_ be a user-visible behavior change here.
>  But, if there were a change it would be for a bit which wasn't even
> mentioned in the documentation.  Somebody would have had to look at the
> doc mentioning 1,2,4 and written an 8.  If they did that, they're asking
> for trouble because we could have defined the '8' bit to do nasty things
> like auto-demote all your memory. :)
> 
> I'll mention it in the changelog, but I still think we should check the
> actual, known bits rather than check for 0.
> 
> BTW, in the hardware, they almost invariably make unused bits "reserved"
> and do mean things like #GP if someone tries to set them.  This is a
> case where the kernel probably should have done the same.  It would have
> saved us the trouble of asking these questions now.  Maybe we should
> even do that going forward.
> 

Maybe enforce it in a sysctl handler so the user catches any errors, which 
would be better than silently accepting some policy that doesn't exist?

RECLAIM_UNMAP and/or RECLAIM_WRITE should likely get -EINVAL if attempted 
to be set without RECLAIM_ZONE as well: they are no-ops without 
RECLAIM_ZONE.  This would likely have caught something wrong with commit 
648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE") if it 
would have already been in place.

I don't feel strongly about this, so feel free to ignore.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI
  2020-07-01 15:26 ` [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI Dave Hansen
  2020-07-01 20:03   ` David Rientjes
@ 2020-07-02 11:28   ` Huang, Ying
  2020-07-02 14:36     ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2020-07-02 11:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, dan.j.williams, cai, stable

Dave Hansen <dave.hansen@linux.intel.com> writes:

> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
> sysctl.  Like a good kernel developer, I also went to go update the
> documentation.  I noticed that the bits in the documentation didn't
> match the bits in the #defines.
>
> The VM never explicitly checks the RECLAIM_ZONE bit.  The bit is,
> however implicitly checked when checking 'node_reclaim_mode==0'.
> The RECLAIM_ZONE #define was removed in a cleanup.  That, by itself
> is fine.
>
> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed.  That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel.  The end result is that
> if someone had a script that did:
>
> 	sysctl vm.zone_reclaim_mode=1
>
> That script went from doing nothing

Per my understanding, this script would have enabled node reclaim for
clean unmapped pages before commit 648b5cf368e0 ("mm/vmscan: remove
unused RECLAIM_OFF/RECLAIM_ZONE").  So we should revise the description
here?

> to writing out pages during
> node reclaim after the commit in question.  That's not great.
>
> Put the bits back the way they were and add a comment so something
> like this is a bit harder to do again.  Update the documentation to
> make it clear that the first bit is ignored.
>

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI
  2020-07-02 11:28   ` Huang, Ying
@ 2020-07-02 14:36     ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-07-02 14:36 UTC (permalink / raw)
  To: Huang, Ying, Dave Hansen
  Cc: linux-kernel, linux-mm, ben.widawsky, alex.shi, dwagner, tobin,
	cl, akpm, dan.j.williams, cai, stable

On 7/2/20 4:28 AM, Huang, Ying wrote:
>> But, when the bit was removed (bit 0) the _other_ bit locations also
>> got changed.  That's not OK because the bit values are documented to
>> mean one specific thing and users surely rely on them meaning that one
>> thing and not changing from kernel to kernel.  The end result is that
>> if someone had a script that did:
>>
>> 	sysctl vm.zone_reclaim_mode=1
>>
>> That script went from doing nothing
> Per my understanding, this script would have enabled node reclaim for
> clean unmapped pages before commit 648b5cf368e0 ("mm/vmscan: remove
> unused RECLAIM_OFF/RECLAIM_ZONE").  So we should revise the description
> here?

Yes, you're right.  I updated the patch with the updated understanding
about the implicit use of the bit but didn't update the changelog.  I'll
do that for v3.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-07-02 14:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01 15:26 [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Dave Hansen
2020-07-01 15:26 ` [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI Dave Hansen
2020-07-01 20:03   ` David Rientjes
2020-07-02 11:28   ` Huang, Ying
2020-07-02 14:36     ` Dave Hansen
2020-07-01 15:26 ` [PATCH 2/3] mm/vmscan: move RECLAIM* bits to uapi header Dave Hansen
2020-07-01 15:46   ` Ben Widawsky
2020-07-01 15:56     ` Dave Hansen
2020-07-01 20:03   ` David Rientjes
2020-07-01 15:26 ` [PATCH 3/3] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks Dave Hansen
2020-07-01 20:03   ` David Rientjes
2020-07-01 20:04     ` Ben Widawsky
2020-07-01 21:29       ` Dave Hansen
2020-07-01 22:01         ` David Rientjes
2020-07-01 16:00 ` [PATCH 0/3] [v2] Repair and clean up vm.zone_reclaim_mode sysctl ABI Ben Widawsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).