linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/15] Slab Movable Objects (SMO)
@ 2019-04-11  1:34 Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 01/15] slub: Add isolate() and migrate() methods Tobin C. Harding
                   ` (14 more replies)
  0 siblings, 15 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

Hi,

Another iteration of the SMO patch set implementing suggestions from Al
and Willy on the last version as well as some feedback from comments on
the recent LWN article.

Applies on top of Linus' tree (tag: v5.1-rc4).

This is a patch set implementing movable objects within the SLUB
allocator.  This is work based on Christopher Lameter's patch set:

 https://lore.kernel.org/patchwork/project/lkml/list/?series=377335

The original code logic is from that set and implemented by Christopher.
Clean up, refactoring, documentation, and additional features by myself.
Responsibility for any bugs remaining falls solely with myself.

Patch #9 has changes to the XArray migration function as suggested by
Matthew, thank you.

The only other changes to this version are to the dcache code.

dcache
------

It was noted on LWN that calling the dcache migration function
'd_migrate' is a misnomer because we are _not_ trying to migrate the
dentry objects but rather only free them.  As noted by Al dentry (and
inode) objects are inherently not relocatable.  What we are trying to
achieve here is, rather, to attempt to free a select group of dentry
objects.  The dcache patches are not intended to be a silver bullet
fixing all fragmentation within the dentry slab cache.  Instead we are
trying to make a non-invasive attempt at freeing up pages sparsely used
by the dentry slab cache.  This may be useful for a number of reasons
e.g. we _may_ be able to free a page that is stopping high order page
allocations.  This would be a useful capability.

Since this is only something that _may_ help the aim is to be
non-intrusive.  This version of the set adds a config option to
selectively build in the SMO stuff for the dcache.  Without this option
the only change this set makes to the dcache is adding a constructor.
With the constructor doing a spinlock_init() it is hoped this will at
best be a performance gain and at worst NOT be a performance reduction.
Benchmarking has found this to be the case, results are included below.

Patch #14 and #15 can be rolled into a single patch if #15 is found
favourable.

Changes since v2:

 - Improve the XArray migration function (thanks Matthew)
 - Fix the dcache constructor (thanks Alexander)
 - Rename the d_migrate function to d_partial_shrink (open to
   suggested improvement)
 - Totally re-write the dcache migration function based on schooling by Al


Thanks for looking at this,
Tobin.


=============================
dcache SMO patch benchmarking
=============================

Process
=======

We use 5.1-rc4 as the baseline.  We benchmark the SMO patchset with
and without CONFIG_DCACHE_SMO.  SMO patch set without CONFIG_DCACHE_SMO
just adds a constructor to the dcache, no other code added to the build.
Building with CONFIG_DCACHE_SMO adds code to enable object migration for
the dcache.

cmd = `time find / -name fname-no-exist`
drop_caches = `cat 2 > /proc/sys/vm/drop_caches`

1. Boot system
2. Run $cmd
3. Run $drop_caches
4. Run $cmd


Bare metal results
------------------

Machine: x86_64
Kernel configured with::

	make defconfig


- rc4 kernel (baseline)::

	time find / -name fname-no-exist dentry 

	real	0m29.799s
	user	0m1.519s
	sys	0m10.825s

	echo 2 > /proc/sys/vm/drop_caches 

	time find / -name fname-no-exist dentry 

	real	0m6.828s
	user	0m0.952s
	sys	0m5.824s


- rc4 kernel with SMO patch set and !CONFIG_DCACHE_SMO::

	time find / -name fname-no-exist

	real	0m30.075s
	user	0m1.480s
	sys	0m10.754s

	echo 2 > /proc/sys/vm/drop_caches 
	time find / -name fname-no-existproc/sys/vm/drop_caches 

	real	0m6.626s
	user	0m0.917s
	sys	0m5.661s


- rc4 kernel with SMO patch set and CONFIG_DCACHE_SMO::

	time find / -name fname-no-exist dentry 

	real	0m30.637s
	user	0m1.516s
	sys	0m11.603s

	echo 2 > /proc/sys/vm/drop_caches 

	time find / -name fname-no-exist dentry 

	real	0m6.886s
	user	0m0.932s
	sys	0m5.907s


Qemu results
------------

Host machine: x86_64

Qemu kernel configured with::

	make defconfig
	make kvmconfig

Qemu invoked with::

    qemu-system-x86_64 \
      -enable-kvm \
      -m 4G \
      -hda arch.qcow \
      -kernel $kernel \
      -serial stdio \
      -display none" \
      -append 'root=/dev/sda1 console=ttyS0 rw'

- rc4 kernel (baseline)::

	time find / -name fname-no-exist

	real	0m0.929s
	user	0m0.096s
	sys	0m0.168s

	echo 2 > /proc/sys/vm/drop_caches 
	time find / -name fname-no-exist

	real	0m0.249s
	user	0m0.112s
	sys	0m0.133s

- rc4 kernel with SMO patch set and !CONFIG_DCACHE_SMO::

	time find / -name fname-no-exist

	real	0m1.018s
	user	0m0.095s
	sys	0m0.151s

	echo 2 > /proc/sys/vm/drop_caches 
	time find / -name fname-no-exist

	real	0m0.191s
	user	0m0.083s
	sys	0m0.105s


- rc4 kernel with SMO patch set and CONFIG_DCACHE_SMO::

	time find / -name fname-no-exist

	real	0m0.763s
	user	0m0.091s
	sys	0m0.165s

	echo 2 > /proc/sys/vm/drop_caches 
	time find / -name fname-no-exist

	real	0m0.192s
	user	0m0.062s
	sys	0m0.126s


I am not very experienced with benchmarking, if this is grossly
incorrect please do not hesitate to yell at me.  Any suggestions on
more/better benchmarking most appreciated.

Thanks,
Tobin.


Tobin C. Harding (15):
  slub: Add isolate() and migrate() methods
  tools/vm/slabinfo: Add support for -C and -M options
  slub: Sort slab cache list
  slub: Slab defrag core
  tools/vm/slabinfo: Add remote node defrag ratio output
  tools/vm/slabinfo: Add defrag_used_ratio output
  tools/testing/slab: Add object migration test module
  tools/testing/slab: Add object migration test suite
  xarray: Implement migration function for objects
  tools/testing/slab: Add XArray movable objects tests
  slub: Enable moving objects to/from specific nodes
  slub: Enable balancing slabs across nodes
  dcache: Provide a dentry constructor
  dcache: Implement partial shrink via Slab Movable Objects
  dcache: Add CONFIG_DCACHE_SMO

 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 fs/dcache.c                                 | 106 ++-
 include/linux/slab.h                        |  71 ++
 include/linux/slub_def.h                    |  10 +
 lib/radix-tree.c                            |  13 +
 lib/xarray.c                                |  49 ++
 mm/Kconfig                                  |  14 +
 mm/slab_common.c                            |   2 +-
 mm/slub.c                                   | 819 ++++++++++++++++++--
 tools/testing/slab/Makefile                 |  10 +
 tools/testing/slab/slub_defrag.c            | 567 ++++++++++++++
 tools/testing/slab/slub_defrag.py           | 451 +++++++++++
 tools/testing/slab/slub_defrag_xarray.c     | 211 +++++
 tools/vm/slabinfo.c                         |  51 +-
 14 files changed, 2295 insertions(+), 93 deletions(-)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c
 create mode 100755 tools/testing/slab/slub_defrag.py
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 01/15] slub: Add isolate() and migrate() methods
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 02/15] tools/vm/slabinfo: Add support for -C and -M options Tobin C. Harding
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

Add the two methods needed for moving objects and enable the display of
the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab
cache.

Add empty functions for SLAB/SLOB. The API is generic so it could be
theoretically implemented for these allocators as well.

Change sysfs 'ctor' field to be 'ops' to contain all the callback
operations defined for a slab cache.  Display the existing 'ctor'
callback in the ops fields contents along with 'isolate' and 'migrate'
callbacks.

Co-developed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 include/linux/slab.h     | 70 ++++++++++++++++++++++++++++++++++++++++
 include/linux/slub_def.h |  3 ++
 mm/slub.c                | 59 +++++++++++++++++++++++++++++----
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..886fc130334d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
+/*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable
+ * mobile objects and targeted reclaim in slab caches.
+ */
+
+/**
+ * typedef kmem_cache_isolate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to isolate.
+ * @nr: Number of objects in @ptr array.
+ *
+ * The purpose of kmem_cache_isolate_func() is to pin each object so that
+ * they cannot be freed until kmem_cache_migrate_func() has processed
+ * them. This may be accomplished by increasing the refcount or setting
+ * a flag.
+ *
+ * The object pointer array passed is also passed to
+ * kmem_cache_migrate_func().  The function may remove objects from the
+ * array by setting pointers to %NULL. This is useful if we can
+ * determine that an object is being freed because
+ * kmem_cache_isolate_func() was called when the subsystem was calling
+ * kmem_cache_free().  In that case it is not necessary to increase the
+ * refcount or specially mark the object because the release of the slab
+ * lock will lead to the immediate freeing of the object.
+ *
+ * Context: Called with locks held so that the slab objects cannot be
+ *          freed.  We are in an atomic context and no slab operations
+ *          may be performed.
+ * Return: A pointer that is passed to the migrate function. If any
+ *         objects cannot be touched at this point then the pointer may
+ *         indicate a failure and then the migration function can simply
+ *         remove the references that were already obtained. The private
+ *         data could be used to track the objects that were already pinned.
+ */
+typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int nr);
+
+/**
+ * typedef kmem_cache_migrate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to migrate.
+ * @nr: Number of objects in @ptr array.
+ * @node: The NUMA node where the object should be allocated.
+ * @private: The pointer returned by kmem_cache_isolate_func().
+ *
+ * This function is responsible for migrating objects.  Typically, for
+ * each object in the input array you will want to allocate an new
+ * object, copy the original object, update any pointers, and free the
+ * old object.
+ *
+ * After this function returns all pointers to the old object should now
+ * point to the new object.
+ *
+ * Context: Called with no locks held and interrupts enabled.  Sleeping
+ *          is possible.  Any operation may be performed.
+ */
+typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr,
+				     int nr, int node, void *private);
+
+/*
+ * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func,
+			       kmem_cache_migrate_func);
+#else
+static inline void
+kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func isolate,
+			  kmem_cache_migrate_func migrate) {}
+#endif
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..2879a2f5f8eb 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
 	gfp_t allocflags;	/* gfp flags to use on each alloc */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *);
+	kmem_cache_isolate_func *isolate;
+	kmem_cache_migrate_func *migrate;
+
 	unsigned int inuse;		/* Offset to metadata */
 	unsigned int align;		/* Alignment */
 	unsigned int red_left_pad;	/* Left redzone padding size */
diff --git a/mm/slub.c b/mm/slub.c
index d30ede89f4a6..ae44d640b8c1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4326,6 +4326,33 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
 	return err;
 }
 
+void kmem_cache_setup_mobility(struct kmem_cache *s,
+			       kmem_cache_isolate_func isolate,
+			       kmem_cache_migrate_func migrate)
+{
+	/*
+	 * Mobile objects must have a ctor otherwise the object may be
+	 * in an undefined state on allocation.  Since the object may
+	 * need to be inspected by the migration function at any time
+	 * after allocation we must ensure that the object always has a
+	 * defined state.
+	 */
+	if (!s->ctor) {
+		pr_err("%s: require constructor to setup mobility\n", s->name);
+		return;
+	}
+
+	s->isolate = isolate;
+	s->migrate = migrate;
+
+	/*
+	 * Sadly serialization requirements currently mean that we have
+	 * to disable fast cmpxchg based processing.
+	 */
+	s->flags &= ~__CMPXCHG_DOUBLE;
+}
+EXPORT_SYMBOL(kmem_cache_setup_mobility);
+
 void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
@@ -5010,13 +5037,33 @@ static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
 }
 SLAB_ATTR(cpu_partial);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static int op_show(char *buf, const char *txt, unsigned long addr)
 {
-	if (!s->ctor)
-		return 0;
-	return sprintf(buf, "%pS\n", s->ctor);
+	int x = 0;
+
+	x += sprintf(buf, "%s : ", txt);
+	x += sprint_symbol(buf + x, addr);
+	x += sprintf(buf + x, "\n");
+
+	return x;
+}
+
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
+{
+	int x = 0;
+
+	if (s->ctor)
+		x += op_show(buf + x, "ctor", (unsigned long)s->ctor);
+
+	if (s->isolate)
+		x += op_show(buf + x, "isolate", (unsigned long)s->isolate);
+
+	if (s->migrate)
+		x += op_show(buf + x, "migrate", (unsigned long)s->migrate);
+
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -5429,7 +5476,7 @@ static struct attribute *slab_attrs[] = {
 	&objects_partial_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&hwcache_align_attr.attr,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 02/15] tools/vm/slabinfo: Add support for -C and -M options
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 01/15] slub: Add isolate() and migrate() methods Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 03/15] slub: Sort slab cache list Tobin C. Harding
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Co-developed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 tools/vm/slabinfo.c | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
 	unsigned int hwcache_align, object_size, objs_per_slab;
 	unsigned int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int movable, ctor;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
 	unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-	printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-		"slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] [slab-regexp]\n"
+	printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 2017 Jump Trading LLC.\n\n"
+	       "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] [slab-regexp]\n"
+
 		"-a|--aliases           Show aliases\n"
 		"-A|--activity          Most active slabs first\n"
 		"-B|--Bytes             Show size in bytes\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-D|--display-active    Switch line format to activity\n"
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
 		"-L|--Loss              Sort by loss\n"
+		"-M|--movable           Show caches that support movable objects\n"
 		"-n|--numa              Show NUMA information\n"
 		"-N|--lines=K           Show the first K slabs\n"
 		"-o|--ops               Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_ctor && !s->ctor)
+		return;
+
+	if (show_movable && !s->movable)
+		return;
+
 	if (sort_loss == 0)
 		store_size(size_str, slab_size(s));
 	else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->ctor)
+		*p++ = 'C';
+	if (s->movable)
+		*p++ = 'M';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
 		printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
 			s->objs_per_slab, s->order,
-			s->slabs ? (s->partial * 100) / s->slabs : 100,
+			s->slabs ? (s->partial * 100) /
+					(s->slabs * s->objs_per_slab) : 100,
 			s->slabs ? (s->objects * s->object_size * 100) /
 				(s->slabs * (page_size << s->order)) : 100,
 			flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
 			slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "migrate :"))
+					slab->movable = 1;
+			}
+
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1332,6 +1356,8 @@ static void xtotals(void)
 }
 
 struct option opts[] = {
+	{ "ctor", no_argument, NULL, 'C' },
+	{ "movable", no_argument, NULL, 'M' },
 	{ "aliases", no_argument, NULL, 'a' },
 	{ "activity", no_argument, NULL, 'A' },
 	{ "debug", optional_argument, NULL, 'd' },
@@ -1367,7 +1393,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTSN:LXBU",
+	while ((c = getopt_long(argc, argv, "aACd::Defhil1MnoprstvzTSN:LXBU",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1376,6 +1402,9 @@ int main(int argc, char *argv[])
 		case 'a':
 			show_alias = 1;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
 		case 'A':
 			sort_active = 1;
 			break;
@@ -1399,6 +1428,9 @@ int main(int argc, char *argv[])
 		case 'i':
 			show_inverted = 1;
 			break;
+		case 'M':
+			show_movable = 1;
+			break;
 		case 'n':
 			show_numa = 1;
 			break;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 03/15] slub: Sort slab cache list
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 01/15] slub: Add isolate() and migrate() methods Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 02/15] tools/vm/slabinfo: Add support for -C and -M options Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 04/15] slub: Slab defrag core Tobin C. Harding
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Co-developed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 mm/slab_common.c | 2 +-
 mm/slub.c        | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
 		goto out_free_cache;
 
 	s->refcount = 1;
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 	memcg_link_cache(s);
 out:
 	if (err)
diff --git a/mm/slub.c b/mm/slub.c
index ae44d640b8c1..f6b0e4a395ef 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4342,6 +4342,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 		return;
 	}
 
+	mutex_lock(&slab_mutex);
+
 	s->isolate = isolate;
 	s->migrate = migrate;
 
@@ -4350,6 +4352,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 	 * to disable fast cmpxchg based processing.
 	 */
 	s->flags &= ~__CMPXCHG_DOUBLE;
+
+	list_move(&s->list, &slab_caches);	/* Move to top */
+
+	mutex_unlock(&slab_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 04/15] slub: Slab defrag core
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (2 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 03/15] slub: Sort slab cache list Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 05/15] tools/vm/slabinfo: Add remote node defrag ratio output Tobin C. Harding
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

Implement Slab Movable Objects in order to defragment slab caches.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

	slabinfo <cache> -s

3. Conditionally via the use of kmem_cache_defrag()

- Use Slab Movable Objects when shrinking cache.

Currently when the kernel calls kmem_cache_shrink() we curate the
partial slabs list.  If object migration is not enabled for the cache we
still do this, if however, SMO is enabled we attempt to move objects in
partially full slabs in order to defragment the cache.  Shrink attempts
to move all objects in order to reduce the cache to a single partial
slab for each node.

- Add conditional per node defrag via new function:

	kmem_defrag_slabs(int node).

kmem_defrag_slabs() attempts to defragment all slab caches for node.
 Defragmentation is done conditionally dependent on MAX_PARTIAL _AND_
 defrag_used_ratio.

   Caches are only considered for defragmentation if the number of
   partial slabs exceeds MAX_PARTIAL (per node).

   Also, defragmentation only occurs if the usage ratio of the slab is
   lower than the configured percentage (sysfs field added in this
   patch).  Fragmentation ratios are measured by calculating the
   percentage of objects in use compared to the total number of objects
   that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_defrag_slabs() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

	echo 0 > /sys/kernel/slab/<cache>/defrag_used_ratio

- Add a defrag ratio sysfs field and set it to 30% by default. A limit
of 30% specifies that more than 3 out of 10 available slots for objects
need to be in use otherwise slab defragmentation will be attempted on
the remaining objects.

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

	void kmem_cache_setup_mobility(struct kmem_cache *,
				       kmem_cache_isolate_func,
			               kmem_cache_migrate_func);

Co-developed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 include/linux/slab.h                        |   1 +
 include/linux/slub_def.h                    |   7 +
 mm/slub.c                                   | 385 ++++++++++++++++----
 4 files changed, 334 insertions(+), 73 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..7770c03be6b4 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,20 @@ Description:
 		list.  It can be written to clear the current count.
 		Available when CONFIG_SLUB_STATS is enabled.
 
+What:		/sys/kernel/slab/cache/defrag_used_ratio
+Date:		February 2019
+KernelVersion:	5.0
+Contact:	Christoph Lameter <cl@linux-foundation.org>
+		Pekka Enberg <penberg@cs.helsinki.fi>,
+Description:
+		The defrag_used_ratio file allows the control of how aggressive
+		slab fragmentation reduction works at reclaiming objects from
+		sparsely populated slabs. This is a percentage. If a slab has
+		less than this percentage of objects allocated then reclaim will
+		attempt to reclaim objects so that the whole slab page can be
+		freed. 0% specifies no reclaim attempt (defrag disabled), 100%
+		specifies attempt to reclaim all pages.  The default is 30%.
+
 What:		/sys/kernel/slab/cache/deactivate_to_tail
 Date:		February 2008
 KernelVersion:	2.6.25
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 886fc130334d..4bf381b34829 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 			void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_defrag_slabs(int node);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2879a2f5f8eb..34c6f1250652 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -107,6 +107,13 @@ struct kmem_cache {
 	unsigned int red_left_pad;	/* Left redzone padding size */
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
+	int defrag_used_ratio;	/*
+				 * Ratio used to check against the
+				 * percentage of objects allocated in a
+				 * slab page.  If less than this ratio
+				 * is allocated then reclaim attempts
+				 * are made.
+				 */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
 	struct work_struct kobj_remove_work;
diff --git a/mm/slub.c b/mm/slub.c
index f6b0e4a395ef..e601c804ed79 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -354,6 +354,12 @@ static __always_inline void slab_lock(struct page *page)
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
+static __always_inline int slab_trylock(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	return bit_spin_trylock(PG_locked, &page->flags);
+}
+
 static __always_inline void slab_unlock(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
@@ -3643,6 +3649,7 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 
 	set_cpu_partial(s);
 
+	s->defrag_used_ratio = 30;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -3959,79 +3966,6 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
-#define SHRINK_PROMOTE_MAX 32
-
-/*
- * kmem_cache_shrink discards empty slabs and promotes the slabs filled
- * up most to the head of the partial lists. New allocations will then
- * fill those up and thus they can be removed from the partial lists.
- *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
- */
-int __kmem_cache_shrink(struct kmem_cache *s)
-{
-	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	struct list_head discard;
-	struct list_head promote[SHRINK_PROMOTE_MAX];
-	unsigned long flags;
-	int ret = 0;
-
-	flush_all(s);
-	for_each_kmem_cache_node(s, node, n) {
-		INIT_LIST_HEAD(&discard);
-		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
-			INIT_LIST_HEAD(promote + i);
-
-		spin_lock_irqsave(&n->list_lock, flags);
-
-		/*
-		 * Build lists of slabs to discard or promote.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			int free = page->objects - page->inuse;
-
-			/* Do not reread page->inuse */
-			barrier();
-
-			/* We do not keep full slabs on the list */
-			BUG_ON(free <= 0);
-
-			if (free == page->objects) {
-				list_move(&page->lru, &discard);
-				n->nr_partial--;
-			} else if (free <= SHRINK_PROMOTE_MAX)
-				list_move(&page->lru, promote + free - 1);
-		}
-
-		/*
-		 * Promote the slabs filled up most to the head of the
-		 * partial list.
-		 */
-		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
-			list_splice(promote + i, &n->partial);
-
-		spin_unlock_irqrestore(&n->list_lock, flags);
-
-		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, &discard, lru)
-			discard_slab(s, page);
-
-		if (slabs_node(s, node))
-			ret = 1;
-	}
-
-	return ret;
-}
-
 #ifdef CONFIG_MEMCG
 static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
 {
@@ -4326,6 +4260,287 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags)
 	return err;
 }
 
+/*
+ * Allocate a slab scratch space that is sufficient to keep pointers to
+ * individual objects for all objects in cache and also a bitmap for the
+ * objects (used to mark which objects are active).
+ */
+static inline void *alloc_scratch(struct kmem_cache *s)
+{
+	unsigned int size = oo_objects(s->max);
+
+	return kmalloc(size * sizeof(void *) +
+		       BITS_TO_LONGS(size) * sizeof(unsigned long),
+		       GFP_KERNEL);
+}
+
+/*
+ * move_slab_page() - Move all objects in the given slab.
+ * @page: The slab we are working on.
+ * @scratch: Pointer to scratch space.
+ * @node: The target node to move objects to.
+ *
+ * If the target node is not the current node then the object is moved
+ * to the target node.  If the target node is the current node then this
+ * is an effective way of defragmentation since the current slab page
+ * with its object is exempt from allocation.
+ */
+static void move_slab_page(struct page *page, void *scratch, int node)
+{
+	unsigned long objects;
+	struct kmem_cache *s;
+	unsigned long flags;
+	unsigned long *map;
+	void *private;
+	int count;
+	void *p;
+	void **vector = scratch;
+	void *addr = page_address(page);
+
+	local_irq_save(flags);
+	slab_lock(page);
+
+	BUG_ON(!PageSlab(page)); /* Must be a slab page */
+	BUG_ON(!page->frozen);	 /* Slab must have been frozen earlier */
+
+	s = page->slab_cache;
+	objects = page->objects;
+	map = scratch + objects * sizeof(void **);
+
+	/* Determine used objects */
+	bitmap_fill(map, objects);
+	for (p = page->freelist; p; p = get_freepointer(s, p))
+		__clear_bit(slab_index(p, s, addr), map);
+
+	/* Build vector of pointers to objects */
+	count = 0;
+	memset(vector, 0, objects * sizeof(void **));
+	for_each_object(p, s, addr, objects)
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[count++] = p;
+
+	if (s->isolate)
+		private = s->isolate(s, vector, count);
+	else
+		/* Objects do not need to be isolated */
+		private = NULL;
+
+	/*
+	 * Pinned the objects. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed. So the
+	 * subsystem must have a way to tell from the content of the
+	 * object that it was freed.
+	 *
+	 * If neither RCU nor ctor is being used then the object may be
+	 * modified by the allocator after being freed which may disrupt
+	 * the ability of the migrate function to tell if the object is
+	 * free or not.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/* Perform callback to move the objects */
+	s->migrate(s, vector, count, node, private);
+}
+
+/*
+ * kmem_cache_defrag() - Defragment node.
+ * @s: cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @ratio: The defrag ratio (percentage, between 0 and 100).
+ *
+ * Release slabs with zero objects and try to call the migration function
+ * for slabs with less than the 'ratio' percentage of objects allocated.
+ *
+ * Moved objects are allocated on @target_node.
+ *
+ * Return: The number of partial slabs left on @node after the
+ *         operation.
+ */
+static unsigned long kmem_cache_defrag(struct kmem_cache *s,
+				       int node, int target_node, int ratio)
+{
+	struct kmem_cache_node *n = get_node(s, node);
+	struct page *page, *page2;
+	LIST_HEAD(move_list);
+	unsigned long flags;
+
+	if (node == target_node && n->nr_partial <= 1) {
+		/*
+		 * Trying to reduce fragmentation on a node but there is
+		 * only a single or no partial slab page. This is already
+		 * the optimal object density that we can reach.
+		 */
+		return n->nr_partial;
+	}
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
+			continue;
+
+		if (page->inuse) {
+			if (page->inuse > ratio * page->objects / 100) {
+				slab_unlock(page);
+				/*
+				 * Skip slab because the object density
+				 * in the slab page is high enough.
+				 */
+				continue;
+			}
+
+			list_move(&page->lru, &move_list);
+			if (s->migrate) {
+				/* Stop page being considered for allocations */
+				n->nr_partial--;
+				page->frozen = 1;
+			}
+			slab_unlock(page);
+		} else {	/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+		}
+	}
+
+	if (!s->migrate) {
+		/*
+		 * No defrag method. By simply putting the zaplist at
+		 * the end of the partial list we can let them simmer
+		 * longer and thus increase the chance of all objects
+		 * being reclaimed.
+		 */
+		list_splice(&move_list, n->partial.prev);
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->migrate && !list_empty(&move_list)) {
+		void **scratch = alloc_scratch(s);
+		if (scratch) {
+			/* Try to remove / move the objects left */
+			list_for_each_entry(page, &move_list, lru) {
+				if (page->inuse)
+					move_slab_page(page, scratch, target_node);
+			}
+			kfree(scratch);
+		}
+
+		/* Inspect results and dispose of pages */
+		spin_lock_irqsave(&n->list_lock, flags);
+		list_for_each_entry_safe(page, page2, &move_list, lru) {
+			list_del(&page->lru);
+			slab_lock(page);
+			page->frozen = 0;
+
+			if (page->inuse) {
+				/*
+				 * Objects left in slab page, move it to the
+				 * tail of the partial list to increase the
+				 * chance that the freeing of the remaining
+				 * objects will free the slab page.
+				 */
+				n->nr_partial++;
+				list_add_tail(&page->lru, &n->partial);
+				slab_unlock(page);
+			} else {
+				slab_unlock(page);
+				discard_slab(s, page);
+			}
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+
+	return n->nr_partial;
+}
+
+/**
+ * kmem_defrag_slabs() - Defrag slab caches.
+ * @node: The node to defrag or -1 for all nodes.
+ *
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ *
+ * Return: The total number of partial slabs in migratable caches left
+ *         on @node after the operation.
+ */
+unsigned long kmem_defrag_slabs(int node)
+{
+	struct kmem_cache *s;
+	unsigned long left = 0;
+	int nid;
+
+	if (node >= MAX_NUMNODES)
+		return -EINVAL;
+
+	/*
+	 * kmem_defrag_slabs() may be called from the reclaim path which
+	 * may be called for any page allocator alloc. So there is the
+	 * danger that we get called in a situation where slub already
+	 * acquired the slub_lock for other purposes.
+	 */
+	if (!mutex_trylock(&slab_mutex))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * Defragmentable caches come first. If the slab cache is
+		 * not defragmentable then we can stop traversing the list.
+		 */
+		if (!s->migrate)
+			break;
+
+		if (node >= 0) {
+			if (s->node[node]->nr_partial > MAX_PARTIAL) {
+				left += kmem_cache_defrag(s, node, node,
+							  s->defrag_used_ratio);
+			}
+			continue;
+		}
+
+		for_each_node_state(nid, N_NORMAL_MEMORY) {
+			if (s->node[nid]->nr_partial > MAX_PARTIAL) {
+				left += kmem_cache_defrag(s, nid, nid,
+							  s->defrag_used_ratio);
+			}
+		}
+	}
+	mutex_unlock(&slab_mutex);
+	return left;
+}
+EXPORT_SYMBOL(kmem_defrag_slabs);
+
+/**
+ * __kmem_cache_shrink() - Shrink a cache.
+ * @s: The cache to shrink.
+ *
+ * Reduces the memory footprint of a slab cache by as much as possible.
+ *
+ * This works by:
+ *  1. Removing empty slabs from the partial list.
+ *  2. Migrating slab objects to denser slab pages if the slab cache
+ *  supports migration.  If not, reorganizing the partial list so that
+ *  more densely allocated slab pages come first.
+ *
+ * Not called directly, called by kmem_cache_shrink().
+ */
+int __kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+	int left = 0;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		left += kmem_cache_defrag(s, node, node, 100);
+
+	return left;
+}
+EXPORT_SYMBOL(__kmem_cache_shrink);
+
 void kmem_cache_setup_mobility(struct kmem_cache *s,
 			       kmem_cache_isolate_func isolate,
 			       kmem_cache_migrate_func migrate)
@@ -5177,6 +5392,29 @@ static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
 }
 SLAB_ATTR_RO(destroy_by_rcu);
 
+static ssize_t defrag_used_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->defrag_used_ratio);
+}
+
+static ssize_t defrag_used_ratio_store(struct kmem_cache *s,
+				       const char *buf, size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = kstrtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio > 100)
+		return -EINVAL;
+
+	s->defrag_used_ratio = ratio;
+	return length;
+}
+SLAB_ATTR(defrag_used_ratio);
+
 #ifdef CONFIG_SLUB_DEBUG
 static ssize_t slabs_show(struct kmem_cache *s, char *buf)
 {
@@ -5501,6 +5739,7 @@ static struct attribute *slab_attrs[] = {
 	&validate_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_used_ratio_attr.attr,
 #endif
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 05/15] tools/vm/slabinfo: Add remote node defrag ratio output
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (3 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 04/15] slub: Slab defrag core Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 06/15] tools/vm/slabinfo: Add defrag_used_ratio output Tobin C. Harding
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

Add output line for NUMA remote node defrag ratio.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 tools/vm/slabinfo.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index cbfc56c44c2f..d2c22f9ee2d8 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
 	unsigned int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
 	int movable, ctor;
+	int remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
 	unsigned long free_fastpath, free_slowpath;
@@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode)
 	if (skip_zero && !s->slabs)
 		return;
 
+	if (mode) {
+		printf("\nNUMA remote node defrag ratio: %3d\n",
+		       s->remote_node_defrag_ratio);
+	}
 	if (!line) {
 		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
@@ -1272,6 +1277,8 @@ static void read_slab_dir(void)
 			slab->cpu_partial_free = get_obj("cpu_partial_free");
 			slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
 			if (read_slab_obj(slab, "ops")) {
 				if (strstr(buffer, "ctor :"))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 06/15] tools/vm/slabinfo: Add defrag_used_ratio output
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (4 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 05/15] tools/vm/slabinfo: Add remote node defrag ratio output Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 07/15] tools/testing/slab: Add object migration test module Tobin C. Harding
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 tools/vm/slabinfo.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
 	unsigned int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
 	int movable, ctor;
+	int defrag_used_ratio;
 	int remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->movable)
+		printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
 			slab->remote_node_defrag_ratio =
 					get_obj("remote_node_defrag_ratio");
+			slab->defrag_used_ratio = get_obj("defrag_used_ratio");
 			chdir("..");
 			if (read_slab_obj(slab, "ops")) {
 				if (strstr(buffer, "ctor :"))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 07/15] tools/testing/slab: Add object migration test module
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (5 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 06/15] tools/vm/slabinfo: Add defrag_used_ratio output Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 08/15] tools/testing/slab: Add object migration test suite Tobin C. Harding
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

We just implemented slab movable objects for the SLUB allocator.  We
should test that code.  In order to do so we need to be able to do a
number of things

 - Create a cache
 - Enable Slab Movable Objects for the cache
 - Allocate objects to the cache
 - Free objects from within specific slabs of the cache

We can do all this via a loadable module.

Add a module that defines functions that can be triggered from userspace
via a debugfs entry. From the source:

  /*
   * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
   *
   * This module is used for testing the SLUB allocator.  Enables
   * userspace to run kernel functions via a debugfs file.
   *
   *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
   *
   * String written to `callfn` is parsed by the module and associated
   * function is called.  See fn_tab for mapping of strings to functions.
   */

References to allocated objects are kept by the module in a linked list
so that userspace can control which object to free.

We introduce the following four functions via the function table

  "enable": Enables object migration for the test cache.
  "alloc X": Allocates X objects
  "free X [Y]": Frees X objects starting at list position Y (default Y==0)
  "test": Runs [stress] tests from within the module (see below).

       {"enable", smo_enable_cache_mobility},
       {"alloc", smo_alloc_objects},
       {"free", smo_free_object},
       {"test", smo_run_module_tests},

Freeing from the start of the list creates a hole in the slab being
freed from (i.e. creates a partial slab).  The results of running these
commands can be see using `slabinfo` (available in tools/vm/):

	make -o slabinfo tools/vm/slabinfo.c

Stress tests can be run from within the module.  These tests are
internal to the module because we verify that object references are
still good after object migration.  These are called 'stress' tests
because it is intended that they create/free a lot of objects.
Userspace can control the number of objects to create, default is 1000.

Example test session
--------------------

Relevant /proc/slabinfo column headers:

  name   <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>

  # mount -t debugfs none /sys/kernel/debug/
  $ cd path/to/linux/tools/testing/slab; make
  ...

  # insmod slub_defrag.ko
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test               0      0    392   20    2

From this we can see that the module created cache 'smo_test' with 20
objects per slab and 2 pages per slab (and cache is currently empty).

We can play with the slab allocator manually:

  # insmod slub_defrag.ko
  # echo 'alloc 21' > callfn
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test              21     40    392   20    2

We see here that 21 active objects have been allocated creating 2
slabs (40 total objects).

  # slabinfo smo_test --report

  Slabcache: smo_test         Aliases:  0 Order :  1 Objects: 21

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :      56  Total  :       2   Sanity Checks : On   Total:   16384
  SlabObj:     392  Full   :       1   Redzoning     : On   Used :    1176
  SlabSiz:    8192  Partial:       1   Poisoning     : On   Loss :   15208
  Loss   :     336  CpuSlab:       0   Tracking      : On   Lalig:    7056
  Align  :       8  Objects:      20   Tracing       : Off  Lpadd:     704

Now free an object from the first slot of the first slab

  # echo 'free 1' > callfn
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test              20     40    392   20    2

  # slabinfo smo_test --report

  Slabcache: smo_test         Aliases:  0 Order :  1 Objects: 20

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :      56  Total  :       2   Sanity Checks : On   Total:   16384
  SlabObj:     392  Full   :       0   Redzoning     : On   Used :    1120
  SlabSiz:    8192  Partial:       2   Poisoning     : On   Loss :   15264
  Loss   :     336  CpuSlab:       0   Tracking      : On   Lalig:    6720
  Align  :       8  Objects:      20   Tracing       : Off  Lpadd:     704

Calling shrink now on the cache does nothing because object migration is
not enabled (output omitted).  If we enable object migration then shrink
the cache we expect the object from the second slab to me moved to the
first slot in the first slab and the second slab to be removed from the
partial list.

  # echo 'enable' > callfn
  # slabinfo smo_test --shrink
  # slabinfo smo_test --report

  Slabcache: smo_test         Aliases:  0 Order :  1 Objects: 20
  ** Defragmentation at 30%

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :      56  Total  :       1   Sanity Checks : On   Total:    8192
  SlabObj:     392  Full   :       1   Redzoning     : On   Used :    1120
  SlabSiz:    8192  Partial:       0   Poisoning     : On   Loss :    7072
  Loss   :     336  CpuSlab:       0   Tracking      : On   Lalig:    6720
  Align  :       8  Objects:      20   Tracing       : Off  Lpadd:     352

We can run the stress tests (with the default number of objects):

  # cd /sys/kernel/debug/smo
  # echo 'test' > callfn
  [    3.576617] smo: test using nr_objs: 1000 keep: 10
  [    3.580169] smo: Module tests completed successfully

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 tools/testing/slab/Makefile      |  10 +
 tools/testing/slab/slub_defrag.c | 566 +++++++++++++++++++++++++++++++
 2 files changed, 576 insertions(+)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 tools/testing/slab/slub_defrag.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
new file mode 100644
index 000000000000..440c2e3e356f
--- /dev/null
+++ b/tools/testing/slab/Makefile
@@ -0,0 +1,10 @@
+obj-m += slub_defrag.o
+
+KTREE=../../..
+
+all:
+	make -C ${KTREE} M=$(PWD) modules
+
+clean:
+	make -C ${KTREE} M=$(PWD) clean
+
diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
new file mode 100644
index 000000000000..4a5c24394b96
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/list.h>
+#include <linux/gfp.h>
+#include <linux/debugfs.h>
+#include <linux/numa.h>
+
+/*
+ * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
+ *
+ * This module is used for testing the SLUB allocator.  Enables
+ * userspace to run kernel functions via a debugfs file.
+ *
+ *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
+ *
+ * String written to `callfn` is parsed by the module and associated
+ * function is called.  See fn_tab for mapping of strings to functions.
+ */
+
+/* debugfs commands accept two optional arguments */
+#define SMO_CMD_DEFAUT_ARG -1
+
+#define SMO_DEBUGFS_DIR "smo"
+struct dentry *smo_debugfs_root;
+
+#define SMO_CACHE_NAME "smo_test"
+static struct kmem_cache *cachep;
+
+struct smo_slub_object {
+	struct list_head list;
+	char buf[32];		/* Unused except to control size of object */
+	long id;
+};
+
+/* Our list of allocated objects */
+LIST_HEAD(objects);
+
+static void list_add_to_objects(struct smo_slub_object *so)
+{
+	/*
+	 * We free from the front of the list so store at the
+	 * tail in order to put holes in the cache when we free.
+	 */
+	list_add_tail(&so->list, &objects);
+}
+
+/**
+ * smo_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smo_object_ctor(void *ptr)
+{
+	struct smo_slub_object *so = ptr;
+
+	INIT_LIST_HEAD(&so->list);
+	memset(so->buf, 0, sizeof(so->buf));
+	so->id = -1;
+}
+
+/**
+ * smo_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smo_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+		       int node, void *private)
+{
+	struct smo_slub_object **so_objs = (struct smo_slub_object **)objs;
+	struct smo_slub_object *so_old, *so_new;
+	int i;
+
+	for (i = 0; i < size; i++) {
+		so_old = so_objs[i];
+
+		so_new = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+		if (!so_new) {
+			pr_debug("kmem_cache_alloc failed\n");
+			return;
+		}
+
+		/* Copy object */
+		so_new->id = so_old->id;
+
+		/* Update references to old object */
+		list_del(&so_old->list);
+		list_add_to_objects(so_new);
+
+		kmem_cache_free(cachep, so_old);
+	}
+}
+
+static int smo_enable_cache_mobility(int _unused, int __unused)
+{
+	/* Enable movable objects: BOOM! */
+	kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+	pr_info("smo: kmem_cache %s defrag enabled\n", SMO_CACHE_NAME);
+	return 0;
+}
+
+/*
+ * smo_alloc_objects() - Allocate objects and store reference.
+ * @nr_objs: Number of objects to allocate.
+ * @node: NUMA node to allocate objects on.
+ *
+ * Allocates @n smo_slub_objects.  Stores a reference to them in
+ * the global list of objects (at the tail of the list).
+ *
+ * Return: The number of objects allocated.
+ */
+static int smo_alloc_objects(int nr_objs, int node)
+{
+	struct smo_slub_object *so;
+	int i;
+
+	/* Set sane parameters if no args passed in */
+	if (nr_objs == SMO_CMD_DEFAUT_ARG)
+		nr_objs = 1;
+	if (node == SMO_CMD_DEFAUT_ARG)
+		node = NUMA_NO_NODE;
+
+	for (i = 0; i < nr_objs; i++) {
+		if (node == NUMA_NO_NODE)
+			so = kmem_cache_alloc(cachep, GFP_KERNEL);
+		else
+			so = kmem_cache_alloc_node(cachep, GFP_KERNEL, node);
+		if (!so) {
+			pr_err("smo: Failed to alloc object %d of %d\n", i, nr_objs);
+			return i;
+		}
+		list_add_to_objects(so);
+	}
+	return nr_objs;
+}
+
+/*
+ * smo_free_object() - Frees n objects from position.
+ * @nr_objs: Number of objects to free.
+ * @pos: Position in global list to start freeing.
+ *
+ * Iterates over the global list of objects to position @pos then frees @n
+ * objects from there (or to end of list).  Does nothing if @n > list length.
+ *
+ * Calling with @n==0 frees all objects starting at @pos.
+ *
+ * Return: Number of objects freed.
+ */
+static int smo_free_object(int nr_objs, int pos)
+{
+	struct smo_slub_object *cur, *tmp;
+	int deleted = 0;
+	int i = 0;
+
+	/* Set sane parameters if no args passed in */
+	if (nr_objs == SMO_CMD_DEFAUT_ARG)
+		nr_objs = 1;
+	if (pos == SMO_CMD_DEFAUT_ARG)
+		pos = 0;
+
+	list_for_each_entry_safe(cur, tmp, &objects, list) {
+		if (i < pos) {
+			i++;
+			continue;
+		}
+
+		list_del(&cur->list);
+		kmem_cache_free(cachep, cur);
+		deleted++;
+		if (deleted == nr_objs)
+			break;
+	}
+	return deleted;
+}
+
+static int index_for_expected_id(long *expected, int size, long id)
+{
+	int i;
+
+	/* Array is unsorted, just iterate the whole thing */
+	for (i = 0; i < size; i++) {
+		if (expected[i] == id)
+			return i;
+	}
+	return -1;		/* Not found */
+}
+
+static int assert_have_objects(int nr_objs, int keep)
+{
+	struct smo_slub_object *cur;
+	long *expected;		/* Array of expected IDs */
+	int nr_ids;		/* Length of array */
+	long id;
+	int index, i;
+
+	nr_ids = nr_objs / keep + 1;
+
+	expected = kmalloc_array(nr_ids, sizeof(long), GFP_KERNEL);
+	if (!expected)
+		return -ENOMEM;
+
+	id = 0;
+	for (i = 0; i < nr_ids; i++) {
+		expected[i] = id;
+		id += keep;
+	}
+
+	list_for_each_entry(cur, &objects, list) {
+		index = index_for_expected_id(expected, nr_ids, cur->id);
+		if (index < 0) {
+			pr_err("smo: ID not found: %ld\n", cur->id);
+			return -1;
+		}
+
+		if (expected[index] == -1) {
+			pr_err("smo: ID already encountered: %ld\n", cur->id);
+			return -1;
+		}
+		expected[index] = -1;
+	}
+	return 0;
+}
+
+/*
+ * smo_run_module_tests() - Runs unit tests from within the module
+ * @nr_objs: Number of objects to allocate.
+ * @keep: Free all but 1 in @keep objects.
+ *
+ * Allocates @nr_objects then iterates over the allocated objects
+ * freeing all but 1 out of every @keep objects i.e. for @keep==10
+ * keeps the first object then frees the next 9.
+ *
+ * Caller is responsible for ensuring that the cache has at most a
+ * single slab on the partial list without any objects in it.  This is
+ * easy enough to ensure, just call this when the module is freshly
+ * loaded.
+ */
+static int smo_run_module_tests(int nr_objs, int keep)
+{
+	struct smo_slub_object *so;
+	struct smo_slub_object *cur, *tmp;
+	long i;
+
+	if (!list_empty(&objects)) {
+		pr_err("smo: test requires clean module state\n");
+		return -1;
+	}
+
+	/* Set sane parameters if no args passed in */
+	if (nr_objs == SMO_CMD_DEFAUT_ARG)
+		nr_objs = 1000;
+	if (keep == SMO_CMD_DEFAUT_ARG)
+		keep = 10;
+
+	pr_info("smo: test using nr_objs: %d keep: %d\n", nr_objs, keep);
+
+	/* Perhaps we got called like this 'test 1000' */
+	if (keep == 0) {
+		pr_err("Usage: test <nr_objs> <keep>\n");
+		return -1;
+	}
+
+	/* Test constructor */
+	so = kmem_cache_alloc(cachep, GFP_KERNEL);
+	if (!so) {
+		pr_err("smo: Failed to alloc object\n");
+		return -1;
+	}
+	if (so->id != -1) {
+		pr_err("smo: Initial state incorrect");
+		return -1;
+	}
+	kmem_cache_free(cachep, so);
+
+	/*
+	 * Test that object migration is correctly implemented by module
+	 *
+	 * This gives us confidence that if new code correctly enables
+	 * object migration (via correct implementation of migrate and
+	 * isolate functions) then the slub allocator code that does
+	 * object migration is correct.
+	 */
+
+	for (i = 0; i < nr_objs; i++) {
+		so = kmem_cache_alloc(cachep, GFP_KERNEL);
+		if (!so) {
+			pr_err("smo: Failed to alloc object %ld of %d\n",
+			       i, nr_objs);
+			return -1;
+		}
+		so->id = (long)i;
+		list_add_to_objects(so);
+	}
+
+	assert_have_objects(nr_objs, 1);
+
+	i = 0;
+	list_for_each_entry_safe(cur, tmp, &objects, list) {
+		if (i++ % keep == 0)
+			continue;
+
+		list_del(&cur->list);
+		kmem_cache_free(cachep, cur);
+	}
+
+	/* Verify shrink does nothing when migration is not enabled */
+	kmem_cache_shrink(cachep);
+	assert_have_objects(nr_objs, 1);
+
+	/* Now test shrink */
+	kmem_cache_setup_mobility(cachep, NULL, smo_cache_migrate);
+	kmem_cache_shrink(cachep);
+	/*
+	 * Because of how migrate function deletes and adds objects to
+	 * the objects list we have no way of knowing the order.  We
+	 * want to confirm that we have all the objects after shrink
+	 * that we had before we did the shrink.
+	 */
+	assert_have_objects(nr_objs, keep);
+
+	/* cleanup */
+	list_for_each_entry_safe(cur, tmp, &objects, list) {
+		list_del(&cur->list);
+		kmem_cache_free(cachep, cur);
+	}
+	kmem_cache_shrink(cachep); /* Remove empty slabs from partial list */
+
+	pr_info("smo: Module tests completed successfully\n");
+	return 0;
+}
+
+/*
+ * struct functions() - Map command to a function pointer.
+ */
+struct functions {
+	char *fn_name;
+	int (*fn_ptr)(int arg0, int arg1);
+} fn_tab[] = {
+	/*
+	 * Because of the way we parse the function table no command
+	 * may have another command as its prefix.
+	 *  i.e. this will break: 'foo'  and 'foobar'
+	 */
+	{"enable", smo_enable_cache_mobility},
+	{"alloc", smo_alloc_objects},
+	{"free", smo_free_object},
+	{"test", smo_run_module_tests},
+};
+
+#define FN_TAB_SIZE (sizeof(fn_tab) / sizeof(struct functions))
+
+/*
+ * parse_cmd_buf() - Gets command and arguments command string.
+ * @buf: Buffer containing the command string.
+ * @cmd: Out parameter, pointer to the command.
+ * @arg1: Out parameter, stores the first argument.
+ * @arg2: Out parameter, stores the second argument.
+ *
+ * Parses and tokenizes the input command buffer. Stores a pointer to the
+ * command (start of @buf) in @cmd.  Stores the converted long values for
+ * argument 1 and 2 in the respective out parameters @arg1 and @arg2.
+ *
+ * Since arguments are optional, if they are not found the default values are
+ * returned.  In order for the caller to differentiate defaults from arguments
+ * of the same value the number of arguments parsed is returned.
+ *
+ * Return: Number of arguments found.
+ */
+static int parse_cmd_buf(char *buf, char **cmd, long *arg1, long *arg2)
+{
+	int found;
+	char *ptr;
+	int ret;
+
+	*arg1 = SMO_CMD_DEFAUT_ARG;
+	*arg2 = SMO_CMD_DEFAUT_ARG;
+	found = 0;
+
+	/* Jump over the command, check if there are any args */
+	ptr = strsep(&buf, " ");
+	if (!ptr || !buf)
+		return found;
+
+	ptr = strsep(&buf, " ");
+	ret = kstrtol(ptr, 10, arg1);
+	if (ret < 0) {
+		pr_err("failed to convert arg, defaulting to %d. (%s)\n",
+		       SMO_CMD_DEFAUT_ARG, ptr);
+		return found;
+	}
+	found++;
+	if (!buf)		/* No second arg */
+		return found;
+
+	ptr = strsep(&buf, " ");
+	ret = kstrtol(ptr, 10, arg2);
+	if (ret < 0) {
+		pr_err("failed to convert arg, defaulting to %d. (%s)\n",
+		       SMO_CMD_DEFAUT_ARG, ptr);
+		return found;
+	}
+	found++;
+
+	return found;
+}
+
+/*
+ * call_function() - Calls the function described by str.
+ * @str: '<cmd> [<arg>]'
+ *
+ * Does table lookup on <cmd>, calls appropriate function passing
+ * <arg> as a the argument.  Optional arg defaults to 1.
+ */
+static void call_function(char *str)
+{
+	char *cmd;
+	long arg1 = 0;
+	long arg2 = 0;
+	int i;
+
+	if (!str)
+		return;
+
+	(void)parse_cmd_buf(str, &cmd, &arg1, &arg2);
+
+	for (i = 0; i < FN_TAB_SIZE; i++) {
+		char *fn_name = fn_tab[i].fn_name;
+
+		if (strcmp(fn_name, str) == 0) {
+			fn_tab[i].fn_ptr(arg1, arg2);
+			return;	/* All done */
+		}
+	}
+
+	pr_err("failed to call function for cmd: %s\n", str);
+}
+
+/*
+ * smo_callfn_debugfs_write() - debugfs write function.
+ * @file: User file
+ * @user_buf: Userspace buffer
+ * @len: Length of the user space buffer
+ * @off: Offset within the file
+ *
+ * Used for triggering functions by writing command to debugfs file.
+ *
+ *   echo '<cmd> <arg>'  > /sys/kernel/debug/smo/callfn
+ *
+ * Return: Number of bytes copied if request succeeds,
+ *	   the corresponding error code otherwise.
+ */
+static ssize_t smo_callfn_debugfs_write(struct file *file,
+					const char __user *ubuf,
+					size_t len,
+					loff_t *off)
+{
+	char *kbuf;
+	int nbytes = 0;
+
+	if (*off != 0 || len == 0)
+		return -EINVAL;
+
+	kbuf = kzalloc(len, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	nbytes = strncpy_from_user(kbuf, ubuf, len);
+	if (nbytes < 0)
+		goto out;
+
+	if (kbuf[nbytes - 1] == '\n')
+		kbuf[nbytes - 1] = '\0';
+
+	call_function(kbuf);	/* Tokenizes kbuf */
+out:
+	kfree(kbuf);
+	return nbytes;
+}
+
+const struct file_operations fops_callfn_debugfs = {
+	.owner = THIS_MODULE,
+	.write = smo_callfn_debugfs_write,
+};
+
+static int __init smo_debugfs_init(void)
+{
+	struct dentry *d;
+
+	smo_debugfs_root = debugfs_create_dir(SMO_DEBUGFS_DIR, NULL);
+	d = debugfs_create_file("callfn", 0200, smo_debugfs_root, NULL,
+				&fops_callfn_debugfs);
+	if (IS_ERR(d))
+		return PTR_ERR(d);
+
+	return 0;
+}
+
+static void __exit smo_debugfs_cleanup(void)
+{
+	debugfs_remove_recursive(smo_debugfs_root);
+}
+
+static int __init smo_cache_init(void)
+{
+	cachep = kmem_cache_create(SMO_CACHE_NAME,
+				   sizeof(struct smo_slub_object),
+				   0, 0, smo_object_ctor);
+	if (!cachep)
+		return -1;
+
+	return 0;
+}
+
+static void __exit smo_cache_cleanup(void)
+{
+	struct smo_slub_object *cur, *tmp;
+
+	list_for_each_entry_safe(cur, tmp, &objects, list) {
+		list_del(&cur->list);
+		kmem_cache_free(cachep, cur);
+	}
+	kmem_cache_destroy(cachep);
+}
+
+static int __init smo_init(void)
+{
+	int ret;
+
+	ret = smo_cache_init();
+	if (ret) {
+		pr_err("smo: Failed to create cache\n");
+		return ret;
+	}
+	pr_info("smo: Created kmem_cache: %s\n", SMO_CACHE_NAME);
+
+	ret = smo_debugfs_init();
+	if (ret) {
+		pr_err("smo: Failed to init debugfs\n");
+		return ret;
+	}
+	pr_info("smo: Created debugfs directory: /sys/kernel/debugfs/%s\n",
+		SMO_DEBUGFS_DIR);
+
+	pr_info("smo: Test module loaded\n");
+	return 0;
+}
+module_init(smo_init);
+
+static void __exit smo_exit(void)
+{
+	smo_debugfs_cleanup();
+	smo_cache_cleanup();
+
+	pr_info("smo: Test module removed\n");
+}
+module_exit(smo_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tobin C. Harding");
+MODULE_DESCRIPTION("SLUB Movable Objects test module.");
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 08/15] tools/testing/slab: Add object migration test suite
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (6 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 07/15] tools/testing/slab: Add object migration test module Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 09/15] xarray: Implement migration function for objects Tobin C. Harding
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects.  Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

  $ cd path/to/linux/tools/testing/slab
  $ /slub_defrag.py
  Please run script as root

  $ sudo ./slub_defrag.py
  <test are quiet, no output on success>

  $ sudo ./slub_defrag.py --debug
  Loading module ...
  Slab cache smo_test created
  Objects per slab: 20
  Running sanity checks ...

  Running module stress test (see dmesg for additional test output) ...
  Removing module slub_defrag ...
  Loading module ...
  Slab cache smo_test created

  Running test non-movable ...
  testing slab 'smo_test' prior to enabling movable objects ...
  verified non-movable slabs are NOT shrinkable

  Running test movable ...
  testing slab 'smo_test' after enabling movable objects ...
  verified movable slabs are shrinkable

  Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 tools/testing/slab/slub_defrag.c  |   1 +
 tools/testing/slab/slub_defrag.py | 451 ++++++++++++++++++++++++++++++
 2 files changed, 452 insertions(+)
 create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)
 
 /*
  * struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
  */
 struct functions {
 	char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py b/tools/testing/slab/slub_defrag.py
new file mode 100755
index 000000000000..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+#  - CONFIG_SLUB=y
+#  - CONFIG_SLUB_DEBUG=y
+#  - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+#  - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+#  - Writes to 'callfn' are parsed as a command string and the function
+#    associated with command is called.
+#  - Defines 4 commands (all commands operate on smo_test cache):
+#     - 'test': Runs module stress tests.
+#     - 'alloc N': Allocates N slub objects
+#     - 'free N POS': Frees N objects starting at POS (see below)
+#     - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects.  Allocation adds
+# objects to the tail of the list.  Free'ing frees from the head of the
+# list.  This has the effect of creating free slots in the slab.  For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+#    - Creates multiple partial slabs as explained above.
+#    - Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+#    - Creates multiple partial slabs as explained above.
+#    - Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab/<cache>/shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False         # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+    print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+    if debug:
+        print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+    return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+    return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+    user = run_shell_get_stdout('whoami')
+    if user != b'root\n':
+        eprint("Please run script as root")
+        sys.exit(1)
+
+
+def mount_debugfs():
+    mounted = False
+
+    # Check if debugfs is mounted at a known mount point.
+    ret = run_shell('mount -l | grep /sys/kernel/debug > /dev/null 2>&1')
+    if ret != 0:
+        run_shell('mount -t debugfs none /sys/kernel/debug/')
+        mounted = True
+        dprint("Mounted debugfs on /sys/kernel/debug")
+
+    return mounted
+
+
+def umount_debugfs():
+    dprint("Un-mounting debugfs")
+    run_shell('umount /sys/kernel/debug')
+
+
+def load_module():
+    """Loads the test module.
+
+    We need a clean slab state to start with so module must
+    be loaded by the test suite.
+    """
+    ret = run_shell('lsmod | grep %s > /dev/null' % MODULE_NAME)
+    if ret == 0:
+        eprint("Please unload slub_defrag module before running test suite")
+        return -1
+
+    dprint('Loading module ...')
+    ret = run_shell('insmod %s.ko' % MODULE_NAME)
+    if ret != 0:                # ret==1 on error
+        return -1
+
+    dprint("Slab cache %s created" % CACHE_NAME)
+    return 0
+
+
+def unload_module():
+    ret = run_shell('lsmod | grep %s > /dev/null' % MODULE_NAME)
+    if ret == 0:
+        dprint('Removing module %s ...' % MODULE_NAME)
+        run_shell('rmmod %s > /dev/null 2>&1' % MODULE_NAME)
+
+
+def get_sysfs_value(filename):
+    """
+    Parse slab sysfs files (single line: '20 N0=20')
+    """
+    path = '/sys/kernel/slab/smo_test/%s' % filename
+    f = open(path, "r")
+    s = f.readline()
+    tokens = s.split(" ")
+
+    return int(tokens[0])
+
+
+def get_nr_objects_active():
+    return get_sysfs_value('objects')
+
+
+def get_nr_objects_total():
+    return get_sysfs_value('total_objects')
+
+
+def get_nr_slabs_total():
+    return get_sysfs_value('slabs')
+
+
+def get_nr_slabs_partial():
+    return get_sysfs_value('partial')
+
+
+def get_nr_slabs_full():
+    return get_nr_slabs_total() - get_nr_slabs_partial()
+
+
+def get_slab_config():
+    """Get relevant information from sysfs."""
+    global objects_per_slab
+
+    objects_per_slab = get_sysfs_value('objs_per_slab')
+    if objects_per_slab < 0:
+        return -1
+
+    dprint("Objects per slab: %d" % objects_per_slab)
+    return 0
+
+
+def verify_state(nr_objects_active, nr_objects_total,
+                 nr_slabs_partial, nr_slabs_full, nr_slabs_total, msg=''):
+    err = 0
+    got_nr_objects_active = get_nr_objects_active()
+    got_nr_objects_total = get_nr_objects_total()
+    got_nr_slabs_partial = get_nr_slabs_partial()
+    got_nr_slabs_full = get_nr_slabs_full()
+    got_nr_slabs_total = get_nr_slabs_total()
+
+    if got_nr_objects_active != nr_objects_active:
+        err = -1
+
+    if got_nr_objects_total != nr_objects_total:
+        err = -2
+
+    if got_nr_slabs_partial != nr_slabs_partial:
+        err = -3
+
+    if got_nr_slabs_full != nr_slabs_full:
+        err = -4
+
+    if got_nr_slabs_total != nr_slabs_total:
+        err = -5
+
+    if err != 0:
+        dprint("Verify state: %s" % msg)
+        dprint("  what\t\t\twant\tgot")
+        dprint("-----------------------------------------")
+        dprint("  %s\t%d\t%d" % ('nr_objects_active', nr_objects_active, got_nr_objects_active))
+        dprint("  %s\t%d\t%d" % ('nr_objects_total', nr_objects_total, got_nr_objects_total))
+        dprint("  %s\t%d\t%d" % ('nr_slabs_partial', nr_slabs_partial, got_nr_slabs_partial))
+        dprint("  %s\t\t%d\t%d" % ('nr_slabs_full', nr_slabs_full, got_nr_slabs_full))
+        dprint("  %s\t%d\t%d\n" % ('nr_slabs_total', nr_slabs_total, got_nr_slabs_total))
+
+    return err
+
+
+def exec_via_sysfs(command):
+        ret = run_shell('echo %s > /sys/kernel/debug/smo/callfn' % command)
+        if ret != 0:
+            eprint("Failed to echo command to sysfs: %s" % command)
+
+        return ret
+
+
+def enable_movable_objects():
+    return exec_via_sysfs('enable')
+
+
+def alloc(n):
+    exec_via_sysfs("alloc %d" % n)
+
+
+def free(n, pos = 0):
+    exec_via_sysfs('free %d %d' % (n, pos))
+
+
+def shrink():
+    ret = run_shell('slabinfo smo_test -s')
+    if ret != 0:
+            eprint("Failed to execute slabinfo -s")
+
+
+def sanity_checks():
+    # Verify everything is 0 to start with.
+    return verify_state(0, 0, 0, 0, 0, "sanity check")
+
+
+def test_non_movable():
+    one_over = objects_per_slab + 1
+
+    dprint("testing slab 'smo_test' prior to enabling movable objects ...")
+
+    alloc(one_over)
+
+    objects_active = one_over
+    objects_total = objects_per_slab * 2
+    slabs_partial = 1
+    slabs_full = 1
+    slabs_total = 2
+    ret = verify_state(objects_active, objects_total,
+                       slabs_partial, slabs_full, slabs_total,
+                       "non-movable: initial allocation")
+    if ret != 0:
+        eprint("test_non_movable: failed to verify initial state")
+        return -1
+
+    # Free object from first slot of first slab.
+    free(1)
+    objects_active = one_over - 1
+    objects_total = objects_per_slab * 2
+    slabs_partial = 2
+    slabs_full = 0
+    slabs_total = 2
+    ret = verify_state(objects_active, objects_total,
+                       slabs_partial, slabs_full, slabs_total,
+                       "non-movable: after free")
+    if ret != 0:
+        eprint("test_non_movable: failed to verify after free")
+        return -1
+
+    # Non-movable cache, shrink should have no effect.
+    shrink()
+    ret = verify_state(objects_active, objects_total,
+                       slabs_partial, slabs_full, slabs_total,
+                       "non-movable: after shrink")
+    if ret != 0:
+        eprint("test_non_movable: failed to verify after shrink")
+        return -1
+
+    # Cleanup
+    free(objects_per_slab)
+    shrink()
+
+    dprint("verified non-movable slabs are NOT shrinkable")
+    return 0
+
+
+def test_movable():
+    one_over = objects_per_slab + 1
+
+    dprint("testing slab 'smo_test' after enabling movable objects ...")
+
+    alloc(one_over)
+
+    objects_active = one_over
+    objects_total = objects_per_slab * 2
+    slabs_partial = 1
+    slabs_full = 1
+    slabs_total = 2
+    ret = verify_state(objects_active, objects_total,
+                       slabs_partial, slabs_full, slabs_total,
+                       "movable: initial allocation")
+    if ret != 0:
+        eprint("test_movable: failed to verify initial state")
+        return -1
+
+    # Free object from first slot of first slab.
+    free(1)
+    objects_active = one_over - 1
+    objects_total = objects_per_slab * 2
+    slabs_partial = 2
+    slabs_full = 0
+    slabs_total = 2
+    ret = verify_state(objects_active, objects_total,
+                       slabs_partial, slabs_full, slabs_total,
+                       "movable: after free")
+    if ret != 0:
+        eprint("test_movable: failed to verify after free")
+        return -1
+
+    # movable cache, shrink should move objects and free slab.
+    shrink()
+    objects_active = one_over - 1
+    objects_total = objects_per_slab * 1
+    slabs_partial = 0
+    slabs_full = 1
+    slabs_total = 1
+    ret = verify_state(objects_active, objects_total,
+                       slabs_partial, slabs_full, slabs_total,
+                       "movable: after shrink")
+    if ret != 0:
+        eprint("test_movable: failed to verify after shrink")
+        return -1
+
+    # Cleanup
+    free(objects_per_slab)
+    shrink()
+
+    dprint("verified movable slabs are shrinkable")
+    return 0
+
+
+def dprint_start_test(test):
+    dprint("Running %s ..." % test)
+
+
+def dprint_done():
+    dprint("")
+
+
+def run_test(fn, desc):
+    dprint_start_test(desc)
+    ret = fn()
+    if ret < 0:
+        fail_test(desc)
+    dprint_done()
+
+
+# Load and unload the module for this test to ensure clean state.
+def run_module_stress_test():
+    dprint("Running module stress test (see dmesg for additional test output) ...")
+
+    unload_module()
+    ret = load_module()
+    if ret < 0:
+        cleanup_and_exit(ret)
+
+    exec_via_sysfs("test");
+
+    unload_module()
+
+    dprint()
+
+
+def fail_test(msg):
+    eprint("\nFAIL: test failed: '%s' ... aborting\n" % msg)
+    cleanup_and_exit(1)
+
+
+def display_help():
+    print("Usage: %s [OPTIONS]\n" % path.basename(sys.argv[0]))
+    print("\tRuns defrag test suite (a.k.a. SLUB Movable Objects)\n")
+    print("OPTIONS:")
+    print("\t-d | --debug       Enable verbose debug output")
+    print("\t-h | --help        Print this help and exit")
+
+
+def cleanup_and_exit(return_code):
+    global debugfs_mounted
+
+    if debugfs_mounted == True:
+        umount_debugfs()
+
+    unload_module()
+
+    sys.exit(return_code)
+
+
+def main():
+    global debug
+
+    if len(sys.argv) > 1:
+        if sys.argv[1] == '-h' or sys.argv[1] == '--help':
+            display_help()
+            sys.exit(0)
+
+        if sys.argv[1] == '-d' or sys.argv[1] == '--debug':
+            debug = True
+
+    assert_root()
+
+    # Use cleanup_and_exit() instead of sys.exit() after mounting debugfs.
+    debugfs_mounted = mount_debugfs()
+
+    # Loads and unloads the module.
+    run_module_stress_test()
+
+    ret = load_module()
+    if (ret < 0):
+        cleanup_and_exit(ret)
+
+    ret = get_slab_config()
+    if (ret != 0):
+        fail_test("get slab config details")
+
+    run_test(sanity_checks, "sanity checks")
+
+    run_test(test_non_movable, "test non-movable")
+
+    ret = enable_movable_objects()
+    if (ret != 0):
+        fail_test("enable movable objects")
+
+    run_test(test_movable, "test movable")
+
+    cleanup_and_exit(0)
+
+if __name__== "__main__":
+  main()
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 09/15] xarray: Implement migration function for objects
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (7 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 08/15] tools/testing/slab: Add object migration test suite Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 10/15] tools/testing/slab: Add XArray movable objects tests Tobin C. Harding
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

Implement functions to migrate objects. This is based on initial code by
Matthew Wilcox and was modified to work with slab object migration.

This patch can not be merged until all radix tree & IDR users are
converted to the XArray because xa_nodes and radix tree nodes share the
same slab cache (thanks Matthew).

Co-developed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 lib/radix-tree.c | 13 +++++++++++++
 lib/xarray.c     | 49 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 14d51548bea6..9412c2853726 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1613,6 +1613,17 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
+extern void xa_object_migrate(void *tree_node, int numa_node);
+
+static void radix_tree_migrate(struct kmem_cache *s, void **objects, int nr,
+			       int node, void *private)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		xa_object_migrate(objects[i], node);
+}
+
 void __init radix_tree_init(void)
 {
 	int ret;
@@ -1627,4 +1638,6 @@ void __init radix_tree_init(void)
 	ret = cpuhp_setup_state_nocalls(CPUHP_RADIX_DEAD, "lib/radix:dead",
 					NULL, radix_tree_cpu_dead);
 	WARN_ON(ret < 0);
+	kmem_cache_setup_mobility(radix_tree_node_cachep, NULL,
+				  radix_tree_migrate);
 }
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..731dd3d8ddb8 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1971,6 +1971,55 @@ void xa_destroy(struct xarray *xa)
 }
 EXPORT_SYMBOL(xa_destroy);
 
+void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+	struct xarray *xa = READ_ONCE(node->array);
+	void __rcu **slot;
+	struct xa_node *new_node;
+	int i;
+
+	/* Freed or not yet in tree then skip */
+	if (!xa || xa == XA_RCU_FREE)
+		return;
+
+	new_node = kmem_cache_alloc_node(radix_tree_node_cachep,
+					 GFP_KERNEL, numa_node);
+	if (!new_node)
+		return;
+
+	xa_lock_irq(xa);
+
+	/* Check again..... */
+	if (xa != node->array) {
+		node = new_node;
+		goto unlock;
+	}
+
+	memcpy(new_node, node, sizeof(struct xa_node));
+
+	if (list_empty(&node->private_list))
+		INIT_LIST_HEAD(&new_node->private_list);
+	else
+		list_replace(&node->private_list, &new_node->private_list);
+
+	for (i = 0; i < XA_CHUNK_SIZE; i++) {
+		void *x = xa_entry_locked(xa, new_node, i);
+
+		if (xa_is_node(x))
+			rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+	}
+	if (!new_node->parent)
+		slot = &xa->xa_head;
+	else
+		slot = &xa_parent_locked(xa, new_node)->slots[new_node->offset];
+	rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+	xa_unlock_irq(xa);
+	xa_node_free(node);
+	rcu_barrier();
+}
+
 #ifdef XA_DEBUG
 void xa_dump_node(const struct xa_node *node)
 {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 10/15] tools/testing/slab: Add XArray movable objects tests
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (8 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 09/15] xarray: Implement migration function for objects Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 11/15] slub: Enable moving objects to/from specific nodes Tobin C. Harding
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

We just implemented movable objects for the XArray.  Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs.  For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session
--------------------

Relevant /proc/slabinfo column headers:

  name   <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>

Prior to testing slabinfo report for radix_tree_node:

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8352
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :     576  Total  :     497   Sanity Checks : On   Total: 8142848
  SlabObj:     912  Full   :     473   Redzoning     : On   Used : 4810752
  SlabSiz:   16384  Partial:      24   Poisoning     : On   Loss : 3332096
  Loss   :     336  CpuSlab:       0   Tracking      : On   Lalig: 2806272
  Align  :       8  Objects:      17   Tracing       : Off  Lpadd:  437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8442
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :     576  Total  :     499   Sanity Checks : On   Total: 8175616
  SlabObj:     912  Full   :     484   Redzoning     : On   Used : 4862592
  SlabSiz:   16384  Partial:      15   Poisoning     : On   Loss : 3313024
  Loss   :     336  CpuSlab:       0   Tracking      : On   Lalig: 2836512
  Align  :       8  Objects:      17   Tracing       : Off  Lpadd:  439120

Now we can shrink the radix_tree_node cache:

  # slabinfo radix_tree_node --shrink
  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8515
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :     576  Total  :     501   Sanity Checks : On   Total: 8208384
  SlabObj:     912  Full   :     500   Redzoning     : On   Used : 4904640
  SlabSiz:   16384  Partial:       1   Poisoning     : On   Loss : 3303744
  Loss   :     336  CpuSlab:       0   Tracking      : On   Lalig: 2861040
  Align  :       8  Objects:      17   Tracing       : Off  Lpadd:  440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 tools/testing/slab/Makefile             |   2 +-
 tools/testing/slab/slub_defrag_xarray.c | 211 ++++++++++++++++++++++++
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o
 
 KTREE=../../..
 
diff --git a/tools/testing/slab/slub_defrag_xarray.c b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index 000000000000..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/list.h>
+#include <linux/gfp.h>
+#include <linux/xarray.h>
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+	long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+	struct smox_thing *thing;
+
+	thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+	if (!thing)
+		return ERR_PTR(-ENOMEM);
+
+	thing->id = id;
+	return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void smox_object_ctor(void *ptr)
+{
+	struct smox_thing *thing = ptr;
+
+	thing->id = -1;
+}
+
+/**
+ * smox_cache_migrate() - kmem_cache migrate function.
+ * @cp: kmem_cache pointer.
+ * @objs: Array of pointers to objects to migrate.
+ * @size: Number of objects in @objs.
+ * @node: NUMA node where the object should be allocated.
+ * @private: Pointer returned by kmem_cache_isolate_func().
+ */
+void smox_cache_migrate(struct kmem_cache *cp, void **objs, int size,
+			int node, void *private)
+{
+	struct smox_thing **ptrs = (struct smox_thing **)objs;
+	struct smox_thing *old, *new;
+	struct smox_thing *thing;
+	unsigned long index;
+	void *entry;
+	int i;
+
+	for (i = 0; i < size; i++) {
+		old = ptrs[i];
+
+		new = kmem_cache_alloc(cachep, GFP_KERNEL);
+		if (!new) {
+			pr_debug("kmem_cache_alloc failed\n");
+			return;
+		}
+
+		new->id = old->id;
+
+		/* Update reference the brain dead way */
+		xa_for_each(&things, index, thing) {
+			if (thing == old) {
+				entry = xa_store(&things, index, new, GFP_KERNEL);
+				if (entry != old) {
+					pr_err("failed to exchange new/old\n");
+					return;
+				}
+			}
+		}
+		kmem_cache_free(cachep, old);
+	}
+}
+
+/*
+ * test_smo_xarray() - Run some tests using an XArray.
+ */
+static int test_smo_xarray(void)
+{
+	const int keep = 6; /* Free 5 out of 6 items */
+	const int nr_items = 10000;
+	struct smox_thing *thing;
+	unsigned long index;
+	void *entry;
+	int expected;
+	int i;
+
+	/*
+	 * Populate XArray, this adds to the radix_tree_node cache as
+	 * well as the smox_test cache.
+	 */
+	for (i = 0; i < nr_items; i++) {
+		thing = alloc_thing(i);
+		entry = xa_store(&things, i, thing, GFP_KERNEL);
+		if (xa_is_err(entry)) {
+			pr_err("smox: failed to allocate entry: %d\n", i);
+			return -ENOMEM;
+		}
+	}
+
+	/* Now free  items, putting holes in both caches. */
+	for (i = 0; i < nr_items; i++) {
+		if (i % keep == 0)
+			continue;
+
+		thing = xa_erase(&things, i);
+		if (xa_is_err(thing))
+			pr_err("smox: error erasing entry: %d\n", i);
+		kmem_cache_free(cachep, thing);
+	}
+
+	expected = 0;
+	xa_for_each(&things, index, thing) {
+		if (thing->id != expected || index != expected) {
+			pr_err("smox: error; got %ld want %d at %ld\n",
+			       thing->id, expected, index);
+			return -1;
+		}
+		expected += keep;
+	}
+
+	/*
+	 * Leave caches sparsely allocated.  Shrink caches manually with:
+	 *
+	 *   slabinfo radix_tree_node --shrink
+	 *   slabinfo smox_test --shrink
+	 */
+
+	return 0;
+}
+
+static int __init smox_cache_init(void)
+{
+	cachep = kmem_cache_create(SMOX_CACHE_NAME,
+				   sizeof(struct smox_thing),
+				   0, 0, smox_object_ctor);
+	if (!cachep)
+		return -1;
+
+	return 0;
+}
+
+static void __exit smox_cache_cleanup(void)
+{
+	struct smox_thing *thing;
+	unsigned long i;
+
+	xa_for_each(&things, i, thing) {
+		kmem_cache_free(cachep, thing);
+	}
+	xa_destroy(&things);
+	kmem_cache_destroy(cachep);
+}
+
+static int __init smox_init(void)
+{
+	int ret;
+
+	ret = smox_cache_init();
+	if (ret) {
+		pr_err("smo_xarray: failed to create cache\n");
+		return ret;
+	}
+	pr_info("smo_xarray: created kmem_cache: %s\n", SMOX_CACHE_NAME);
+
+	kmem_cache_setup_mobility(cachep, NULL, smox_cache_migrate);
+	pr_info("smo_xarray: kmem_cache %s defrag enabled\n", SMOX_CACHE_NAME);
+
+	/*
+	 * Running this test consumes memory unless you shrink the
+	 * radix_tree_node cache manually with `slabinfo`.
+	 */
+	ret = test_smo_xarray();
+	if (ret)
+		pr_warn("test_smo_xarray failed: %d\n", ret);
+
+	pr_info("smo_xarray: module loaded successfully\n");
+	return 0;
+}
+module_init(smox_init);
+
+static void __exit smox_exit(void)
+{
+	smox_cache_cleanup();
+
+	pr_info("smo_xarray: module removed\n");
+}
+module_exit(smox_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tobin C. Harding");
+MODULE_DESCRIPTION("SMO XArray test module.");
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 11/15] slub: Enable moving objects to/from specific nodes
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (9 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 10/15] tools/testing/slab: Add XArray movable objects tests Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 12/15] slub: Enable balancing slabs across nodes Tobin C. Harding
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

We have just implemented Slab Movable Objects (object migration).
Currently object migration is used to defrag a cache.  On NUMA systems
it would be nice to be able to control the source and destination nodes
when moving objects.

Add CONFIG_SMO_NODE to guard this feature.  CONFIG_SMO_NODE depends on
CONFIG_SLUB_DEBUG because we use the full list.  Leave it like this for
the RFC because the patch will be less cluttered to review, separate
full list out of CONFIG_DEBUG before doing a PATCH version.

Implement moving all objects (including those in full slabs) to a
specific node.  Expose this functionality to userspace via a sysfs entry.

Add sysfs entry:

   /sysfs/kernel/slab/<cache>/move

With this users get access to the following functionality:

 - Move all objects to specified node.

   	echo "N1" > move

 - Move all objects from specified node to other specified
   node (from N1 -> to N2):

   	echo "N1 N2" > move

This also enables shrinking slabs on a specific node:

   	echo "N1 N1" > move

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 mm/Kconfig |   7 ++
 mm/slub.c  | 249 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 256 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb8a7db..47040d939f3b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,13 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
 	bool
 
+config SMO_NODE
+       bool "Enable per node control of Slab Movable Objects"
+       depends on SLUB && SYSFS
+       select SLUB_DEBUG
+       help
+         On NUMA systems enable moving objects to and from a specified node.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT
 
diff --git a/mm/slub.c b/mm/slub.c
index e601c804ed79..e4f3dde443f5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4345,6 +4345,106 @@ static void move_slab_page(struct page *page, void *scratch, int node)
 	s->migrate(s, vector, count, node, private);
 }
 
+#ifdef CONFIG_SMO_NODE
+/*
+ * kmem_cache_move() - Attempt to move all slab objects.
+ * @s: The cache we are working on.
+ * @node: The node to move objects away from.
+ * @target_node: The node to move objects on to.
+ *
+ * Attempts to move all objects (partial slabs and full slabs) to target
+ * node.
+ *
+ * Context: Takes the list_lock.
+ * Return: The number of slabs remaining on node.
+ */
+static unsigned long kmem_cache_move(struct kmem_cache *s,
+				     int node, int target_node)
+{
+	struct kmem_cache_node *n = get_node(s, node);
+	LIST_HEAD(move_list);
+	struct page *page, *page2;
+	unsigned long flags;
+	void **scratch;
+
+	if (!s->migrate) {
+		pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+		goto out;
+	}
+
+	scratch = alloc_scratch(s);
+	if (!scratch)
+		goto out;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
+			continue;
+
+		if (page->inuse) {
+			list_move(&page->lru, &move_list);
+			/* Stop page being considered for allocations */
+			n->nr_partial--;
+			page->frozen = 1;
+
+			slab_unlock(page);
+		} else {	/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+		}
+	}
+	list_for_each_entry_safe(page, page2, &n->full, lru) {
+		if (!slab_trylock(page))
+			continue;
+
+		list_move(&page->lru, &move_list);
+		page->frozen = 1;
+		slab_unlock(page);
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	list_for_each_entry(page, &move_list, lru) {
+		if (page->inuse)
+			move_slab_page(page, scratch, target_node);
+	}
+	kfree(scratch);
+
+	/* Bail here to save taking the list_lock */
+	if (list_empty(&move_list))
+		goto out;
+
+	/* Inspect results and dispose of pages */
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &move_list, lru) {
+		list_del(&page->lru);
+		slab_lock(page);
+		page->frozen = 0;
+
+		if (page->inuse) {
+			if (page->inuse == page->objects) {
+				list_add(&page->lru, &n->full);
+				slab_unlock(page);
+			} else {
+				n->nr_partial++;
+				list_add_tail(&page->lru, &n->partial);
+				slab_unlock(page);
+			}
+		} else {
+			slab_unlock(page);
+			discard_slab(s, page);
+		}
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+out:
+	return atomic_long_read(&n->nr_slabs);
+}
+#endif	/* CONFIG_SMO_NODE */
+
 /*
  * kmem_cache_defrag() - Defragment node.
  * @s: cache we are working on.
@@ -4459,6 +4559,32 @@ static unsigned long kmem_cache_defrag(struct kmem_cache *s,
 	return n->nr_partial;
 }
 
+#ifdef CONFIG_SMO_NODE
+/*
+ * kmem_cache_move_to_node() - Move all slab objects to node.
+ * @s: The cache we are working on.
+ * @node: The target node to move objects to.
+ *
+ * Attempt to move all slab objects from all nodes to @node.
+ *
+ * Return: The total number of slabs left on emptied nodes.
+ */
+static unsigned long kmem_cache_move_to_node(struct kmem_cache *s, int node)
+{
+	unsigned long left = 0;
+	int nid;
+
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
+		if (nid == node)
+			continue;
+
+		left += kmem_cache_move(s, nid, node);
+	}
+
+	return left;
+}
+#endif
+
 /**
  * kmem_defrag_slabs() - Defrag slab caches.
  * @node: The node to defrag or -1 for all nodes.
@@ -5603,6 +5729,126 @@ static ssize_t shrink_store(struct kmem_cache *s,
 }
 SLAB_ATTR(shrink);
 
+#ifdef CONFIG_SMO_NODE
+static ssize_t move_show(struct kmem_cache *s, char *buf)
+{
+	return 0;
+}
+
+/*
+ * parse_move_store_input() - Parse buf getting integer arguments.
+ * @buf: Buffer to parse.
+ * @length: Length of @buf.
+ * @arg0: Return parameter, first argument.
+ * @arg1: Return parameter, second argument.
+ *
+ * Parses the input from user write to sysfs file 'move'.  Input string
+ * should contain either one or two node specifiers of form Nx where x
+ * is an integer specifying the NUMA node ID.  'N' or 'n' may be used.
+ * n/N may be omitted.
+ *
+ * e.g.
+ *     echo 'N1' > /sysfs/kernel/slab/cache/move
+ * or
+ *     echo 'N0 N2' > /sysfs/kernel/slab/cache/move
+ *
+ * Regex matching accepted forms: '[nN]?[0-9]( [nN]?[0-9])?'
+ *
+ * FIXME: This is really fragile.  Input must be exactly correct,
+ *        spurious whitespace causes parse errors.
+ *
+ * Return: 0 if an argument was successfully converted, or an error code.
+ */
+static ssize_t parse_move_store_input(const char *buf, size_t length,
+				      long *arg0, long *arg1)
+{
+	char *s, *save, *ptr;
+	int ret = 0;
+
+	if (!buf)
+		return -EINVAL;
+
+	s = kstrdup(buf, GFP_KERNEL);
+	if (!s)
+		return -ENOMEM;
+	save = s;
+
+	if (s[length - 1] == '\n') {
+		s[length - 1] = '\0';
+		length--;
+	}
+
+	ptr = strsep(&s, " ");
+	if (!ptr || strcmp(ptr, "") == 0) {
+		ret = 0;
+		goto out;
+	}
+
+	if (*ptr == 'N' || *ptr == 'n')
+		ptr++;
+	ret = kstrtol(ptr, 10, arg0);
+	if (ret < 0)
+		goto out;
+
+	if (s) {
+		if (*s == 'N' || *s == 'n')
+			s++;
+		ret = kstrtol(s, 10, arg1);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+out:
+	kfree(save);
+	return ret;
+}
+
+static bool is_valid_node(int node)
+{
+	int nid;
+
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
+		if (nid == node)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * move_store() - Move objects between nodes.
+ * @s: The cache we are working on.
+ * @buf: String received.
+ * @length: Length of @buf.
+ *
+ * Writes to /sys/kernel/slab/<cache>/move are interpreted as follows:
+ *
+ *  echo "N1" > move       : Move all objects (from all nodes) to node 1.
+ *  echo "N0 N1" > move    : Move all objects from node 0 to node 1.
+ *
+ * 'N' may be omitted:
+ */
+static ssize_t move_store(struct kmem_cache *s, const char *buf, size_t length)
+{
+	long arg0 = -1;
+	long arg1 = -1;
+	int ret;
+
+	ret = parse_move_store_input(buf, length, &arg0, &arg1);
+	if (ret < 0)
+		return -EINVAL;
+
+	if (is_valid_node(arg0) && is_valid_node(arg1))
+		(void)kmem_cache_move(s, arg0, arg1);
+	else if (is_valid_node(arg0))
+		(void)kmem_cache_move_to_node(s, arg0);
+
+	/* FIXME: What should we be returning here? */
+	return length;
+}
+SLAB_ATTR(move);
+#endif	/* CONFIG_SMO_NODE */
+
 #ifdef CONFIG_NUMA
 static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -5727,6 +5973,9 @@ static struct attribute *slab_attrs[] = {
 	&reclaim_account_attr.attr,
 	&destroy_by_rcu_attr.attr,
 	&shrink_attr.attr,
+#ifdef CONFIG_SMO_NODE
+	&move_attr.attr,
+#endif
 	&slabs_cpu_partial_attr.attr,
 #ifdef CONFIG_SLUB_DEBUG
 	&total_objects_attr.attr,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 12/15] slub: Enable balancing slabs across nodes
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (10 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 11/15] slub: Enable moving objects to/from specific nodes Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 13/15] dcache: Provide a dentry constructor Tobin C. Harding
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

We have just implemented Slab Movable Objects (SMO).  On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs.  Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

 1. Move all objects to node 0 (this has the effect of defragmenting the
    cache).

 2. Calculate the desired number of slabs for each node (this is done
    using the approximation nr_slabs / nr_nodes).

 3. Loop over the nodes moving the desired number of slabs from node 0
    to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this).  Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry.  Add sysfs
entry:

       /sysfs/kernel/slab/<cache>/balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

	kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 mm/slub.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index e4f3dde443f5..a5c48c41d72b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4583,6 +4583,109 @@ static unsigned long kmem_cache_move_to_node(struct kmem_cache *s, int node)
 
 	return left;
 }
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node.  This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+				  int node, int target_node, long num)
+{
+	struct kmem_cache_node *n = get_node(s, node);
+	LIST_HEAD(move_list);
+	struct page *page, *page2;
+	unsigned long flags;
+	void **scratch;
+	long done = 0;
+
+	if (node == target_node)
+		return -EINVAL;
+
+	scratch = alloc_scratch(s);
+	if (!scratch)
+		return -ENOMEM;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->full, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
+			continue;
+
+		list_move(&page->lru, &move_list);
+		page->frozen = 1;
+		slab_unlock(page);
+
+		if (++done >= num)
+			break;
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	list_for_each_entry(page, &move_list, lru) {
+		if (page->inuse)
+			move_slab_page(page, scratch, target_node);
+	}
+	kfree(scratch);
+
+	/* Inspect results and dispose of pages */
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &move_list, lru) {
+		list_del(&page->lru);
+		slab_lock(page);
+		page->frozen = 0;
+
+		if (page->inuse) {
+			/*
+			 * This is best effort only, if slab still has
+			 * objects just put it back on the partial list.
+			 */
+			n->nr_partial++;
+			list_add_tail(&page->lru, &n->partial);
+			slab_unlock(page);
+		} else {
+			slab_unlock(page);
+			discard_slab(s, page);
+		}
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+	struct kmem_cache_node *n = get_node(s, 0);
+	unsigned long desired_nr_slabs_per_node;
+	unsigned long nr_slabs;
+	int nr_nodes = 0;
+	int nid;
+
+	(void)kmem_cache_move_to_node(s, 0);
+
+	for_each_node_state(nid, N_NORMAL_MEMORY)
+		nr_nodes++;
+
+	nr_slabs = atomic_long_read(&n->nr_slabs);
+	desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
+		if (nid == 0)
+			continue;
+
+		kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node);
+	}
+}
 #endif
 
 /**
@@ -5847,6 +5950,22 @@ static ssize_t move_store(struct kmem_cache *s, const char *buf, size_t length)
 	return length;
 }
 SLAB_ATTR(move);
+
+static ssize_t balance_show(struct kmem_cache *s, char *buf)
+{
+	return 0;
+}
+
+static ssize_t balance_store(struct kmem_cache *s,
+			     const char *buf, size_t length)
+{
+	if (buf[0] == '1')
+		kmem_cache_balance_nodes(s);
+	else
+		return -EINVAL;
+	return length;
+}
+SLAB_ATTR(balance);
 #endif	/* CONFIG_SMO_NODE */
 
 #ifdef CONFIG_NUMA
@@ -5975,6 +6094,7 @@ static struct attribute *slab_attrs[] = {
 	&shrink_attr.attr,
 #ifdef CONFIG_SMO_NODE
 	&move_attr.attr,
+	&balance_attr.attr,
 #endif
 	&slabs_cpu_partial_attr.attr,
 #ifdef CONFIG_SLUB_DEBUG
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 13/15] dcache: Provide a dentry constructor
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (11 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 12/15] slub: Enable balancing slabs across nodes Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects Tobin C. Harding
  2019-04-11  1:34 ` [RFC PATCH v3 15/15] dcache: Add CONFIG_DCACHE_SMO Tobin C. Harding
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 fs/dcache.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index aac41adf4743..606cfca20d42 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1603,6 +1603,16 @@ void d_invalidate(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
+static void dcache_ctor(void *p)
+{
+	struct dentry *dentry = p;
+
+	/* Mimic lockref_mark_dead() */
+	dentry->d_lockref.count = -128;
+
+	spin_lock_init(&dentry->d_lock);
+}
+
 /**
  * __d_alloc	-	allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1658,7 +1668,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 
 	dentry->d_lockref.count = 1;
 	dentry->d_flags = 0;
-	spin_lock_init(&dentry->d_lock);
+
 	seqcount_init(&dentry->d_seq);
 	dentry->d_inode = NULL;
 	dentry->d_parent = dentry;
@@ -3091,14 +3101,17 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
-	/*
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache.
-	 */
-	dentry_cache = KMEM_CACHE_USERCOPY(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
-		d_iname);
+	slab_flags_t flags =
+		SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | SLAB_ACCOUNT;
+
+	dentry_cache =
+		kmem_cache_create_usercopy("dentry",
+					   sizeof(struct dentry),
+					   __alignof__(struct dentry),
+					   flags,
+					   offsetof(struct dentry, d_iname),
+					   sizeof_field(struct dentry, d_iname),
+					   dcache_ctor);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (12 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 13/15] dcache: Provide a dentry constructor Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  2019-04-11  2:33   ` Al Viro
  2019-04-11  1:34 ` [RFC PATCH v3 15/15] dcache: Add CONFIG_DCACHE_SMO Tobin C. Harding
  14 siblings, 1 reply; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

Implement isolate and 'migrate' functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 fs/dcache.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 606cfca20d42..5c707ed9ab5a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -30,6 +30,7 @@
 #include <linux/bit_spinlock.h>
 #include <linux/rculist_bl.h>
 #include <linux/list_lru.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -3068,6 +3069,74 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+	struct dentry *dentry;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		__dget(dentry);
+	}
+
+	return NULL;		/* No need for private data */
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to migrate.
+ * @nr: Number of objects in @v.
+ * @node: The NUMA node where new object should be allocated.
+ * @private: Returned by d_isolate() (currently %NULL).
+ *
+ * Dentry objects _can not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
+		      int node, void *_unused)
+{
+	struct dentry *dentry;
+	LIST_HEAD(dispose);
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		spin_lock(&dentry->d_lock);
+		dentry->d_lockref.count--;
+
+		if (dentry->d_lockref.count > 0 ||
+		    dentry->d_flags & DCACHE_SHRINK_LIST) {
+			spin_unlock(&dentry->d_lock);
+			continue;
+		}
+
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+
+		d_shrink_add(dentry, &dispose);
+
+		spin_unlock(&dentry->d_lock);
+	}
+
+	if (!list_empty(&dispose))
+		shrink_dentry_list(&dispose);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3113,6 +3182,8 @@ static void __init dcache_init(void)
 					   sizeof_field(struct dentry, d_iname),
 					   dcache_ctor);
 
+	kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
 		return;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 15/15] dcache: Add CONFIG_DCACHE_SMO
  2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
                   ` (13 preceding siblings ...)
  2019-04-11  1:34 ` [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects Tobin C. Harding
@ 2019-04-11  1:34 ` Tobin C. Harding
  14 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tobin C. Harding, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

In an attempt to make the SMO patchset as non-invasive as possible add a
config option CONFIG_DCACHE_SMO (under "Memory Management options") for
enabling SMO for the DCACHE.  Whithout this option dcache constructor is
used but no other code is built in, with this option enabled slab
mobility is enabled and the isolate/migrate functions are built in.

Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache via
Slab Movable Objects infrastructure.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 fs/dcache.c | 4 ++++
 mm/Kconfig  | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c707ed9ab5a..5ef68b78b457 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3069,6 +3069,7 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+#ifdef CONFIG_DCACHE_SMO
 /*
  * d_isolate() - Dentry isolation callback function.
  * @s: The dentry cache.
@@ -3136,6 +3137,7 @@ static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
 	if (!list_empty(&dispose))
 		shrink_dentry_list(&dispose);
 }
+#endif	/* CONFIG_DCACHE_SMO */
 
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
@@ -3182,7 +3184,9 @@ static void __init dcache_init(void)
 					   sizeof_field(struct dentry, d_iname),
 					   dcache_ctor);
 
+#ifdef CONFIG_DCACHE_SMO
 	kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+#endif
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff --git a/mm/Kconfig b/mm/Kconfig
index 47040d939f3b..92fc27ad3472 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -265,6 +265,13 @@ config SMO_NODE
        help
          On NUMA systems enable moving objects to and from a specified node.
 
+config DCACHE_SMO
+       bool "Enable Slab Movable Objects for the dcache"
+       depends on SLUB
+       help
+         Under memory pressure we can try to free dentry slab cache objects from
+         the partial slab list if this is enabled.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  1:34 ` [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects Tobin C. Harding
@ 2019-04-11  2:33   ` Al Viro
  2019-04-11  2:48     ` Tobin C. Harding
  0 siblings, 1 reply; 28+ messages in thread
From: Al Viro @ 2019-04-11  2:33 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: Andrew Morton, Roman Gushchin, Alexander Viro, Christoph Hellwig,
	Pekka Enberg, David Rientjes, Joonsoo Kim, Christopher Lameter,
	Matthew Wilcox, Miklos Szeredi, Andreas Dilger, Waiman Long,
	Tycho Andersen, Theodore Ts'o, Andi Kleen, David Chinner,
	Nick Piggin, Rik van Riel, Hugh Dickins, Jonathan Corbet,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, Apr 11, 2019 at 11:34:40AM +1000, Tobin C. Harding wrote:
> +/*
> + * d_isolate() - Dentry isolation callback function.
> + * @s: The dentry cache.
> + * @v: Vector of pointers to the objects to isolate.
> + * @nr: Number of objects in @v.
> + *
> + * The slab allocator is holding off frees. We can safely examine
> + * the object without the danger of it vanishing from under us.
> + */
> +static void *d_isolate(struct kmem_cache *s, void **v, int nr)
> +{
> +	struct dentry *dentry;
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +		__dget(dentry);
> +	}
> +
> +	return NULL;		/* No need for private data */
> +}

Huh?  This is compeletely wrong; what you need is collecting the ones
with zero refcount (and not on shrink lists) into a private list.
*NOT* bumping the refcounts at all.  And do it in your isolate thing.

> +static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
> +		      int node, void *_unused)
> +{
> +	struct dentry *dentry;
> +	LIST_HEAD(dispose);
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +		spin_lock(&dentry->d_lock);
> +		dentry->d_lockref.count--;
> +
> +		if (dentry->d_lockref.count > 0 ||
> +		    dentry->d_flags & DCACHE_SHRINK_LIST) {
> +			spin_unlock(&dentry->d_lock);
> +			continue;
> +		}
> +
> +		if (dentry->d_flags & DCACHE_LRU_LIST)
> +			d_lru_del(dentry);
> +
> +		d_shrink_add(dentry, &dispose);
> +
> +		spin_unlock(&dentry->d_lock);
> +	}

Basically, that loop (sans jerking the refcount up and down) should
get moved into d_isolate().
> +
> +	if (!list_empty(&dispose))
> +		shrink_dentry_list(&dispose);
> +}

... with this left in d_partial_shrink().  And you obviously need some way
to pass the list from the former to the latter...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  2:33   ` Al Viro
@ 2019-04-11  2:48     ` Tobin C. Harding
  2019-04-11  4:47       ` Al Viro
  0 siblings, 1 reply; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  2:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 11, 2019 at 03:33:22AM +0100, Al Viro wrote:
> On Thu, Apr 11, 2019 at 11:34:40AM +1000, Tobin C. Harding wrote:
> > +/*
> > + * d_isolate() - Dentry isolation callback function.
> > + * @s: The dentry cache.
> > + * @v: Vector of pointers to the objects to isolate.
> > + * @nr: Number of objects in @v.
> > + *
> > + * The slab allocator is holding off frees. We can safely examine
> > + * the object without the danger of it vanishing from under us.
> > + */
> > +static void *d_isolate(struct kmem_cache *s, void **v, int nr)
> > +{
> > +	struct dentry *dentry;
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		dentry = v[i];
> > +		__dget(dentry);
> > +	}
> > +
> > +	return NULL;		/* No need for private data */
> > +}
> 
> Huh?  This is compeletely wrong; what you need is collecting the ones
> with zero refcount (and not on shrink lists) into a private list.
> *NOT* bumping the refcounts at all.  And do it in your isolate thing.

Oh, so putting entries on a shrink list is enough to pin them?

> 
> > +static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
> > +		      int node, void *_unused)
> > +{
> > +	struct dentry *dentry;
> > +	LIST_HEAD(dispose);
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		dentry = v[i];
> > +		spin_lock(&dentry->d_lock);
> > +		dentry->d_lockref.count--;
> > +
> > +		if (dentry->d_lockref.count > 0 ||
> > +		    dentry->d_flags & DCACHE_SHRINK_LIST) {
> > +			spin_unlock(&dentry->d_lock);
> > +			continue;
> > +		}
> > +
> > +		if (dentry->d_flags & DCACHE_LRU_LIST)
> > +			d_lru_del(dentry);
> > +
> > +		d_shrink_add(dentry, &dispose);
> > +
> > +		spin_unlock(&dentry->d_lock);
> > +	}
> 
> Basically, that loop (sans jerking the refcount up and down) should
> get moved into d_isolate().
> > +
> > +	if (!list_empty(&dispose))
> > +		shrink_dentry_list(&dispose);
> > +}
> 
> ... with this left in d_partial_shrink().  And you obviously need some way
> to pass the list from the former to the latter...

Easy enough, we have a void * return value from the isolate function
just for this purpose.

Thanks Al, hackety hack ...


	Tobin
	

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  2:48     ` Tobin C. Harding
@ 2019-04-11  4:47       ` Al Viro
  2019-04-11  5:05         ` Tobin C. Harding
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Al Viro @ 2019-04-11  4:47 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 11, 2019 at 12:48:21PM +1000, Tobin C. Harding wrote:

> Oh, so putting entries on a shrink list is enough to pin them?

Not exactly pin, but __dentry_kill() has this:
        if (dentry->d_flags & DCACHE_SHRINK_LIST) {
                dentry->d_flags |= DCACHE_MAY_FREE;
                can_free = false;
        }
        spin_unlock(&dentry->d_lock);
        if (likely(can_free))
                dentry_free(dentry);
and shrink_dentry_list() - this:
                        if (dentry->d_lockref.count < 0)
                                can_free = dentry->d_flags & DCACHE_MAY_FREE;
                        spin_unlock(&dentry->d_lock);
                        if (can_free)
                                dentry_free(dentry);
			continue;
so if dentry destruction comes before we get around to
shrink_dentry_list(), it'll stop short of dentry_free() and mark it for
shrink_dentry_list() to do just dentry_free(); if it overlaps with
shrink_dentry_list(), but doesn't progress all the way to freeing,
we will
	* have dentry removed from shrink list
	* notice the negative ->d_count (i.e. that it has already reached
__dentry_kill())
	* see that __dentry_kill() is not through with tearing the sucker
apart (no DCACHE_MAY_FREE set)
... and just leave it alone, letting __dentry_kill() do the rest of its
thing - it's already off the shrink list, so __dentry_kill() will do
everything, including dentry_free().

The reason for that dance is the locking - shrink list belongs to whoever
has set it up and nobody else is modifying it.  So __dentry_kill() doesn't
even try to remove the victim from there; it does all the teardown
(detaches from inode, unhashes, etc.) and leaves removal from the shrink
list and actual freeing to the owner of shrink list.  That way we don't
have to protect all shrink lists a single lock (contention on it would
be painful) and we don't have to play with per-shrink-list locks and
all the attendant headaches (those lists usually live on stack frame
of some function, so just having the lock next to the list_head would
do us no good, etc.).  Much easier to have the shrink_dentry_list()
do all the manipulations...

The bottom line is, once it's on a shrink list, it'll stay there
until shrink_dentry_list().  It may get extra references after
being inserted there (e.g. be found by hash lookup), it may drop
those, whatever - it won't get freed until we run shrink_dentry_list().
If it ends up with extra references, no problem - shrink_dentry_list()
will just kick it off the shrink list and leave it alone.

Note, BTW, that umount coming between isolate and drop is not a problem;
it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
finds something on (another) shrink list, it won't put it to the shrink
list of its own, but it will make note of that and repeat the scan in
such case.  So if we find something with zero refcount and not on
shrink list, we can move it to our shrink list and be sure that its
superblock won't go away under us...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  4:47       ` Al Viro
@ 2019-04-11  5:05         ` Tobin C. Harding
  2019-04-11 20:01         ` Al Viro
  2019-04-11 21:02         ` Al Viro
  2 siblings, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-04-11  5:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:
> On Thu, Apr 11, 2019 at 12:48:21PM +1000, Tobin C. Harding wrote:
> 
> > Oh, so putting entries on a shrink list is enough to pin them?
> 
> Not exactly pin, but __dentry_kill() has this:
>         if (dentry->d_flags & DCACHE_SHRINK_LIST) {
>                 dentry->d_flags |= DCACHE_MAY_FREE;
>                 can_free = false;
>         }
>         spin_unlock(&dentry->d_lock);
>         if (likely(can_free))
>                 dentry_free(dentry);
> and shrink_dentry_list() - this:
>                         if (dentry->d_lockref.count < 0)
>                                 can_free = dentry->d_flags & DCACHE_MAY_FREE;
>                         spin_unlock(&dentry->d_lock);
>                         if (can_free)
>                                 dentry_free(dentry);
> 			continue;
> so if dentry destruction comes before we get around to
> shrink_dentry_list(), it'll stop short of dentry_free() and mark it for
> shrink_dentry_list() to do just dentry_free(); if it overlaps with
> shrink_dentry_list(), but doesn't progress all the way to freeing,
> we will
> 	* have dentry removed from shrink list
> 	* notice the negative ->d_count (i.e. that it has already reached
> __dentry_kill())
> 	* see that __dentry_kill() is not through with tearing the sucker
> apart (no DCACHE_MAY_FREE set)
> ... and just leave it alone, letting __dentry_kill() do the rest of its
> thing - it's already off the shrink list, so __dentry_kill() will do
> everything, including dentry_free().
> 
> The reason for that dance is the locking - shrink list belongs to whoever
> has set it up and nobody else is modifying it.  So __dentry_kill() doesn't
> even try to remove the victim from there; it does all the teardown
> (detaches from inode, unhashes, etc.) and leaves removal from the shrink
> list and actual freeing to the owner of shrink list.  That way we don't
> have to protect all shrink lists a single lock (contention on it would
> be painful) and we don't have to play with per-shrink-list locks and
> all the attendant headaches (those lists usually live on stack frame
> of some function, so just having the lock next to the list_head would
> do us no good, etc.).  Much easier to have the shrink_dentry_list()
> do all the manipulations...
> 
> The bottom line is, once it's on a shrink list, it'll stay there
> until shrink_dentry_list().  It may get extra references after
> being inserted there (e.g. be found by hash lookup), it may drop
> those, whatever - it won't get freed until we run shrink_dentry_list().
> If it ends up with extra references, no problem - shrink_dentry_list()
> will just kick it off the shrink list and leave it alone.
> 
> Note, BTW, that umount coming between isolate and drop is not a problem;
> it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
> finds something on (another) shrink list, it won't put it to the shrink
> list of its own, but it will make note of that and repeat the scan in
> such case.  So if we find something with zero refcount and not on
> shrink list, we can move it to our shrink list and be sure that its
> superblock won't go away under us...

Man, that was good to read.  Thanks for taking the time to write this.


	Tobin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  4:47       ` Al Viro
  2019-04-11  5:05         ` Tobin C. Harding
@ 2019-04-11 20:01         ` Al Viro
  2019-04-11 21:02         ` Al Viro
  2 siblings, 0 replies; 28+ messages in thread
From: Al Viro @ 2019-04-11 20:01 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:

> The reason for that dance is the locking - shrink list belongs to whoever
> has set it up and nobody else is modifying it.  So __dentry_kill() doesn't
> even try to remove the victim from there; it does all the teardown
> (detaches from inode, unhashes, etc.) and leaves removal from the shrink
> list and actual freeing to the owner of shrink list.  That way we don't
> have to protect all shrink lists a single lock (contention on it would
> be painful) and we don't have to play with per-shrink-list locks and
> all the attendant headaches (those lists usually live on stack frame
> of some function, so just having the lock next to the list_head would
> do us no good, etc.).  Much easier to have the shrink_dentry_list()
> do all the manipulations...
> 
> The bottom line is, once it's on a shrink list, it'll stay there
> until shrink_dentry_list().  It may get extra references after
> being inserted there (e.g. be found by hash lookup), it may drop
> those, whatever - it won't get freed until we run shrink_dentry_list().
> If it ends up with extra references, no problem - shrink_dentry_list()
> will just kick it off the shrink list and leave it alone.

FWIW, here's a braindump of sorts on the late stages of dentry
lifecycle (cut'n'paste from the local notes, with minimal editing;
I think the outright obscenities are all gone, but not much is done
beyond that):

        Events at the end of life

__dentry_kill() is called.  This is the point of no return; the victim
has no counting references left, no new ones are coming and we are
committed to tearing it down.  Caller is holding the following locks:
	a) ->d_lock on dentry itself
	b) ->i_lock on its inode, if dentry is positive
	c) ->d_lock on its parent, if dentry has a parent.
Acquiring those in the sane order (a nests inside of c, which nests inside of b)
can be rather convoluted, but that's the responsibility of callers.

State of dentry at that point:
        * it must not be a PAR_LOOKUP one, if it ever had been.  [See section
on PAR_LOOKUP state, specifically the need to exit that state before
dropping the last reference; <<the section in question is in too disorganised
state to include it here>>].
	* ->d_count is either 0 (eviction pathways - d_prune_aliases(),
shrink_dentry_list()) or 1 (when we are disposing of the last reference
and want it evicted rather than retained - dentry_kill(), called by
dput() or shrink_dentry_list()).  Note that ->d_lock stabilizes ->d_count.
        * its ->d_subdirs must be already empty (or we would've had
counting references from those).  Again, stabilized by ->d_lock.

We can detect dentries having reached that state by observing (under ->d_lock)
a negative ->d_count - that's the very first thing __dentry_kill() does.

At that point ->d_prune() is called - that's the last chance for a filesystem
to see a doomed dentry more or less intact.

After that dentry passes through several stages of teardown:
        * if dentry had been on LRU list, it is removed from there.
        * if dentry had been hashed, it is unhashed (and ->d_seq is
bumped)
        * dentry is made unreachable via d_child
        * dentry is made negative; if it used to be positive, inode
reference is dropped.  That's another place where filesystem might
get a chance to play (->d_iput(), as always for transitions from
positive to negative).  At that stage all spinlocks are dropped.
	* final filesystem call: ->d_release().  That's the time
to release whatever data structures filesystem might've had augmenting
that dentry.  NOTE: lockless accesses are still possible at that
point, so anything needed for those (->d_hash(), ->d_compare(),
lockless case of ->d_revalidate(), lockless case of ->d_manage())
MUST NOT be freed without an RCU delay.

At that stage dentry is essentially a dead body.  It might still
have lockless references hanging around and it might on someone's
shrink list, but that's it.  The next stage is body disposal,
either immediately (if not on anyone's shrink list) or once
the owner of shrink list in question gets around to
shrink_dentry_list().

Disposal is done in dentry_free().  For dentries not on any
shrink list it's called directly from __dentry_kill().  That's
the normal case.  For dentries currently on some shrink list
__dentry_kill() marks the dentry as fully dead (DCACHE_MAY_FREE)
and leave it for eventual shrink_dentry_list() to feed to
dentry_free().

Once dentry_free() is called, there can be only lockless references.
At that point the only things left in the sucker are
	* name (->d_name)
	* superblock it belongs to (->d_sb; won't be freed without
an RCU delay and neither will its file_system_type)
	* methods' table (->d_op)
	* ->d_flags and ->d_seq
	* parent's address (->d_parent; not pinned anymore - its
ownership is passed to caller, which proceeds to drop the reference.
However, parent will also not be freed without an RCU delay,
so lockless users can safely dereference it)
	* ->d_fsdata, if the filesystem had seen fit to leave it
around (see above re RCU delays for destroying anything used
by lockless methods)

Generally we don't get around to actually freeing dentry
(in __d_free()/__d_free_external()) without an RCU delay.

There is one important case where we *do* expedited freeing -
pipes and sockets (to be more precise, the stuff created by
alloc_file_pseudo()).  Those can't have lockless references
at all - they are never hashed, they are not anyone's parents
and they can't be a starting point of a lockless pathwalk
(see path_init() for details).  And they are created and
destroyed often enough to make RCU delays a noticable burden.
So for those we do freeing immediately.  In -next it's
marked by DCACHE_NORCU in flags; in mainline it's a bit of
a mess at the moment.

The reason for __d_free/__d_free_external separation is
somewhat subtle.  We obviously need an RCU delay between
dentry_free() and freeing an external name, but why not
do the "drop refcout on external name and free if it hits
zero" right in __d_free()?  The thing is, we need an RCU
delay between the last decrement of extname refcount and
its freeing.  Suppose we have two dentries that happen
to share an extname.  Initially:

d1->d_name.name == d2->d_name.name == &ext->name; ext->count == 2

CPU1:
dentry_free(d1)
call_rcu() schedules __d_free()

CPU2:
d_path() on child of d2: rcu_read_lock(),
start walking towards root, copying names
get to d2, pick d2->d_name.name (i.e. ext->name)

CPU3:
rename d2, dropping a reference to its old name.
ext->count is 1 now, nothing freed.

CPU2:
start copying ext->name[]

... and scheduled __d_free() runs, dropping the last reference to
ext and freeing it.  The reason is that call_rcu() has happened
*BEFORE* rcu_read_lock(), so we get no protection whatsoever.

In other words, we need the decrement and check of external name
refcount before the RCU delay.  We could do the decrement and
check in __d_free(), but that would demand an additional RCU
delay for freeing.  It's cheaper do decrement-and-check right
in dentry_free() and make the decision whether to free there.
Thus the two variants of __d_free() - one for "need to free
the external name", another for "no external name or not the
last reference to it".

In the scenario above the actual kernel gets ext->count to 1
in the dentry_free(d1) and schedules plain __d_free().  Then
when we rename d2 dropping the other reference gets ext->count
to 0 and we use kfree_rcu() to schedule its freeing.  And _that_
happens after ->d_name switch, so either d_path() doesn't see
ext at all, or we are guaranteed that RCU delay before freeing
ext has started after rcu_read_lock() has been done by d_path().

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11  4:47       ` Al Viro
  2019-04-11  5:05         ` Tobin C. Harding
  2019-04-11 20:01         ` Al Viro
@ 2019-04-11 21:02         ` Al Viro
  2019-06-29  4:08           ` Al Viro
  2 siblings, 1 reply; 28+ messages in thread
From: Al Viro @ 2019-04-11 21:02 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel, Linus Torvalds

On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:

> Note, BTW, that umount coming between isolate and drop is not a problem;
> it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
> finds something on (another) shrink list, it won't put it to the shrink
> list of its own, but it will make note of that and repeat the scan in
> such case.  So if we find something with zero refcount and not on
> shrink list, we can move it to our shrink list and be sure that its
> superblock won't go away under us...

Aaaarrgghhh...  No, we can't.  Look: we get one candidate dentry in isolate
phase.  We put it into shrink list.  umount(2) comes and calls
shrink_dcache_for_umount(), which calls shrink_dcache_parent(root).
In the meanwhile, shrink_dentry_list() is run and does __dentry_kill() on
that one dentry.  Fine, it's gone - before shrink_dcache_parent() even
sees it.  Now shrink_dentry_list() holds a reference to its parent and
is about to drop it in
                dentry = parent;
                while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
                        dentry = dentry_kill(dentry);
And dropped it will be, but... shrink_dcache_parent() has finished the
scan, without finding *anything* with zero refcount - the thing that used
to be on the shrink list was already gone before shrink_dcache_parent()
has gotten there and the reference to parent was not dropped yet.  So
shrink_dcache_for_umount() plows past shrink_dcache_parent(), walks the
tree and complains loudly about "busy" dentries (that parent we hadn't
finished dropping), and then we proceed with filesystem shutdown.
In the meanwhile, dentry_kill() finally gets to killing dentry and
triggers an unexpected late call of ->d_iput() on a filesystem that
has already been far enough into shutdown - far enough to destroy the
data structures needed for that sucker.

The reason we don't hit that problem with regular memory shrinker is
this:
                unregister_shrinker(&s->s_shrink);
                fs->kill_sb(s);
in deactivate_locked_super().  IOW, shrinker for this fs is gone
before we get around to shutdown.  And so are all normal sources
of dentry eviction for that fs.

Your earlier variants all suffer the same problem - picking a page
shared by dentries from several superblocks can run into trouble
if it overlaps with umount of one of those.

Fuck...  One variant of solution would be to have per-superblock
struct kmem_cache to be used for dentries of that superblock.
However,
	* we'd need to prevent them getting merged
	* it would add per-superblock memory costs (for struct
kmem_cache and associated structures)
	* it might mean more pages eaten by the dentries -
on average half a page per superblock (more if there are very
few dentries on that superblock)

OTOH, it might actually improve the memory footprint - all
dentries sharing a page would be from the same superblock,
so the use patterns might be more similar, which might
lower the fragmentation...

Hell knows...  I'd like to hear an opinion from VM folks on
that one.  Comments?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects
  2019-04-11 21:02         ` Al Viro
@ 2019-06-29  4:08           ` Al Viro
  2019-06-29  4:38             ` shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects) Al Viro
  0 siblings, 1 reply; 28+ messages in thread
From: Al Viro @ 2019-06-29  4:08 UTC (permalink / raw)
  To: Tobin C. Harding
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel, Linus Torvalds

On Thu, Apr 11, 2019 at 10:02:00PM +0100, Al Viro wrote:

> Aaaarrgghhh...  No, we can't.  Look: we get one candidate dentry in isolate
> phase.  We put it into shrink list.  umount(2) comes and calls
> shrink_dcache_for_umount(), which calls shrink_dcache_parent(root).
> In the meanwhile, shrink_dentry_list() is run and does __dentry_kill() on
> that one dentry.  Fine, it's gone - before shrink_dcache_parent() even
> sees it.  Now shrink_dentry_list() holds a reference to its parent and
> is about to drop it in
>                 dentry = parent;
>                 while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
>                         dentry = dentry_kill(dentry);
> And dropped it will be, but... shrink_dcache_parent() has finished the
> scan, without finding *anything* with zero refcount - the thing that used
> to be on the shrink list was already gone before shrink_dcache_parent()
> has gotten there and the reference to parent was not dropped yet.  So
> shrink_dcache_for_umount() plows past shrink_dcache_parent(), walks the
> tree and complains loudly about "busy" dentries (that parent we hadn't
> finished dropping), and then we proceed with filesystem shutdown.
> In the meanwhile, dentry_kill() finally gets to killing dentry and
> triggers an unexpected late call of ->d_iput() on a filesystem that
> has already been far enough into shutdown - far enough to destroy the
> data structures needed for that sucker.
> 
> The reason we don't hit that problem with regular memory shrinker is
> this:
>                 unregister_shrinker(&s->s_shrink);
>                 fs->kill_sb(s);
> in deactivate_locked_super().  IOW, shrinker for this fs is gone
> before we get around to shutdown.  And so are all normal sources
> of dentry eviction for that fs.
> 
> Your earlier variants all suffer the same problem - picking a page
> shared by dentries from several superblocks can run into trouble
> if it overlaps with umount of one of those.

FWIW, I think I see a kinda-sorta sane solution.  Namely, add

static void __dput_to_list(struct dentry *dentry, struct list_head *list)
{
	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
		/* let the owner of the list it's on deal with it */
		--dentry->d_lockref.count;
	} else {
		if (dentry->d_flags & DCACHE_LRU_LIST)
			d_lru_del(dentry);
		if (!--dentry->d_lockref.count)
			d_shrink_add(parent, list);
	}
}

and have
shrink_dentry_list() do this in the end of loop:
                d_shrink_del(dentry);
                parent = dentry->d_parent;
		/* both dentry and parent are locked at that point */
		if (parent != dentry) {
			/*
			 * We need to prune ancestors too. This is necessary to
			 * prevent quadratic behavior of shrink_dcache_parent(),
			 * but is also expected to be beneficial in reducing
			 * dentry cache fragmentation.
			 */
			__dput_to_list(parent, list);
		}
		__dentry_kill(dentry);
        }

instead of
                d_shrink_del(dentry);
                parent = dentry->d_parent;
                __dentry_kill(dentry);
                if (parent == dentry)
                        continue;
                /*
                 * We need to prune ancestors too. This is necessary to prevent
                 * quadratic behavior of shrink_dcache_parent(), but is also
                 * expected to be beneficial in reducing dentry cache
                 * fragmentation.
                 */
                dentry = parent;
                while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
                        dentry = dentry_kill(dentry);
        }
we have there now.  Linus, do you see any problems with that change?  AFAICS,
that should avoid the problem described above.  Moreover, it seems to allow
a fun API addition:

void dput_to_list(struct dentry *dentry, struct list_head *list)
{
	rcu_read_lock();
	if (likely(fast_dput(dentry))) {
		rcu_read_unlock();
		return;
	}
	rcu_read_unlock();
	if (!retain_dentry(dentry))
		__dput_to_list(dentry, list);
	spin_unlock(&dentry->d_lock);
}

allowing to take an empty list, do a bunch of dput_to_list() (under spinlocks,
etc.), then, once we are in better locking conditions, shrink_dentry_list()
to take them all out.  I can see applications for that in e.g. fs/namespace.c -
quite a bit of kludges with ->mnt_ex_mountpoint would be killable that way,
and there would be a chance to transfer the contribution to ->d_count of
mountpoint from struct mount to struct mountpoint (i.e. make any number of
mounts on the same mountpoint dentry contribute only 1 to its ->d_count,
not the number of such mounts).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)
  2019-06-29  4:08           ` Al Viro
@ 2019-06-29  4:38             ` Al Viro
  2019-06-29 19:06               ` Al Viro
  0 siblings, 1 reply; 28+ messages in thread
From: Al Viro @ 2019-06-29  4:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel, Linus Torvalds

On Sat, Jun 29, 2019 at 05:08:44AM +0100, Al Viro wrote:
> > The reason we don't hit that problem with regular memory shrinker is
> > this:
> >                 unregister_shrinker(&s->s_shrink);
> >                 fs->kill_sb(s);
> > in deactivate_locked_super().  IOW, shrinker for this fs is gone
> > before we get around to shutdown.  And so are all normal sources
> > of dentry eviction for that fs.
> > 
> > Your earlier variants all suffer the same problem - picking a page
> > shared by dentries from several superblocks can run into trouble
> > if it overlaps with umount of one of those.

PS: the problem is not gone in the next iteration of the patchset in
question.  The patch I'm proposing (including dput_to_list() and _ONLY_
compile-tested) follows.  Comments?

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..dfe21a649c96 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -860,6 +860,32 @@ void dput(struct dentry *dentry)
 }
 EXPORT_SYMBOL(dput);
 
+static void __dput_to_list(struct dentry *dentry, struct list_head *list)
+__must_hold(&dentry->d_lock)
+{
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/* let the owner of the list it's on deal with it */
+		--dentry->d_lockref.count;
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!--dentry->d_lockref.count)
+			d_shrink_add(dentry, list);
+	}
+}
+
+void dput_to_list(struct dentry *dentry, struct list_head *list)
+{
+	rcu_read_lock();
+	if (likely(fast_dput(dentry))) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	if (!retain_dentry(dentry))
+		__dput_to_list(dentry, list);
+	spin_unlock(&dentry->d_lock);
+}
 
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
@@ -1088,18 +1114,9 @@ static void shrink_dentry_list(struct list_head *list)
 		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
+		if (parent != dentry)
+			__dput_to_list(parent, list);
 		__dentry_kill(dentry);
-		if (parent == dentry)
-			continue;
-		/*
-		 * We need to prune ancestors too. This is necessary to prevent
-		 * quadratic behavior of shrink_dcache_parent(), but is also
-		 * expected to be beneficial in reducing dentry cache
-		 * fragmentation.
-		 */
-		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)
  2019-06-29  4:38             ` shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects) Al Viro
@ 2019-06-29 19:06               ` Al Viro
  2019-06-29 22:29                 ` Al Viro
  2019-07-01  9:26                 ` Tobin C. Harding
  0 siblings, 2 replies; 28+ messages in thread
From: Al Viro @ 2019-06-29 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Sat, Jun 29, 2019 at 05:38:03AM +0100, Al Viro wrote:

> PS: the problem is not gone in the next iteration of the patchset in
> question.  The patch I'm proposing (including dput_to_list() and _ONLY_
> compile-tested) follows.  Comments?

FWIW, there's another unpleasantness in the whole thing.  Suppose we have
picked a page full of dentries, all with refcount 0.  We decide to
evict all of them.  As it turns out, they are from two filesystems.
Filesystem 1 is NFS on a server, with currently downed hub on the way
to it.  Filesystem 2 is local.  We attempt to evict an NFS dentry and
get stuck - tons of dirty data with no way to flush them on server.
In the meanwhile, admin tries to unmount the local filesystem.  And
gets stuck as well, since umount can't do anything to its dentries
that happen to sit in our shrink list.

I wonder if the root of problem here isn't in shrink_dcache_for_umount();
all it really needs is to have everything on that fs with refcount 0
dragged through __dentry_kill().  If something had been on a shrink
list, __dentry_kill() will just leave behind a struct dentry completely
devoid of any connection to superblock, other dentries, filesystem
type, etc. - it's just a piece of memory that won't be freed until
the owner of shrink list finally gets around to it.  Which can happen
at any point - all they'll do to it is dentry_free(), and that doesn't
need any fs-related data structures.

The logics in shrink_dcache_parent() is
	collect everything evictable into a shrink list
	if anything found - kick it out and repeat the scan
	otherwise, if something had been on other's shrink list
		repeat the scan

I wonder if after the "no evictable candidates, but something
on other's shrink lists" we ought to do something along the
lines of
	rcu_read_lock
	walk it, doing
		if dentry has zero refcount
			if it's not on a shrink list,
				move it to ours
			else
				store its address in 'victim'
				end the walk
	if no victim found
		rcu_read_unlock
	else
		lock victim for __dentry_kill
		rcu_read_unlock
		if it's still alive
			if it's not IS_ROOT
				if parent is not on shrink list
					decrement parent's refcount
					put it on our list
				else
					decrement parent's refcount
			__dentry_kill(victim)
		else
			unlock
	if our list is non-empty
		shrink_dentry_list on it
in there...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)
  2019-06-29 19:06               ` Al Viro
@ 2019-06-29 22:29                 ` Al Viro
  2019-06-29 22:34                   ` Al Viro
  2019-07-01  9:26                 ` Tobin C. Harding
  1 sibling, 1 reply; 28+ messages in thread
From: Al Viro @ 2019-06-29 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Sat, Jun 29, 2019 at 08:06:24PM +0100, Al Viro wrote:
> I wonder if after the "no evictable candidates, but something
> on other's shrink lists" we ought to do something along the
> lines of
> 	rcu_read_lock
> 	walk it, doing
> 		if dentry has zero refcount
> 			if it's not on a shrink list,
> 				move it to ours
> 			else
> 				store its address in 'victim'
> 				end the walk
> 	if no victim found
> 		rcu_read_unlock
> 	else
> 		lock victim for __dentry_kill
> 		rcu_read_unlock
> 		if it's still alive
> 			if it's not IS_ROOT
> 				if parent is not on shrink list
> 					decrement parent's refcount
> 					put it on our list
> 				else
> 					decrement parent's refcount
> 			__dentry_kill(victim)
> 		else
> 			unlock
> 	if our list is non-empty
> 		shrink_dentry_list on it
> in there...

Like this (again, only build-tested):

Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists

Currently, running into a shrink list that contains dentries from different
filesystems can cause several unpleasant things for shrink_dcache_parent()
and for umount(2).

The first problem is that there's a window during shrink_dentry_list() between
__dentry_kill() takes a victim out and dropping reference to its parent.  During
that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
(or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
eviction candidates and no indication that it needs to wait for some
shrink_dentry_list() to proceed further.

That applies for any shrink list that might intersect with the subtree we are
trying to shrink; the only reason it does not blow on umount(2) in the mainline
is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().

Another problem happens if something in a mixed-filesystem shrink list gets
be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
the stuck shrinker to get around to our dentries.

Solution:
	1) have shrink_dentry_list() decrement the parent's refcount and
make sure it's on a shrink list (ours unless it already had been on some
other) before calling __dentry_kill().  That eliminates the window when
shrink_dcache_parent() would've blown past the entire subtree without
noticing anything with zero refcount not on shrink lists.
	2) when shrink_dcache_parent() has found no eviction candidates,
but some dentries are still sitting on shrink lists, rather than
repeating the scan in hope that shrinkers have progressed, scan looking
for something on shrink lists with zero refcount.  If such a thing is
found, grab rcu_read_lock() and stop the scan, with caller locking
it for eviction, dropping out of RCU and doing __dentry_kill(), with
the same treatment for parent as shrink_dentry_list() would do.

Note that right now mixed-filesystem shrink lists do not occur, so this
is not a mainline bug.  Howevere, there's a bunch of uses for such
beasts (e.g. the "try and evict everything we can out of given page"
patches; there are potential uses in mount-related code, considerably
simplifying the life in fs/namespace.c, etc.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..43480f516329 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -860,6 +860,32 @@ void dput(struct dentry *dentry)
 }
 EXPORT_SYMBOL(dput);
 
+static void __dput_to_list(struct dentry *dentry, struct list_head *list)
+__must_hold(&dentry->d_lock)
+{
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/* let the owner of the list it's on deal with it */
+		--dentry->d_lockref.count;
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!--dentry->d_lockref.count)
+			d_shrink_add(dentry, list);
+	}
+}
+
+void dput_to_list(struct dentry *dentry, struct list_head *list)
+{
+	rcu_read_lock();
+	if (likely(fast_dput(dentry))) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	if (!retain_dentry(dentry))
+		__dput_to_list(dentry, list);
+	spin_unlock(&dentry->d_lock);
+}
 
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
@@ -1088,18 +1114,9 @@ static void shrink_dentry_list(struct list_head *list)
 		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
+		if (parent != dentry)
+			__dput_to_list(parent, list);
 		__dentry_kill(dentry);
-		if (parent == dentry)
-			continue;
-		/*
-		 * We need to prune ancestors too. This is necessary to prevent
-		 * quadratic behavior of shrink_dcache_parent(), but is also
-		 * expected to be beneficial in reducing dentry cache
-		 * fragmentation.
-		 */
-		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
 	}
 }
 
@@ -1444,8 +1461,11 @@ int d_set_mounted(struct dentry *dentry)
 
 struct select_data {
 	struct dentry *start;
+	union {
+		long found;
+		struct dentry *victim;
+	};
 	struct list_head dispose;
-	int found;
 };
 
 static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
@@ -1477,6 +1497,37 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 	return ret;
 }
 
+static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
+{
+	struct select_data *data = _data;
+	enum d_walk_ret ret = D_WALK_CONTINUE;
+
+	if (data->start == dentry)
+		goto out;
+
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		if (!dentry->d_lockref.count) {
+			rcu_read_lock();
+			data->victim = dentry;
+			return D_WALK_QUIT;
+		}
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!dentry->d_lockref.count)
+			d_shrink_add(dentry, &data->dispose);
+	}
+	/*
+	 * We can return to the caller if we have found some (this
+	 * ensures forward progress). We'll be coming back to find
+	 * the rest.
+	 */
+	if (!list_empty(&data->dispose))
+		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
+out:
+	return ret;
+}
+
 /**
  * shrink_dcache_parent - prune dcache
  * @parent: parent of entries to prune
@@ -1486,12 +1537,9 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 void shrink_dcache_parent(struct dentry *parent)
 {
 	for (;;) {
-		struct select_data data;
+		struct select_data data = {.start = parent};
 
 		INIT_LIST_HEAD(&data.dispose);
-		data.start = parent;
-		data.found = 0;
-
 		d_walk(parent, &data, select_collect);
 
 		if (!list_empty(&data.dispose)) {
@@ -1502,6 +1550,21 @@ void shrink_dcache_parent(struct dentry *parent)
 		cond_resched();
 		if (!data.found)
 			break;
+		d_walk(parent, &data, select_collect2);
+		if (data.victim) {
+			struct dentry *parent;
+			if (!shrink_lock_dentry(data.victim)) {
+				rcu_read_unlock();
+			} else {
+				rcu_read_unlock();
+				parent = data.victim->d_parent;
+				if (parent != data.victim)
+					__dput_to_list(parent, &data.dispose);
+				__dentry_kill(data.victim);
+			}
+		}
+		if (!list_empty(&data.dispose))
+			shrink_dentry_list(&data.dispose);
 	}
 }
 EXPORT_SYMBOL(shrink_dcache_parent);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)
  2019-06-29 22:29                 ` Al Viro
@ 2019-06-29 22:34                   ` Al Viro
  0 siblings, 0 replies; 28+ messages in thread
From: Al Viro @ 2019-06-29 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tobin C. Harding, Andrew Morton, Roman Gushchin, Alexander Viro,
	Christoph Hellwig, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Sat, Jun 29, 2019 at 11:29:45PM +0100, Al Viro wrote:

> Like this (again, only build-tested):
... and with obvious braino fixed,

Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists

Currently, running into a shrink list that contains dentries from different
filesystems can cause several unpleasant things for shrink_dcache_parent()
and for umount(2).

The first problem is that there's a window during shrink_dentry_list() between
__dentry_kill() takes a victim out and dropping reference to its parent.  During
that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
(or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
eviction candidates and no indication that it needs to wait for some
shrink_dentry_list() to proceed further.

That applies for any shrink list that might intersect with the subtree we are
trying to shrink; the only reason it does not blow on umount(2) in the mainline
is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().

Another problem happens if something in a mixed-filesystem shrink list gets
be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
the stuck shrinker to get around to our dentries.

Solution:
    1) have shrink_dentry_list() decrement the parent's refcount and
make sure it's on a shrink list (ours unless it already had been on some
other) before calling __dentry_kill().  That eliminates the window when
shrink_dcache_parent() would've blown past the entire subtree without
noticing anything with zero refcount not on shrink lists.
    2) when shrink_dcache_parent() has found no eviction candidates,
but some dentries are still sitting on shrink lists, rather than
repeating the scan in hope that shrinkers have progressed, scan looking
for something on shrink lists with zero refcount.  If such a thing is
found, grab rcu_read_lock() and stop the scan, with caller locking
it for eviction, dropping out of RCU and doing __dentry_kill(), with
the same treatment for parent as shrink_dentry_list() would do.

Note that right now mixed-filesystem shrink lists do not occur, so this
is not a mainline bug.  Howevere, there's a bunch of uses for such
beasts (e.g. the "try and evict everything we can out of given page"
patches; there are potential uses in mount-related code, considerably
simplifying the life in fs/namespace.c, etc.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..4b50e09ee950 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -860,6 +860,32 @@ void dput(struct dentry *dentry)
 }
 EXPORT_SYMBOL(dput);
 
+static void __dput_to_list(struct dentry *dentry, struct list_head *list)
+__must_hold(&dentry->d_lock)
+{
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/* let the owner of the list it's on deal with it */
+		--dentry->d_lockref.count;
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!--dentry->d_lockref.count)
+			d_shrink_add(dentry, list);
+	}
+}
+
+void dput_to_list(struct dentry *dentry, struct list_head *list)
+{
+	rcu_read_lock();
+	if (likely(fast_dput(dentry))) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	if (!retain_dentry(dentry))
+		__dput_to_list(dentry, list);
+	spin_unlock(&dentry->d_lock);
+}
 
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
@@ -1088,18 +1114,9 @@ static void shrink_dentry_list(struct list_head *list)
 		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
+		if (parent != dentry)
+			__dput_to_list(parent, list);
 		__dentry_kill(dentry);
-		if (parent == dentry)
-			continue;
-		/*
-		 * We need to prune ancestors too. This is necessary to prevent
-		 * quadratic behavior of shrink_dcache_parent(), but is also
-		 * expected to be beneficial in reducing dentry cache
-		 * fragmentation.
-		 */
-		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
 	}
 }
 
@@ -1444,8 +1461,11 @@ int d_set_mounted(struct dentry *dentry)
 
 struct select_data {
 	struct dentry *start;
+	union {
+		long found;
+		struct dentry *victim;
+	};
 	struct list_head dispose;
-	int found;
 };
 
 static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
@@ -1477,6 +1497,37 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 	return ret;
 }
 
+static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
+{
+	struct select_data *data = _data;
+	enum d_walk_ret ret = D_WALK_CONTINUE;
+
+	if (data->start == dentry)
+		goto out;
+
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		if (!dentry->d_lockref.count) {
+			rcu_read_lock();
+			data->victim = dentry;
+			return D_WALK_QUIT;
+		}
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!dentry->d_lockref.count)
+			d_shrink_add(dentry, &data->dispose);
+	}
+	/*
+	 * We can return to the caller if we have found some (this
+	 * ensures forward progress). We'll be coming back to find
+	 * the rest.
+	 */
+	if (!list_empty(&data->dispose))
+		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
+out:
+	return ret;
+}
+
 /**
  * shrink_dcache_parent - prune dcache
  * @parent: parent of entries to prune
@@ -1486,12 +1537,9 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 void shrink_dcache_parent(struct dentry *parent)
 {
 	for (;;) {
-		struct select_data data;
+		struct select_data data = {.start = parent};
 
 		INIT_LIST_HEAD(&data.dispose);
-		data.start = parent;
-		data.found = 0;
-
 		d_walk(parent, &data, select_collect);
 
 		if (!list_empty(&data.dispose)) {
@@ -1502,6 +1550,22 @@ void shrink_dcache_parent(struct dentry *parent)
 		cond_resched();
 		if (!data.found)
 			break;
+		data.victim = NULL;
+		d_walk(parent, &data, select_collect2);
+		if (data.victim) {
+			struct dentry *parent;
+			if (!shrink_lock_dentry(data.victim)) {
+				rcu_read_unlock();
+			} else {
+				rcu_read_unlock();
+				parent = data.victim->d_parent;
+				if (parent != data.victim)
+					__dput_to_list(parent, &data.dispose);
+				__dentry_kill(data.victim);
+			}
+		}
+		if (!list_empty(&data.dispose))
+			shrink_dentry_list(&data.dispose);
 	}
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
diff --git a/fs/internal.h b/fs/internal.h
index 0010889f2e85..68f132cf2664 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -160,6 +160,7 @@ extern int d_set_mounted(struct dentry *dentry);
 extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 extern struct dentry *d_alloc_cursor(struct dentry *);
 extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
+extern void dput_to_list(struct dentry *, struct list_head *);
 
 /*
  * read_write.c



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)
  2019-06-29 19:06               ` Al Viro
  2019-06-29 22:29                 ` Al Viro
@ 2019-07-01  9:26                 ` Tobin C. Harding
  1 sibling, 0 replies; 28+ messages in thread
From: Tobin C. Harding @ 2019-07-01  9:26 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Tobin C. Harding, Andrew Morton, Roman Gushchin,
	Alexander Viro, Christoph Hellwig, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Christopher Lameter, Matthew Wilcox, Miklos Szeredi,
	Andreas Dilger, Waiman Long, Tycho Andersen, Theodore Ts'o,
	Andi Kleen, David Chinner, Nick Piggin, Rik van Riel,
	Hugh Dickins, Jonathan Corbet, linux-mm, linux-fsdevel,
	linux-kernel

On Sat, Jun 29, 2019 at 08:06:24PM +0100, Al Viro wrote:
> On Sat, Jun 29, 2019 at 05:38:03AM +0100, Al Viro wrote:
> 
> > PS: the problem is not gone in the next iteration of the patchset in
> > question.  The patch I'm proposing (including dput_to_list() and _ONLY_
> > compile-tested) follows.  Comments?
> 
> FWIW, there's another unpleasantness in the whole thing.  Suppose we have
> picked a page full of dentries, all with refcount 0.  We decide to
> evict all of them.  As it turns out, they are from two filesystems.
> Filesystem 1 is NFS on a server, with currently downed hub on the way
> to it.  Filesystem 2 is local.  We attempt to evict an NFS dentry and
> get stuck - tons of dirty data with no way to flush them on server.
> In the meanwhile, admin tries to unmount the local filesystem.  And
> gets stuck as well, since umount can't do anything to its dentries
> that happen to sit in our shrink list.
> 
> I wonder if the root of problem here isn't in shrink_dcache_for_umount();
> all it really needs is to have everything on that fs with refcount 0
> dragged through __dentry_kill().  If something had been on a shrink
> list, __dentry_kill() will just leave behind a struct dentry completely
> devoid of any connection to superblock, other dentries, filesystem
> type, etc. - it's just a piece of memory that won't be freed until
> the owner of shrink list finally gets around to it.  Which can happen
> at any point - all they'll do to it is dentry_free(), and that doesn't
> need any fs-related data structures.
> 
> The logics in shrink_dcache_parent() is
> 	collect everything evictable into a shrink list
> 	if anything found - kick it out and repeat the scan
> 	otherwise, if something had been on other's shrink list
> 		repeat the scan
> 
> I wonder if after the "no evictable candidates, but something
> on other's shrink lists" we ought to do something along the
> lines of
> 	rcu_read_lock
> 	walk it, doing
> 		if dentry has zero refcount
> 			if it's not on a shrink list,
> 				move it to ours
> 			else
> 				store its address in 'victim'
> 				end the walk
> 	if no victim found
> 		rcu_read_unlock
> 	else
> 		lock victim for __dentry_kill
> 		rcu_read_unlock
> 		if it's still alive
> 			if it's not IS_ROOT
> 				if parent is not on shrink list
> 					decrement parent's refcount
> 					put it on our list
> 				else
> 					decrement parent's refcount
> 			__dentry_kill(victim)
> 		else
> 			unlock
> 	if our list is non-empty
> 		shrink_dentry_list on it
> in there...

Thanks for still thinking about this Al.  I don't have a lot of idea
about what to do with your comments until I can grok them fully but I
wanted to acknowledge having read them.

Thanks,
Tobin.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-07-01  9:26 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-11  1:34 [RFC PATCH v3 00/15] Slab Movable Objects (SMO) Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 01/15] slub: Add isolate() and migrate() methods Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 02/15] tools/vm/slabinfo: Add support for -C and -M options Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 03/15] slub: Sort slab cache list Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 04/15] slub: Slab defrag core Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 05/15] tools/vm/slabinfo: Add remote node defrag ratio output Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 06/15] tools/vm/slabinfo: Add defrag_used_ratio output Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 07/15] tools/testing/slab: Add object migration test module Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 08/15] tools/testing/slab: Add object migration test suite Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 09/15] xarray: Implement migration function for objects Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 10/15] tools/testing/slab: Add XArray movable objects tests Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 11/15] slub: Enable moving objects to/from specific nodes Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 12/15] slub: Enable balancing slabs across nodes Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 13/15] dcache: Provide a dentry constructor Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects Tobin C. Harding
2019-04-11  2:33   ` Al Viro
2019-04-11  2:48     ` Tobin C. Harding
2019-04-11  4:47       ` Al Viro
2019-04-11  5:05         ` Tobin C. Harding
2019-04-11 20:01         ` Al Viro
2019-04-11 21:02         ` Al Viro
2019-06-29  4:08           ` Al Viro
2019-06-29  4:38             ` shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects) Al Viro
2019-06-29 19:06               ` Al Viro
2019-06-29 22:29                 ` Al Viro
2019-06-29 22:34                   ` Al Viro
2019-07-01  9:26                 ` Tobin C. Harding
2019-04-11  1:34 ` [RFC PATCH v3 15/15] dcache: Add CONFIG_DCACHE_SMO Tobin C. Harding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).