All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/6] Slab Fragmentation Reduction V16
@ 2017-03-07 21:24 Christoph Lameter
  2017-03-07 21:24 ` [RFC 1/6] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

V15->V16
- Reworked core logic against 4.11 kernel code
- Just the bare bones for Matthew to have the ability to review
  the patches and to see how slab defrag could work with the radix
  tree and/or new xarrays. Skip reclaim integration etc etc.

V14->V15
- The lost version ... I posted it in 2010 but the material is nowhere
  to be found on my backups.

V13->V14
- Rediff against linux-next on request of Andrew
- TestSetPageLocked -> trylock_page conversion.

Slab fragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim without much succes.
Slab defragmentation adds the capability to recover the memory that
is wasted.

Memory reclaim for the following slab caches is possible:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_heads

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

	updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the objects from
the slabs with just a few entries reclaiming more memory for other uses.
In the simplest case (as provided here) this is done by simply reclaiming
the objects.

However, if the logic in the kick() function is made more
sophisticated then we will be able to move the objects out of the slabs.
Allocations of objects is possible if a slab is fragmented without the use of
the page allocator because a large number of free slots are available. Moving
an object will reduce fragmentation in the slab the object is moved to.

V12->v13:
- Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
- Fix unitialized variable issue

V11->V12:
- Pekka and me fixed various minor issues pointed out by Andrew.
- Split ext2/3/4 defrag support patches.
- Add more documentation
- Revise the way that slab defrag is triggered from reclaim. No longer
  use a timeout but track the amount of slab reclaim done by the shrinkers.
  Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
- Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
  /proc/sys/vm/slab_defrag_count (for global reclaim).
- Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
- Add a patch that obsoletes SLAB and explains why SLOB does not support
  defrag (Either of those could be theoretically equipped to support
  slab defrag in some way but it seems that Andrew/Linus want to reduce
  the number of slab allocators).

V10->V11
- Simplify determination when to reclaim: Just scan over all partials
  and check if they are sparsely populated.
- Add support for performance counters
- Rediff on top of current slab-mm.
- Reduce frequency of scanning. A look at the stats showed that we
  were calling into reclaim very frequently when the system was under
  memory pressure which slowed things down. Various measures to
  avoid scanning the partial list too frequently were added and the
  earlier (expensive) method of determining the defrag ratio of the slab
  cache as a whole was dropped. I think this addresses the issues that
  Mel saw with V10.

V9->V10
- Rediff against upstream

V8->V9
- Rediff against 2.6.24-rc6-mm1

V7->V8
- Rediff against 2.6.24-rc3-mm2

V6->V7
- Rediff against 2.6.24-rc2-mm1
- Remove lumpy reclaim support. No point anymore given that the antifrag
  handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
  Targeted reclaim never triggers. This has to wait until we make
  slabs movable or we need to perform a special version of lumpy reclaim
  in SLUB while we scan the partial lists for slabs to kick out.
  Removal simplifies handling significantly since we
  get to slabs in a more controlled way via the partial lists.
  The patchset now provides pure reduction of fragmentation levels.
- SLAB/SLOB: Provide inlines that do nothing
- Fix various smaller issues that were brought up during review of V6.

V5->V6
- Rediff against 2.6.24-rc2 + mm slub patches.
- Add reviewed by lines.
- Take out the experimental code to make slab pages movable. That
  has to wait until this has been considered by Mel.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 1/6] slub: Replace ctor field with ops field in /sys/slab/*
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
@ 2017-03-07 21:24 ` Christoph Lameter
  2017-03-07 21:24 ` [RFC 2/6] slub: Add defrag_ratio field and sysfs support Christoph Lameter
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

[-- Attachment #1: ctor_to_ops --]
[-- Type: text/plain, Size: 1428 bytes --]

Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that will
be defined soon to enable defragmentation.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -4942,13 +4942,18 @@ static ssize_t cpu_partial_store(struct
 }
 SLAB_ATTR(cpu_partial);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
+	int x;
+
 	if (!s->ctor)
 		return 0;
-	return sprintf(buf, "%pS\n", s->ctor);
+
+	if (s->ctor)
+		x += sprintf(buf + x, "ctor : %pS\n", s->ctor);
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -5356,7 +5361,7 @@ static struct attribute *slab_attrs[] =
 	&objects_partial_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&hwcache_align_attr.attr,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 2/6] slub: Add defrag_ratio field and sysfs support
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
  2017-03-07 21:24 ` [RFC 1/6] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
@ 2017-03-07 21:24 ` Christoph Lameter
  2017-03-07 21:24 ` [RFC 3/6] slub: Add get() and kick() methods Christoph Lameter
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

[-- Attachment #1: defrag_ratio --]
[-- Type: text/plain, Size: 3698 bytes --]

The defrag_ratio is used to set the threshold at which defragmentation
should be attempted on a slab page.

The allocation ratio is measured by the percentage of the available slots
allocated.

Add a defrag ratio field and set it to 30% by default. A limit of 30% specified
that less than 3 out of 10 available slots for objects are in use before
slab defragmeentation runs.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 Documentation/ABI/testing/sysfs-kernel-slab |   13 +++++++++++++
 include/linux/slub_def.h                    |    6 ++++++
 mm/slub.c                                   |   23 +++++++++++++++++++++++
 3 files changed, 42 insertions(+)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3596,6 +3596,7 @@ static int kmem_cache_open(struct kmem_c
 	else
 		s->cpu_partial = 30;
 
+	s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -5057,6 +5058,27 @@ static ssize_t reserved_show(struct kmem
 }
 SLAB_ATTR_RO(reserved);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+				const char *buf, size_t length)
+{
+	unsigned long ratio;
+	int err;
+
+	err = kstrtoul(buf, 10, &ratio);
+	if (err)
+		return err;
+
+	if (ratio < 100)
+		s->defrag_ratio = ratio;
+	return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_SLUB_DEBUG
 static ssize_t slabs_show(struct kmem_cache *s, char *buf)
 {
@@ -5381,6 +5403,7 @@ static struct attribute *slab_attrs[] =
 	&validate_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
+	&defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_ZONE_DMA
 	&cache_dma_attr.attr,
Index: linux/Documentation/ABI/testing/sysfs-kernel-slab
===================================================================
--- linux.orig/Documentation/ABI/testing/sysfs-kernel-slab
+++ linux/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,19 @@ Description:
 		list.  It can be written to clear the current count.
 		Available when CONFIG_SLUB_STATS is enabled.
 
+What:		/sys/kernel/slab/cache/defrag_ratio
+Date:		August 2017
+KernelVersion:	4.13
+Contact:	Christoph Lameter <cl@linux-foundation.org>
+		Pekka Enberg <penberg@cs.helsinki.fi>,
+Description:
+		The defrag_ratio files allows the control of how agressive
+		slab fragmentation reduction works at reclaiming objects from
+		sparsely populated slabs. This is a percentage. If a slab
+		contains less than this percentage of objects then reclaim
+		will attempt to reclaim objects so that the whole slab
+		page can be freed. The default is 30%.
+
 What:		/sys/kernel/slab/cache/deactivate_to_tail
 Date:		February 2008
 KernelVersion:	2.6.25
Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -82,6 +82,13 @@ struct kmem_cache {
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 	int red_left_pad;	/* Left redzone padding size */
+
+	int defrag_ratio;	/*
+				 * Ratio used to check the percentage of
+				 * objects allocate in a slab page.
+				 * If less than this ratio is allocated
+				 * then reclaim attempts are made.
+				 */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
 #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 3/6] slub: Add get() and kick() methods
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
  2017-03-07 21:24 ` [RFC 1/6] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
  2017-03-07 21:24 ` [RFC 2/6] slub: Add defrag_ratio field and sysfs support Christoph Lameter
@ 2017-03-07 21:24 ` Christoph Lameter
  2017-03-07 21:24 ` [RFC 4/6] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

[-- Attachment #1: get_and_kick --]
[-- Type: text/plain, Size: 5928 bytes --]

V15->V16
 - Disable CMPXCHG_DOUBLE mode if these methods are specified.
   Maybe we can find another safer way later that can use the
   cmpxchg double fast mode.

Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the defrag methods for a slab cache.

Add empty functions for SLAB/SLOB. The API is generic so it
could be theoretically implemented for either allocator.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slab.h     |   50 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/slub_def.h |    3 ++
 mm/slub.c                |   29 ++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -76,6 +76,9 @@ struct kmem_cache {
 	gfp_t allocflags;	/* gfp flags to use on each alloc */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *);
+	kmem_defrag_get_func *get;
+	kmem_defrag_kick_func *kick;
+
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	int reserved;		/* Reserved bytes at the end of slabs */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3439,6 +3439,8 @@ static int calculate_sizes(struct kmem_c
 	else
 		s->flags &= ~__OBJECT_POISON;
 
+	if (s->ctor || s->kick || s->get)
+		return 1;
 
 	/*
 	 * If we are Redzoning then check if there is some space between the
@@ -4258,6 +4260,25 @@ int __kmem_cache_create(struct kmem_cach
 	return err;
 }
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
+{
+	/*
+	 * Defragmentable slabs must have a ctor otherwise objects may be
+	 * in an undetermined state after they are allocated.
+	 */
+	BUG_ON(!s->ctor);
+	s->get = get;
+	s->kick = kick;
+	/*
+	 * Sadly serialization requirements currently mean that we have
+	 * to disable fast cmpxchg based processing.
+	 */
+	s->flags &= ~__CMPXCHG_DOUBLE;
+
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
 	struct kmem_cache *s;
@@ -4952,6 +4973,20 @@ static ssize_t ops_show(struct kmem_cach
 
 	if (s->ctor)
 		x += sprintf(buf + x, "ctor : %pS\n", s->ctor);
+
+	if (s->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux/include/linux/slab.h
===================================================================
--- linux.orig/include/linux/slab.h
+++ linux/include/linux/slab.h
@@ -135,6 +135,59 @@ void memcg_deactivate_kmem_caches(struct
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
 /*
+ * Function prototypes passed to kmem_cache_defrag() to enable defragmentation
+ * and targeted reclaim in slab caches.
+ */
+
+/*
+ * kmem_cache_defrag_get_func() is called with locks held so that the slab
+ * objects cannot be freed. We are in an atomic context and no slab
+ * operations may be performed. The purpose of kmem_cache_defrag_get_func()
+ * is to obtain a stable refcount on the objects, so that they cannot be
+ * removed until kmem_cache_kick_func() has handled them.
+ *
+ * Parameters passed are the number of objects to process and an array of
+ * pointers to objects for which we need references.
+ *
+ * Returns a pointer that is passed to the kick function. If any objects
+ * cannot be moved then the pointer may indicate a failure and
+ * then kick can simply remove the references that were already obtained.
+ *
+ * The object pointer array passed is also passed to kmem_cache_defrag_kick().
+ * The function may remove objects from the array by setting pointers to
+ * NULL. This is useful if we can determine that an object is already about
+ * to be removed. In that case it is often impossible to obtain the necessary
+ * refcount.
+ */
+typedef void *kmem_defrag_get_func(struct kmem_cache *, int, void **);
+
+/*
+ * kmem_cache_defrag_kick_func is called with no locks held and interrupts
+ * enabled. Sleeping is possible. Any operation may be performed in kick().
+ * kmem_cache_defrag should free all the objects in the pointer array.
+ *
+ * Parameters passed are the number of objects in the array, the array of
+ * pointers to the objects and the pointer returned by kmem_cache_defrag_get().
+ *
+ * Success is checked by examining the number of remaining objects in the slab.
+ */
+typedef void kmem_defrag_kick_func(struct kmem_cache *, int, void **, void *);
+
+/*
+ * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
+						kmem_defrag_kick_func);
+#else
+static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
+	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+#endif
+
+/*
+ * Allocator specific definitions. These are mainly used to establish optimized
+ * ways to convert kmalloc() calls to kmem_cache_alloc() invocations by
+ * selecting the appropriate general cache at compile time.
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 4/6] slub: Sort slab cache list and establish maximum objects for defrag slabs
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
                   ` (2 preceding siblings ...)
  2017-03-07 21:24 ` [RFC 3/6] slub: Add get() and kick() methods Christoph Lameter
@ 2017-03-07 21:24 ` Christoph Lameter
  2017-03-07 21:24 ` [RFC 5/6] slub: Slab defrag core Christoph Lameter
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

[-- Attachment #1: sort_and_max --]
[-- Type: text/plain, Size: 3021 bytes --]

It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
the sizing of the array holding refs to objects in a slab later.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -194,6 +194,9 @@ static inline bool kmem_cache_has_cpu_pa
 #define __OBJECT_POISON		0x80000000UL /* Poison object */
 #define __CMPXCHG_DOUBLE	0x40000000UL /* Use cmpxchg_double */
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects;
+
 /*
  * Tracking user of a slab.
  */
@@ -2715,6 +2718,7 @@ redo:
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
 		memset(object, 0, s->object_size);
 
+	list_add_tail(&s->list, &slab_caches);
 	slab_post_alloc_hook(s, gfpflags, 1, &object);
 
 	return object;
@@ -4260,22 +4264,44 @@ int __kmem_cache_create(struct kmem_cach
 	return err;
 }
 
+/*
+ * Allocate a slab scratch space that is sufficient to keep at least
+ * max_defrag_slab_objects pointers to individual objects and also a bitmap
+ * for max_defrag_slab_objects.
+ */
+static inline void *alloc_scratch(void)
+{
+	return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+		BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+		GFP_KERNEL);
+}
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
 {
+	int max_objects = oo_objects(s->max);
+
 	/*
 	 * Defragmentable slabs must have a ctor otherwise objects may be
 	 * in an undetermined state after they are allocated.
 	 */
 	BUG_ON(!s->ctor);
+	mutex_lock(&slab_mutex);
+
 	s->get = get;
 	s->kick = kick;
+
 	/*
 	 * Sadly serialization requirements currently mean that we have
 	 * to disable fast cmpxchg based processing.
 	 */
 	s->flags &= ~__CMPXCHG_DOUBLE;
 
+	list_move(&s->list, &slab_caches);	/* Move to top */
+	if (max_objects > max_defrag_slab_objects)
+		max_defrag_slab_objects = max_objects;
+
+	mutex_unlock(&slab_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
Index: linux/mm/slab_common.c
===================================================================
--- linux.orig/mm/slab_common.c
+++ linux/mm/slab_common.c
@@ -384,7 +384,7 @@ static struct kmem_cache *create_cache(c
 		goto out_free_cache;
 
 	s->refcount = 1;
-	list_add(&s->list, &slab_caches);
+	list_add_tail(&s->list, &slab_caches);
 	memcg_link_cache(s);
 out:
 	if (err)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 5/6] slub: Slab defrag core
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
                   ` (3 preceding siblings ...)
  2017-03-07 21:24 ` [RFC 4/6] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
@ 2017-03-07 21:24 ` Christoph Lameter
  2017-03-07 22:03   ` Matthew Wilcox
  2017-03-07 21:24 ` [RFC 6/6] slub: Extend slabinfo to support -D and -F options Christoph Lameter
  2017-03-08 14:34 ` [RFC 0/6] Slab Fragmentation Reduction V16 Michal Hocko
  6 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

[-- Attachment #1: defrag_core --]
[-- Type: text/plain, Size: 10809 bytes --]

Slab defragmentation may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Through the use of the slabinfo command.

3. Per node defrag conditionally when kmem_cache_defrag(<node>) is called
   (can be called from reclaim code with a later patch).

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab page can accomodate.

   The scanning of slab caches is optimized because the
   defragmentable slabs come first on the list. Thus we can terminate scans
   on the first slab encountered that does not support defragmentation.

   kmem_cache_defrag() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are

kmem_defrag_get_func (void *get(struct kmem_cache *s, int nr, void **objects))

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free() attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect such situations and void the attempts to free such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page that contains the object is taken. Any attempt to perform
	a slab operation may lead to a deadlock.

	kmem_defrag_get_func returns a private pointer that is passed to
	kmem_defrag_kick_func(). Should we be unable to obtain all references
	then that pointer may indicate to the kick() function that it should
	not attempt any object removal or move but simply remove the
	reference counts.

kmem_defrag_kick_func (void kick(struct kmem_cache *, int nr, void **objects,
							void *get_result))

	After SLUB has established references to the objects in a
	slab it will then drop all locks and use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via kmem_defrag_get_func(). The
	callback may perform any slab operation since no locks are held at
	the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab. A filled up slab can also be removed
	from the partial list. So there could be a double effect.

	kmem_defrag_kick_func() does not return a result. SLUB will check
	the number of remaining objects in the slab. If all objects were
	removed then the slab is freed and we have reduced the overall
	fragmentation of the slab cache.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slab.h |    3 
 mm/slub.c            |  265 ++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 215 insertions(+), 53 deletions(-)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -318,6 +318,12 @@ static __always_inline void slab_lock(st
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
+static __always_inline int slab_trylock(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	return bit_spin_trylock(PG_locked, &page->flags);
+}
+
 static __always_inline void slab_unlock(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
@@ -4276,6 +4282,228 @@ static inline void *alloc_scratch(void)
 		GFP_KERNEL);
 }
 
+/*
+ * Vacate all objects in the given slab.
+ *
+ * The scratch area passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We use it to hold void ** times
+ * objects per slab plus a bitmap for each object.
+ */
+static void kmem_cache_vacate(struct page *page, void *scratch)
+{
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	struct kmem_cache *s;
+	unsigned long *map;
+	int count;
+	void *private;
+	unsigned long flags;
+	unsigned long objects;
+
+	local_irq_save(flags);
+	slab_lock(page);
+
+	BUG_ON(!PageSlab(page));	/* Must be s slab page */
+	BUG_ON(!page->frozen);	/* Slab must have been frozen earlier */
+
+	s = page->slab_cache;
+	objects = page->objects;
+	map = scratch + objects * sizeof(void **);
+
+	/* Determine used objects */
+	bitmap_fill(map, objects);
+	for (p = page->freelist; p; p = get_freepointer(s, p))
+		__clear_bit(slab_index(p, s, addr), map);
+
+	/* Build vector of pointers to objects */
+	count = 0;
+	memset(vector, 0, objects * sizeof(void **));
+	for_each_object(p, s, addr, objects)
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[count++] = p;
+
+	private = s->get(s, count, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->kick(s, count, vector, private);
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ * by releasing slabs with zero objects and trying to reclaim
+ * slabs with less than the configured percentage of objects allocated.
+ */
+static unsigned long __shrink(struct kmem_cache *s, int node,
+							unsigned long limit)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed = 0;
+	struct kmem_cache_node *n = get_node(s, node);
+
+	if (n->nr_partial <= limit)
+		return 0;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
+			continue;
+
+		if (page->inuse) {
+			if (page->inuse * 100 >=
+					s->defrag_ratio * page->objects) {
+				slab_unlock(page);
+				/* Slab contains enough objects */
+				continue;
+			}
+
+			list_move(&page->lru, &zaplist);
+			if (s->kick) {
+				/* Remove page from being considered for allocations */
+				n->nr_partial--;
+				page->frozen = 1;
+			}
+			slab_unlock(page);
+		} else {
+			/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		}
+	}
+
+	if (!s->kick)
+		/*
+		 * No defrag method. By simply putting the zaplist at the
+		 * end of the partial list we can let them simmer longer
+		 * and thus increase the chance of all objects being
+		 * reclaimed.
+		 *
+		 * We have effectively sorted the partial list and put
+		 * the slabs with more objects first. As soon as they
+		 * are allocated they are going to be removed from the
+		 * partial list.
+		 */
+		list_splice(&zaplist, n->partial.prev);
+
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->kick && !list_empty(&zaplist)) {
+		void **scratch = alloc_scratch();
+		struct page *page;
+		struct page *page2;
+
+		if (scratch) {
+			/* Try to remove / move the objects left */
+			list_for_each_entry(page, &zaplist, lru) {
+				if (page->inuse)
+					kmem_cache_vacate(page, scratch);
+			}
+			kfree(scratch);
+		}
+
+		/* Inspect results and dispose of pages */
+		spin_lock_irqsave(&n->list_lock, flags);
+		list_for_each_entry_safe(page, page2, &zaplist, lru) {
+			slab_lock(page);
+			page->frozen = 0;
+
+			if (page->inuse) {
+
+				/* Still objects left */
+				n->nr_partial++;
+				list_add_tail(&n->partial, &page->lru);
+				slab_unlock(page);
+
+			} else {
+
+				/* Success */
+				slab_unlock(page);
+				discard_slab(s, page);
+				freed++;
+			}
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+	return freed;
+}
+
+/*
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long slabs = 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!mutex_trylock(&slab_mutex))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		unsigned long reclaimed = 0;
+
+		/*
+		 * Defragmentable caches come first. If the slab cache is not
+		 * defragmentable then we can stop traversing the list.
+		 */
+		if (!s->kick)
+			break;
+
+		if (node == -1) {
+			int nid;
+
+			for_each_node_state(nid, N_NORMAL_MEMORY)
+				reclaimed += __shrink(s, nid, MAX_PARTIAL);
+		} else
+			reclaimed = __shrink(s, node, MAX_PARTIAL);
+
+		slabs += reclaimed;
+	}
+	mutex_unlock(&slab_mutex);
+	return slabs;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/*
+ * kmem_cache_shrink removes empty slabs from the partial lists.
+ * If the slab cache supports defragmentation then objects are
+ * reclaimed.
+ */
+int kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		__shrink(s, node, 0);
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kick)
 {
Index: linux/include/linux/slab.h
===================================================================
--- linux.orig/include/linux/slab.h
+++ linux/include/linux/slab.h
@@ -175,13 +175,16 @@ typedef void kmem_defrag_kick_func(struc
 
 /*
  * kmem_cache_setup_defrag() is used to setup callbacks for a slab cache.
+ * kmem_cache_defrag() performs the actual defragmentation.
  */
 #ifdef CONFIG_SLUB
 void kmem_cache_setup_defrag(struct kmem_cache *, kmem_defrag_get_func,
 						kmem_defrag_kick_func);
+int kmem_cache_defrag(int node);
 #else
 static inline void kmem_cache_setup_defrag(struct kmem_cache *s,
 	kmem_defrag_get_func get, kmem_defrag_kick_func kiok) {}
+static inline int kmem_cache_defrag(int node) { return 0; }
 #endif
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC 6/6] slub: Extend slabinfo to support -D and -F options
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
                   ` (4 preceding siblings ...)
  2017-03-07 21:24 ` [RFC 5/6] slub: Slab defrag core Christoph Lameter
@ 2017-03-07 21:24 ` Christoph Lameter
  2017-03-08 14:34 ` [RFC 0/6] Slab Fragmentation Reduction V16 Michal Hocko
  6 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter @ 2017-03-07 21:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

[-- Attachment #1: extend_slabinfo --]
[-- Type: text/plain, Size: 5742 bytes --]

-F lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 Documentation/vm/slabinfo.c |   48 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

Index: linux/tools/vm/slabinfo.c
===================================================================
--- linux.orig/tools/vm/slabinfo.c
+++ linux/tools/vm/slabinfo.c
@@ -32,6 +32,8 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
+	int defrag_ratio, remote_node_defrag_ratio;
 	unsigned long partial, objects, slabs, objects_partial, objects_total;
 	unsigned long alloc_fastpath, alloc_slowpath;
 	unsigned long free_fastpath, free_slowpath;
@@ -66,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_defrag;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -107,14 +111,16 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-	printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+	printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 2017 Jump Trading LLC.\n\n"
+		"slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
 		"-A|--activity          Most active slabs first\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-D|--display-active    Switch line format to activity\n"
 		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
+		"-F|--defrag            Show defragmentable caches\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
@@ -366,7 +372,7 @@ static void slab_numa(struct slabinfo *s
 		return;
 
 	if (!line) {
-		printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+		printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
 		for(node = 0; node <= highest_node; node++)
 			printf(" %4d", node);
 		printf("\n----------------------");
@@ -375,6 +381,7 @@ static void slab_numa(struct slabinfo *s
 		printf("\n");
 	}
 	printf("%-21s ", mode ? "All slabs" : s->name);
+	printf("%3d ", s->remote_node_defrag_ratio);
 	for(node = 0; node <= highest_node; node++) {
 		char b[20];
 
@@ -532,6 +539,8 @@ static void report(struct slabinfo *s)
 		printf("** Slabs are destroyed via RCU\n");
 	if (s->reclaim_account)
 		printf("** Reclaim accounting active\n");
+	if (s->defrag)
+		printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
 	printf("\nSizes (bytes)     Slabs              Debug                Memory\n");
 	printf("------------------------------------------------------------------------\n");
@@ -579,6 +588,12 @@ static void slabcache(struct slabinfo *s
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	if (sort_loss == 0)
 		store_size(size_str, slab_size(s));
 	else
@@ -593,6 +608,10 @@ static void slabcache(struct slabinfo *s
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'F';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -627,7 +646,8 @@ static void slabcache(struct slabinfo *s
 		printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
 			s->name, s->objects, s->object_size, size_str, dist_str,
 			s->objs_per_slab, s->order,
-			s->slabs ? (s->partial * 100) / s->slabs : 100,
+			s->slabs ? (s->partial * 100) /
+					(s->slabs * s->objs_per_slab) : 100,
 			s->slabs ? (s->objects * s->object_size * 100) /
 				(s->slabs * (page_size << s->order)) : 100,
 			flags);
@@ -1246,7 +1266,17 @@ static void read_slab_dir(void)
 			slab->cpu_partial_free = get_obj("cpu_partial_free");
 			slab->alloc_node_mismatch = get_obj("alloc_node_mismatch");
 			slab->deactivate_bypass = get_obj("deactivate_bypass");
+			slab->defrag_ratio = get_obj("defrag_ratio");
+			slab->remote_node_defrag_ratio =
+					get_obj("remote_node_defrag_ratio");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
+
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1323,6 +1353,8 @@ static void xtotals(void)
 }
 
 struct option opts[] = {
+	{ "ctor", no_argument, NULL, 'C' },
+	{ "defrag", no_argument, NULL, 'F' },
 	{ "aliases", no_argument, NULL, 'a' },
 	{ "activity", no_argument, NULL, 'A' },
 	{ "debug", optional_argument, NULL, 'd' },
@@ -1357,7 +1389,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTSN:LXB",
+	while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTSN:LXB",
 						opts, NULL)) != -1)
 		switch (c) {
 		case '1':
@@ -1413,6 +1445,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'F':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 5/6] slub: Slab defrag core
  2017-03-07 21:24 ` [RFC 5/6] slub: Slab defrag core Christoph Lameter
@ 2017-03-07 22:03   ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2017-03-07 22:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Pekka Enberg, akpm, Mel Gorman, andi, Rik van Riel

On Tue, Mar 07, 2017 at 03:24:34PM -0600, Christoph Lameter wrote:
> kmem_defrag_get_func (void *get(struct kmem_cache *s, int nr, void **objects))
> 
> 	Must obtain a reference to the listed objects. SLUB guarantees that
> 	the objects are still allocated. However, other threads may be blocked
> 	in slab_free() attempting to free objects in the slab. These may succeed
> 	as soon as get() returns to the slab allocator. The function must
> 	be able to detect such situations and void the attempts to free such
> 	objects (by for example voiding the corresponding entry in the objects
> 	array).
> 
> 	No slab operations may be performed in get(). Interrupts
> 	are disabled. What can be done is very limited. The slab lock
> 	for the page that contains the object is taken. Any attempt to perform
> 	a slab operation may lead to a deadlock.
> 
> 	kmem_defrag_get_func returns a private pointer that is passed to
> 	kmem_defrag_kick_func(). Should we be unable to obtain all references
> 	then that pointer may indicate to the kick() function that it should
> 	not attempt any object removal or move but simply remove the
> 	reference counts.

I think calling it 'get' is overly prescriptive of how an implementation should
work.  Perhaps 'test'?  And returning ERR_PTR if we cannot free all objects?

> kmem_defrag_kick_func (void kick(struct kmem_cache *, int nr, void **objects,
> 							void *get_result))
> 
> 	After SLUB has established references to the objects in a
> 	slab it will then drop all locks and use kick() to move objects out
> 	of the slab. The existence of the object is guaranteed by virtue of
> 	the earlier obtained references via kmem_defrag_get_func(). The
> 	callback may perform any slab operation since no locks are held at
> 	the time of call.
> 
> 	The callback should remove the object from the slab in some way. This
> 	may be accomplished by reclaiming the object and then running
> 	kmem_cache_free() or reallocating it and then running
> 	kmem_cache_free(). Reallocation is advantageous because the partial
> 	slabs were just sorted to have the partial slabs with the most objects
> 	first. Reallocation is likely to result in filling up a slab in
> 	addition to freeing up one slab. A filled up slab can also be removed
> 	from the partial list. So there could be a double effect.
> 
> 	kmem_defrag_kick_func() does not return a result. SLUB will check
> 	the number of remaining objects in the slab. If all objects were
> 	removed then the slab is freed and we have reduced the overall
> 	fragmentation of the slab cache.

I think 'kick' is a bad name.  'evict', maybe?

Also, xarray, dcache and the inode cache all use RCU to free objects, so
perhaps a sentence or two in here about that would be beneficial ...

	If objects are freed to this slab using RCU, the evict function
	should call rcu_barrier() before returning to ensure that all
	objects have been returned and the slab page can be freed.

> +	private = s->get(s, count, vector);
> +
> +	/*
> +	 * Got references. Now we can drop the slab lock. The slab
> +	 * is frozen so it cannot vanish from under us nor will
> +	 * allocations be performed on the slab. However, unlocking the
> +	 * slab will allow concurrent slab_frees to proceed.
> +	 */
> +	slab_unlock(page);
> +	local_irq_restore(flags);
> +
> +	/*
> +	 * Perform the KICK callbacks to remove the objects.
> +	 */
> +	s->kick(s, count, vector, private);

	private = s->test(vector, count);
	slab_unlock(page);
	local_irq_restore(flags);
	if (!IS_ERR(private))
		s->evict(vector, count, private);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/6] Slab Fragmentation Reduction V16
  2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
                   ` (5 preceding siblings ...)
  2017-03-07 21:24 ` [RFC 6/6] slub: Extend slabinfo to support -D and -F options Christoph Lameter
@ 2017-03-08 14:34 ` Michal Hocko
  2017-03-08 15:58   ` Christoph Lameter
  6 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-03-08 14:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, linux-mm, Pekka Enberg, akpm, Mel Gorman, andi,
	Rik van Riel

On Tue 07-03-17 15:24:29, Cristopher Lameter wrote:
> V15->V16
> - Reworked core logic against 4.11 kernel code
> - Just the bare bones for Matthew to have the ability to review
>   the patches and to see how slab defrag could work with the radix
>   tree and/or new xarrays. Skip reclaim integration etc etc.

JFTR the previous version was posted here: https://lwn.net/Articles/371892/
and Dave had some concerns https://lkml.org/lkml/2010/2/8/329 which led
to a different approach and design of the slab shrinking
https://lkml.org/lkml/2010/2/8/329.

I haven't looked at this series yet but has those concerns been
addressed/considered?

> 
> V14->V15
> - The lost version ... I posted it in 2010 but the material is nowhere
>   to be found on my backups.
> 
> V13->V14
> - Rediff against linux-next on request of Andrew
> - TestSetPageLocked -> trylock_page conversion.
> 
> Slab fragmentation is mainly an issue if Linux is used as a fileserver
> and large amounts of dentries, inodes and buffer heads accumulate. In some
> load situations the slabs become very sparsely populated so that a lot of
> memory is wasted by slabs that only contain one or a few objects. In
> extreme cases the performance of a machine will become sluggish since
> we are continually running reclaim without much succes.
> Slab defragmentation adds the capability to recover the memory that
> is wasted.
> 
> Memory reclaim for the following slab caches is possible:
> 
> 1. dentry cache
> 2. inode cache (with a generic interface to allow easy setup of more
>    filesystems than the currently supported ext2/3/4 reiserfs, XFS
>    and proc)
> 3. buffer_heads
> 
> One typical mechanism that triggers slab defragmentation on my systems
> is the daily run of
> 
> 	updatedb
> 
> Updatedb scans all files on the system which causes a high inode and dentry
> use. After updatedb is complete we need to go back to the regular use
> patterns (typical on my machine: kernel compiles). Those need the memory now
> for different purposes. The inodes and dentries used for updatedb will
> gradually be aged by the dentry/inode reclaim algorithm which will free
> up the dentries and inode entries randomly through the slabs that were
> allocated. As a result the slabs will become sparsely populated. If they
> become empty then they can be freed but a lot of them will remain sparsely
> populated. That is where slab defrag comes in: It removes the objects from
> the slabs with just a few entries reclaiming more memory for other uses.
> In the simplest case (as provided here) this is done by simply reclaiming
> the objects.
> 
> However, if the logic in the kick() function is made more
> sophisticated then we will be able to move the objects out of the slabs.
> Allocations of objects is possible if a slab is fragmented without the use of
> the page allocator because a large number of free slots are available. Moving
> an object will reduce fragmentation in the slab the object is moved to.
> 
> V12->v13:
> - Rebase onto Linux 2.6.27-rc1 (deal with page flags conversion, ctor parameters etc)
> - Fix unitialized variable issue
> 
> V11->V12:
> - Pekka and me fixed various minor issues pointed out by Andrew.
> - Split ext2/3/4 defrag support patches.
> - Add more documentation
> - Revise the way that slab defrag is triggered from reclaim. No longer
>   use a timeout but track the amount of slab reclaim done by the shrinkers.
>   Add a field in /proc/sys/vm/slab_defrag_limit to control the threshold.
> - Display current slab_defrag_counters in /proc/zoneinfo (for a zone) and
>   /proc/sys/vm/slab_defrag_count (for global reclaim).
> - Add new config vaue slab_defrag_limit to /proc/sys/vm/slab_defrag_limit
> - Add a patch that obsoletes SLAB and explains why SLOB does not support
>   defrag (Either of those could be theoretically equipped to support
>   slab defrag in some way but it seems that Andrew/Linus want to reduce
>   the number of slab allocators).
> 
> V10->V11
> - Simplify determination when to reclaim: Just scan over all partials
>   and check if they are sparsely populated.
> - Add support for performance counters
> - Rediff on top of current slab-mm.
> - Reduce frequency of scanning. A look at the stats showed that we
>   were calling into reclaim very frequently when the system was under
>   memory pressure which slowed things down. Various measures to
>   avoid scanning the partial list too frequently were added and the
>   earlier (expensive) method of determining the defrag ratio of the slab
>   cache as a whole was dropped. I think this addresses the issues that
>   Mel saw with V10.
> 
> V9->V10
> - Rediff against upstream
> 
> V8->V9
> - Rediff against 2.6.24-rc6-mm1
> 
> V7->V8
> - Rediff against 2.6.24-rc3-mm2
> 
> V6->V7
> - Rediff against 2.6.24-rc2-mm1
> - Remove lumpy reclaim support. No point anymore given that the antifrag
>   handling in 2.6.24-rc2 puts reclaimable slabs into different sections.
>   Targeted reclaim never triggers. This has to wait until we make
>   slabs movable or we need to perform a special version of lumpy reclaim
>   in SLUB while we scan the partial lists for slabs to kick out.
>   Removal simplifies handling significantly since we
>   get to slabs in a more controlled way via the partial lists.
>   The patchset now provides pure reduction of fragmentation levels.
> - SLAB/SLOB: Provide inlines that do nothing
> - Fix various smaller issues that were brought up during review of V6.
> 
> V5->V6
> - Rediff against 2.6.24-rc2 + mm slub patches.
> - Add reviewed by lines.
> - Take out the experimental code to make slab pages movable. That
>   has to wait until this has been considered by Mel.
> 
> V4->V5:
> - Support lumpy reclaim for slabs
> - Support reclaim via slab_shrink()
> - Add constructors to insure a consistent object state at all times.
> 
> V3->V4:
> - Optimize scan for slabs that need defragmentation
> - Add /sys/slab/*/defrag_ratio to allow setting defrag limits
>   per slab.
> - Add support for buffer heads.
> - Describe how the cleanup after the daily updatedb can be
>   improved by slab defragmentation.
> 
> V2->V3
> - Support directory reclaim
> - Add infrastructure to trigger defragmentation after slab shrinking if we
>   have slabs with a high degree of fragmentation.
> 
> V1->V2
> - Clean up control flow using a state variable. Simplify API. Back to 2
>   functions that now take arrays of objects.
> - Inode defrag support for a set of filesystems
> - Fix up dentry defrag support to work on negative dentries by adding
>   a new dentry flag that indicates that a dentry is not in the process
>   of being freed or allocated.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/6] Slab Fragmentation Reduction V16
  2017-03-08 14:34 ` [RFC 0/6] Slab Fragmentation Reduction V16 Michal Hocko
@ 2017-03-08 15:58   ` Christoph Lameter
  2017-03-13  9:15     ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2017-03-08 15:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, linux-mm, Pekka Enberg, akpm, Mel Gorman, andi,
	Rik van Riel

On Wed, 8 Mar 2017, Michal Hocko wrote:

> JFTR the previous version was posted here: https://lwn.net/Articles/371892/
> and Dave had some concerns https://lkml.org/lkml/2010/2/8/329 which led
> to a different approach and design of the slab shrinking
> https://lkml.org/lkml/2010/2/8/329.
>
> I haven't looked at this series yet but has those concerns been
> addressed/considered?

Well yes this has been discussed for a couple of years. The basic approach
is not only needed for the file systems (like what Chinner was focusing
on) but in general for slab caches. The objection was regarding the
integration into the slab reclaim logic in vmscan.c and the filesystem
reclaim in general.

Dave and Matthew were at linux.conf.au and we agreed to first try it with
the radix tree and then generalize from there.  The reclaim logic
was a bit hacky and we will have to find some better way to
integrate this.

There is a video on youtube capturing the discussion (My talk on movable
kernel objects).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/6] Slab Fragmentation Reduction V16
  2017-03-08 15:58   ` Christoph Lameter
@ 2017-03-13  9:15     ` Michal Hocko
  2017-03-13  9:16       ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-03-13  9:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, linux-mm, Pekka Enberg, akpm, Mel Gorman, andi,
	Rik van Riel

On Wed 08-03-17 09:58:58, Cristopher Lameter wrote:
> On Wed, 8 Mar 2017, Michal Hocko wrote:
> 
> > JFTR the previous version was posted here: https://lwn.net/Articles/371892/
> > and Dave had some concerns https://lkml.org/lkml/2010/2/8/329 which led
> > to a different approach and design of the slab shrinking
> > https://lkml.org/lkml/2010/2/8/329.
> >
> > I haven't looked at this series yet but has those concerns been
> > addressed/considered?
> 
> Well yes this has been discussed for a couple of years. The basic approach
> is not only needed for the file systems (like what Chinner was focusing
> on) but in general for slab caches. The objection was regarding the
> integration into the slab reclaim logic in vmscan.c and the filesystem
> reclaim in general.
> 
> Dave and Matthew were at linux.conf.au and we agreed to first try it with
> the radix tree and then generalize from there.  The reclaim logic
> was a bit hacky and we will have to find some better way to
> integrate this.
> 
> There is a video on youtube capturing the discussion (My talk on movable
> kernel objects).

Hmm, OK. There seems to be a slot to discuss this at LSFMM this year so
I hope we can discuss your proposal there.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/6] Slab Fragmentation Reduction V16
  2017-03-13  9:15     ` Michal Hocko
@ 2017-03-13  9:16       ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-03-13  9:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Matthew Wilcox, linux-mm, Pekka Enberg, akpm, Mel Gorman, andi,
	Rik van Riel

On Mon 13-03-17 10:15:15, Michal Hocko wrote:
> On Wed 08-03-17 09:58:58, Cristopher Lameter wrote:
> > On Wed, 8 Mar 2017, Michal Hocko wrote:
> > 
> > > JFTR the previous version was posted here: https://lwn.net/Articles/371892/
> > > and Dave had some concerns https://lkml.org/lkml/2010/2/8/329 which led
> > > to a different approach and design of the slab shrinking
> > > https://lkml.org/lkml/2010/2/8/329.
> > >
> > > I haven't looked at this series yet but has those concerns been
> > > addressed/considered?
> > 
> > Well yes this has been discussed for a couple of years. The basic approach
> > is not only needed for the file systems (like what Chinner was focusing
> > on) but in general for slab caches. The objection was regarding the
> > integration into the slab reclaim logic in vmscan.c and the filesystem
> > reclaim in general.
> > 
> > Dave and Matthew were at linux.conf.au and we agreed to first try it with
> > the radix tree and then generalize from there.  The reclaim logic
> > was a bit hacky and we will have to find some better way to
> > integrate this.
> > 
> > There is a video on youtube capturing the discussion (My talk on movable
> > kernel objects).
> 
> Hmm, OK. There seems to be a slot to discuss this at LSFMM this year so
> I hope we can discuss your proposal there.

Btw. it would be great if you could summarize the discussion you had at
LCA here as well.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-03-13  9:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-07 21:24 [RFC 0/6] Slab Fragmentation Reduction V16 Christoph Lameter
2017-03-07 21:24 ` [RFC 1/6] slub: Replace ctor field with ops field in /sys/slab/* Christoph Lameter
2017-03-07 21:24 ` [RFC 2/6] slub: Add defrag_ratio field and sysfs support Christoph Lameter
2017-03-07 21:24 ` [RFC 3/6] slub: Add get() and kick() methods Christoph Lameter
2017-03-07 21:24 ` [RFC 4/6] slub: Sort slab cache list and establish maximum objects for defrag slabs Christoph Lameter
2017-03-07 21:24 ` [RFC 5/6] slub: Slab defrag core Christoph Lameter
2017-03-07 22:03   ` Matthew Wilcox
2017-03-07 21:24 ` [RFC 6/6] slub: Extend slabinfo to support -D and -F options Christoph Lameter
2017-03-08 14:34 ` [RFC 0/6] Slab Fragmentation Reduction V16 Michal Hocko
2017-03-08 15:58   ` Christoph Lameter
2017-03-13  9:15     ` Michal Hocko
2017-03-13  9:16       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.