[LSF/MM TOPIC] Movable memory and reliable higher order allocations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] Movable memory and reliable higher order allocations
@ 2017-02-28 21:32 Christoph Lameter
  2017-02-28 23:17 ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2017-02-28 21:32 UTC (permalink / raw)
  To: linux-mm; +Cc: Jesper Dangaard Brouer, riel, Mel Gorman

This has come up lots of times. We talked about this at linux.conf.au
again and agreed to try to make the radix tree movable. Sadly I have not
had enough time yet to make progress on this one but reliable higher order
allocations or some sort of other solution are needed for performance
reasons in many places. Recently there was demand from the developers in
the network stack and in the RDMA subsystems for large contiguous
allocation for performance reasons. I would like to talk about this (yes
the gazillionth time) to see what avenues there are to make progress on
this one.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-02-28 21:32 [LSF/MM TOPIC] Movable memory and reliable higher order allocations Christoph Lameter
@ 2017-02-28 23:17 ` Matthew Wilcox
  2017-03-02  4:12   ` Matthew Wilcox
  2017-03-02 17:00   ` Christoph Lameter
  0 siblings, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2017-02-28 23:17 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Tue, Feb 28, 2017 at 03:32:12PM -0600, Christoph Lameter wrote:
> This has come up lots of times. We talked about this at linux.conf.au
> again and agreed to try to make the radix tree movable.

The radix tree is not movable given its current API.  In order to move
a node, we need to be able to lock the tree to prevent simultaneous
modification by another CPU.  But the radix tree API makes callers
responsible for their own locking -- we don't even know if it's locked
by a mutex or a spinlock, much less which lock protects this tree.

This was one of my motivations for the xarray.  The xarray handles its own
locking, so we can always lock out other CPUs from modifying the array.
We still have to take care of RCU walkers, but that's straightforward
to handle.  I have a prototype patch for the radix tree (ignoring the
locking problem), so I can port that over to the xarray and post that
for comment tomorrow.

Also the xarray doesn't use huge numbers of preallocated nodes, so
that'll reduce the pressure on the memory allocator somewhat.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-02-28 23:17 ` Matthew Wilcox
@ 2017-03-02  4:12   ` Matthew Wilcox
  2017-03-02 17:26     ` Christoph Lameter
  2017-03-02 17:00   ` Christoph Lameter
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2017-03-02  4:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Tue, Feb 28, 2017 at 03:17:33PM -0800, Matthew Wilcox wrote:
> This was one of my motivations for the xarray.  The xarray handles its own
> locking, so we can always lock out other CPUs from modifying the array.
> We still have to take care of RCU walkers, but that's straightforward
> to handle.  I have a prototype patch for the radix tree (ignoring the
> locking problem), so I can port that over to the xarray and post that
> for comment tomorrow.

This should do the trick ... untested.

I use the ->array member of the xa_node to distinguish between the
three states of the node -- allocated, in use, waiting for rcu free.
Most of the nodes will be in use, and most of them can be moved.

Let me know whether the assumptions I listed above xa_reclaim() are
reasonable ... also, do you want me returning a bool to indicate whether
I freed the node, or is that not useful because you'll know that anyway?

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 646ff84b4444..931f17a69807 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -380,6 +380,12 @@ static inline bool xa_is_relative(void *entry)
 #define XA_RESTART_FIND		((struct xa_node *)1)
 #define XA_RESTART_NEXT		((struct xa_node *)2)
 
+/*
+ * Also not an array entry.  This is found in node->array and informs
+ * the reclaim routine that the node is waiting for RCU
+ */
+#define XA_RCU_FREE		((struct xarray *)1)
+
 static inline bool xa_is_retry(void *entry)
 {
 	return unlikely(entry == XA_RETRY_ENTRY);
@@ -423,14 +429,14 @@ static inline void *xa_entry_locked(const struct xarray *xa,
 					lockdep_is_held(&xa->xa_lock));
 }
 
-static inline void *xa_parent(const struct xarray *xa,
+static inline struct xa_node *xa_parent(const struct xarray *xa,
 		const struct xa_node *node)
 {
 	return rcu_dereference_check(node->parent,
 					lockdep_is_held(&xa->xa_lock));
 }
 
-static inline void *xa_parent_locked(const struct xarray *xa,
+static inline struct xa_node *xa_parent_locked(const struct xarray *xa,
 		const struct xa_node *node)
 {
 	return rcu_dereference_protected(node->parent,
diff --git a/lib/xarray.c b/lib/xarray.c
index fd33b5b91013..d8004c8014c9 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -182,6 +182,7 @@ static void xa_node_ctor(void *p)
 	memset(&node->tags, 0, sizeof(node->tags));
 	memset(&node->slots, 0, sizeof(node->slots));
 	INIT_LIST_HEAD(&node->private_list);
+	node->array = NULL;
 }
 
 static void xa_node_rcu_free(struct rcu_head *head)
@@ -194,9 +195,72 @@ static void xa_node_rcu_free(struct rcu_head *head)
 
 static void xa_node_free(struct xa_node *node)
 {
+	node->array = XA_RCU_FREE;
 	call_rcu(&node->rcu_head, xa_node_rcu_free);
 }
 
+/*
+ * We rely on the following assumptions:
+ *  - The slab allocator calls us in process context with IRQs enabled and
+ *    no locks held (not even the RCU lock)
+ *  - We can allocate a replacement using GFP_KERNEL
+ *  - If the victim is freed while reclaim is running,
+ *    - The slab allocator will not overwrite any fields in the victim
+ *    - The page will not be returned to the page allocator until we return
+ *    - The victim will not be reallocated until we return
+ */
+static bool xa_reclaim(void *arg)
+{
+	struct xa_node *node, *victim = arg;
+	struct xarray *xa = READ_ONCE(victim->array);
+	void __rcu **slot;
+	unsigned int i;
+
+	/* Node has been allocated, but not yet placed in a tree. */
+	if (!xa)
+		return false;
+	/* If the node has already been freed, we only need to wait for RCU */
+	if (xa == XA_RCU_FREE)
+		goto out;
+
+	node = kmem_cache_alloc(xa_node_cache, GFP_KERNEL);
+
+	xa_lock_irq(xa);
+
+	/* Might have been freed since we last checked */
+	xa = victim->array;
+	if (xa == XA_RCU_FREE)
+		goto unlock;
+
+	/* Can't grab the LRU list lock here */
+	if (!list_empty(&victim->private_list))
+		goto busy;
+
+	memcpy(node, victim, sizeof(*node));
+	INIT_LIST_HEAD(&node->private_list);
+	for (i = 0; i < XA_CHUNK_SIZE; i++) {
+		void *entry = xa_entry_locked(xa, node, i);
+		if (xa_is_node(entry))
+			rcu_assign_pointer(xa_node(entry)->parent, node);
+	}
+	if (!node->parent)
+		slot = &xa->xa_head;
+	else
+		slot = &xa_parent_locked(xa, node)->slots[node->offset];
+	rcu_assign_pointer(*slot, xa_mk_node(node));
+unlock:
+	xa_unlock_irq(xa);
+	xa_node_free(victim);
+
+out:
+	rcu_barrier();
+	return true;
+
+busy:
+	xa_unlock_irq(xa);
+	return false;
+}
+
 /**
  * xas_destroy() - Dispose of any resources used during the xarray operation
  * @xas: Array operation state.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-02-28 23:17 ` Matthew Wilcox
  2017-03-02  4:12   ` Matthew Wilcox
@ 2017-03-02 17:00   ` Christoph Lameter
  1 sibling, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2017-03-02 17:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Tue, 28 Feb 2017, Matthew Wilcox wrote:

> The radix tree is not movable given its current API.  In order to move
> a node, we need to be able to lock the tree to prevent simultaneous
> modification by another CPU.  But the radix tree API makes callers
> responsible for their own locking -- we don't even know if it's locked
> by a mutex or a spinlock, much less which lock protects this tree.
>
> This was one of my motivations for the xarray.  The xarray handles its own
> locking, so we can always lock out other CPUs from modifying the array.
> We still have to take care of RCU walkers, but that's straightforward
> to handle.  I have a prototype patch for the radix tree (ignoring the
> locking problem), so I can port that over to the xarray and post that
> for comment tomorrow.

Great. Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-03-02  4:12   ` Matthew Wilcox
@ 2017-03-02 17:26     ` Christoph Lameter
  2017-03-02 20:55       ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2017-03-02 17:26 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Wed, 1 Mar 2017, Matthew Wilcox wrote:

> Let me know whether the assumptions I listed above xa_reclaim() are
> reasonable ... also, do you want me returning a bool to indicate whether
> I freed the node, or is that not useful because you'll know that anyway?

The way that the slab logic works is in two phases:

1. Callback

void *defrag_get_func(struct kmem_cache *s, int nr, void **objects)

Locks are held. Interrupts are disabled. No slab operations may be
performed and any operations on the slab page will cause that the
concurrent access to block.

The callback must establish a stable reference to the slab objects.
Meaning generally a additional refcount is added so that any free
operations will not remove the object. This is required in order to ensure
that free operations will not interfere with reclaim processing.

The get() will return a pointer to a private data structure that is passed
on to the second function. Before that callback the slab allocator will
drop all the locks. If the function returns NULL then that is an
indication that the objects are in use and that a reclaim operation cannot
be performed. No refcount has been taken.

This is required to have a stable array of objects to work on. If the
objects could be freed at any time then the objects could not be inspected
for state nor could an array of pointers to the objects be passed on for
future processing.

2. Callback

defrag_reclaim_func(struct kmem_cache *s, int nr, void **objects, void *private)

Here anything may be done. Free the objects or reallocate them (calling
kmalloc or so to allocate another object to move it to). On return the
slab allocator will inspect the slab page and if there are no objects
remaining then the slab page will be freed.

During proccesing the slab page is exempt from allocation and thus objects
can only be removed from the slab page until processing is complete.

> +/*
> + * We rely on the following assumptions:
> + *  - The slab allocator calls us in process context with IRQs enabled and
> + *    no locks held (not even the RCU lock)

This is true for the second callback.

> + *  - We can allocate a replacement using GFP_KERNEL
> + *  - If the victim is freed while reclaim is running,
> + *    - The slab allocator will not overwrite any fields in the victim
> + *    - The page will not be returned to the page allocator until we return
> + *    - The victim will not be reallocated until we return

The viction cannot be freed during processing since the first callback
established a reference. The callback must free the object if possible and
drop the reference count.

> + */
> +static bool xa_reclaim(void *arg)

Ok lets assume that this is the second callback.

> +{
> +	struct xa_node *node, *victim = arg;
> +	struct xarray *xa = READ_ONCE(victim->array);
> +	void __rcu **slot;
> +	unsigned int i;
> +
> +	/* Node has been allocated, but not yet placed in a tree. */
> +	if (!xa)
> +		return false;
> +	/* If the node has already been freed, we only need to wait for RCU */
> +	if (xa == XA_RCU_FREE)
> +		goto out;

Hmmm... We really need a refcount here.

> +	node = kmem_cache_alloc(xa_node_cache, GFP_KERNEL);
> +
> +	xa_lock_irq(xa);

The lock may be held for the set of objects being processed.

> +
> +	/* Might have been freed since we last checked */
> +	xa = victim->array;
> +	if (xa == XA_RCU_FREE)
> +		goto unlock;
> +
> +	/* Can't grab the LRU list lock here */
> +	if (!list_empty(&victim->private_list))
> +		goto busy;
> +
> +	memcpy(node, victim, sizeof(*node));
> +	INIT_LIST_HEAD(&node->private_list);
> +	for (i = 0; i < XA_CHUNK_SIZE; i++) {
> +		void *entry = xa_entry_locked(xa, node, i);
> +		if (xa_is_node(entry))
> +			rcu_assign_pointer(xa_node(entry)->parent, node);
> +	}
> +	if (!node->parent)
> +		slot = &xa->xa_head;
> +	else
> +		slot = &xa_parent_locked(xa, node)->slots[node->offset];
> +	rcu_assign_pointer(*slot, xa_mk_node(node));
> +unlock:
> +	xa_unlock_irq(xa);
> +	xa_node_free(victim);
> +
> +out:
> +	rcu_barrier();
> +	return true;
> +
> +busy:
> +	xa_unlock_irq(xa);
> +	return false;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-03-02 17:26     ` Christoph Lameter
@ 2017-03-02 20:55       ` Matthew Wilcox
  2017-03-03 15:24         ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2017-03-02 20:55 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Thu, Mar 02, 2017 at 11:26:39AM -0600, Christoph Lameter wrote:
> On Wed, 1 Mar 2017, Matthew Wilcox wrote:
> 
> > Let me know whether the assumptions I listed above xa_reclaim() are
> > reasonable ... also, do you want me returning a bool to indicate whether
> > I freed the node, or is that not useful because you'll know that anyway?
> 
> The way that the slab logic works is in two phases:

We may need to negotiate the API a little ;-)

> 1. Callback
> 
> void *defrag_get_func(struct kmem_cache *s, int nr, void **objects)

OK, so you're passing me an array of pointers of length 'nr'?
It's conventional to put 'objects' before 'nr' -- see release_pages()
and vm_map_ram()

> Locks are held. Interrupts are disabled. No slab operations may be
> performed and any operations on the slab page will cause that the
> concurrent access to block.
> 
> The callback must establish a stable reference to the slab objects.
> Meaning generally a additional refcount is added so that any free
> operations will not remove the object. This is required in order to ensure
> that free operations will not interfere with reclaim processing.

I don't currently have a way to do that.  There is a refcount on the node,
but if somebody does an operation which causes the node to be removed
from the tree (something like splatting a huge page over the top of it),
we ignore the refcount and free the node.  However, since it's been in
the tree, we pass it to RCU to free, so if you hold the RCU read lock in
addition to your other locks, the xarray can satisfy your requirements
that the object not be handed back to slab.

That takes care of nodes currently in the tree and nodes handed to RCU.
It doesn't take care of nodes which have been recently allocated and
not yet inserted into the tree.  I've got no way of freeing them, nor
of preventing them from being freed.

> The get() will return a pointer to a private data structure that is passed
> on to the second function. Before that callback the slab allocator will
> drop all the locks. If the function returns NULL then that is an
> indication that the objects are in use and that a reclaim operation cannot
> be performed. No refcount has been taken.

I don't think I have a useful private data structure that I can create.
I assume returning (void *)1 will be acceptable.

> This is required to have a stable array of objects to work on. If the
> objects could be freed at any time then the objects could not be inspected
> for state nor could an array of pointers to the objects be passed on for
> future processing.

If I can free some, but not all of the objects, is that worth doing,
or should I return NULL here?

> 2. Callback
> 
> defrag_reclaim_func(struct kmem_cache *s, int nr, void **objects, void *private)

You missed the return type here ... assuming it's void.

> Here anything may be done. Free the objects or reallocate them (calling
> kmalloc or so to allocate another object to move it to). On return the
> slab allocator will inspect the slab page and if there are no objects
> remaining then the slab page will be freed.

I have to reallocate; I have no way of knowing what my user is using
the xarray for, so I can't throw away nodes.

> During proccesing the slab page is exempt from allocation and thus objects
> can only be removed from the slab page until processing is complete.

That's great for me.

> > +/*
> > + * We rely on the following assumptions:
> > + *  - The slab allocator calls us in process context with IRQs enabled and
> > + *    no locks held (not even the RCU lock)
> 
> This is true for the second callback.
> 
> > + *  - We can allocate a replacement using GFP_KERNEL
> > + *  - If the victim is freed while reclaim is running,
> > + *    - The slab allocator will not overwrite any fields in the victim
> > + *    - The page will not be returned to the page allocator until we return
> > + *    - The victim will not be reallocated until we return
> 
> The viction cannot be freed during processing since the first callback
> established a reference. The callback must free the object if possible and
> drop the reference count.

Most of the frees are going to be coming via call_rcu().  I think that
actually satisfies your requirements.

> > + */
> > +static bool xa_reclaim(void *arg)
> 
> Ok lets assume that this is the second callback.

Yes, it at least approximates your second callback.

> > +{
> > +	struct xa_node *node, *victim = arg;
> > +	struct xarray *xa = READ_ONCE(victim->array);
> > +	void __rcu **slot;
> > +	unsigned int i;
> > +
> > +	/* Node has been allocated, but not yet placed in a tree. */
> > +	if (!xa)
> > +		return false;
> > +	/* If the node has already been freed, we only need to wait for RCU */
> > +	if (xa == XA_RCU_FREE)
> > +		goto out;
> 
> Hmmm... We really need a refcount here.
> 
> > +	node = kmem_cache_alloc(xa_node_cache, GFP_KERNEL);
> > +
> > +	xa_lock_irq(xa);
> 
> The lock may be held for the set of objects being processed.

The objects may well be in different xarrays, so I can't hold the lock
across the entire collection of objects you're asking to free.

I understand your desire to batch up all the objects on a page and ask
the reclaimer to free them all, but is the additional complexity worth
the performance gains you're expecting to see?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-03-02 20:55       ` Matthew Wilcox
@ 2017-03-03 15:24         ` Christoph Lameter
  2017-03-03 20:39           ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Lameter @ 2017-03-03 15:24 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman



> We may need to negotiate the API a little ;-)

Well lets continue the fun then.

>
> > 1. Callback
> >
> > void *defrag_get_func(struct kmem_cache *s, int nr, void **objects)
>
> OK, so you're passing me an array of pointers of length 'nr'?
> It's conventional to put 'objects' before 'nr' -- see release_pages()
> and vm_map_ram()

Ok.

> > Locks are held. Interrupts are disabled. No slab operations may be
> > performed and any operations on the slab page will cause that the
> > concurrent access to block.
> >
> > The callback must establish a stable reference to the slab objects.
> > Meaning generally a additional refcount is added so that any free
> > operations will not remove the object. This is required in order to ensure
> > that free operations will not interfere with reclaim processing.
>
> I don't currently have a way to do that.  There is a refcount on the node,
> but if somebody does an operation which causes the node to be removed
> from the tree (something like splatting a huge page over the top of it),
> we ignore the refcount and free the node.  However, since it's been in
> the tree, we pass it to RCU to free, so if you hold the RCU read lock in
> addition to your other locks, the xarray can satisfy your requirements
> that the object not be handed back to slab.

We need a general solution here. Objects having a refcount is the common
way to provide an existence guarantee. Holding rcu_locks in a
function that performs slab operations or lenghty object inspection
calling a variety of VM operations is not advisable.

> That takes care of nodes currently in the tree and nodes handed to RCU.
> It doesn't take care of nodes which have been recently allocated and
> not yet inserted into the tree.  I've got no way of freeing them, nor
> of preventing them from being freed.

The function can fail if you encounter such objects. You do not have to
free any objects that are currently busy.

> > The get() will return a pointer to a private data structure that is passed
> > on to the second function. Before that callback the slab allocator will
> > drop all the locks. If the function returns NULL then that is an
> > indication that the objects are in use and that a reclaim operation cannot
> > be performed. No refcount has been taken.
>
> I don't think I have a useful private data structure that I can create.
> I assume returning (void *)1 will be acceptable.

Yep.

> > This is required to have a stable array of objects to work on. If the
> > objects could be freed at any time then the objects could not be inspected
> > for state nor could an array of pointers to the objects be passed on for
> > future processing.
>
> If I can free some, but not all of the objects, is that worth doing,
> or should I return NULL here?

The objects are all objects from the same slab page. If you cannot free
one then the whole slab page must remain. It it advantageous to not free
objects. The slab can then be used for more allocations and filled up
again.

> > 2. Callback
> >
> > defrag_reclaim_func(struct kmem_cache *s, int nr, void **objects, void *private)
>
> You missed the return type here ... assuming it's void.

Yes.

> > Here anything may be done. Free the objects or reallocate them (calling
> > kmalloc or so to allocate another object to move it to). On return the
> > slab allocator will inspect the slab page and if there are no objects
> > remaining then the slab page will be freed.
>
> I have to reallocate; I have no way of knowing what my user is using
> the xarray for, so I can't throw away nodes.

That is fine.

> > During proccesing the slab page is exempt from allocation and thus objects
> > can only be removed from the slab page until processing is complete.
>
> That's great for me.

Allright.

> > > +/*
> > > + * We rely on the following assumptions:
> > > + *  - The slab allocator calls us in process context with IRQs enabled and
> > > + *    no locks held (not even the RCU lock)
> >
> > This is true for the second callback.
> >
> > > + *  - We can allocate a replacement using GFP_KERNEL
> > > + *  - If the victim is freed while reclaim is running,
> > > + *    - The slab allocator will not overwrite any fields in the victim
> > > + *    - The page will not be returned to the page allocator until we return
> > > + *    - The victim will not be reallocated until we return
> >
> > The viction cannot be freed during processing since the first callback
> > established a reference. The callback must free the object if possible and
> > drop the reference count.
>
> Most of the frees are going to be coming via call_rcu().  I think that
> actually satisfies your requirements.

I doubt that we can hold the rcu locks for the entirety of the processing.
There are other slab caches that also need to use this (dentry and inodes)
and all of those are refcount based.

> The objects may well be in different xarrays, so I can't hold the lock
> across the entire collection of objects you're asking to free.
>
> I understand your desire to batch up all the objects on a page and ask
> the reclaimer to free them all, but is the additional complexity worth
> the performance gains you're expecting to see?

Depends on the number of objects in a slab page. I think for starters we
can avoid the complexity and just process one by one. However, a slab page
has a number of allocated objects and the slab functions are geared to
process them together since the slab page containing them needs to be
exempted from allocations, locked etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-03-03 15:24         ` Christoph Lameter
@ 2017-03-03 20:39           ` Matthew Wilcox
  2017-03-06 14:53             ` Christoph Lameter
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2017-03-03 20:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Fri, Mar 03, 2017 at 09:24:23AM -0600, Christoph Lameter wrote:
> > We may need to negotiate the API a little ;-)
> 
> Well lets continue the fun then.

It is a fun little dance!  It'd help if you posted your current code;
I'm trying to reason about what you're probably doing and why, and a
bit less guesswork would make it easier.

> > > Locks are held. Interrupts are disabled. No slab operations may be
> > > performed and any operations on the slab page will cause that the
> > > concurrent access to block.
> > >
> > > The callback must establish a stable reference to the slab objects.
> > > Meaning generally a additional refcount is added so that any free
> > > operations will not remove the object. This is required in order to ensure
> > > that free operations will not interfere with reclaim processing.
> >
> > I don't currently have a way to do that.  There is a refcount on the node,
> > but if somebody does an operation which causes the node to be removed
> > from the tree (something like splatting a huge page over the top of it),
> > we ignore the refcount and free the node.  However, since it's been in
> > the tree, we pass it to RCU to free, so if you hold the RCU read lock in
> > addition to your other locks, the xarray can satisfy your requirements
> > that the object not be handed back to slab.
> 
> We need a general solution here. Objects having a refcount is the common
> way to provide an existence guarantee. Holding rcu_locks in a
> function that performs slab operations or lenghty object inspection
> calling a variety of VM operations is not advisable.

Even if I had a refcount, it wouldn't solve your problem.  Look at
the dcache:

        if (!(dentry->d_flags & DCACHE_RCUACCESS))
                __d_free(&dentry->d_u.d_rcu);
        else
                call_rcu(&dentry->d_u.d_rcu, __d_free);

and the inode freeing routine is much the same:

        if (inode->i_sb->s_op->destroy_inode)
                inode->i_sb->s_op->destroy_inode(inode);
        else
                call_rcu(&inode->i_rcu, i_callback);

So all three of the most important reclaimable caches free their data
using RCU.  And once an object has gone onto the RCU lists, there's no
refcount that's going to avoid it being passed from RCU to the slab.
Your best bet for avoiding having somebody call kmem_cache_free() on
one of the objects in your list is to hold off RCU.

Of course, I now realise that taking the RCU read lock is not going
to help.  Your critical section will not pre-date all callers of RCU,
so we can have a situation like this:

CPU A		CPU B		CPU C
read_lock
get node
		spin_lock
		call_rcu
		spin_unlock
read stale data
				read_lock
				mark slab page as blocking
read_unlock
		kmem_cache_free()
				read_unlock

and CPU B is going to block in softirq context.  Nasty.  I also don't see
how to avoid it.  Unless by "block", you mean "will spin on slab_lock()",
which isn't too bad, I suppose.

> > > This is required to have a stable array of objects to work on. If the
> > > objects could be freed at any time then the objects could not be inspected
> > > for state nor could an array of pointers to the objects be passed on for
> > > future processing.
> >
> > If I can free some, but not all of the objects, is that worth doing,
> > or should I return NULL here?
> 
> The objects are all objects from the same slab page. If you cannot free
> one then the whole slab page must remain. It it advantageous to not free
> objects. The slab can then be used for more allocations and filled up
> again.

OK.  So how about we have the following functions:

bool can_free(void **objects, unsigned int nr);
void reclaim(void **objects, unsigned int nr);

I don't think the kmem_cache is actually useful to any of the callees.
And until we have a user, let's not complicate the interface with the
ability to pass a private data structure around -- again, i don't see
it being useful for dentries or inodes.

The callee can take references or whetever else is useful to mark
objects as being targetted for reclaim in 'can_free', but may not sleep,
and should not take a long time to execute (because we're potentially
delaying somebody in irq context).

In reclaim, anything goes, no locks are held by slab, kmem_cache_alloc
can be called.  When reclaim() returns, slab will evaluate the state
of the page and free it back to the page allocator if everything is
freed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM TOPIC] Movable memory and reliable higher order allocations
  2017-03-03 20:39           ` Matthew Wilcox
@ 2017-03-06 14:53             ` Christoph Lameter
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Lameter @ 2017-03-06 14:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, Jesper Dangaard Brouer, riel, Mel Gorman

On Fri, 3 Mar 2017, Matthew Wilcox wrote:

> OK.  So how about we have the following functions:
>
> bool can_free(void **objects, unsigned int nr);
> void reclaim(void **objects, unsigned int nr);
>
> The callee can take references or whetever else is useful to mark
> objects as being targetted for reclaim in 'can_free', but may not sleep,
> and should not take a long time to execute (because we're potentially
> delaying somebody in irq context).
>
> In reclaim, anything goes, no locks are held by slab, kmem_cache_alloc
> can be called.  When reclaim() returns, slab will evaluate the state
> of the page and free it back to the page allocator if everything is
> freed.

Ok. That is pretty much how it works (aside from the naming, the
refcounting is just what is commonly done to provide existence
guarantees, you can do something else).

The old patchset is available at https://lwn.net/Articles/371892/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-03-06 14:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 21:32 [LSF/MM TOPIC] Movable memory and reliable higher order allocations Christoph Lameter
2017-02-28 23:17 ` Matthew Wilcox
2017-03-02  4:12   ` Matthew Wilcox
2017-03-02 17:26     ` Christoph Lameter
2017-03-02 20:55       ` Matthew Wilcox
2017-03-03 15:24         ` Christoph Lameter
2017-03-03 20:39           ` Matthew Wilcox
2017-03-06 14:53             ` Christoph Lameter
2017-03-02 17:00   ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.