linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
@ 2015-11-03  6:57 Alexey Kardashevskiy
  2015-11-03 14:39 ` Steven Rostedt
  0 siblings, 1 reply; 10+ messages in thread
From: Alexey Kardashevskiy @ 2015-11-03  6:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexey Kardashevskiy, Steven Rostedt, Paul Mackerras,
	David Gibson, linux-kernel

This defines list_for_each_entry_lockless. This allows safe list
traversing in cases when lockdep() invocation is unwanted like
real mode (MMU is off).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is for VFIO acceleration in POWERKVM for pSeries guests.
There is a KVM instance. There also can be some VFIO (PCI passthru)
devices attached to a KVM guest.

To perform DMA, a pSeries guest registers DMA memory by calling some
hypercalls explicitely at the rate close to one-two hcalls per
a network packet, i.e. very often. When a guest does a hypercall
(which is just an assembly instruction), the host kernel receives it
in the real mode (MMU is off). When real mode fails to handle it,
it enables MMU and tries handling a hcall in virtual mode.

A logical bus ID (LIOBN) is a tagret id for these hypecalls.

Each VFIO device belongs to an IOMMU group. Each group has an address
translation table. It is allowed to have multiple IOMMU groups (i.e.
multiple tables) under the same LIOBN.

So effectively every DMA hcall has to update one or more TCE tables
attached to the same LIOBN. RCU is used to update/traverse this list
safely.

Using RCU as is in virtual mode is fine. Lockdep works, etc.
list_add_rcu() is used to populate the list;
list_del_rcu() + call_rcu() used to remove groups from a list.
These operations can happen in runtim as a result of PCI hotplug/unplug
in guests.

Using RCU as is in real mode is not fine as some RCU checks can lock up
the system and in real mode we won't even have a chance to see any
debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.

Previous version of this used to define list_for_each_entry_rcu_notrace()
but it was proposed to use list_entry_lockless() instead. However
the comment for lockless_dereference() suggests this is a good idea
if "lifetime is managed by something other than RCU" but it is in my case.

So what would be the correct approach here? Thanks.
---
 include/linux/rculist.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 17c6b1f..a83a924 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list,
 		pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_lockless - iterate over rcu list of given type
+ * @pos:	the type * to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the list_struct within the struct.
+ *
+ * This list-traversal primitive may safely run concurrently
+ */
+#define list_entry_lockless(ptr, type, member) \
+	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+
+#define list_for_each_entry_lockless(pos, head, member) \
+	for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
+		&pos->member != (head); \
+		pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.
-- 
2.5.0.rc3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-11-03  6:57 [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu Alexey Kardashevskiy
@ 2015-11-03 14:39 ` Steven Rostedt
  2015-11-06  2:17   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2015-11-03 14:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paul E. McKenney, Paul Mackerras, David Gibson, linux-kernel

On Tue,  3 Nov 2015 17:57:05 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This defines list_for_each_entry_lockless. This allows safe list
> traversing in cases when lockdep() invocation is unwanted like
> real mode (MMU is off).
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is for VFIO acceleration in POWERKVM for pSeries guests.
> There is a KVM instance. There also can be some VFIO (PCI passthru)
> devices attached to a KVM guest.
> 
> To perform DMA, a pSeries guest registers DMA memory by calling some
> hypercalls explicitely at the rate close to one-two hcalls per
> a network packet, i.e. very often. When a guest does a hypercall
> (which is just an assembly instruction), the host kernel receives it
> in the real mode (MMU is off). When real mode fails to handle it,
> it enables MMU and tries handling a hcall in virtual mode.
> 
> A logical bus ID (LIOBN) is a tagret id for these hypecalls.
> 
> Each VFIO device belongs to an IOMMU group. Each group has an address
> translation table. It is allowed to have multiple IOMMU groups (i.e.
> multiple tables) under the same LIOBN.
> 
> So effectively every DMA hcall has to update one or more TCE tables
> attached to the same LIOBN. RCU is used to update/traverse this list
> safely.
> 
> Using RCU as is in virtual mode is fine. Lockdep works, etc.
> list_add_rcu() is used to populate the list;
> list_del_rcu() + call_rcu() used to remove groups from a list.
> These operations can happen in runtim as a result of PCI hotplug/unplug
> in guests.
> 
> Using RCU as is in real mode is not fine as some RCU checks can lock up
> the system and in real mode we won't even have a chance to see any
> debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.
> 
> Previous version of this used to define list_for_each_entry_rcu_notrace()
> but it was proposed to use list_entry_lockless() instead. However
> the comment for lockless_dereference() suggests this is a good idea
> if "lifetime is managed by something other than RCU" but it is in my case.
> 
> So what would be the correct approach here? Thanks.

If the only use case for this so far is in POWERKVM, perhaps it should
be defined specifically (and in arch/powerpc) and not confuse others
about using this.

Or, if you do imagine that this can be used in other scenarios, then a
much deeper comment must be made in the code in the kerneldoc section.
list_for_each_entry_rcu() should really be used in 99.99% of the time
in the kernel. This looks to be an extreme exception. I hate to add a
generic helper for something that will only be used in one location.

-- Steve


> ---
>  include/linux/rculist.h | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index 17c6b1f..a83a924 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list,
>  		pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
>  
>  /**
> + * list_for_each_entry_lockless - iterate over rcu list of given type
> + * @pos:	the type * to use as a loop cursor.
> + * @head:	the head for your list.
> + * @member:	the name of the list_struct within the struct.
> + *
> + * This list-traversal primitive may safely run concurrently
> + */
> +#define list_entry_lockless(ptr, type, member) \
> +	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
> +
> +#define list_for_each_entry_lockless(pos, head, member) \
> +	for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
> +		&pos->member != (head); \
> +		pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
> +
> +/**
>   * list_for_each_entry_continue_rcu - continue iteration over list of given type
>   * @pos:	the type * to use as a loop cursor.
>   * @head:	the head for your list.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-11-03 14:39 ` Steven Rostedt
@ 2015-11-06  2:17   ` Alexey Kardashevskiy
  2015-11-18 19:13     ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Alexey Kardashevskiy @ 2015-11-06  2:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul E. McKenney, Paul Mackerras, David Gibson, linux-kernel

On 11/04/2015 01:39 AM, Steven Rostedt wrote:
> On Tue,  3 Nov 2015 17:57:05 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>> This defines list_for_each_entry_lockless. This allows safe list
>> traversing in cases when lockdep() invocation is unwanted like
>> real mode (MMU is off).
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> This is for VFIO acceleration in POWERKVM for pSeries guests.
>> There is a KVM instance. There also can be some VFIO (PCI passthru)
>> devices attached to a KVM guest.
>>
>> To perform DMA, a pSeries guest registers DMA memory by calling some
>> hypercalls explicitely at the rate close to one-two hcalls per
>> a network packet, i.e. very often. When a guest does a hypercall
>> (which is just an assembly instruction), the host kernel receives it
>> in the real mode (MMU is off). When real mode fails to handle it,
>> it enables MMU and tries handling a hcall in virtual mode.
>>
>> A logical bus ID (LIOBN) is a tagret id for these hypecalls.
>>
>> Each VFIO device belongs to an IOMMU group. Each group has an address
>> translation table. It is allowed to have multiple IOMMU groups (i.e.
>> multiple tables) under the same LIOBN.
>>
>> So effectively every DMA hcall has to update one or more TCE tables
>> attached to the same LIOBN. RCU is used to update/traverse this list
>> safely.
>>
>> Using RCU as is in virtual mode is fine. Lockdep works, etc.
>> list_add_rcu() is used to populate the list;
>> list_del_rcu() + call_rcu() used to remove groups from a list.
>> These operations can happen in runtim as a result of PCI hotplug/unplug
>> in guests.
>>
>> Using RCU as is in real mode is not fine as some RCU checks can lock up
>> the system and in real mode we won't even have a chance to see any
>> debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.
>>
>> Previous version of this used to define list_for_each_entry_rcu_notrace()
>> but it was proposed to use list_entry_lockless() instead. However
>> the comment for lockless_dereference() suggests this is a good idea
>> if "lifetime is managed by something other than RCU" but it is in my case.
>>
>> So what would be the correct approach here? Thanks.
>
> If the only use case for this so far is in POWERKVM, perhaps it should
> be defined specifically (and in arch/powerpc) and not confuse others
> about using this.
>
> Or, if you do imagine that this can be used in other scenarios, then a
> much deeper comment must be made in the code in the kerneldoc section.
> list_for_each_entry_rcu() should really be used in 99.99% of the time
> in the kernel. This looks to be an extreme exception. I hate to add a
> generic helper for something that will only be used in one location.


No, I cannot imagine this really and I can move it somewhere in arch/powerpc.

Still, is my approach correct? What does the comment for 
lockless_dereference() actally mean - it won't work together with RCU at 
all or this is to force people not to use it as "list_for_each_entry_rcu() 
should really be used in 99.99% of the time"? :)



>
> -- Steve
>
>
>> ---
>>   include/linux/rculist.h | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> index 17c6b1f..a83a924 100644
>> --- a/include/linux/rculist.h
>> +++ b/include/linux/rculist.h
>> @@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list,
>>   		pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
>>
>>   /**
>> + * list_for_each_entry_lockless - iterate over rcu list of given type
>> + * @pos:	the type * to use as a loop cursor.
>> + * @head:	the head for your list.
>> + * @member:	the name of the list_struct within the struct.
>> + *
>> + * This list-traversal primitive may safely run concurrently
>> + */
>> +#define list_entry_lockless(ptr, type, member) \
>> +	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
>> +
>> +#define list_for_each_entry_lockless(pos, head, member) \
>> +	for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
>> +		&pos->member != (head); \
>> +		pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
>> +
>> +/**
>>    * list_for_each_entry_continue_rcu - continue iteration over list of given type
>>    * @pos:	the type * to use as a loop cursor.
>>    * @head:	the head for your list.
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-11-06  2:17   ` Alexey Kardashevskiy
@ 2015-11-18 19:13     ` Paul E. McKenney
  2015-11-29 23:39       ` Paul Mackerras
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2015-11-18 19:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Steven Rostedt, Paul Mackerras, David Gibson, linux-kernel

On Fri, Nov 06, 2015 at 01:17:17PM +1100, Alexey Kardashevskiy wrote:
> On 11/04/2015 01:39 AM, Steven Rostedt wrote:
> >On Tue,  3 Nov 2015 17:57:05 +1100
> >Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> >>This defines list_for_each_entry_lockless. This allows safe list
> >>traversing in cases when lockdep() invocation is unwanted like
> >>real mode (MMU is off).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>---
> >>
> >>This is for VFIO acceleration in POWERKVM for pSeries guests.
> >>There is a KVM instance. There also can be some VFIO (PCI passthru)
> >>devices attached to a KVM guest.
> >>
> >>To perform DMA, a pSeries guest registers DMA memory by calling some
> >>hypercalls explicitely at the rate close to one-two hcalls per
> >>a network packet, i.e. very often. When a guest does a hypercall
> >>(which is just an assembly instruction), the host kernel receives it
> >>in the real mode (MMU is off). When real mode fails to handle it,
> >>it enables MMU and tries handling a hcall in virtual mode.
> >>
> >>A logical bus ID (LIOBN) is a tagret id for these hypecalls.
> >>
> >>Each VFIO device belongs to an IOMMU group. Each group has an address
> >>translation table. It is allowed to have multiple IOMMU groups (i.e.
> >>multiple tables) under the same LIOBN.
> >>
> >>So effectively every DMA hcall has to update one or more TCE tables
> >>attached to the same LIOBN. RCU is used to update/traverse this list
> >>safely.
> >>
> >>Using RCU as is in virtual mode is fine. Lockdep works, etc.
> >>list_add_rcu() is used to populate the list;
> >>list_del_rcu() + call_rcu() used to remove groups from a list.
> >>These operations can happen in runtim as a result of PCI hotplug/unplug
> >>in guests.
> >>
> >>Using RCU as is in real mode is not fine as some RCU checks can lock up
> >>the system and in real mode we won't even have a chance to see any
> >>debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.
> >>
> >>Previous version of this used to define list_for_each_entry_rcu_notrace()
> >>but it was proposed to use list_entry_lockless() instead. However
> >>the comment for lockless_dereference() suggests this is a good idea
> >>if "lifetime is managed by something other than RCU" but it is in my case.
> >>
> >>So what would be the correct approach here? Thanks.
> >
> >If the only use case for this so far is in POWERKVM, perhaps it should
> >be defined specifically (and in arch/powerpc) and not confuse others
> >about using this.
> >
> >Or, if you do imagine that this can be used in other scenarios, then a
> >much deeper comment must be made in the code in the kerneldoc section.
> >list_for_each_entry_rcu() should really be used in 99.99% of the time
> >in the kernel. This looks to be an extreme exception. I hate to add a
> >generic helper for something that will only be used in one location.
> 
> No, I cannot imagine this really and I can move it somewhere in arch/powerpc.
> 
> Still, is my approach correct? What does the comment for
> lockless_dereference() actally mean - it won't work together with
> RCU at all or this is to force people not to use it as
> "list_for_each_entry_rcu() should really be used in 99.99% of the
> time"? :)

Well, it depends...

The key difference between lockless_dereference() and rcu_dereference()
is that lockless_dereference() won't complain if used outside of
an RCU read-side critical section.  When there is no RCU read-side
critical section, lockless_dereference() cannot rely on RCU's normal
action of keeping the data item around.  Therefore, when you are using
lockless_dereference(), you have to have some mechanism other than RCU
to keep the data item around.  The usual approach is for the data item
to never be freed, for example, if data is only ever added to the list
and never removed.  Other approaches include reference counting, hazard
pointers, hashed arrays of locks, garbage collectors, transactional
memory, and so on.

Of these other approaches:

1.	Reference counting is quite hard to use.

2.	Hazard pointers are easier to use than reference counting, but
	still requires that races with deletion force the traversal
	to be retried from the top level.

3.	Hashed arrays of locks pose interesting deadlock issues, and
	also poor locality of reference, which in turn results in
	not-so-good performance and scalability.

4.	As far as I know, garbage collectors are not available in the
	Linux kernel.

5.	Full-blown hardware transactional memory is only available as s390's
	constrained transactions.  (Yes, I know about x86 and powerpc's
	transactional memory, and in particular, I know that it needs a
	software fallback, which needs to be some other item on this list.)
	The software transactional memory implementations that I am aware
	of suffer from not-so-good performance and scalability.

So, in the Linux kernel, it is common to use RCU.

In any case, the docbook comment "This list-traversal primitive may
safely run concurrently" does need some help.  Maybe something like
the following?

 * This list-traversal primitive is similar to list_for_each_entry_rcu(),
 * but is intended for lists whose elements are never removed.


All well and good, but the other thing you might mean is that elements
-do- get removed, but in some environment that cannot be preempted.

In this case, you could use synchronize_rcu_sched() or call_rcu_sched()
to wait for all readers.  If you also had readers running in one
of the normal kernel environments, they could be protected by
rcu_read_lock_sched() and rcu_read_unlock_sched() or any pair of
primitives that disable then re-enable preemption (e.g., local_irq_save()
and local_irq_restore()).  Accesses from one of the normal kernel
environments would use rcu_dereference_sched(), but this primitive's
calls to lockdep might not work so well in your special kernel
environment.

In this case, the docbook comment might be as follows:

 * This list-traversal primitive is similar to list_for_each_entry_rcu(),
 * but is intended for lists whose elements are never removed on the
 * one hand and for non-preemptible special environments where lockdep
 * cannot be invoked on the other.  In the case of the special environments,
 * use synchronize_rcu_sched() and call_rcu_sched() to wait for readers.


Or are you doing something else entirely?

							Thanx, Paul

> >-- Steve
> >
> >
> >>---
> >>  include/linux/rculist.h | 16 ++++++++++++++++
> >>  1 file changed, 16 insertions(+)
> >>
> >>diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >>index 17c6b1f..a83a924 100644
> >>--- a/include/linux/rculist.h
> >>+++ b/include/linux/rculist.h
> >>@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list,
> >>  		pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
> >>
> >>  /**
> >>+ * list_for_each_entry_lockless - iterate over rcu list of given type
> >>+ * @pos:	the type * to use as a loop cursor.
> >>+ * @head:	the head for your list.
> >>+ * @member:	the name of the list_struct within the struct.
> >>+ *
> >>+ * This list-traversal primitive may safely run concurrently
> >>+ */
> >>+#define list_entry_lockless(ptr, type, member) \
> >>+	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
> >>+
> >>+#define list_for_each_entry_lockless(pos, head, member) \
> >>+	for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
> >>+		&pos->member != (head); \
> >>+		pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
> >>+
> >>+/**
> >>   * list_for_each_entry_continue_rcu - continue iteration over list of given type
> >>   * @pos:	the type * to use as a loop cursor.
> >>   * @head:	the head for your list.
> >
> 
> 
> -- 
> Alexey
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-11-18 19:13     ` Paul E. McKenney
@ 2015-11-29 23:39       ` Paul Mackerras
  2015-11-30 20:30         ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Mackerras @ 2015-11-29 23:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexey Kardashevskiy, Steven Rostedt, David Gibson, linux-kernel

On Wed, Nov 18, 2015 at 11:13:28AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 06, 2015 at 01:17:17PM +1100, Alexey Kardashevskiy wrote:
[snip]
> > Still, is my approach correct? What does the comment for
> > lockless_dereference() actally mean - it won't work together with
> > RCU at all or this is to force people not to use it as
> > "list_for_each_entry_rcu() should really be used in 99.99% of the
> > time"? :)
> 
> Well, it depends...
> 
> The key difference between lockless_dereference() and rcu_dereference()
> is that lockless_dereference() won't complain if used outside of
> an RCU read-side critical section.  When there is no RCU read-side
> critical section, lockless_dereference() cannot rely on RCU's normal
> action of keeping the data item around.  Therefore, when you are using
> lockless_dereference(), you have to have some mechanism other than RCU
> to keep the data item around.  The usual approach is for the data item
> to never be freed, for example, if data is only ever added to the list
> and never removed.  Other approaches include reference counting, hazard
> pointers, hashed arrays of locks, garbage collectors, transactional
> memory, and so on.

So, the situation is that we have an RCU-protected list, which in this
case we are traversing without modifying the list.  We are in an
restricted environment (hypervisor real mode) where we can't be
preempted, both because interrupts are hard-disabled and because our
caller has done preempt_disable().  In this restricted environment we
can access the linear mapping (thus kmalloc'd data) but not the
vmalloc or ioremap regions.

Thus we are not formally in a RCU read-side critical section, though
we believe that having preemption disabled gives us equivalent
protection.  Probably what we should do is to add a
rcu_read_lock/unlock pair in a function higher up the call chain
so that we are actually in a RCU read-side critical section.

Then the only reason not to use list_for_each_entry_rcu would be that
we don't trust the checking machinery not to ever access vmalloc'd
data.  In other words, we want a list_for_each_entry_nocheck
or list_for_each_entry_restricted which omits all the lockdep
checking.  That would need a list_entry_rcu_nocheck which would need a
thing like rcu_dereference_raw that does no lockdep checking - which
is where I thought you suggested lockless_dereference.

So, what name do you like for these primitives, and where should they
go?

Thanks,
Paul.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-11-29 23:39       ` Paul Mackerras
@ 2015-11-30 20:30         ` Paul E. McKenney
  2015-12-06  2:19           ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2015-11-30 20:30 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexey Kardashevskiy, Steven Rostedt, David Gibson, linux-kernel

On Mon, Nov 30, 2015 at 10:39:15AM +1100, Paul Mackerras wrote:
> On Wed, Nov 18, 2015 at 11:13:28AM -0800, Paul E. McKenney wrote:
> > On Fri, Nov 06, 2015 at 01:17:17PM +1100, Alexey Kardashevskiy wrote:
> [snip]
> > > Still, is my approach correct? What does the comment for
> > > lockless_dereference() actally mean - it won't work together with
> > > RCU at all or this is to force people not to use it as
> > > "list_for_each_entry_rcu() should really be used in 99.99% of the
> > > time"? :)
> > 
> > Well, it depends...
> > 
> > The key difference between lockless_dereference() and rcu_dereference()
> > is that lockless_dereference() won't complain if used outside of
> > an RCU read-side critical section.  When there is no RCU read-side
> > critical section, lockless_dereference() cannot rely on RCU's normal
> > action of keeping the data item around.  Therefore, when you are using
> > lockless_dereference(), you have to have some mechanism other than RCU
> > to keep the data item around.  The usual approach is for the data item
> > to never be freed, for example, if data is only ever added to the list
> > and never removed.  Other approaches include reference counting, hazard
> > pointers, hashed arrays of locks, garbage collectors, transactional
> > memory, and so on.
> 
> So, the situation is that we have an RCU-protected list, which in this
> case we are traversing without modifying the list.  We are in an
> restricted environment (hypervisor real mode) where we can't be
> preempted, both because interrupts are hard-disabled and because our
> caller has done preempt_disable().  In this restricted environment we
> can access the linear mapping (thus kmalloc'd data) but not the
> vmalloc or ioremap regions.
> 
> Thus we are not formally in a RCU read-side critical section, though
> we believe that having preemption disabled gives us equivalent
> protection.  Probably what we should do is to add a
> rcu_read_lock/unlock pair in a function higher up the call chain
> so that we are actually in a RCU read-side critical section.
> 
> Then the only reason not to use list_for_each_entry_rcu would be that
> we don't trust the checking machinery not to ever access vmalloc'd
> data.  In other words, we want a list_for_each_entry_nocheck
> or list_for_each_entry_restricted which omits all the lockdep
> checking.  That would need a list_entry_rcu_nocheck which would need a
> thing like rcu_dereference_raw that does no lockdep checking - which
> is where I thought you suggested lockless_dereference.
> 
> So, what name do you like for these primitives, and where should they
> go?

I am good with list_for_each_entry_lockless() and with its using
lockless_dereference().  I am also good with this going into
include/linux/rculist.h.  That said...

o	The name of the macro was last given as list_entry_lockless()
	rather than the name list_for_each_entry_lockless() that was
	used in the docbook comment.  The latter is preferable because
	it is consistent with the other list-traversal macros.

o	The docbook comment (and I quote) "This list-traversal
	primitive may safely run concurrently" is clearly inadequate.
	The points it needs to make are:

	1.	This locklessly traverses a list.

	2.	It may be used when items are never removed from the list.

	3.	It may also be used in environments where preemption
		is implicitly disabled, but where lockdep cannot safely
		be invoked.  In this case, after an element is removed,
		the synchronize_sched(), synchronize_sched_expedited(),
		or call_rcu_sched() functions must be used to wait for
		the needed grace period.

o	The commit log is also inadequate, though given a good patch,
	I can fix this.  It needs some of the information that we have
	been sending back and forth in this email thread.

It feels to me like we are going around in circles here.  Someone needs
to submit an updated patch to move this forward:  I cannot in good faith
accept the November 3rd patch.  Are you or Alexey going to send this
patch, or should I produce my best guess as to what you guys need?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-11-30 20:30         ` Paul E. McKenney
@ 2015-12-06  2:19           ` Paul E. McKenney
  2015-12-08  5:20             ` Paul Mackerras
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2015-12-06  2:19 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexey Kardashevskiy, Steven Rostedt, David Gibson, linux-kernel

On Mon, Nov 30, 2015 at 12:30:06PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 30, 2015 at 10:39:15AM +1100, Paul Mackerras wrote:
> > On Wed, Nov 18, 2015 at 11:13:28AM -0800, Paul E. McKenney wrote:
> > > On Fri, Nov 06, 2015 at 01:17:17PM +1100, Alexey Kardashevskiy wrote:
> > [snip]
> > > > Still, is my approach correct? What does the comment for
> > > > lockless_dereference() actally mean - it won't work together with
> > > > RCU at all or this is to force people not to use it as
> > > > "list_for_each_entry_rcu() should really be used in 99.99% of the
> > > > time"? :)
> > > 
> > > Well, it depends...
> > > 
> > > The key difference between lockless_dereference() and rcu_dereference()
> > > is that lockless_dereference() won't complain if used outside of
> > > an RCU read-side critical section.  When there is no RCU read-side
> > > critical section, lockless_dereference() cannot rely on RCU's normal
> > > action of keeping the data item around.  Therefore, when you are using
> > > lockless_dereference(), you have to have some mechanism other than RCU
> > > to keep the data item around.  The usual approach is for the data item
> > > to never be freed, for example, if data is only ever added to the list
> > > and never removed.  Other approaches include reference counting, hazard
> > > pointers, hashed arrays of locks, garbage collectors, transactional
> > > memory, and so on.
> > 
> > So, the situation is that we have an RCU-protected list, which in this
> > case we are traversing without modifying the list.  We are in an
> > restricted environment (hypervisor real mode) where we can't be
> > preempted, both because interrupts are hard-disabled and because our
> > caller has done preempt_disable().  In this restricted environment we
> > can access the linear mapping (thus kmalloc'd data) but not the
> > vmalloc or ioremap regions.
> > 
> > Thus we are not formally in a RCU read-side critical section, though
> > we believe that having preemption disabled gives us equivalent
> > protection.  Probably what we should do is to add a
> > rcu_read_lock/unlock pair in a function higher up the call chain
> > so that we are actually in a RCU read-side critical section.
> > 
> > Then the only reason not to use list_for_each_entry_rcu would be that
> > we don't trust the checking machinery not to ever access vmalloc'd
> > data.  In other words, we want a list_for_each_entry_nocheck
> > or list_for_each_entry_restricted which omits all the lockdep
> > checking.  That would need a list_entry_rcu_nocheck which would need a
> > thing like rcu_dereference_raw that does no lockdep checking - which
> > is where I thought you suggested lockless_dereference.
> > 
> > So, what name do you like for these primitives, and where should they
> > go?
> 
> I am good with list_for_each_entry_lockless() and with its using
> lockless_dereference().  I am also good with this going into
> include/linux/rculist.h.  That said...
> 
> o	The name of the macro was last given as list_entry_lockless()
> 	rather than the name list_for_each_entry_lockless() that was
> 	used in the docbook comment.  The latter is preferable because
> 	it is consistent with the other list-traversal macros.
> 
> o	The docbook comment (and I quote) "This list-traversal
> 	primitive may safely run concurrently" is clearly inadequate.
> 	The points it needs to make are:
> 
> 	1.	This locklessly traverses a list.
> 
> 	2.	It may be used when items are never removed from the list.
> 
> 	3.	It may also be used in environments where preemption
> 		is implicitly disabled, but where lockdep cannot safely
> 		be invoked.  In this case, after an element is removed,
> 		the synchronize_sched(), synchronize_sched_expedited(),
> 		or call_rcu_sched() functions must be used to wait for
> 		the needed grace period.
> 
> o	The commit log is also inadequate, though given a good patch,
> 	I can fix this.  It needs some of the information that we have
> 	been sending back and forth in this email thread.
> 
> It feels to me like we are going around in circles here.  Someone needs
> to submit an updated patch to move this forward:  I cannot in good faith
> accept the November 3rd patch.  Are you or Alexey going to send this
> patch, or should I produce my best guess as to what you guys need?

As in the following?  (And yes, I was confused by the fact that the
docbook header was in front of a differently-named macro!)

							Thanx, Paul

------------------------------------------------------------------------

commit 428ac14295ea49853afa136c9120cc6fb554a030
Author: Alexey Kardashevskiy <aik@ozlabs.ru>
Date:   Sat Dec 5 18:14:19 2015 -0800

    list: Add lockless list traversal primitives
    
    Although list_for_each_entry_rcu() can in theory be used anywhere
    preemption is disabled, it can result in calls to lockdep, which cannot
    be used in certain constrained execution environments, such as exception
    handlers that do not map the entire kernel into their address spaces.
    This commit therefore adds list_entry_lockless() and
    list_for_each_entry_lockless(), which never invoke lockdep and can
    therefore safely be used from these constrained environments, but only
    as long as those environments are non-preemptible (or items are never
    deleted from the list).
    
    Use synchronize_sched(), call_rcu_sched(), or synchronize_sched_expedited()
    in updates for the needed grace periods.  Of course, if items are never
    deleted from the list, there is no need to wait for grace periods.
    
    Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 5ed540986019..1fad79861e14 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -305,6 +305,42 @@ static inline void list_splice_init_rcu(struct list_head *list,
 		pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
 
 /**
+ * list_entry_lockless - get the struct for this entry
+ * @ptr:        the &struct list_head pointer.
+ * @type:       the type of the struct this is embedded in.
+ * @member:     the name of the list_head within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu(), but requires some implicit RCU
+ * read-side guarding.  One example is running within a special
+ * exception-time environment where preemption is disabled and where
+ * lockdep cannot be invoked (in which case updaters must use RCU-sched,
+ * as in synchronize_sched(), call_rcu_sched(), and friends).  Another
+ * example is when items are added to the list, but never deleted.
+ */
+#define list_entry_lockless(ptr, type, member) \
+	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+
+/**
+ * list_for_each_entry_lockless - iterate over rcu list of given type
+ * @pos:	the type * to use as a loop cursor.
+ * @head:	the head for your list.
+ * @member:	the name of the list_struct within the struct.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu(), but requires some implicit RCU
+ * read-side guarding.  One example is running within a special
+ * exception-time environment where preemption is disabled and where
+ * lockdep cannot be invoked (in which case updaters must use RCU-sched,
+ * as in synchronize_sched(), call_rcu_sched(), and friends).  Another
+ * example is when items are added to the list, but never deleted.
+ */
+#define list_for_each_entry_lockless(pos, head, member) \
+	for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
+	     &pos->member != (head); \
+	     pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
+
+/**
  * list_for_each_entry_continue_rcu - continue iteration over list of given type
  * @pos:	the type * to use as a loop cursor.
  * @head:	the head for your list.


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-12-06  2:19           ` Paul E. McKenney
@ 2015-12-08  5:20             ` Paul Mackerras
  2015-12-08  5:46               ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Mackerras @ 2015-12-08  5:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexey Kardashevskiy, Steven Rostedt, David Gibson, linux-kernel

On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote:
> 
> As in the following?  (And yes, I was confused by the fact that the
> docbook header was in front of a differently-named macro!)
> 
> 							Thanx, Paul

That looks great!  Have you committed to one of your branches?

Thanks,
Paul.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-12-08  5:20             ` Paul Mackerras
@ 2015-12-08  5:46               ` Paul E. McKenney
  2015-12-22  7:08                 ` Alexey Kardashevskiy
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2015-12-08  5:46 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexey Kardashevskiy, Steven Rostedt, David Gibson, linux-kernel

On Tue, Dec 08, 2015 at 04:20:03PM +1100, Paul Mackerras wrote:
> On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote:
> > 
> > As in the following?  (And yes, I was confused by the fact that the
> > docbook header was in front of a differently-named macro!)
> 
> That looks great!  Have you committed to one of your branches?

Indeed I have, though quite recently.  Please see branch rcu/dev of:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Or, more specifically, this commit: 69b907297f4e

If your testing goes well, I will push this into the upcoming merge
window.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu
  2015-12-08  5:46               ` Paul E. McKenney
@ 2015-12-22  7:08                 ` Alexey Kardashevskiy
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Kardashevskiy @ 2015-12-22  7:08 UTC (permalink / raw)
  To: paulmck, Paul Mackerras; +Cc: Steven Rostedt, David Gibson, linux-kernel

On 12/08/2015 04:46 PM, Paul E. McKenney wrote:
> On Tue, Dec 08, 2015 at 04:20:03PM +1100, Paul Mackerras wrote:
>> On Sat, Dec 05, 2015 at 06:19:46PM -0800, Paul E. McKenney wrote:
>>>
>>> As in the following?  (And yes, I was confused by the fact that the
>>> docbook header was in front of a differently-named macro!)
>>
>> That looks great!  Have you committed to one of your branches?
>
> Indeed I have, though quite recently.  Please see branch rcu/dev of:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
>
> Or, more specifically, this commit: 69b907297f4e
>
> If your testing goes well, I will push this into the upcoming merge
> window.

Sorry, it took a little longer than I expected. It works just fine, thanks!


-- 
Alexey

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-12-22  7:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-03  6:57 [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu Alexey Kardashevskiy
2015-11-03 14:39 ` Steven Rostedt
2015-11-06  2:17   ` Alexey Kardashevskiy
2015-11-18 19:13     ` Paul E. McKenney
2015-11-29 23:39       ` Paul Mackerras
2015-11-30 20:30         ` Paul E. McKenney
2015-12-06  2:19           ` Paul E. McKenney
2015-12-08  5:20             ` Paul Mackerras
2015-12-08  5:46               ` Paul E. McKenney
2015-12-22  7:08                 ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).