All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup)
@ 2011-07-27 19:10 Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 1/8] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

Code review showed, that the hierarchical order of the leaf
CFS runqueues introduced by commit 67e8625 ("sched: Introduce 
hierarchal order on shares update list") is not yet achieved.
(See description of third patch for an example.)

During the discussion of possible solutions [1], Paul Turner
mentioned an "ideal approach" to solve that.

This is the second iteration of the patch-set that tries to
realize this "ideal" approach. 

The first iteration got some positive feedback from Paul E. McKenney
regarding its "advanced" RCU usage. His negative feedback -- missing
comments -- should be addressed, now.

Changes since v1:
- rebased against v3.0
- included follow-up patches 4 to 8 (demonstrating the purpose of patch 1)
- patch 1 should be complete this time
- moved more functionality to rculist.h (see patch 2+3)
- more comments everywhere


Patch 1: After inventing __list_link(), I realized, that this
	 function already existed, but with a different name.

	 This patch just does some renaming. Not really needed,
	 but if I use the old naming in patch 2 it's really 
	 hard to understand what's actually going on.

	 It also helps to increase the readability of the existing
	 code, see patches 6-8.

Patch 2: This introduces new list functions to splice RCU lists
	 and handle deleted RCU list entries.

Patch 3: The actual bugfix.

Patch 4+5: Follow-ups to patch 1. Some more renaming and use of
	   appropriate functions.

Patch 6: Another follow-up to patch 1, improving the readability of
	 the list routines a bit.

Patch 7+8: Follow-ups to patch 2. Make use of the introduced
           functionality in the already existing code.


Comments?

Regards
Jan


[1] http://lkml.org/lkml/2011/7/18/86



Jan H. Schönherr (8):
  list, treewide: Rename __list_del() to __list_link()
  rcu: More rcu-variants for list manipulation
  sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  list, treewide: Rename __list_del_entry() to __list_del()
  treewide: Use __list_del() instead of __list_link()
  list: Make use of __list_link()
  rcu: Make use of __list_link() and __list_link_rcu()
  rcu: Rewrite and rename list_splice_init_rcu()

 drivers/char/ipmi/ipmi_msghandler.c  |    2 +-
 drivers/firewire/core-topology.c     |    2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c |    4 +-
 fs/btrfs/volumes.c                   |    2 +-
 include/linux/list.h                 |   80 ++++++++----------
 include/linux/rculist.h              |  155 +++++++++++++++++++++++++++++----
 kernel/mutex.h                       |    2 +-
 kernel/sched_fair.c                  |   93 ++++++++++++++++-----
 kernel/timer.c                       |    2 +-
 lib/list_debug.c                     |    8 +-
 net/ipv4/cipso_ipv4.c                |    2 +-
 11 files changed, 257 insertions(+), 95 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 1/8] list, treewide: Rename __list_del() to __list_link()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation Jan H. Schönherr
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

The purpose of __list_del() is to let two list entries point to each other.
It does not actually remove anything. It can be used to delete something
as it is done from within __list_del_entry().

This commit renames __list_del() to __list_link() as this new name better
reflects what the function is actually doing.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 drivers/firewire/core-topology.c     |    2 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c |    4 ++--
 include/linux/list.h                 |   11 +++++------
 include/linux/rculist.h              |    2 +-
 kernel/mutex.h                       |    2 +-
 kernel/timer.c                       |    2 +-
 lib/list_debug.c                     |    2 +-
 net/ipv4/cipso_ipv4.c                |    2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/firewire/core-topology.c b/drivers/firewire/core-topology.c
index 193ed92..b7cd1db 100644
--- a/drivers/firewire/core-topology.c
+++ b/drivers/firewire/core-topology.c
@@ -290,7 +290,7 @@ static struct fw_node *build_tree(struct fw_card *card,
 		}
 
 		/* Pop the child nodes off the stack and push the new node. */
-		__list_del(h->prev, &stack);
+		__list_link(h->prev, &stack);
 		list_add_tail(&node->link, &stack);
 		stack_depth += 1 - child_port_count;
 
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index d948575..fd4443a 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -331,7 +331,7 @@ restart:
 		/* We can only remove NUM_PAGES_TO_ALLOC at a time. */
 		if (freed_pages >= NUM_PAGES_TO_ALLOC) {
 			/* remove range of pages from the pool */
-			__list_del(p->lru.prev, &pool->list);
+			__list_link(p->lru.prev, &pool->list);
 
 			ttm_pool_update_free_locked(pool, freed_pages);
 			/**
@@ -366,7 +366,7 @@ restart:
 
 	/* remove range of pages from the pool */
 	if (freed_pages) {
-		__list_del(&p->lru, &pool->list);
+		__list_link(&p->lru, &pool->list);
 
 		ttm_pool_update_free_locked(pool, freed_pages);
 		nr_free -= freed_pages;
diff --git a/include/linux/list.h b/include/linux/list.h
index cc6d2aa..9d093cc 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -77,13 +77,12 @@ static inline void list_add_tail(struct list_head *new, struct list_head *head)
 }
 
 /*
- * Delete a list entry by making the prev/next entries
+ * Link two list entries by making the prev/next entries
  * point to each other.
  *
- * This is only for internal list manipulation where we know
- * the prev/next entries already!
+ * This is only for internal list manipulation.
  */
-static inline void __list_del(struct list_head * prev, struct list_head * next)
+static inline void __list_link(struct list_head *prev, struct list_head *next)
 {
 	next->prev = prev;
 	prev->next = next;
@@ -98,12 +97,12 @@ static inline void __list_del(struct list_head * prev, struct list_head * next)
 #ifndef CONFIG_DEBUG_LIST
 static inline void __list_del_entry(struct list_head *entry)
 {
-	__list_del(entry->prev, entry->next);
+	__list_link(entry->prev, entry->next);
 }
 
 static inline void list_del(struct list_head *entry)
 {
-	__list_del(entry->prev, entry->next);
+	__list_link(entry->prev, entry->next);
 	entry->next = LIST_POISON1;
 	entry->prev = LIST_POISON2;
 }
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index e3beb31..748401b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -108,7 +108,7 @@ static inline void list_add_tail_rcu(struct list_head *new,
  */
 static inline void list_del_rcu(struct list_head *entry)
 {
-	__list_del(entry->prev, entry->next);
+	__list_link(entry->prev, entry->next);
 	entry->prev = LIST_POISON2;
 }
 
diff --git a/kernel/mutex.h b/kernel/mutex.h
index 4115fbf..bcac8d1 100644
--- a/kernel/mutex.h
+++ b/kernel/mutex.h
@@ -14,7 +14,7 @@
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
 #define mutex_remove_waiter(lock, waiter, ti) \
-		__list_del((waiter)->list.prev, (waiter)->list.next)
+		__list_link((waiter)->list.prev, (waiter)->list.next)
 
 #ifdef CONFIG_SMP
 static inline void mutex_set_owner(struct mutex *lock)
diff --git a/kernel/timer.c b/kernel/timer.c
index 8cff361..7ecae26 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -615,7 +615,7 @@ static inline void detach_timer(struct timer_list *timer,
 
 	debug_deactivate(timer);
 
-	__list_del(entry->prev, entry->next);
+	__list_link(entry->prev, entry->next);
 	if (clear_pending)
 		entry->next = NULL;
 	entry->prev = LIST_POISON2;
diff --git a/lib/list_debug.c b/lib/list_debug.c
index b8029a5..7e37695 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -56,7 +56,7 @@ void __list_del_entry(struct list_head *entry)
 		"but was %p\n", entry, next->prev))
 		return;
 
-	__list_del(prev, next);
+	__list_link(prev, next);
 }
 EXPORT_SYMBOL(__list_del_entry);
 
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 2b3c23c..5803807 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -348,7 +348,7 @@ static int cipso_v4_cache_check(const unsigned char *key,
 			if (entry->activity > prev_entry->activity &&
 			    entry->activity - prev_entry->activity >
 			    CIPSO_V4_CACHE_REORDERLIMIT) {
-				__list_del(entry->list.prev, entry->list.next);
+				__list_link(entry->list.prev, entry->list.next);
 				__list_add(&entry->list,
 					   prev_entry->list.prev,
 					   &prev_entry->list);
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 1/8] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-07-29  8:41   ` Jan Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

This commit introduces some more *_rcu variants of some list
manipulation functions:

* __list_link_rcu()

  RCU-aware counterpart to __list_link()

* __list_splice_rcu(), list_splice_tail_init_rcu()

  RCU-aware non-blocking splicing.

  The existing rcu-splice function allows concurrent readers on the
  list to be spliced that are not from the destination list. Therefore,
  it has to block to get these readers off the list before the splicing
  can be done. The new functions forbid this and can, thus, work without
  blocking.

And a new function:

* list_add_tail_nobackref()

  This function is similar to list_add_tail() with the difference that
  the references from head->next and head->prev back to head are not
  maintained. This is mainly useful, when you are not allowed to
  overwrite them, because the nodes logically belong to an RCU protected
  list from which they were removed and there might still be readers on
  these nodes.

  As this function only makes sense with RCU-protected lists, it is
  added to rculist.h, although the function itself is completely unaware
  of RCU.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

---
Changes since v1:
- added comments
- added list_add_tail_nobackref()
---
 include/linux/rculist.h |  122 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 122 insertions(+), 0 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 748401b..445c4f2 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -25,6 +25,29 @@
 #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
 
 /*
+ * Link two list entries by making the prev/next entries
+ * point to each other.
+ *
+ * The next pointer of @prev is assigned RCU-aware. So, this
+ * function is primarily intended to publish new nodes starting
+ * at @next within the RCU-protected list @prev.
+ *
+ * This is only for internal list manipulation, when everything
+ * else, i. e. a link from the nodes given by @next back to the
+ * list of @prev, is already set up.
+ *
+ * Concurrent readers are basically allowed, concurrent writers
+ * are forbidden.
+ */
+static inline void __list_link_rcu(struct list_head *prev,
+					struct list_head *next)
+{
+	rcu_assign_pointer(list_next_rcu(prev), next);
+	next->prev = prev;
+}
+
+
+/*
  * Insert a new entry between two known consecutive entries.
  *
  * This is only for internal list manipulation where we know
@@ -159,6 +182,105 @@ static inline void list_replace_rcu(struct list_head *old,
 }
 
 /**
+ * list_add_tail_nobackref - add element to list without setting up
+ *			     back references
+ * @new: element to add
+ * @head: list to add the element to
+ *
+ * @new is appended to @head.
+ *
+ * In contrast to list_add_tail(), this function does not maintain the
+ * references back to @head. So, neither @new->next will be changed, nor
+ * -- in case @new becomes the first entry of @head -- @new->pref.
+ *
+ * This is useful when reorganizing deleted elements of a RCU-protected
+ * list as concurrent readers must always find their way back to the list.
+ * When used in this context, special care must be taken in order to not
+ * disturb concurrent readers on @head (or @new):
+ * a) @new->next must lead back to the list
+ * b) @new->next must not lead to any node already on @head
+ * c) @new must be already known to existing readers on @head
+ *
+ * Point a) is the normal RCU invariant.
+ *
+ * Point b) prevents capturing readers in cycles.
+ *
+ * Point c) is necessary as this function itself is not RCU-aware: if a
+ * concurrent reader would reach an unknown node it might see uninitialized
+ * data in that node. Basically this requires that @new was not published
+ * for the first time after any of the nodes given by @head.
+ *
+ * Concurrent writers are forbidden.
+ */
+static inline void list_add_tail_nobackref(struct list_head *new,
+						struct list_head *head)
+{
+	if (list_empty(head)) {
+		/* If @head is empty, simply let it point to @new */
+		head->next = new;
+		head->prev = new;
+	} else {
+		/*
+		 * If @head has elements, append @new to the last element and
+		 * advance the tail pointer of @head.
+		 */
+		__list_link(head->prev, new);
+		head->prev = new;
+	}
+}
+
+/*
+ * Low level function for splicing.
+ *
+ * @prev and @next must point into the same RCU-protected list.
+ *
+ * All elements between (not including) @prev and @next are replaced
+ * by the nodes given by @list.
+ *
+ * It is safe to have concurrent readers of the @prev/@next list
+ * on any of the referenced nodes. There must be no reader not
+ * from @prev/@next within @list.
+ * There must be no concurrent writers on @list or @prev/@next.
+ *
+ * This is only for internal list manipulation. If elements of
+ * @prev/@next are deleted, their prev-pointers are not poisoned.
+ */
+static inline void __list_splice_rcu(struct list_head *list,
+					struct list_head *prev,
+					struct list_head *next)
+{
+	/* Prepare link back to @prev/@next */
+	__list_link(list->prev, next);
+
+	/*
+	 * Publish new nodes, implicitly deleting everything between
+	 * @prev and @next.
+	 */
+	__list_link_rcu(prev, list->next);
+}
+
+/**
+ * list_splice_tail_init_rcu - splice a list into a RCU-protected list
+ * @list: the list to be spliced
+ * @head: the RCU-protected list to splice into
+ *
+ * The contents of @list are inserted before @head. After completion
+ * @list is empty.
+ *
+ * It is safe to have concurrent readers of @head. There must be no
+ * concurrent readers not from @head on @list.
+ * There must be no concurrent writers on @list or @head.
+ */
+static inline void list_splice_tail_init_rcu(struct list_head *list,
+						struct list_head *head)
+{
+	if (!list_empty(list)) {
+		__list_splice_rcu(list, head->prev, head);
+		INIT_LIST_HEAD(list);
+	}
+}
+
+/**
  * list_splice_init_rcu - splice an RCU-protected list into an existing list.
  * @list:	the RCU-protected list to splice
  * @head:	the place in the list to splice the first list into
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 1/8] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-08-02 13:50   ` Peter Zijlstra
  2011-07-27 19:10 ` [PATCH RFCv2 4/8] list, treewide: Rename __list_del_entry() to __list_del() Jan H. Schönherr
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

In the case of an on_list ancestor we may incorrectly place the child to
the right of a great-ancestor on the list.

Consider:

      A
     / \     Here, t1A results in A->cfs_rq being on_list, however when
    B  t1A   we start enqueuing from C this will not be visible.  This is
   /         compounded by the fact that on_list expiration may also be out
  C          of order, punching holes in the tree.
 /
t1C

Prevent this by making additions to the leaf_cfs_rq_list position
independent.

This is done by collecting additions to this list within the
enqueue_task_fair() path until we reach the top or find an on_list
ancestor. All additions are then spliced into the leaf_cfs_rq_list at
once.

The additions cannot be collected in a list with the normal means as
this violates the RCU guarantee that the next pointer of an element
always leads back to the list as long as there could be concurrent
readers.  However, we are still allowed to modify the next pointer of
deleted entries as long as we make sure that they eventually return to
the list. That is, if we have to collect more than one entry in a row,
it is legal to set the next-pointer of the first collected entry to the
next one. (We can do this even without using *_rcu functions, as there
are no new elements on that path.)

Suggested-by: Paul Turner <pjt@google.com>
Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

---
Changes since v1:
- moved some more code to include/linux/rculist.h
- added comment regarding RCU
---
 kernel/sched_fair.c |   93 ++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c768588..1487649 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -143,26 +143,75 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct list_head *leaf_cfs_rqs)
 {
-	if (!cfs_rq->on_list) {
-		/*
-		 * Ensure we either appear before our parent (if already
-		 * enqueued) or force our parent to appear after us when it is
-		 * enqueued.  The fact that we always enqueue bottom-up
-		 * reduces this to two cases.
-		 */
-		if (cfs_rq->tg->parent &&
-		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
-			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		} else {
-			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
-				&rq_of(cfs_rq)->leaf_cfs_rq_list);
-		}
+	if (cfs_rq->on_list)
+		return;
+
+	/*
+	 * Carefully collect leaf-cfs entries.
+	 *
+	 * There are basically two cases:
+	 * 1. cfs_rq has not yet been on the leaf-list.
+	 * 2. cfs_rq has been deleted previously from the leaf_list.
+	 *
+	 * In case 2, we might still have concurrent readers.
+	 *
+	 * Therefore, the requirements of list_add_tail_nobackref() are
+	 * fulfilled:
+	 *
+	 * a) If there are concurrent readers, ->next must lead back to the
+	 *    list.
+	 *
+	 *    We can only have readers after case 2. After case 2, only case 2
+	 *    can follow. The next pointers of case 2 nodes always lead back to
+	 *    the list.
+	 *
+	 * b) If there are concurrent readers, ->next must not lead to any
+	 *    already collected node.
+	 *
+	 *    As we collect nodes always bottom-up, all already collected nodes
+	 *    must be below this node in the task group hierarchy.  The
+	 *    ordering constraint of the leaf list guarantees that next
+	 *    pointers only lead to nodes further up in the hierarchy (or to
+	 *    unrelated nodes). Neither deleting nodes nor the manipulations
+	 *    done here change that. Thus, we cannot reach an already collected
+	 *    node.
+	 *
+	 * c) If there are concurrent readers, they must already know this
+	 *    node.
+	 *
+	 *    If we have to add case 1 nodes, they are collected in the
+	 *    beginning and cannot be reached by readers until they are
+	 *    spliced. Furthermore, after they are spliced, we will not
+	 *    encounter more case 1 nodes higher up in the task group
+	 *    hierarchy. For this reason any reader on an earlier collected
+	 *    case 2 node must know all nodes that we collect later.
+	 */
+	list_add_tail_nobackref(&cfs_rq->leaf_cfs_rq_list, leaf_cfs_rqs);
 
-		cfs_rq->on_list = 1;
+	/*
+	 * If our parent is on_list or if there is no parent, then splice all
+	 * entries collected so far at the correct position into the
+	 * leaf_cfs_rq_list.
+	 *
+	 * If our parent is not on the list, it will be collected during the
+	 * next call to this function.
+	 */
+	if (cfs_rq->tg->parent) {
+		int cpu = cpu_of(rq_of(cfs_rq));
+		struct cfs_rq *parent_cfs_rq = cfs_rq->tg->parent->cfs_rq[cpu];
+		if (parent_cfs_rq->on_list) {
+			list_splice_tail_init_rcu(leaf_cfs_rqs,
+					&parent_cfs_rq->leaf_cfs_rq_list);
+		}
+	} else {
+		list_splice_tail_init_rcu(leaf_cfs_rqs,
+					&rq_of(cfs_rq)->leaf_cfs_rq_list);
 	}
+
+	cfs_rq->on_list = 1;
 }
 
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
@@ -276,7 +325,8 @@ static inline struct cfs_rq *cpu_cfs_rq(struct cfs_rq *cfs_rq, int this_cpu)
 	return &cpu_rq(this_cpu)->cfs;
 }
 
-static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq,
+					struct list_head *leaf_cfs_rqs)
 {
 }
 
@@ -998,8 +1048,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
 
-	if (cfs_rq->nr_running == 1)
-		list_add_leaf_cfs_rq(cfs_rq);
 }
 
 static void __clear_buddies_last(struct sched_entity *se)
@@ -1326,12 +1374,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	struct list_head leaf_cfs_rqs;
+
+	INIT_LIST_HEAD(&leaf_cfs_rqs);
 
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		if (cfs_rq->nr_running == 1)
+			list_add_leaf_cfs_rq(cfs_rq, &leaf_cfs_rqs);
 		flags = ENQUEUE_WAKEUP;
 	}
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 4/8] list, treewide: Rename __list_del_entry() to __list_del()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
                   ` (2 preceding siblings ...)
  2011-07-27 19:10 ` [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 5/8] treewide: Use __list_del() instead of __list_link() Jan H. Schönherr
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

Now that the name __list_del was freed, we can use this more
appropriate name for __list_del_entry(), making the name a bit
more consistent with __list_add().

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 include/linux/list.h |   10 +++++-----
 lib/list_debug.c     |    6 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 9d093cc..5319099 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -95,7 +95,7 @@ static inline void __list_link(struct list_head *prev, struct list_head *next)
  * in an undefined state.
  */
 #ifndef CONFIG_DEBUG_LIST
-static inline void __list_del_entry(struct list_head *entry)
+static inline void __list_del(struct list_head *entry)
 {
 	__list_link(entry->prev, entry->next);
 }
@@ -107,7 +107,7 @@ static inline void list_del(struct list_head *entry)
 	entry->prev = LIST_POISON2;
 }
 #else
-extern void __list_del_entry(struct list_head *entry);
+extern void __list_del(struct list_head *entry);
 extern void list_del(struct list_head *entry);
 #endif
 
@@ -140,7 +140,7 @@ static inline void list_replace_init(struct list_head *old,
  */
 static inline void list_del_init(struct list_head *entry)
 {
-	__list_del_entry(entry);
+	__list_del(entry);
 	INIT_LIST_HEAD(entry);
 }
 
@@ -151,7 +151,7 @@ static inline void list_del_init(struct list_head *entry)
  */
 static inline void list_move(struct list_head *list, struct list_head *head)
 {
-	__list_del_entry(list);
+	__list_del(list);
 	list_add(list, head);
 }
 
@@ -163,7 +163,7 @@ static inline void list_move(struct list_head *list, struct list_head *head)
 static inline void list_move_tail(struct list_head *list,
 				  struct list_head *head)
 {
-	__list_del_entry(list);
+	__list_del(list);
 	list_add_tail(list, head);
 }
 
diff --git a/lib/list_debug.c b/lib/list_debug.c
index 7e37695..9ca2d3a 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -35,7 +35,7 @@ void __list_add(struct list_head *new,
 }
 EXPORT_SYMBOL(__list_add);
 
-void __list_del_entry(struct list_head *entry)
+void __list_del(struct list_head *entry)
 {
 	struct list_head *prev, *next;
 
@@ -58,7 +58,7 @@ void __list_del_entry(struct list_head *entry)
 
 	__list_link(prev, next);
 }
-EXPORT_SYMBOL(__list_del_entry);
+EXPORT_SYMBOL(__list_del);
 
 /**
  * list_del - deletes entry from list.
@@ -68,7 +68,7 @@ EXPORT_SYMBOL(__list_del_entry);
  */
 void list_del(struct list_head *entry)
 {
-	__list_del_entry(entry);
+	__list_del(entry);
 	entry->next = LIST_POISON1;
 	entry->prev = LIST_POISON2;
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 5/8] treewide: Use __list_del() instead of __list_link()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
                   ` (3 preceding siblings ...)
  2011-07-27 19:10 ` [PATCH RFCv2 4/8] list, treewide: Rename __list_del_entry() to __list_del() Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 6/8] list: Make use " Jan H. Schönherr
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

There are a few cases where __list_link() is actually used
to delete a single list entry like this:

__list_link(entry->prev, entry->next);

For this case, we have the function __list_del() which
does exactly the same, but states the purpose more clearly:

__list_del(entry);

This commit changes these occurrences.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 include/linux/list.h  |    2 +-
 kernel/mutex.h        |    2 +-
 kernel/timer.c        |    2 +-
 net/ipv4/cipso_ipv4.c |    2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 5319099..86a5f3f 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -102,7 +102,7 @@ static inline void __list_del(struct list_head *entry)
 
 static inline void list_del(struct list_head *entry)
 {
-	__list_link(entry->prev, entry->next);
+	__list_del(entry);
 	entry->next = LIST_POISON1;
 	entry->prev = LIST_POISON2;
 }
diff --git a/kernel/mutex.h b/kernel/mutex.h
index bcac8d1..39c02e8 100644
--- a/kernel/mutex.h
+++ b/kernel/mutex.h
@@ -14,7 +14,7 @@
 #define spin_unlock_mutex(lock, flags) \
 		do { spin_unlock(lock); (void)(flags); } while (0)
 #define mutex_remove_waiter(lock, waiter, ti) \
-		__list_link((waiter)->list.prev, (waiter)->list.next)
+		__list_del(&(waiter)->list)
 
 #ifdef CONFIG_SMP
 static inline void mutex_set_owner(struct mutex *lock)
diff --git a/kernel/timer.c b/kernel/timer.c
index 7ecae26..867a7ee 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -615,7 +615,7 @@ static inline void detach_timer(struct timer_list *timer,
 
 	debug_deactivate(timer);
 
-	__list_link(entry->prev, entry->next);
+	__list_del(entry);
 	if (clear_pending)
 		entry->next = NULL;
 	entry->prev = LIST_POISON2;
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 5803807..552603f 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -348,7 +348,7 @@ static int cipso_v4_cache_check(const unsigned char *key,
 			if (entry->activity > prev_entry->activity &&
 			    entry->activity - prev_entry->activity >
 			    CIPSO_V4_CACHE_REORDERLIMIT) {
-				__list_link(entry->list.prev, entry->list.next);
+				__list_del(&entry->list);
 				__list_add(&entry->list,
 					   prev_entry->list.prev,
 					   &prev_entry->list);
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 6/8] list: Make use of __list_link()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
                   ` (4 preceding siblings ...)
  2011-07-27 19:10 ` [PATCH RFCv2 5/8] treewide: Use __list_del() instead of __list_link() Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 7/8] rcu: Make use of __list_link() and __list_link_rcu() Jan H. Schönherr
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

This commit improves the readability of the code by rewriting
list-primitives so that they make use of __list_link() instead
of modifying prev/next-pointer directly.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 include/linux/list.h |   65 ++++++++++++++++++++++---------------------------
 1 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 86a5f3f..ab69007 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -21,10 +21,24 @@
 #define LIST_HEAD(name) \
 	struct list_head name = LIST_HEAD_INIT(name)
 
+/*
+ * Link two list entries by making the prev/next entries
+ * point to each other.
+ *
+ * This is only for internal list manipulation.
+ */
+static inline void __list_link(struct list_head *prev, struct list_head *next)
+{
+	next->prev = prev;
+	prev->next = next;
+}
+
+/*
+ * Initialize a list head.
+ */
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
-	list->next = list;
-	list->prev = list;
+	__list_link(list, list);
 }
 
 /*
@@ -38,10 +52,8 @@ static inline void __list_add(struct list_head *new,
 			      struct list_head *prev,
 			      struct list_head *next)
 {
-	next->prev = new;
-	new->next = next;
-	new->prev = prev;
-	prev->next = new;
+	__list_link(new, next);
+	__list_link(prev, new);
 }
 #else
 extern void __list_add(struct list_head *new,
@@ -76,18 +88,6 @@ static inline void list_add_tail(struct list_head *new, struct list_head *head)
 	__list_add(new, head->prev, head);
 }
 
-/*
- * Link two list entries by making the prev/next entries
- * point to each other.
- *
- * This is only for internal list manipulation.
- */
-static inline void __list_link(struct list_head *prev, struct list_head *next)
-{
-	next->prev = prev;
-	prev->next = next;
-}
-
 /**
  * list_del - deletes entry from list.
  * @entry: the element to delete from the list.
@@ -121,10 +121,8 @@ extern void list_del(struct list_head *entry);
 static inline void list_replace(struct list_head *old,
 				struct list_head *new)
 {
-	new->next = old->next;
-	new->next->prev = new;
-	new->prev = old->prev;
-	new->prev->next = new;
+	__list_link(new, old->next);
+	__list_link(old->prev, new);
 }
 
 static inline void list_replace_init(struct list_head *old,
@@ -233,12 +231,13 @@ static inline void __list_cut_position(struct list_head *list,
 		struct list_head *head, struct list_head *entry)
 {
 	struct list_head *new_first = entry->next;
-	list->next = head->next;
-	list->next->prev = list;
-	list->prev = entry;
-	entry->next = list;
-	head->next = new_first;
-	new_first->prev = head;
+
+	/* Setup "list" */
+	__list_link(list, head->next);
+	__list_link(entry, list);
+
+	/* Delete from "head" */
+	__list_link(head, new_first);
 }
 
 /**
@@ -273,14 +272,8 @@ static inline void __list_splice(const struct list_head *list,
 				 struct list_head *prev,
 				 struct list_head *next)
 {
-	struct list_head *first = list->next;
-	struct list_head *last = list->prev;
-
-	first->prev = prev;
-	prev->next = first;
-
-	last->next = next;
-	next->prev = last;
+	__list_link(prev, list->next);
+	__list_link(list->prev, next);
 }
 
 /**
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 7/8] rcu: Make use of __list_link() and __list_link_rcu()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
                   ` (5 preceding siblings ...)
  2011-07-27 19:10 ` [PATCH RFCv2 6/8] list: Make use " Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-07-27 19:10 ` [PATCH RFCv2 8/8] rcu: Rewrite and rename list_splice_init_rcu() Jan H. Schönherr
  2011-08-03 21:05 ` [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan Schönherr
  8 siblings, 0 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

This commit rewrites __list_add_rcu() and list_replace_rcu() to make
use of the new __list_link*() functions improving the readability
of the code.

Compared to the previous code, the assignments of the prev-pointers and
the call to rcu_assign_pointer() are reordered. But this is not a
problem, as readers are forbidden to use the prev-pointers.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 include/linux/rculist.h |   12 ++++--------
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 445c4f2..c5db505 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -56,10 +56,8 @@ static inline void __list_link_rcu(struct list_head *prev,
 static inline void __list_add_rcu(struct list_head *new,
 		struct list_head *prev, struct list_head *next)
 {
-	new->next = next;
-	new->prev = prev;
-	rcu_assign_pointer(list_next_rcu(prev), new);
-	next->prev = new;
+	__list_link(new, next);
+	__list_link_rcu(prev, new);
 }
 
 /**
@@ -174,10 +172,8 @@ static inline void hlist_del_init_rcu(struct hlist_node *n)
 static inline void list_replace_rcu(struct list_head *old,
 				struct list_head *new)
 {
-	new->next = old->next;
-	new->prev = old->prev;
-	rcu_assign_pointer(list_next_rcu(new->prev), new);
-	new->next->prev = new;
+	__list_link(new, old->next);
+	__list_link_rcu(old->prev, new);
 	old->prev = LIST_POISON2;
 }
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFCv2 8/8] rcu: Rewrite and rename list_splice_init_rcu()
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
                   ` (6 preceding siblings ...)
  2011-07-27 19:10 ` [PATCH RFCv2 7/8] rcu: Make use of __list_link() and __list_link_rcu() Jan H. Schönherr
@ 2011-07-27 19:10 ` Jan H. Schönherr
  2011-08-03 21:05 ` [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan Schönherr
  8 siblings, 0 replies; 15+ messages in thread
From: Jan H. Schönherr @ 2011-07-27 19:10 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel,
	Jan H. Schönherr

From: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>

Rewrite list_splice_init_rcu() to make use of the newly introduced
__list_splice_rcu(). Also, add "_sync" to the name, so that we can
differentiate whether the list to be spliced is allowed to have its own
concurrent readers: with "_sync", readers not belonging to the
destination list are allowed and the function will block; without
"_sync", readers not belonging to the destination list are forbidden and
the function will not block.

Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de>
---
 drivers/char/ipmi/ipmi_msghandler.c |    2 +-
 fs/btrfs/volumes.c                  |    2 +-
 include/linux/rculist.h             |   19 +++++++++----------
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 58c0e63..62b48e8 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -502,7 +502,7 @@ static void clean_up_interface_data(ipmi_smi_t intf)
 	 */
 	mutex_lock(&intf->cmd_rcvrs_mutex);
 	INIT_LIST_HEAD(&list);
-	list_splice_init_rcu(&intf->cmd_rcvrs, &list, synchronize_rcu);
+	list_splice_init_rcu_sync(&intf->cmd_rcvrs, &list, synchronize_rcu);
 	mutex_unlock(&intf->cmd_rcvrs_mutex);
 
 	list_for_each_entry_safe(rcvr, rcvr2, &list, link)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 19450bc..e1d5149 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1440,7 +1440,7 @@ static int btrfs_prepare_sprout(struct btrfs_trans_handle *trans,
 	mutex_init(&seed_devices->device_list_mutex);
 
 	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
-	list_splice_init_rcu(&fs_devices->devices, &seed_devices->devices,
+	list_splice_init_rcu_sync(&fs_devices->devices, &seed_devices->devices,
 			      synchronize_rcu);
 	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index c5db505..b695e88 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -277,7 +277,8 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
 }
 
 /**
- * list_splice_init_rcu - splice an RCU-protected list into an existing list.
+ * list_splice_init_rcu_sync - splice an RCU-protected list into an existing
+ *			       RCU-protected list.
  * @list:	the RCU-protected list to splice
  * @head:	the place in the list to splice the first list into
  * @sync:	function to sync: synchronize_rcu(), synchronize_sched(), ...
@@ -293,18 +294,19 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  *	based on call_rcu() could be created.  But only if -really-
  *	needed -- there is no shortage of RCU API members.
  */
-static inline void list_splice_init_rcu(struct list_head *list,
+static inline void list_splice_init_rcu_sync(struct list_head *list,
 					struct list_head *head,
 					void (*sync)(void))
 {
-	struct list_head *first = list->next;
-	struct list_head *last = list->prev;
-	struct list_head *at = head->next;
+	struct list_head tmp;
 
 	if (list_empty(head))
 		return;
 
-	/* "first" and "last" tracking list, so initialize it. */
+	tmp.next = list->next;
+	tmp.prev = list->prev;
+
+	/* "tmp" is tracking "list", so we can safely initialize "list". */
 
 	INIT_LIST_HEAD(list);
 
@@ -325,10 +327,7 @@ static inline void list_splice_init_rcu(struct list_head *list,
 	 * this function.
 	 */
 
-	last->next = at;
-	rcu_assign_pointer(list_next_rcu(head), first);
-	first->prev = head;
-	at->prev = last;
+	__list_splice_rcu(&tmp, head, head->next);
 }
 
 /**
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation
  2011-07-27 19:10 ` [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation Jan H. Schönherr
@ 2011-07-29  8:41   ` Jan Schönherr
  2011-08-02 13:39     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Schönherr @ 2011-07-29  8:41 UTC (permalink / raw)
  To: Paul Turner, Paul E. McKenney
  Cc: Ingo Molnar, Peter Zijlstra, Dipankar Sarma, linux-kernel

Am 27.07.2011 21:10, schrieb Jan H. Schönherr:
>  /**
> + * list_add_tail_nobackref - add element to list without setting up
> + *			     back references
> + * @new: element to add
> + * @head: list to add the element to
> + *
> + * @new is appended to @head.
> + *
> + * In contrast to list_add_tail(), this function does not maintain the
> + * references back to @head. So, neither @new->next will be changed, nor
> + * -- in case @new becomes the first entry of @head -- @new->pref.
> + *
> + * This is useful when reorganizing deleted elements of a RCU-protected
> + * list as concurrent readers must always find their way back to the list.
> + * When used in this context, special care must be taken in order to not
> + * disturb concurrent readers on @head (or @new):
> + * a) @new->next must lead back to the list
> + * b) @new->next must not lead to any node already on @head
> + * c) @new must be already known to existing readers on @head

This might actually race with physical deletion. Consider a list:

   HEAD --> A --> B --> C --> D --> HEAD

1. Remove C from list, then D:

   HEAD --> A --> B --------------> HEAD
                        C --> D ----^

2. Attempt to physically delete D by using call_rcu() or similar.
   This is delayed until all already running readers have finished.

3. Let a new reader begin to traverse the list, advancing until A.

4. Remove A.

   HEAD --------> B --------------> HEAD
            A ----^      C --> D ----^

5. Call list_add_tail_nobackref() for A, then C.

   HEAD --------> B --------------> HEAD
   LIST --> A ---------> C --> D ----^

6. Finish step 2, really deleting D.

7. Let the new reader continue its traversal.

This reader will hit D, which is obviously bad.


I think, we can avoid this by modifying the next pointers
of the deleted elements and letting them point to HEAD.
That is:

5a. Call list_add_tail_nobackref() for A.

   HEAD --------> B --------------> HEAD
                        C --> D ----^
   LIST --> A -----------------------^

5b. Call list_add_tail_nobackref() for C.

   HEAD --------> B --------------> HEAD
                              D ----^
   LIST --> A --------> C -----------^



However, this (and also the previous version) might
cause readers to skip elements that are on the list
(B in the example above). Can we tolerate that?

My current guess would be: no.

Thinking more about that, we already have that case:


   HEAD --> A --> B --> C --> HEAD

1. Let a reader traverse until A.

2. Remove A.

   HEAD --------> B --> C --> HEAD
            A ----^

3. Add A after B.

   HEAD --------> B     C --> HEAD
                  +> A -^

4. Let reader continue traversal, skipping B.

(Similarly, if we had removed C and re-added it between A and B,
we could have traversed B twice.)

So either the presented solution in this patch series
is valid, or -- when we are not allowed to re-add
elements which might still have readers -- the current
handling of CFS leaf runqueues has an additional problem
besides mixing up the order sometimes...

Paul? [Both of you, actually. :)]

Regards
Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation
  2011-07-29  8:41   ` Jan Schönherr
@ 2011-08-02 13:39     ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2011-08-02 13:39 UTC (permalink / raw)
  To: Jan Schönherr
  Cc: Paul Turner, Paul E. McKenney, Ingo Molnar, Dipankar Sarma, linux-kernel

On Fri, 2011-07-29 at 10:41 +0200, Jan Schönherr wrote:
> Am 27.07.2011 21:10, schrieb Jan H. Schönherr:
> >  /**
> > + * list_add_tail_nobackref - add element to list without setting up
> > + *			     back references
> > + * @new: element to add
> > + * @head: list to add the element to
> > + *
> > + * @new is appended to @head.
> > + *
> > + * In contrast to list_add_tail(), this function does not maintain the
> > + * references back to @head. So, neither @new->next will be changed, nor
> > + * -- in case @new becomes the first entry of @head -- @new->pref.
> > + *
> > + * This is useful when reorganizing deleted elements of a RCU-protected
> > + * list as concurrent readers must always find their way back to the list.
> > + * When used in this context, special care must be taken in order to not
> > + * disturb concurrent readers on @head (or @new):
> > + * a) @new->next must lead back to the list
> > + * b) @new->next must not lead to any node already on @head
> > + * c) @new must be already known to existing readers on @head
> 
> This might actually race with physical deletion. Consider a list:
> 
>    HEAD --> A --> B --> C --> D --> HEAD
> 
> 1. Remove C from list, then D:
> 
>    HEAD --> A --> B --------------> HEAD
>                         C --> D ----^
> 
> 2. Attempt to physically delete D by using call_rcu() or similar.
>    This is delayed until all already running readers have finished.
> 
> 3. Let a new reader begin to traverse the list, advancing until A.
> 
> 4. Remove A.
> 
>    HEAD --------> B --------------> HEAD
>             A ----^      C --> D ----^
> 
> 5. Call list_add_tail_nobackref() for A, then C.
> 
>    HEAD --------> B --------------> HEAD
>    LIST --> A ---------> C --> D ----^
> 
> 6. Finish step 2, really deleting D.
> 
> 7. Let the new reader continue its traversal.
> 
> This reader will hit D, which is obviously bad.
> 
> 
> I think, we can avoid this by modifying the next pointers
> of the deleted elements and letting them point to HEAD.
> That is:
> 
> 5a. Call list_add_tail_nobackref() for A.
> 
>    HEAD --------> B --------------> HEAD
>                         C --> D ----^
>    LIST --> A -----------------------^
> 
> 5b. Call list_add_tail_nobackref() for C.
> 
>    HEAD --------> B --------------> HEAD
>                               D ----^
>    LIST --> A --------> C -----------^
> 
> 
> 
> However, this (and also the previous version) might
> cause readers to skip elements that are on the list
> (B in the example above). Can we tolerate that?
> 
> My current guess would be: no.
> 
> Thinking more about that, we already have that case:
> 
> 
>    HEAD --> A --> B --> C --> HEAD
> 
> 1. Let a reader traverse until A.
> 
> 2. Remove A.
> 
>    HEAD --------> B --> C --> HEAD
>             A ----^
> 
> 3. Add A after B.
> 
>    HEAD --------> B     C --> HEAD
>                   +> A -^
> 
> 4. Let reader continue traversal, skipping B.
> 
> (Similarly, if we had removed C and re-added it between A and B,
> we could have traversed B twice.)
> 
> So either the presented solution in this patch series
> is valid, or -- when we are not allowed to re-add
> elements which might still have readers -- the current
> handling of CFS leaf runqueues has an additional problem
> besides mixing up the order sometimes...
> 
> Paul? [Both of you, actually. :)]

*head still reels a little*

Right so traditional wisdom dictates not re-using entries that might
possibly still be in used. I remember us having had a lot of 'fun' with
that in kernel/smp.c..

See commit 6dc19899958e420a931274b94019e267e2396d3e and the follow ups
from Milton Miller.

Our use-case makes this hard because we need to re-insert the entry as
soon as a task wakes up again.. ho humm..

I don't think we really mind missing an entry or visiting a few twice,
as long as there's a solid termination condition for each iteration. Its
load-balancing after all, if we mess up, we'll fix up next time
round ;-)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-07-27 19:10 ` [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
@ 2011-08-02 13:50   ` Peter Zijlstra
  2011-08-03 20:44     ` Jan Schönherr
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2011-08-02 13:50 UTC (permalink / raw)
  To: Jan H. Schönherr
  Cc: Ingo Molnar, Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel

On Wed, 2011-07-27 at 21:10 +0200, Jan H. Schönherr wrote:
> +       /*
> +        * Carefully collect leaf-cfs entries.
> +        *
> +        * There are basically two cases:
> +        * 1. cfs_rq has not yet been on the leaf-list.
> +        * 2. cfs_rq has been deleted previously from the leaf_list.
> +        *
> +        * In case 2, we might still have concurrent readers.
> +        *
> +        * Therefore, the requirements of list_add_tail_nobackref() are
> +        * fulfilled:
> +        *
> +        * a) If there are concurrent readers, ->next must lead back to the
> +        *    list.
> +        *
> +        *    We can only have readers after case 2. After case 2, only case 2
> +        *    can follow. The next pointers of case 2 nodes always lead back to
> +        *    the list.
> +        *
> +        * b) If there are concurrent readers, ->next must not lead to any
> +        *    already collected node.
> +        *
> +        *    As we collect nodes always bottom-up, all already collected nodes
> +        *    must be below this node in the task group hierarchy.  The
> +        *    ordering constraint of the leaf list guarantees that next
> +        *    pointers only lead to nodes further up in the hierarchy (or to
> +        *    unrelated nodes). Neither deleting nodes nor the manipulations
> +        *    done here change that. Thus, we cannot reach an already collected
> +        *    node.
> +        *
> +        * c) If there are concurrent readers, they must already know this
> +        *    node.
> +        *
> +        *    If we have to add case 1 nodes, they are collected in the
> +        *    beginning and cannot be reached by readers until they are
> +        *    spliced. Furthermore, after they are spliced, we will not
> +        *    encounter more case 1 nodes higher up in the task group
> +        *    hierarchy. For this reason any reader on an earlier collected
> +        *    case 2 node must know all nodes that we collect later.
> +        */
> +       list_add_tail_nobackref(&cfs_rq->leaf_cfs_rq_list, leaf_cfs_rqs); 

I think there's an argument for not adding _nobackref and simply
open-coding the operation here. Could there possibly be another user
that wants this? 

Furthermore, since its tricky like hell every site would want a comment
like the above explaining exactly what and why, and when you put in that
much effort, you might as well write the list-op itself too.

Bit of a barrier to shooting your foot off or so..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq()
  2011-08-02 13:50   ` Peter Zijlstra
@ 2011-08-03 20:44     ` Jan Schönherr
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Schönherr @ 2011-08-03 20:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel

Am 02.08.2011 15:50, schrieb Peter Zijlstra:
> On Wed, 2011-07-27 at 21:10 +0200, Jan H. Schönherr wrote:
[...]
>> +        * c) If there are concurrent readers, they must already know this
>> +        *    node.
>> +        *
>> +        *    If we have to add case 1 nodes, they are collected in the
>> +        *    beginning and cannot be reached by readers until they are
>> +        *    spliced. Furthermore, after they are spliced, we will not
>> +        *    encounter more case 1 nodes higher up in the task group
>> +        *    hierarchy. For this reason any reader on an earlier collected
>> +        *    case 2 node must know all nodes that we collect later.
>> +        */
>> +       list_add_tail_nobackref(&cfs_rq->leaf_cfs_rq_list, leaf_cfs_rqs); 
> 
> I think there's an argument for not adding _nobackref and simply
> open-coding the operation here. Could there possibly be another user
> that wants this? 
> 
> Furthermore, since its tricky like hell every site would want a comment
> like the above explaining exactly what and why, and when you put in that
> much effort, you might as well write the list-op itself too.

Will do.

However, when reassigning next-pointers of deleted nodes to not deleted
nodes (e. g. the list head itself) as outlined in the other mail,
we'll have to use rcu-aware assignments to really prevent the race with
physical deletion. Therefore, the condition c) still listed above
will be unnecessary, then.

Regards
Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup)
  2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
                   ` (7 preceding siblings ...)
  2011-07-27 19:10 ` [PATCH RFCv2 8/8] rcu: Rewrite and rename list_splice_init_rcu() Jan H. Schönherr
@ 2011-08-03 21:05 ` Jan Schönherr
  2011-08-03 21:35   ` Peter Zijlstra
  8 siblings, 1 reply; 15+ messages in thread
From: Jan Schönherr @ 2011-08-03 21:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel

Hi Peter,

Am 27.07.2011 21:10, schrieb Jan H. Schönherr:
> Patch 1: After inventing __list_link(), I realized, that this
> 	 function already existed, but with a different name.
> 
> 	 This patch just does some renaming. Not really needed,
> 	 but if I use the old naming in patch 2 it's really 
> 	 hard to understand what's actually going on.
> 
> 	 It also helps to increase the readability of the existing
> 	 code, see patches 6-8.
> 
> Patch 2: This introduces new list functions to splice RCU lists
> 	 and handle deleted RCU list entries.
> 
> Patch 3: The actual bugfix.
> 
> Patch 4+5: Follow-ups to patch 1. Some more renaming and use of
> 	   appropriate functions.
> 
> Patch 6: Another follow-up to patch 1, improving the readability of
> 	 the list routines a bit.
> 
> Patch 7+8: Follow-ups to patch 2. Make use of the introduced
>            functionality in the already existing code.


I am wondering, whether v3 should consist basically only of
patches 2 and 3, i. e. the minimal approach, or if you would
take all of them?

If you prefer the minimal version, I would make another patch
set out of the other patches. But as there seems to be no official
maintainer for list related functionality, I would appreciate
some hints who I should put on the TO/CC list.

Regards
Jan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup)
  2011-08-03 21:05 ` [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan Schönherr
@ 2011-08-03 21:35   ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2011-08-03 21:35 UTC (permalink / raw)
  To: Jan Schönherr
  Cc: Ingo Molnar, Paul Turner, Paul E. McKenney, Dipankar Sarma, linux-kernel

On Wed, 2011-08-03 at 23:05 +0200, Jan Schönherr wrote:
> Hi Peter,
> 
> Am 27.07.2011 21:10, schrieb Jan H. Schönherr:
> > Patch 1: After inventing __list_link(), I realized, that this
> > 	 function already existed, but with a different name.
> > 
> > 	 This patch just does some renaming. Not really needed,
> > 	 but if I use the old naming in patch 2 it's really 
> > 	 hard to understand what's actually going on.
> > 
> > 	 It also helps to increase the readability of the existing
> > 	 code, see patches 6-8.
> > 
> > Patch 2: This introduces new list functions to splice RCU lists
> > 	 and handle deleted RCU list entries.
> > 
> > Patch 3: The actual bugfix.
> > 
> > Patch 4+5: Follow-ups to patch 1. Some more renaming and use of
> > 	   appropriate functions.
> > 
> > Patch 6: Another follow-up to patch 1, improving the readability of
> > 	 the list routines a bit.
> > 
> > Patch 7+8: Follow-ups to patch 2. Make use of the introduced
> >            functionality in the already existing code.
> 
> 
> I am wondering, whether v3 should consist basically only of
> patches 2 and 3, i. e. the minimal approach, or if you would
> take all of them?
> 
> If you prefer the minimal version, I would make another patch
> set out of the other patches. But as there seems to be no official
> maintainer for list related functionality, I would appreciate
> some hints who I should put on the TO/CC list.

Ha, good question, so what we can do is I can take the minimal patch-set
through the scheduler tree once we un-confuse ourself with those list
ops and at least one of the Paul's has had a look too (such tricksy code
deserves more eye balls).

After that we can maybe trick Andrew Morton into carrying the generic
cleanups etc.. at any rate, try and Cc everybody who's files you're
touching on the various patches (scripts/get_maintaineres.pl --no-git or
so should get you an idea).

Alternatively we could try and trick Ingo into pushing them, but
generally Andrew is maintainer of misc. stuff.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-08-03 21:36 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-27 19:10 [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan H. Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 1/8] list, treewide: Rename __list_del() to __list_link() Jan H. Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 2/8] rcu: More rcu-variants for list manipulation Jan H. Schönherr
2011-07-29  8:41   ` Jan Schönherr
2011-08-02 13:39     ` Peter Zijlstra
2011-07-27 19:10 ` [PATCH RFCv2 3/8] sched: Handle on_list ancestor in list_add_leaf_cfs_rq() Jan H. Schönherr
2011-08-02 13:50   ` Peter Zijlstra
2011-08-03 20:44     ` Jan Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 4/8] list, treewide: Rename __list_del_entry() to __list_del() Jan H. Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 5/8] treewide: Use __list_del() instead of __list_link() Jan H. Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 6/8] list: Make use " Jan H. Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 7/8] rcu: Make use of __list_link() and __list_link_rcu() Jan H. Schönherr
2011-07-27 19:10 ` [PATCH RFCv2 8/8] rcu: Rewrite and rename list_splice_init_rcu() Jan H. Schönherr
2011-08-03 21:05 ` [PATCH RFCv2 0/8] sched: Enforce order of leaf CFS runqueues (and list cleanup) Jan Schönherr
2011-08-03 21:35   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.