linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ipc,sem: sysv semaphore scalability
@ 2013-03-20 19:55 Rik van Riel
  2013-03-20 19:55 ` [PATCH 1/7] ipc: remove bogus lock comment for ipc_checkid Rik van Riel
                   ` (14 more replies)
  0 siblings, 15 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod

Include lkml in the CC: this time... *sigh*
---8<---

This series makes the sysv semaphore code more scalable,
by reducing the time the semaphore lock is held, and making
the locking more scalable for semaphore arrays with multiple
semaphores.

The first four patches were written by Davidlohr Buesso, and
reduce the hold time of the semaphore lock.

The last three patches change the sysv semaphore code locking
to be more fine grained, providing a performance boost when
multiple semaphores in a semaphore array are being manipulated
simultaneously.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
threads			patches		rwlock patches	v3 patches
10	610652		726325		1783589		2142206
20	341570		365699		1520453		1977878
30	288102		307037		1498167		2037995
40	290714		305955		1612665		2256484
50	288620		312890		1733453		2650292
60	289987		306043		1649360		2388008
70	291298		306347		1723167		2717486
80	290948		305662		1729545		2763582
90	290996		306680		1736021		2757524
100	292243		306700		1773700		3059159


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 1/7] ipc: remove bogus lock comment for ipc_checkid
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-20 19:55 ` [PATCH 2/7] ipc: introduce obtaining a lockless ipc object Rik van Riel
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel

From: Davidlohr Bueso <davidlohr.bueso@hp.com>

There is no reason to be holding the ipc lock while
reading ipcp->seq, hence remove misleading comment.

Also simplify the return value for the function.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 ipc/util.h |    7 +------
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/ipc/util.h b/ipc/util.h
index eeb79a1..ac1480a 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -150,14 +150,9 @@ static inline int ipc_buildid(int id, int seq)
 	return SEQ_MULTIPLIER * seq + id;
 }
 
-/*
- * Must be called with ipcp locked
- */
 static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
 {
-	if (uid / SEQ_MULTIPLIER != ipcp->seq)
-		return 1;
-	return 0;
+	return uid / SEQ_MULTIPLIER != ipcp->seq;
 }
 
 static inline void ipc_lock_by_ptr(struct kern_ipc_perm *perm)
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 2/7] ipc: introduce obtaining a lockless ipc object
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
  2013-03-20 19:55 ` [PATCH 1/7] ipc: remove bogus lock comment for ipc_checkid Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-20 19:55 ` [PATCH 3/7] ipc: introduce lockless pre_down ipcctl Rik van Riel
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel

From: Davidlohr Bueso <davidlohr.bueso@hp.com>

Through ipc_lock() and therefore ipc_lock_check() we currently
return the locked ipc object. This is not necessary for all situations
an can, therefore, incur in unnecessary ipc lock contention.

Introduce, analogous, ipc_obtain_object() and ipc_obtain_object_check()
functions that only lookup and return the ipc object.

Both these functions must be called within the RCU read critical section.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
---
 ipc/util.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++++-----------
 ipc/util.h |    2 +
 2 files changed, 60 insertions(+), 13 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 464a8ab..65c3d6c 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -668,6 +668,28 @@ void ipc64_perm_to_ipc_perm (struct ipc64_perm *in, struct ipc_perm *out)
 }
 
 /**
+ * ipc_obtain_object
+ * @ids: ipc identifier set
+ * @id: ipc id to look for
+ *
+ * Look for an id in the ipc ids idr and return associated ipc object.
+ *
+ * Call inside the RCU critical section.
+ * The ipc object is *not* locked on exit.
+ */
+struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id)
+{
+	struct kern_ipc_perm *out;
+	int lid = ipcid_to_idx(id);
+
+	out = idr_find(&ids->ipcs_idr, lid);
+	if (!out)
+		return ERR_PTR(-EINVAL);
+
+	return out;
+}
+
+/**
  * ipc_lock - Lock an ipc structure without rw_mutex held
  * @ids: IPC identifier set
  * @id: ipc id to look for
@@ -680,27 +702,50 @@ void ipc64_perm_to_ipc_perm (struct ipc64_perm *in, struct ipc_perm *out)
 struct kern_ipc_perm *ipc_lock(struct ipc_ids *ids, int id)
 {
 	struct kern_ipc_perm *out;
-	int lid = ipcid_to_idx(id);
-
+	
 	rcu_read_lock();
-	out = idr_find(&ids->ipcs_idr, lid);
-	if (out == NULL) {
-		rcu_read_unlock();
-		return ERR_PTR(-EINVAL);
-	}
+	out = ipc_obtain_object(ids, id);
+	if (IS_ERR(out))
+		goto err1;
 
 	spin_lock(&out->lock);
-	
+
 	/* ipc_rmid() may have already freed the ID while ipc_lock
 	 * was spinning: here verify that the structure is still valid
 	 */
-	if (out->deleted) {
-		spin_unlock(&out->lock);
-		rcu_read_unlock();
-		return ERR_PTR(-EINVAL);
-	}
+	if (out->deleted)
+		goto err0;
 
 	return out;
+err0:
+	spin_unlock(&out->lock);
+err1:
+	rcu_read_unlock();
+	return ERR_PTR(-EINVAL);
+}
+
+/**
+ * ipc_obtain_object_check
+ * @ids: ipc identifier set
+ * @id: ipc id to look for
+ *
+ * Similar to ipc_obtain_object() but also checks
+ * the ipc object reference counter.
+ *
+ * Call inside the RCU critical section.
+ * The ipc object is *not* locked on exit.
+ */
+struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id)
+{
+	struct kern_ipc_perm *out = ipc_obtain_object(ids, id);
+
+	if (IS_ERR(out))
+		goto out;
+
+	if (ipc_checkid(out, id))
+		return ERR_PTR(-EIDRM);
+out:
+	return out;
 }
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
diff --git a/ipc/util.h b/ipc/util.h
index ac1480a..bfc8d4e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -123,6 +123,7 @@ void ipc_rcu_getref(void *ptr);
 void ipc_rcu_putref(void *ptr);
 
 struct kern_ipc_perm *ipc_lock(struct ipc_ids *, int);
+struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id);
 
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
@@ -168,6 +169,7 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 }
 
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
+struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 			struct ipc_ops *ops, struct ipc_params *params);
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 3/7] ipc: introduce lockless pre_down ipcctl
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
  2013-03-20 19:55 ` [PATCH 1/7] ipc: remove bogus lock comment for ipc_checkid Rik van Riel
  2013-03-20 19:55 ` [PATCH 2/7] ipc: introduce obtaining a lockless ipc object Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-20 19:55 ` [PATCH 4/7] ipc,sem: do not hold ipc lock more than necessary Rik van Riel
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel

From: Davidlohr Bueso <davidlohr.bueso@hp.com>

Various forms of ipc use the ipcctl_pre_down() function to
retrieve an ipc object and check permissions, mostly for IPC_RMID
and IPC_SET commands.

Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
The locking version is maintained, yet modified to call the nolock version,
without affecting its semantics, thus transparent to all ipc callers.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 ipc/util.c |   31 ++++++++++++++++++++++++++-----
 ipc/util.h |    3 +++
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/ipc/util.c b/ipc/util.c
index 65c3d6c..6a98e62 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -825,11 +825,28 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace *ns,
 				      struct ipc64_perm *perm, int extra_perm)
 {
 	struct kern_ipc_perm *ipcp;
+
+	ipcp = ipcctl_pre_down_nolock(ns, ids, id, cmd, perm, extra_perm);
+	if (IS_ERR(ipcp))
+		goto out;
+
+	spin_lock(&ipcp->lock);
+out:
+	return ipcp;
+}
+
+struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+					     struct ipc_ids *ids, int id, int cmd,
+					     struct ipc64_perm *perm, int extra_perm)
+{
 	kuid_t euid;
-	int err;
+	int err = -EPERM;
+	struct kern_ipc_perm *ipcp;
 
 	down_write(&ids->rw_mutex);
-	ipcp = ipc_lock_check(ids, id);
+	rcu_read_lock();
+
+	ipcp = ipc_obtain_object_check(ids, id);
 	if (IS_ERR(ipcp)) {
 		err = PTR_ERR(ipcp);
 		goto out_up;
@@ -838,17 +855,21 @@ struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace *ns,
 	audit_ipc_obj(ipcp);
 	if (cmd == IPC_SET)
 		audit_ipc_set_perm(extra_perm, perm->uid,
-					 perm->gid, perm->mode);
+				   perm->gid, perm->mode);
 
 	euid = current_euid();
 	if (uid_eq(euid, ipcp->cuid) || uid_eq(euid, ipcp->uid)  ||
 	    ns_capable(ns->user_ns, CAP_SYS_ADMIN))
 		return ipcp;
 
-	err = -EPERM;
-	ipc_unlock(ipcp);
 out_up:
+	/*
+	 * Unsuccessful lookup, unlock and return
+	 * the corresponding error.
+	 */
+	rcu_read_unlock();
 	up_write(&ids->rw_mutex);
+
 	return ERR_PTR(err);
 }
 
diff --git a/ipc/util.h b/ipc/util.h
index bfc8d4e..13d92fe 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -128,6 +128,9 @@ struct kern_ipc_perm *ipc_obtain_object(struct ipc_ids *ids, int id);
 void kernel_to_ipc64_perm(struct kern_ipc_perm *in, struct ipc64_perm *out);
 void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out);
 int ipc_update_perm(struct ipc64_perm *in, struct kern_ipc_perm *out);
+struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns,
+					     struct ipc_ids *ids, int id, int cmd,
+					     struct ipc64_perm *perm, int extra_perm);
 struct kern_ipc_perm *ipcctl_pre_down(struct ipc_namespace *ns,
 				      struct ipc_ids *ids, int id, int cmd,
 				      struct ipc64_perm *perm, int extra_perm);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 4/7] ipc,sem: do not hold ipc lock more than necessary
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (2 preceding siblings ...)
  2013-03-20 19:55 ` [PATCH 3/7] ipc: introduce lockless pre_down ipcctl Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-20 19:55 ` [PATCH 5/7] ipc,sem: open code and rename sem_lock Rik van Riel
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel, Emmanuel Benisty

From: Davidlohr Bueso <davidlohr.bueso@hp.com>

I have corrected the two locking bugs in this patch, but
left it otherwise untouched - Rik.
---8<---
Instead of holding the ipc lock for permissions and security
checks, among others, only acquire it when necessary.

Some numbers....

1) With Rik's semop-multi.c microbenchmark we can see the following
results:

Baseline (3.9-rc1):
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 151452270, ops/sec 5048409

+  59.40%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+   6.14%            a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   3.84%            a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   3.64%            a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   2.06%            a.out  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
+   1.86%            a.out  [kernel.kallsyms]  [k] ipc_lock

With this patchset:
cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
total operations: 273156400, ops/sec 9105213

+  18.54%            a.out  [kernel.kallsyms]  [k] _raw_spin_lock
+  11.72%            a.out  [kernel.kallsyms]  [k] sys_semtimedop
+   7.70%            a.out  [kernel.kallsyms]  [k] ipc_has_perm.isra.21
+   6.58%            a.out  [kernel.kallsyms]  [k] avc_has_perm_flags
+   6.54%            a.out  [kernel.kallsyms]  [k] __audit_syscall_exit
+   4.71%            a.out  [kernel.kallsyms]  [k] ipc_obtain_object_check

2) While on an Oracle swingbench DSS (data mining) workload the
improvements are not as exciting as with Rik's benchmark, we can see
some positive numbers. For an 8 socket machine the following are the
percentages of %sys time incurred in the ipc lock:

Baseline (3.9-rc1):
100 swingbench users: 8,74%
400 swingbench users: 21,86%
800 swingbench users: 84,35%

With this patchset:
100 swingbench users: 8,11%
400 swingbench users: 19,93%
800 swingbench users: 77,69%

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Acked-by: Michel Lespinasse <walken@google.com>
CC: Rik van Riel <riel@redhat.com>
CC: Jason Low <jason.low2@hp.com>
CC: Emmanuel Benisty <benisty.e@gmail.com>
---
 ipc/sem.c  |  155 ++++++++++++++++++++++++++++++++++++++++++------------------
 ipc/util.h |    5 ++
 2 files changed, 114 insertions(+), 46 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1..ff3d4ff 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -204,13 +204,34 @@ static inline struct sem_array *sem_lock(struct ipc_namespace *ns, int id)
 	return container_of(ipcp, struct sem_array, sem_perm);
 }
 
+static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, int id)
+{
+	struct kern_ipc_perm *ipcp = ipc_obtain_object(&sem_ids(ns), id);
+
+	if (IS_ERR(ipcp))
+		return ERR_CAST(ipcp);
+
+	return container_of(ipcp, struct sem_array, sem_perm);
+}
+
 static inline struct sem_array *sem_lock_check(struct ipc_namespace *ns,
 						int id)
 {
 	struct kern_ipc_perm *ipcp = ipc_lock_check(&sem_ids(ns), id);
 
 	if (IS_ERR(ipcp))
-		return (struct sem_array *)ipcp;
+		return ERR_CAST(ipcp);
+
+	return container_of(ipcp, struct sem_array, sem_perm);
+}
+
+static inline struct sem_array *sem_obtain_object_check(struct ipc_namespace *ns,
+							int id)
+{
+	struct kern_ipc_perm *ipcp = ipc_obtain_object_check(&sem_ids(ns), id);
+
+	if (IS_ERR(ipcp))
+		return ERR_CAST(ipcp);
 
 	return container_of(ipcp, struct sem_array, sem_perm);
 }
@@ -234,6 +255,16 @@ static inline void sem_putref(struct sem_array *sma)
 	ipc_unlock(&(sma)->sem_perm);
 }
 
+/*
+ * Call inside the rcu read section.
+ */
+static inline void sem_getref(struct sem_array *sma)
+{
+	spin_lock(&(sma)->sem_perm.lock);
+	ipc_rcu_getref(sma);
+	ipc_unlock(&(sma)->sem_perm);
+}
+
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
 {
 	ipc_rmid(&sem_ids(ns), &s->sem_perm);
@@ -842,18 +873,25 @@ static int semctl_nolock(struct ipc_namespace *ns, int semid,
 	case SEM_STAT:
 	{
 		struct semid64_ds tbuf;
-		int id;
+		int id = 0;
+
+		memset(&tbuf, 0, sizeof(tbuf));
 
 		if (cmd == SEM_STAT) {
-			sma = sem_lock(ns, semid);
-			if (IS_ERR(sma))
-				return PTR_ERR(sma);
+			rcu_read_lock();
+			sma = sem_obtain_object(ns, semid);
+			if (IS_ERR(sma)) {
+				err = PTR_ERR(sma);
+				goto out_unlock;
+			}
 			id = sma->sem_perm.id;
 		} else {
-			sma = sem_lock_check(ns, semid);
-			if (IS_ERR(sma))
-				return PTR_ERR(sma);
-			id = 0;
+			rcu_read_lock();
+			sma = sem_obtain_object_check(ns, semid);
+			if (IS_ERR(sma)) {
+				err = PTR_ERR(sma);
+				goto out_unlock;
+			}
 		}
 
 		err = -EACCES;
@@ -864,13 +902,11 @@ static int semctl_nolock(struct ipc_namespace *ns, int semid,
 		if (err)
 			goto out_unlock;
 
-		memset(&tbuf, 0, sizeof(tbuf));
-
 		kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
 		tbuf.sem_otime  = sma->sem_otime;
 		tbuf.sem_ctime  = sma->sem_ctime;
 		tbuf.sem_nsems  = sma->sem_nsems;
-		sem_unlock(sma);
+		rcu_read_unlock();
 		if (copy_semid_to_user (arg.buf, &tbuf, version))
 			return -EFAULT;
 		return id;
@@ -879,7 +915,7 @@ static int semctl_nolock(struct ipc_namespace *ns, int semid,
 		return -EINVAL;
 	}
 out_unlock:
-	sem_unlock(sma);
+	rcu_read_unlock();
 	return err;
 }
 
@@ -888,27 +924,34 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 {
 	struct sem_array *sma;
 	struct sem* curr;
-	int err;
+	int err, nsems;
 	ushort fast_sem_io[SEMMSL_FAST];
 	ushort* sem_io = fast_sem_io;
-	int nsems;
 	struct list_head tasks;
 
-	sma = sem_lock_check(ns, semid);
-	if (IS_ERR(sma))
+	INIT_LIST_HEAD(&tasks);
+
+	rcu_read_lock();
+	sma = sem_obtain_object_check(ns, semid);
+	if (IS_ERR(sma)) {
+		rcu_read_unlock();
 		return PTR_ERR(sma);
+	}
 
-	INIT_LIST_HEAD(&tasks);
 	nsems = sma->sem_nsems;
 
 	err = -EACCES;
 	if (ipcperms(ns, &sma->sem_perm,
-			(cmd == SETVAL || cmd == SETALL) ? S_IWUGO : S_IRUGO))
-		goto out_unlock;
+		     (cmd == SETVAL || cmd == SETALL) ? S_IWUGO : S_IRUGO)) {
+		rcu_read_unlock();
+		goto out_wakeup;
+	}
 
 	err = security_sem_semctl(sma, cmd);
-	if (err)
-		goto out_unlock;
+	if (err) {
+		rcu_read_unlock();
+		goto out_wakeup;
+	}
 
 	err = -EACCES;
 	switch (cmd) {
@@ -918,7 +961,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		int i;
 
 		if(nsems > SEMMSL_FAST) {
-			sem_getref_and_unlock(sma);
+			sem_getref(sma);
 
 			sem_io = ipc_alloc(sizeof(ushort)*nsems);
 			if(sem_io == NULL) {
@@ -934,6 +977,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 			}
 		}
 
+		spin_lock(&sma->sem_perm.lock);
 		for (i = 0; i < sma->sem_nsems; i++)
 			sem_io[i] = sma->sem_base[i].semval;
 		sem_unlock(sma);
@@ -947,7 +991,8 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		int i;
 		struct sem_undo *un;
 
-		sem_getref_and_unlock(sma);
+		ipc_rcu_getref(sma);
+		rcu_read_unlock();
 
 		if(nsems > SEMMSL_FAST) {
 			sem_io = ipc_alloc(sizeof(ushort)*nsems);
@@ -997,6 +1042,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 	if(semnum < 0 || semnum >= nsems)
 		goto out_unlock;
 
+	spin_lock(&sma->sem_perm.lock);
 	curr = &sma->sem_base[semnum];
 
 	switch (cmd) {
@@ -1034,10 +1080,11 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		goto out_unlock;
 	}
 	}
+
 out_unlock:
 	sem_unlock(sma);
+out_wakeup:
 	wake_up_sem_queue_do(&tasks);
-
 out_free:
 	if(sem_io != fast_sem_io)
 		ipc_free(sem_io, sizeof(ushort)*nsems);
@@ -1088,29 +1135,35 @@ static int semctl_down(struct ipc_namespace *ns, int semid,
 			return -EFAULT;
 	}
 
-	ipcp = ipcctl_pre_down(ns, &sem_ids(ns), semid, cmd,
-			       &semid64.sem_perm, 0);
+	ipcp = ipcctl_pre_down_nolock(ns, &sem_ids(ns), semid, cmd,
+				      &semid64.sem_perm, 0);
 	if (IS_ERR(ipcp))
 		return PTR_ERR(ipcp);
 
 	sma = container_of(ipcp, struct sem_array, sem_perm);
 
 	err = security_sem_semctl(sma, cmd);
-	if (err)
+	if (err) {
+		rcu_read_unlock();
 		goto out_unlock;
+	}
 
 	switch(cmd){
 	case IPC_RMID:
+		ipc_lock_object(&sma->sem_perm);
 		freeary(ns, ipcp);
 		goto out_up;
 	case IPC_SET:
+		ipc_lock_object(&sma->sem_perm);
 		err = ipc_update_perm(&semid64.sem_perm, ipcp);
 		if (err)
 			goto out_unlock;
 		sma->sem_ctime = get_seconds();
 		break;
 	default:
+		rcu_read_unlock();
 		err = -EINVAL;
+		goto out_up;
 	}
 
 out_unlock:
@@ -1248,16 +1301,18 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	spin_unlock(&ulp->lock);
 	if (likely(un!=NULL))
 		goto out;
-	rcu_read_unlock();
 
 	/* no undo structure around - allocate one. */
 	/* step 1: figure out the size of the semaphore array */
-	sma = sem_lock_check(ns, semid);
-	if (IS_ERR(sma))
+	sma = sem_obtain_object_check(ns, semid);
+	if (IS_ERR(sma)) {
+		rcu_read_unlock();
 		return ERR_CAST(sma);
+	}
 
 	nsems = sma->sem_nsems;
-	sem_getref_and_unlock(sma);
+	ipc_rcu_getref(sma);
+	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
 	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
@@ -1392,7 +1447,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 
 	INIT_LIST_HEAD(&tasks);
 
-	sma = sem_lock_check(ns, semid);
+	rcu_read_lock();
+	sma = sem_obtain_object_check(ns, semid);
 	if (IS_ERR(sma)) {
 		if (un)
 			rcu_read_unlock();
@@ -1400,6 +1456,24 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		goto out_free;
 	}
 
+	error = -EFBIG;
+	if (max >= sma->sem_nsems) {
+		rcu_read_unlock();
+		goto out_wakeup;
+	}
+
+	error = -EACCES;
+	if (ipcperms(ns, &sma->sem_perm, alter ? S_IWUGO : S_IRUGO)) {
+		rcu_read_unlock();
+		goto out_wakeup;
+	}
+
+	error = security_sem_semop(sma, sops, nsops, alter);
+	if (error) {
+		rcu_read_unlock();
+		goto out_wakeup;
+	}
+
 	/*
 	 * semid identifiers are not unique - find_alloc_undo may have
 	 * allocated an undo structure, it was invalidated by an RMID
@@ -1408,6 +1482,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	 * "un" itself is guaranteed by rcu.
 	 */
 	error = -EIDRM;
+	ipc_lock_object(&sma->sem_perm);
 	if (un) {
 		if (un->semid == -1) {
 			rcu_read_unlock();
@@ -1425,18 +1500,6 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		}
 	}
 
-	error = -EFBIG;
-	if (max >= sma->sem_nsems)
-		goto out_unlock_free;
-
-	error = -EACCES;
-	if (ipcperms(ns, &sma->sem_perm, alter ? S_IWUGO : S_IRUGO))
-		goto out_unlock_free;
-
-	error = security_sem_semop(sma, sops, nsops, alter);
-	if (error)
-		goto out_unlock_free;
-
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {
 		if (alter && error == 0)
@@ -1539,7 +1602,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 
 out_unlock_free:
 	sem_unlock(sma);
-
+out_wakeup:
 	wake_up_sem_queue_do(&tasks);
 out_free:
 	if(sops != fast_sops)
diff --git a/ipc/util.h b/ipc/util.h
index 13d92fe..c36b997 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -171,6 +171,11 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)
 	rcu_read_unlock();
 }
 
+static inline void ipc_lock_object(struct kern_ipc_perm *perm)
+{
+	spin_lock(&perm->lock);
+}
+
 struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
 struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids *ids, int id);
 int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 5/7] ipc,sem: open code and rename sem_lock
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (3 preceding siblings ...)
  2013-03-20 19:55 ` [PATCH 4/7] ipc,sem: do not hold ipc lock more than necessary Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-22  1:14   ` Davidlohr Bueso
  2013-03-20 19:55 ` [PATCH 6/7] ipc,sem: have only one list in struct sem_queue Rik van Riel
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel, Rik van Riel

Rename sem_lock to sem_obtain_lock, so we can introduce a sem_lock
function later that only locks the sem_array and does nothing else.

Open code the locking from ipc_lock in sem_obtain_lock, so we can
introduce finer grained locking for the sem_array in the next patch.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index ff3d4ff..d25f9b6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -194,14 +194,29 @@ void __init sem_init (void)
  * sem_lock_(check_) routines are called in the paths where the rw_mutex
  * is not held.
  */
-static inline struct sem_array *sem_lock(struct ipc_namespace *ns, int id)
+static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns, int id)
 {
-	struct kern_ipc_perm *ipcp = ipc_lock(&sem_ids(ns), id);
+	struct kern_ipc_perm *ipcp;
 
+	rcu_read_lock();
+	ipcp = ipc_obtain_object(&sem_ids(ns), id);
 	if (IS_ERR(ipcp))
-		return (struct sem_array *)ipcp;
+		goto err1;
+
+	spin_lock(&ipcp->lock);
+
+	/* ipc_rmid() may have already freed the ID while sem_lock
+	 * was spinning: verify that the structure is still valid
+	 */
+	if (ipcp->deleted)
+		goto err0;
 
 	return container_of(ipcp, struct sem_array, sem_perm);
+err0:
+	spin_unlock(&ipcp->lock);
+err1:
+	rcu_read_unlock();
+	return ERR_PTR(-EINVAL);
 }
 
 static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, int id)
@@ -1562,7 +1577,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		goto out_free;
 	}
 
-	sma = sem_lock(ns, semid);
+	sma = sem_obtain_lock(ns, semid);
 
 	/*
 	 * Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing.
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 6/7] ipc,sem: have only one list in struct sem_queue
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (4 preceding siblings ...)
  2013-03-20 19:55 ` [PATCH 5/7] ipc,sem: open code and rename sem_lock Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-22  1:14   ` Davidlohr Bueso
  2013-03-20 19:55 ` [PATCH 7/7] ipc,sem: fine grained locking for semtimedop Rik van Riel
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel, Rik van Riel

Having only one list in struct sem_queue, and only queueing simple
semaphore operations on the list for the semaphore involved, allows
us to introduce finer grained locking for semtimedop.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |   65 +++++++++++++++++++++++++++++++-----------------------------
 1 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index d25f9b6..468e2c1 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -99,7 +99,6 @@ struct sem {
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
-	struct list_head	simple_list; /* queue of pending operations */
 	struct list_head	list;	 /* queue of pending operations */
 	struct task_struct	*sleeper; /* this process */
 	struct sem_undo		*undo;	 /* undo structure */
@@ -517,7 +516,7 @@ static void wake_up_sem_queue_prepare(struct list_head *pt,
 	q->status = IN_WAKEUP;
 	q->pid = error;
 
-	list_add_tail(&q->simple_list, pt);
+	list_add_tail(&q->list, pt);
 }
 
 /**
@@ -535,7 +534,7 @@ static void wake_up_sem_queue_do(struct list_head *pt)
 	int did_something;
 
 	did_something = !list_empty(pt);
-	list_for_each_entry_safe(q, t, pt, simple_list) {
+	list_for_each_entry_safe(q, t, pt, list) {
 		wake_up_process(q->sleeper);
 		/* q can disappear immediately after writing q->status. */
 		smp_wmb();
@@ -548,9 +547,7 @@ static void wake_up_sem_queue_do(struct list_head *pt)
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
 {
 	list_del(&q->list);
-	if (q->nsops == 1)
-		list_del(&q->simple_list);
-	else
+	if (q->nsops > 1)
 		sma->complex_count--;
 }
 
@@ -603,9 +600,9 @@ static int check_restart(struct sem_array *sma, struct sem_queue *q)
 	}
 	/*
 	 * semval is 0. Check if there are wait-for-zero semops.
-	 * They must be the first entries in the per-semaphore simple queue
+	 * They must be the first entries in the per-semaphore queue
 	 */
-	h = list_first_entry(&curr->sem_pending, struct sem_queue, simple_list);
+	h = list_first_entry(&curr->sem_pending, struct sem_queue, list);
 	BUG_ON(h->nsops != 1);
 	BUG_ON(h->sops[0].sem_num != q->sops[0].sem_num);
 
@@ -625,8 +622,9 @@ static int check_restart(struct sem_array *sma, struct sem_queue *q)
  * @pt: list head for the tasks that must be woken up.
  *
  * update_queue must be called after a semaphore in a semaphore array
- * was modified. If multiple semaphore were modified, then @semnum
- * must be set to -1.
+ * was modified. If multiple semaphores were modified, update_queue must
+ * be called with semnum = -1, as well as with the number of each modified
+ * semaphore.
  * The tasks that must be woken up are added to @pt. The return code
  * is stored in q->pid.
  * The function return 1 if at least one semop was completed successfully.
@@ -636,30 +634,19 @@ static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
 	struct sem_queue *q;
 	struct list_head *walk;
 	struct list_head *pending_list;
-	int offset;
 	int semop_completed = 0;
 
-	/* if there are complex operations around, then knowing the semaphore
-	 * that was modified doesn't help us. Assume that multiple semaphores
-	 * were modified.
-	 */
-	if (sma->complex_count)
-		semnum = -1;
-
-	if (semnum == -1) {
+	if (semnum == -1)
 		pending_list = &sma->sem_pending;
-		offset = offsetof(struct sem_queue, list);
-	} else {
+	else
 		pending_list = &sma->sem_base[semnum].sem_pending;
-		offset = offsetof(struct sem_queue, simple_list);
-	}
 
 again:
 	walk = pending_list->next;
 	while (walk != pending_list) {
 		int error, restart;
 
-		q = (struct sem_queue *)((char *)walk - offset);
+		q = container_of(walk, struct sem_queue, list);
 		walk = walk->next;
 
 		/* If we are scanning the single sop, per-semaphore list of
@@ -718,9 +705,18 @@ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsop
 	if (sma->complex_count || sops == NULL) {
 		if (update_queue(sma, -1, pt))
 			otime = 1;
+	}
+
+	if (!sops) {
+		/* No semops; something special is going on. */
+		for (i = 0; i < sma->sem_nsems; i++) {
+			if (update_queue(sma, i, pt))
+				otime = 1;
+		}
 		goto done;
 	}
 
+	/* Check the semaphores that were modified. */
 	for (i = 0; i < nsops; i++) {
 		if (sops[i].sem_op > 0 ||
 			(sops[i].sem_op < 0 &&
@@ -791,6 +787,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	struct sem_queue *q, *tq;
 	struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm);
 	struct list_head tasks;
+	int i;
 
 	/* Free the existing undo structures for this semaphore set.  */
 	assert_spin_locked(&sma->sem_perm.lock);
@@ -809,6 +806,13 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 		unlink_queue(sma, q);
 		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
 	}
+	for (i = 0; i < sma->sem_nsems; i++) {
+		struct sem *sem = sma->sem_base + i;
+		list_for_each_entry_safe(q, tq, &sem->sem_pending, list) {
+			unlink_queue(sma, q);
+			wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
+		}
+	}
 
 	/* Remove the semaphore set from the IDR */
 	sem_rmid(ns, sma);
@@ -1532,21 +1536,20 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	queue.undo = un;
 	queue.pid = task_tgid_vnr(current);
 	queue.alter = alter;
-	if (alter)
-		list_add_tail(&queue.list, &sma->sem_pending);
-	else
-		list_add(&queue.list, &sma->sem_pending);
 
 	if (nsops == 1) {
 		struct sem *curr;
 		curr = &sma->sem_base[sops->sem_num];
 
 		if (alter)
-			list_add_tail(&queue.simple_list, &curr->sem_pending);
+			list_add_tail(&queue.list, &curr->sem_pending);
 		else
-			list_add(&queue.simple_list, &curr->sem_pending);
+			list_add(&queue.list, &curr->sem_pending);
 	} else {
-		INIT_LIST_HEAD(&queue.simple_list);
+		if (alter)
+			list_add_tail(&queue.list, &sma->sem_pending);
+		else
+			list_add(&queue.list, &sma->sem_pending);
 		sma->complex_count++;
 	}
 
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH 7/7] ipc,sem: fine grained locking for semtimedop
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (5 preceding siblings ...)
  2013-03-20 19:55 ` [PATCH 6/7] ipc,sem: have only one list in struct sem_queue Rik van Riel
@ 2013-03-20 19:55 ` Rik van Riel
  2013-03-22  1:14   ` Davidlohr Bueso
  2013-03-22 23:01   ` Michel Lespinasse
  2013-03-20 20:49 ` ipc,sem: sysv semaphore scalability Linus Torvalds
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-20 19:55 UTC (permalink / raw)
  To: torvalds
  Cc: davidlohr.bueso, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel, Rik van Riel

Introduce finer grained locking for semtimedop, to handle the
common case of a program wanting to manipulate one semaphore
from an array with multiple semaphores.

If the call is a semop manipulating just one semaphore in
an array with multiple semaphores, only take the lock for
that semaphore itself.

If the call needs to manipulate multiple semaphores, or
another caller is in a transaction that manipulates multiple
semaphores, the sem_array lock is taken, as well as all the
locks for the individual semaphores.

On a 24 CPU system, performance numbers with the semop-multi
test with N threads and N semaphores, look like this:

	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
threads			patches		rwlock patches	v3 patches
10	610652		726325		1783589		2142206
20	341570		365699		1520453		1977878
30	288102		307037		1498167		2037995
40	290714		305955		1612665		2256484
50	288620		312890		1733453		2650292
60	289987		306043		1649360		2388008
70	291298		306347		1723167		2717486
80	290948		305662		1729545		2763582
90	290996		306680		1736021		2757524
100	292243		306700		1773700		3059159

Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 ipc/sem.c |  152 ++++++++++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 105 insertions(+), 47 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 468e2c1..483eb6b 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -94,6 +94,7 @@
 struct sem {
 	int	semval;		/* current value */
 	int	sempid;		/* pid of last operation */
+	spinlock_t	lock;	/* spinlock for fine-grained semtimedop */
 	struct list_head sem_pending; /* pending single-sop operations */
 };
 
@@ -137,7 +138,6 @@ struct sem_undo_list {
 
 #define sem_ids(ns)	((ns)->ids[IPC_SEM_IDS])
 
-#define sem_unlock(sma)		ipc_unlock(&(sma)->sem_perm)
 #define sem_checkid(sma, semid)	ipc_checkid(&sma->sem_perm, semid)
 
 static int newary(struct ipc_namespace *, struct ipc_params *);
@@ -190,19 +190,83 @@ void __init sem_init (void)
 }
 
 /*
+ * If the sem_array contains just one semaphore, or if multiple
+ * semops are performed in one syscall, or if there are complex
+ * operations pending, the whole sem_array is locked.
+ * If one semop is performed on an array with multiple semaphores,
+ * get a shared lock on the array, and lock the individual semaphore.
+ *
+ * Carefully guard against sma->complex_count changing between zero
+ * and non-zero while we are spinning for the lock. The value of
+ * sma->complex_count cannot change while we are holding the lock,
+ * so sem_unlock should be fine.
+ */
+static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
+			      int nsops)
+{
+	int locknum;
+	if (nsops == 1 && !sma->complex_count) {
+		struct sem *sem = sma->sem_base + sops->sem_num;
+
+		/* Lock just the semaphore we are interested in. */
+		spin_lock(&sem->lock);
+
+		/*
+		 * If sma->complex_count was set while we were spinning,
+		 * we may need to look at things we did not lock here.
+		 */
+		if (unlikely(sma->complex_count)) {
+			spin_unlock(&sma->sem_perm.lock);
+			goto lock_all;
+		}
+		locknum = sops->sem_num;
+	} else {
+		int i;
+		/* Lock the sem_array, and all the semaphore locks */
+ lock_all:
+		spin_lock(&sma->sem_perm.lock);
+		for (i = 0; i < sma->sem_nsems; i++) {
+			struct sem *sem = sma->sem_base + i;
+			spin_lock(&sem->lock);
+		}
+		locknum = -1;
+	}
+	return locknum;
+}
+
+static inline void sem_unlock(struct sem_array *sma, int locknum)
+{
+	if (locknum == -1) {
+		int i;
+		for (i = 0; i < sma->sem_nsems; i++) {
+			struct sem *sem = sma->sem_base + i;
+			spin_unlock(&sem->lock);
+		}
+		spin_unlock(&sma->sem_perm.lock);
+	} else {
+		struct sem *sem = sma->sem_base + locknum;
+		spin_unlock(&sem->lock);
+	}
+	rcu_read_unlock();
+}
+
+/*
  * sem_lock_(check_) routines are called in the paths where the rw_mutex
  * is not held.
  */
-static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns, int id)
+static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns,
+			int id, struct sembuf *sops, int nsops, int *locknum)
 {
 	struct kern_ipc_perm *ipcp;
+	struct sem_array *sma;
 
 	rcu_read_lock();
 	ipcp = ipc_obtain_object(&sem_ids(ns), id);
 	if (IS_ERR(ipcp))
 		goto err1;
 
-	spin_lock(&ipcp->lock);
+	sma = container_of(ipcp, struct sem_array, sem_perm);
+	*locknum = sem_lock(sma, sops, nsops);
 
 	/* ipc_rmid() may have already freed the ID while sem_lock
 	 * was spinning: verify that the structure is still valid
@@ -210,9 +274,9 @@ static inline struct sem_array *sem_obtain_lock(struct ipc_namespace *ns, int id
 	if (ipcp->deleted)
 		goto err0;
 
-	return container_of(ipcp, struct sem_array, sem_perm);
+	return sma;
 err0:
-	spin_unlock(&ipcp->lock);
+	sem_unlock(sma, *locknum);
 err1:
 	rcu_read_unlock();
 	return ERR_PTR(-EINVAL);
@@ -228,17 +292,6 @@ static inline struct sem_array *sem_obtain_object(struct ipc_namespace *ns, int
 	return container_of(ipcp, struct sem_array, sem_perm);
 }
 
-static inline struct sem_array *sem_lock_check(struct ipc_namespace *ns,
-						int id)
-{
-	struct kern_ipc_perm *ipcp = ipc_lock_check(&sem_ids(ns), id);
-
-	if (IS_ERR(ipcp))
-		return ERR_CAST(ipcp);
-
-	return container_of(ipcp, struct sem_array, sem_perm);
-}
-
 static inline struct sem_array *sem_obtain_object_check(struct ipc_namespace *ns,
 							int id)
 {
@@ -252,21 +305,21 @@ static inline struct sem_array *sem_obtain_object_check(struct ipc_namespace *ns
 
 static inline void sem_lock_and_putref(struct sem_array *sma)
 {
-	ipc_lock_by_ptr(&sma->sem_perm);
+	rcu_read_lock();
+	sem_lock(sma, NULL, -1);
 	ipc_rcu_putref(sma);
 }
 
 static inline void sem_getref_and_unlock(struct sem_array *sma)
 {
 	ipc_rcu_getref(sma);
-	ipc_unlock(&(sma)->sem_perm);
+	sem_unlock(sma, -1);
 }
 
 static inline void sem_putref(struct sem_array *sma)
 {
-	ipc_lock_by_ptr(&sma->sem_perm);
-	ipc_rcu_putref(sma);
-	ipc_unlock(&(sma)->sem_perm);
+	sem_lock_and_putref(sma);
+	sem_unlock(sma, -1);
 }
 
 /*
@@ -274,9 +327,9 @@ static inline void sem_putref(struct sem_array *sma)
  */
 static inline void sem_getref(struct sem_array *sma)
 {
-	spin_lock(&(sma)->sem_perm.lock);
+	sem_lock(sma, NULL, -1);
 	ipc_rcu_getref(sma);
-	ipc_unlock(&(sma)->sem_perm);
+	sem_unlock(sma, -1);
 }
 
 static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
@@ -369,15 +422,18 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 
 	sma->sem_base = (struct sem *) &sma[1];
 
-	for (i = 0; i < nsems; i++)
+	for (i = 0; i < nsems; i++) {
 		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
+		spin_lock_init(&sma->sem_base[i].lock);
+		spin_lock(&sma->sem_base[i].lock);
+	}
 
 	sma->complex_count = 0;
 	INIT_LIST_HEAD(&sma->sem_pending);
 	INIT_LIST_HEAD(&sma->list_id);
 	sma->sem_nsems = nsems;
 	sma->sem_ctime = get_seconds();
-	sem_unlock(sma);
+	sem_unlock(sma, -1);
 
 	return sma->sem_perm.id;
 }
@@ -816,7 +872,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 
 	/* Remove the semaphore set from the IDR */
 	sem_rmid(ns, sma);
-	sem_unlock(sma);
+	sem_unlock(sma, -1);
 
 	wake_up_sem_queue_do(&tasks);
 	ns->used_sems -= sma->sem_nsems;
@@ -990,16 +1046,16 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 
 			sem_lock_and_putref(sma);
 			if (sma->sem_perm.deleted) {
-				sem_unlock(sma);
+				sem_unlock(sma, -1);
 				err = -EIDRM;
 				goto out_free;
 			}
 		}
 
-		spin_lock(&sma->sem_perm.lock);
+		sem_lock(sma, NULL, -1);
 		for (i = 0; i < sma->sem_nsems; i++)
 			sem_io[i] = sma->sem_base[i].semval;
-		sem_unlock(sma);
+		sem_unlock(sma, -1);
 		err = 0;
 		if(copy_to_user(array, sem_io, nsems*sizeof(ushort)))
 			err = -EFAULT;
@@ -1036,7 +1092,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 		}
 		sem_lock_and_putref(sma);
 		if (sma->sem_perm.deleted) {
-			sem_unlock(sma);
+			sem_unlock(sma, -1);
 			err = -EIDRM;
 			goto out_free;
 		}
@@ -1061,7 +1117,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 	if(semnum < 0 || semnum >= nsems)
 		goto out_unlock;
 
-	spin_lock(&sma->sem_perm.lock);
+	sem_lock(sma, NULL, -1);
 	curr = &sma->sem_base[semnum];
 
 	switch (cmd) {
@@ -1101,7 +1157,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 	}
 
 out_unlock:
-	sem_unlock(sma);
+	sem_unlock(sma, -1);
 out_wakeup:
 	wake_up_sem_queue_do(&tasks);
 out_free:
@@ -1169,11 +1225,11 @@ static int semctl_down(struct ipc_namespace *ns, int semid,
 
 	switch(cmd){
 	case IPC_RMID:
-		ipc_lock_object(&sma->sem_perm);
+		sem_lock(sma, NULL, -1);
 		freeary(ns, ipcp);
 		goto out_up;
 	case IPC_SET:
-		ipc_lock_object(&sma->sem_perm);
+		sem_lock(sma, NULL, -1);
 		err = ipc_update_perm(&semid64.sem_perm, ipcp);
 		if (err)
 			goto out_unlock;
@@ -1186,7 +1242,7 @@ static int semctl_down(struct ipc_namespace *ns, int semid,
 	}
 
 out_unlock:
-	sem_unlock(sma);
+	sem_unlock(sma, -1);
 out_up:
 	up_write(&sem_ids(ns).rw_mutex);
 	return err;
@@ -1343,7 +1399,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	/* step 3: Acquire the lock on semaphore array */
 	sem_lock_and_putref(sma);
 	if (sma->sem_perm.deleted) {
-		sem_unlock(sma);
+		sem_unlock(sma, -1);
 		kfree(new);
 		un = ERR_PTR(-EIDRM);
 		goto out;
@@ -1371,7 +1427,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 success:
 	spin_unlock(&ulp->lock);
 	rcu_read_lock();
-	sem_unlock(sma);
+	sem_unlock(sma, -1);
 out:
 	return un;
 }
@@ -1411,7 +1467,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	struct sembuf fast_sops[SEMOPM_FAST];
 	struct sembuf* sops = fast_sops, *sop;
 	struct sem_undo *un;
-	int undos = 0, alter = 0, max;
+	int undos = 0, alter = 0, max, locknum;
 	struct sem_queue queue;
 	unsigned long jiffies_left = 0;
 	struct ipc_namespace *ns;
@@ -1501,7 +1557,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	 * "un" itself is guaranteed by rcu.
 	 */
 	error = -EIDRM;
-	ipc_lock_object(&sma->sem_perm);
+	locknum = sem_lock(sma, sops, nsops);
 	if (un) {
 		if (un->semid == -1) {
 			rcu_read_unlock();
@@ -1558,7 +1614,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 
 sleep_again:
 	current->state = TASK_INTERRUPTIBLE;
-	sem_unlock(sma);
+	sem_unlock(sma, locknum);
 
 	if (timeout)
 		jiffies_left = schedule_timeout(jiffies_left);
@@ -1580,7 +1636,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		goto out_free;
 	}
 
-	sma = sem_obtain_lock(ns, semid);
+	sma = sem_obtain_lock(ns, semid, sops, nsops, &locknum);
 
 	/*
 	 * Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing.
@@ -1619,7 +1675,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	unlink_queue(sma, &queue);
 
 out_unlock_free:
-	sem_unlock(sma);
+	sem_unlock(sma, locknum);
 out_wakeup:
 	wake_up_sem_queue_do(&tasks);
 out_free:
@@ -1693,12 +1749,14 @@ void exit_sem(struct task_struct *tsk)
 			semid = -1;
 		 else
 			semid = un->semid;
-		rcu_read_unlock();
 
-		if (semid == -1)
+		if (semid == -1) {
+			rcu_read_unlock();
 			break;
+		}
 
-		sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);
+		sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
+		sem_lock(sma, NULL, -1);
 
 		/* exit_sem raced with IPC_RMID, nothing to do */
 		if (IS_ERR(sma))
@@ -1709,7 +1767,7 @@ void exit_sem(struct task_struct *tsk)
 			/* exit_sem raced with IPC_RMID+semget() that created
 			 * exactly the same semid. Nothing to do.
 			 */
-			sem_unlock(sma);
+			sem_unlock(sma, -1);
 			continue;
 		}
 
@@ -1749,7 +1807,7 @@ void exit_sem(struct task_struct *tsk)
 		/* maybe some queued-up processes were waiting for this */
 		INIT_LIST_HEAD(&tasks);
 		do_smart_update(sma, NULL, 0, 1, &tasks);
-		sem_unlock(sma);
+		sem_unlock(sma, -1);
 		wake_up_sem_queue_do(&tasks);
 
 		kfree_rcu(un, rcu);
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (6 preceding siblings ...)
  2013-03-20 19:55 ` [PATCH 7/7] ipc,sem: fine grained locking for semtimedop Rik van Riel
@ 2013-03-20 20:49 ` Linus Torvalds
  2013-03-20 20:56   ` Linus Torvalds
  2013-03-20 20:57   ` Davidlohr Bueso
  2013-03-21 21:10 ` Andrew Morton
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-20 20:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Andrew Morton,
	hhuang, Low, Jason, Michel Lespinasse, Larry Woodman, Vinod,
	Chegu

On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel <riel@surriel.com> wrote:
>
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

The series looks sane to me, and I like how each individual step is
pretty small and makes sense.

It *would* be lovely to see this run with the actual Swingbench
numbers. The microbenchmark always looked much nicer. Do the
additional multi-semaphore scalability patches on top of Davidlohr's
patches help with the swingbench issue, or are we still totally
swamped by the ipc lock there?

Maybe there were already numbers for that, but the last swingbench
numbers I can actually recall was from before the finer-grained
locking..

And obviously, getting this tested so that there aren't any more
missed wakeups etc would be lovely. I'm assuming the plan is that this
all goes through Andrew? Do we have big semop users who could test it
on real loads? Considering that I *suspect* the main users are things
like Oracle etc, I'd assume that there's some RH lab or partner or
similar that is interested in making sure this not only helps, but
also that it doesn't break anything ;)

                 Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 20:49 ` ipc,sem: sysv semaphore scalability Linus Torvalds
@ 2013-03-20 20:56   ` Linus Torvalds
  2013-03-20 20:57   ` Davidlohr Bueso
  1 sibling, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-20 20:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Linux Kernel Mailing List, Andrew Morton,
	hhuang, Low, Jason, Michel Lespinasse, Larry Woodman, Vinod,
	Chegu

On Wed, Mar 20, 2013 at 1:49 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It *would* be lovely to see this run with the actual Swingbench
> numbers. The microbenchmark always looked much nicer. Do the
> additional multi-semaphore scalability patches on top of Davidlohr's
> patches help with the swingbench issue, or are we still totally
> swamped by the ipc lock there?
>
> Maybe there were already numbers for that, but the last swingbench
> numbers I can actually recall was from before the finer-grained
> locking..

Ok, and if the spinlock is still a big deal even with the finer
granularity, it might be interesting to hear if Michel's fast locks
make a difference. I'm guessing that this series might actually make
it easier/cleaner to do the semaphore locking using another lock,
since the ipc_lock got split up and out...

I think Michel did it for some socket code too. I think that was
fairly independent and was for netperf.

            Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 20:49 ` ipc,sem: sysv semaphore scalability Linus Torvalds
  2013-03-20 20:56   ` Linus Torvalds
@ 2013-03-20 20:57   ` Davidlohr Bueso
  1 sibling, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-20 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Linux Kernel Mailing List, Andrew Morton, hhuang,
	Low, Jason, Michel Lespinasse, Larry Woodman, Vinod, Chegu

On Wed, 2013-03-20 at 13:49 -0700, Linus Torvalds wrote:
> On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel <riel@surriel.com> wrote:
> >
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> The series looks sane to me, and I like how each individual step is
> pretty small and makes sense.
> 
> It *would* be lovely to see this run with the actual Swingbench
> numbers. The microbenchmark always looked much nicer. Do the
> additional multi-semaphore scalability patches on top of Davidlohr's
> patches help with the swingbench issue, or are we still totally
> swamped by the ipc lock there?

Yes, I'm testing this patchset with my swingbench workloads. I should
have some numbers by today or tomorrow.

> 
> Maybe there were already numbers for that, but the last swingbench
> numbers I can actually recall was from before the finer-grained
> locking..

Right, I couldn't get Oracle to run on the with the previous patches,
hopefully the bug(s) are now addressed.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (7 preceding siblings ...)
  2013-03-20 20:49 ` ipc,sem: sysv semaphore scalability Linus Torvalds
@ 2013-03-21 21:10 ` Andrew Morton
  2013-03-21 21:47   ` Peter Hurley
                     ` (2 more replies)
  2013-03-22  1:12 ` Davidlohr Bueso
                   ` (5 subsequent siblings)
  14 siblings, 3 replies; 129+ messages in thread
From: Andrew Morton @ 2013-03-21 21:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, hhuang, jason.low2,
	walken, lwoodman, chegu_vinod, Peter Hurley, Dave Jones

On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel <riel@surriel.com> wrote:

> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.

These patches conflict pretty badly with Peter's:

ipc-clamp-with-min.patch
ipc-separate-msg-allocation-from-userspace-copy.patch
ipc-tighten-msg-copy-loops.patch
ipc-set-efault-as-default-error-in-load_msg.patch
ipc-remove-msg-handling-from-queue-scan.patch
ipc-implement-msg_copy-as-a-new-receive-mode.patch
ipc-simplify-msg-list-search.patch
ipc-refactor-msg-list-search-into-separate-function.patch
ipc-refactor-msg-list-search-into-separate-function-fix.patch

(they're at http://ozlabs.org/~akpm/mmots/broken-out/)



We're in a bit of a mess at present.

Last month Peter sent a ten-patch series which fixed an oops (Subject:
"ipc MSG_COPY fixes").  The series did other stuff, so we merged into
mainline just two bugfix patches.  Then davej hit a trinity oops
(Subject: "ipc/testmsg GPF.") and testing confirmed that the remaining
eight patches in Peter's series fixed that up.  Nobody has yet
identified which of the eight does the good deed.  So we need to
either:

a) revert the original two fixes "ipc: fix potential oops when src
   msg > 4k w/ MSG_COPY" and "ipc: don't allocate a copy larger than
   max" or 

b) try reverting "ipc: don't allocate a copy larger than max", as
   "ipc: fix potential oops when src msg > 4k w/ MSG_COPY" looks pretty
   harmless or

c) work out which of the remaining 8 fixes the new oops and merge that or

d) merge all 8 fixes or

e) something else.

Whichever way we go, we should get a wiggle on - this has been hanging
around for too long.  Dave, do you have time to determine whether
reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
than max") fixes things up?


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-21 21:10 ` Andrew Morton
@ 2013-03-21 21:47   ` Peter Hurley
  2013-03-21 21:50   ` Peter Hurley
  2013-03-26 19:28   ` Dave Jones
  2 siblings, 0 replies; 129+ messages in thread
From: Peter Hurley @ 2013-03-21 21:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Dave Jones

On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
> On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel <riel@surriel.com> wrote:
> 
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> > 
> > The first four patches were written by Davidlohr Buesso, and
> > reduce the hold time of the semaphore lock.
> > 
> > The last three patches change the sysv semaphore code locking
> > to be more fine grained, providing a performance boost when
> > multiple semaphores in a semaphore array are being manipulated
> > simultaneously.
> 
> These patches conflict pretty badly with Peter's:
> 
> ipc-clamp-with-min.patch
> ipc-separate-msg-allocation-from-userspace-copy.patch
> ipc-tighten-msg-copy-loops.patch
> ipc-set-efault-as-default-error-in-load_msg.patch
> ipc-remove-msg-handling-from-queue-scan.patch
> ipc-implement-msg_copy-as-a-new-receive-mode.patch
> ipc-simplify-msg-list-search.patch
> ipc-refactor-msg-list-search-into-separate-function.patch
> ipc-refactor-msg-list-search-into-separate-function-fix.patch
> 
> (they're at http://ozlabs.org/~akpm/mmots/broken-out/)
> 
> 
> 
> We're in a bit of a mess at present.
> 
> Last month Peter sent a ten-patch series which fixed an oops (Subject:
> "ipc MSG_COPY fixes").  The series did other stuff, so we merged into
> mainline just two bugfix patches.  Then davej hit a trinity oops
> (Subject: "ipc/testmsg GPF.") and testing confirmed that the remaining
> eight patches in Peter's series fixed that up.

Just to clarify the history.

A while back on linux-next both Dave and I were hitting ipc oopses on
trinity, which was fallout from the MSG_COPY implementation.

One of those oopses was the ipc/testmsg oops.

I was trying to push out the new tty ldisc patchset and trinity is a
decent tool for exploring race conditions (which the tty layer was full
of), so the oopses were in my way.

Because it was nearly impossible to static analyze the existing ipc
code, I just started cleaning up. As I did, I found the two
by-then-obvious bug fixes. Also, assigning the 'msg' variable in the
search loop looked suspicious so I factored that search loop out as
well. That was the whole 10-patch "ipc MSG_COPY fixes" series.

I wasn't getting the ipc/testmsg bug anymore and I let Dave know, so all
was good.

When Andrew asked me if the whole series needed to go into 3.9, I said I
didn't know.

So when just the 2 of 10 patches went in, Dave started to hit the
ipc/testmsg bug again on 3.9.

> Nobody has yet
> identified which of the eight does the good deed.  So we need to
> either:
> 
> a) revert the original two fixes "ipc: fix potential oops when src
>    msg > 4k w/ MSG_COPY" and "ipc: don't allocate a copy larger than
>    max" or 
> 
> b) try reverting "ipc: don't allocate a copy larger than max", as
>    "ipc: fix potential oops when src msg > 4k w/ MSG_COPY" looks pretty
>    harmless or
> 
> c) work out which of the remaining 8 fixes the new oops and merge that or
> 
> d) merge all 8 fixes or
> 
> e) something else.


The fix is in the other 8 patches. But I have to say, the base code was
such a mess I have no idea which of the other 8 patches fixes the
ipc/testmsg oops or how.

I guarantee reverting those 2 fixes from my series will not make the
ipc/testmsg oops go away.


> Whichever way we go, we should get a wiggle on - this has been hanging
> around for too long.  Dave, do you have time to determine whether
> reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
> than max") fixes things up?




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-21 21:10 ` Andrew Morton
  2013-03-21 21:47   ` Peter Hurley
@ 2013-03-21 21:50   ` Peter Hurley
  2013-03-21 22:01     ` Andrew Morton
  2013-03-26 19:28   ` Dave Jones
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Hurley @ 2013-03-21 21:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Dave Jones

On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
> On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel <riel@surriel.com> wrote:
> 
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> > 
> > The first four patches were written by Davidlohr Buesso, and
> > reduce the hold time of the semaphore lock.
> > 
> > The last three patches change the sysv semaphore code locking
> > to be more fine grained, providing a performance boost when
> > multiple semaphores in a semaphore array are being manipulated
> > simultaneously.
> 
> These patches conflict pretty badly with Peter's:

On one point I'm a little confused: my series has been in linux-next for
a while. On what tree is this series based?



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-21 21:50   ` Peter Hurley
@ 2013-03-21 22:01     ` Andrew Morton
  2013-03-22  3:38       ` Rik van Riel
  0 siblings, 1 reply; 129+ messages in thread
From: Andrew Morton @ 2013-03-21 22:01 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Dave Jones

On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley <peter@hurleysoftware.com> wrote:

> On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
> > On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel <riel@surriel.com> wrote:
> > 
> > > This series makes the sysv semaphore code more scalable,
> > > by reducing the time the semaphore lock is held, and making
> > > the locking more scalable for semaphore arrays with multiple
> > > semaphores.
> > > 
> > > The first four patches were written by Davidlohr Buesso, and
> > > reduce the hold time of the semaphore lock.
> > > 
> > > The last three patches change the sysv semaphore code locking
> > > to be more fine grained, providing a performance boost when
> > > multiple semaphores in a semaphore array are being manipulated
> > > simultaneously.
> > 
> > These patches conflict pretty badly with Peter's:
> 
> On one point I'm a little confused: my series has been in linux-next for
> a while. On what tree is this series based?

It'll be based on mainline.  People often forget to peek into
linux-next when preparing patches.  In the great majority of cases
that's OK.  Occasionally, we lose...


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (8 preceding siblings ...)
  2013-03-21 21:10 ` Andrew Morton
@ 2013-03-22  1:12 ` Davidlohr Bueso
  2013-03-22  1:23   ` Linus Torvalds
  2013-03-22  7:30 ` Mike Galbraith
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-22  1:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: akpm, hhuang, jason.low2, walken, lwoodman, chegu_vinod,
	linux-kernel, torvalds

On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
> 	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
> threads			patches		rwlock patches	v3 patches
> 10	610652		726325		1783589		2142206
> 20	341570		365699		1520453		1977878
> 30	288102		307037		1498167		2037995
> 40	290714		305955		1612665		2256484
> 50	288620		312890		1733453		2650292
> 60	289987		306043		1649360		2388008
> 70	291298		306347		1723167		2717486
> 80	290948		305662		1729545		2763582
> 90	290996		306680		1736021		2757524
> 100	292243		306700		1773700		3059159
> 

After testing these patches with my Oracle Swingbench DSS workload, I
can say that there are significant improvements. The ipc lock contention
was reduced drastically, specially with higher amounts of benchmark
users. As a result, the overall %sys time went down as well.
Furthermore, throughput (in transactions per second) was increased.

TPS:
100 users: 1257.21 (vanilla)    2805.06 (v3 patchset)
400 users: 1437.57 (vanilla)    2664.67 (v3 patchset)
800 users: 1236.89 (vanilla)    2750.73 (v3 patchset)

ipc lock contention:
100 users:  8,74%  (vanilla)    3.17% (v3 patchset)
400 users:  21,86% (vanilla)    5.23% (v3 patchset)
800 users   84,35% (vanilla)    7.39% (v3 patchset)             

As seen with perf, the ipc lock isn't even the main source of contention
anymore. Also, no matter how many benchmark users,  the lock's user is
mostly semctl_main() .

100 users:
    3.17%           oracle  [kernel.kallsyms]   [k] _raw_spin_lock                            
                     |
                     --- _raw_spin_lock
                        |          
                        |--50.53%-- sem_lock
                        |          |          
                        |          |--82.60%-- semctl_main
                        |           --17.40%-- sys_semtimedop

400 users:
    5.23%           oracle  [kernel.kallsyms]   [k] _raw_spin_lock                            
                     |
                     --- _raw_spin_lock
                        |          
                        |--75.81%-- sem_lock
                        |          |          
                        |          |--94.09%-- semctl_main
                        |           --5.91%-- sys_semtimedop


800 users:
     7.39%           oracle  [kernel.kallsyms]   [k] _raw_spin_lock                            
                     |
                     --- _raw_spin_lock
                        |          
                        |--81.71%-- sem_lock
                        |          |          
                        |          |--64.98%-- semctl_main
                        |           --35.02%-- sys_semtimedop


Thanks,
Davidlohr



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 5/7] ipc,sem: open code and rename sem_lock
  2013-03-20 19:55 ` [PATCH 5/7] ipc,sem: open code and rename sem_lock Rik van Riel
@ 2013-03-22  1:14   ` Davidlohr Bueso
  0 siblings, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-22  1:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel

On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Rename sem_lock to sem_obtain_lock, so we can introduce a sem_lock
> function later that only locks the sem_array and does nothing else.
> 
> Open code the locking from ipc_lock in sem_obtain_lock, so we can
> introduce finer grained locking for the sem_array in the next patch.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 6/7] ipc,sem: have only one list in struct sem_queue
  2013-03-20 19:55 ` [PATCH 6/7] ipc,sem: have only one list in struct sem_queue Rik van Riel
@ 2013-03-22  1:14   ` Davidlohr Bueso
  0 siblings, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-22  1:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel

On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Having only one list in struct sem_queue, and only queueing simple
> semaphore operations on the list for the semaphore involved, allows
> us to introduce finer grained locking for semtimedop.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 7/7] ipc,sem: fine grained locking for semtimedop
  2013-03-20 19:55 ` [PATCH 7/7] ipc,sem: fine grained locking for semtimedop Rik van Riel
@ 2013-03-22  1:14   ` Davidlohr Bueso
  2013-03-22 23:01   ` Michel Lespinasse
  1 sibling, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-22  1:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod, Rik van Riel

On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Introduce finer grained locking for semtimedop, to handle the
> common case of a program wanting to manipulate one semaphore
> from an array with multiple semaphores.
> 
> If the call is a semop manipulating just one semaphore in
> an array with multiple semaphores, only take the lock for
> that semaphore itself.
> 
> If the call needs to manipulate multiple semaphores, or
> another caller is in a transaction that manipulates multiple
> semaphores, the sem_array lock is taken, as well as all the
> locks for the individual semaphores.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
> 	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
> threads			patches		rwlock patches	v3 patches
> 10	610652		726325		1783589		2142206
> 20	341570		365699		1520453		1977878
> 30	288102		307037		1498167		2037995
> 40	290714		305955		1612665		2256484
> 50	288620		312890		1733453		2650292
> 60	289987		306043		1649360		2388008
> 70	291298		306347		1723167		2717486
> 80	290948		305662		1729545		2763582
> 90	290996		306680		1736021		2757524
> 100	292243		306700		1773700		3059159
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>

Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-22  1:12 ` Davidlohr Bueso
@ 2013-03-22  1:23   ` Linus Torvalds
  2013-03-22  3:40     ` Rik van Riel
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-22  1:23 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Rik van Riel, Andrew Morton, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Linux Kernel Mailing List

On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
>
> ipc lock contention:
> 100 users:  8,74%  (vanilla)    3.17% (v3 patchset)
> 400 users:  21,86% (vanilla)    5.23% (v3 patchset)
> 800 users   84,35% (vanilla)    7.39% (v3 patchset)

Ok, I'd call that pretty much "solved". Sure, it's still visible, but
for being a benchmark that apparently does little else than pound on
those sysv semaphores, I think we can consider it pretty much fine.
I'm going to assume that anybody who actually then does any real work
(ie a database) is never going to see even close to this bad
contention.

Good job, Rik. I'm assuming we'll be merging this during the 3.10
merge window, and hopefully the merge conflicts will be sorted out
too. Rik, Peter, can you look at each others patches and see if you
can get that sorted out for Andrew?

            Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-21 22:01     ` Andrew Morton
@ 2013-03-22  3:38       ` Rik van Riel
  0 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-22  3:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Hurley, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Dave Jones

On 03/21/2013 06:01 PM, Andrew Morton wrote:
> On Thu, 21 Mar 2013 17:50:05 -0400 Peter Hurley <peter@hurleysoftware.com> wrote:
>
>> On Thu, 2013-03-21 at 14:10 -0700, Andrew Morton wrote:
>>> On Wed, 20 Mar 2013 15:55:30 -0400 Rik van Riel <riel@surriel.com> wrote:
>>>
>>>> This series makes the sysv semaphore code more scalable,
>>>> by reducing the time the semaphore lock is held, and making
>>>> the locking more scalable for semaphore arrays with multiple
>>>> semaphores.
>>>>
>>>> The first four patches were written by Davidlohr Buesso, and
>>>> reduce the hold time of the semaphore lock.
>>>>
>>>> The last three patches change the sysv semaphore code locking
>>>> to be more fine grained, providing a performance boost when
>>>> multiple semaphores in a semaphore array are being manipulated
>>>> simultaneously.
>>>
>>> These patches conflict pretty badly with Peter's:
>>
>> On one point I'm a little confused: my series has been in linux-next for
>> a while. On what tree is this series based?
>
> It'll be based on mainline.  People often forget to peek into
> linux-next when preparing patches.  In the great majority of cases
> that's OK.  Occasionally, we lose...

I'll be happy to rebase the series onto linux-next, to make
merging easier for you.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-22  1:23   ` Linus Torvalds
@ 2013-03-22  3:40     ` Rik van Riel
  0 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-22  3:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Andrew Morton, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Linux Kernel Mailing List

On 03/21/2013 09:23 PM, Linus Torvalds wrote:
> On Thu, Mar 21, 2013 at 6:12 PM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
>>
>> ipc lock contention:
>> 100 users:  8,74%  (vanilla)    3.17% (v3 patchset)
>> 400 users:  21,86% (vanilla)    5.23% (v3 patchset)
>> 800 users   84,35% (vanilla)    7.39% (v3 patchset)
>
> Ok, I'd call that pretty much "solved". Sure, it's still visible, but
> for being a benchmark that apparently does little else than pound on
> those sysv semaphores, I think we can consider it pretty much fine.
> I'm going to assume that anybody who actually then does any real work
> (ie a database) is never going to see even close to this bad
> contention.
>
> Good job, Rik. I'm assuming we'll be merging this during the 3.10
> merge window, and hopefully the merge conflicts will be sorted out
> too. Rik, Peter, can you look at each others patches and see if you
> can get that sorted out for Andrew?

Will do.

I will rebase this series on top of what is in linux-next.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (9 preceding siblings ...)
  2013-03-22  1:12 ` Davidlohr Bueso
@ 2013-03-22  7:30 ` Mike Galbraith
  2013-03-22 11:04 ` Emmanuel Benisty
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 129+ messages in thread
From: Mike Galbraith @ 2013-03-22  7:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod

On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote: 
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
> 	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
> threads			patches		rwlock patches	v3 patches
> 10	610652		726325		1783589		2142206
> 20	341570		365699		1520453		1977878
> 30	288102		307037		1498167		2037995
> 40	290714		305955		1612665		2256484
> 50	288620		312890		1733453		2650292
> 60	289987		306043		1649360		2388008
> 70	291298		306347		1723167		2717486
> 80	290948		305662		1729545		2763582
> 90	290996		306680		1736021		2757524
> 100	292243		306700		1773700		3059159

Hi Rik,

I plugged this set into enterprise -rt kernel, and beat on four boxen.
I ran into no trouble while giving boxen a generic drubbing fwtw.

Some semop-multi -rt numbers for an abby-normal 8 node box, and a
mundane 4 node box below.


32 cores+SMT, 3.0.68-rt92

numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 32733 MB
node 0 free: 29910 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 32768 MB
node 1 free: 30396 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 32768 MB
node 2 free: 30568 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 32767 MB
node 3 free: 28706 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10 

SCHED_OTHER
       -v3 set        +v3 set
threads
10      438485        1744599
20      376411        1580813
30      396639        1546841
40      449062        2152861
50      477594        2431344
60      446453        1874476
70      578705        2047884
80      607928        2144898
90      662136        2171074
100     622889        2295879
200     709668        2867273
300     661902        3008695
400     641758        3273250
500     614117        3403775

SCHED_FIFO
       -v3 set        +v3 set
threads
10     158656          914343
20      99784         1133775
30      84245         1099604
40      89725         1756577
50      85697         1607893
60      84033         1467272
70      86833         1979405
80      91451         1922250
90      92484         1990699
100     90691         2067705
200    103692         2308652
300    101161         2376921
400    103722         2417580
500    108838         2443349


64 core box (poor thing, smi free zone though), 3.0.68-rt92

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 8181 MB
node 0 free: 6309 MB
node distances:
node   0
  0:  10

SCHED_OTHER
       -v3 set        +v3 set
threads
10      677534        2304417
20      451507        1873474
30      356876        1542688
40      329585        1500392
50      415761        1318676
60      403482        1380644
70      394089        1185165
80      407408        1191834
90      445764        1249156
100     430823        1245573
200     425470        1421686
300     427092        1480379
400     497900        1516599
500     421927        1551309

SCHED_FIFO
10      323560        1882543
20      226257        1806862
30      187851        1263163
40      205337         881331
50      196806         766666
60      193218         612709
70      209432        1241797                            
80      240445        1269146
90      219865        1482649
100     227970        1473038
200     201354        1719977
300     183728        1823074
400     175051        1808792
500     243708        1849803



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (10 preceding siblings ...)
  2013-03-22  7:30 ` Mike Galbraith
@ 2013-03-22 11:04 ` Emmanuel Benisty
  2013-03-22 15:37   ` Linus Torvalds
  2013-03-22 17:51 ` Davidlohr Bueso
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-22 11:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod

Hi Rik,

On Thu, Mar 21, 2013 at 2:55 AM, Rik van Riel <riel@surriel.com> wrote:
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

I was trying your patchset and my machine died while building a
package. I could reproduce the bug the (only) two times I tried.
There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg

Thanks.
-- Emmanuel

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-22 11:04 ` Emmanuel Benisty
@ 2013-03-22 15:37   ` Linus Torvalds
  2013-03-23  3:19     ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-22 15:37 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Rik van Riel, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Fri, Mar 22, 2013 at 4:04 AM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>
> I was trying your patchset and my machine died while building a
> package. I could reproduce the bug the (only) two times I tried.
> There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg

Hmm. The original oops may well have scrolled off the window, what
remains is just indicating something is corrupted in slab. Which isn't
likely to give much hints..

Can you try to reproduce this with CONFIG_SLUB_DEBUG=y and
CONFIG_SLUB_DEBUG_ON=y, and see if there are any earlier messages?

                  Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (11 preceding siblings ...)
  2013-03-22 11:04 ` Emmanuel Benisty
@ 2013-03-22 17:51 ` Davidlohr Bueso
  2013-03-25 20:21 ` Sasha Levin
  2013-03-26 17:33 ` ipc,sem: sysv semaphore scalability Sasha Levin
  14 siblings, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-22 17:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, linux-kernel, akpm, hhuang, jason.low2, walken,
	lwoodman, chegu_vinod

On Wed, 2013-03-20 at 15:55 -0400, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.
> 
> The first four patches were written by Davidlohr Buesso, and
> reduce the hold time of the semaphore lock.
> 
> The last three patches change the sysv semaphore code locking
> to be more fine grained, providing a performance boost when
> multiple semaphores in a semaphore array are being manipulated
> simultaneously.
> 
> On a 24 CPU system, performance numbers with the semop-multi
> test with N threads and N semaphores, look like this:
> 
> 	vanilla		Davidlohr's	Davidlohr's +	Davidlohr's +
> threads			patches		rwlock patches	v3 patches
> 10	610652		726325		1783589		2142206
> 20	341570		365699		1520453		1977878
> 30	288102		307037		1498167		2037995
> 40	290714		305955		1612665		2256484
> 50	288620		312890		1733453		2650292
> 60	289987		306043		1649360		2388008
> 70	291298		306347		1723167		2717486
> 80	290948		305662		1729545		2763582
> 90	290996		306680		1736021		2757524
> 100	292243		306700		1773700		3059159
> 

Some results with semop-multi on my 4 core laptop:

	vanilla		v3 patchset
threads
10       5094473         10289146
20       5079946         10187923
30       5041258         10660635
40       4942786         10876009
50       5076437         10759434
60       5139024         10797032
70       5103811         10698323
80       5094850          9959675
90       5085774         10054844
100      4939547          9798291





^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 7/7] ipc,sem: fine grained locking for semtimedop
  2013-03-20 19:55 ` [PATCH 7/7] ipc,sem: fine grained locking for semtimedop Rik van Riel
  2013-03-22  1:14   ` Davidlohr Bueso
@ 2013-03-22 23:01   ` Michel Lespinasse
  2013-03-22 23:38     ` Rik van Riel
  2013-03-22 23:42     ` [PATCH 7/7 part3] fix for sem_lock Rik van Riel
  1 sibling, 2 replies; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-22 23:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, lwoodman, chegu_vinod, Rik van Riel

Sorry for the late reply; I've been swamped and am behind on my upstream mail.

On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel <riel@surriel.com> wrote:
> +static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
> +                             int nsops)
> +{
> +       int locknum;
> +       if (nsops == 1 && !sma->complex_count) {
> +               struct sem *sem = sma->sem_base + sops->sem_num;
> +
> +               /* Lock just the semaphore we are interested in. */
> +               spin_lock(&sem->lock);
> +
> +               /*
> +                * If sma->complex_count was set while we were spinning,
> +                * we may need to look at things we did not lock here.
> +                */
> +               if (unlikely(sma->complex_count)) {
> +                       spin_unlock(&sma->sem_perm.lock);

I believe this should be spin_unlock(&sem->lock) instead ?

> +                       goto lock_all;
> +               }
> +               locknum = sops->sem_num;
> +       } else {
> +               int i;
> +               /* Lock the sem_array, and all the semaphore locks */
> + lock_all:
> +               spin_lock(&sma->sem_perm.lock);
> +               for (i = 0; i < sma->sem_nsems; i++) {

Do we have to lock every sem from the array instead of just the ones
that are being operated on in sops ?
(I'm not sure either way, as I don't fully understand the queueing of
complex ops)

If we want to keep the loop as is, then we may at least remove the
sops argument to sem_lock() since we only care about nsops.

> +                       struct sem *sem = sma->sem_base + i;
> +                       spin_lock(&sem->lock);
> +               }
> +               locknum = -1;
> +       }
> +       return locknum;
> +}

That's all I have. Very nice test results BTW!

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH 7/7] ipc,sem: fine grained locking for semtimedop
  2013-03-22 23:01   ` Michel Lespinasse
@ 2013-03-22 23:38     ` Rik van Riel
  2013-03-22 23:42     ` [PATCH 7/7 part3] fix for sem_lock Rik van Riel
  1 sibling, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-22 23:38 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod

On 03/22/2013 07:01 PM, Michel Lespinasse wrote:
> Sorry for the late reply; I've been swamped and am behind on my upstream mail.
>
> On Wed, Mar 20, 2013 at 12:55 PM, Rik van Riel <riel@surriel.com> wrote:
>> +static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
>> +                             int nsops)
>> +{
>> +       int locknum;
>> +       if (nsops == 1 && !sma->complex_count) {
>> +               struct sem *sem = sma->sem_base + sops->sem_num;
>> +
>> +               /* Lock just the semaphore we are interested in. */
>> +               spin_lock(&sem->lock);
>> +
>> +               /*
>> +                * If sma->complex_count was set while we were spinning,
>> +                * we may need to look at things we did not lock here.
>> +                */
>> +               if (unlikely(sma->complex_count)) {
>> +                       spin_unlock(&sma->sem_perm.lock);
>
> I believe this should be spin_unlock(&sem->lock) instead ?

You are right. Good catch.

I'll send a one-liner fix for this to Andrew Morton,
since he already applied the patches in -mm.

>> +                       goto lock_all;
>> +               }
>> +               locknum = sops->sem_num;
>> +       } else {
>> +               int i;
>> +               /* Lock the sem_array, and all the semaphore locks */
>> + lock_all:
>> +               spin_lock(&sma->sem_perm.lock);
>> +               for (i = 0; i < sma->sem_nsems; i++) {
>
> Do we have to lock every sem from the array instead of just the ones
> that are being operated on in sops ?
> (I'm not sure either way, as I don't fully understand the queueing of
> complex ops)

We should be able to get away with locking just the ones that are
being operated on.  However, that does require we remember which
ones we locked, so we know which ones to unlock again.

I do not know whether that additional complexity is worthwhile,
especially considering that in Davidlohr's profile, the major
caller locking everybody appears to be semctl, which passes no
semops into the kernel but simply operates on all semaphores
at once (SET_ALL).

> If we want to keep the loop as is, then we may at least remove the
> sops argument to sem_lock() since we only care about nsops.

We need to know exactly which semaphore to lock when we
are locking only one. The sops argument is used to figure
out which one.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH 7/7 part3] fix for sem_lock
  2013-03-22 23:01   ` Michel Lespinasse
  2013-03-22 23:38     ` Rik van Riel
@ 2013-03-22 23:42     ` Rik van Riel
  1 sibling, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-22 23:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, lwoodman, chegu_vinod, Rik van Riel


> > +               /*
> > +                * If sma->complex_count was set while we were spinning,
> > +                * we may need to look at things we did not lock here.
> > +                */
> > +               if (unlikely(sma->complex_count)) {
> > +                       spin_unlock(&sma->sem_perm.lock);
> 
> I believe this should be spin_unlock(&sem->lock) instead ?

Michel, thanks for spotting this!

Andrew, could you fold this fix into my patch 7/7 before submitting
things for 3.10?  Thank you.

--->8---
Fix a typo in sem_lock.  Of course we need to unlock the local
semaphore lock before jumping to lock_all, in the rare case that
somebody started a complex operation while we were spinning on
the spinlock.

Can be folded into patch 7/7 before merging

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
---
 ipc/sem.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index a4b93fb..450248e 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -346,7 +346,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 * we may need to look at things we did not lock here.
 		 */
 		if (unlikely(sma->complex_count)) {
-			spin_unlock(&sma->sem_perm.lock);
+			spin_unlock(&sem->lock);
 			goto lock_all;
 		}
 		locknum = sops->sem_num;

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-22 15:37   ` Linus Torvalds
@ 2013-03-23  3:19     ` Emmanuel Benisty
  2013-03-23 19:45       ` Linus Torvalds
  2013-05-04 15:55       ` Jörn Engel
  0 siblings, 2 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-23  3:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

Hi Linus,

On Fri, Mar 22, 2013 at 10:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 22, 2013 at 4:04 AM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>>
>> I was trying your patchset and my machine died while building a
>> package. I could reproduce the bug the (only) two times I tried.
>> There's a poor quality picture here: http://i.imgur.com/MuYuyQC.jpg
>
> Hmm. The original oops may well have scrolled off the window, what
> remains is just indicating something is corrupted in slab. Which isn't
> likely to give much hints..
>
> Can you try to reproduce this with CONFIG_SLUB_DEBUG=y and
> CONFIG_SLUB_DEBUG_ON=y, and see if there are any earlier messages?

I could reproduce it but could you please let me know what would be
the right tools I should use to catch the original oops?
This is what I got but I doubt it will be helpful:
http://i.imgur.com/Mewi1hC.jpg

Thanks.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-23  3:19     ` Emmanuel Benisty
@ 2013-03-23 19:45       ` Linus Torvalds
  2013-03-24 13:46         ` Emmanuel Benisty
  2013-05-04 15:55       ` Jörn Engel
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-23 19:45 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Rik van Riel, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Fri, Mar 22, 2013 at 8:19 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>
> I could reproduce it but could you please let me know what would be
> the right tools I should use to catch the original oops?
> This is what I got but I doubt it will be helpful:
> http://i.imgur.com/Mewi1hC.jpg

In this case, I think the best thing to do would be to comment out all
of drm_warn_on_modeset_not_all_locked(), because those warnings make
the original problem (that probably caused the lock problem in the
first place that it is warning about) scroll away.

That said, you should also take the oneliner fix that Rik posted to
patch 7 (subject line: "[PATCH 7/7 part3] fix for sem_lock") and apply
that, just to make sure that you aren't possibly hitting a bug with
the shared-memory locking introduced by that (unusual) case.

                    Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-23 19:45       ` Linus Torvalds
@ 2013-03-24 13:46         ` Emmanuel Benisty
  2013-03-24 17:10           ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-24 13:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Sun, Mar 24, 2013 at 2:45 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 22, 2013 at 8:19 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>>
>> I could reproduce it but could you please let me know what would be
>> the right tools I should use to catch the original oops?
>> This is what I got but I doubt it will be helpful:
>> http://i.imgur.com/Mewi1hC.jpg
>
> In this case, I think the best thing to do would be to comment out all
> of drm_warn_on_modeset_not_all_locked(), because those warnings make
> the original problem (that probably caused the lock problem in the
> first place that it is warning about) scroll away.
>
> That said, you should also take the oneliner fix that Rik posted to
> patch 7 (subject line: "[PATCH 7/7 part3] fix for sem_lock") and apply
> that, just to make sure that you aren't possibly hitting a bug with
> the shared-memory locking introduced by that (unusual) case.

Thanks Linus. I hope I got this right, here's the result (3.9-rc4, 7+1
patches): http://i.imgur.com/BebGZxV.jpg

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-24 13:46         ` Emmanuel Benisty
@ 2013-03-24 17:10           ` Linus Torvalds
  2013-03-25 13:47             ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-24 17:10 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Rik van Riel, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Sun, Mar 24, 2013 at 6:46 AM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>
> Thanks Linus. I hope I got this right, here's the result (3.9-rc4, 7+1
> patches): http://i.imgur.com/BebGZxV.jpg

Ok, that's *slightly* more informative, but not much. At least now we
see the actual page fault information, and see what the bad
dereference was.

It seems to be a branch through the rcu list "->func" pointer in the
rcu callbacks, and the ->func pointer has been corrupted. Instead of
being a valid kernel pointer (or a "kfree_rcu_offset" marker, which is
a small number between 0-4096), it has the odd value
"0x0000006400000064". Two words of decimal "100", in other words.

That's not one of the usual "use-after-free" patters or anything like
that, so I don't see what it would be. So I have to admit to not
really having any better clue about what is going on. Sometimes the
corruption pattern give a hint of what it was that overwrote it, but
not here..

And you never see this problem without Rik's patches? Could you bisect
*which* patch it starts with? Are the first four ones ok (the moving
of the locking around, but without the fine-grained ones), for
example?

Another thing to try might be to enable SLUB debugging (ie make sure that all of

  CONFIG_SLUB_DEBUG=y
  CONFIG_SLUB=y
  CONFIG_SLUB_DEBUG_ON=y

are set in your kernel config), which might help pin things down a
bit. Sometimes that makes any allocation problems show up earlier in
the path, so that it's more obvious who screwed up.

              Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-24 17:10           ` Linus Torvalds
@ 2013-03-25 13:47             ` Emmanuel Benisty
  2013-03-25 14:00               ` Rik van Riel
                                 ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-25 13:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> And you never see this problem without Rik's patches?

No, never.

> Could you bisect
> *which* patch it starts with? Are the first four ones ok (the moving
> of the locking around, but without the fine-grained ones), for
> example?

With the first four patches only, I got some X server freeze (just tried once).

> Another thing to try might be to enable SLUB debugging (ie make sure that all of
>
>   CONFIG_SLUB_DEBUG=y
>   CONFIG_SLUB=y
>   CONFIG_SLUB_DEBUG_ON=y
>
> are set in your kernel config)

My bad, I forgot to enable this in my former build, sorry. Here's what
I get now:
http://i.imgur.com/G6H8KgD.jpg

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 13:47             ` Emmanuel Benisty
@ 2013-03-25 14:00               ` Rik van Riel
  2013-03-25 14:03                 ` Rik van Riel
  2013-03-25 14:01               ` Rik van Riel
  2013-03-26 17:59               ` Davidlohr Bueso
  2 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-25 14:00 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:
> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> And you never see this problem without Rik's patches?
>
> No, never.
>
>> Could you bisect
>> *which* patch it starts with? Are the first four ones ok (the moving
>> of the locking around, but without the fine-grained ones), for
>> example?
>
> With the first four patches only, I got some X server freeze (just tried once).

Could you try booting with panic=1 so the kernel panics on the first
oops?

Maybe that way (if we are lucky) we will be able to capture the first
oops, and maybe get an idea of what causes the problem.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 13:47             ` Emmanuel Benisty
  2013-03-25 14:00               ` Rik van Riel
@ 2013-03-25 14:01               ` Rik van Riel
  2013-03-25 14:21                 ` Emmanuel Benisty
  2013-03-26 17:59               ` Davidlohr Bueso
  2 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-25 14:01 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:
> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> And you never see this problem without Rik's patches?
>
> No, never.
>
>> Could you bisect
>> *which* patch it starts with? Are the first four ones ok (the moving
>> of the locking around, but without the fine-grained ones), for
>> example?
>
> With the first four patches only, I got some X server freeze (just tried once).

Another random question, just to rule something out. What video driver
are you using on your system?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 14:00               ` Rik van Riel
@ 2013-03-25 14:03                 ` Rik van Riel
  2013-03-25 15:20                   ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-25 14:03 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On 03/25/2013 10:00 AM, Rik van Riel wrote:
> On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:
>> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> And you never see this problem without Rik's patches?
>>
>> No, never.
>>
>>> Could you bisect
>>> *which* patch it starts with? Are the first four ones ok (the moving
>>> of the locking around, but without the fine-grained ones), for
>>> example?
>>
>> With the first four patches only, I got some X server freeze (just
>> tried once).
>
> Could you try booting with panic=1 so the kernel panics on the first
> oops?

Sorry that should be "oops=panic"

> Maybe that way (if we are lucky) we will be able to capture the first
> oops, and maybe get an idea of what causes the problem.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 14:01               ` Rik van Riel
@ 2013-03-25 14:21                 ` Emmanuel Benisty
  0 siblings, 0 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-25 14:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Mon, Mar 25, 2013 at 9:01 PM, Rik van Riel <riel@surriel.com> wrote:
> On 03/25/2013 09:47 AM, Emmanuel Benisty wrote:
>>
>> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> And you never see this problem without Rik's patches?
>>
>>
>> No, never.
>>
>>> Could you bisect
>>> *which* patch it starts with? Are the first four ones ok (the moving
>>> of the locking around, but without the fine-grained ones), for
>>> example?
>>
>>
>> With the first four patches only, I got some X server freeze (just tried
>> once).
>
>
> Another random question, just to rule something out. What video driver
> are you using on your system?

intel on i3 330m

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 14:03                 ` Rik van Riel
@ 2013-03-25 15:20                   ` Emmanuel Benisty
  2013-03-25 15:53                     ` Rik van Riel
  0 siblings, 1 reply; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-25 15:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel <riel@surriel.com> wrote:
>>> With the first four patches only, I got some X server freeze (just
>>> tried once).
>>
>>
>> Could you try booting with panic=1 so the kernel panics on the first
>> oops?
>
>
> Sorry that should be "oops=panic"
>
>
>> Maybe that way (if we are lucky) we will be able to capture the first
>> oops, and maybe get an idea of what causes the problem.

Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling
gets stuck and is impossible to kill, can't kill X) with the 4
patches+oops=panic but no trace. Here after is 7+1 patches with
oops=panic boot: http://i.imgur.com/1jep1qx.jpg

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 15:20                   ` Emmanuel Benisty
@ 2013-03-25 15:53                     ` Rik van Riel
  2013-03-25 17:09                       ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-25 15:53 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On 03/25/2013 11:20 AM, Emmanuel Benisty wrote:
> On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel <riel@surriel.com> wrote:
>>>> With the first four patches only, I got some X server freeze (just
>>>> tried once).
>>>
>>>
>>> Could you try booting with panic=1 so the kernel panics on the first
>>> oops?
>>
>>
>> Sorry that should be "oops=panic"
>>
>>
>>> Maybe that way (if we are lucky) we will be able to capture the first
>>> oops, and maybe get an idea of what causes the problem.
>
> Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling
> gets stuck and is impossible to kill, can't kill X) with the 4
> patches+oops=panic but no trace. Here after is 7+1 patches with
> oops=panic boot: http://i.imgur.com/1jep1qx.jpg

This may be a stupid question, but you re-compile and re-install
the kernel modules every time you changed the kernel?

The behaviour you report with just the first four patches is so
random, it sounds almost like a mismatched data structure between
compiles...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 15:53                     ` Rik van Riel
@ 2013-03-25 17:09                       ` Emmanuel Benisty
  0 siblings, 0 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-25 17:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Davidlohr Bueso, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Mon, Mar 25, 2013 at 10:53 PM, Rik van Riel <riel@surriel.com> wrote:
> On 03/25/2013 11:20 AM, Emmanuel Benisty wrote:
>>
>> On Mon, Mar 25, 2013 at 9:03 PM, Rik van Riel <riel@surriel.com> wrote:
>>>>>
>>>>> With the first four patches only, I got some X server freeze (just
>>>>> tried once).
>>>>
>>>>
>>>>
>>>> Could you try booting with panic=1 so the kernel panics on the first
>>>> oops?
>>>
>>>
>>>
>>> Sorry that should be "oops=panic"
>>>
>>>
>>>> Maybe that way (if we are lucky) we will be able to capture the first
>>>> oops, and maybe get an idea of what causes the problem.
>>
>>
>> Sorry Rik, I get all kind of weird behaviors (wireless dies, compiling
>> gets stuck and is impossible to kill, can't kill X) with the 4
>> patches+oops=panic but no trace. Here after is 7+1 patches with
>> oops=panic boot: http://i.imgur.com/1jep1qx.jpg
>
>
> This may be a stupid question, but you re-compile and re-install
> the kernel modules every time you changed the kernel?
>
> The behaviour you report with just the first four patches is so
> random, it sounds almost like a mismatched data structure between
> compiles...

Yes it's OK. I even started from scratch just in case but I'm still
getting the same weird things. Everything works just fine with a build
from Linus' tree which I normally use. I'll try the patches on another
machine tomorrow.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (12 preceding siblings ...)
  2013-03-22 17:51 ` Davidlohr Bueso
@ 2013-03-25 20:21 ` Sasha Levin
  2013-03-25 20:38   ` [PATCH -mm -next] ipc,sem: fix lockdep false positive Rik van Riel
  2013-03-26 17:33 ` ipc,sem: sysv semaphore scalability Sasha Levin
  14 siblings, 1 reply; 129+ messages in thread
From: Sasha Levin @ 2013-03-25 20:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Dave Jones

On 03/20/2013 03:55 PM, Rik van Riel wrote:
> Include lkml in the CC: this time... *sigh*
> ---8<---
> 
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

Hi Rik,

I'm getting the following false positives from lockdep:

[   80.492995] =============================================
[   80.494052] [ INFO: possible recursive locking detected ]
[   80.494878] 3.9.0-rc4-next-20130325-sasha-00044-gcb6ef58 #315 Tainted: G        W
[   80.496228] ---------------------------------------------
[   80.497171] trinity-child9/7210 is trying to acquire lock:
[   80.497934]  (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: [<ffffffff8192da37>] newary+0x1c7/0x2a0
[   80.499202]
[   80.499202] but task is already holding lock:
[   80.500031]  (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: [<ffffffff8192da37>] newary+0x1c7/0x2a0
[   80.500031]
[   80.500031] other info that might help us debug this:
[   80.500031]  Possible unsafe locking scenario:
[   80.500031]
[   80.500031]        CPU0
[   80.500031]        ----
[   80.500031]   lock(&(&sma->sem_base[i].lock)->rlock);
[   80.500031]   lock(&(&sma->sem_base[i].lock)->rlock);
[   80.500031]
[   80.500031]  *** DEADLOCK ***
[   80.500031]
[   80.500031]  May be due to missing lock nesting notation
[   80.500031]
[   80.500031] 4 locks held by trinity-child9/7210:
[   80.500031]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8192a422>] ipcget+0x72/0x340
[   80.500031]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81929b65>] ipc_addid+0x35/0x230
[   80.500031]  #2:  (&(&new->lock)->rlock){+.+...}, at: [<ffffffff81929c00>] ipc_addid+0xd0/0x230
[   80.500031]  #3:  (&(&sma->sem_base[i].lock)->rlock){+.+...}, at: [<ffffffff8192da37>] newary+0x1c7/0x2a0
[   80.500031]
[   80.500031] stack backtrace:
[   80.500031] Pid: 7210, comm: trinity-child9 Tainted: G        W    3.9.0-rc4-next-20130325-sasha-00044-gcb6ef58 #315
[   80.500031] Call Trace:
[   80.500031]  [<ffffffff8117f65e>] __lock_acquire+0xc6e/0x1e50
[   80.500031]  [<ffffffff819ff225>] ? idr_get_empty_slot+0x255/0x3c0
[   80.500031]  [<ffffffff8117ca1e>] ? mark_held_locks+0x12e/0x150
[   80.500031]  [<ffffffff811810ba>] lock_acquire+0x1aa/0x240
[   80.500031]  [<ffffffff8192da37>] ? newary+0x1c7/0x2a0
[   80.500031]  [<ffffffff83d8768b>] _raw_spin_lock+0x3b/0x70
[   80.500031]  [<ffffffff8192da37>] ? newary+0x1c7/0x2a0
[   80.500031]  [<ffffffff8192da37>] newary+0x1c7/0x2a0
[   80.500031]  [<ffffffff8192a422>] ? ipcget+0x72/0x340
[   80.500031]  [<ffffffff8192a5d6>] ipcget+0x226/0x340
[   80.500031]  [<ffffffff81930215>] SyS_semget+0x65/0x80
[   80.500031]  [<ffffffff8192d870>] ? semctl_down.constprop.8+0x320/0x320
[   80.500031]  [<ffffffff8192c770>] ? wake_up_sem_queue_do+0xa0/0xa0
[   80.500031]  [<ffffffff8192c690>] ? SyS_msgrcv+0x20/0x20
[   80.500031]  [<ffffffff83d90898>] tracesys+0xe1/0xe6

The code is:

        for (i = 0; i < nsems; i++) {
                INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
                spin_lock_init(&sma->sem_base[i].lock);
                spin_lock(&sma->sem_base[i].lock);  <---- here
        }


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-25 20:21 ` Sasha Levin
@ 2013-03-25 20:38   ` Rik van Riel
  2013-03-25 21:42     ` Michel Lespinasse
  0 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-25 20:38 UTC (permalink / raw)
  To: Sasha Levin
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Dave Jones, benisty.e

On Mon, 25 Mar 2013 16:21:22 -0400
Sasha Levin <sasha.levin@oracle.com> wrote:

> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > Include lkml in the CC: this time... *sigh*
> > ---8<---
> > 
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> I'm getting the following false positives from lockdep:

Does this patch fix it?

Andrew, this looks like another one for the queue...
---8<---
Subject: [PATCH -mm -next] ipc,sem: fix lockdep false positive

When locking all the semaphores inside a sem_array, the kernel ends up
locking a large number of locks with identical lockdep status. This
trips up lockdep.  Annotate the code to prevent such warnings.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 450248e..f46441a 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -357,7 +357,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		spin_lock(&sma->sem_perm.lock);
 		for (i = 0; i < sma->sem_nsems; i++) {
 			struct sem *sem = sma->sem_base + i;
-			spin_lock(&sem->lock);
+			spin_lock_nested(&sem->lock, SINGLE_DEPTH_NESTING);
 		}
 		locknum = -1;
 	}
@@ -558,7 +558,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	for (i = 0; i < nsems; i++) {
 		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
 		spin_lock_init(&sma->sem_base[i].lock);
-		spin_lock(&sma->sem_base[i].lock);
+		spin_lock_nested(&sma->sem_base[i].lock, SINGLE_DEPTH_NESTING);
 	}
 
 	sma->complex_count = 0;


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-25 20:38   ` [PATCH -mm -next] ipc,sem: fix lockdep false positive Rik van Riel
@ 2013-03-25 21:42     ` Michel Lespinasse
  2013-03-25 21:51       ` Michel Lespinasse
                         ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-25 21:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod, Dave Jones, benisty.e,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 25, 2013 at 1:38 PM, Rik van Riel <riel@surriel.com> wrote:
> On Mon, 25 Mar 2013 16:21:22 -0400
> Sasha Levin <sasha.levin@oracle.com> wrote:
>
>> On 03/20/2013 03:55 PM, Rik van Riel wrote:
>> > Include lkml in the CC: this time... *sigh*
>> > ---8<---
>> >
>> > This series makes the sysv semaphore code more scalable,
>> > by reducing the time the semaphore lock is held, and making
>> > the locking more scalable for semaphore arrays with multiple
>> > semaphores.
>>
>> Hi Rik,
>>
>> I'm getting the following false positives from lockdep:
>
> Does this patch fix it?

I'll be surprised if it does, because we don't actually have single
depth nesting here...
Adding Peter & Ingo for advice about how to proceed
(the one solution I know would involve using arch_spin_lock() directly
to bypass the lockdep checks, but there's got to be a better way...)

> Andrew, this looks like another one for the queue...
> ---8<---
> Subject: [PATCH -mm -next] ipc,sem: fix lockdep false positive
>
> When locking all the semaphores inside a sem_array, the kernel ends up
> locking a large number of locks with identical lockdep status. This
> trips up lockdep.  Annotate the code to prevent such warnings.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  ipc/sem.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 450248e..f46441a 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -357,7 +357,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
>                 spin_lock(&sma->sem_perm.lock);
>                 for (i = 0; i < sma->sem_nsems; i++) {
>                         struct sem *sem = sma->sem_base + i;
> -                       spin_lock(&sem->lock);
> +                       spin_lock_nested(&sem->lock, SINGLE_DEPTH_NESTING);
>                 }
>                 locknum = -1;
>         }
> @@ -558,7 +558,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
>         for (i = 0; i < nsems; i++) {
>                 INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
>                 spin_lock_init(&sma->sem_base[i].lock);
> -               spin_lock(&sma->sem_base[i].lock);
> +               spin_lock_nested(&sma->sem_base[i].lock, SINGLE_DEPTH_NESTING);
>         }
>
>         sma->complex_count = 0;
>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-25 21:42     ` Michel Lespinasse
@ 2013-03-25 21:51       ` Michel Lespinasse
  2013-03-25 21:56         ` Sasha Levin
  2013-03-25 21:52       ` Sasha Levin
  2013-03-26 13:19       ` Peter Zijlstra
  2 siblings, 1 reply; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-25 21:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod, Dave Jones, benisty.e,
	Peter Zijlstra, Ingo Molnar

On Mon, Mar 25, 2013 at 2:42 PM, Michel Lespinasse <walken@google.com> wrote:
> I'll be surprised if it does, because we don't actually have single
> depth nesting here...
> Adding Peter & Ingo for advice about how to proceed
> (the one solution I know would involve using arch_spin_lock() directly
> to bypass the lockdep checks, but there's got to be a better way...)

Maybe spin_lock_nest_lock() can help too. I'm not sure, the feature is
undocumented.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-25 21:42     ` Michel Lespinasse
  2013-03-25 21:51       ` Michel Lespinasse
@ 2013-03-25 21:52       ` Sasha Levin
  2013-03-26 13:19       ` Peter Zijlstra
  2 siblings, 0 replies; 129+ messages in thread
From: Sasha Levin @ 2013-03-25 21:52 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod, Dave Jones, benisty.e,
	Peter Zijlstra, Ingo Molnar

On 03/25/2013 05:42 PM, Michel Lespinasse wrote:
> On Mon, Mar 25, 2013 at 1:38 PM, Rik van Riel <riel@surriel.com> wrote:
>> > On Mon, 25 Mar 2013 16:21:22 -0400
>> > Sasha Levin <sasha.levin@oracle.com> wrote:
>> >
>>> >> On 03/20/2013 03:55 PM, Rik van Riel wrote:
>>>> >> > Include lkml in the CC: this time... *sigh*
>>>> >> > ---8<---
>>>> >> >
>>>> >> > This series makes the sysv semaphore code more scalable,
>>>> >> > by reducing the time the semaphore lock is held, and making
>>>> >> > the locking more scalable for semaphore arrays with multiple
>>>> >> > semaphores.
>>> >>
>>> >> Hi Rik,
>>> >>
>>> >> I'm getting the following false positives from lockdep:
>> >
>> > Does this patch fix it?
> I'll be surprised if it does, because we don't actually have single
> depth nesting here...
> Adding Peter & Ingo for advice about how to proceed
> (the one solution I know would involve using arch_spin_lock() directly
> to bypass the lockdep checks, but there's got to be a better way...)

Yeah, it did not.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-25 21:51       ` Michel Lespinasse
@ 2013-03-25 21:56         ` Sasha Levin
  0 siblings, 0 replies; 129+ messages in thread
From: Sasha Levin @ 2013-03-25 21:56 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod, Dave Jones, benisty.e,
	Peter Zijlstra, Ingo Molnar

On 03/25/2013 05:51 PM, Michel Lespinasse wrote:
> On Mon, Mar 25, 2013 at 2:42 PM, Michel Lespinasse <walken@google.com> wrote:
>> I'll be surprised if it does, because we don't actually have single
>> depth nesting here...
>> Adding Peter & Ingo for advice about how to proceed
>> (the one solution I know would involve using arch_spin_lock() directly
>> to bypass the lockdep checks, but there's got to be a better way...)
> 
> Maybe spin_lock_nest_lock() can help too. I'm not sure, the feature is
> undocumented.
> 

I think we should name the locks properly (using 'key') and initialize their
lockdep_map using lockdep_init_map instead of letting spin_lock pass the
"&sma->sem_base[i].lock" as name.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-25 21:42     ` Michel Lespinasse
  2013-03-25 21:51       ` Michel Lespinasse
  2013-03-25 21:52       ` Sasha Levin
@ 2013-03-26 13:19       ` Peter Zijlstra
  2013-03-26 13:40         ` Michel Lespinasse
  2013-03-26 14:25         ` [PATCH " Rik van Riel
  2 siblings, 2 replies; 129+ messages in thread
From: Peter Zijlstra @ 2013-03-26 13:19 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Mon, 2013-03-25 at 14:42 -0700, Michel Lespinasse wrote:
> depth nesting here...
> Adding Peter & Ingo for advice about how to proceed

> > +++ b/ipc/sem.c
> > @@ -357,7 +357,7 @@ static inline int sem_lock(struct sem_array
> *sma, struct sembuf *sops,
> >                 spin_lock(&sma->sem_perm.lock);
> >                 for (i = 0; i < sma->sem_nsems; i++) {
> >                         struct sem *sem = sma->sem_base + i;
> > -                       spin_lock(&sem->lock);
> > +                       spin_lock_nested(&sem->lock,
> SINGLE_DEPTH_NESTING);
> >                 }
> >                 locknum = -1;
> >         }

Right, so as walken said, this isn't going to work right.

I need a little more information as I've not really paid much attention
to this stuff. Firstly, is there a limit to sem_nsems or is this a
random user specified number? Secondly do we care about lock order at
all, or is array order the only order that counts?




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-26 13:19       ` Peter Zijlstra
@ 2013-03-26 13:40         ` Michel Lespinasse
  2013-03-26 14:27           ` Peter Zijlstra
  2013-03-26 14:25         ` [PATCH " Rik van Riel
  1 sibling, 1 reply; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-26 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Tue, Mar 26, 2013 at 6:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2013-03-25 at 14:42 -0700, Michel Lespinasse wrote:
>> depth nesting here...
>> Adding Peter & Ingo for advice about how to proceed
>
>> > +++ b/ipc/sem.c
>> > @@ -357,7 +357,7 @@ static inline int sem_lock(struct sem_array
>> *sma, struct sembuf *sops,
>> >                 spin_lock(&sma->sem_perm.lock);
>> >                 for (i = 0; i < sma->sem_nsems; i++) {
>> >                         struct sem *sem = sma->sem_base + i;
>> > -                       spin_lock(&sem->lock);
>> > +                       spin_lock_nested(&sem->lock,
>> SINGLE_DEPTH_NESTING);
>> >                 }
>> >                 locknum = -1;
>> >         }
>
> Right, so as walken said, this isn't going to work right.
>
> I need a little more information as I've not really paid much attention
> to this stuff. Firstly, is there a limit to sem_nsems or is this a
> random user specified number? Secondly do we care about lock order at
> all, or is array order the only order that counts?

sem_nsems is user provided as the array size in some semget system
call. It's the size of an ipc semaphore array.

complex semop operations take the array's lock plus every semaphore
locks; simple semop operations (operating on a single semaphore) only
take that one semaphore's lock.

AFAIK no operations take more than one semaphore lock except for the
case where they are all taken (in which case the array's lock is also
taken). Ideally it'd be best to have lockdep verify that this usage
pattern is respected, though.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-26 13:19       ` Peter Zijlstra
  2013-03-26 13:40         ` Michel Lespinasse
@ 2013-03-26 14:25         ` Rik van Riel
  1 sibling, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-26 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michel Lespinasse, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On 03/26/2013 09:19 AM, Peter Zijlstra wrote:
> On Mon, 2013-03-25 at 14:42 -0700, Michel Lespinasse wrote:
>> depth nesting here...
>> Adding Peter & Ingo for advice about how to proceed
>
>>> +++ b/ipc/sem.c
>>> @@ -357,7 +357,7 @@ static inline int sem_lock(struct sem_array
>> *sma, struct sembuf *sops,
>>>                  spin_lock(&sma->sem_perm.lock);
>>>                  for (i = 0; i < sma->sem_nsems; i++) {
>>>                          struct sem *sem = sma->sem_base + i;
>>> -                       spin_lock(&sem->lock);
>>> +                       spin_lock_nested(&sem->lock,
>> SINGLE_DEPTH_NESTING);
>>>                  }
>>>                  locknum = -1;
>>>          }
>
> Right, so as walken said, this isn't going to work right.
>
> I need a little more information as I've not really paid much attention
> to this stuff. Firstly, is there a limit to sem_nsems or is this a
> random user specified number? Secondly do we care about lock order at
> all, or is array order the only order that counts?

It is a user specified number, and we either lock only one
of the semaphore locks, or we lock all of them (in array
order, after taking the semaphore array lock).

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-26 13:40         ` Michel Lespinasse
@ 2013-03-26 14:27           ` Peter Zijlstra
  2013-03-26 15:19             ` Rik van Riel
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2013-03-26 14:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Tue, 2013-03-26 at 06:40 -0700, Michel Lespinasse wrote:

> sem_nsems is user provided as the array size in some semget system
> call. It's the size of an ipc semaphore array.

So we're basically adding a random (big) number to preempt_count
(obviously while preemption is disabled), seems rather costly and
undesirable.

> complex semop operations take the array's lock plus every semaphore
> locks; simple semop operations (operating on a single semaphore) only
> take that one semaphore's lock.

Right, standard global/local lock like stuff. Is there a way we can add
a r/o test to the 'local' lock operation and avoid doing the above?

Maybe something like:

void sma_lock(struct sem_array *sma) /* global */
{
	int i;

	sma->global_locked = 1;
	smp_wmb(); /* can we merge with the LOCK ? */
	spin_lock(&sma->global_lock);

	/* wait for all local locks to go away */
	for (i = 0; i < sma->sem_nsems; i++)
		spin_unlock_wait(&sem->sem_base[i]->lock);	
}

void sma_lock_one(struct sem_array *sma, int nr) /* local */
{
	smp_rmb(); /* pairs with wmb in sma_lock() */
	if (unlikely(sma->global_locked)) { /* wait for global lock */
		while (sma->global_locked)
			spin_unlock_wait(&sma->global_lock);
	}
	spin_lock(&sma->sem_base[nr]->lock);
}

This still has the problem of a non-preemptible section of O(sem_nsems) 
(with the avg wait-time on the local lock). Could we make the global 
lock a sleeping lock?




^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-26 14:27           ` Peter Zijlstra
@ 2013-03-26 15:19             ` Rik van Riel
  2013-03-27  8:40               ` Peter Zijlstra
  2013-03-27  8:42               ` Peter Zijlstra
  0 siblings, 2 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-26 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michel Lespinasse, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On 03/26/2013 10:27 AM, Peter Zijlstra wrote:
> On Tue, 2013-03-26 at 06:40 -0700, Michel Lespinasse wrote:
>
>> sem_nsems is user provided as the array size in some semget system
>> call. It's the size of an ipc semaphore array.
>
> So we're basically adding a random (big) number to preempt_count
> (obviously while preemption is disabled), seems rather costly and
> undesirable.
 >
>> complex semop operations take the array's lock plus every semaphore
>> locks; simple semop operations (operating on a single semaphore) only
>> take that one semaphore's lock.
>
> Right, standard global/local lock like stuff. Is there a way we can add
> a r/o test to the 'local' lock operation and avoid doing the above?

That makes me wonder, how did mm_take_all_locks used to work before
we turned the anon_vma lock into a mutex?

The code used to use spin_lock_nest_lock, but still has the potential
to overflow the preempt counter. How did that ever work right?

> Maybe something like:
>
> void sma_lock(struct sem_array *sma) /* global */
> {
> 	int i;
>
> 	sma->global_locked = 1;
> 	smp_wmb(); /* can we merge with the LOCK ? */
> 	spin_lock(&sma->global_lock);
>
> 	/* wait for all local locks to go away */
> 	for (i = 0; i < sma->sem_nsems; i++)
> 		spin_unlock_wait(&sem->sem_base[i]->lock);	
> }
>
> void sma_lock_one(struct sem_array *sma, int nr) /* local */
> {
> 	smp_rmb(); /* pairs with wmb in sma_lock() */
> 	if (unlikely(sma->global_locked)) { /* wait for global lock */
> 		while (sma->global_locked)
> 			spin_unlock_wait(&sma->global_lock);
> 	}
> 	spin_lock(&sma->sem_base[nr]->lock);
> }

That is essentially a read-only version of the global rwlock that
I originally proposed, where the global lock takes the lock for
write and the single version takes the global lock for read, and
then one of the semaphore spinlocks.

I could certainly implement and test the above, unless Linus
thinks it's too ugly to live :)

> This still has the problem of a non-preemptible section of O(sem_nsems)
> (with the avg wait-time on the local lock). Could we make the global
> lock a sleeping lock?

Not without breaking your scheme above :)

I suppose making things into a sleeping lock should be possible,
but that is another major change in this code. I would rather do
things in smaller steps...

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
                   ` (13 preceding siblings ...)
  2013-03-25 20:21 ` Sasha Levin
@ 2013-03-26 17:33 ` Sasha Levin
  2013-03-26 17:51   ` Davidlohr Bueso
                     ` (2 more replies)
  14 siblings, 3 replies; 129+ messages in thread
From: Sasha Levin @ 2013-03-26 17:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Paul E. McKenney,
	Andrew Morton

On 03/20/2013 03:55 PM, Rik van Riel wrote:
> This series makes the sysv semaphore code more scalable,
> by reducing the time the semaphore lock is held, and making
> the locking more scalable for semaphore arrays with multiple
> semaphores.

Hi Rik,

Another issue that came up is:

[   96.347341] ================================================
[   96.348085] [ BUG: lock held when returning to user space! ]
[   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
[   96.360300] ------------------------------------------------
[   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
[   96.362019] 1 lock held by trinity-child9/7583:
[   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0

It seems that we can leave semtimedop without releasing the rcu read lock.

I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it
seems that this behaviour is actually intentional?

        rcu_read_lock();
        sma = sem_obtain_object_check(ns, semid);
        if (IS_ERR(sma)) {
                if (un)
                        rcu_read_unlock();
                error = PTR_ERR(sma);
                goto out_free;
        }

When I've looked at that it seems that not releasing the read lock was (very)
intentional.

After that, the only code path that would release the lock starts with:

        if (un) {
		...

So we won't release the lock at all if un is NULL?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 17:33 ` ipc,sem: sysv semaphore scalability Sasha Levin
@ 2013-03-26 17:51   ` Davidlohr Bueso
  2013-03-26 18:07     ` Sasha Levin
  2013-03-26 17:55   ` ipc,sem: sysv semaphore scalability Paul E. McKenney
  2013-03-28 15:32   ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
  2 siblings, 1 reply; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-26 17:51 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rik van Riel, torvalds, linux-kernel, akpm, hhuang, jason.low2,
	walken, lwoodman, chegu_vinod, Paul E. McKenney

On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> Another issue that came up is:
> 
> [   96.347341] ================================================
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
> [   96.360300] ------------------------------------------------
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it
> seems that this behaviour is actually intentional?
> 
>         rcu_read_lock();
>         sma = sem_obtain_object_check(ns, semid);
>         if (IS_ERR(sma)) {
>                 if (un)
>                         rcu_read_unlock();
>                 error = PTR_ERR(sma);
>                 goto out_free;
>         }
> 
> When I've looked at that it seems that not releasing the read lock was (very)
> intentional.

This logic was from the original code, which I also found to be quite
confusing.

> 
> After that, the only code path that would release the lock starts with:
> 
>         if (un) {
> 		...
> 
> So we won't release the lock at all if un is NULL?
> 

Not necessarily, we do release everything at the end of the function: 

out_unlock_free:
	sem_unlock(sma, locknum);

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 17:33 ` ipc,sem: sysv semaphore scalability Sasha Levin
  2013-03-26 17:51   ` Davidlohr Bueso
@ 2013-03-26 17:55   ` Paul E. McKenney
  2013-03-28 15:32   ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
  2 siblings, 0 replies; 129+ messages in thread
From: Paul E. McKenney @ 2013-03-26 17:55 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, walken, lwoodman, chegu_vinod

On Tue, Mar 26, 2013 at 01:33:07PM -0400, Sasha Levin wrote:
> On 03/20/2013 03:55 PM, Rik van Riel wrote:
> > This series makes the sysv semaphore code more scalable,
> > by reducing the time the semaphore lock is held, and making
> > the locking more scalable for semaphore arrays with multiple
> > semaphores.
> 
> Hi Rik,
> 
> Another issue that came up is:
> 
> [   96.347341] ================================================
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
> [   96.360300] ------------------------------------------------
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it
> seems that this behaviour is actually intentional?
> 
>         rcu_read_lock();
>         sma = sem_obtain_object_check(ns, semid);
>         if (IS_ERR(sma)) {
>                 if (un)
>                         rcu_read_unlock();
>                 error = PTR_ERR(sma);
>                 goto out_free;
>         }
> 
> When I've looked at that it seems that not releasing the read lock was (very)
> intentional.
> 
> After that, the only code path that would release the lock starts with:
> 
>         if (un) {
> 		...
> 
> So we won't release the lock at all if un is NULL?

Intentions notwithstanding, it is absolutely required to exit any and
all RCU read-side critical sections prior to going into user mode.

I suggest removing the "if (un)".

							Thanx, Paul


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-25 13:47             ` Emmanuel Benisty
  2013-03-25 14:00               ` Rik van Riel
  2013-03-25 14:01               ` Rik van Riel
@ 2013-03-26 17:59               ` Davidlohr Bueso
  2013-03-26 18:14                 ` Rik van Riel
  2013-03-26 18:35                 ` Andrew Morton
  2 siblings, 2 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-26 17:59 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Rik van Riel, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
> On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > And you never see this problem without Rik's patches?
> 
> No, never.
> 
> > Could you bisect
> > *which* patch it starts with? Are the first four ones ok (the moving
> > of the locking around, but without the fine-grained ones), for
> > example?
> 
> With the first four patches only, I got some X server freeze (just tried once).

Going over the code again, I found a potential recursive spinlock scenario. 
Andrew, if you have no objections, please queue this up.

Thanks.

---8<---

From: Davidlohr Bueso <davidlohr.bueso@hp.com>
Subject: [PATCH] ipc, sem: prevent possible deadlock

In semctl_main(), when cmd == GETALL, we're locking
sma->sem_perm.lock (through sem_lock_and_putref), yet
after the conditional, we lock it again.
Unlock sma right after exiting the conditional.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 ipc/sem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ipc/sem.c b/ipc/sem.c
index 1a2913d..f257afe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
 				err = -EIDRM;
 				goto out_free;
 			}
+			sem_unlock(sma, -1);
 		}
 
 		sem_lock(sma, NULL, -1);
-- 
1.7.11.7




^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 17:51   ` Davidlohr Bueso
@ 2013-03-26 18:07     ` Sasha Levin
  2013-03-26 18:17       ` Rik van Riel
  2013-03-26 20:00       ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
  0 siblings, 2 replies; 129+ messages in thread
From: Sasha Levin @ 2013-03-26 18:07 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Rik van Riel, torvalds, linux-kernel, akpm, hhuang, jason.low2,
	walken, lwoodman, chegu_vinod, Paul E. McKenney

On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:
> On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
>> On 03/20/2013 03:55 PM, Rik van Riel wrote:
>>> This series makes the sysv semaphore code more scalable,
>>> by reducing the time the semaphore lock is held, and making
>>> the locking more scalable for semaphore arrays with multiple
>>> semaphores.
>>
>> Hi Rik,
>>
>> Another issue that came up is:
>>
>> [   96.347341] ================================================
>> [   96.348085] [ BUG: lock held when returning to user space! ]
>> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
>> [   96.360300] ------------------------------------------------
>> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
>> [   96.362019] 1 lock held by trinity-child9/7583:
>> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
>>
>> It seems that we can leave semtimedop without releasing the rcu read lock.
>>
>> I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it
>> seems that this behaviour is actually intentional?
>>
>>         rcu_read_lock();
>>         sma = sem_obtain_object_check(ns, semid);
>>         if (IS_ERR(sma)) {
>>                 if (un)
>>                         rcu_read_unlock();
>>                 error = PTR_ERR(sma);
>>                 goto out_free;
>>         }
>>
>> When I've looked at that it seems that not releasing the read lock was (very)
>> intentional.
> 
> This logic was from the original code, which I also found to be quite
> confusing.

I wasn't getting this warning with the old code, so there was probably something
else that triggers this now.

>>
>> After that, the only code path that would release the lock starts with:
>>
>>         if (un) {
>> 		...
>>
>> So we won't release the lock at all if un is NULL?
>>
> 
> Not necessarily, we do release everything at the end of the function: 
> 
> out_unlock_free:
> 	sem_unlock(sma, locknum);

Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?

	if (un->semid == -1) {
		rcu_read_unlock();
		goto out_unlock_free;
	}
[...]
	out_unlock_free:
	        sem_unlock(sma, locknum);


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 17:59               ` Davidlohr Bueso
@ 2013-03-26 18:14                 ` Rik van Riel
  2013-03-26 18:35                 ` Andrew Morton
  1 sibling, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-26 18:14 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Emmanuel Benisty, Linus Torvalds, Linux Kernel Mailing List,
	Andrew Morton, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On 03/26/2013 01:59 PM, Davidlohr Bueso wrote:

> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> Subject: [PATCH] ipc, sem: prevent possible deadlock
>
> In semctl_main(), when cmd == GETALL, we're locking
> sma->sem_perm.lock (through sem_lock_and_putref), yet
> after the conditional, we lock it again.
> Unlock sma right after exiting the conditional.
>
> Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 18:07     ` Sasha Levin
@ 2013-03-26 18:17       ` Rik van Riel
  2013-03-26 20:00       ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
  1 sibling, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-26 18:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Davidlohr Bueso, torvalds, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Paul E. McKenney

On 03/26/2013 02:07 PM, Sasha Levin wrote:
> On 03/26/2013 01:51 PM, Davidlohr Bueso wrote:
>> On Tue, 2013-03-26 at 13:33 -0400, Sasha Levin wrote:
>>> On 03/20/2013 03:55 PM, Rik van Riel wrote:
>>>> This series makes the sysv semaphore code more scalable,
>>>> by reducing the time the semaphore lock is held, and making
>>>> the locking more scalable for semaphore arrays with multiple
>>>> semaphores.
>>>
>>> Hi Rik,
>>>
>>> Another issue that came up is:
>>>
>>> [   96.347341] ================================================
>>> [   96.348085] [ BUG: lock held when returning to user space! ]
>>> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
>>> [   96.360300] ------------------------------------------------
>>> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
>>> [   96.362019] 1 lock held by trinity-child9/7583:
>>> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
>>>
>>> It seems that we can leave semtimedop without releasing the rcu read lock.
>>>
>>> I'm a bit confused by what's going on in semtimedop with regards to rcu read lock, it
>>> seems that this behaviour is actually intentional?
>>>
>>>          rcu_read_lock();
>>>          sma = sem_obtain_object_check(ns, semid);
>>>          if (IS_ERR(sma)) {
>>>                  if (un)
>>>                          rcu_read_unlock();
>>>                  error = PTR_ERR(sma);
>>>                  goto out_free;
>>>          }
>>>
>>> When I've looked at that it seems that not releasing the read lock was (very)
>>> intentional.
>>
>> This logic was from the original code, which I also found to be quite
>> confusing.
>
> I wasn't getting this warning with the old code, so there was probably something
> else that triggers this now.
>
>>>
>>> After that, the only code path that would release the lock starts with:
>>>
>>>          if (un) {
>>> 		...
>>>
>>> So we won't release the lock at all if un is NULL?
>>>
>>
>> Not necessarily, we do release everything at the end of the function:
>>
>> out_unlock_free:
>> 	sem_unlock(sma, locknum);
>
> Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
> more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?

It is uglier than you think...

On success, find_alloc_undo will call rcu_read_lock, so we have the
rcu_read_lock held twice :(

Some of the ipc code is quite ugly, but making too many large changes
at once is just asking for trouble. I suspect we're going to have to
untangle this one bit at a time...


-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 17:59               ` Davidlohr Bueso
  2013-03-26 18:14                 ` Rik van Riel
@ 2013-03-26 18:35                 ` Andrew Morton
  2013-04-16 23:30                   ` Andrew Morton
  1 sibling, 1 reply; 129+ messages in thread
From: Andrew Morton @ 2013-03-26 18:35 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Emmanuel Benisty, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Tue, 26 Mar 2013 10:59:27 -0700 Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:

> On Mon, 2013-03-25 at 20:47 +0700, Emmanuel Benisty wrote:
> > On Mon, Mar 25, 2013 at 12:10 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > And you never see this problem without Rik's patches?
> > 
> > No, never.
> > 
> > > Could you bisect
> > > *which* patch it starts with? Are the first four ones ok (the moving
> > > of the locking around, but without the fine-grained ones), for
> > > example?
> > 
> > With the first four patches only, I got some X server freeze (just tried once).
> 
> Going over the code again, I found a potential recursive spinlock scenario. 
> Andrew, if you have no objections, please queue this up.
> 
> Thanks.
> 
> ---8<---
> 
> From: Davidlohr Bueso <davidlohr.bueso@hp.com>
> Subject: [PATCH] ipc, sem: prevent possible deadlock
> 
> In semctl_main(), when cmd == GETALL, we're locking
> sma->sem_perm.lock (through sem_lock_and_putref), yet
> after the conditional, we lock it again.
> Unlock sma right after exiting the conditional.
> 
> Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
> ---
>  ipc/sem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 1a2913d..f257afe 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1243,6 +1243,7 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
>  				err = -EIDRM;
>  				goto out_free;
>  			}
> +			sem_unlock(sma, -1);
>  		}
>  
>  		sem_lock(sma, NULL, -1);

Looks right. 

Do we need the locking at all?  What does it actually do?

			sem_lock_and_putref(sma);
			if (sma->sem_perm.deleted) {
				sem_unlock(sma, -1);
				err = -EIDRM;
				goto out_free;
			}
			sem_unlock(sma, -1);

We're taking the lock, testing an int and then dropping the lock. 
What's the point in that?


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-21 21:10 ` Andrew Morton
  2013-03-21 21:47   ` Peter Hurley
  2013-03-21 21:50   ` Peter Hurley
@ 2013-03-26 19:28   ` Dave Jones
  2013-03-26 19:43     ` Andrew Morton
  2 siblings, 1 reply; 129+ messages in thread
From: Dave Jones @ 2013-03-26 19:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Peter Hurley

On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:

 > Whichever way we go, we should get a wiggle on - this has been hanging
 > around for too long.  Dave, do you have time to determine whether
 > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > than max") fixes things up?

Ok, with that reverted it's been grinding away for a few hours without incident.
Normally I see the oops within a minute or so.

	Dave

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 19:28   ` Dave Jones
@ 2013-03-26 19:43     ` Andrew Morton
  2013-03-29 16:17       ` Dave Jones
  2013-03-29 19:01       ` Dave Jones
  0 siblings, 2 replies; 129+ messages in thread
From: Andrew Morton @ 2013-03-26 19:43 UTC (permalink / raw)
  To: Dave Jones
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Peter Hurley

On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones <davej@redhat.com> wrote:

> On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
> 
>  > Whichever way we go, we should get a wiggle on - this has been hanging
>  > around for too long.  Dave, do you have time to determine whether
>  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
>  > than max") fixes things up?
> 
> Ok, with that reverted it's been grinding away for a few hours without incident.
> Normally I see the oops within a minute or so.
> 

OK, thanks, I queued a revert:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "ipc: don't allocate a copy larger than max"

Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
during trinity testing.

Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/msg.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff -puN ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max ipc/msg.c
--- a/ipc/msg.c~revert-ipc-dont-allocate-a-copy-larger-than-max
+++ a/ipc/msg.c
@@ -820,17 +820,15 @@ long do_msgrcv(int msqid, void __user *b
 	struct msg_msg *copy = NULL;
 	unsigned long copy_number = 0;
 
-	ns = current->nsproxy->ipc_ns;
-
 	if (msqid < 0 || (long) bufsz < 0)
 		return -EINVAL;
 	if (msgflg & MSG_COPY) {
-		copy = prepare_copy(buf, min_t(size_t, bufsz, ns->msg_ctlmax),
-				    msgflg, &msgtyp, &copy_number);
+		copy = prepare_copy(buf, bufsz, msgflg, &msgtyp, &copy_number);
 		if (IS_ERR(copy))
 			return PTR_ERR(copy);
 	}
 	mode = convert_mode(&msgtyp, msgflg);
+	ns = current->nsproxy->ipc_ns;
 
 	msq = msg_lock_check(ns, msqid);
 	if (IS_ERR(msq)) {
_


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-26 18:07     ` Sasha Levin
  2013-03-26 18:17       ` Rik van Riel
@ 2013-03-26 20:00       ` Rik van Riel
  2013-04-05  4:38         ` Mike Galbraith
  1 sibling, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-26 20:00 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Davidlohr Bueso, torvalds, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Paul E. McKenney

On Tue, 26 Mar 2013 14:07:14 -0400
Sasha Levin <sasha.levin@oracle.com> wrote:

> > Not necessarily, we do release everything at the end of the function: 
> >     out_unlock_free:
> > 	sem_unlock(sma, locknum);
> 
> Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
> more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?

Sasha, this patch should resolve the RCU tangle, by making sure
we only ever take the rcu_read_lock once in semtimedop.

---8<---

The ipc semaphore code has a nasty RCU locking tangle, with both
find_alloc_undo and semtimedop taking the rcu_read_lock(). The
code can be cleaned up somewhat by only taking the rcu_read_lock
once.

There are no other callers to find_alloc_undo.

This should also solve the trinity issue reported by Sasha Levin.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |   31 +++++++++----------------------
 1 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f46441a..2ec2945 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1646,22 +1646,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 			alter = 1;
 	}
 
+	INIT_LIST_HEAD(&tasks);
+
 	if (undos) {
+		/* On success, find_alloc_undo takes the rcu_read_lock */
 		un = find_alloc_undo(ns, semid);
 		if (IS_ERR(un)) {
 			error = PTR_ERR(un);
 			goto out_free;
 		}
-	} else
+	} else {
 		un = NULL;
+		rcu_read_lock();
+	}
 
-	INIT_LIST_HEAD(&tasks);
-
-	rcu_read_lock();
 	sma = sem_obtain_object_check(ns, semid);
 	if (IS_ERR(sma)) {
-		if (un)
-			rcu_read_unlock();
+		rcu_read_unlock();
 		error = PTR_ERR(sma);
 		goto out_free;
 	}
@@ -1693,22 +1694,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	 */
 	error = -EIDRM;
 	locknum = sem_lock(sma, sops, nsops);
-	if (un) {
-		if (un->semid == -1) {
-			rcu_read_unlock();
-			goto out_unlock_free;
-		} else {
-			/*
-			 * rcu lock can be released, "un" cannot disappear:
-			 * - sem_lock is acquired, thus IPC_RMID is
-			 *   impossible.
-			 * - exit_sem is impossible, it always operates on
-			 *   current (or a dead task).
-			 */
-
-			rcu_read_unlock();
-		}
-	}
+	if (un && un->semid == -1)
+		goto out_unlock_free;
 
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-26 15:19             ` Rik van Riel
@ 2013-03-27  8:40               ` Peter Zijlstra
  2013-03-27  8:42               ` Peter Zijlstra
  1 sibling, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2013-03-27  8:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Tue, 2013-03-26 at 11:19 -0400, Rik van Riel wrote:
> That makes me wonder, how did mm_take_all_locks used to work before
> we turned the anon_vma lock into a mutex?
> 
> The code used to use spin_lock_nest_lock, but still has the potential
> to overflow the preempt counter. How did that ever work right?

It did trigger a bunch of warnings, but early on it was understood that
KVM would have 'few' vmas when starting and registering the
mmu_notifier thing.. then KVM bloated into insanity.

But aside from the warnings, if you overflow the regular preempt_count
bits, nothing really bad happens because you start poking at softirq
nesting, then hardirq etc.. all of those also disable preemption. 

You'll get a few 'unexpected' side-effects for things like
serving_softirq()/in_irq() or whatever those functions are called, but
other than that things mostly work.

I don't particularly like overflowing preempt count, but its mostly
harmless (up to a point). The much worse offender in my book is the
duration of the preempt_disable section thus created.

Esp with everything in user control, you can basically create an
arbitrary long non-preempt section with the semops.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-26 15:19             ` Rik van Riel
  2013-03-27  8:40               ` Peter Zijlstra
@ 2013-03-27  8:42               ` Peter Zijlstra
  2013-03-27 11:22                 ` Michel Lespinasse
                                   ` (2 more replies)
  1 sibling, 3 replies; 129+ messages in thread
From: Peter Zijlstra @ 2013-03-27  8:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Tue, 2013-03-26 at 11:19 -0400, Rik van Riel wrote:
> > Maybe something like:
> >
> > void sma_lock(struct sem_array *sma) /* global */
> > {
> >       int i;
> >
> >       sma->global_locked = 1;
> >       smp_wmb(); /* can we merge with the LOCK ? */
> >       spin_lock(&sma->global_lock);
> >
> >       /* wait for all local locks to go away */
> >       for (i = 0; i < sma->sem_nsems; i++)
> >               spin_unlock_wait(&sem->sem_base[i]->lock);      
> > }
> >
> > void sma_lock_one(struct sem_array *sma, int nr) /* local */
> > {
> >       smp_rmb(); /* pairs with wmb in sma_lock() */
> >       if (unlikely(sma->global_locked)) { /* wait for global lock */
> >               while (sma->global_locked)
> >                       spin_unlock_wait(&sma->global_lock);
> >       }
> >       spin_lock(&sma->sem_base[nr]->lock);
> > }

I since realized there's an ordering problem with ->global_locked, we
need to use spin_is_locked() or somesuch.

Two competing sma_lock() operations will screw over the separate
variable.

> 
> > This still has the problem of a non-preemptible section of
> O(sem_nsems)
> > (with the avg wait-time on the local lock). Could we make the global
> > lock a sleeping lock?
> 
> Not without breaking your scheme above :)

How would making sma->global_lock a mutex wreck anything?


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-27  8:42               ` Peter Zijlstra
@ 2013-03-27 11:22                 ` Michel Lespinasse
  2013-03-27 12:02                   ` Peter Zijlstra
  2013-03-27 20:00                 ` Rik van Riel
  2013-03-28 20:23                 ` [PATCH v2 " Rik van Riel
  2 siblings, 1 reply; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-27 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Wed, Mar 27, 2013 at 1:42 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2013-03-26 at 11:19 -0400, Rik van Riel wrote:
>> > Maybe something like:
>> >
>> > void sma_lock(struct sem_array *sma) /* global */
>> > {
>> >       int i;
>> >
>> >       sma->global_locked = 1;
>> >       smp_wmb(); /* can we merge with the LOCK ? */
>> >       spin_lock(&sma->global_lock);
>> >
>> >       /* wait for all local locks to go away */
>> >       for (i = 0; i < sma->sem_nsems; i++)
>> >               spin_unlock_wait(&sem->sem_base[i]->lock);
>> > }
>> >
>> > void sma_lock_one(struct sem_array *sma, int nr) /* local */
>> > {
>> >       smp_rmb(); /* pairs with wmb in sma_lock() */
>> >       if (unlikely(sma->global_locked)) { /* wait for global lock */
>> >               while (sma->global_locked)
>> >                       spin_unlock_wait(&sma->global_lock);
>> >       }
>> >       spin_lock(&sma->sem_base[nr]->lock);
>> > }
>
> I since realized there's an ordering problem with ->global_locked, we
> need to use spin_is_locked() or somesuch.
>
> Two competing sma_lock() operations will screw over the separate
> variable.
>
>>
>> > This still has the problem of a non-preemptible section of
>> O(sem_nsems)
>> > (with the avg wait-time on the local lock). Could we make the global
>> > lock a sleeping lock?
>>
>> Not without breaking your scheme above :)
>
> How would making sma->global_lock a mutex wreck anything?

I don't remember the details (rik probably will), but rcu is also
already involved, so there is a non trivial chance that it would...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-27 11:22                 ` Michel Lespinasse
@ 2013-03-27 12:02                   ` Peter Zijlstra
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Zijlstra @ 2013-03-27 12:02 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Wed, 2013-03-27 at 04:22 -0700, Michel Lespinasse wrote:
> >> Not without breaking your scheme above :)
> >
> > How would making sma->global_lock a mutex wreck anything?
> 
> I don't remember the details (rik probably will), but rcu is also
> already involved, so there is a non trivial chance that it would...

Ah, that might be true, the callsite might not be able to deal with it,
but the proposed global/local lock thing doesn't care -- which is how I
read Rik's comment above.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: fix lockdep false positive
  2013-03-27  8:42               ` Peter Zijlstra
  2013-03-27 11:22                 ` Michel Lespinasse
@ 2013-03-27 20:00                 ` Rik van Riel
  2013-03-28 20:23                 ` [PATCH v2 " Rik van Riel
  2 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-27 20:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michel Lespinasse, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On 03/27/2013 04:42 AM, Peter Zijlstra wrote:
> On Tue, 2013-03-26 at 11:19 -0400, Rik van Riel wrote:
>>> Maybe something like:
>>>
>>> void sma_lock(struct sem_array *sma) /* global */
>>> {
>>>        int i;
>>>
>>>        sma->global_locked = 1;
>>>        smp_wmb(); /* can we merge with the LOCK ? */
>>>        spin_lock(&sma->global_lock);
>>>
>>>        /* wait for all local locks to go away */
>>>        for (i = 0; i < sma->sem_nsems; i++)
>>>                spin_unlock_wait(&sem->sem_base[i]->lock);
>>> }
>>>
>>> void sma_lock_one(struct sem_array *sma, int nr) /* local */
>>> {
>>>        smp_rmb(); /* pairs with wmb in sma_lock() */
>>>        if (unlikely(sma->global_locked)) { /* wait for global lock */
>>>                while (sma->global_locked)
>>>                        spin_unlock_wait(&sma->global_lock);
>>>        }
>>>        spin_lock(&sma->sem_base[nr]->lock);
>>> }
>
> I since realized there's an ordering problem with ->global_locked, we
> need to use spin_is_locked() or somesuch.
>
> Two competing sma_lock() operations will screw over the separate
> variable.

There may be another problem with your idea.

If there are two single locks coming in for the same semaphore,
the first one holds the lock, while the second one is spinning
on the lock.

The global lock is spinning on spin_unlock_wait, which may end
up finishing after the first single lock holder unlocks, right
before the second single lock holder grabs the lock.

At that point, you have both a process that thinks it holds the
global lock, and a process that thinks it holds a single lock.

To prevent against this, the single lock probably needs to test
whether the global lock is locked, after it acquires the local
lock.

If the global lock is locked, it needs to unlock its single lock,
and then do a spin_unlock_wait on the global lock, before trying
again from the start.

Would that work?


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-26 17:33 ` ipc,sem: sysv semaphore scalability Sasha Levin
  2013-03-26 17:51   ` Davidlohr Bueso
  2013-03-26 17:55   ` ipc,sem: sysv semaphore scalability Paul E. McKenney
@ 2013-03-28 15:32   ` Rik van Riel
  2013-03-28 21:05     ` Davidlohr Bueso
                       ` (2 more replies)
  2 siblings, 3 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-28 15:32 UTC (permalink / raw)
  To: Sasha Levin
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Paul E. McKenney

On Tue, 26 Mar 2013 13:33:07 -0400
Sasha Levin <sasha.levin@oracle.com> wrote:

> [   96.347341] ================================================
> [   96.348085] [ BUG: lock held when returning to user space! ]
> [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
> [   96.360300] ------------------------------------------------
> [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
> [   96.362019] 1 lock held by trinity-child9/7583:
> [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
> 
> It seems that we can leave semtimedop without releasing the rcu read lock.

Sasha, this patch untangles the RCU locking with find_alloc_undo,
and should fix the above issue. As a side benefit, this makes the
code a little cleaner.

Next up: implement locking in a way that does not trigger any 
lockdep warnings...

---8<---

Subject: ipc,sem: untangle RCU locking with find_alloc_undo

The ipc semaphore code has a nasty RCU locking tangle, with both
find_alloc_undo and semtimedop taking the rcu_read_lock(). The
code can be cleaned up somewhat by only taking the rcu_read_lock
once.

The only caller of find_alloc_undo is in semtimedop.

This should solve the trinity issue reported by Sasha Levin.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |   31 +++++++++----------------------
 1 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f46441a..2ec2945 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1646,22 +1646,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 			alter = 1;
 	}
 
+	INIT_LIST_HEAD(&tasks);
+
 	if (undos) {
+		/* On success, find_alloc_undo takes the rcu_read_lock */
 		un = find_alloc_undo(ns, semid);
 		if (IS_ERR(un)) {
 			error = PTR_ERR(un);
 			goto out_free;
 		}
-	} else
+	} else {
 		un = NULL;
+		rcu_read_lock();
+	}
 
-	INIT_LIST_HEAD(&tasks);
-
-	rcu_read_lock();
 	sma = sem_obtain_object_check(ns, semid);
 	if (IS_ERR(sma)) {
-		if (un)
-			rcu_read_unlock();
+		rcu_read_unlock();
 		error = PTR_ERR(sma);
 		goto out_free;
 	}
@@ -1693,22 +1694,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	 */
 	error = -EIDRM;
 	locknum = sem_lock(sma, sops, nsops);
-	if (un) {
-		if (un->semid == -1) {
-			rcu_read_unlock();
-			goto out_unlock_free;
-		} else {
-			/*
-			 * rcu lock can be released, "un" cannot disappear:
-			 * - sem_lock is acquired, thus IPC_RMID is
-			 *   impossible.
-			 * - exit_sem is impossible, it always operates on
-			 *   current (or a dead task).
-			 */
-
-			rcu_read_unlock();
-		}
-	}
+	if (un && un->semid == -1)
+		goto out_unlock_free;
 
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-27  8:42               ` Peter Zijlstra
  2013-03-27 11:22                 ` Michel Lespinasse
  2013-03-27 20:00                 ` Rik van Riel
@ 2013-03-28 20:23                 ` Rik van Riel
  2013-03-29  2:50                   ` Michel Lespinasse
  2 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-28 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michel Lespinasse, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Wed, 27 Mar 2013 09:42:30 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> I since realized there's an ordering problem with ->global_locked, we
> need to use spin_is_locked() or somesuch.
> 
> Two competing sma_lock() operations will screw over the separate
> variable.

I created a worse version of the stress test, one that will have
the first thread always do two semops at a time (creating a complex
operation), and have all the other threads do one at a time. All
the threads issue down and then up, so there is a lot of sleeping
and waking back up with this test.

The patch below should fix the last lockdep issue (the other issue
being fixed by the RCU patch), and is running the various semop
tests just fine on a 48 CPU system.

The locking is a little more complicated than I would have liked,
but a few hundred million semaphore operations suggest there are
probably no race conditions left...

---8<---

Subject: [PATCH -mm -next] ipc,sem: change locking scheme to make lockdep happy

Unfortunately the locking scheme originally proposed has false positives
with lockdep.  This can be fixed by changing the code to only ever take
one lock, and making sure that other relevant locks are not locked, before
entering a critical section.

For the "global lock" case, this is done by taking the sem_array lock,
and then (potentially) waiting for all the semaphore's spinlocks to be
unlocked.

For the "local lock" case, we wait on the sem_array's lock to be free,
before taking the semaphore local lock. To prevent races, we need to
check again after we have taken the local lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |   55 ++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 36500a6..87b74d5 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -320,24 +320,39 @@ void __init sem_init (void)
 }
 
 /*
- * If the sem_array contains just one semaphore, or if multiple
- * semops are performed in one syscall, or if there are complex
- * operations pending, the whole sem_array is locked.
- * If one semop is performed on an array with multiple semaphores,
- * get a shared lock on the array, and lock the individual semaphore.
+ * If the request contains only one semaphore operation, and there are
+ * no complex transactions pending, lock only the semaphore involved.
+ * Otherwise, lock the entire semaphore array, since we either have
+ * multiple semaphores in our own semops, or we need to look at
+ * semaphores from other pending complex operations.
  *
  * Carefully guard against sma->complex_count changing between zero
  * and non-zero while we are spinning for the lock. The value of
  * sma->complex_count cannot change while we are holding the lock,
  * so sem_unlock should be fine.
+ *
+ * The global lock path checks that all the local locks have been released,
+ * checking each local lock once. This means that the local lock paths
+ * cannot start their critical sections while the global lock is held.
  */
 static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 			      int nsops)
 {
 	int locknum;
+ again:
 	if (nsops == 1 && !sma->complex_count) {
 		struct sem *sem = sma->sem_base + sops->sem_num;
 
+		/*
+		 * Another process is holding the global lock on the
+		 * sem_array. Wait for that process to release the lock,
+		 * before acquiring our lock.
+		 */
+		if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
+			spin_unlock_wait(&sma->sem_perm.lock);
+			goto again;
+		}
+
 		/* Lock just the semaphore we are interested in. */
 		spin_lock(&sem->lock);
 
@@ -347,17 +362,33 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		if (unlikely(sma->complex_count)) {
 			spin_unlock(&sem->lock);
-			goto lock_all;
+			goto lock_array;
+		}
+
+		/*
+		 * Another process is holding the global lock on the
+		 * sem_array; we cannot enter our critical section,
+		 * but have to wait for the global lock to be released.
+		 */
+		if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
+			spin_unlock(&sem->lock);
+			goto again;
 		}
+
 		locknum = sops->sem_num;
 	} else {
 		int i;
-		/* Lock the sem_array, and all the semaphore locks */
- lock_all:
+		/*
+		 * Lock the semaphore array, and wait for all of the
+		 * individual semaphore locks to go away.  The code
+		 * above ensures no new single-lock holders will enter
+		 * their critical section while the array lock is held.
+		 */
+ lock_array:
 		spin_lock(&sma->sem_perm.lock);
 		for (i = 0; i < sma->sem_nsems; i++) {
 			struct sem *sem = sma->sem_base + i;
-			spin_lock(&sem->lock);
+			spin_unlock_wait(&sem->lock);
 		}
 		locknum = -1;
 	}
@@ -367,11 +398,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
-		int i;
-		for (i = 0; i < sma->sem_nsems; i++) {
-			struct sem *sem = sma->sem_base + i;
-			spin_unlock(&sem->lock);
-		}
 		spin_unlock(&sma->sem_perm.lock);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -558,7 +584,6 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	for (i = 0; i < nsems; i++) {
 		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
 		spin_lock_init(&sma->sem_base[i].lock);
-		spin_lock(&sma->sem_base[i].lock);
 	}
 
 	sma->complex_count = 0;

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-28 15:32   ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
@ 2013-03-28 21:05     ` Davidlohr Bueso
  2013-03-29  1:00     ` Michel Lespinasse
  2013-03-30 13:35     ` Sasha Levin
  2 siblings, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-28 21:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, torvalds, linux-kernel, akpm, hhuang, jason.low2,
	walken, lwoodman, chegu_vinod, Paul E. McKenney

On Thu, 2013-03-28 at 11:32 -0400, Rik van Riel wrote:
> On Tue, 26 Mar 2013 13:33:07 -0400
> Sasha Levin <sasha.levin@oracle.com> wrote:
> 
> > [   96.347341] ================================================
> > [   96.348085] [ BUG: lock held when returning to user space! ]
> > [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
> > [   96.360300] ------------------------------------------------
> > [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
> > [   96.362019] 1 lock held by trinity-child9/7583:
> > [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
> > 
> > It seems that we can leave semtimedop without releasing the rcu read lock.
> 
> Sasha, this patch untangles the RCU locking with find_alloc_undo,
> and should fix the above issue. As a side benefit, this makes the
> code a little cleaner.
> 
> Next up: implement locking in a way that does not trigger any 
> lockdep warnings...
> 
> ---8<---
> 
> Subject: ipc,sem: untangle RCU locking with find_alloc_undo
> 
> The ipc semaphore code has a nasty RCU locking tangle, with both
> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
> code can be cleaned up somewhat by only taking the rcu_read_lock
> once.

indeed!

> 
> The only caller of find_alloc_undo is in semtimedop.
> 
> This should solve the trinity issue reported by Sasha Levin.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  ipc/sem.c |   31 +++++++++----------------------
>  1 files changed, 9 insertions(+), 22 deletions(-)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index f46441a..2ec2945 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1646,22 +1646,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>  			alter = 1;
>  	}
>  
> +	INIT_LIST_HEAD(&tasks);
> +
>  	if (undos) {
> +		/* On success, find_alloc_undo takes the rcu_read_lock */
>  		un = find_alloc_undo(ns, semid);

find_alloc_undo() has some nested rcu_read_locks of its own. We can
simplify that as well. Will look into it, but don't want to introduce
any more changes until we address all the issues with the patchset, and
know it to behave.

>  		if (IS_ERR(un)) {
>  			error = PTR_ERR(un);
>  			goto out_free;
>  		}
> -	} else
> +	} else {
>  		un = NULL;
> +		rcu_read_lock();
> +	}
>  
> -	INIT_LIST_HEAD(&tasks);
> -
> -	rcu_read_lock();
>  	sma = sem_obtain_object_check(ns, semid);
>  	if (IS_ERR(sma)) {
> -		if (un)
> -			rcu_read_unlock();
> +		rcu_read_unlock();
>  		error = PTR_ERR(sma);
>  		goto out_free;
>  	}
> @@ -1693,22 +1694,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>  	 */
>  	error = -EIDRM;
>  	locknum = sem_lock(sma, sops, nsops);
> -	if (un) {
> -		if (un->semid == -1) {
> -			rcu_read_unlock();
> -			goto out_unlock_free;
> -		} else {
> -			/*
> -			 * rcu lock can be released, "un" cannot disappear:
> -			 * - sem_lock is acquired, thus IPC_RMID is
> -			 *   impossible.
> -			 * - exit_sem is impossible, it always operates on
> -			 *   current (or a dead task).
> -			 */
> -
> -			rcu_read_unlock();
> -		}
> -	}
> +	if (un && un->semid == -1)
> +		goto out_unlock_free;

Yeah, I was tempted in doing something much like this, but didn't want
to change any existing logic. Hopefully we can get away with this and it
fixes Sasha's issue.

>  
>  	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
>  	if (error <= 0) {

Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-28 15:32   ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
  2013-03-28 21:05     ` Davidlohr Bueso
@ 2013-03-29  1:00     ` Michel Lespinasse
  2013-03-29  1:14       ` Sasha Levin
  2013-03-30 13:35     ` Sasha Levin
  2 siblings, 1 reply; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-29  1:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod, Paul E. McKenney

On Thu, Mar 28, 2013 at 8:32 AM, Rik van Riel <riel@surriel.com> wrote:
> The ipc semaphore code has a nasty RCU locking tangle, with both
> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
> code can be cleaned up somewhat by only taking the rcu_read_lock
> once.
>
> The only caller of find_alloc_undo is in semtimedop.
>
> This should solve the trinity issue reported by Sasha Levin.
>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>

Looks good^Wbetter. I have nothing specific to say other than I've
been staring at it for 10 minutes :)

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-29  1:00     ` Michel Lespinasse
@ 2013-03-29  1:14       ` Sasha Levin
  0 siblings, 0 replies; 129+ messages in thread
From: Sasha Levin @ 2013-03-29  1:14 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, akpm,
	hhuang, jason.low2, lwoodman, chegu_vinod, Paul E. McKenney

On 03/28/2013 09:00 PM, Michel Lespinasse wrote:
> On Thu, Mar 28, 2013 at 8:32 AM, Rik van Riel <riel@surriel.com> wrote:
>> The ipc semaphore code has a nasty RCU locking tangle, with both
>> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
>> code can be cleaned up somewhat by only taking the rcu_read_lock
>> once.
>>
>> The only caller of find_alloc_undo is in semtimedop.
>>
>> This should solve the trinity issue reported by Sasha Levin.
>>
>> Reported-by: Sasha Levin <sasha.levin@oracle.com>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> Looks good^Wbetter. I have nothing specific to say other than I've
> been staring at it for 10 minutes :)
> 
> Reviewed-by: Michel Lespinasse <walken@google.com>

And the warnings are gone:

	Tested-by: Sasha Levin <sasha.levin@oracle.com>


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-28 20:23                 ` [PATCH v2 " Rik van Riel
@ 2013-03-29  2:50                   ` Michel Lespinasse
  2013-03-29  9:57                     ` Peter Zijlstra
                                       ` (2 more replies)
  0 siblings, 3 replies; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-29  2:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Thu, Mar 28, 2013 at 1:23 PM, Rik van Riel <riel@surriel.com> wrote:
> Subject: [PATCH -mm -next] ipc,sem: change locking scheme to make lockdep happy
>
> Unfortunately the locking scheme originally proposed has false positives
> with lockdep.  This can be fixed by changing the code to only ever take
> one lock, and making sure that other relevant locks are not locked, before
> entering a critical section.
>
> For the "global lock" case, this is done by taking the sem_array lock,
> and then (potentially) waiting for all the semaphore's spinlocks to be
> unlocked.
>
> For the "local lock" case, we wait on the sem_array's lock to be free,
> before taking the semaphore local lock. To prevent races, we need to
> check again after we have taken the local lock.
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>

TL;DR: The locking algorithm is not familiar for me, but it seems
sound. There are some implementation details I don't like. More
below...

> ---
>  ipc/sem.c |   55 ++++++++++++++++++++++++++++++++++++++++---------------
>  1 files changed, 40 insertions(+), 15 deletions(-)
>
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 36500a6..87b74d5 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -320,24 +320,39 @@ void __init sem_init (void)
>  }
>
>  /*
> - * If the sem_array contains just one semaphore, or if multiple
> - * semops are performed in one syscall, or if there are complex
> - * operations pending, the whole sem_array is locked.
> - * If one semop is performed on an array with multiple semaphores,
> - * get a shared lock on the array, and lock the individual semaphore.
> + * If the request contains only one semaphore operation, and there are
> + * no complex transactions pending, lock only the semaphore involved.
> + * Otherwise, lock the entire semaphore array, since we either have
> + * multiple semaphores in our own semops, or we need to look at
> + * semaphores from other pending complex operations.
>   *
>   * Carefully guard against sma->complex_count changing between zero
>   * and non-zero while we are spinning for the lock. The value of
>   * sma->complex_count cannot change while we are holding the lock,
>   * so sem_unlock should be fine.
> + *
> + * The global lock path checks that all the local locks have been released,
> + * checking each local lock once. This means that the local lock paths
> + * cannot start their critical sections while the global lock is held.
>   */
>  static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
>                               int nsops)
>  {
>         int locknum;
> + again:
>         if (nsops == 1 && !sma->complex_count) {
>                 struct sem *sem = sma->sem_base + sops->sem_num;
>
> +               /*
> +                * Another process is holding the global lock on the
> +                * sem_array. Wait for that process to release the lock,
> +                * before acquiring our lock.
> +                */
> +               if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
> +                       spin_unlock_wait(&sma->sem_perm.lock);
> +                       goto again;
> +               }
> +

So, there are a few things I don't like about spin_unlock_wait():

1- From a lock ordering point of view, it is strictly equivalent to
taking the lock and then releasing it - and yet, lockdep won't catch
any deadlocks that involve spin_unlock_wait. (Not your fault here,
this should be fixed as a separate change in lockdep. I manually
looked at the lock ordering here and found it safe).

2- With the current ticket lock implementation, a stream of lockers
can starve spin_unlock_wait() forever. Once again, not your fault and
I suspect this could be fixed - I expect spin_unlock_wait() callers
actually only want to know that the lock has been passed on, not that
it actually got to an unlocked state.

3- Regarding your actual use here - I find it confusing to call
spin_unlock_wait() before holding any other lock. The pattern I expect
to see is that people take one lock, then see that the other lock they
want is already taken, so they release the first lock and wait on the
second. So, I'd suggest we remove the sem_perm.lock checks here and
deal with this in a retry path later down.

>                 /* Lock just the semaphore we are interested in. */
>                 spin_lock(&sem->lock);
>
> @@ -347,17 +362,33 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
>                  */
>                 if (unlikely(sma->complex_count)) {
>                         spin_unlock(&sem->lock);
> -                       goto lock_all;
> +                       goto lock_array;
> +               }
> +
> +               /*
> +                * Another process is holding the global lock on the
> +                * sem_array; we cannot enter our critical section,
> +                * but have to wait for the global lock to be released.
> +                */
> +               if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
> +                       spin_unlock(&sem->lock);
> +                       goto again;

This is IMO where the spin_unlock_wait(&sma->sem_perm.lock) would
belong - right before the goto again.

Also - I think there is a risk that an endless stream of complex
semops could starve a simple semop here, as it would always find the
sem_perm.lock to be locked ??? One easy way to guarantee progress
would be to goto lock_array instead; however there is then the issue
that a complex semop could force an endless stream of following simple
semops to take the lock_array path. I'm not sure which of these
problems is preferable to have...

>                 }
> +
>                 locknum = sops->sem_num;
>         } else {
>                 int i;
> -               /* Lock the sem_array, and all the semaphore locks */
> - lock_all:
> +               /*
> +                * Lock the semaphore array, and wait for all of the
> +                * individual semaphore locks to go away.  The code
> +                * above ensures no new single-lock holders will enter
> +                * their critical section while the array lock is held.
> +                */
> + lock_array:
>                 spin_lock(&sma->sem_perm.lock);
>                 for (i = 0; i < sma->sem_nsems; i++) {
>                         struct sem *sem = sma->sem_base + i;
> -                       spin_lock(&sem->lock);
> +                       spin_unlock_wait(&sem->lock);
>                 }
>                 locknum = -1;
>         }

Subtle, but it'll work (modulo the starvation issue I mentioned).

Cheers,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29  2:50                   ` Michel Lespinasse
@ 2013-03-29  9:57                     ` Peter Zijlstra
  2013-03-29 13:21                       ` Michel Lespinasse
  2013-03-29 12:07                     ` Rik van Riel
  2013-03-29 13:55                     ` [PATCH v3 " Rik van Riel
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Zijlstra @ 2013-03-29  9:57 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Thu, 2013-03-28 at 19:50 -0700, Michel Lespinasse wrote:
> So, there are a few things I don't like about spin_unlock_wait():
> 
> 1- From a lock ordering point of view, it is strictly equivalent to
> taking the lock and then releasing it - and yet, lockdep won't catch
> any deadlocks that involve spin_unlock_wait. (Not your fault here,
> this should be fixed as a separate change in lockdep. I manually
> looked at the lock ordering here and found it safe).

Ooh, I never noticed that, but indeed this shouldn't be hard to cure.

> 2- With the current ticket lock implementation, a stream of lockers
> can starve spin_unlock_wait() forever. Once again, not your fault and
> I suspect this could be fixed - I expect spin_unlock_wait() callers
> actually only want to know that the lock has been passed on, not that
> it actually got to an unlocked state.

I suppose the question is do we want to fix it or have both semantics
and use lock+unlock where appropriate.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29  2:50                   ` Michel Lespinasse
  2013-03-29  9:57                     ` Peter Zijlstra
@ 2013-03-29 12:07                     ` Rik van Riel
  2013-03-29 13:08                       ` Michel Lespinasse
  2013-03-29 13:55                     ` [PATCH v3 " Rik van Riel
  2 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-29 12:07 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Peter Zijlstra, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On 03/28/2013 10:50 PM, Michel Lespinasse wrote:
> On Thu, Mar 28, 2013 at 1:23 PM, Rik van Riel <riel@surriel.com> wrote:
>> Subject: [PATCH -mm -next] ipc,sem: change locking scheme to make lockdep happy
>>
>> Unfortunately the locking scheme originally proposed has false positives
>> with lockdep.  This can be fixed by changing the code to only ever take
>> one lock, and making sure that other relevant locks are not locked, before
>> entering a critical section.
>>
>> For the "global lock" case, this is done by taking the sem_array lock,
>> and then (potentially) waiting for all the semaphore's spinlocks to be
>> unlocked.
>>
>> For the "local lock" case, we wait on the sem_array's lock to be free,
>> before taking the semaphore local lock. To prevent races, we need to
>> check again after we have taken the local lock.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Reported-by: Sasha Levin <sasha.levin@oracle.com>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>
> TL;DR: The locking algorithm is not familiar for me, but it seems
> sound. There are some implementation details I don't like. More
> below...
>
>> ---
>>   ipc/sem.c |   55 ++++++++++++++++++++++++++++++++++++++++---------------
>>   1 files changed, 40 insertions(+), 15 deletions(-)
>>
>> diff --git a/ipc/sem.c b/ipc/sem.c
>> index 36500a6..87b74d5 100644
>> --- a/ipc/sem.c
>> +++ b/ipc/sem.c
>> @@ -320,24 +320,39 @@ void __init sem_init (void)
>>   }
>>
>>   /*
>> - * If the sem_array contains just one semaphore, or if multiple
>> - * semops are performed in one syscall, or if there are complex
>> - * operations pending, the whole sem_array is locked.
>> - * If one semop is performed on an array with multiple semaphores,
>> - * get a shared lock on the array, and lock the individual semaphore.
>> + * If the request contains only one semaphore operation, and there are
>> + * no complex transactions pending, lock only the semaphore involved.
>> + * Otherwise, lock the entire semaphore array, since we either have
>> + * multiple semaphores in our own semops, or we need to look at
>> + * semaphores from other pending complex operations.
>>    *
>>    * Carefully guard against sma->complex_count changing between zero
>>    * and non-zero while we are spinning for the lock. The value of
>>    * sma->complex_count cannot change while we are holding the lock,
>>    * so sem_unlock should be fine.
>> + *
>> + * The global lock path checks that all the local locks have been released,
>> + * checking each local lock once. This means that the local lock paths
>> + * cannot start their critical sections while the global lock is held.
>>    */
>>   static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
>>                                int nsops)
>>   {
>>          int locknum;
>> + again:
>>          if (nsops == 1 && !sma->complex_count) {
>>                  struct sem *sem = sma->sem_base + sops->sem_num;
>>
>> +               /*
>> +                * Another process is holding the global lock on the
>> +                * sem_array. Wait for that process to release the lock,
>> +                * before acquiring our lock.
>> +                */
>> +               if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
>> +                       spin_unlock_wait(&sma->sem_perm.lock);
>> +                       goto again;
>> +               }
>> +
>
> So, there are a few things I don't like about spin_unlock_wait():
>
> 1- From a lock ordering point of view, it is strictly equivalent to
> taking the lock and then releasing it - and yet, lockdep won't catch
> any deadlocks that involve spin_unlock_wait. (Not your fault here,
> this should be fixed as a separate change in lockdep. I manually
> looked at the lock ordering here and found it safe).
>
> 2- With the current ticket lock implementation, a stream of lockers
> can starve spin_unlock_wait() forever. Once again, not your fault and
> I suspect this could be fixed - I expect spin_unlock_wait() callers
> actually only want to know that the lock has been passed on, not that
> it actually got to an unlocked state.
>
> 3- Regarding your actual use here - I find it confusing to call
> spin_unlock_wait() before holding any other lock. The pattern I expect
> to see is that people take one lock, then see that the other lock they
> want is already taken, so they release the first lock and wait on the
> second. So, I'd suggest we remove the sem_perm.lock checks here and
> deal with this in a retry path later down.
>
>>                  /* Lock just the semaphore we are interested in. */
>>                  spin_lock(&sem->lock);
>>
>> @@ -347,17 +362,33 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
>>                   */
>>                  if (unlikely(sma->complex_count)) {
>>                          spin_unlock(&sem->lock);
>> -                       goto lock_all;
>> +                       goto lock_array;
>> +               }
>> +
>> +               /*
>> +                * Another process is holding the global lock on the
>> +                * sem_array; we cannot enter our critical section,
>> +                * but have to wait for the global lock to be released.
>> +                */
>> +               if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
>> +                       spin_unlock(&sem->lock);
>> +                       goto again;
>
> This is IMO where the spin_unlock_wait(&sma->sem_perm.lock) would
> belong - right before the goto again.

That is where I had it initially. I may have gotten too clever
and worked on keeping more accesses read-only. If you want, I
can move it back here and re-submit the patch :)

> Also - I think there is a risk that an endless stream of complex
> semops could starve a simple semop here, as it would always find the
> sem_perm.lock to be locked ??? One easy way to guarantee progress
> would be to goto lock_array instead; however there is then the issue
> that a complex semop could force an endless stream of following simple
> semops to take the lock_array path. I'm not sure which of these
> problems is preferable to have...

If starvation turns out to be an issue, there is an even
simpler solution:

	if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
		spin_unlock(&sem->lock);
		spin_lock(&sma->sem_perm.lock);
		spin_lock(&sem->lock);
		spin_unlock(&sma->sem_perm.lock);
	}

Followed by unconditionally doing the critical section for
holding a single semaphore's lock, because we know that a
subsequent taker of sma->sem_perm.lock will either grab a
different semaphore's spinlock, or wait on the same semaphore's
spinlock, or wait for us to unlock our spinlock.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29 12:07                     ` Rik van Riel
@ 2013-03-29 13:08                       ` Michel Lespinasse
  2013-03-29 13:24                         ` Rik van Riel
  0 siblings, 1 reply; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-29 13:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Fri, Mar 29, 2013 at 5:07 AM, Rik van Riel <riel@surriel.com> wrote:
> On 03/28/2013 10:50 PM, Michel Lespinasse wrote:
>> On Thu, Mar 28, 2013 at 1:23 PM, Rik van Riel <riel@surriel.com> wrote:
>>>                  if (unlikely(sma->complex_count)) {
>>>                          spin_unlock(&sem->lock);
>>> -                       goto lock_all;
>>> +                       goto lock_array;
>>> +               }
>>> +
>>> +               /*
>>> +                * Another process is holding the global lock on the
>>> +                * sem_array; we cannot enter our critical section,
>>> +                * but have to wait for the global lock to be released.
>>> +                */
>>> +               if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
>>> +                       spin_unlock(&sem->lock);
>>> +                       goto again;
>>
>> This is IMO where the spin_unlock_wait(&sma->sem_perm.lock) would
>> belong - right before the goto again.
>
> That is where I had it initially. I may have gotten too clever
> and worked on keeping more accesses read-only. If you want, I
> can move it back here and re-submit the patch :)

I think I would prefer that - I feel having it upper serves little
purpose, as you still need to check it here, so we might as well be
optimistic and check it only here. Or maybe I've missed the benefit of
having it earlier - I just don't see it.

>> Also - I think there is a risk that an endless stream of complex
>> semops could starve a simple semop here, as it would always find the
>> sem_perm.lock to be locked ??? One easy way to guarantee progress
>> would be to goto lock_array instead; however there is then the issue
>> that a complex semop could force an endless stream of following simple
>> semops to take the lock_array path. I'm not sure which of these
>> problems is preferable to have...
>
> If starvation turns out to be an issue, there is an even
> simpler solution:
>
>         if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
>                 spin_unlock(&sem->lock);
>                 spin_lock(&sma->sem_perm.lock);
>                 spin_lock(&sem->lock);
>                 spin_unlock(&sma->sem_perm.lock);
>         }

This is kinda nice - certainly nicer than falling back to the lock_array case.

It still makes it possible (though less likely) that each simple semop
taking this path might make the next one take it too, though. So, I'm
not sure how to balance that against the starvation possibility. I'll
leave it up to you then :)

Cheers,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29  9:57                     ` Peter Zijlstra
@ 2013-03-29 13:21                       ` Michel Lespinasse
  0 siblings, 0 replies; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-29 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Fri, Mar 29, 2013 at 2:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2013-03-28 at 19:50 -0700, Michel Lespinasse wrote:
>> So, there are a few things I don't like about spin_unlock_wait():
>>
>> 1- From a lock ordering point of view, it is strictly equivalent to
>> taking the lock and then releasing it - and yet, lockdep won't catch
>> any deadlocks that involve spin_unlock_wait. (Not your fault here,
>> this should be fixed as a separate change in lockdep. I manually
>> looked at the lock ordering here and found it safe).
>
> Ooh, I never noticed that, but indeed this shouldn't be hard to cure.
>
>> 2- With the current ticket lock implementation, a stream of lockers
>> can starve spin_unlock_wait() forever. Once again, not your fault and
>> I suspect this could be fixed - I expect spin_unlock_wait() callers
>> actually only want to know that the lock has been passed on, not that
>> it actually got to an unlocked state.
>
> I suppose the question is do we want to fix it or have both semantics
> and use lock+unlock where appropriate.

We'd have to look at the users to be sure, but I strongly expect they
don't need to get in line waiting - it's sufficient to just wait for
the head of the queue to move (or for the queue to be empty).

There are actually very few users - Just drivers/ata/libata-eh.c for
the spin_unlock_wait() function, and a couple more (kernel/task_work.c
and kernel/exit.c) for the raw_spin_unlock_wait variant. Guess I'm not
the only one to dislike that function :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH v2 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29 13:08                       ` Michel Lespinasse
@ 2013-03-29 13:24                         ` Rik van Riel
  0 siblings, 0 replies; 129+ messages in thread
From: Rik van Riel @ 2013-03-29 13:24 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Peter Zijlstra, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On 03/29/2013 09:08 AM, Michel Lespinasse wrote:

>> That is where I had it initially. I may have gotten too clever
>> and worked on keeping more accesses read-only. If you want, I
>> can move it back here and re-submit the patch :)
>
> I think I would prefer that - I feel having it upper serves little
> purpose, as you still need to check it here, so we might as well be
> optimistic and check it only here. Or maybe I've missed the benefit of
> having it earlier - I just don't see it.

OK, will do.

> It still makes it possible (though less likely) that each simple semop
> taking this path might make the next one take it too, though. So, I'm
> not sure how to balance that against the starvation possibility. I'll
> leave it up to you then :)

Lets optimistically assume that userland is using the semaphores
for synchronization. That rather limits the potential for starvation.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [PATCH v3 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29  2:50                   ` Michel Lespinasse
  2013-03-29  9:57                     ` Peter Zijlstra
  2013-03-29 12:07                     ` Rik van Riel
@ 2013-03-29 13:55                     ` Rik van Riel
  2013-03-29 13:59                       ` Michel Lespinasse
  2 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-29 13:55 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Peter Zijlstra, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Thu, 28 Mar 2013 19:50:47 -0700
Michel Lespinasse <walken@google.com> wrote:

> This is IMO where the spin_unlock_wait(&sma->sem_perm.lock) would
> belong - right before the goto again.

Here is the slightly more optimistic (and probably more readable)
version of the patch:

---8<---
Unfortunately the locking scheme originally proposed has false positives
with lockdep.  This can be fixed by changing the code to only ever take
one lock, and making sure that other relevant locks are not locked, before
entering a critical section.

For the "global lock" case, this is done by taking the sem_array lock,
and then (potentially) waiting for all the semaphore's spinlocks to be
unlocked.

For the "local lock" case, we wait on the sem_array's lock to be free,
before taking the semaphore local lock. To prevent races, we need to
check again after we have taken the local lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 ipc/sem.c |   46 +++++++++++++++++++++++++++++++---------------
 1 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 36500a6..5142171 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -320,21 +320,26 @@ void __init sem_init (void)
 }
 
 /*
- * If the sem_array contains just one semaphore, or if multiple
- * semops are performed in one syscall, or if there are complex
- * operations pending, the whole sem_array is locked.
- * If one semop is performed on an array with multiple semaphores,
- * get a shared lock on the array, and lock the individual semaphore.
+ * If the request contains only one semaphore operation, and there are
+ * no complex transactions pending, lock only the semaphore involved.
+ * Otherwise, lock the entire semaphore array, since we either have
+ * multiple semaphores in our own semops, or we need to look at
+ * semaphores from other pending complex operations.
  *
  * Carefully guard against sma->complex_count changing between zero
  * and non-zero while we are spinning for the lock. The value of
  * sma->complex_count cannot change while we are holding the lock,
  * so sem_unlock should be fine.
+ *
+ * The global lock path checks that all the local locks have been released,
+ * checking each local lock once. This means that the local lock paths
+ * cannot start their critical sections while the global lock is held.
  */
 static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 			      int nsops)
 {
 	int locknum;
+ again:
 	if (nsops == 1 && !sma->complex_count) {
 		struct sem *sem = sma->sem_base + sops->sem_num;
 
@@ -347,17 +352,34 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		 */
 		if (unlikely(sma->complex_count)) {
 			spin_unlock(&sem->lock);
-			goto lock_all;
+			goto lock_array;
+		}
+
+		/*
+		 * Another process is holding the global lock on the
+		 * sem_array; we cannot enter our critical section,
+		 * but have to wait for the global lock to be released.
+		 */
+		if (unlikely(spin_is_locked(&sma->sem_perm.lock))) {
+			spin_unlock(&sem->lock);
+			spin_unlock_wait(&sma->sem_perm.lock);
+			goto again;
 		}
+
 		locknum = sops->sem_num;
 	} else {
 		int i;
-		/* Lock the sem_array, and all the semaphore locks */
- lock_all:
+		/*
+		 * Lock the semaphore array, and wait for all of the
+		 * individual semaphore locks to go away.  The code
+		 * above ensures no new single-lock holders will enter
+		 * their critical section while the array lock is held.
+		 */
+ lock_array:
 		spin_lock(&sma->sem_perm.lock);
 		for (i = 0; i < sma->sem_nsems; i++) {
 			struct sem *sem = sma->sem_base + i;
-			spin_lock(&sem->lock);
+			spin_unlock_wait(&sem->lock);
 		}
 		locknum = -1;
 	}
@@ -367,11 +389,6 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
-		int i;
-		for (i = 0; i < sma->sem_nsems; i++) {
-			struct sem *sem = sma->sem_base + i;
-			spin_unlock(&sem->lock);
-		}
 		spin_unlock(&sma->sem_perm.lock);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -558,7 +575,6 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	for (i = 0; i < nsems; i++) {
 		INIT_LIST_HEAD(&sma->sem_base[i].sem_pending);
 		spin_lock_init(&sma->sem_base[i].lock);
-		spin_lock(&sma->sem_base[i].lock);
 	}
 
 	sma->complex_count = 0;

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH v3 -mm -next] ipc,sem: fix lockdep false positive
  2013-03-29 13:55                     ` [PATCH v3 " Rik van Riel
@ 2013-03-29 13:59                       ` Michel Lespinasse
  0 siblings, 0 replies; 129+ messages in thread
From: Michel Lespinasse @ 2013-03-29 13:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Sasha Levin, torvalds, davidlohr.bueso,
	linux-kernel, akpm, hhuang, jason.low2, lwoodman, chegu_vinod,
	Dave Jones, benisty.e, Ingo Molnar

On Fri, Mar 29, 2013 at 6:55 AM, Rik van Riel <riel@surriel.com> wrote:
> On Thu, 28 Mar 2013 19:50:47 -0700
> Michel Lespinasse <walken@google.com> wrote:
>
>> This is IMO where the spin_unlock_wait(&sma->sem_perm.lock) would
>> belong - right before the goto again.
>
> Here is the slightly more optimistic (and probably more readable)
> version of the patch:
>
> ---8<---
> Unfortunately the locking scheme originally proposed has false positives
> with lockdep.  This can be fixed by changing the code to only ever take
> one lock, and making sure that other relevant locks are not locked, before
> entering a critical section.
>
> For the "global lock" case, this is done by taking the sem_array lock,
> and then (potentially) waiting for all the semaphore's spinlocks to be
> unlocked.
>
> For the "local lock" case, we wait on the sem_array's lock to be free,
> before taking the semaphore local lock. To prevent races, we need to
> check again after we have taken the local lock.
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>

Looks good.

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 19:43     ` Andrew Morton
@ 2013-03-29 16:17       ` Dave Jones
  2013-03-29 18:00         ` Linus Torvalds
                           ` (2 more replies)
  2013-03-29 19:01       ` Dave Jones
  1 sibling, 3 replies; 129+ messages in thread
From: Dave Jones @ 2013-03-29 16:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Peter Hurley

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
 > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones <davej@redhat.com> wrote:
 > 
 > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
 > > 
 > >  > Whichever way we go, we should get a wiggle on - this has been hanging
 > >  > around for too long.  Dave, do you have time to determine whether
 > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > >  > than max") fixes things up?
 > > 
 > > Ok, with that reverted it's been grinding away for a few hours without incident.
 > > Normally I see the oops within a minute or so.
 > 
 > OK, thanks, I queued a revert:
 > 
 > From: Andrew Morton <akpm@linux-foundation.org>
 > Subject: revert "ipc: don't allocate a copy larger than max"
 > 
 > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > during trinity testing.

Now that I have that reverted, I'm not seeing msgrcv traces any more, but 
I've started seeing this..

general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: l2tp_ppp l2tp_netlink l2tp_core llc2 phonet netrom rose af_key af_rxrpc caif_socket caif can_raw cmtp kernelcapi can_bcm can nfnetlink ipt_ULOG scsi_transport_iscsi af_802154 ax25 atm ipx pppoe pppox x25 nfc irda ppp_generic p8023 slhc p8022 appletalk decnet crc_ccitt rds psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec raid0 snd_pcm rfkill microcode serio_raw pcspkr snd_page_alloc edac_core snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 3 
Pid: 1850, comm: trinity-child37 Tainted: G    B        3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff812c20fb>]  [<ffffffff812c20fb>] free_msg+0x2b/0x40
RSP: 0018:ffff8800a1d3bdd0  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff810b6ced
RBP: ffff8800a1d3bde0 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88009997e620
R13: ffffffff81c7ace0 R14: ffff8800caf359d8 R15: ffffffff81c7b024
FS:  00007f2d7be64740(0000) GS:ffff88012ac00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f8bd7bb6000 CR3: 00000000a1f0a000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child37 (pid: 1850, threadinfo ffff8800a1d3a000, task ffff8800a1e62490)
Stack:
 6b6b6b6b6b6b6b6b ffff8800caf35928 ffff8800a1d3be18 ffffffff812c289f
 0000000000000000 ffffffff81c7ace0 ffff8800caf35928 0000000000000000
 ffff8800a1d3be28 ffff8800a1d3bec8 ffffffff812c2a93 ffff8800a1d3be40
Call Trace:
 [<ffffffff812c289f>] freeque+0xcf/0x140
 [<ffffffff812c2a93>] msgctl_down.constprop.9+0x183/0x200
 [<ffffffff810767cf>] ? up_read+0x1f/0x40
 [<ffffffff816c8f94>] ? __do_page_fault+0x214/0x5b0
 [<ffffffff810b94be>] ? lock_release_non_nested+0x23e/0x320
 [<ffffffff812c2da9>] sys_msgctl+0x139/0x400
 [<ffffffff816c5d4d>] ? retint_swapgs+0xe/0x13
 [<ffffffff810b6c55>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134b39e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
Code: 66 66 66 66 90 55 48 89 e5 41 54 49 89 fc 53 e8 fc 5e 01 00 49 8b 5c 24 20 4c 89 e7 e8 8f af ed ff 48 85 db 75 05 eb 13 4c 89 e3 <4c> 8b 23 48 89 df e8 7a af ed ff 4d 85 e4 75 ed 5b 41 5c 5d c3 


(Taint is from an ext4 double-free I just reported in a separate thread)

decoded..

   0:	66 66 66 66 90       	data32 data32 data32 xchg %ax,%ax
   5:	55                   	push   %rbp
   6:	48 89 e5             	mov    %rsp,%rbp
   9:	41 54                	push   %r12
   b:	49 89 fc             	mov    %rdi,%r12
   e:	53                   	push   %rbx
   f:	e8 fc 5e 01 00       	callq  0x15f10
  14:	49 8b 5c 24 20       	mov    0x20(%r12),%rbx
  19:	4c 89 e7             	mov    %r12,%rdi
  1c:	e8 8f af ed ff       	callq  0xffffffffffedafb0
  21:	48 85 db             	test   %rbx,%rbx
  24:	75 05                	jne    0x2b
  26:	eb 13                	jmp    0x3b
  28:	4c 89 e3             	mov    %r12,%rbx
  2b:*	4c 8b 23             	mov    (%rbx),%r12     <-- trapping instruction
  2e:	48 89 df             	mov    %rbx,%rdi
  31:	e8 7a af ed ff       	callq  0xffffffffffedafb0
  36:	4d 85 e4             	test   %r12,%r12
  39:	75 ed                	jne    0x28
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	5d                   	pop    %rbp
  3f:	c3                   	retq   

Disassembly of free_msg shows..

        seg = msg->next;
        kfree(msg);
        while (seg != NULL) {
                struct msg_msgseg *tmp = seg->next;
 30b:   4c 8b 23                mov    (%rbx),%r12
                kfree(seg);
 30e:   48 89 df                mov    %rbx,%rdi
 311:   e8 00 00 00 00          callq  316 <free_msg+0x36>


Looks like seg was already kfree'd.

	Dave


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 16:17       ` Dave Jones
@ 2013-03-29 18:00         ` Linus Torvalds
  2013-03-29 18:04           ` Dave Jones
  2013-03-29 18:43         ` Linus Torvalds
  2013-03-29 20:41         ` Linus Torvalds
  2 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 18:00 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Rik van Riel, Linus Torvalds,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones <davej@redhat.com> wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>
> Looks like seg was already kfree'd.

Just to clarify: is this you testing -mm that has Davidlohr's and
Rik's scalability patches? Or mainline?

         Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 18:00         ` Linus Torvalds
@ 2013-03-29 18:04           ` Dave Jones
  2013-03-29 18:10             ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Dave Jones @ 2013-03-29 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 11:00:27AM -0700, Linus Torvalds wrote:
 > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 > > I've started seeing this..
 > >
 > > general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > >
 > > Looks like seg was already kfree'd.
 > 
 > Just to clarify: is this you testing -mm that has Davidlohr's and
 > Rik's scalability patches? Or mainline?

mainline. Your current tree.

	Dave

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 18:04           ` Dave Jones
@ 2013-03-29 18:10             ` Linus Torvalds
  0 siblings, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 18:10 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 11:04 AM, Dave Jones <davej@redhat.com> wrote:
>
> mainline. Your current tree.

Ok, that's what I thought you were generally testing, just wanted to
verify. The subject kind of implied otherwise..

          Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 16:17       ` Dave Jones
  2013-03-29 18:00         ` Linus Torvalds
@ 2013-03-29 18:43         ` Linus Torvalds
  2013-03-29 19:06           ` Dave Jones
                             ` (2 more replies)
  2013-03-29 20:41         ` Linus Torvalds
  2 siblings, 3 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 18:43 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley, Stanislav Kinsbursky

[-- Attachment #1: Type: text/plain, Size: 682 bytes --]

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones <davej@redhat.com> wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC

Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
disable it?

I think I foud at least one bug in the MSG_COPY stuff: it leaks the
"copy" allocation if

    mode == SEARCH_LESSEQUAL

but maybe I'm misreading it. And that shouldn't cause the problem you
see, but it's indicative of how badly tested and thought through the
MSG_COPY code is.

Totally UNTESTED leak fix appended. Stanislav?

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 626 bytes --]

 ipc/msg.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..b841508556cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp,
 						msg = copy_msg(msg, copy);
 						if (IS_ERR(msg))
 							goto out_unlock;
+						copy = NULL;
 						break;
 					}
 				} else
@@ -976,10 +977,9 @@ out_unlock:
 			break;
 		}
 	}
-	if (IS_ERR(msg)) {
-		free_copy(copy);
+	free_copy(copy);
+	if (IS_ERR(msg))
 		return PTR_ERR(msg);
-	}
 
 	bufsz = msg_handler(buf, msg, bufsz);
 	free_msg(msg);

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 19:43     ` Andrew Morton
  2013-03-29 16:17       ` Dave Jones
@ 2013-03-29 19:01       ` Dave Jones
  2013-05-03 15:03         ` Peter Hurley
  1 sibling, 1 reply; 129+ messages in thread
From: Dave Jones @ 2013-03-29 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, torvalds, davidlohr.bueso, linux-kernel, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Peter Hurley

On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
 > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones <davej@redhat.com> wrote:
 > 
 > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
 > > 
 > >  > Whichever way we go, we should get a wiggle on - this has been hanging
 > >  > around for too long.  Dave, do you have time to determine whether
 > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
 > >  > than max") fixes things up?
 > > 
 > > Ok, with that reverted it's been grinding away for a few hours without incident.
 > > Normally I see the oops within a minute or so.
 > > 
 > 
 > OK, thanks, I queued a revert:
 > 
 > From: Andrew Morton <akpm@linux-foundation.org>
 > Subject: revert "ipc: don't allocate a copy larger than max"
 > 
 > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > during trinity testing.

I owe Peter an apology. I just hit it again with that backed out.
Andrew, might as well drop that revert.

BUG: unable to handle kernel NULL pointer dereference at 000000000000000f
IP: [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff812c24ca>]  [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
RSP: 0018:ffff880117bb5e88  EFLAGS: 00010246
RAX: ffffffffffffffff RBX: 0000000000000004 RCX: 0000000000000078
RDX: 0000000000000004 RSI: fffffffffffffffe RDI: 000000000000000f
RBP: ffff880117bb5e88 R08: 0000000000000004 R09: 0000000000000001
R10: ffff880117bb8000 R11: 0000000000000001 R12: fffffffffffffffe
R13: ffff88010fd10308 R14: ffff88010fd10258 R15: ffffffffffffffff
FS:  00007fa89c256740(0000) GS:ffff88012aa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000000f CR3: 000000010f76c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child20 (pid: 958, threadinfo ffff880117bb4000, task ffff880117bb8000)
Stack:
 ffff880117bb5f68 ffffffff812c3746 0000000000000000 ffff880117bb8000
 ffff880117bb8000 ffff880117bb8000 ffffffff81c7ace0 ffffffff812c2430
 0000000000000004 0000000000000000 000000000000ffff 00000000652a928e
Call Trace:
 [<ffffffff812c3746>] do_msgrcv+0x1a6/0x5f0
 [<ffffffff812c2430>] ? msg_security+0x10/0x10
 [<ffffffff810b6c55>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134b39e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff812c3ba5>] sys_msgrcv+0x15/0x20
 [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60

I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC

and.. we're back to square one.

	Dave


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 18:43         ` Linus Torvalds
@ 2013-03-29 19:06           ` Dave Jones
  2013-03-29 19:13             ` Linus Torvalds
  2013-03-29 19:26             ` Linus Torvalds
  2013-03-29 19:33           ` Peter Hurley
  2013-04-01  7:40           ` Stanislav Kinsbursky
  2 siblings, 2 replies; 129+ messages in thread
From: Dave Jones @ 2013-03-29 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley, Stanislav Kinsbursky

On Fri, Mar 29, 2013 at 11:43:25AM -0700, Linus Torvalds wrote:
 > On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > Now that I have that reverted, I'm not seeing msgrcv traces any more, but
 > > I've started seeing this..
 > >
 > > general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 > 
 > Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
 > disable it?
 > 
 > I think I foud at least one bug in the MSG_COPY stuff: it leaks the
 > "copy" allocation if
 > 
 >     mode == SEARCH_LESSEQUAL
 > 
 > but maybe I'm misreading it. And that shouldn't cause the problem you
 > see, but it's indicative of how badly tested and thought through the
 > MSG_COPY code is.
 > 
 > Totally UNTESTED leak fix appended. Stanislav?

I'll give it a shot.

Btw, something that's really bothering me is just how much bogus
'follow-on' spew we have lately. I'm not sure what changed, but it
seems to have gotten worse.

Here's an oops I just hit..

BUG: unable to handle kernel NULL pointer dereference at 000000000000000f
IP: [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
PGD 10fd95067 PUD 10f767067 PMD 0 
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
CPU 2 
Pid: 958, comm: trinity-child20 Not tainted 3.9.0-rc4+ #7 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff812c24ca>]  [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
RSP: 0018:ffff880117bb5e88  EFLAGS: 00010246
RAX: ffffffffffffffff RBX: 0000000000000004 RCX: 0000000000000078
RDX: 0000000000000004 RSI: fffffffffffffffe RDI: 000000000000000f
RBP: ffff880117bb5e88 R08: 0000000000000004 R09: 0000000000000001
R10: ffff880117bb8000 R11: 0000000000000001 R12: fffffffffffffffe
R13: ffff88010fd10308 R14: ffff88010fd10258 R15: ffffffffffffffff
FS:  00007fa89c256740(0000) GS:ffff88012aa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000000f CR3: 000000010f76c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child20 (pid: 958, threadinfo ffff880117bb4000, task ffff880117bb8000)
Stack:
 ffff880117bb5f68 ffffffff812c3746 0000000000000000 ffff880117bb8000
 ffff880117bb8000 ffff880117bb8000 ffffffff81c7ace0 ffffffff812c2430
 0000000000000004 0000000000000000 000000000000ffff 00000000652a928e
Call Trace:
 [<ffffffff812c3746>] do_msgrcv+0x1a6/0x5f0
 [<ffffffff812c2430>] ? msg_security+0x10/0x10
 [<ffffffff810b6c55>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134b39e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff812c3ba5>] sys_msgrcv+0x15/0x20
 [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
Code: c3 48 c7 c0 f2 ff ff ff eb e5 0f 1f 80 00 00 00 00 66 66 66 66 90 55 83 fa 02 48 89 e5 74 3a 7e 28 83 fa 03 74 13 83 fa 04 75 0a <48> 39 37 b8 01 00 00 00 7e 02 31 c0 5d c3 48 3b 37 74 f7 b8 01 
RIP  [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
 RSP <ffff880117bb5e88>
CR2: 000000000000000f
---[ end trace 8f0713d2aacb6aa3 ]---
BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic(): 1, irqs_disabled(): 0, pid: 958, name: trinity-child20
INFO: lockdep is turned off.
Pid: 958, comm: trinity-child20 Tainted: G      D      3.9.0-rc4+ #7
Call Trace:
 [<ffffffff8107dba5>] __might_sleep+0x145/0x200
 [<ffffffff816c215a>] down_read+0x2a/0xa0
 [<ffffffff8105fa14>] exit_signals+0x24/0x130
 [<ffffffff8104b80c>] do_exit+0xbc/0xd10
 [<ffffffff81048b25>] ? kmsg_dump+0x1b5/0x230
 [<ffffffff81048995>] ? kmsg_dump+0x25/0x230
 [<ffffffff816c6b6c>] oops_end+0x9c/0xe0
 [<ffffffff816b7a40>] no_context+0x268/0x275
 [<ffffffff816b7ac5>] __bad_area_nosemaphore+0x78/0x1d1
 [<ffffffff816b7c31>] bad_area_nosemaphore+0x13/0x15
 [<ffffffff816c90e6>] __do_page_fault+0x366/0x5b0
 [<ffffffff810b72b5>] ? __lock_acquire+0x2e5/0x1a00
 [<ffffffff810b3348>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff8134b3dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff816c933e>] do_page_fault+0xe/0x10
 [<ffffffff816c5fa2>] page_fault+0x22/0x30
 [<ffffffff812c24ca>] ? testmsg.isra.5+0x1a/0x60
 [<ffffffff812d80b6>] ? security_msg_queue_msgrcv+0x16/0x20
 [<ffffffff812c3746>] do_msgrcv+0x1a6/0x5f0
 [<ffffffff812c2430>] ? msg_security+0x10/0x10
 [<ffffffff810b6c55>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134b39e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff812c3ba5>] sys_msgrcv+0x15/0x20
 [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
note: trinity-child20[958] exited with preempt_count 1
BUG: scheduling while atomic: trinity-child20/958/0x10000002
INFO: lockdep is turned off.
Modules linked in: phonet netrom llc2 af_key rose af_rxrpc caif_socket caif can_raw cmtp kernelcapi ipt_ULOG nfnetlink can_bcm can scsi_transport_iscsi af_802154 irda ax25 atm ipx x25 p8023 p8022 appletalk pppoe decnet pppox ppp_generic nfc rds slhc psnap crc_ccitt llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables raid0 snd_hda_codec_realtek snd_hda_intel btusb snd_hda_codec bluetooth microcode serio_raw snd_pcm edac_core pcspkr snd_page_alloc rfkill snd_timer snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm radeon backlight drm_kms_helper ttm
Pid: 958, comm: trinity-child20 Tainted: G      D      3.9.0-rc4+ #7
Call Trace:
 [<ffffffff816b8792>] __schedule_bug+0x61/0x70
 [<ffffffff816c38af>] __schedule+0x8df/0x950
 [<ffffffff81083bb8>] __cond_resched+0x18/0x30
 [<ffffffff816c3dea>] _cond_resched+0x3a/0x50
 [<ffffffff8117712f>] munlock_vma_pages_range+0xbf/0xe0
 [<ffffffff8117b6e7>] exit_mmap+0x57/0x160
 [<ffffffff811a9b5e>] ? __khugepaged_exit+0xee/0x130
 [<ffffffff8119c485>] ? kmem_cache_free+0x335/0x380
 [<ffffffff811a9b5e>] ? __khugepaged_exit+0xee/0x130
 [<ffffffff81042047>] mmput+0x77/0x100
 [<ffffffff8104b9e1>] do_exit+0x291/0xd10
 [<ffffffff81048b25>] ? kmsg_dump+0x1b5/0x230
 [<ffffffff81048995>] ? kmsg_dump+0x25/0x230
 [<ffffffff816c6b6c>] oops_end+0x9c/0xe0
 [<ffffffff816b7a40>] no_context+0x268/0x275
 [<ffffffff816b7ac5>] __bad_area_nosemaphore+0x78/0x1d1
 [<ffffffff816b7c31>] bad_area_nosemaphore+0x13/0x15
 [<ffffffff816c90e6>] __do_page_fault+0x366/0x5b0
 [<ffffffff810b72b5>] ? __lock_acquire+0x2e5/0x1a00
 [<ffffffff810b3348>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff8134b3dd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff816c933e>] do_page_fault+0xe/0x10
 [<ffffffff816c5fa2>] page_fault+0x22/0x30
 [<ffffffff812c24ca>] ? testmsg.isra.5+0x1a/0x60
 [<ffffffff812d80b6>] ? security_msg_queue_msgrcv+0x16/0x20
 [<ffffffff812c3746>] do_msgrcv+0x1a6/0x5f0
 [<ffffffff812c2430>] ? msg_security+0x10/0x10
 [<ffffffff810b6c55>] ? trace_hardirqs_on_caller+0x115/0x1a0
 [<ffffffff8134b39e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff812c3ba5>] sys_msgrcv+0x15/0x20
 [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b



That's a ton of not-very-interesting crap that makes me thankful
I have a serial console. Most users aren't so lucky. Is there
any way we can silence all that if we've just oopsed ?

Related: is there any value in printing out all the ? symbols on
kernels with frame pointers enabled ?

	Dave


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 19:06           ` Dave Jones
@ 2013-03-29 19:13             ` Linus Torvalds
  2013-03-29 19:26             ` Linus Torvalds
  1 sibling, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 19:13 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley,
	Stanislav Kinsbursky

On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones <davej@redhat.com> wrote:
>
> Btw, something that's really bothering me is just how much bogus
> 'follow-on' spew we have lately. I'm not sure what changed, but it
> seems to have gotten worse.

.. we have many more sanity checks, and they tend to trigger in the
case of us killing somebody holding locks etc.

> Related: is there any value in printing out all the ? symbols on
> kernels with frame pointers enabled ?

I've found them useful. You can often see what happened just before.

Of course, I'd still prefer gcc not generate unnecessarily big stack
frames (which is one of the bigger reasons for old stuff shining
through) but as things are, they often give hints about what happened
just before.

          Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 19:06           ` Dave Jones
  2013-03-29 19:13             ` Linus Torvalds
@ 2013-03-29 19:26             ` Linus Torvalds
  2013-03-29 19:36               ` Peter Hurley
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 19:26 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley,
	Stanislav Kinsbursky

On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones <davej@redhat.com> wrote:
>
> Here's an oops I just hit..
>
> BUG: unable to handle kernel NULL pointer dereference at 000000000000000f
> IP: [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60

Btw, looking at the code leading up to this, what the f*ck is wrong
with the IPC stuff?

It's using the generic list stuff for most of the lists, but then it
open-codes the accesses.

So instead of using

   for_each_entry(walk_msg, &msq->q_messages, m_list) {
      ..
   }

the ipc/msg.c code does all that by hand, with

   tmp = msq->q_messages.next;
   while (tmp != &msq->q_messages) {
      struct msg_msg *walk_msg;

      walk_msg = list_entry(tmp, struct msg_msg, m_list);
      ...
      tmp = tmp->next;
   }

Ugh. The code is near unreadable. And then it has magic memory
barriers etc, implying that it doesn't lock the data structures, but
no comments about them. See expunge_all() and pipelined_send().

The code seems entirely random, and it's badly set up (annoyance of
the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
you have to spend more effort looking for them, and (b) they won't get
inlined).

Clearly nobody has cared for the crazy IPC message code in a long time.

              Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 18:43         ` Linus Torvalds
  2013-03-29 19:06           ` Dave Jones
@ 2013-03-29 19:33           ` Peter Hurley
  2013-03-29 19:54             ` Linus Torvalds
  2013-04-01  7:40           ` Stanislav Kinsbursky
  2 siblings, 1 reply; 129+ messages in thread
From: Peter Hurley @ 2013-03-29 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Stanislav Kinsbursky

On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
> "copy" allocation if
> 
>     mode == SEARCH_LESSEQUAL
> 
> but maybe I'm misreading it.

Yes, you're misreading it. copy_msg() returns the 'copy' address when
copying is successful. So this patch double-frees 'copy'.

Andrew has had the patches that fix this for a month but because they're
fairly extensive he didn't want to apply them for 3.9.

Regards,
Peter



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 19:26             ` Linus Torvalds
@ 2013-03-29 19:36               ` Peter Hurley
  2013-04-02 16:08                 ` Sasha Levin
  0 siblings, 1 reply; 129+ messages in thread
From: Peter Hurley @ 2013-03-29 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Stanislav Kinsbursky

On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > Here's an oops I just hit..
> >
> > BUG: unable to handle kernel NULL pointer dereference at 000000000000000f
> > IP: [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
> 
> Btw, looking at the code leading up to this, what the f*ck is wrong
> with the IPC stuff?
> 
> It's using the generic list stuff for most of the lists, but then it
> open-codes the accesses.
> 
> So instead of using
> 
>    for_each_entry(walk_msg, &msq->q_messages, m_list) {
>       ..
>    }
> 
> the ipc/msg.c code does all that by hand, with
> 
>    tmp = msq->q_messages.next;
>    while (tmp != &msq->q_messages) {
>       struct msg_msg *walk_msg;
> 
>       walk_msg = list_entry(tmp, struct msg_msg, m_list);
>       ...
>       tmp = tmp->next;
>    }
> 
> Ugh. The code is near unreadable. And then it has magic memory
> barriers etc, implying that it doesn't lock the data structures, but
> no comments about them. See expunge_all() and pipelined_send().
> 
> The code seems entirely random, and it's badly set up (annoyance of
> the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
> you have to spend more effort looking for them, and (b) they won't get
> inlined).
> 
> Clearly nobody has cared for the crazy IPC message code in a long time.

Exactly that's what my patch series does; clean this mess up.

This is what I wrote to Andrew a couple of days ago.

On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
I just figured out how the queue is being corrupted and why my series
> fixes it.
> 
> 
> With MSG_COPY set, the queue scan can exit with the local variable
'msg'
> pointing to a real msg if the msg_counter never reaches the
copy_number.
> 
> The remaining execution looks like this:
> 
> 	if (!IS_ERR(msg)) {
> 		....
> 		if (msgflg & MSG_COPY)
> 			goto out_unlock;
> 		....
> 
> out_unlock:
> 			msg_unlock(msq);
> 			break;
> 		}
> 	}
> 	if (IS_ERR(msg))
> 		....
> 
> 	bufsz = msg_handler();
> 	free_msg(msg);			<<---- msg never unlinked
> 
> 
> Since the msg should not have been found (because it failed the match
> criteria), the if (!IS_ERR(msg)) clause should never have executed.
> 
> That's why my refactor fixes resolve this; because msg is not
> inadvertently treated as a found msg.
> 
> But let's be honest; the real bug here is the poor structure of this
> function that even made this possible. The deepest nesting executes a
> goto to a label in the middle of an if clause. Yuck! No wonder this
> thing's fragile.
> 
> So my recommendation still stands. The series that fixes this has been
> getting tested in linux-next for a month. Fixing this some other way
is
> just asking for more trouble.
> 
> But why not just revert MSG_COPY altogether for 3.9?

Regards,
Peter Hurley


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 19:33           ` Peter Hurley
@ 2013-03-29 19:54             ` Linus Torvalds
  0 siblings, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 19:54 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Stanislav Kinsbursky

On Fri, Mar 29, 2013 at 12:33 PM, Peter Hurley <peter@hurleysoftware.com> wrote:
> On Fri, 2013-03-29 at 11:43 -0700, Linus Torvalds wrote:
>> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
>> "copy" allocation if
>>
>>     mode == SEARCH_LESSEQUAL
>>
>> but maybe I'm misreading it.
>
> Yes, you're misreading it. copy_msg() returns the 'copy' address when
> copying is successful. So this patch double-frees 'copy'.

No it doesn't.

The case where "copy_msg()" *is* called and is successful, msg gets
set to "copy", my patch sets "copy" to NULL, so no, it doesn't
double-free.

That said, it looks like if MSG_COPY is set, we cannot trigger the
"mode == SEARCH_LESSEQUAL" case, because it forces msgtyp to 0, which
in turn forces mode = SEARCH_ANY.

So the actual leak case seems to not be possible, although that's
*really* hard to see from looking at the code. The code is just an
unreadable mess.

              Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 16:17       ` Dave Jones
  2013-03-29 18:00         ` Linus Torvalds
  2013-03-29 18:43         ` Linus Torvalds
@ 2013-03-29 20:41         ` Linus Torvalds
  2013-03-29 21:12           ` Linus Torvalds
  2 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 20:41 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Rik van Riel, Linus Torvalds,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones <davej@redhat.com> wrote:
>
> Now that I have that reverted, I'm not seeing msgrcv traces any more, but
> I've started seeing this..
>
> general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> RIP: 0010:[<ffffffff812c20fb>]  [<ffffffff812c20fb>] free_msg+0x2b/0x40
> Call Trace:
>  [<ffffffff812c289f>] freeque+0xcf/0x140
>  [<ffffffff812c2a93>] msgctl_down.constprop.9+0x183/0x200
>  [<ffffffff812c2da9>] sys_msgctl+0x139/0x400
>  [<ffffffff816cd942>] system_call_fastpath+0x16/0x1b
>
> Looks like seg was already kfree'd.

Hmm.

I have a suspicion.

The use of ipc_rcu_getref/ipc_rcu_putref() seems a bit iffy.

In particular, the refcount is not an atomic variable, so we
absolutely *depend* on the spinlock for it.

However, looking at "freeque()", that's not actually the case. It
releases the message queue spinlock *before* it does the
ipc_rcu_putref(), and it does that because the thing has become
unreachable (it did a msg_rmid(), which will set ->deleted, which in
turn will mean that nobody should successfully look it up any more).

HOWEVER.

While the "deleted" flag is serialized, the actual refcount is not. So
in *parallel* with the freeque() call, we may have some other user
that does something like

                ...
                ipc_rcu_getref(msq);
                msg_unlock(msq);
                schedule();

                ipc_lock_by_ptr(&msq->q_perm);
                ipc_rcu_putref(msq);
                if (msq->q_perm.deleted) {
                        err = -EIDRM;
                        goto out_unlock_free;
                }
                ...


which got the lock for the "deleted" test, so that part is all fine,
but notice the "ipc_rcu_putref()". It can happen at the same time that
freeque() also does its own ipc_rcu_putref().

So now refcount may get buggered, resulting in random early reuse,
double free's or leaking of the msq.

There may be some reason I don't see why this cannot happen, but it
does look suspicious. I wonder if the refcount should be an atomic
value.

The alternative would be to make sure the thing is always locked (and
in a rcu-read-safe region) before putref/getref. The only place (apart
from the initial allocation, which doesn't matter, because nothing can
see if itf that path fails) seems to be that freeque(), but I didn't
check everything.

Moving the

    msg_unlock(msq);

to the end of freeque() might be the way to go. It's going to cause us
to hold the lock for longer, but I'm not sure we care for that path.

Guys, am I missing something? This kind of refcounting problem might
well explain the rcu-freeing-time bugs reported with the scalability
patches: long-time race that just got *much* easier to expose with the
higher level of parallelism?

               Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 20:41         ` Linus Torvalds
@ 2013-03-29 21:12           ` Linus Torvalds
  2013-03-29 23:16             ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 21:12 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Rik van Riel, Linus Torvalds,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The alternative would be to make sure the thing is always locked (and
> in a rcu-read-safe region) before putref/getref. The only place (apart
> from the initial allocation, which doesn't matter, because nothing can
> see if itf that path fails) seems to be that freeque(), but I didn't
> check everything.
>
> Moving the
>
>     msg_unlock(msq);
>
> to the end of freeque() might be the way to go. It's going to cause us
> to hold the lock for longer, but I'm not sure we care for that path.

Uhhuh. I think shm_destroy() has the same pattern. And I think that
one has locking reasons why it has to do the shm_unlock() before
tearing some things down, although I'm not sure..

The good news is that shm doesn't seem to have any users of
ipc_rcu_get/putref(), so I don't see anything to race against. So it
has the same buggy pattern, but I don't think it can trigger anything.

And ipc/sem.c has the same bug-pattern in freeary(). It does
"sem_unlock(sma)" followed by "ipc_rcu_putref(sma);", and it *does*
seem to have code that that can race against (sem_lock_and_putref()).
The whole "sem_putref()" tries to be careful and gets the lock for the
last ref, but getting the lock doesn't help when freeary() does the
refcount access without it.

The semaphore case seems to argue fairly strongly for an "atomic_t
refcount", because right now it does a lot of "sem_putref()" stuff
that just gets the lock for the putref. So not only is that
insufficient (if it races against freeary(), but it's also more
expensive than just an atomic refcount would have been.

I dunno. I'm still not sure this is triggerable, but it looks bad. But
both the semaphore case and the msg cases seem to be solvable by
moving the unlock down, and shm seem to have no getref/putref users to
race with, so this (whitespace-damaged) patch *may* be sufficient:

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
        expunge_all(msq, -EIDRM);
        ss_wakeup(&msq->q_senders, 1);
        msg_rmid(ns, msq);
-       msg_unlock(msq);

        tmp = msq->q_messages.next;
        while (tmp != &msq->q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
        atomic_sub(msq->q_cbytes, &ns->msg_bytes);
        security_msg_queue_free(msq);
        ipc_rcu_putref(msq);
+       msg_unlock(msq);
 }

 /*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)

        /* Remove the semaphore set from the IDR */
        sem_rmid(ns, sma);
-       sem_unlock(sma);

        wake_up_sem_queue_do(&tasks);
        ns->used_sems -= sma->sem_nsems;
        security_sem_free(sma);
        ipc_rcu_putref(sma);
+       sem_unlock(sma);
 }

 static unsigned long copy_semid_to_user(void __user *buf, struct
semid64_ds *in, int version)

(I didn't check very carefully, it's possible that we end up having
some locking problem if we move the unlocks down later, but it *looks*
fine)

Anybody?

                Linus

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 21:12           ` Linus Torvalds
@ 2013-03-29 23:16             ` Linus Torvalds
  2013-03-30  1:36               ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-29 23:16 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Rik van Riel, Linus Torvalds,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu, Peter Hurley,
	Emmanuel Benisty

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

On Fri, Mar 29, 2013 at 2:12 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I dunno. I'm still not sure this is triggerable, but it looks bad. But
> both the semaphore case and the msg cases seem to be solvable by
> moving the unlock down, and shm seem to have no getref/putref users to
> race with, so this (whitespace-damaged) patch *may* be sufficient:

Well, the patch doesn't seem to cause any problems, at least neither
lockdep nor spinlock sleep debugging complains. I have no idea whether
it actually fixes any problems, though.

I do wonder if this might explain the problem Emmanuel saw. A double
free of a RCU-freeable object would possibly result in exactly the
kind of mess that Emmanuel reported with the semaphore scalability
patches.

Emmanuel, can you try the attached patch? I think it applies cleanly
on top of the scalability series too without any changes, but I didn't
check if the patches perhaps changed some of the naming or something.

              Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 1207 bytes --]

 ipc/msg.c | 2 +-
 ipc/sem.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	expunge_all(msq, -EIDRM);
 	ss_wakeup(&msq->q_senders, 1);
 	msg_rmid(ns, msq);
-	msg_unlock(msq);
 
 	tmp = msq->q_messages.next;
 	while (tmp != &msq->q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	atomic_sub(msq->q_cbytes, &ns->msg_bytes);
 	security_msg_queue_free(msq);
 	ipc_rcu_putref(msq);
+	msg_unlock(msq);
 }
 
 /*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 
 	/* Remove the semaphore set from the IDR */
 	sem_rmid(ns, sma);
-	sem_unlock(sma);
 
 	wake_up_sem_queue_do(&tasks);
 	ns->used_sems -= sma->sem_nsems;
 	security_sem_free(sma);
 	ipc_rcu_putref(sma);
+	sem_unlock(sma);
 }
 
 static unsigned long copy_semid_to_user(void __user *buf, struct semid64_ds *in, int version)

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 23:16             ` Linus Torvalds
@ 2013-03-30  1:36               ` Emmanuel Benisty
  2013-03-30  2:08                 ` Davidlohr Bueso
  2013-03-30  2:09                 ` Linus Torvalds
  0 siblings, 2 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-30  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

Hi Linus,

On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Emmanuel, can you try the attached patch? I think it applies cleanly
> on top of the scalability series too without any changes, but I didn't
> check if the patches perhaps changed some of the naming or something.

I had to slightly modify the patch since it wouldn't match the changes
introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
hope that was the right thing to do. So, what I tried was: original 7
patches + the one liner + your patch blindly modified by me on the top
of 3.9-rc4 and I'm still having twilight zone issues.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  1:36               ` Emmanuel Benisty
@ 2013-03-30  2:08                 ` Davidlohr Bueso
  2013-03-30  3:02                   ` Emmanuel Benisty
  2013-03-30  2:09                 ` Linus Torvalds
  1 sibling, 1 reply; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-30  2:08 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Sat, 2013-03-30 at 08:36 +0700, Emmanuel Benisty wrote:
> Hi Linus,
> 
> On Sat, Mar 30, 2013 at 6:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > Emmanuel, can you try the attached patch? I think it applies cleanly
> > on top of the scalability series too without any changes, but I didn't
> > check if the patches perhaps changed some of the naming or something.
> 
> I had to slightly modify the patch since it wouldn't match the changes
> introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> hope that was the right thing to do. So, what I tried was: original 7
> patches + the one liner + your patch blindly modified by me on the top
> of 3.9-rc4 and I'm still having twilight zone issues.

Not sure which one liner you refer to, but, if you haven't already done
so, please try with these fixes (queued in linux-next):

http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065adddd9ec31

I've been trying to reproduce your twilight zone problem on five
different machines now without any luck. Is there anything you're doing
to trigger the issue? Does the machine boot ok and then do weird things,
say after X starts, open some program, etc?

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  1:36               ` Emmanuel Benisty
  2013-03-30  2:08                 ` Davidlohr Bueso
@ 2013-03-30  2:09                 ` Linus Torvalds
  2013-03-30  2:55                   ` Davidlohr Bueso
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-30  2:09 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>
> I had to slightly modify the patch since it wouldn't match the changes
> introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> hope that was the right thing to do. So, what I tried was: original 7
> patches + the one liner + your patch blindly modified by me on the top
> of 3.9-rc4 and I'm still having twilight zone issues.

Ok, please send your patch so that I can double-check what you did,
but it was simple enough that you probably did the right thing.

Sad. Your case definitely looks like a double rcu-free, as shown by
the fact that when you enabled SLUB debugging the oops happened with
the use-after-free pattern (it's __rcu_reclaim() doing the
"head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head"
has already been free'd once).

So ipc_rcu_putref() and a refcounting error looked very promising.as a
potential explanation.

The 'un' undo structure is also free'd with rcu, but the locking
around that seems much more robust. The undo entry is on two lists
(sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under
ulp->lock). But those locks are actually tested with
assert_spin_locked() in all the relevant places, and the code actually
looks sane. So I had high hopes for ipc_rcu_putref()...

Hmm. Except for exit_sem() that does odd things. You have preemption
enabled, don't you? exit_sem() does a lookup of the first list_proc
entry under tcy_read_lock to lookup un->semid, and then it drops the
rcu read lock. At which point "un" is no longer reliable, I think. But
then it still uses "un->semid", rather than the stable value it looked
up under the rcu read lock. Which looks bogus.

So I'd like you to test a few more things:

 (a) In exit_sem(), can you change the

         sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);

     to use just "semid" rather than "un->semid", because I don't
think "un" is stable here.

 (b) does the problem go away if you change disable CONFIG_PREEMPT
(perhaps to PREEMPT_NONE or PREEMPT_VOLUNTARY?)

             Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  2:09                 ` Linus Torvalds
@ 2013-03-30  2:55                   ` Davidlohr Bueso
  0 siblings, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-30  2:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Emmanuel Benisty, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, 2013-03-29 at 19:09 -0700, Linus Torvalds wrote:
> On Fri, Mar 29, 2013 at 6:36 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
> >
> > I had to slightly modify the patch since it wouldn't match the changes
> > introduced by 7-7-ipc-sem-fine-grained-locking-for-semtimedop.patch,
> > hope that was the right thing to do. So, what I tried was: original 7
> > patches + the one liner + your patch blindly modified by me on the top
> > of 3.9-rc4 and I'm still having twilight zone issues.
> 
> Ok, please send your patch so that I can double-check what you did,
> but it was simple enough that you probably did the right thing.
> 
> Sad. Your case definitely looks like a double rcu-free, as shown by
> the fact that when you enabled SLUB debugging the oops happened with
> the use-after-free pattern (it's __rcu_reclaim() doing the
> "head->func(head);" thing, and "func" is 0x6b6b6b6b6b6b6b6b, so "head"
> has already been free'd once).
> 
> So ipc_rcu_putref() and a refcounting error looked very promising.as a
> potential explanation.
> 
> The 'un' undo structure is also free'd with rcu, but the locking
> around that seems much more robust. The undo entry is on two lists
> (sma->list_id, under sma->sem_perm.lock and ulp->list_proc, under
> ulp->lock). But those locks are actually tested with
> assert_spin_locked() in all the relevant places, and the code actually
> looks sane. So I had high hopes for ipc_rcu_putref()...
> 
> Hmm. Except for exit_sem() that does odd things. You have preemption
> enabled, don't you? exit_sem() does a lookup of the first list_proc
> entry under tcy_read_lock to lookup un->semid, and then it drops the
> rcu read lock. At which point "un" is no longer reliable, I think. But
> then it still uses "un->semid", rather than the stable value it looked
> up under the rcu read lock. Which looks bogus.
> 
> So I'd like you to test a few more things:
> 
>  (a) In exit_sem(), can you change the
> 
>          sma = sem_lock_check(tsk->nsproxy->ipc_ns, un->semid);
> 
>      to use just "semid" rather than "un->semid", because I don't
> think "un" is stable here.

Well that's not really the case in the new code. We don't drop the rcu
read lock until the end of the loop, in sem_unlock(). However, I just
noticed that we're checking sma for error after trying to acquire
sma->sem_perm.lock:

		sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
		sem_lock(sma, NULL, -1);

		/* exit_sem raced with IPC_RMID, nothing to do */
		if (IS_ERR(sma))
			continue;

The IS_ERR(sma) check should be right after the sem_obtain_object_check() call instead.



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  2:08                 ` Davidlohr Bueso
@ 2013-03-30  3:02                   ` Emmanuel Benisty
  2013-03-30  3:46                     ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-30  3:02 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

Hi Davidlohr,

On Sat, Mar 30, 2013 at 9:08 AM, Davidlohr Bueso <davidlohr.bueso@hp.com> wrote:
> Not sure which one liner you refer to, but, if you haven't already done
> so, please try with these fixes (queued in linux-next):
>
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=a9cead0347283f3e72a39e7b76a3cc479b048e51
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=4db64b89525ac357cba754c3120065adddd9ec31
>
> I've been trying to reproduce your twilight zone problem on five
> different machines now without any luck. Is there anything you're doing
> to trigger the issue? Does the machine boot ok and then do weird things,
> say after X starts, open some program, etc?

I was missing a9cead0, thanks. What I usually do is starting a
standard session, which looks like this:

init-+-5*[agetty]
     |-bash---startx---xinit-+-X---2*[{X}]
     |                       `-dwm---sh---sleep
     |-bash---chromium-+-chromium
     |                 |-chromium-+-chromium
     |                 |          `-2*[{chromium}]
     |
|-chromium-sandbo---chromium---chromium---4*[chromium---4*[{chromium}]]
     |                 `-65*[{chromium}]
     |-crond
     |-dbus-daemon
     |-klogd
     |-syslogd
     |-tmux-+-alsamixer
     |      |-bash---bash
     |      |-bash
     |      |-htop
     |      `-newsbeuter---{newsbeuter}
     |-udevd
     |-urxvtd-+-bash---pstree
     |        `-bash---tmux
     `-wpa_supplicant

Then I start building a random package and the problems start. They
may also happen without compiling but this seems to trigger the bug
quite quickly. Anyway, some progress here, I hope: dmesg seems to be
willing to reveal some secrets (using some pastebin service since this
is pretty big):

https://gist.github.com/anonymous/5275120

Thanks.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  3:02                   ` Emmanuel Benisty
@ 2013-03-30  3:46                     ` Linus Torvalds
  2013-03-30  4:33                       ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-30  3:46 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Davidlohr Bueso, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>
> Then I start building a random package and the problems start. They
> may also happen without compiling but this seems to trigger the bug
> quite quickly.

I suspect it's about preemption, and the build just results in enough
scheduling load that you start hitting whatever race there is.

> Anyway, some progress here, I hope: dmesg seems to be
> willing to reveal some secrets (using some pastebin service since this
> is pretty big):
>
> https://gist.github.com/anonymous/5275120

That looks like exactly the exit_sem() bug that Davidlohr was talking
about, where the

                /* exit_sem raced with IPC_RMID, nothing to do */
                if (IS_ERR(sma))
                        continue;

should be moved to *before* the

                sem_lock(sma, NULL, -1);

call. And apparently the bug I had found is already fixed in -next.

               Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  3:46                     ` Linus Torvalds
@ 2013-03-30  4:33                       ` Emmanuel Benisty
  2013-03-30  5:10                         ` Linus Torvalds
  2013-03-31  5:01                         ` Davidlohr Bueso
  0 siblings, 2 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-30  4:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

[-- Attachment #1: Type: text/plain, Size: 1571 bytes --]

On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>>
>> Then I start building a random package and the problems start. They
>> may also happen without compiling but this seems to trigger the bug
>> quite quickly.
>
> I suspect it's about preemption, and the build just results in enough
> scheduling load that you start hitting whatever race there is.
>
>> Anyway, some progress here, I hope: dmesg seems to be
>> willing to reveal some secrets (using some pastebin service since this
>> is pretty big):
>>
>> https://gist.github.com/anonymous/5275120
>
> That looks like exactly the exit_sem() bug that Davidlohr was talking
> about, where the
>
>                 /* exit_sem raced with IPC_RMID, nothing to do */
>                 if (IS_ERR(sma))
>                         continue;
>
> should be moved to *before* the
>
>                 sem_lock(sma, NULL, -1);
>
> call. And apparently the bug I had found is already fixed in -next.

I just tried the 7 original patches + the 2 one liners from -next +
modified Linus' patch (attached) on the top of 3.9-rc4 using
PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
above. I was building two packages at the same time, went away for 30
seconds, came back and everything froze as soon as I touched the
laptop's touchpad. Maybe a coincidence but anyway... Another shot in
the dark, I had this weird message when trying to build gcc:
semop(2): encountered an error: Identifier removed

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 1215 bytes --]

 ipc/msg.c | 2 +-
 ipc/sem.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	expunge_all(msq, -EIDRM);
 	ss_wakeup(&msq->q_senders, 1);
 	msg_rmid(ns, msq);
-	msg_unlock(msq);
 
 	tmp = msq->q_messages.next;
 	while (tmp != &msq->q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 	atomic_sub(msq->q_cbytes, &ns->msg_bytes);
 	security_msg_queue_free(msq);
 	ipc_rcu_putref(msq);
+	msg_unlock(msq);
 }
 
 /*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 
 	/* Remove the semaphore set from the IDR */
 	sem_rmid(ns, sma);
-	sem_unlock(sma, -1);
 
 	wake_up_sem_queue_do(&tasks);
 	ns->used_sems -= sma->sem_nsems;
 	security_sem_free(sma);
 	ipc_rcu_putref(sma);
+	sem_unlock(sma, -1);
 }
 
 static unsigned long copy_semid_to_user(void __user *buf, struct semid64_ds *in, int version)

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  4:33                       ` Emmanuel Benisty
@ 2013-03-30  5:10                         ` Linus Torvalds
  2013-03-30  5:57                           ` Emmanuel Benisty
  2013-03-31  5:01                         ` Davidlohr Bueso
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-30  5:10 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Davidlohr Bueso, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Fri, Mar 29, 2013 at 9:33 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>
> I just tried the 7 original patches + the 2 one liners from -next +
> modified Linus' patch (attached)

.. that patch looks fine.

> on the top of 3.9-rc4 using
> PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
> above. I was building two packages at the same time, went away for 30
> seconds, came back and everything froze as soon as I touched the
> laptop's touchpad. Maybe a coincidence but anyway... Another shot in
> the dark, I had this weird message when trying to build gcc:
> semop(2): encountered an error: Identifier removed

This came from the gcc build?

That's just crazy. No normal app uses sysv semaphores. I have an older
gcc build environment, and some grepping shows it has some ipc
semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
but that should be very very minor.

You seem to trigger errors really trivially easily, which is really
odd. It's sounding less and less like some subtle race, and more like
the error just happens all the time. If you can make even the gcc
build generate errors, I don't think they can be some rare blue-moon
thing.

I notice that your dmesg says that your kernel is compiled by
gcc-4.8.1 prerelease. Is there any chance that you could try to
install a known-stable gcc, like 4.7.2 or something. It's entirely
possible that it's a kernel bug even if it's triggered by some more
aggressive compiler optimization or something, but it would be really
good to try to see if this might be gcc-specific.

For example, I wonder if your gcc might miscompile idr_alloc() or
something, so that we get the same ID for different ipc objects. That
would certainly potentially cause chaos.

Hmm?

                      Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  5:10                         ` Linus Torvalds
@ 2013-03-30  5:57                           ` Emmanuel Benisty
  2013-03-30 17:22                             ` Linus Torvalds
  0 siblings, 1 reply; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-30  5:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>> Another shot in
>> the dark, I had this weird message when trying to build gcc:
>> semop(2): encountered an error: Identifier removed
>
> This came from the gcc build?

yes, very early in the build process, IIRC this line was repeated a
few times and the build just stalled.

> That's just crazy. No normal app uses sysv semaphores. I have an older
> gcc build environment, and some grepping shows it has some ipc
> semaphore use in the libstdc++ testsuite, and some libmudflap hooks,
> but that should be very very minor.

> I notice that your dmesg says that your kernel is compiled by
> gcc-4.8.1 prerelease. Is there any chance that you could try to
> install a known-stable gcc, like 4.7.2 or something. It's entirely
> possible that it's a kernel bug even if it's triggered by some more
> aggressive compiler optimization or something, but it would be really
> good to try to see if this might be gcc-specific.

I could build a kernel on another machine on which 4.7.2 is installed,
kernel oops'd as well:
http://i.imgur.com/uk6gmq1.jpg

FWIW, I have a few things disabled in my config so here is the one I
used, maybe I'm missing something (but again, everything works
perfectly with your tree):
https://gist.github.com/anonymous/5275566

Lastly, just FTR, I tried Andrew's 3.9-rc4-mm1 and got the same
issues, unsurprisingly I guess.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-28 15:32   ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
  2013-03-28 21:05     ` Davidlohr Bueso
  2013-03-29  1:00     ` Michel Lespinasse
@ 2013-03-30 13:35     ` Sasha Levin
  2013-03-31  1:30       ` Rik van Riel
  2 siblings, 1 reply; 129+ messages in thread
From: Sasha Levin @ 2013-03-30 13:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Paul E. McKenney

On 03/28/2013 11:32 AM, Rik van Riel wrote:
> On Tue, 26 Mar 2013 13:33:07 -0400
> Sasha Levin <sasha.levin@oracle.com> wrote:
> 
>> > [   96.347341] ================================================
>> > [   96.348085] [ BUG: lock held when returning to user space! ]
>> > [   96.348834] 3.9.0-rc4-next-20130326-sasha-00011-gbcb2313 #318 Tainted: G        W
>> > [   96.360300] ------------------------------------------------
>> > [   96.361084] trinity-child9/7583 is leaving the kernel with locks still held!
>> > [   96.362019] 1 lock held by trinity-child9/7583:
>> > [   96.362610]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192eafb>] SYSC_semtimedop+0x1fb/0xec0
>> > 
>> > It seems that we can leave semtimedop without releasing the rcu read lock.
> Sasha, this patch untangles the RCU locking with find_alloc_undo,
> and should fix the above issue. As a side benefit, this makes the
> code a little cleaner.
> 
> Next up: implement locking in a way that does not trigger any 
> lockdep warnings...

The following is mostly unrelated to this patch but close enough:

semctl_main() has a snippet that looks like this:

        err = -EINVAL;
        if(semnum < 0 || semnum >= nsems)
                goto out_unlock;

        sem_lock(sma, NULL, -1);

Which means we'll try unlocking the sma without trying to lock it first.
It makes lockdep unhappy:

[   95.528492] =====================================
[   95.529251] [ BUG: bad unlock balance detected! ]
[   95.529897] 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319 Tainted: G        W
[   95.530190] -------------------------------------
[   95.530190] trinity-child14/9123 is trying to release lock (&(&new->lock)->rlock) at:
[   95.530190] [<ffffffff8192f8e4>] semctl_main+0xe54/0xf00
[   95.530190] but there are no more locks to release!
[   95.530190]
[   95.530190] other info that might help us debug this:
[   95.530190] 1 lock held by trinity-child14/9123:
[   95.530190]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8192ea90>] semctl_main+0x0/0xf00
[   95.530190]
[   95.530190] stack backtrace:
[   95.530190] Pid: 9123, comm: trinity-child14 Tainted: G        W    3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319
[   95.530190] Call Trace:
[   95.530190]  [<ffffffff8192f8e4>] ? semctl_main+0xe54/0xf00
[   95.530190]  [<ffffffff8117b7e6>] print_unlock_imbalance_bug+0xf6/0x110
[   95.530190]  [<ffffffff8192f8e4>] ? semctl_main+0xe54/0xf00
[   95.530190]  [<ffffffff81180a35>] lock_release_non_nested+0xd5/0x320
[   95.530190]  [<ffffffff8122e3ab>] ? __do_fault+0x42b/0x530
[   95.530190]  [<ffffffff81179da2>] ? get_lock_stats+0x22/0x70
[   95.530190]  [<ffffffff81179e5e>] ? put_lock_stats.isra.14+0xe/0x40
[   95.530190]  [<ffffffff8192f8e4>] ? semctl_main+0xe54/0xf00
[   95.530190]  [<ffffffff81180f1e>] lock_release+0x29e/0x3b0
[   95.530190]  [<ffffffff819451f4>] ? security_ipc_permission+0x14/0x20
[   95.530190]  [<ffffffff83daf33e>] _raw_spin_unlock+0x1e/0x60
[   95.530190]  [<ffffffff8192f8e4>] semctl_main+0xe54/0xf00
[   95.530190]  [<ffffffff8192ea90>] ? SYSC_semtimedop+0xe30/0xe30
[   95.530190]  [<ffffffff8109d188>] ? kvm_clock_read+0x38/0x70
[   95.530190]  [<ffffffff8114feb5>] ? sched_clock_local+0x25/0xa0
[   95.530190]  [<ffffffff811500e8>] ? sched_clock_cpu+0xf8/0x110
[   95.530190]  [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
[   95.530190]  [<ffffffff81179da2>] ? get_lock_stats+0x22/0x70
[   95.530190]  [<ffffffff81179e5e>] ? put_lock_stats.isra.14+0xe/0x40
[   95.530190]  [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
[   95.530190]  [<ffffffff8113dc0e>] ? up_read+0x1e/0x40
[   95.530190]  [<ffffffff83db3814>] ? __do_page_fault+0x514/0x5e0
[   95.530190]  [<ffffffff811c6500>] ? rcu_eqs_exit_common+0x60/0x260
[   95.530190]  [<ffffffff81202b9d>] ? user_enter+0xfd/0x130
[   95.530190]  [<ffffffff81202c85>] ? user_exit+0xb5/0xe0
[   95.530190]  [<ffffffff8192faf9>] SyS_semctl+0x69/0x430
[   95.530190]  [<ffffffff81076ea0>] ? syscall_trace_enter+0x20/0x2e0
[   95.530190]  [<ffffffff83db7d18>] tracesys+0xe1/0xe6

I'm thinking that the solution is as simple as:

diff --git a/ipc/sem.c b/ipc/sem.c
index 6e109ef..ac36671 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1333,8 +1333,10 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
        /* GETVAL, GETPID, GETNCTN, GETZCNT: fall-through */
        }
        err = -EINVAL;
-       if(semnum < 0 || semnum >= nsems)
-               goto out_unlock;
+       if(semnum < 0 || semnum >= nsems) {
+               rcu_read_unlock();
+               goto out_wakeup;
+       }

        sem_lock(sma, NULL, -1);
        curr = &sma->sem_base[semnum];

But I'm not 100% sure if I don't mess up anything else.


Thanks,
Sasha

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  5:57                           ` Emmanuel Benisty
@ 2013-03-30 17:22                             ` Linus Torvalds
  2013-03-31  2:38                               ` Emmanuel Benisty
  0 siblings, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-03-30 17:22 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Davidlohr Bueso, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

[-- Attachment #1: Type: text/plain, Size: 983 bytes --]

On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
>>
>> This came from the gcc build?
>
> yes, very early in the build process, IIRC this line was repeated a
> few times and the build just stalled.

Ok, we're bringing out the crazy hacks now.

The attached patch is just insane, doesn't really even work in
general, and only even compiles on 64-bit. But it should work in
*practice* to find if somebody adds the same RCU head to the RCU lists
twice, and ignore the second time it happens (and give a warning that
hopefully pinpoints the backtrace).

It's ugly. It's broken. It may not work. In other words, I'm not proud
of it. But you seem to be the only one able to trigger the issue
easily, willing to try crazy crap, so "tag, you're it". Maybe this
gives us more information. And maybe it doesn't, and I'm totally wrong
about the whole "rcu head added twice" theory.

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 2253 bytes --]

 include/linux/types.h | 1 +
 include/net/dst.h     | 2 +-
 kernel/rcu.h          | 1 +
 kernel/rcutree.c      | 6 ++++++
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index 4d118ba11349..3f0d9daff906 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -209,6 +209,7 @@ struct ustat {
 struct callback_head {
 	struct callback_head *next;
 	void (*func)(struct callback_head *head);
+	unsigned long magic;
 };
 #define rcu_head callback_head
 
diff --git a/include/net/dst.h b/include/net/dst.h
index 1f8fd109e225..6f8acd031948 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -89,7 +89,7 @@ struct dst_entry {
 	 * (L1_CACHE_SIZE would be too much)
 	 */
 #ifdef CONFIG_64BIT
-	long			__pad_to_align_refcnt[2];
+	long			__pad_to_align_refcnt[1];
 #endif
 	/*
 	 * __refcnt wants to be on a different cache line from
diff --git a/kernel/rcu.h b/kernel/rcu.h
index 7f8e7590e3e5..0381ed3721bb 100644
--- a/kernel/rcu.h
+++ b/kernel/rcu.h
@@ -98,6 +98,7 @@ static inline bool __rcu_reclaim(char *rn, struct rcu_head *head)
 {
 	unsigned long offset = (unsigned long)head->func;
 
+	head->magic = 0;
 	if (__is_kfree_rcu_offset(offset)) {
 		RCU_TRACE(trace_rcu_invoke_kfree_callback(rn, head, offset));
 		kfree((void *)head - offset);
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5b8ad827fd86..80f9cfb63748 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2221,6 +2221,9 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 	}
 }
 
+/* Unlikely bit-pattern to check double RCU calls! */
+#define RCU_HEAD_MAGIC ((unsigned long)(0xfeeddead1acef8edll))
+
 /*
  * Helper function for call_rcu() and friends.  The cpu argument will
  * normally be -1, indicating "currently running CPU".  It may specify
@@ -2235,9 +2238,12 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	struct rcu_data *rdp;
 
 	WARN_ON_ONCE((unsigned long)head & 0x3); /* Misaligned rcu_head! */
+	if (WARN_ON_ONCE(head->magic == RCU_HEAD_MAGIC))
+		return;
 	debug_rcu_head_queue(head);
 	head->func = func;
 	head->next = NULL;
+	head->magic = RCU_HEAD_MAGIC;
 
 	/*
 	 * Opportunistically note grace-period endings and beginnings.

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-30 13:35     ` Sasha Levin
@ 2013-03-31  1:30       ` Rik van Riel
  2013-03-31  4:09         ` Davidlohr Bueso
  0 siblings, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-31  1:30 UTC (permalink / raw)
  To: Sasha Levin
  Cc: torvalds, davidlohr.bueso, linux-kernel, akpm, hhuang,
	jason.low2, walken, lwoodman, chegu_vinod, Paul E. McKenney

On 03/30/2013 09:35 AM, Sasha Levin wrote:

> I'm thinking that the solution is as simple as:

Your patch is absolutely correct.  All it needs now is your
signed-off-by, so Andrew can merge it into -mm :)

Reviewed-by: Rik van Riel <riel@redhat.com>

> diff --git a/ipc/sem.c b/ipc/sem.c
> index 6e109ef..ac36671 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1333,8 +1333,10 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum,
>          /* GETVAL, GETPID, GETNCTN, GETZCNT: fall-through */
>          }
>          err = -EINVAL;
> -       if(semnum < 0 || semnum >= nsems)
> -               goto out_unlock;
> +       if(semnum < 0 || semnum >= nsems) {
> +               rcu_read_unlock();
> +               goto out_wakeup;
> +       }
>
>          sem_lock(sma, NULL, -1);
>          curr = &sma->sem_base[semnum];
>
> But I'm not 100% sure if I don't mess up anything else.

I checked the surrounding code, it all looks fine.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30 17:22                             ` Linus Torvalds
@ 2013-03-31  2:38                               ` Emmanuel Benisty
  0 siblings, 0 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-31  2:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davidlohr Bueso, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

Hi Linus,

On Sun, Mar 31, 2013 at 12:22 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Mar 29, 2013 at 10:57 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
>> On Sat, Mar 30, 2013 at 12:10 PM, Linus Torvalds
>>>
>>> This came from the gcc build?
>>
>> yes, very early in the build process, IIRC this line was repeated a
>> few times and the build just stalled.
>
> Ok, we're bringing out the crazy hacks now.
>
> The attached patch is just insane, doesn't really even work in
> general, and only even compiles on 64-bit. But it should work in
> *practice* to find if somebody adds the same RCU head to the RCU lists
> twice, and ignore the second time it happens (and give a warning that
> hopefully pinpoints the backtrace).
>
> It's ugly. It's broken. It may not work. In other words, I'm not proud
> of it. But you seem to be the only one able to trigger the issue
> easily, willing to try crazy crap, so "tag, you're it". Maybe this
> gives us more information. And maybe it doesn't, and I'm totally wrong
> about the whole "rcu head added twice" theory.

That's all I could get so far:
https://gist.github.com/anonymous/5279255

Losing wireless is generally the start signal of controlled demolition
of the machine.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-31  1:30       ` Rik van Riel
@ 2013-03-31  4:09         ` Davidlohr Bueso
  0 siblings, 0 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-31  4:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, torvalds, linux-kernel, akpm, hhuang, jason.low2,
	walken, lwoodman, chegu_vinod, Paul E. McKenney

On Sat, 2013-03-30 at 21:30 -0400, Rik van Riel wrote:
> On 03/30/2013 09:35 AM, Sasha Levin wrote:
> 
> > I'm thinking that the solution is as simple as:
> 
> Your patch is absolutely correct.  All it needs now is your
> signed-off-by, so Andrew can merge it into -mm :)
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>

Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-30  4:33                       ` Emmanuel Benisty
  2013-03-30  5:10                         ` Linus Torvalds
@ 2013-03-31  5:01                         ` Davidlohr Bueso
  2013-03-31 13:45                           ` Rik van Riel
  2013-03-31 17:02                           ` Emmanuel Benisty
  1 sibling, 2 replies; 129+ messages in thread
From: Davidlohr Bueso @ 2013-03-31  5:01 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Sat, 2013-03-30 at 11:33 +0700, Emmanuel Benisty wrote:
> On Sat, Mar 30, 2013 at 10:46 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Fri, Mar 29, 2013 at 8:02 PM, Emmanuel Benisty <benisty.e@gmail.com> wrote:
> >>
> >> Then I start building a random package and the problems start. They
> >> may also happen without compiling but this seems to trigger the bug
> >> quite quickly.
> >
> > I suspect it's about preemption, and the build just results in enough
> > scheduling load that you start hitting whatever race there is.
> >
> >> Anyway, some progress here, I hope: dmesg seems to be
> >> willing to reveal some secrets (using some pastebin service since this
> >> is pretty big):
> >>
> >> https://gist.github.com/anonymous/5275120
> >
> > That looks like exactly the exit_sem() bug that Davidlohr was talking
> > about, where the
> >
> >                 /* exit_sem raced with IPC_RMID, nothing to do */
> >                 if (IS_ERR(sma))
> >                         continue;
> >
> > should be moved to *before* the
> >
> >                 sem_lock(sma, NULL, -1);
> >
> > call. And apparently the bug I had found is already fixed in -next.
> 
> I just tried the 7 original patches + the 2 one liners from -next +
> modified Linus' patch (attached) on the top of 3.9-rc4 using
> PREEMPT_NONE and after moving sem_lock(sma, NULL, -1) as explained
> above. I was building two packages at the same time, went away for 30
> seconds, came back and everything froze as soon as I touched the
> laptop's touchpad. Maybe a coincidence but anyway... Another shot in
> the dark, I had this weird message when trying to build gcc:
> semop(2): encountered an error: Identifier removed

*sigh*. I had high hopes for this being the bug triggering your issue,
specially after seeing exit_sem() in the trace. 

Emmanuel, just to be sure, does your changes reflect the patch below?
Specially dropping the rcu read lock before the continue statement
(sorry for not mentioning this in the last email).

Anyway, this is still a bug. Andrew, the patch below applies to
linux-next, please queue this up if you don't have any objections. 

Thanks,
Davidlohr

---8<---
From: Davidlohr Bueso <davidlohr.bueso@hp.com>
Subject: [PATCH] ipc, sem: do not call sem_lock when bogus sma

In exit_sem() we attempt to acquire the sma->sem_perm.lock by calling
sem_lock() immediately after obtaining sma. However, if sma isn't valid,
then calling sem_lock() will tend to do bad things.

Move the sma error check right after the sem_obtain_object_check() call instead.

Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
---
 ipc/sem.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index f257afe..74cedfe 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
 		struct sem_array *sma;
 		struct sem_undo *un;
 		struct list_head tasks;
-		int semid;
-		int i;
+		int semid, i;
 
 		rcu_read_lock();
 		un = list_entry_rcu(ulp->list_proc.next,
@@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
 		}
 
 		sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
-		sem_lock(sma, NULL, -1);
-
 		/* exit_sem raced with IPC_RMID, nothing to do */
-		if (IS_ERR(sma))
+		if (IS_ERR(sma)) {
+			rcu_read_unlock();
 			continue;
+		}
 
+		sem_lock(sma, NULL, -1);
 		un = __lookup_undo(ulp, semid);
 		if (un == NULL) {
 			/* exit_sem raced with IPC_RMID+semget() that created
-- 
1.7.11.7




^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-31  5:01                         ` Davidlohr Bueso
@ 2013-03-31 13:45                           ` Rik van Riel
  2013-03-31 17:10                             ` Linus Torvalds
  2013-03-31 17:02                           ` Emmanuel Benisty
  1 sibling, 1 reply; 129+ messages in thread
From: Rik van Riel @ 2013-03-31 13:45 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Emmanuel Benisty, Linus Torvalds, Dave Jones, Andrew Morton,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On 03/31/2013 01:01 AM, Davidlohr Bueso wrote:

> diff --git a/ipc/sem.c b/ipc/sem.c
> index f257afe..74cedfe 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1867,8 +1867,7 @@ void exit_sem(struct task_struct *tsk)
>   		struct sem_array *sma;
>   		struct sem_undo *un;
>   		struct list_head tasks;
> -		int semid;
> -		int i;
> +		int semid, i;
>
>   		rcu_read_lock();
>   		un = list_entry_rcu(ulp->list_proc.next,
> @@ -1884,12 +1883,13 @@ void exit_sem(struct task_struct *tsk)
>   		}
>
>   		sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);

Should we use "semid" here, like Linus suggested, instead of "un->semid"?

> -		sem_lock(sma, NULL, -1);
> -
>   		/* exit_sem raced with IPC_RMID, nothing to do */
> -		if (IS_ERR(sma))
> +		if (IS_ERR(sma)) {
> +			rcu_read_unlock();
>   			continue;
> +		}
>
> +		sem_lock(sma, NULL, -1);
>   		un = __lookup_undo(ulp, semid);
>   		if (un == NULL) {
>   			/* exit_sem raced with IPC_RMID+semget() that created
>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-31  5:01                         ` Davidlohr Bueso
  2013-03-31 13:45                           ` Rik van Riel
@ 2013-03-31 17:02                           ` Emmanuel Benisty
  1 sibling, 0 replies; 129+ messages in thread
From: Emmanuel Benisty @ 2013-03-31 17:02 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Linus Torvalds, Dave Jones, Andrew Morton, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

Hi Davidlohr,

On Sun, Mar 31, 2013 at 12:01 PM, Davidlohr Bueso
<davidlohr.bueso@hp.com> wrote:
> Specially dropping the rcu read lock before the continue statement
> (sorry for not mentioning this in the last email).

I was missing this indeed, thanks. Still the same issues however...
I'll do some more testing on the same machine but with a totally
different environment, within tomorrow hopefully.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-31 13:45                           ` Rik van Riel
@ 2013-03-31 17:10                             ` Linus Torvalds
  0 siblings, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-03-31 17:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Davidlohr Bueso, Emmanuel Benisty, Dave Jones, Andrew Morton,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley

On Sun, Mar 31, 2013 at 6:45 AM, Rik van Riel <riel@surriel.com> wrote:
>
> Should we use "semid" here, like Linus suggested, instead of "un->semid"?

As Davidlohr noted, in linux-next the rcu read-lock is held over the
whole thing, so no, un->semid should be stable once "un" has been
re-looked-up under the semaphore lock.

In mainline, the problem is that the "sem_lock_check()" is done with
"un->semid" *after* we've dropped the RCU read-lock, so "un" at that
point is not reliable (it could be free'd at any time underneath us).

That said, I really *really* hate what both mainline and linux-next do
with the RCU read lock, and linux-next is arguably worse.

The whole "take the RCU lock in one place, and release it in another"
is confusing and bug-prone as hell. And linux-next made it worse: now
sem_lock() no longer takes the read-lock (it expects the caller to
take it), but sem_unlock() still drops the read-lock. This is all just
f*cking crazy.

The rule should be that the rcu read-lock is always and released at
the same "level". For example, find_alloc_undo() should just be called
with (and unconditionaly return with) the rcu read-lock held, and if
it needs to actually do an allocation, it can drop the rcu lock for
the duration of the allocation.

This whole "conditional locking" depending on error returns and on
whether we have undo's etc is bug-prone and confusing. And when you
have totally different locking rules for "sem_lock()" vs
"sem_unlock()", you know you're confused.

                        Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 18:43         ` Linus Torvalds
  2013-03-29 19:06           ` Dave Jones
  2013-03-29 19:33           ` Peter Hurley
@ 2013-04-01  7:40           ` Stanislav Kinsbursky
  2 siblings, 0 replies; 129+ messages in thread
From: Stanislav Kinsbursky @ 2013-04-01  7:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Andrew Morton, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu, Peter Hurley, sds

29.03.2013 22:43, Linus Torvalds пишет:
> On Fri, Mar 29, 2013 at 9:17 AM, Dave Jones<davej@redhat.com>  wrote:
>> >
>> >Now that I have that reverted, I'm not seeing msgrcv traces any more, but
>> >I've started seeing this..
>> >
>> >general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Do you have CONFIG_CHECKPOINT_RESTORE enabled? Does it go away if you
> disable it?
>
> I think I foud at least one bug in the MSG_COPY stuff: it leaks the
> "copy" allocation if
>
>      mode == SEARCH_LESSEQUAL
>

Hello, Linus.
Sorry, but I don't see copy allocation leak.
Dummy message is allocated always in msgflg has MSG_COPY flag being set.
Also prepare_copy() use msgtyp as a message number to copy and thus set it to 0.

> but maybe I'm misreading it. And that shouldn't cause the problem you
> see, but it's indicative of how badly tested and thought through the
> MSG_COPY code is.
>
> Totally UNTESTED leak fix appended. Stanislav?
>

I don't see, how this patch can help. And we should not release it until copy is
done in msg_handler, because msg is equal to copy.

Dummy copy message is release either by free_copy() (if msg is error) or
free_msg().

But there are two issues here definitely:

1) Poor SELinux support for message
copying. This issue was addressed by Stephen Smalley here:

https://lkml.org/lkml/2013/2/6/663

But look like he didn't sent the patch to Andrew.

2) Copy leak and queue corruption in case of
copy message wasn't found (this was mentioned by Peter in another thread; thanks for
catching this, Peter!), because msg will be a valid pointer and all the "message copy"
clean up logic doesn't work.

I like Peter's cleanup and fix series.  But if it looks like too much changes
for this old code, I have another small patch, which should fix the issue:

     ipc: set msg back to -EAGAIN if copy wasn't performed

     Make sure, that msg pointer is set back to error value in case of MSG_COPY
     flag is set and desired message to copy wasn't found. This garantees, that msg
     is either a error pointer or a copy address.
     Otherwise last message in queue will be freed without unlinking from queue
     (which leads to memory corruption) plus dummy allocated copy won't be released.

     Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf..fede1d0 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -872,6 +872,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp,
                                                         goto out_unlock;
                                                 break;
                                         }
+                                       msg = ERR_PTR(-EAGAIN);
                                 } else
                                         break;
                                 msg_counter++;

>                       Linus
>
>
> patch.diff
>
>
>   ipc/msg.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/ipc/msg.c b/ipc/msg.c
> index 31cd1bf6af27..b841508556cb 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -870,6 +870,7 @@ long do_msgrcv(int msqid, void __user *buf, size_t bufsz, long msgtyp,
>   						msg = copy_msg(msg, copy);
>   						if (IS_ERR(msg))
>   							goto out_unlock;
> +						copy = NULL;
>   						break;
>   					}
>   				} else
> @@ -976,10 +977,9 @@ out_unlock:
>   			break;
>   		}
>   	}
> -	if (IS_ERR(msg)) {
> -		free_copy(copy);
> +	free_copy(copy);
> +	if (IS_ERR(msg))
>   		return PTR_ERR(msg);
> -	}
>
>   	bufsz = msg_handler(buf, msg, bufsz);
>   	free_msg(msg);
>


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 19:36               ` Peter Hurley
@ 2013-04-02 16:08                 ` Sasha Levin
  2013-04-02 17:24                   ` Linus Torvalds
  2013-04-02 17:52                   ` Linus Torvalds
  0 siblings, 2 replies; 129+ messages in thread
From: Sasha Levin @ 2013-04-02 16:08 UTC (permalink / raw)
  To: Peter Hurley
  Cc: Linus Torvalds, Dave Jones, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Stanislav Kinsbursky

On 03/29/2013 03:36 PM, Peter Hurley wrote:
> On Fri, 2013-03-29 at 12:26 -0700, Linus Torvalds wrote:
>> On Fri, Mar 29, 2013 at 12:06 PM, Dave Jones <davej@redhat.com> wrote:
>>>
>>> Here's an oops I just hit..
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 000000000000000f
>>> IP: [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
>>
>> Btw, looking at the code leading up to this, what the f*ck is wrong
>> with the IPC stuff?
>>
>> It's using the generic list stuff for most of the lists, but then it
>> open-codes the accesses.
>>
>> So instead of using
>>
>>    for_each_entry(walk_msg, &msq->q_messages, m_list) {
>>       ..
>>    }
>>
>> the ipc/msg.c code does all that by hand, with
>>
>>    tmp = msq->q_messages.next;
>>    while (tmp != &msq->q_messages) {
>>       struct msg_msg *walk_msg;
>>
>>       walk_msg = list_entry(tmp, struct msg_msg, m_list);
>>       ...
>>       tmp = tmp->next;
>>    }
>>
>> Ugh. The code is near unreadable. And then it has magic memory
>> barriers etc, implying that it doesn't lock the data structures, but
>> no comments about them. See expunge_all() and pipelined_send().
>>
>> The code seems entirely random, and it's badly set up (annoyance of
>> the day: crazy helper functions in ipc/msgutil.c to make sure that (a)
>> you have to spend more effort looking for them, and (b) they won't get
>> inlined).
>>
>> Clearly nobody has cared for the crazy IPC message code in a long time.
> 
> Exactly that's what my patch series does; clean this mess up.
> 
> This is what I wrote to Andrew a couple of days ago.
> 
> On Tue, 2013-03-26 at 22:33 -0400, Peter Hurley wrote:
> I just figured out how the queue is being corrupted and why my series
>> fixes it.
>>
>>
>> With MSG_COPY set, the queue scan can exit with the local variable
> 'msg'
>> pointing to a real msg if the msg_counter never reaches the
> copy_number.
>>
>> The remaining execution looks like this:
>>
>> 	if (!IS_ERR(msg)) {
>> 		....
>> 		if (msgflg & MSG_COPY)
>> 			goto out_unlock;
>> 		....
>>
>> out_unlock:
>> 			msg_unlock(msq);
>> 			break;
>> 		}
>> 	}
>> 	if (IS_ERR(msg))
>> 		....
>>
>> 	bufsz = msg_handler();
>> 	free_msg(msg);			<<---- msg never unlinked
>>
>>
>> Since the msg should not have been found (because it failed the match
>> criteria), the if (!IS_ERR(msg)) clause should never have executed.
>>
>> That's why my refactor fixes resolve this; because msg is not
>> inadvertently treated as a found msg.
>>
>> But let's be honest; the real bug here is the poor structure of this
>> function that even made this possible. The deepest nesting executes a
>> goto to a label in the middle of an if clause. Yuck! No wonder this
>> thing's fragile.
>>
>> So my recommendation still stands. The series that fixes this has been
>> getting tested in linux-next for a month. Fixing this some other way
> is
>> just asking for more trouble.
>>
>> But why not just revert MSG_COPY altogether for 3.9?

If you guys are already looking at this, the conversions between size_t,
long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
trigger anything from:

[   33.046572] BUG: unable to handle kernel paging request at ffff88003c2c7000
[   33.047721] IP: [<ffffffff83dbcb34>] bad_from_user+0x4/0x6
[   33.048528] PGD 7232067 PUD 7233067 PMD 3ffff067 PTE 800000003c2c7060
[   33.049506] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   33.050029] Modules linked in:
[   33.050029] CPU 0
[   33.050029] Pid: 6885, comm: a.out Tainted: G        W    3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321
[   33.050029] RIP: 0010:[<ffffffff83dbcb34>]  [<ffffffff83dbcb34>] bad_from_user+0x4/0x6
[   33.050029] RSP: 0018:ffff88003462be40  EFLAGS: 00010246
[   33.050029] RAX: 0000000000000000 RBX: 00000000fffffffb RCX: 00000000ff06ae2b
[   33.050029] RDX: 00000000fffffffb RSI: 00007fffed36d2a0 RDI: ffff88003c2c7000
[   33.050029] RBP: ffff88003462be88 R08: 0000000000000280 R09: 0000000000000000
[   33.050029] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffffb
[   33.050029] R13: 00007fffed36d2a0 R14: 0000000000000000 R15: 0000000000000000
[   33.050029] FS:  00007f6990044700(0000) GS:ffff88003dc00000(0000) knlGS:0000000000000000
[   33.050029] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   33.050029] CR2: ffff88003c2c7000 CR3: 00000000347c8000 CR4: 00000000000406f0
[   33.050029] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   33.050029] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   33.050029] Process a.out (pid: 6885, threadinfo ffff88003462a000, task ffff880034cb3000)
[   33.050029] Stack:
[   33.050029]  ffffffff8192a6a9 ffff88003462be98 ffff88003b331e00 ffff88003ddd01e0
[   33.050029]  0000000000000000 0000000000000000 0000000000000001 0000000000000000
[   33.050029]  0000000000000000 ffff88003462bf68 ffffffff8192bb34 0000000000000000
[   33.050029] Call Trace:
[   33.050029]  [<ffffffff8192a6a9>] ? load_msg+0x59/0x100
[   33.050029]  [<ffffffff8192bb34>] do_msgrcv+0x74/0x5b0
[   33.050029]  [<ffffffff81202c85>] ? user_exit+0xb5/0xe0
[   33.050029]  [<ffffffff8192a750>] ? load_msg+0x100/0x100
[   33.050029]  [<ffffffff8117cdcd>] ? trace_hardirqs_on+0xd/0x10
[   33.050029]  [<ffffffff81076ea0>] ? syscall_trace_enter+0x20/0x2e0
[   33.050029]  [<ffffffff8192c080>] SyS_msgrcv+0x10/0x20
[   33.050029]  [<ffffffff83db7e58>] tracesys+0xe1/0xe6
[   33.050029] Code: e9 1f ee c3 fd b9 f2 ff ff ff e9 28 ee c3 fd b8 f2 ff ff ff e9 2f ee c3 fd ba f2 ff ff ff e9 bf f1 c3 fd 90
90 90 90 89 d1 31 c0 <f3> aa 89 d0 c3 01 ca e9 50 fa c4 fd c1 e1 06 01 ca eb 08 48 8d
[   33.050029] RIP  [<ffffffff83dbcb34>] bad_from_user+0x4/0x6
[   33.050029]  RSP <ffff88003462be40>
[   33.050029] CR2: ffff88003c2c7000
[   33.050029] ---[ end trace 9bba0da8a88b1faa ]---

To:

=============================================================================
[ 1393.475659] BUG kmalloc-4096 (Tainted: G        W   ): Padding overwritten. 0xffff88004e00f8f8-0xffff88004e00ffff
[ 1393.477469] -----------------------------------------------------------------------------
[ 1393.477469]
[ 1393.478980] Disabling lock debugging due to kernel taint
[ 1393.479730] INFO: Slab 0xffffea0001380200 objects=7 used=7 fp=0x          (null) flags=0x1ffc0000004081
[ 1393.480030] Pid: 25258, comm: trinity-child54 Tainted: G    B   W    3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321
[ 1393.480030] Call Trace:
[ 1393.480030]  [<ffffffff8125a3ca>] slab_err+0xaa/0xd0
[ 1393.480030]  [<ffffffff81179e5e>] ? put_lock_stats.isra.14+0xe/0x40
[ 1393.480030]  [<ffffffff8125af14>] slab_pad_check+0x104/0x170
[ 1393.480030]  [<ffffffff8125b045>] check_slab+0xc5/0xd0
[ 1393.480030]  [<ffffffff83d67748>] free_debug_processing+0x52/0x204
[ 1393.480030]  [<ffffffff83dafc5d>] ? _raw_spin_unlock_irqrestore+0x5d/0xb0
[ 1393.480030]  [<ffffffff8192a583>] ? free_msg+0x33/0x40
[ 1393.480030]  [<ffffffff8192a583>] ? free_msg+0x33/0x40
[ 1393.480030]  [<ffffffff83d67931>] __slab_free+0x37/0x3f7
[ 1393.480030]  [<ffffffff81a2268c>] ? __debug_check_no_obj_freed+0x16c/0x220
[ 1393.480030]  [<ffffffff811c8ad7>] ? rcu_irq_exit+0x1c7/0x260
[ 1393.480030]  [<ffffffff8125c07d>] ? kfree+0x20d/0x330
[ 1393.480030]  [<ffffffff8192a583>] ? free_msg+0x33/0x40
[ 1393.480030]  [<ffffffff8125c137>] kfree+0x2c7/0x330
[ 1393.480030]  [<ffffffff8192a583>] free_msg+0x33/0x40
[ 1393.480030]  [<ffffffff8192a739>] load_msg+0xe9/0x100
[ 1393.480030]  [<ffffffff8192bb34>] do_msgrcv+0x74/0x5b0
[ 1393.480030]  [<ffffffff81202c85>] ? user_exit+0xb5/0xe0
[ 1393.480030]  [<ffffffff8192a750>] ? load_msg+0x100/0x100
[ 1393.480030]  [<ffffffff8117cdcd>] ? trace_hardirqs_on+0xd/0x10
[ 1393.480030]  [<ffffffff81076ea0>] ? syscall_trace_enter+0x20/0x2e0
[ 1393.480030]  [<ffffffff8192c080>] SyS_msgrcv+0x10/0x20
[ 1393.480030]  [<ffffffff83db7e58>] tracesys+0xe1/0xe6
[ 1393.480030] Padding ffff88004e00f8f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.480030] Padding ffff88004e00f908: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.480030] Padding ffff88004e00f918: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.480030] Padding ffff88004e00f928: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f938: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f948: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f958: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f968: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f978: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f988: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f998: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f9a8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f9b8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f9c8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f9d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f9e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00f9f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa08: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa28: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa38: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa48: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa78: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fa98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00faa8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fab8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fac8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fad8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fae8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00faf8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb08: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb28: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb38: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb48: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb78: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fb98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fba8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fbb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fbc8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fbd8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fbe8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fbf8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc08: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc28: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc38: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc48: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc78: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fc98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fca8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fcb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fcc8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fcd8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fce8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fcf8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd08: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd28: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd38: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd48: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd78: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fd98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fda8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fdb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fdc8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fdd8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fde8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fdf8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe08: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe28: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe38: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe48: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe78: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fe98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fea8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00feb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fec8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fed8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fee8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fef8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff08: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff18: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff28: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff38: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff48: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff58: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff68: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff78: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff88: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ff98: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ffa8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ffb8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ffc8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ffd8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00ffe8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.509882] Padding ffff88004e00fff8: 00 00 00 00 00 00 00 00                          ........
[ 1393.509882] FIX kmalloc-4096: Restoring 0xffff88004e00f8f8-0xffff88004e00ffff=0x5a
[ 1393.509882]
[ 1393.689228] =============================================================================
[ 1393.690761] BUG kmalloc-4096 (Tainted: G    B   W   ): Redzone overwritten
[ 1393.690761] -----------------------------------------------------------------------------
[ 1393.690761]
[ 1393.690761] INFO: 0xffff88004e00f7b0-0xffff88004e00f7b7. First byte 0x0 instead of 0xcc
[ 1393.690761] INFO: Slab 0xffffea0001380200 objects=7 used=6 fp=0xffff88004e008000 flags=0x1ffc0000004081
[ 1393.690761] INFO: Object 0xffff88004e00e7b0 @offset=26544 fp=0x          (null)
[ 1393.690761]
[ 1393.690761] Bytes b4 ffff88004e00e7a0: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
[ 1393.690761] Object ffff88004e00e7b0: 48 91 00 4e 00 88 ff ff 6b 6b 6b 6b 6b 6b 6b 6b  H..N....kkkkkkkk
[ 1393.690761] Object ffff88004e00e7c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e7d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e7e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e7f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e800: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e810: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e820: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e830: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e840: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e850: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e860: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e870: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e880: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e890: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e8a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e8b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e8c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e8d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e8e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e8f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e900: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e910: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e920: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e930: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e940: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e950: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e960: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e970: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e980: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e990: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e9a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e9b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e9c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e9d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e9e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00e9f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea00: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea10: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea20: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea30: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ea90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eaa0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eab0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eac0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ead0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eae0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eaf0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb00: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb10: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb20: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb30: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eb90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ebb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ebc0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ebd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ebe0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ebf0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec00: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec10: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec20: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec30: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ec90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00eca0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ecb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ecc0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ecd0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ece0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[ 1393.690761] Object ffff88004e00ecf0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00  kkkkkkkkkkk.....
[ 1393.690761] Object ffff88004e00ed00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ed90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eda0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00edb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00edc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00edd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ede0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00edf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ee90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eeb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00ef90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00efa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00efb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00efc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00efd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00efe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00eff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f0a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f0b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f0c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f0d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f0f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f1a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f1b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f1c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f1d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f1e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f1f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f2a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f2b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f2c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f2d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f2e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f2f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f3a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f3b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f3c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f3d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f3e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f3f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f410: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f420: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f440: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f450: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f460: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f470: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f4a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f4b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f4c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f4d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f4e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f4f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f520: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f530: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f540: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f550: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f560: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f570: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f590: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f5a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f5c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f5d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f5e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f5f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f640: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f650: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f660: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f670: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f6b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f6c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f6d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f6e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f6f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f710: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f720: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f730: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f740: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f770: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f790: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Object ffff88004e00f7a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1393.690761] Redzone ffff88004e00f7b0: 00 00 00 00 00 00 00 00                          ........
[ 1393.690761] Padding ffff88004e00f8f0: 00 00 00 00 00 00 00 00                          ........
[ 1393.690761] Pid: 25258, comm: trinity-child54 Tainted: G    B   W    3.9.0-rc4-next-20130328-sasha-00017-g1463000 #321
[ 1393.690761] Call Trace:
[ 1393.690761]  [<ffffffff81259268>] ? print_section+0x38/0x40
[ 1393.690761]  [<ffffffff812593a1>] print_trailer+0x131/0x140
[ 1393.690761]  [<ffffffff812597f4>] check_bytes_and_report+0xc4/0x120
[ 1393.690761]  [<ffffffff8125a781>] check_object+0x51/0x240
[ 1393.690761]  [<ffffffff83d677bd>] free_debug_processing+0xc7/0x204
[ 1393.690761]  [<ffffffff8192a583>] ? free_msg+0x33/0x40
[ 1393.690761]  [<ffffffff8192a583>] ? free_msg+0x33/0x40
[ 1393.690761]  [<ffffffff83d67931>] __slab_free+0x37/0x3f7
[ 1393.690761]  [<ffffffff81a2268c>] ? __debug_check_no_obj_freed+0x16c/0x220
[ 1393.690761]  [<ffffffff811c8ad7>] ? rcu_irq_exit+0x1c7/0x260
[ 1393.690761]  [<ffffffff8125c07d>] ? kfree+0x20d/0x330
[ 1393.690761]  [<ffffffff8192a583>] ? free_msg+0x33/0x40
[ 1393.690761]  [<ffffffff8125c137>] kfree+0x2c7/0x330
[ 1393.690761]  [<ffffffff8192a583>] free_msg+0x33/0x40
[ 1393.690761]  [<ffffffff8192a739>] load_msg+0xe9/0x100
[ 1393.690761]  [<ffffffff8192bb34>] do_msgrcv+0x74/0x5b0
[ 1393.690761]  [<ffffffff81202c85>] ? user_exit+0xb5/0xe0
[ 1393.690761]  [<ffffffff8192a750>] ? load_msg+0x100/0x100
[ 1393.690761]  [<ffffffff8117cdcd>] ? trace_hardirqs_on+0xd/0x10
[ 1393.690761]  [<ffffffff81076ea0>] ? syscall_trace_enter+0x20/0x2e0
[ 1393.690761]  [<ffffffff8192c080>] SyS_msgrcv+0x10/0x20
[ 1393.690761]  [<ffffffff83db7e58>] tracesys+0xe1/0xe6
[ 1393.690761] FIX kmalloc-4096: Restoring 0xffff88004e00f7b0-0xffff88004e00f7b7=0xcc

By just playing with the 'msgsz' parameter with MSG_COPY set.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-04-02 16:08                 ` Sasha Levin
@ 2013-04-02 17:24                   ` Linus Torvalds
  2013-04-02 17:52                   ` Linus Torvalds
  1 sibling, 0 replies; 129+ messages in thread
From: Linus Torvalds @ 2013-04-02 17:24 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Hurley, Dave Jones, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Stanislav Kinsbursky

On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> If you guys are already looking at this, the conversions between size_t,
> long and int in the do_msgrcv/load_msg/alloc_msg code are a mess. You could
> trigger anything from:

Good catch.

Let's just change the "(long)bufsz < 0" into "bufsz > INT_MAX".

I suspect we should change some of the "int" arguments to "size_t" too
so that we don't have these kinds of odd "different routines see
different values due to subtle casting errors", but in the end we
don't really want to ever help people have these kinds of potential
overflow issues. We already limit normal read/write/sendmsg etc to
INT_MAX (although we tend to *truncate* it to INT_MAX rather than
return an error, but I think the simpler patch here is preferable
unless somebody complains).

Comments?

              Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-04-02 16:08                 ` Sasha Levin
  2013-04-02 17:24                   ` Linus Torvalds
@ 2013-04-02 17:52                   ` Linus Torvalds
  2013-04-02 19:53                     ` Sasha Levin
  1 sibling, 1 reply; 129+ messages in thread
From: Linus Torvalds @ 2013-04-02 17:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Hurley, Dave Jones, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Stanislav Kinsbursky

On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> By just playing with the 'msgsz' parameter with MSG_COPY set.

Hmm. Looking closer, I suspect you're testing without commit
88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
should limit the size passed in to prepare_copy -> load_copy to
msg_ctlmax.

Now, I think it's possibly still a good idea to limit bufsz to INT_MAX
regardless, but as far as I can see that prepare_copy -> load_copy
path is the only place that can get confused. Everybody else uses
size_t (or "long" in the case of r_maxsize) as far as I can tell.

        Linus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-04-02 17:52                   ` Linus Torvalds
@ 2013-04-02 19:53                     ` Sasha Levin
  2013-04-02 20:00                       ` Dave Jones
  0 siblings, 1 reply; 129+ messages in thread
From: Sasha Levin @ 2013-04-02 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Hurley, Dave Jones, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Stanislav Kinsbursky

On 04/02/2013 01:52 PM, Linus Torvalds wrote:
> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>
>> By just playing with the 'msgsz' parameter with MSG_COPY set.
> 
> Hmm. Looking closer, I suspect you're testing without commit
> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
> should limit the size passed in to prepare_copy -> load_copy to
> msg_ctlmax.

That commit has a revert in the -next trees, do we need a revert
of the revert?

	commit ff6577a3e714ccae02d4400e989762c19c37b0b3
	Author: Andrew Morton <akpm@linux-foundation.org>
	Date:   Wed Mar 27 10:24:02 2013 +1100
	
	    revert "ipc: don't allocate a copy larger than max"
	
	    Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
	    during trinity testing.
	
	    Cc: Peter Hurley <peter@hurleysoftware.com>
	    Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
	    Reported-by: Dave Jones <davej@redhat.com>
	    Cc: <stable@vger.kernel.org>
	    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>



Thanks,
Sasha

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-04-02 19:53                     ` Sasha Levin
@ 2013-04-02 20:00                       ` Dave Jones
  0 siblings, 0 replies; 129+ messages in thread
From: Dave Jones @ 2013-04-02 20:00 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linus Torvalds, Peter Hurley, Andrew Morton, Rik van Riel,
	Davidlohr Bueso, Linux Kernel Mailing List, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu,
	Stanislav Kinsbursky

On Tue, Apr 02, 2013 at 03:53:01PM -0400, Sasha Levin wrote:
 > On 04/02/2013 01:52 PM, Linus Torvalds wrote:
 > > On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
 > >>
 > >> By just playing with the 'msgsz' parameter with MSG_COPY set.
 > > 
 > > Hmm. Looking closer, I suspect you're testing without commit
 > > 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
 > > should limit the size passed in to prepare_copy -> load_copy to
 > > msg_ctlmax.
 > 
 > That commit has a revert in the -next trees, do we need a revert
 > of the revert?

Yeah, I told Andrew to drop that, but I think he's travelling.

	Dave


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-03-26 20:00       ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
@ 2013-04-05  4:38         ` Mike Galbraith
  2013-04-05 13:21           ` Rik van Riel
  0 siblings, 1 reply; 129+ messages in thread
From: Mike Galbraith @ 2013-04-05  4:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, Davidlohr Bueso, torvalds, linux-kernel, akpm,
	hhuang, jason.low2, walken, lwoodman, chegu_vinod,
	Paul E. McKenney

On Tue, 2013-03-26 at 16:00 -0400, Rik van Riel wrote: 
> On Tue, 26 Mar 2013 14:07:14 -0400
> Sasha Levin <sasha.levin@oracle.com> wrote:
> 
> > > Not necessarily, we do release everything at the end of the function: 
> > >     out_unlock_free:
> > > 	sem_unlock(sma, locknum);
> > 
> > Ow, there's a rcu_read_unlock() in sem_unlock()? This complicates things even
> > more I suspect. If un is non-NULL we'll be unlocking rcu lock twice?
> 
> Sasha, this patch should resolve the RCU tangle, by making sure
> we only ever take the rcu_read_lock once in semtimedop.
> 
> ---8<---
> 
> The ipc semaphore code has a nasty RCU locking tangle, with both
> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
> code can be cleaned up somewhat by only taking the rcu_read_lock
> once.
> 
> There are no other callers to find_alloc_undo.
> 
> This should also solve the trinity issue reported by Sasha Levin.

I take it this is on top of the patchlet Sasha submitted?

(I hit rcu stall banging on patch set in rt with 60 synchronized core
executive model if I let it run long enough, fwtw)

> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  ipc/sem.c |   31 +++++++++----------------------
>  1 files changed, 9 insertions(+), 22 deletions(-)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index f46441a..2ec2945 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1646,22 +1646,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>  			alter = 1;
>  	}
>  
> +	INIT_LIST_HEAD(&tasks);
> +
>  	if (undos) {
> +		/* On success, find_alloc_undo takes the rcu_read_lock */
>  		un = find_alloc_undo(ns, semid);
>  		if (IS_ERR(un)) {
>  			error = PTR_ERR(un);
>  			goto out_free;
>  		}
> -	} else
> +	} else {
>  		un = NULL;
> +		rcu_read_lock();
> +	}
>  
> -	INIT_LIST_HEAD(&tasks);
> -
> -	rcu_read_lock();
>  	sma = sem_obtain_object_check(ns, semid);
>  	if (IS_ERR(sma)) {
> -		if (un)
> -			rcu_read_unlock();
> +		rcu_read_unlock();
>  		error = PTR_ERR(sma);
>  		goto out_free;
>  	}
> @@ -1693,22 +1694,8 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>  	 */
>  	error = -EIDRM;
>  	locknum = sem_lock(sma, sops, nsops);
> -	if (un) {
> -		if (un->semid == -1) {
> -			rcu_read_unlock();
> -			goto out_unlock_free;
> -		} else {
> -			/*
> -			 * rcu lock can be released, "un" cannot disappear:
> -			 * - sem_lock is acquired, thus IPC_RMID is
> -			 *   impossible.
> -			 * - exit_sem is impossible, it always operates on
> -			 *   current (or a dead task).
> -			 */
> -
> -			rcu_read_unlock();
> -		}
> -	}
> +	if (un && un->semid == -1)
> +		goto out_unlock_free;
>  
>  	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
>  	if (error <= 0) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-04-05  4:38         ` Mike Galbraith
@ 2013-04-05 13:21           ` Rik van Riel
  2013-04-05 16:26             ` Mike Galbraith
  2013-04-16 12:37             ` Mike Galbraith
  0 siblings, 2 replies; 129+ messages in thread
From: Rik van Riel @ 2013-04-05 13:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Sasha Levin, Davidlohr Bueso, torvalds, linux-kernel, akpm,
	hhuang, jason.low2, walken, lwoodman, chegu_vinod,
	Paul E. McKenney

On 04/05/2013 12:38 AM, Mike Galbraith wrote:
> On Tue, 2013-03-26 at 16:00 -0400, Rik van Riel wrote:

>> The ipc semaphore code has a nasty RCU locking tangle, with both
>> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
>> code can be cleaned up somewhat by only taking the rcu_read_lock
>> once.
>>
>> There are no other callers to find_alloc_undo.
>>
>> This should also solve the trinity issue reported by Sasha Levin.
>
> I take it this is on top of the patchlet Sasha submitted?

Indeed, and all the other fixes that got submitted :)

> (I hit rcu stall banging on patch set in rt with 60 synchronized core
> executive model if I let it run long enough, fwtw)

What are you using to trigger an rcu stall?

>> Reported-by: Sasha Levin <sasha.levin@oracle.com>
>> Signed-off-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-04-05 13:21           ` Rik van Riel
@ 2013-04-05 16:26             ` Mike Galbraith
  2013-04-16 12:37             ` Mike Galbraith
  1 sibling, 0 replies; 129+ messages in thread
From: Mike Galbraith @ 2013-04-05 16:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, Davidlohr Bueso, torvalds, linux-kernel, akpm,
	hhuang, jason.low2, walken, lwoodman, chegu_vinod,
	Paul E. McKenney

On Fri, 2013-04-05 at 09:21 -0400, Rik van Riel wrote: 
> On 04/05/2013 12:38 AM, Mike Galbraith wrote:
> > On Tue, 2013-03-26 at 16:00 -0400, Rik van Riel wrote:
> 
> >> The ipc semaphore code has a nasty RCU locking tangle, with both
> >> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
> >> code can be cleaned up somewhat by only taking the rcu_read_lock
> >> once.
> >>
> >> There are no other callers to find_alloc_undo.
> >>
> >> This should also solve the trinity issue reported by Sasha Levin.
> >
> > I take it this is on top of the patchlet Sasha submitted?
> 
> Indeed, and all the other fixes that got submitted :)
> 
> > (I hit rcu stall banging on patch set in rt with 60 synchronized core
> > executive model if I let it run long enough, fwtw)
> 
> What are you using to trigger an rcu stall?

Running a model of a userspace task scheduler.  That was a fix or so ago
now though.  I'll try the set again on that box.

-Mike


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo
  2013-04-05 13:21           ` Rik van Riel
  2013-04-05 16:26             ` Mike Galbraith
@ 2013-04-16 12:37             ` Mike Galbraith
  1 sibling, 0 replies; 129+ messages in thread
From: Mike Galbraith @ 2013-04-16 12:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Sasha Levin, Davidlohr Bueso, torvalds, linux-kernel, akpm,
	hhuang, jason.low2, walken, lwoodman, chegu_vinod,
	Paul E. McKenney

On Fri, 2013-04-05 at 09:21 -0400, Rik van Riel wrote: 
> On 04/05/2013 12:38 AM, Mike Galbraith wrote:
> > On Tue, 2013-03-26 at 16:00 -0400, Rik van Riel wrote:
> 
> >> The ipc semaphore code has a nasty RCU locking tangle, with both
> >> find_alloc_undo and semtimedop taking the rcu_read_lock(). The
> >> code can be cleaned up somewhat by only taking the rcu_read_lock
> >> once.
> >>
> >> There are no other callers to find_alloc_undo.
> >>
> >> This should also solve the trinity issue reported by Sasha Levin.
> >
> > I take it this is on top of the patchlet Sasha submitted?
> 
> Indeed, and all the other fixes that got submitted :)

I plugged it into my 64 core rt box and beat on it again, no stalls or
any other troubles noted.

-Mike


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-26 18:35                 ` Andrew Morton
@ 2013-04-16 23:30                   ` Andrew Morton
  0 siblings, 0 replies; 129+ messages in thread
From: Andrew Morton @ 2013-04-16 23:30 UTC (permalink / raw)
  To: Davidlohr Bueso, Emmanuel Benisty, Linus Torvalds, Rik van Riel,
	Linux Kernel Mailing List, hhuang, Low, Jason, Michel Lespinasse,
	Larry Woodman, Vinod, Chegu

On Tue, 26 Mar 2013 11:35:33 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> Do we need the locking at all?  What does it actually do?
> 
> 			sem_lock_and_putref(sma);
> 			if (sma->sem_perm.deleted) {
> 				sem_unlock(sma, -1);
> 				err = -EIDRM;
> 				goto out_free;
> 			}
> 			sem_unlock(sma, -1);
> 
> We're taking the lock, testing an int and then dropping the lock. 
> What's the point in that?

Rikpoke.

The new semctl_main() is now taking a lock, testing
sma->sem_perm.deleted then dropping that lock.  It looks wrong.  What
is that lock testing against?  What prevents .deleted from changing
value 1ns after we dropped that lock?


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-29 19:01       ` Dave Jones
@ 2013-05-03 15:03         ` Peter Hurley
  0 siblings, 0 replies; 129+ messages in thread
From: Peter Hurley @ 2013-05-03 15:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Jones, Rik van Riel, torvalds, davidlohr.bueso,
	linux-kernel, hhuang, jason.low2, walken, lwoodman, chegu_vinod

On 03/29/2013 03:01 PM, Dave Jones wrote:
> On Tue, Mar 26, 2013 at 12:43:09PM -0700, Andrew Morton wrote:
>   > On Tue, 26 Mar 2013 15:28:52 -0400 Dave Jones <davej@redhat.com> wrote:
>   >
>   > > On Thu, Mar 21, 2013 at 02:10:58PM -0700, Andrew Morton wrote:
>   > >
>   > >  > Whichever way we go, we should get a wiggle on - this has been hanging
>   > >  > around for too long.  Dave, do you have time to determine whether
>   > >  > reverting 88b9e456b1649722673ff ("ipc: don't allocate a copy larger
>   > >  > than max") fixes things up?
>   > >
>   > > Ok, with that reverted it's been grinding away for a few hours without incident.
>   > > Normally I see the oops within a minute or so.
>   > >
>   >
>   > OK, thanks, I queued a revert:
>   >
>   > From: Andrew Morton <akpm@linux-foundation.org>
>   > Subject: revert "ipc: don't allocate a copy larger than max"
>   >
>   > Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
>   > during trinity testing.
>
> I owe Peter an apology. I just hit it again with that backed out.
> Andrew, might as well drop that revert.
>
> BUG: unable to handle kernel NULL pointer dereference at 000000000000000f
> IP: [<ffffffff812c24ca>] testmsg.isra.5+0x1a/0x60
> [...snip...]
>
> I think I wasn't seeing that this last week because I had inadvertantly disabled DEBUG_PAGEALLOC
>
> and.. we're back to square one.
>
> 	Dave

Andrew,

I just realized you're still carrying

	commit 4bea54c91bcc5451f237e6b721b0b35eccd01d17
	Author: Andrew Morton <akpm@linux-foundation.org>
	Date:   Fri Apr 26 10:55:12 2013 +1000

	    revert "ipc: don't allocate a copy larger than max"

	    Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
	    during trinity testing.

	    Cc: Peter Hurley <peter@hurleysoftware.com>
	    Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
	    Reported-by: Dave Jones <davej@redhat.com>
	    Cc: <stable@vger.kernel.org>
	    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Please drop.

As quoted above, the testing on mainline that attributed
the observed oops to the reverted patch was due to a config
error.

As Linus pointed out below, this patch fixes real bugs.

On 04/02/2013 03:53 PM, Sasha Levin wrote:
 > On 04/02/2013 01:52 PM, Linus Torvalds wrote:
 >> On Tue, Apr 2, 2013 at 9:08 AM, Sasha Levin <sasha.levin@oracle.com> wrote:
 >>>
 >>> By just playing with the 'msgsz' parameter with MSG_COPY set.
 >>
 >> Hmm. Looking closer, I suspect you're testing without commit
 >> 88b9e456b164 ("ipc: don't allocate a copy larger than max"). That
 >> should limit the size passed in to prepare_copy -> load_copy to
 >> msg_ctlmax.
 >
 > That commit has a revert in the -next trees, do we need a revert
 > of the revert?
 >
 > 	commit ff6577a3e714ccae02d4400e989762c19c37b0b3
 > 	Author: Andrew Morton <akpm@linux-foundation.org>
 > 	Date:   Wed Mar 27 10:24:02 2013 +1100
 > 	
 > 	    revert "ipc: don't allocate a copy larger than max"
 > 	
 > 	    Revert 88b9e456b164.  Dave has confirmed that this was causing oopses
 > 	    during trinity testing.
 > 	
 > 	    Cc: Peter Hurley <peter@hurleysoftware.com>
 > 	    Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
 > 	    Reported-by: Dave Jones <davej@redhat.com>
 > 	    Cc: <stable@vger.kernel.org>
 > 	    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-03-23  3:19     ` Emmanuel Benisty
  2013-03-23 19:45       ` Linus Torvalds
@ 2013-05-04 15:55       ` Jörn Engel
  2013-05-04 18:12         ` Borislav Petkov
  1 sibling, 1 reply; 129+ messages in thread
From: Jörn Engel @ 2013-05-04 15:55 UTC (permalink / raw)
  To: Emmanuel Benisty
  Cc: Linus Torvalds, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, Andrew Morton, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu

On Sat, 23 March 2013 10:19:16 +0700, Emmanuel Benisty wrote:
> 
> I could reproduce it but could you please let me know what would be
> the right tools I should use to catch the original oops?
> This is what I got but I doubt it will be helpful:
> http://i.imgur.com/Mewi1hC.jpg

You could use either netconsole or blockconsole.  Netconsole requires
a second machine to capture the information, blockconsole requires a
usb key or something similar to write to.  In both cases you will get
the entire console output in a file, often including very helpful
messages leading up to the crash.

Blockconsole currently lives here:
https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Jörn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-05-04 15:55       ` Jörn Engel
@ 2013-05-04 18:12         ` Borislav Petkov
  2013-05-06 14:47           ` Jörn Engel
  0 siblings, 1 reply; 129+ messages in thread
From: Borislav Petkov @ 2013-05-04 18:12 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Emmanuel Benisty, Linus Torvalds, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, Andrew Morton, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu

On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
> Blockconsole currently lives here:
> https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/

Tja, if only that were upstream...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: ipc,sem: sysv semaphore scalability
  2013-05-04 18:12         ` Borislav Petkov
@ 2013-05-06 14:47           ` Jörn Engel
  0 siblings, 0 replies; 129+ messages in thread
From: Jörn Engel @ 2013-05-06 14:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Emmanuel Benisty, Linus Torvalds, Rik van Riel, Davidlohr Bueso,
	Linux Kernel Mailing List, Andrew Morton, hhuang, Low, Jason,
	Michel Lespinasse, Larry Woodman, Vinod, Chegu

On Sat, 4 May 2013 20:12:45 +0200, Borislav Petkov wrote:
> On Sat, May 04, 2013 at 11:55:58AM -0400, Jörn Engel wrote:
> > Blockconsole currently lives here:
> > https://git.kernel.org/cgit/linux/kernel/git/joern/bcon2.git/
> 
> Tja, if only that were upstream...

Linus has a pull request.  If he prefers to change code until the
relevant bits fit into 80x25, that is his choice. ;)

Jörn

--
The only good bug is a dead bug.
-- Starship Troopers

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2013-05-06 16:14 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-20 19:55 ipc,sem: sysv semaphore scalability Rik van Riel
2013-03-20 19:55 ` [PATCH 1/7] ipc: remove bogus lock comment for ipc_checkid Rik van Riel
2013-03-20 19:55 ` [PATCH 2/7] ipc: introduce obtaining a lockless ipc object Rik van Riel
2013-03-20 19:55 ` [PATCH 3/7] ipc: introduce lockless pre_down ipcctl Rik van Riel
2013-03-20 19:55 ` [PATCH 4/7] ipc,sem: do not hold ipc lock more than necessary Rik van Riel
2013-03-20 19:55 ` [PATCH 5/7] ipc,sem: open code and rename sem_lock Rik van Riel
2013-03-22  1:14   ` Davidlohr Bueso
2013-03-20 19:55 ` [PATCH 6/7] ipc,sem: have only one list in struct sem_queue Rik van Riel
2013-03-22  1:14   ` Davidlohr Bueso
2013-03-20 19:55 ` [PATCH 7/7] ipc,sem: fine grained locking for semtimedop Rik van Riel
2013-03-22  1:14   ` Davidlohr Bueso
2013-03-22 23:01   ` Michel Lespinasse
2013-03-22 23:38     ` Rik van Riel
2013-03-22 23:42     ` [PATCH 7/7 part3] fix for sem_lock Rik van Riel
2013-03-20 20:49 ` ipc,sem: sysv semaphore scalability Linus Torvalds
2013-03-20 20:56   ` Linus Torvalds
2013-03-20 20:57   ` Davidlohr Bueso
2013-03-21 21:10 ` Andrew Morton
2013-03-21 21:47   ` Peter Hurley
2013-03-21 21:50   ` Peter Hurley
2013-03-21 22:01     ` Andrew Morton
2013-03-22  3:38       ` Rik van Riel
2013-03-26 19:28   ` Dave Jones
2013-03-26 19:43     ` Andrew Morton
2013-03-29 16:17       ` Dave Jones
2013-03-29 18:00         ` Linus Torvalds
2013-03-29 18:04           ` Dave Jones
2013-03-29 18:10             ` Linus Torvalds
2013-03-29 18:43         ` Linus Torvalds
2013-03-29 19:06           ` Dave Jones
2013-03-29 19:13             ` Linus Torvalds
2013-03-29 19:26             ` Linus Torvalds
2013-03-29 19:36               ` Peter Hurley
2013-04-02 16:08                 ` Sasha Levin
2013-04-02 17:24                   ` Linus Torvalds
2013-04-02 17:52                   ` Linus Torvalds
2013-04-02 19:53                     ` Sasha Levin
2013-04-02 20:00                       ` Dave Jones
2013-03-29 19:33           ` Peter Hurley
2013-03-29 19:54             ` Linus Torvalds
2013-04-01  7:40           ` Stanislav Kinsbursky
2013-03-29 20:41         ` Linus Torvalds
2013-03-29 21:12           ` Linus Torvalds
2013-03-29 23:16             ` Linus Torvalds
2013-03-30  1:36               ` Emmanuel Benisty
2013-03-30  2:08                 ` Davidlohr Bueso
2013-03-30  3:02                   ` Emmanuel Benisty
2013-03-30  3:46                     ` Linus Torvalds
2013-03-30  4:33                       ` Emmanuel Benisty
2013-03-30  5:10                         ` Linus Torvalds
2013-03-30  5:57                           ` Emmanuel Benisty
2013-03-30 17:22                             ` Linus Torvalds
2013-03-31  2:38                               ` Emmanuel Benisty
2013-03-31  5:01                         ` Davidlohr Bueso
2013-03-31 13:45                           ` Rik van Riel
2013-03-31 17:10                             ` Linus Torvalds
2013-03-31 17:02                           ` Emmanuel Benisty
2013-03-30  2:09                 ` Linus Torvalds
2013-03-30  2:55                   ` Davidlohr Bueso
2013-03-29 19:01       ` Dave Jones
2013-05-03 15:03         ` Peter Hurley
2013-03-22  1:12 ` Davidlohr Bueso
2013-03-22  1:23   ` Linus Torvalds
2013-03-22  3:40     ` Rik van Riel
2013-03-22  7:30 ` Mike Galbraith
2013-03-22 11:04 ` Emmanuel Benisty
2013-03-22 15:37   ` Linus Torvalds
2013-03-23  3:19     ` Emmanuel Benisty
2013-03-23 19:45       ` Linus Torvalds
2013-03-24 13:46         ` Emmanuel Benisty
2013-03-24 17:10           ` Linus Torvalds
2013-03-25 13:47             ` Emmanuel Benisty
2013-03-25 14:00               ` Rik van Riel
2013-03-25 14:03                 ` Rik van Riel
2013-03-25 15:20                   ` Emmanuel Benisty
2013-03-25 15:53                     ` Rik van Riel
2013-03-25 17:09                       ` Emmanuel Benisty
2013-03-25 14:01               ` Rik van Riel
2013-03-25 14:21                 ` Emmanuel Benisty
2013-03-26 17:59               ` Davidlohr Bueso
2013-03-26 18:14                 ` Rik van Riel
2013-03-26 18:35                 ` Andrew Morton
2013-04-16 23:30                   ` Andrew Morton
2013-05-04 15:55       ` Jörn Engel
2013-05-04 18:12         ` Borislav Petkov
2013-05-06 14:47           ` Jörn Engel
2013-03-22 17:51 ` Davidlohr Bueso
2013-03-25 20:21 ` Sasha Levin
2013-03-25 20:38   ` [PATCH -mm -next] ipc,sem: fix lockdep false positive Rik van Riel
2013-03-25 21:42     ` Michel Lespinasse
2013-03-25 21:51       ` Michel Lespinasse
2013-03-25 21:56         ` Sasha Levin
2013-03-25 21:52       ` Sasha Levin
2013-03-26 13:19       ` Peter Zijlstra
2013-03-26 13:40         ` Michel Lespinasse
2013-03-26 14:27           ` Peter Zijlstra
2013-03-26 15:19             ` Rik van Riel
2013-03-27  8:40               ` Peter Zijlstra
2013-03-27  8:42               ` Peter Zijlstra
2013-03-27 11:22                 ` Michel Lespinasse
2013-03-27 12:02                   ` Peter Zijlstra
2013-03-27 20:00                 ` Rik van Riel
2013-03-28 20:23                 ` [PATCH v2 " Rik van Riel
2013-03-29  2:50                   ` Michel Lespinasse
2013-03-29  9:57                     ` Peter Zijlstra
2013-03-29 13:21                       ` Michel Lespinasse
2013-03-29 12:07                     ` Rik van Riel
2013-03-29 13:08                       ` Michel Lespinasse
2013-03-29 13:24                         ` Rik van Riel
2013-03-29 13:55                     ` [PATCH v3 " Rik van Riel
2013-03-29 13:59                       ` Michel Lespinasse
2013-03-26 14:25         ` [PATCH " Rik van Riel
2013-03-26 17:33 ` ipc,sem: sysv semaphore scalability Sasha Levin
2013-03-26 17:51   ` Davidlohr Bueso
2013-03-26 18:07     ` Sasha Levin
2013-03-26 18:17       ` Rik van Riel
2013-03-26 20:00       ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
2013-04-05  4:38         ` Mike Galbraith
2013-04-05 13:21           ` Rik van Riel
2013-04-05 16:26             ` Mike Galbraith
2013-04-16 12:37             ` Mike Galbraith
2013-03-26 17:55   ` ipc,sem: sysv semaphore scalability Paul E. McKenney
2013-03-28 15:32   ` [PATCH -mm -next] ipc,sem: untangle RCU locking with find_alloc_undo Rik van Riel
2013-03-28 21:05     ` Davidlohr Bueso
2013-03-29  1:00     ` Michel Lespinasse
2013-03-29  1:14       ` Sasha Levin
2013-03-30 13:35     ` Sasha Levin
2013-03-31  1:30       ` Rik van Riel
2013-03-31  4:09         ` Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).