* [PATCH 1/3 v3] dcache: Don't take unnecessary lock in d_count update
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
@ 2013-05-23 1:37 ` Waiman Long
2013-05-23 1:37 ` Waiman Long
` (5 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: , Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent,
Sage Weil, Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs,
samba-technical, linux-kernel, Chandramouleeswaran, Aswin,
Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel
The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary to
take the lock if the reference count won't go to 0.
Without using a lock, multiple threads may update d_count
simultaneously. Therefore, atomic instructions must be used to
ensure consistency except in shrink_dcache_for_umount*() where the
whole superblock is being dismounted and locking is not needed.
The worst case scenarios are:
1. d_lock taken in dput with d_count = 2 in one thread and another
thread comes in to atomically decrement d_count without taking
the lock. This may result in a d_count of 0 with no deleting
action taken.
2. d_lock taken in dput with d_count = 1 in one thread and another
thread comes in to atomically increment d_count without taking
the lock. This may result in the dentry in the deleted state while
having a d_count of 1.
Without taking a lock, we need to make sure the decrementing or
incrementing action should not be taken while other threads are
updating d_count simultaneously. This can be done by using the
atomic cmpxchg instruction which will fail if the underlying value
is changed. If the lock is taken, it should be safe to use a simpler
atomic increment or decrement instruction.
To make sure that the above worst case scenarios will not happen,
the dget() function must take the lock if d_count <= 1. Similarly,
the dput() function must take the lock if d_count <= 2. The cmpxchg()
call to update d_count will be tried twice before falling back to
using the lock as there is a fairly good chance that the cmpxchg()
may fail in a busy situation.
Finally, the CPU must have an instructional level cmpxchg instruction
or the emulated cmpxchg() function may be too expensive to
use. Therefore, the above mentioned changes will only be applied if
the __HAVE_ARCH_CMPXCHG flag is set. Most of the major architectures
supported by Linux have this flag set with the notation exception
of ARM.
As for the performance of the updated reference counting code, it
all depends on whether the cmpxchg instruction is used or not. The
original code has 2 atomic instructions to lock and unlock the
spinlock. The new code path has either 1 atomic cmpxchg instruction
or 3 atomic instructions if the lock has to be taken. Depending on
how frequent the cmpxchg instruction is used (d_count > 1 or 2),
the new code can be faster or slower than the original one.
This patch has a particular big impact on the short workload of the
AIM7 benchmark with ramdisk filesystem. The table below show the
performance improvement to the JPM (jobs per minutes) throughput due
to this patch on an 8-socket 80-core x86-64 system with a 3.10-rc1
kernel in a 1/2/4/8 node configuration by using numactl to restrict
the execution of the workload on certain nodes.
+-----------------+----------------+-----------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | Rate w/o patch | Rate with patch | |
+-----------------+---------------------------------------------+
| | User Range 10 - 100 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1546813 | 5080126 | +228.4% |
| 4 nodes, HT off | 1663439 | 4703083 | +182.7% |
| 2 nodes, HT off | 2545389 | 3790995 | +48.9% |
| 1 node , HT off | 2374488 | 2394846 | +0.9% |
+-----------------+---------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1064184 | 7391382 | +594.6% |
| 4 nodes, HT off | 1362447 | 7325475 | +437.7% |
| 2 nodes, HT off | 1851198 | 4539511 | +145.2% |
| 1 node , HT off | 2477537 | 2457513 | -0.8% |
+-----------------+---------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1056243 | 7067282 | +569.1% |
| 4 nodes, HT off | 1354505 | 6930437 | +411.7% |
| 2 nodes, HT off | 1735881 | 4581847 | +164.0% |
| 1 node , HT off | 2511547 | 2496594 | -0.6% |
+-----------------+----------------+-----------------+----------+
It can be seen that with 20 CPUs (2 nodes) or more, this patch can
significantly improve the short workload performance. With only 1
node, the performance is similar with or without the patch. The short
workload also scales pretty well up to 4 nodes with this patch.
A perf call-graph report of the short workload at 1500 users without
the patch on 8 nodes indicates that about 79% of the workload's total
time were spent in the _raw_spin_lock() function. Almost all of which
can be attributed to the following 2 kernel functions:
1. dget_parent (49.91%)
2. dput (49.82%)
With this patch on, the time spent on _raw_spin_lock() is only 1.38%
which is a huge improvement. Of the 1.38%, only a small portion of it
is related to the dcache lock. However, the _raw_spin_lock_irqsave()
of the tty_ldisc_lock now dominates with 24.5% of the total time.
This impact of this patch on other AIM7 workloads were much more
modest. The table below show the mean %change due to this patch on
the same 8-socket system with a 3.10-rc1 kernel.
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | -0.2% | -0.5% | -2.7% |
| five_sec | -1.0% | +2.6% | +2.4% |
| fserver | +1.3% | +1.9% | +1.0% |
| high_systime | +0.1% | +1.3% | +4.8% |
| new_fserver | -1.5% | +0.8% | +1.1% |
| shared | +0.5% | +0.4% | -0.1% |
+--------------+---------------+----------------+-----------------+
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
fs/dcache.c | 37 ++++++++---------
fs/namei.c | 2 +-
include/linux/dcache.h | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 117 insertions(+), 23 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..470b06f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -467,7 +467,7 @@ relock:
}
if (ref)
- dentry->d_count--;
+ dcount_dec(dentry);
/*
* inform the fs via d_prune that this dentry is about to be
* unhashed and destroyed.
@@ -515,10 +515,13 @@ void dput(struct dentry *dentry)
repeat:
if (dentry->d_count == 1)
might_sleep();
+ if (dcount_dec_cmpxchg(dentry))
+ return;
+
spin_lock(&dentry->d_lock);
BUG_ON(!dentry->d_count);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -535,7 +538,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
@@ -606,11 +609,13 @@ EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
{
- dentry->d_count++;
+ dcount_inc(dentry);
}
static inline void __dget(struct dentry *dentry)
{
+ if (dcount_inc_cmpxchg(dentry))
+ return;
spin_lock(&dentry->d_lock);
__dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
@@ -620,22 +625,16 @@ struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
-repeat:
- /*
- * Don't need rcu_dereference because we re-check it was correct under
- * the lock.
- */
rcu_read_lock();
- ret = dentry->d_parent;
- spin_lock(&ret->d_lock);
- if (unlikely(ret != dentry->d_parent)) {
- spin_unlock(&ret->d_lock);
+ ret = rcu_dereference(dentry->d_parent);
+ if (dcount_inc_cmpxchg(ret)) {
rcu_read_unlock();
- goto repeat;
+ return ret;
}
+ spin_lock(&ret->d_lock);
rcu_read_unlock();
BUG_ON(!ret->d_count);
- ret->d_count++;
+ dcount_inc(ret);
spin_unlock(&ret->d_lock);
return ret;
}
@@ -765,7 +764,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
while (dentry) {
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1970,7 +1969,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
goto next;
}
- dentry->d_count++;
+ dcount_inc(dentry);
found = dentry;
spin_unlock(&dentry->d_lock);
break;
@@ -2937,7 +2936,7 @@ resume:
}
if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
dentry->d_flags |= DCACHE_GENOCIDE;
- dentry->d_count--;
+ dcount_dec(dentry);
}
spin_unlock(&dentry->d_lock);
}
@@ -2945,7 +2944,7 @@ resume:
struct dentry *child = this_parent;
if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
this_parent->d_flags |= DCACHE_GENOCIDE;
- this_parent->d_count--;
+ dcount_dec(this_parent);
}
this_parent = try_to_ascend(this_parent, locked, seq);
if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 85e40d1..03dcaed 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -537,7 +537,7 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
*/
BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
BUG_ON(!parent->d_count);
- parent->d_count++;
+ dcount_inc(parent);
spin_unlock(&dentry->d_lock);
}
spin_unlock(&parent->d_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..99da5e2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -258,6 +258,99 @@ extern int have_submounts(struct dentry *);
extern void d_rehash(struct dentry *);
/**
+ * dcount_inc, dcount_dec - increment or decrement the dentry reference
+ * count
+ * @dentry: dentry to get a reference to
+ */
+static inline void dcount_inc(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_inc((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count++;
+#endif
+}
+
+static inline void dcount_dec(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_dec((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count--;
+#endif
+}
+
+/**
+ * dcount_inc_cmpxchg - increment dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ *
+ * N.B. For architectures that do not have a cmpxchg instruction, the
+ * substitute code may not perform better than taking the lock.
+ * So this cmpxchg code path is disabled for those cases.
+ */
+static inline int dcount_inc_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ if ((oldc = dentry->d_count) > 1) {
+ /*
+ * If reference count > 1, we can safely increment its
+ * value using atomic cmpxchg without locking. If the
+ * operation fails, fall back to using the lock.
+ */
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 1)) {
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
+ * dcount_dec_cmpxchg - decrement dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ */
+static inline int dcount_dec_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ /*
+ * If d_count is bigger than 2, we can safely decrement the
+ * reference count using atomic cmpxchg instruction without locking.
+ * Even if d_lock is taken by a thread running dput() with a parallel
+ * atomic_dec(), the reference count won't go to 0 without anyone
+ * noticing.
+ */
+ if ((oldc = dentry->d_count) > 2) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 2)) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
* d_add - add dentry to hash queues
* @entry: dentry to add
* @inode: The inode to attach to this dentry
@@ -319,7 +412,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
assert_spin_locked(&dentry->d_lock);
if (!read_seqcount_retry(&dentry->d_seq, seq)) {
ret = 1;
- dentry->d_count++;
+ dcount_inc(dentry);
}
return ret;
@@ -340,7 +433,6 @@ extern char *dentry_path_raw(struct dentry *, char *, int);
extern char *dentry_path(struct dentry *, char *, int);
/* Allocation counts.. */
-
/**
* dget, dget_dlock - get a reference to a dentry
* @dentry: dentry to get a reference to
@@ -352,13 +444,16 @@ extern char *dentry_path(struct dentry *, char *, int);
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
if (dentry)
- dentry->d_count++;
+ dcount_inc(dentry);
return dentry;
}
static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry) {
+ if (dcount_inc_cmpxchg(dentry))
+ return dentry;
+
spin_lock(&dentry->d_lock);
dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 1/3 v3] dcache: Don't take unnecessary lock in d_count update
@ 2013-05-23 1:37 ` Waiman Long
0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs,
samba-technical, linux-kernel, Chandramouleeswaran, Aswin,
Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel
The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary to
take the lock if the reference count won't go to 0.
Without using a lock, multiple threads may update d_count
simultaneously. Therefore, atomic instructions must be used to
ensure consistency except in shrink_dcache_for_umount*() where the
whole superblock is being dismounted and locking is not needed.
The worst case scenarios are:
1. d_lock taken in dput with d_count = 2 in one thread and another
thread comes in to atomically decrement d_count without taking
the lock. This may result in a d_count of 0 with no deleting
action taken.
2. d_lock taken in dput with d_count = 1 in one thread and another
thread comes in to atomically increment d_count without taking
the lock. This may result in the dentry in the deleted state while
having a d_count of 1.
Without taking a lock, we need to make sure the decrementing or
incrementing action should not be taken while other threads are
updating d_count simultaneously. This can be done by using the
atomic cmpxchg instruction which will fail if the underlying value
is changed. If the lock is taken, it should be safe to use a simpler
atomic increment or decrement instruction.
To make sure that the above worst case scenarios will not happen,
the dget() function must take the lock if d_count <= 1. Similarly,
the dput() function must take the lock if d_count <= 2. The cmpxchg()
call to update d_count will be tried twice before falling back to
using the lock as there is a fairly good chance that the cmpxchg()
may fail in a busy situation.
Finally, the CPU must have an instructional level cmpxchg instruction
or the emulated cmpxchg() function may be too expensive to
use. Therefore, the above mentioned changes will only be applied if
the __HAVE_ARCH_CMPXCHG flag is set. Most of the major architectures
supported by Linux have this flag set with the notation exception
of ARM.
As for the performance of the updated reference counting code, it
all depends on whether the cmpxchg instruction is used or not. The
original code has 2 atomic instructions to lock and unlock the
spinlock. The new code path has either 1 atomic cmpxchg instruction
or 3 atomic instructions if the lock has to be taken. Depending on
how frequent the cmpxchg instruction is used (d_count > 1 or 2),
the new code can be faster or slower than the original one.
This patch has a particular big impact on the short workload of the
AIM7 benchmark with ramdisk filesystem. The table below show the
performance improvement to the JPM (jobs per minutes) throughput due
to this patch on an 8-socket 80-core x86-64 system with a 3.10-rc1
kernel in a 1/2/4/8 node configuration by using numactl to restrict
the execution of the workload on certain nodes.
+-----------------+----------------+-----------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | Rate w/o patch | Rate with patch | |
+-----------------+---------------------------------------------+
| | User Range 10 - 100 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1546813 | 5080126 | +228.4% |
| 4 nodes, HT off | 1663439 | 4703083 | +182.7% |
| 2 nodes, HT off | 2545389 | 3790995 | +48.9% |
| 1 node , HT off | 2374488 | 2394846 | +0.9% |
+-----------------+---------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1064184 | 7391382 | +594.6% |
| 4 nodes, HT off | 1362447 | 7325475 | +437.7% |
| 2 nodes, HT off | 1851198 | 4539511 | +145.2% |
| 1 node , HT off | 2477537 | 2457513 | -0.8% |
+-----------------+---------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1056243 | 7067282 | +569.1% |
| 4 nodes, HT off | 1354505 | 6930437 | +411.7% |
| 2 nodes, HT off | 1735881 | 4581847 | +164.0% |
| 1 node , HT off | 2511547 | 2496594 | -0.6% |
+-----------------+----------------+-----------------+----------+
It can be seen that with 20 CPUs (2 nodes) or more, this patch can
significantly improve the short workload performance. With only 1
node, the performance is similar with or without the patch. The short
workload also scales pretty well up to 4 nodes with this patch.
A perf call-graph report of the short workload at 1500 users without
the patch on 8 nodes indicates that about 79% of the workload's total
time were spent in the _raw_spin_lock() function. Almost all of which
can be attributed to the following 2 kernel functions:
1. dget_parent (49.91%)
2. dput (49.82%)
With this patch on, the time spent on _raw_spin_lock() is only 1.38%
which is a huge improvement. Of the 1.38%, only a small portion of it
is related to the dcache lock. However, the _raw_spin_lock_irqsave()
of the tty_ldisc_lock now dominates with 24.5% of the total time.
This impact of this patch on other AIM7 workloads were much more
modest. The table below show the mean %change due to this patch on
the same 8-socket system with a 3.10-rc1 kernel.
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | -0.2% | -0.5% | -2.7% |
| five_sec | -1.0% | +2.6% | +2.4% |
| fserver | +1.3% | +1.9% | +1.0% |
| high_systime | +0.1% | +1.3% | +4.8% |
| new_fserver | -1.5% | +0.8% | +1.1% |
| shared | +0.5% | +0.4% | -0.1% |
+--------------+---------------+----------------+-----------------+
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
fs/dcache.c | 37 ++++++++---------
fs/namei.c | 2 +-
include/linux/dcache.h | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 117 insertions(+), 23 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..470b06f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -467,7 +467,7 @@ relock:
}
if (ref)
- dentry->d_count--;
+ dcount_dec(dentry);
/*
* inform the fs via d_prune that this dentry is about to be
* unhashed and destroyed.
@@ -515,10 +515,13 @@ void dput(struct dentry *dentry)
repeat:
if (dentry->d_count == 1)
might_sleep();
+ if (dcount_dec_cmpxchg(dentry))
+ return;
+
spin_lock(&dentry->d_lock);
BUG_ON(!dentry->d_count);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -535,7 +538,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
@@ -606,11 +609,13 @@ EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
{
- dentry->d_count++;
+ dcount_inc(dentry);
}
static inline void __dget(struct dentry *dentry)
{
+ if (dcount_inc_cmpxchg(dentry))
+ return;
spin_lock(&dentry->d_lock);
__dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
@@ -620,22 +625,16 @@ struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
-repeat:
- /*
- * Don't need rcu_dereference because we re-check it was correct under
- * the lock.
- */
rcu_read_lock();
- ret = dentry->d_parent;
- spin_lock(&ret->d_lock);
- if (unlikely(ret != dentry->d_parent)) {
- spin_unlock(&ret->d_lock);
+ ret = rcu_dereference(dentry->d_parent);
+ if (dcount_inc_cmpxchg(ret)) {
rcu_read_unlock();
- goto repeat;
+ return ret;
}
+ spin_lock(&ret->d_lock);
rcu_read_unlock();
BUG_ON(!ret->d_count);
- ret->d_count++;
+ dcount_inc(ret);
spin_unlock(&ret->d_lock);
return ret;
}
@@ -765,7 +764,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
while (dentry) {
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1970,7 +1969,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
goto next;
}
- dentry->d_count++;
+ dcount_inc(dentry);
found = dentry;
spin_unlock(&dentry->d_lock);
break;
@@ -2937,7 +2936,7 @@ resume:
}
if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
dentry->d_flags |= DCACHE_GENOCIDE;
- dentry->d_count--;
+ dcount_dec(dentry);
}
spin_unlock(&dentry->d_lock);
}
@@ -2945,7 +2944,7 @@ resume:
struct dentry *child = this_parent;
if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
this_parent->d_flags |= DCACHE_GENOCIDE;
- this_parent->d_count--;
+ dcount_dec(this_parent);
}
this_parent = try_to_ascend(this_parent, locked, seq);
if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 85e40d1..03dcaed 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -537,7 +537,7 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
*/
BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
BUG_ON(!parent->d_count);
- parent->d_count++;
+ dcount_inc(parent);
spin_unlock(&dentry->d_lock);
}
spin_unlock(&parent->d_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..99da5e2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -258,6 +258,99 @@ extern int have_submounts(struct dentry *);
extern void d_rehash(struct dentry *);
/**
+ * dcount_inc, dcount_dec - increment or decrement the dentry reference
+ * count
+ * @dentry: dentry to get a reference to
+ */
+static inline void dcount_inc(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_inc((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count++;
+#endif
+}
+
+static inline void dcount_dec(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_dec((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count--;
+#endif
+}
+
+/**
+ * dcount_inc_cmpxchg - increment dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ *
+ * N.B. For architectures that do not have a cmpxchg instruction, the
+ * substitute code may not perform better than taking the lock.
+ * So this cmpxchg code path is disabled for those cases.
+ */
+static inline int dcount_inc_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ if ((oldc = dentry->d_count) > 1) {
+ /*
+ * If reference count > 1, we can safely increment its
+ * value using atomic cmpxchg without locking. If the
+ * operation fails, fall back to using the lock.
+ */
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 1)) {
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
+ * dcount_dec_cmpxchg - decrement dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ */
+static inline int dcount_dec_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ /*
+ * If d_count is bigger than 2, we can safely decrement the
+ * reference count using atomic cmpxchg instruction without locking.
+ * Even if d_lock is taken by a thread running dput() with a parallel
+ * atomic_dec(), the reference count won't go to 0 without anyone
+ * noticing.
+ */
+ if ((oldc = dentry->d_count) > 2) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 2)) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
* d_add - add dentry to hash queues
* @entry: dentry to add
* @inode: The inode to attach to this dentry
@@ -319,7 +412,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
assert_spin_locked(&dentry->d_lock);
if (!read_seqcount_retry(&dentry->d_seq, seq)) {
ret = 1;
- dentry->d_count++;
+ dcount_inc(dentry);
}
return ret;
@@ -340,7 +433,6 @@ extern char *dentry_path_raw(struct dentry *, char *, int);
extern char *dentry_path(struct dentry *, char *, int);
/* Allocation counts.. */
-
/**
* dget, dget_dlock - get a reference to a dentry
* @dentry: dentry to get a reference to
@@ -352,13 +444,16 @@ extern char *dentry_path(struct dentry *, char *, int);
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
if (dentry)
- dentry->d_count++;
+ dcount_inc(dentry);
return dentry;
}
static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry) {
+ if (dcount_inc_cmpxchg(dentry))
+ return dentry;
+
spin_lock(&dentry->d_lock);
dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 1/3 v3] dcache: Don't take unnecessary lock in d_count update
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
2013-05-23 1:37 ` Waiman Long
@ 2013-05-23 1:37 ` Waiman Long
2013-05-23 1:37 ` Waiman Long
` (4 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-fsdevel, linux-kernel, autofs, ceph-devel,
linux-cifs, samba-technical, linux-nfs, Chandramouleeswaran,
Aswin, Norton, Scott J, Andi Kleen, Dave Chinner
The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary to
take the lock if the reference count won't go to 0.
Without using a lock, multiple threads may update d_count
simultaneously. Therefore, atomic instructions must be used to
ensure consistency except in shrink_dcache_for_umount*() where the
whole superblock is being dismounted and locking is not needed.
The worst case scenarios are:
1. d_lock taken in dput with d_count = 2 in one thread and another
thread comes in to atomically decrement d_count without taking
the lock. This may result in a d_count of 0 with no deleting
action taken.
2. d_lock taken in dput with d_count = 1 in one thread and another
thread comes in to atomically increment d_count without taking
the lock. This may result in the dentry in the deleted state while
having a d_count of 1.
Without taking a lock, we need to make sure the decrementing or
incrementing action should not be taken while other threads are
updating d_count simultaneously. This can be done by using the
atomic cmpxchg instruction which will fail if the underlying value
is changed. If the lock is taken, it should be safe to use a simpler
atomic increment or decrement instruction.
To make sure that the above worst case scenarios will not happen,
the dget() function must take the lock if d_count <= 1. Similarly,
the dput() function must take the lock if d_count <= 2. The cmpxchg()
call to update d_count will be tried twice before falling back to
using the lock as there is a fairly good chance that the cmpxchg()
may fail in a busy situation.
Finally, the CPU must have an instructional level cmpxchg instruction
or the emulated cmpxchg() function may be too expensive to
use. Therefore, the above mentioned changes will only be applied if
the __HAVE_ARCH_CMPXCHG flag is set. Most of the major architectures
supported by Linux have this flag set with the notation exception
of ARM.
As for the performance of the updated reference counting code, it
all depends on whether the cmpxchg instruction is used or not. The
original code has 2 atomic instructions to lock and unlock the
spinlock. The new code path has either 1 atomic cmpxchg instruction
or 3 atomic instructions if the lock has to be taken. Depending on
how frequent the cmpxchg instruction is used (d_count > 1 or 2),
the new code can be faster or slower than the original one.
This patch has a particular big impact on the short workload of the
AIM7 benchmark with ramdisk filesystem. The table below show the
performance improvement to the JPM (jobs per minutes) throughput due
to this patch on an 8-socket 80-core x86-64 system with a 3.10-rc1
kernel in a 1/2/4/8 node configuration by using numactl to restrict
the execution of the workload on certain nodes.
+-----------------+----------------+-----------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | Rate w/o patch | Rate with patch | |
+-----------------+---------------------------------------------+
| | User Range 10 - 100 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1546813 | 5080126 | +228.4% |
| 4 nodes, HT off | 1663439 | 4703083 | +182.7% |
| 2 nodes, HT off | 2545389 | 3790995 | +48.9% |
| 1 node , HT off | 2374488 | 2394846 | +0.9% |
+-----------------+---------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1064184 | 7391382 | +594.6% |
| 4 nodes, HT off | 1362447 | 7325475 | +437.7% |
| 2 nodes, HT off | 1851198 | 4539511 | +145.2% |
| 1 node , HT off | 2477537 | 2457513 | -0.8% |
+-----------------+---------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+---------------------------------------------+
| 8 nodes, HT off | 1056243 | 7067282 | +569.1% |
| 4 nodes, HT off | 1354505 | 6930437 | +411.7% |
| 2 nodes, HT off | 1735881 | 4581847 | +164.0% |
| 1 node , HT off | 2511547 | 2496594 | -0.6% |
+-----------------+----------------+-----------------+----------+
It can be seen that with 20 CPUs (2 nodes) or more, this patch can
significantly improve the short workload performance. With only 1
node, the performance is similar with or without the patch. The short
workload also scales pretty well up to 4 nodes with this patch.
A perf call-graph report of the short workload at 1500 users without
the patch on 8 nodes indicates that about 79% of the workload's total
time were spent in the _raw_spin_lock() function. Almost all of which
can be attributed to the following 2 kernel functions:
1. dget_parent (49.91%)
2. dput (49.82%)
With this patch on, the time spent on _raw_spin_lock() is only 1.38%
which is a huge improvement. Of the 1.38%, only a small portion of it
is related to the dcache lock. However, the _raw_spin_lock_irqsave()
of the tty_ldisc_lock now dominates with 24.5% of the total time.
This impact of this patch on other AIM7 workloads were much more
modest. The table below show the mean %change due to this patch on
the same 8-socket system with a 3.10-rc1 kernel.
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | -0.2% | -0.5% | -2.7% |
| five_sec | -1.0% | +2.6% | +2.4% |
| fserver | +1.3% | +1.9% | +1.0% |
| high_systime | +0.1% | +1.3% | +4.8% |
| new_fserver | -1.5% | +0.8% | +1.1% |
| shared | +0.5% | +0.4% | -0.1% |
+--------------+---------------+----------------+-----------------+
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
fs/dcache.c | 37 ++++++++---------
fs/namei.c | 2 +-
include/linux/dcache.h | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 117 insertions(+), 23 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..470b06f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -467,7 +467,7 @@ relock:
}
if (ref)
- dentry->d_count--;
+ dcount_dec(dentry);
/*
* inform the fs via d_prune that this dentry is about to be
* unhashed and destroyed.
@@ -515,10 +515,13 @@ void dput(struct dentry *dentry)
repeat:
if (dentry->d_count == 1)
might_sleep();
+ if (dcount_dec_cmpxchg(dentry))
+ return;
+
spin_lock(&dentry->d_lock);
BUG_ON(!dentry->d_count);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -535,7 +538,7 @@ repeat:
dentry->d_flags |= DCACHE_REFERENCED;
dentry_lru_add(dentry);
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
@@ -606,11 +609,13 @@ EXPORT_SYMBOL(d_invalidate);
/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
{
- dentry->d_count++;
+ dcount_inc(dentry);
}
static inline void __dget(struct dentry *dentry)
{
+ if (dcount_inc_cmpxchg(dentry))
+ return;
spin_lock(&dentry->d_lock);
__dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
@@ -620,22 +625,16 @@ struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
-repeat:
- /*
- * Don't need rcu_dereference because we re-check it was correct under
- * the lock.
- */
rcu_read_lock();
- ret = dentry->d_parent;
- spin_lock(&ret->d_lock);
- if (unlikely(ret != dentry->d_parent)) {
- spin_unlock(&ret->d_lock);
+ ret = rcu_dereference(dentry->d_parent);
+ if (dcount_inc_cmpxchg(ret)) {
rcu_read_unlock();
- goto repeat;
+ return ret;
}
+ spin_lock(&ret->d_lock);
rcu_read_unlock();
BUG_ON(!ret->d_count);
- ret->d_count++;
+ dcount_inc(ret);
spin_unlock(&ret->d_lock);
return ret;
}
@@ -765,7 +764,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
while (dentry) {
spin_lock(&dentry->d_lock);
if (dentry->d_count > 1) {
- dentry->d_count--;
+ dcount_dec(dentry);
spin_unlock(&dentry->d_lock);
return;
}
@@ -1970,7 +1969,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
goto next;
}
- dentry->d_count++;
+ dcount_inc(dentry);
found = dentry;
spin_unlock(&dentry->d_lock);
break;
@@ -2937,7 +2936,7 @@ resume:
}
if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
dentry->d_flags |= DCACHE_GENOCIDE;
- dentry->d_count--;
+ dcount_dec(dentry);
}
spin_unlock(&dentry->d_lock);
}
@@ -2945,7 +2944,7 @@ resume:
struct dentry *child = this_parent;
if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
this_parent->d_flags |= DCACHE_GENOCIDE;
- this_parent->d_count--;
+ dcount_dec(this_parent);
}
this_parent = try_to_ascend(this_parent, locked, seq);
if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 85e40d1..03dcaed 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -537,7 +537,7 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
*/
BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
BUG_ON(!parent->d_count);
- parent->d_count++;
+ dcount_inc(parent);
spin_unlock(&dentry->d_lock);
}
spin_unlock(&parent->d_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..99da5e2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -258,6 +258,99 @@ extern int have_submounts(struct dentry *);
extern void d_rehash(struct dentry *);
/**
+ * dcount_inc, dcount_dec - increment or decrement the dentry reference
+ * count
+ * @dentry: dentry to get a reference to
+ */
+static inline void dcount_inc(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_inc((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count++;
+#endif
+}
+
+static inline void dcount_dec(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ atomic_dec((atomic_t *)&dentry->d_count);
+#else
+ dentry->d_count--;
+#endif
+}
+
+/**
+ * dcount_inc_cmpxchg - increment dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ *
+ * N.B. For architectures that do not have a cmpxchg instruction, the
+ * substitute code may not perform better than taking the lock.
+ * So this cmpxchg code path is disabled for those cases.
+ */
+static inline int dcount_inc_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ if ((oldc = dentry->d_count) > 1) {
+ /*
+ * If reference count > 1, we can safely increment its
+ * value using atomic cmpxchg without locking. If the
+ * operation fails, fall back to using the lock.
+ */
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 1)) {
+ newc = oldc + 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
+ * dcount_dec_cmpxchg - decrement dentry reference count using cmpxchg
+ * @dentry: dentry to get a reference to
+ * Return : 0 on failure, else 1
+ */
+static inline int dcount_dec_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+ int oldc, newc;
+
+ /*
+ * If d_count is bigger than 2, we can safely decrement the
+ * reference count using atomic cmpxchg instruction without locking.
+ * Even if d_lock is taken by a thread running dput() with a parallel
+ * atomic_dec(), the reference count won't go to 0 without anyone
+ * noticing.
+ */
+ if ((oldc = dentry->d_count) > 2) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ /* Retry one more time */
+ if (likely((oldc = ACCESS_ONCE(dentry->d_count)) > 2)) {
+ newc = oldc - 1;
+ if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+ return 1;
+ cpu_relax();
+ }
+ }
+#endif
+ return 0;
+}
+
+/**
* d_add - add dentry to hash queues
* @entry: dentry to add
* @inode: The inode to attach to this dentry
@@ -319,7 +412,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
assert_spin_locked(&dentry->d_lock);
if (!read_seqcount_retry(&dentry->d_seq, seq)) {
ret = 1;
- dentry->d_count++;
+ dcount_inc(dentry);
}
return ret;
@@ -340,7 +433,6 @@ extern char *dentry_path_raw(struct dentry *, char *, int);
extern char *dentry_path(struct dentry *, char *, int);
/* Allocation counts.. */
-
/**
* dget, dget_dlock - get a reference to a dentry
* @dentry: dentry to get a reference to
@@ -352,13 +444,16 @@ extern char *dentry_path(struct dentry *, char *, int);
static inline struct dentry *dget_dlock(struct dentry *dentry)
{
if (dentry)
- dentry->d_count++;
+ dcount_inc(dentry);
return dentry;
}
static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry) {
+ if (dcount_inc_cmpxchg(dentry))
+ return dentry;
+
spin_lock(&dentry->d_lock);
dget_dlock(dentry);
spin_unlock(&dentry->d_lock);
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/3 v3] dcache: introduce a new sequence read/write lock type
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
@ 2013-05-23 1:37 ` Waiman Long
2013-05-23 1:37 ` Waiman Long
` (5 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: , Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent,
Sage Weil, Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs,
samba-technical, linux-kernel, Chandramouleeswaran, Aswin,
Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel
The current sequence lock supports 2 types of lock users:
1. A reader who wants a consistent set of information and is willing
to retry if the information changes. The information that the
reader needs cannot contain pointers, because any writer could
invalidate a pointer that a reader was following. This reader
will never block but they may have to retry if a writer is in
progress.
2. A writer who may need to modify content of a data structure. Writer
blocks only if another writer is in progress.
This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.
To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:
1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
a writer is in progress. In turn, it blocks a writer if it is in
progress. Multiple readers of this type can proceed concurrently.
Taking this reader lock won't update the sequence number.
This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 137 insertions(+), 0 deletions(-)
create mode 100644 include/linux/seqrwlock.h
diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 0000000..c6145ff
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,137 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * ------------------------
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ * retry if the information changes. The information that the reader
+ * need cannot contain pointers, because any writer could invalidate
+ * a pointer that a reader was following. This reader never block but
+ * they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ * a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ * blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ * do {
+ * seq = read_seqrwbegin(&foo);
+ * ...
+ * } while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ * read_seqrwlock(&foo)
+ * ...
+ * read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ * write_seqrwlock(&foo)
+ * ...
+ * write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include <linux/rwlock.h>
+#include <linux/preempt.h>
+#include <asm/processor.h>
+
+typedef struct {
+ unsigned sequence;
+ rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+ { 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x) \
+ do { \
+ (x)->sequence = 0; \
+ rwlock_init(&(x)->lock); \
+ } while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+ seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+ write_lock(&sl->lock);
+ ++sl->sequence;
+ smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+ smp_wmb();
+ sl->sequence++;
+ write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+ int ret = write_trylock(&sl->lock);
+
+ if (ret) {
+ ++sl->sequence;
+ smp_wmb();
+ }
+ return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+ read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(seqrwlock_t *sl)
+{
+ read_unlock(&sl->lock);
+}
+
+static inline int read_tryseqrwlock(seqrwlock_t *sl)
+{
+ return read_trylock(&sl->lock);
+}
+
+/* Start of read calculation -- fetch last complete writer token */
+static __always_inline unsigned read_seqrwbegin(const seqrwlock_t *sl)
+{
+ unsigned ret;
+
+repeat:
+ ret = ACCESS_ONCE(sl->sequence);
+ if (unlikely(ret & 1)) {
+ cpu_relax();
+ goto repeat;
+ }
+ smp_rmb();
+ return ret;
+}
+
+/*
+ * Test if reader processed invalid data.
+ *
+ * If sequence value changed then writer changed data while in section.
+ */
+static __always_inline int read_seqrwretry(const seqrwlock_t *sl, unsigned start)
+{
+ smp_rmb();
+ return unlikely(sl->sequence != start);
+}
+
+#endif /* __LINUX_SEQLOCK_H */
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/3 v3] dcache: introduce a new sequence read/write lock type
@ 2013-05-23 1:37 ` Waiman Long
0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs,
samba-technical, linux-kernel, Chandramouleeswaran, Aswin,
Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel
The current sequence lock supports 2 types of lock users:
1. A reader who wants a consistent set of information and is willing
to retry if the information changes. The information that the
reader needs cannot contain pointers, because any writer could
invalidate a pointer that a reader was following. This reader
will never block but they may have to retry if a writer is in
progress.
2. A writer who may need to modify content of a data structure. Writer
blocks only if another writer is in progress.
This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.
To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:
1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
a writer is in progress. In turn, it blocks a writer if it is in
progress. Multiple readers of this type can proceed concurrently.
Taking this reader lock won't update the sequence number.
This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 137 insertions(+), 0 deletions(-)
create mode 100644 include/linux/seqrwlock.h
diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 0000000..c6145ff
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,137 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * ------------------------
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ * retry if the information changes. The information that the reader
+ * need cannot contain pointers, because any writer could invalidate
+ * a pointer that a reader was following. This reader never block but
+ * they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ * a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ * blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ * do {
+ * seq = read_seqrwbegin(&foo);
+ * ...
+ * } while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ * read_seqrwlock(&foo)
+ * ...
+ * read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ * write_seqrwlock(&foo)
+ * ...
+ * write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include <linux/rwlock.h>
+#include <linux/preempt.h>
+#include <asm/processor.h>
+
+typedef struct {
+ unsigned sequence;
+ rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+ { 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x) \
+ do { \
+ (x)->sequence = 0; \
+ rwlock_init(&(x)->lock); \
+ } while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+ seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+ write_lock(&sl->lock);
+ ++sl->sequence;
+ smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+ smp_wmb();
+ sl->sequence++;
+ write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+ int ret = write_trylock(&sl->lock);
+
+ if (ret) {
+ ++sl->sequence;
+ smp_wmb();
+ }
+ return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+ read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(seqrwlock_t *sl)
+{
+ read_unlock(&sl->lock);
+}
+
+static inline int read_tryseqrwlock(seqrwlock_t *sl)
+{
+ return read_trylock(&sl->lock);
+}
+
+/* Start of read calculation -- fetch last complete writer token */
+static __always_inline unsigned read_seqrwbegin(const seqrwlock_t *sl)
+{
+ unsigned ret;
+
+repeat:
+ ret = ACCESS_ONCE(sl->sequence);
+ if (unlikely(ret & 1)) {
+ cpu_relax();
+ goto repeat;
+ }
+ smp_rmb();
+ return ret;
+}
+
+/*
+ * Test if reader processed invalid data.
+ *
+ * If sequence value changed then writer changed data while in section.
+ */
+static __always_inline int read_seqrwretry(const seqrwlock_t *sl, unsigned start)
+{
+ smp_rmb();
+ return unlikely(sl->sequence != start);
+}
+
+#endif /* __LINUX_SEQLOCK_H */
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 2/3 v3] dcache: introduce a new sequence read/write lock type
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
` (2 preceding siblings ...)
2013-05-23 1:37 ` Waiman Long
@ 2013-05-23 1:37 ` Waiman Long
2013-05-23 1:37 ` [PATCH 3/3 v3] dcache: change rename_lock to a sequence read/write lock Waiman Long
` (2 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-fsdevel, linux-kernel, autofs, ceph-devel,
linux-cifs, samba-technical, linux-nfs, Chandramouleeswaran,
Aswin, Norton, Scott J, Andi Kleen, Dave Chinner
The current sequence lock supports 2 types of lock users:
1. A reader who wants a consistent set of information and is willing
to retry if the information changes. The information that the
reader needs cannot contain pointers, because any writer could
invalidate a pointer that a reader was following. This reader
will never block but they may have to retry if a writer is in
progress.
2. A writer who may need to modify content of a data structure. Writer
blocks only if another writer is in progress.
This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.
To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:
1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
a writer is in progress. In turn, it blocks a writer if it is in
progress. Multiple readers of this type can proceed concurrently.
Taking this reader lock won't update the sequence number.
This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
include/linux/seqrwlock.h | 137 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 137 insertions(+), 0 deletions(-)
create mode 100644 include/linux/seqrwlock.h
diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 0000000..c6145ff
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,137 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * ------------------------
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ * retry if the information changes. The information that the reader
+ * need cannot contain pointers, because any writer could invalidate
+ * a pointer that a reader was following. This reader never block but
+ * they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ * a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ * blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ * do {
+ * seq = read_seqrwbegin(&foo);
+ * ...
+ * } while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ * read_seqrwlock(&foo)
+ * ...
+ * read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ * write_seqrwlock(&foo)
+ * ...
+ * write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include <linux/rwlock.h>
+#include <linux/preempt.h>
+#include <asm/processor.h>
+
+typedef struct {
+ unsigned sequence;
+ rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+ { 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x) \
+ do { \
+ (x)->sequence = 0; \
+ rwlock_init(&(x)->lock); \
+ } while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+ seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+ write_lock(&sl->lock);
+ ++sl->sequence;
+ smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+ smp_wmb();
+ sl->sequence++;
+ write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+ int ret = write_trylock(&sl->lock);
+
+ if (ret) {
+ ++sl->sequence;
+ smp_wmb();
+ }
+ return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+ read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(seqrwlock_t *sl)
+{
+ read_unlock(&sl->lock);
+}
+
+static inline int read_tryseqrwlock(seqrwlock_t *sl)
+{
+ return read_trylock(&sl->lock);
+}
+
+/* Start of read calculation -- fetch last complete writer token */
+static __always_inline unsigned read_seqrwbegin(const seqrwlock_t *sl)
+{
+ unsigned ret;
+
+repeat:
+ ret = ACCESS_ONCE(sl->sequence);
+ if (unlikely(ret & 1)) {
+ cpu_relax();
+ goto repeat;
+ }
+ smp_rmb();
+ return ret;
+}
+
+/*
+ * Test if reader processed invalid data.
+ *
+ * If sequence value changed then writer changed data while in section.
+ */
+static __always_inline int read_seqrwretry(const seqrwlock_t *sl, unsigned start)
+{
+ smp_rmb();
+ return unlikely(sl->sequence != start);
+}
+
+#endif /* __LINUX_SEQLOCK_H */
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 3/3 v3] dcache: change rename_lock to a sequence read/write lock
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
` (3 preceding siblings ...)
2013-05-23 1:37 ` Waiman Long
@ 2013-05-23 1:37 ` Waiman Long
2013-05-23 1:37 ` Waiman Long
2013-05-23 9:42 ` [PATCH 0/3 v3] dcache: make it more scalable on large system Dave Chinner
6 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-fsdevel, linux-kernel, autofs, ceph-devel,
linux-cifs, samba-technical, linux-nfs, Chandramouleeswaran,
Aswin, Norton, Scott J, Andi Kleen, Dave Chinner
The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.
It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.
All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.
This patch will have merge conflict When applying to kernel version
earlier than 3.10.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
fs/autofs4/waitq.c | 6 ++--
fs/ceph/mds_client.c | 4 +-
fs/cifs/dir.c | 4 +-
fs/dcache.c | 83 ++++++++++++++++++++++++-----------------------
fs/nfs/namespace.c | 6 ++--
include/linux/dcache.h | 4 +-
kernel/auditsc.c | 4 +-
7 files changed, 56 insertions(+), 55 deletions(-)
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 3db70da..3afc4db 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -197,7 +197,7 @@ rename_retry:
buf = *name;
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
spin_lock(&sbi->fs_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -206,7 +206,7 @@ rename_retry:
if (!len || --len > NAME_MAX) {
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return 0;
}
@@ -222,7 +222,7 @@ rename_retry:
}
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..b0c266f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1489,7 +1489,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
retry:
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = dentry; !IS_ROOT(temp);) {
struct inode *inode = temp->d_inode;
@@ -1539,7 +1539,7 @@ retry:
temp = temp->d_parent;
}
rcu_read_unlock();
- if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+ if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
pr_err("build_path did not end path lookup where "
"expected, namelen is %d, pos is %d\n", len, pos);
/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 5699b50..b672c02 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
dfsplen = 0;
cifs_bp_rename_retry:
namelen = dfsplen;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = direntry; !IS_ROOT(temp);) {
namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
}
}
rcu_read_unlock();
- if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+ if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
cifs_dbg(FYI, "did not end path lookup where expected. namelen=%ddfsplen=%d\n",
namelen, dfsplen);
/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 470b06f..c96bdb1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
#include <asm/uaccess.h>
#include <linux/security.h>
#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
#include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
@@ -1009,7 +1010,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
*/
if (new != old->d_parent ||
(old->d_flags & DCACHE_DENTRY_KILLED) ||
- (!locked && read_seqretry(&rename_lock, seq))) {
+ (!locked && read_seqrwretry(&rename_lock, seq))) {
spin_unlock(&new->d_lock);
new = NULL;
}
@@ -1038,7 +1039,7 @@ int have_submounts(struct dentry *parent)
unsigned seq;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
@@ -1081,23 +1082,23 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 0; /* No mount points found in tree */
positive:
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 1;
rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
EXPORT_SYMBOL(have_submounts);
@@ -1124,7 +1125,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
int found = 0;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
spin_lock(&this_parent->d_lock);
@@ -1189,10 +1190,10 @@ resume:
}
out:
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return found;
rename_retry:
@@ -1201,7 +1202,7 @@ rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
@@ -1816,7 +1817,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -1884,11 +1885,11 @@ struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
unsigned seq;
do {
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
dentry = __d_lookup(parent, name);
if (dentry)
break;
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return dentry;
}
EXPORT_SYMBOL(d_lookup);
@@ -1902,7 +1903,7 @@ EXPORT_SYMBOL(d_lookup);
* __d_lookup is like d_lookup, however it may (rarely) return a
* false-negative result due to unrelated rename activity.
*
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
* however it must be used carefully, eg. with a following d_lookup in
* the case of failure.
*
@@ -1934,7 +1935,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -2309,9 +2310,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
*/
void d_move(struct dentry *dentry, struct dentry *target)
{
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
__d_move(dentry, target);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
}
EXPORT_SYMBOL(d_move);
@@ -2439,7 +2440,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
alias = __d_find_alias(inode, 0);
if (alias) {
actual = alias;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
if (d_ancestor(alias, dentry)) {
/* Check for loops */
@@ -2449,7 +2450,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
/* Is this an anonymous mountpoint that we
* could splice into our tree? */
__d_materialise_dentry(dentry, alias);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
__d_drop(alias);
goto found;
} else {
@@ -2457,7 +2458,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
* aliasing. This drops inode->i_lock */
actual = __d_unalias(inode, dentry, alias);
}
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
if (IS_ERR(actual)) {
if (PTR_ERR(actual) == -ELOOP)
pr_warn_ratelimited(
@@ -2602,9 +2603,9 @@ char *__d_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2623,9 +2624,9 @@ char *d_absolute_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error > 1)
@@ -2691,9 +2692,9 @@ char *d_path(const struct path *path, char *buf, int buflen)
get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2761,9 +2762,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
char *retval;
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
return retval;
}
@@ -2774,7 +2775,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2782,7 +2783,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2821,7 +2822,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2829,7 +2830,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2850,7 +2851,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}
@@ -2887,7 +2888,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
do {
/* for restarting inner loop in case of seq retry */
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
/*
* Need rcu_readlock to protect against the d_parent trashing
* due to d_move
@@ -2898,7 +2899,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
else
result = 0;
rcu_read_unlock();
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return result;
}
@@ -2910,7 +2911,7 @@ void d_genocide(struct dentry *root)
unsigned seq;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = root;
spin_lock(&this_parent->d_lock);
@@ -2953,17 +2954,17 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return;
rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
*--end = '\0';
buflen--;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
while (1) {
spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
spin_unlock(&dentry->d_lock);
dentry = dentry->d_parent;
}
- if (read_seqretry(&rename_lock, seq)) {
+ if (read_seqrwretry(&rename_lock, seq)) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
Elong_unlock:
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
Elong:
return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 99da5e2..5f05815 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
#include <linux/rculist.h>
#include <linux/rculist_bl.h>
#include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>
@@ -210,7 +210,7 @@ struct dentry_operations {
#define DCACHE_DENTRY_KILLED 0x100000
-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;
static inline int dname_external(struct dentry *dentry)
{
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 3c8a601..d464b67 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1591,7 +1591,7 @@ retry:
drop = NULL;
d = dentry;
rcu_read_lock();
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
for(;;) {
struct inode *inode = d->d_inode;
if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1609,7 +1609,7 @@ retry:
break;
d = parent;
}
- if (unlikely(read_seqretry(&rename_lock, seq) || drop)) { /* in this order */
+ if (unlikely(read_seqrwretry(&rename_lock, seq) || drop)) { /* in this order */
rcu_read_unlock();
if (!drop) {
/* just a race with rename */
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 3/3 v3] dcache: change rename_lock to a sequence read/write lock
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
@ 2013-05-23 1:37 ` Waiman Long
2013-05-23 1:37 ` Waiman Long
` (5 subsequent siblings)
6 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: , Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent,
Sage Weil, Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs,
samba-technical, linux-kernel, Chandramouleeswaran, Aswin,
Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel
The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.
It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.
All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.
This patch will have merge conflict When applying to kernel version
earlier than 3.10.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
fs/autofs4/waitq.c | 6 ++--
fs/ceph/mds_client.c | 4 +-
fs/cifs/dir.c | 4 +-
fs/dcache.c | 83 ++++++++++++++++++++++++-----------------------
fs/nfs/namespace.c | 6 ++--
include/linux/dcache.h | 4 +-
kernel/auditsc.c | 4 +-
7 files changed, 56 insertions(+), 55 deletions(-)
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 3db70da..3afc4db 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -197,7 +197,7 @@ rename_retry:
buf = *name;
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
spin_lock(&sbi->fs_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -206,7 +206,7 @@ rename_retry:
if (!len || --len > NAME_MAX) {
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return 0;
}
@@ -222,7 +222,7 @@ rename_retry:
}
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..b0c266f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1489,7 +1489,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
retry:
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = dentry; !IS_ROOT(temp);) {
struct inode *inode = temp->d_inode;
@@ -1539,7 +1539,7 @@ retry:
temp = temp->d_parent;
}
rcu_read_unlock();
- if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+ if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
pr_err("build_path did not end path lookup where "
"expected, namelen is %d, pos is %d\n", len, pos);
/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 5699b50..b672c02 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
dfsplen = 0;
cifs_bp_rename_retry:
namelen = dfsplen;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = direntry; !IS_ROOT(temp);) {
namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
}
}
rcu_read_unlock();
- if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+ if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
cifs_dbg(FYI, "did not end path lookup where expected. namelen=%ddfsplen=%d\n",
namelen, dfsplen);
/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 470b06f..c96bdb1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
#include <asm/uaccess.h>
#include <linux/security.h>
#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
#include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
@@ -1009,7 +1010,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
*/
if (new != old->d_parent ||
(old->d_flags & DCACHE_DENTRY_KILLED) ||
- (!locked && read_seqretry(&rename_lock, seq))) {
+ (!locked && read_seqrwretry(&rename_lock, seq))) {
spin_unlock(&new->d_lock);
new = NULL;
}
@@ -1038,7 +1039,7 @@ int have_submounts(struct dentry *parent)
unsigned seq;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
@@ -1081,23 +1082,23 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 0; /* No mount points found in tree */
positive:
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 1;
rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
EXPORT_SYMBOL(have_submounts);
@@ -1124,7 +1125,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
int found = 0;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
spin_lock(&this_parent->d_lock);
@@ -1189,10 +1190,10 @@ resume:
}
out:
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return found;
rename_retry:
@@ -1201,7 +1202,7 @@ rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
@@ -1816,7 +1817,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -1884,11 +1885,11 @@ struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
unsigned seq;
do {
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
dentry = __d_lookup(parent, name);
if (dentry)
break;
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return dentry;
}
EXPORT_SYMBOL(d_lookup);
@@ -1902,7 +1903,7 @@ EXPORT_SYMBOL(d_lookup);
* __d_lookup is like d_lookup, however it may (rarely) return a
* false-negative result due to unrelated rename activity.
*
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
* however it must be used carefully, eg. with a following d_lookup in
* the case of failure.
*
@@ -1934,7 +1935,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -2309,9 +2310,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
*/
void d_move(struct dentry *dentry, struct dentry *target)
{
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
__d_move(dentry, target);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
}
EXPORT_SYMBOL(d_move);
@@ -2439,7 +2440,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
alias = __d_find_alias(inode, 0);
if (alias) {
actual = alias;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
if (d_ancestor(alias, dentry)) {
/* Check for loops */
@@ -2449,7 +2450,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
/* Is this an anonymous mountpoint that we
* could splice into our tree? */
__d_materialise_dentry(dentry, alias);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
__d_drop(alias);
goto found;
} else {
@@ -2457,7 +2458,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
* aliasing. This drops inode->i_lock */
actual = __d_unalias(inode, dentry, alias);
}
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
if (IS_ERR(actual)) {
if (PTR_ERR(actual) == -ELOOP)
pr_warn_ratelimited(
@@ -2602,9 +2603,9 @@ char *__d_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2623,9 +2624,9 @@ char *d_absolute_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error > 1)
@@ -2691,9 +2692,9 @@ char *d_path(const struct path *path, char *buf, int buflen)
get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2761,9 +2762,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
char *retval;
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
return retval;
}
@@ -2774,7 +2775,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2782,7 +2783,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2821,7 +2822,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2829,7 +2830,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2850,7 +2851,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}
@@ -2887,7 +2888,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
do {
/* for restarting inner loop in case of seq retry */
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
/*
* Need rcu_readlock to protect against the d_parent trashing
* due to d_move
@@ -2898,7 +2899,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
else
result = 0;
rcu_read_unlock();
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return result;
}
@@ -2910,7 +2911,7 @@ void d_genocide(struct dentry *root)
unsigned seq;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = root;
spin_lock(&this_parent->d_lock);
@@ -2953,17 +2954,17 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return;
rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
*--end = '\0';
buflen--;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
while (1) {
spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
spin_unlock(&dentry->d_lock);
dentry = dentry->d_parent;
}
- if (read_seqretry(&rename_lock, seq)) {
+ if (read_seqrwretry(&rename_lock, seq)) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
Elong_unlock:
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
Elong:
return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 99da5e2..5f05815 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
#include <linux/rculist.h>
#include <linux/rculist_bl.h>
#include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>
@@ -210,7 +210,7 @@ struct dentry_operations {
#define DCACHE_DENTRY_KILLED 0x100000
-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;
static inline int dname_external(struct dentry *dentry)
{
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 3c8a601..d464b67 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1591,7 +1591,7 @@ retry:
drop = NULL;
d = dentry;
rcu_read_lock();
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
for(;;) {
struct inode *inode = d->d_inode;
if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1609,7 +1609,7 @@ retry:
break;
d = parent;
}
- if (unlikely(read_seqretry(&rename_lock, seq) || drop)) { /* in this order */
+ if (unlikely(read_seqrwretry(&rename_lock, seq) || drop)) { /* in this order */
rcu_read_unlock();
if (!drop) {
/* just a race with rename */
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* [PATCH 3/3 v3] dcache: change rename_lock to a sequence read/write lock
@ 2013-05-23 1:37 ` Waiman Long
0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 1:37 UTC (permalink / raw)
To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris
Cc: Waiman Long, linux-cifs, linux-nfs, Norton, Scott J, autofs,
samba-technical, linux-kernel, Chandramouleeswaran, Aswin,
Andi Kleen, Dave Chinner, linux-fsdevel, ceph-devel
The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.
It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.
All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.
This patch will have merge conflict When applying to kernel version
earlier than 3.10.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
fs/autofs4/waitq.c | 6 ++--
fs/ceph/mds_client.c | 4 +-
fs/cifs/dir.c | 4 +-
fs/dcache.c | 83 ++++++++++++++++++++++++-----------------------
fs/nfs/namespace.c | 6 ++--
include/linux/dcache.h | 4 +-
kernel/auditsc.c | 4 +-
7 files changed, 56 insertions(+), 55 deletions(-)
diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 3db70da..3afc4db 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -197,7 +197,7 @@ rename_retry:
buf = *name;
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
spin_lock(&sbi->fs_lock);
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -206,7 +206,7 @@ rename_retry:
if (!len || --len > NAME_MAX) {
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return 0;
}
@@ -222,7 +222,7 @@ rename_retry:
}
spin_unlock(&sbi->fs_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..b0c266f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1489,7 +1489,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
retry:
len = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = dentry; !IS_ROOT(temp);) {
struct inode *inode = temp->d_inode;
@@ -1539,7 +1539,7 @@ retry:
temp = temp->d_parent;
}
rcu_read_unlock();
- if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+ if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
pr_err("build_path did not end path lookup where "
"expected, namelen is %d, pos is %d\n", len, pos);
/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 5699b50..b672c02 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
dfsplen = 0;
cifs_bp_rename_retry:
namelen = dfsplen;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
for (temp = direntry; !IS_ROOT(temp);) {
namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
}
}
rcu_read_unlock();
- if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+ if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
cifs_dbg(FYI, "did not end path lookup where expected. namelen=%ddfsplen=%d\n",
namelen, dfsplen);
/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 470b06f..c96bdb1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
#include <asm/uaccess.h>
#include <linux/security.h>
#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/swap.h>
#include <linux/bootmem.h>
#include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
@@ -1009,7 +1010,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
*/
if (new != old->d_parent ||
(old->d_flags & DCACHE_DENTRY_KILLED) ||
- (!locked && read_seqretry(&rename_lock, seq))) {
+ (!locked && read_seqrwretry(&rename_lock, seq))) {
spin_unlock(&new->d_lock);
new = NULL;
}
@@ -1038,7 +1039,7 @@ int have_submounts(struct dentry *parent)
unsigned seq;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
@@ -1081,23 +1082,23 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 0; /* No mount points found in tree */
positive:
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return 1;
rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
EXPORT_SYMBOL(have_submounts);
@@ -1124,7 +1125,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
int found = 0;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = parent;
spin_lock(&this_parent->d_lock);
@@ -1189,10 +1190,10 @@ resume:
}
out:
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return found;
rename_retry:
@@ -1201,7 +1202,7 @@ rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
@@ -1816,7 +1817,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -1884,11 +1885,11 @@ struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
unsigned seq;
do {
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
dentry = __d_lookup(parent, name);
if (dentry)
break;
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return dentry;
}
EXPORT_SYMBOL(d_lookup);
@@ -1902,7 +1903,7 @@ EXPORT_SYMBOL(d_lookup);
* __d_lookup is like d_lookup, however it may (rarely) return a
* false-negative result due to unrelated rename activity.
*
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
* however it must be used carefully, eg. with a following d_lookup in
* the case of failure.
*
@@ -1934,7 +1935,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
* It is possible that concurrent renames can mess up our list
* walk here and result in missing our dentry, resulting in the
* false-negative result. d_lookup() protects against concurrent
- * renames using rename_lock seqlock.
+ * renames using rename_lock seqrwlock.
*
* See Documentation/filesystems/path-lookup.txt for more details.
*/
@@ -2309,9 +2310,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
*/
void d_move(struct dentry *dentry, struct dentry *target)
{
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
__d_move(dentry, target);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
}
EXPORT_SYMBOL(d_move);
@@ -2439,7 +2440,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
alias = __d_find_alias(inode, 0);
if (alias) {
actual = alias;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
if (d_ancestor(alias, dentry)) {
/* Check for loops */
@@ -2449,7 +2450,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
/* Is this an anonymous mountpoint that we
* could splice into our tree? */
__d_materialise_dentry(dentry, alias);
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
__d_drop(alias);
goto found;
} else {
@@ -2457,7 +2458,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
* aliasing. This drops inode->i_lock */
actual = __d_unalias(inode, dentry, alias);
}
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
if (IS_ERR(actual)) {
if (PTR_ERR(actual) == -ELOOP)
pr_warn_ratelimited(
@@ -2602,9 +2603,9 @@ char *__d_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2623,9 +2624,9 @@ char *d_absolute_path(const struct path *path,
prepend(&res, &buflen, "\0", 1);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = prepend_path(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error > 1)
@@ -2691,9 +2692,9 @@ char *d_path(const struct path *path, char *buf, int buflen)
get_fs_root(current->fs, &root);
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
error = path_with_deleted(path, &root, &res, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
res = ERR_PTR(error);
@@ -2761,9 +2762,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
{
char *retval;
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
return retval;
}
@@ -2774,7 +2775,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
char *p = NULL;
char *retval;
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (d_unlinked(dentry)) {
p = buf + buflen;
if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2782,7 +2783,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
buflen++;
}
retval = __dentry_path(dentry, buf, buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
if (!IS_ERR(retval) && p)
*p = '/'; /* restore '/' overriden with '\0' */
return retval;
@@ -2821,7 +2822,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -ENOENT;
br_read_lock(&vfsmount_lock);
- write_seqlock(&rename_lock);
+ read_seqrwlock(&rename_lock);
if (!d_unlinked(pwd.dentry)) {
unsigned long len;
char *cwd = page + PAGE_SIZE;
@@ -2829,7 +2830,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
prepend(&cwd, &buflen, "\0", 1);
error = prepend_path(&pwd, &root, &cwd, &buflen);
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
if (error < 0)
@@ -2850,7 +2851,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
error = -EFAULT;
}
} else {
- write_sequnlock(&rename_lock);
+ read_seqrwunlock(&rename_lock);
br_read_unlock(&vfsmount_lock);
}
@@ -2887,7 +2888,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
do {
/* for restarting inner loop in case of seq retry */
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
/*
* Need rcu_readlock to protect against the d_parent trashing
* due to d_move
@@ -2898,7 +2899,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
else
result = 0;
rcu_read_unlock();
- } while (read_seqretry(&rename_lock, seq));
+ } while (read_seqrwretry(&rename_lock, seq));
return result;
}
@@ -2910,7 +2911,7 @@ void d_genocide(struct dentry *root)
unsigned seq;
int locked = 0;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
again:
this_parent = root;
spin_lock(&this_parent->d_lock);
@@ -2953,17 +2954,17 @@ resume:
goto resume;
}
spin_unlock(&this_parent->d_lock);
- if (!locked && read_seqretry(&rename_lock, seq))
+ if (!locked && read_seqrwretry(&rename_lock, seq))
goto rename_retry;
if (locked)
- write_sequnlock(&rename_lock);
+ write_seqrwunlock(&rename_lock);
return;
rename_retry:
if (locked)
goto again;
locked = 1;
- write_seqlock(&rename_lock);
+ write_seqrwlock(&rename_lock);
goto again;
}
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
*--end = '\0';
buflen--;
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
rcu_read_lock();
while (1) {
spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
spin_unlock(&dentry->d_lock);
dentry = dentry->d_parent;
}
- if (read_seqretry(&rename_lock, seq)) {
+ if (read_seqrwretry(&rename_lock, seq)) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
Elong_unlock:
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
- if (read_seqretry(&rename_lock, seq))
+ if (read_seqrwretry(&rename_lock, seq))
goto rename_retry;
Elong:
return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 99da5e2..5f05815 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
#include <linux/rculist.h>
#include <linux/rculist_bl.h>
#include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
#include <linux/cache.h>
#include <linux/rcupdate.h>
@@ -210,7 +210,7 @@ struct dentry_operations {
#define DCACHE_DENTRY_KILLED 0x100000
-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;
static inline int dname_external(struct dentry *dentry)
{
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 3c8a601..d464b67 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1591,7 +1591,7 @@ retry:
drop = NULL;
d = dentry;
rcu_read_lock();
- seq = read_seqbegin(&rename_lock);
+ seq = read_seqrwbegin(&rename_lock);
for(;;) {
struct inode *inode = d->d_inode;
if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1609,7 +1609,7 @@ retry:
break;
d = parent;
}
- if (unlikely(read_seqretry(&rename_lock, seq) || drop)) { /* in this order */
+ if (unlikely(read_seqrwretry(&rename_lock, seq) || drop)) { /* in this order */
rcu_read_unlock();
if (!drop) {
/* just a race with rename */
--
1.7.1
^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system
2013-05-23 1:37 [PATCH 0/3 v3] dcache: make it more scalable on large system Waiman Long
` (5 preceding siblings ...)
2013-05-23 1:37 ` Waiman Long
@ 2013-05-23 9:42 ` Dave Chinner
2013-05-23 21:34 ` Waiman Long
6 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2013-05-23 9:42 UTC (permalink / raw)
To: Waiman Long
Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris, linux-fsdevel,
linux-kernel, autofs, ceph-devel, linux-cifs, samba-technical,
linux-nfs, Chandramouleeswaran, Aswin, Norton, Scott J,
Andi Kleen
On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
> Change log:
>
> v2->v3
> - Fix the RCU lock problem found by Al Viro.
> - Rebase the code to the latest v3.10-rc1 linux mainline.
> - Remove patch 4 which may be problematic if the dentry is deleted.
> - Rerun performance measurement using 3.10-rc1 kernel.
>
> v1->v2
> - Include performance improvement in the AIM7 benchmark results because
> of this patch.
> - Modify dget_parent() to avoid taking the lock, if possible, to further
> improve AIM7 benchmark results.
>
> During some perf-record sessions of the kernel running the high_systime
> workload of the AIM7 benchmark, it was found that quite a large portion
> of the spinlock contention was due to the perf_event_mmap_event()
> function itself. This perf kernel function calls d_path() which,
> in turn, call path_get() and dput() indirectly. These 3 functions
> were the hottest functions shown in the perf-report output of
> the _raw_spin_lock() function in an 8-socket system with 80 cores
> (hyperthreading off) with a 3.10-rc1 kernel:
<sigh>
What was it I said about this patchset when you posted it to speed
up an Oracle benchmark back in february? I'll repeat:
"Nobody should be doing reverse dentry-to-name lookups in a quantity
sufficient for it to become a performance limiting factor."
And that makes whatever that tracepoint is doing utterly stupid.
Dumping out full paths in a tracepoint is hardly "low overhead", and
that tracepoint should be stomped on from a great height. Sure,
print the filename straight out of the dentry into a tracepoint,
but repeated calculating the full path (probably for the same set of
dentries) is just a dumb thing to do.
Anyway, your numbers show that a single AIM7 microbenchmark goes
better under tracing the specific mmap event that uses d_path(), but
the others are on average a little bit slower. That doesn't convince
me that this is worth doing. Indeed, what happens to performance
when you aren't tracing?
Indeed, have you analysed what makes that
microbenchmark contend so much on the dentry lock while reverse
lookups are occuring? Dentry lock contention does not necessarily
indicate a problem with the dentry locks, and without knowing why
there is contention occuring (esp. compared to the other benchmarks)
we can't really determine if this is a good solution or not...
IOWs, you need more than one microbenchmark that interacts with
some naive monitoring code to justify the complexity these locking
changes introduce....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system
2013-05-23 9:42 ` [PATCH 0/3 v3] dcache: make it more scalable on large system Dave Chinner
@ 2013-05-23 21:34 ` Waiman Long
0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 21:34 UTC (permalink / raw)
To: Dave Chinner
Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
autofs-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
linux-cifs-u79uwXL29TY76Z2rM5mHXA,
linux-nfs-u79uwXL29TY76Z2rM5mHXA, Chandramouleeswaran, Aswin,
Norton, Scott J, Andi Kleen
On 05/23/2013 05:42 AM, Dave Chinner wrote:
> On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
>> Change log:
>>
>> v2->v3
>> - Fix the RCU lock problem found by Al Viro.
>> - Rebase the code to the latest v3.10-rc1 linux mainline.
>> - Remove patch 4 which may be problematic if the dentry is deleted.
>> - Rerun performance measurement using 3.10-rc1 kernel.
>>
>> v1->v2
>> - Include performance improvement in the AIM7 benchmark results because
>> of this patch.
>> - Modify dget_parent() to avoid taking the lock, if possible, to further
>> improve AIM7 benchmark results.
>>
>> During some perf-record sessions of the kernel running the high_systime
>> workload of the AIM7 benchmark, it was found that quite a large portion
>> of the spinlock contention was due to the perf_event_mmap_event()
>> function itself. This perf kernel function calls d_path() which,
>> in turn, call path_get() and dput() indirectly. These 3 functions
>> were the hottest functions shown in the perf-report output of
>> the _raw_spin_lock() function in an 8-socket system with 80 cores
>> (hyperthreading off) with a 3.10-rc1 kernel:
> What was it I said about this patchset when you posted it to speed
> up an Oracle benchmark back in february? I'll repeat:
>
> "Nobody should be doing reverse dentry-to-name lookups in a quantity
> sufficient for it to become a performance limiting factor."
Thank for the comment, but my point is that it is the d_lock contention
is skewing the data about how much spin lock contention had actually
happened in the workload and it makes it harder to pinpoint problem
areas to look at. This is not about performance, it is about accurate
representation of performance data. Ideally, we want the overhead of
turning on perf instrumentation to be as low as possible.
>
> And that makes whatever that tracepoint is doing utterly stupid.
> Dumping out full paths in a tracepoint is hardly "low overhead", and
> that tracepoint should be stomped on from a great height. Sure,
> print the filename straight out of the dentry into a tracepoint,
> but repeated calculating the full path (probably for the same set of
> dentries) is just a dumb thing to do.
>
> Anyway, your numbers show that a single AIM7 microbenchmark goes
> better under tracing the specific mmap event that uses d_path(), but
> the others are on average a little bit slower. That doesn't convince
> me that this is worth doing. Indeed, what happens to performance
> when you aren't tracing?
>
> Indeed, have you analysed what makes that
> microbenchmark contend so much on the dentry lock while reverse
> lookups are occuring? Dentry lock contention does not necessarily
> indicate a problem with the dentry locks, and without knowing why
> there is contention occuring (esp. compared to the other benchmarks)
> we can't really determine if this is a good solution or not...
What made it contend so much was the large number of CPUs available in
my test system which is a 8-socket Westmere EX machines with 80 cores.
As perf was collecting data from every core, the threads will
unavoidably bump into each other to translate dentries back to the full
paths. The current code only allows one CPU in the d_path() critical
path. My patch will allow all of them to be in the critical path
concurrently.
The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So even a
4-socket machine will have up to 60 cores or 120 virtual CPUs if
hyperthreading is turned on.
> IOWs, you need more than one microbenchmark that interacts with
> some naive monitoring code to justify the complexity these locking
> changes introduce....
The first patch can also independently improve the performance of a
number of AIM7 workloads including almost 7X improvement in the short
workload. More detailed information of these types of performance
benefit was discussed in the patch description of the first patch. I
will try to collect more performance improvement data on other workloads
too.
Thank for the review.
Longman
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system
@ 2013-05-23 21:34 ` Waiman Long
0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2013-05-23 21:34 UTC (permalink / raw)
To: Dave Chinner
Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris, linux-fsdevel,
linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs,
Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen
On 05/23/2013 05:42 AM, Dave Chinner wrote:
> On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
>> Change log:
>>
>> v2->v3
>> - Fix the RCU lock problem found by Al Viro.
>> - Rebase the code to the latest v3.10-rc1 linux mainline.
>> - Remove patch 4 which may be problematic if the dentry is deleted.
>> - Rerun performance measurement using 3.10-rc1 kernel.
>>
>> v1->v2
>> - Include performance improvement in the AIM7 benchmark results because
>> of this patch.
>> - Modify dget_parent() to avoid taking the lock, if possible, to further
>> improve AIM7 benchmark results.
>>
>> During some perf-record sessions of the kernel running the high_systime
>> workload of the AIM7 benchmark, it was found that quite a large portion
>> of the spinlock contention was due to the perf_event_mmap_event()
>> function itself. This perf kernel function calls d_path() which,
>> in turn, call path_get() and dput() indirectly. These 3 functions
>> were the hottest functions shown in the perf-report output of
>> the _raw_spin_lock() function in an 8-socket system with 80 cores
>> (hyperthreading off) with a 3.10-rc1 kernel:
> What was it I said about this patchset when you posted it to speed
> up an Oracle benchmark back in february? I'll repeat:
>
> "Nobody should be doing reverse dentry-to-name lookups in a quantity
> sufficient for it to become a performance limiting factor."
Thank for the comment, but my point is that it is the d_lock contention
is skewing the data about how much spin lock contention had actually
happened in the workload and it makes it harder to pinpoint problem
areas to look at. This is not about performance, it is about accurate
representation of performance data. Ideally, we want the overhead of
turning on perf instrumentation to be as low as possible.
>
> And that makes whatever that tracepoint is doing utterly stupid.
> Dumping out full paths in a tracepoint is hardly "low overhead", and
> that tracepoint should be stomped on from a great height. Sure,
> print the filename straight out of the dentry into a tracepoint,
> but repeated calculating the full path (probably for the same set of
> dentries) is just a dumb thing to do.
>
> Anyway, your numbers show that a single AIM7 microbenchmark goes
> better under tracing the specific mmap event that uses d_path(), but
> the others are on average a little bit slower. That doesn't convince
> me that this is worth doing. Indeed, what happens to performance
> when you aren't tracing?
>
> Indeed, have you analysed what makes that
> microbenchmark contend so much on the dentry lock while reverse
> lookups are occuring? Dentry lock contention does not necessarily
> indicate a problem with the dentry locks, and without knowing why
> there is contention occuring (esp. compared to the other benchmarks)
> we can't really determine if this is a good solution or not...
What made it contend so much was the large number of CPUs available in
my test system which is a 8-socket Westmere EX machines with 80 cores.
As perf was collecting data from every core, the threads will
unavoidably bump into each other to translate dentries back to the full
paths. The current code only allows one CPU in the d_path() critical
path. My patch will allow all of them to be in the critical path
concurrently.
The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So even a
4-socket machine will have up to 60 cores or 120 virtual CPUs if
hyperthreading is turned on.
> IOWs, you need more than one microbenchmark that interacts with
> some naive monitoring code to justify the complexity these locking
> changes introduce....
The first patch can also independently improve the performance of a
number of AIM7 workloads including almost 7X improvement in the short
workload. More detailed information of these types of performance
benefit was discussed in the patch description of the first patch. I
will try to collect more performance improvement data on other workloads
too.
Thank for the review.
Longman
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system
2013-05-23 21:34 ` Waiman Long
(?)
@ 2013-05-27 2:09 ` Dave Chinner
2013-05-29 15:55 ` Waiman Long
-1 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2013-05-27 2:09 UTC (permalink / raw)
To: Waiman Long
Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris, linux-fsdevel,
linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs,
Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen
On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:
> On 05/23/2013 05:42 AM, Dave Chinner wrote:
> >On Wed, May 22, 2013 at 09:37:25PM -0400, Waiman Long wrote:
> >>Change log:
> >>
> >>v2->v3
> >> - Fix the RCU lock problem found by Al Viro.
> >> - Rebase the code to the latest v3.10-rc1 linux mainline.
> >> - Remove patch 4 which may be problematic if the dentry is deleted.
> >> - Rerun performance measurement using 3.10-rc1 kernel.
> >>
> >>v1->v2
> >> - Include performance improvement in the AIM7 benchmark results because
> >> of this patch.
> >> - Modify dget_parent() to avoid taking the lock, if possible, to further
> >> improve AIM7 benchmark results.
> >>
> >>During some perf-record sessions of the kernel running the high_systime
> >>workload of the AIM7 benchmark, it was found that quite a large portion
> >>of the spinlock contention was due to the perf_event_mmap_event()
> >>function itself. This perf kernel function calls d_path() which,
> >>in turn, call path_get() and dput() indirectly. These 3 functions
> >>were the hottest functions shown in the perf-report output of
> >>the _raw_spin_lock() function in an 8-socket system with 80 cores
> >>(hyperthreading off) with a 3.10-rc1 kernel:
> >What was it I said about this patchset when you posted it to speed
> >up an Oracle benchmark back in february? I'll repeat:
> >
> >"Nobody should be doing reverse dentry-to-name lookups in a quantity
> >sufficient for it to become a performance limiting factor."
>
> Thank for the comment, but my point is that it is the d_lock
> contention is skewing the data about how much spin lock contention
> had actually happened in the workload and it makes it harder to
> pinpoint problem areas to look at. This is not about performance, it
> is about accurate representation of performance data. Ideally, we
> want the overhead of turning on perf instrumentation to be as low as
> possible.
Right. But d_path will never be "low overhead", and as such it
shouldn't be used by perf.
> >And that makes whatever that tracepoint is doing utterly stupid.
> >Dumping out full paths in a tracepoint is hardly "low overhead", and
> >that tracepoint should be stomped on from a great height. Sure,
> >print the filename straight out of the dentry into a tracepoint,
> >but repeated calculating the full path (probably for the same set of
> >dentries) is just a dumb thing to do.
> >
> >Anyway, your numbers show that a single AIM7 microbenchmark goes
> >better under tracing the specific mmap event that uses d_path(), but
> >the others are on average a little bit slower. That doesn't convince
> >me that this is worth doing. Indeed, what happens to performance
> >when you aren't tracing?
> >
> >Indeed, have you analysed what makes that
> >microbenchmark contend so much on the dentry lock while reverse
> >lookups are occuring? Dentry lock contention does not necessarily
> >indicate a problem with the dentry locks, and without knowing why
> >there is contention occuring (esp. compared to the other benchmarks)
> >we can't really determine if this is a good solution or not...
>
> What made it contend so much was the large number of CPUs available
> in my test system which is a 8-socket Westmere EX machines with 80
> cores. As perf was collecting data from every core, the threads will
> unavoidably bump into each other to translate dentries back to the
> full paths. The current code only allows one CPU in the d_path()
> critical path. My patch will allow all of them to be in the critical
> path concurrently.
Yes, I know *how* contention occurs and what your patch does. I'm
asking you to explain *why* we need to fix this specific workload,
and why the tracepoint that calls d_path() actually needs to do
that.
Your numbers indicate that your patchset decreases peformance in the
common, non-d_path intensive workloads, so why should we add all
this complexity to optimise a slow path at the expense of
significant complexity and lower performance in the fast
path?
> The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So
> even a 4-socket machine will have up to 60 cores or 120 virtual CPUs
> if hyperthreading is turned on.
I don't see why that matters. I've been dealing with systems much
larger than that since early 2002, adn the process for delaing with
lock contention hasn't changed much in the last 10 years. First we
need to determine what you are doing is something that is sane,
determining whether there's a simpler fix than changing locking, and
whether it's actually any faster in the common case we care
about....
> >IOWs, you need more than one microbenchmark that interacts with
> >some naive monitoring code to justify the complexity these locking
> >changes introduce....
> The first patch can also independently improve the performance of a
> number of AIM7 workloads including almost 7X improvement in the
> short workload. More detailed information of these types of
> performance benefit was discussed in the patch description of the
> first patch. I will try to collect more performance improvement data
> on other workloads too.
>
> Thank for the review.
I haven't reviewed anything. All I've made comments about you
needing to justifying the complexity of the change.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 0/3 v3] dcache: make it more scalable on large system
2013-05-27 2:09 ` Dave Chinner
@ 2013-05-29 15:55 ` Waiman Long
[not found] ` <51A624E2.3000301-VXdhtT5mjnY@public.gmane.org>
0 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2013-05-29 15:55 UTC (permalink / raw)
To: Dave Chinner
Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ian Kent, Sage Weil,
Steve French, Trond Myklebust, Eric Paris, linux-fsdevel,
linux-kernel, autofs, ceph-devel, linux-cifs, linux-nfs,
Chandramouleeswaran, Aswin, Norton, Scott J, Andi Kleen
On 05/26/2013 10:09 PM, Dave Chinner wrote:
> On Thu, May 23, 2013 at 05:34:23PM -0400, Waiman Long wrote:
>> On 05/23/2013 05:42 AM, Dave Chinner wrote:
>>>
>>> What was it I said about this patchset when you posted it to speed
>>> up an Oracle benchmark back in february? I'll repeat:
>>>
>>> "Nobody should be doing reverse dentry-to-name lookups in a quantity
>>> sufficient for it to become a performance limiting factor."
>> Thank for the comment, but my point is that it is the d_lock
>> contention is skewing the data about how much spin lock contention
>> had actually happened in the workload and it makes it harder to
>> pinpoint problem areas to look at. This is not about performance, it
>> is about accurate representation of performance data. Ideally, we
>> want the overhead of turning on perf instrumentation to be as low as
>> possible.
> Right. But d_path will never be "low overhead", and as such it
> shouldn't be used by perf.
The d_path() is called by perf_event_mmap_event() which translates VMA
to its file path for memory segments backed by files. As perf is not
just for sampling data within the kernel, it can also be used for
checking access pattern in the user space. As a result, it needs to map
VMAs back to the backing files to access their symbols information. If
d_path() is not the right function to call for this purpose, what other
alternatives do we have?
There may be ways to reduce calls to d_path() by doing some kind of
caching, but that will certainly increase the complexity of the perf code.
>>> And that makes whatever that tracepoint is doing utterly stupid.
>>> Dumping out full paths in a tracepoint is hardly "low overhead", and
>>> that tracepoint should be stomped on from a great height. Sure,
>>> print the filename straight out of the dentry into a tracepoint,
>>> but repeated calculating the full path (probably for the same set of
>>> dentries) is just a dumb thing to do.
>>>
>>> Anyway, your numbers show that a single AIM7 microbenchmark goes
>>> better under tracing the specific mmap event that uses d_path(), but
>>> the others are on average a little bit slower. That doesn't convince
>>> me that this is worth doing. Indeed, what happens to performance
>>> when you aren't tracing?
>>>
>>> Indeed, have you analysed what makes that
>>> microbenchmark contend so much on the dentry lock while reverse
>>> lookups are occuring? Dentry lock contention does not necessarily
>>> indicate a problem with the dentry locks, and without knowing why
>>> there is contention occuring (esp. compared to the other benchmarks)
>>> we can't really determine if this is a good solution or not...
>> What made it contend so much was the large number of CPUs available
>> in my test system which is a 8-socket Westmere EX machines with 80
>> cores. As perf was collecting data from every core, the threads will
>> unavoidably bump into each other to translate dentries back to the
>> full paths. The current code only allows one CPU in the d_path()
>> critical path. My patch will allow all of them to be in the critical
>> path concurrently.
> Yes, I know *how* contention occurs and what your patch does. I'm
> asking you to explain *why* we need to fix this specific workload,
> and why the tracepoint that calls d_path() actually needs to do
> that.
My patch set consists of 2 different changes. The first one is to avoid
taking the d_lock lock when updating the reference count in the
dentries. This particular change also benefit some other workloads that
are filesystem intensive. One particular example is the short workload
in the AIM7 benchmark. One of the job type in the short workload is
"misc_rtns_1" which calls security functions like getpwnam(),
getpwuid(), getgrgid() a couple of times. These functions open the
/etc/passwd or /etc/group files, read their content and close the files.
It is the intensive open/read/close sequence from multiple threads that
is causing 80%+ contention in the d_lock on a system with large number
of cores. The MIT's MOSBench paper also outlined dentry reference
counting as a scalability roadblock for its Apache and Exim tests. So
that change will certainly help workloads that are dcache reference
counting intensive.
The second rename_lock change is mainly for reducing the d_path lock
contention for perf and some other applications that may need to access
/proc system files like maps or numa_maps frequently. If you think the
rename_lock change is not worth the effort to just benefit perf, I would
still like to see the first one go in as it can certainly benefit other
workload.
> Your numbers indicate that your patchset decreases peformance in the
> common, non-d_path intensive workloads, so why should we add all
> this complexity to optimise a slow path at the expense of
> significant complexity and lower performance in the fast
> path?
My numbers didn't actually indicate a performance regression in other
common workloads. They just indicate that there wasn't a significant
changes in performance as + or - a few percentages here and there are
within the margin of errors for the tests that I used.
>> The upcoming Ivy-Bridge EX can have up to 15 cores per socket. So
>> even a 4-socket machine will have up to 60 cores or 120 virtual CPUs
>> if hyperthreading is turned on.
> I don't see why that matters. I've been dealing with systems much
> larger than that since early 2002, adn the process for delaing with
> lock contention hasn't changed much in the last 10 years. First we
> need to determine what you are doing is something that is sane,
> determining whether there's a simpler fix than changing locking, and
> whether it's actually any faster in the common case we care
> about....
I certainly agree with that. My testing indicated no change in
performance for the common case, but significant speed-up in some
scenarios with large number of CPUs.
Regards,
Longman
^ permalink raw reply [flat|nested] 42+ messages in thread