All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] dcache: make Oracle more scalable on large systems
@ 2013-02-19 18:50 Waiman Long
  2013-02-19 18:50 ` [PATCH 1/4] dcache: Don't take unncessary lock in d_count update Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: linux-fsdevel, Alexander Viro; +Cc: Waiman Long, linux-kernel

It was found that the Oracle database software issues a lot of call
to the seq_path() kernel function which translates a (dentry, mnt)
pair to an absolute path. The seq_path() function will eventually
take the following two locks:

1. dentry->d_lock (spinlock) from dget()/dput()
2. rename_lock    (seqlock)  from d_path()

With a lot of database activities, the spinning of the 2 locks takes
a major portion of the kernel time and slow down the database software.

This set of patches were designed to minimize the locking overhead of
this code path and improve Oracle performance on systems with a large
number of CPUs.

The current kernel takes the dentry->d_lock lock whenever it wants to
increment or decrement the d_count reference count. However, nothing
big will really happen until the reference count goes all the way to 1
or 0.  Actually, we don't need to take the lock when reference count
is bigger than 1. Instead, atomic cmpxchg() function can be used to
increment or decrement the count in these situations. For safety,
other reference count update operations have to be changed to use
atomic instruction as well.

The rename_lock is a sequence lock. The d_path() function takes the
writer lock because it needs to traverse different dentries through
pointers to get the full path name. Hence it can't tolerate changes
in those pointers. But taking the writer lock also prevent multiple
d_path() calls to proceed concurrently.

A solution is to introduce a new lock type where there will be a
second type of reader which can block the writers - the sequence
read/write lock (seqrwlock). The d_path() and related functions will
then be changed to take the reader lock instead of the writer lock.
This will allow multiple d_path() operations to proceed concurrently.

Performance testing was done using the Oracle SLOB benchmark with the
latest 11.2.0.3 release of Oracle on a 3.8-rc3 kernel. Database files
were put in a tmpfs partition to minimize physical I/O overhead. Huge
pages were used with 30GB of SGA. The test machine was an 8-socket,
80-core HP Proliant DL980 with 1TB of memory and hyperthreading off.
The tests were run 5 times and the averages were taken.

The patch only has a slight positive impact on logical read
performance. The impact on write (redo size) performance, however,
is much greater. The redo size is a proxy of how much database write
has happened. So a larger value means a higher transaction rate.

+---------+---------+-------------+------------+----------+
| Readers | Writers | Redo Size   | Redo Size  | % Change |
|	  |	    | w/o patch   | with patch |	  |
|	  |	    |   (MB/s)    |   (MB/s)   |	  |
+---------+---------+-------------+------------+----------+
|    8	  |   64    |    802      |    903     |  12.6%	  |
|   32	  |   64    |    798      |    892     |  11.8%	  |
|   80	  |   64    |    658      |    714     |   8.5%	  |
|  128	  |   64    |    748      |    907     |  21.3%	  |
+---------+---------+-------------+------------+----------+

The table below shows the %system and %user times reported by Oracle's
AWR tool as well as the %time spent in the spinlocking code in kernel
with (inside parenthesis) and without (outside parenthesis) the patch.

+---------+---------+------------+------------+------------+
| Readers | Writers |  % System  |   % User   | % spinlock |
+---------+---------+------------+------------+------------+
|   32	  |    0    |  0.3(0.3)  | 39.0(39.0) |  6.3(17.4) |
|   80	  |    0    |  0.7(0.7)  | 97.4(94.2) |  2.9(31.7) |
|  128	  |    0    |  1.4(1.4)  | 34.4(32.2) | 43.5(62.2) |
|   32	  |   64    |  3.8(3.5)  | 55.4(53.6) |  9.1(35.0) |
|   80	  |   64    |  3.0(2.9)  | 94.4(93.9) |  4.5(38.8) |
|  128	  |   64    |  4.7(4.3)  | 38.2(40.3) | 34.8(58.7) |
+---------+---------+------------+------------+------------+

The following tests with multiple threads were also run on kernels with
and without the patch on both DL980 and a PC with 4-core i5 processor:

1. find $HOME -size 0b
2. cat /proc/*/maps /proc/*/numa_maps
3. git diff

For both the find-size and cat-maps tests, the performance difference
with hot cache was within a few percentage points and hence within
the margin of error. Single-thread performance was slightly worse,
but multithread performance was generally a bit better. Apparently,
reference count update isn't a significant factor in those tests. Their
perf traces indicates that there was less spinlock content in
functions like dput(), but the function itself ran a little bit longer
on average.

The git-diff test showed no difference in performance. There is a
slight increase in system time compensated by a slight decrease in
user time.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Waiman Long (4):
  dcache: Don't take unncessary lock in d_count update
  dcache: introduce a new sequence read/write lock type
  dcache: change rename_lock to a sequence read/write lock
  dcache: don't need to take d_lock in prepend_path()

 fs/autofs4/waitq.c        |    6 +-
 fs/ceph/mds_client.c      |    4 +-
 fs/cifs/dir.c             |    4 +-
 fs/dcache.c               |  113 +++++++++++++++++++------------------
 fs/namei.c                |    2 +-
 fs/nfs/namespace.c        |    6 +-
 include/linux/dcache.h    |  109 ++++++++++++++++++++++++++++++++++--
 include/linux/seqrwlock.h |  138 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/auditsc.c          |    5 +-
 9 files changed, 315 insertions(+), 72 deletions(-)
 create mode 100644 include/linux/seqrwlock.h


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] dcache: Don't take unncessary lock in d_count update
  2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
@ 2013-02-19 18:50 ` Waiman Long
  2013-02-19 18:50 ` [PATCH 2/4] dcache: introduce a new sequence read/write lock type Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: linux-fsdevel, Alexander Viro, Jeff Layton, Miklos Szeredi
  Cc: Waiman Long, linux-kernel

The current code takes the dentry's d_lock lock whenever the d_count
reference count is being updated. In reality, nothing big really
happens until d_count goes to 0 in dput(). So it is not necessary to
take the lock if the reference count won't go to 0.

Without using a lock, multiple threads may update d_count
simultaneously.  Therefore, atomic instructions must be used to
ensure consistency except in shrink_dcache_for_umount*() where the
whole superblock is being dismounted and locking is not needed.

The worst case scenarios are:

1. d_lock taken in dput with d_count = 2 in one thread and another
   thread comes in to atomically decrement d_count without taking
   the lock. This may result in a d_count of 0 with no deleting
   action taken.

2. d_lock taken in dput with d_count = 1 in one thread and another
   thread comes in to atomically increment d_count without taking
   the lock. This may result in the dentry in the deleted state while
   having a d_count of 1.

Without taking a lock, we need to make sure the decrementing or
incrementing action should not be taken while other threads are
updating d_count simultaneously. This can be done by using the
atomic cmpxchg instruction which will fail if the underlying value
is changed.  If the lock is taken, it should be safe to use a simpler
atomic increment or decrement instruction.

To make sure that the above worst case scenerios will not happen,
the dget() function must take the lock if d_count <= 1. Similarly,
the dput() function must take the lock if d_count <= 2. The cmpxchg()
call to update d_count will be tried twice before falling back to
using the lock as there is a fairly good chance that the cmpxchg()
may fail in a busy situation.

Finally, the CPU must have an instructional level cmpxchg instruction
or the emulated cmpxchg() function may be too expensive to
use. Therefore, the above mentioned changes will only be applied if
the __HAVE_ARCH_CMPXCHG flag is set. Most of the major architectures
supported by Linux have this flag set with the notation exception
of ARM.

As for the performance of the updated reference counting code, it
all depends on whether the cmpxchg instruction is used or not. The
original code has 2 atomic instructions to lock and unlock the
spinlock. The new code path has either 1 atomic cmpxchg instruction
or 3 atomic instructions if the lock has to be taken. Depending on
how frequent the cmpxchg instruction is used (d_count > 1 or 2),
the new code can be faster or slower than the original one.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/dcache.c            |   23 ++++++----
 fs/namei.c             |    2 +-
 include/linux/dcache.h |  105 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 117 insertions(+), 13 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 19153a0..20cc789 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -484,7 +484,7 @@ relock:
 	}
 
 	if (ref)
-		dentry->d_count--;
+		dcount_dec(dentry);
 	/*
 	 * if dentry was on the d_lru list delete it from there.
 	 * inform the fs via d_prune that this dentry is about to be
@@ -530,10 +530,13 @@ void dput(struct dentry *dentry)
 repeat:
 	if (dentry->d_count == 1)
 		might_sleep();
+	if (dcount_dec_cmpxchg(dentry))
+		return;
+
 	spin_lock(&dentry->d_lock);
 	BUG_ON(!dentry->d_count);
 	if (dentry->d_count > 1) {
-		dentry->d_count--;
+		dcount_dec(dentry);
 		spin_unlock(&dentry->d_lock);
 		return;
 	}
@@ -550,7 +553,7 @@ repeat:
 	dentry->d_flags |= DCACHE_REFERENCED;
 	dentry_lru_add(dentry);
 
-	dentry->d_count--;
+	dcount_dec(dentry);
 	spin_unlock(&dentry->d_lock);
 	return;
 
@@ -621,11 +624,13 @@ EXPORT_SYMBOL(d_invalidate);
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
 {
-	dentry->d_count++;
+	dcount_inc(dentry);
 }
 
 static inline void __dget(struct dentry *dentry)
 {
+	if (dcount_inc_cmpxchg(dentry))
+		return;
 	spin_lock(&dentry->d_lock);
 	__dget_dlock(dentry);
 	spin_unlock(&dentry->d_lock);
@@ -650,7 +655,7 @@ repeat:
 	}
 	rcu_read_unlock();
 	BUG_ON(!ret->d_count);
-	ret->d_count++;
+	dcount_inc(ret);
 	spin_unlock(&ret->d_lock);
 	return ret;
 }
@@ -782,7 +787,7 @@ static void try_prune_one_dentry(struct dentry *dentry)
 	while (dentry) {
 		spin_lock(&dentry->d_lock);
 		if (dentry->d_count > 1) {
-			dentry->d_count--;
+			dcount_dec(dentry);
 			spin_unlock(&dentry->d_lock);
 			return;
 		}
@@ -1980,7 +1985,7 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 				goto next;
 		}
 
-		dentry->d_count++;
+		dcount_inc(dentry);
 		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
@@ -2984,7 +2989,7 @@ resume:
 		}
 		if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
 			dentry->d_flags |= DCACHE_GENOCIDE;
-			dentry->d_count--;
+			dcount_dec(dentry);
 		}
 		spin_unlock(&dentry->d_lock);
 	}
@@ -2992,7 +2997,7 @@ resume:
 		struct dentry *child = this_parent;
 		if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
 			this_parent->d_flags |= DCACHE_GENOCIDE;
-			this_parent->d_count--;
+			dcount_dec(this_parent);
 		}
 		this_parent = try_to_ascend(this_parent, locked, seq);
 		if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 43a97ee..a1d5a36 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -537,7 +537,7 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
 		 */
 		BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
 		BUG_ON(!parent->d_count);
-		parent->d_count++;
+		dcount_inc(parent);
 		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&parent->d_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c1754b5..09495ba 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -255,6 +255,103 @@ extern int have_submounts(struct dentry *);
 extern void d_rehash(struct dentry *);
 
 /**
+ *	dcount_inc, dcount_dec - increment or decrement the dentry reference
+ *				 count
+ *	@dentry: dentry to get a reference to
+ */
+static inline void dcount_inc(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+	atomic_inc((atomic_t *)&dentry->d_count);
+#else
+	dentry->d_count++;
+#endif
+}
+
+static inline void dcount_dec(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+	atomic_dec((atomic_t *)&dentry->d_count);
+#else
+	dentry->d_count--;
+#endif
+}
+
+/**
+ *	dcount_inc_cmpxchg - increment dentry reference count using cmpxchg
+ *	@dentry: dentry to get a reference to
+ *	Return : 0 on failure, else 1
+ *
+ *	N.B. For architectures that do not have a cmpxchg instruction, the
+ *	     substitute code may not perform better than taking the lock.
+ *	     So this cmpxchg code path is disabled for those cases.
+ */
+static inline int dcount_inc_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+	int oldc = dentry->d_count,
+	    newc;
+
+	if (oldc > 1) {
+		/*
+		 * If reference count > 1, we can safely increment its
+		 * value using atomic cmpxchg without locking. If the
+		 * operation fails, fall back to using the lock.
+		 */
+		newc = oldc + 1;
+		if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+			return 1;
+		cpu_relax();
+		/* Retry one more time */
+		oldc = ACCESS_ONCE(dentry->d_count);
+		if (likely(oldc > 1)) {
+			newc = oldc + 1;
+			if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+				return 1;
+			cpu_relax();
+		}
+	}
+#endif
+	return 0;
+}
+
+/**
+ *	dcount_dec_cmpxchg - decrement dentry reference count using cmpxchg
+ *	@dentry: dentry to get a reference to
+ *	Return : 0 on failure, else 1
+ */
+static inline int dcount_dec_cmpxchg(struct dentry *dentry)
+{
+#ifdef __HAVE_ARCH_CMPXCHG
+	int oldc = dentry->d_count,
+	    newc;
+
+	/*
+	 * If d_count is bigger than 2, we can safely decrement the
+	 * reference count using atomic cmpxchg instruction without locking.
+	 * Even if d_lock is taken by a thread running dput() with a parallel
+	 * atomic_dec(), the reference count won't go to 0 without anyone
+	 * noticing.
+	 */
+	if (oldc > 2) {
+		newc = oldc - 1;
+		if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+			return 1;
+		cpu_relax();
+		/* Retry one more time */
+		oldc = ACCESS_ONCE(dentry->d_count);
+		if (likely(oldc > 2)) {
+			newc = oldc - 1;
+			if (cmpxchg(&dentry->d_count, oldc, newc) == oldc)
+				return 1;
+			cpu_relax();
+		}
+	}
+#endif
+	return 0;
+}
+
+/**
  * d_add - add dentry to hash queues
  * @entry: dentry to add
  * @inode: The inode to attach to this dentry
@@ -316,7 +413,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
 	assert_spin_locked(&dentry->d_lock);
 	if (!read_seqcount_retry(&dentry->d_seq, seq)) {
 		ret = 1;
-		dentry->d_count++;
+		dcount_inc(dentry);
 	}
 
 	return ret;
@@ -338,7 +435,6 @@ extern char *dentry_path_raw(struct dentry *, char *, int);
 extern char *dentry_path(struct dentry *, char *, int);
 
 /* Allocation counts.. */
-
 /**
  *	dget, dget_dlock -	get a reference to a dentry
  *	@dentry: dentry to get a reference to
@@ -350,13 +446,16 @@ extern char *dentry_path(struct dentry *, char *, int);
 static inline struct dentry *dget_dlock(struct dentry *dentry)
 {
 	if (dentry)
-		dentry->d_count++;
+		dcount_inc(dentry);
 	return dentry;
 }
 
 static inline struct dentry *dget(struct dentry *dentry)
 {
 	if (dentry) {
+		if (dcount_inc_cmpxchg(dentry))
+			return dentry;
+
 		spin_lock(&dentry->d_lock);
 		dget_dlock(dentry);
 		spin_unlock(&dentry->d_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] dcache: introduce a new sequence read/write lock type
  2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
  2013-02-19 18:50 ` [PATCH 1/4] dcache: Don't take unncessary lock in d_count update Waiman Long
@ 2013-02-19 18:50 ` Waiman Long
  2013-02-19 18:50   ` Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: linux-fsdevel, Alexander Viro; +Cc: Waiman Long, linux-kernel

The current sequence lock supports 2 types of lock users:

1. A reader who wants a consistent set of information and is willing
   to retry if the information changes. The information that the
   reader needs cannot contain pointers, because any writer could
   invalidate a pointer that a reader was following. This reader
   will never block but they may have to retry if a writer is in
   progress.
2. A writer who may need to modify content of a data structure. Writer
   blocks only if another writer is in progress.

This type of lock is suitable for cases where there are a large number
of readers and much less writers. However, it has a limitation that
reader who may want to follow pointer or cannot tolerate unexpected
changes in the protected data structure must take the writer lock
even if it doesn't need to make any changes.

To more efficiently support this type of readers, a new lock type is
introduced by this patch: sequence read/write lock. Two types of readers
are supported by this new lock:

1. Reader who has the same behavior as a sequence lock reader.
2. Reader who may need to follow pointers. This reader will block if
   a writer is in progress. In turn, it blocks a writer if it is in
   progress. Multiple readers of this type can proceed concurrently.
   Taking this reader lock won't update the sequence number.

This new lock type is a combination of the sequence lock and read/write
lock. Hence it will have the same limitation of a read/write lock that
writers may be starved if there is a lot of contention.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/linux/seqrwlock.h |  138 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 138 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/seqrwlock.h

diff --git a/include/linux/seqrwlock.h b/include/linux/seqrwlock.h
new file mode 100644
index 0000000..3ff5119
--- /dev/null
+++ b/include/linux/seqrwlock.h
@@ -0,0 +1,138 @@
+#ifndef __LINUX_SEQRWLOCK_H
+#define __LINUX_SEQRWLOCK_H
+/*
+ * Sequence Read/Write Lock
+ * ------------------------
+ * This new lock type is a combination of the sequence lock and read/write
+ * lock. Three types of lock users are supported:
+ * 1. Reader who wants a consistent set of information and is willing to
+ *    retry if the information changes. The information that the reader
+ *    need cannot contain pointers, because any writer could invalidate
+ *    a pointer that a reader was following. This reader never block but
+ *    they may have to retry if a writer is in progress.
+ * 2. Reader who may need to follow pointers. This reader will block if
+ *    a writer is in progress.
+ * 3. Writer who may need to modify content of a data structure. Writer
+ *    blocks if another writer or the 2nd type of reader is in progress.
+ *
+ * The current implementation layered on top of the regular read/write
+ * lock. There is a chance that the writers may be starved by the readers.
+ * So be careful when you decided to use this lock.
+ *
+ * Expected 1st type reader usage:
+ *	do {
+ *	    seq = read_seqrwbegin(&foo);
+ *	...
+ *	} while (read_seqrwretry(&foo, seq));
+ *
+ * Expected 2nd type reader usage:
+ *	read_seqrwlock(&foo)
+ *	...
+ *	read_seqrwunlock(&foo)
+ *
+ * Expected writer usage:
+ *	write_seqrwlock(&foo)
+ *	...
+ *	write_seqrwunlock(&foo)
+ *
+ * Based on the seqlock.h file
+ * by Waiman Long
+ */
+
+#include <linux/rwlock.h>
+#include <linux/preempt.h>
+#include <asm/processor.h>
+
+typedef struct {
+	unsigned sequence;
+	rwlock_t lock;
+} seqrwlock_t;
+
+#define __SEQRWLOCK_UNLOCKED(lockname) \
+		 { 0, __RW_LOCK_UNLOCKED(lockname) }
+
+#define seqrwlock_init(x)				\
+	do {						\
+		(x)->sequence = 0;			\
+		rwlock_init(&(x)->lock);		\
+	} while (0)
+
+#define DEFINE_SEQRWLOCK(x) \
+		seqrwlock_t x = __SEQRWLOCK_UNLOCKED(x)
+
+/* For writer:
+ * Lock out other writers and 2nd type of readers and update the sequence
+ * number. Don't need preempt_disable() because that is in the read_lock and
+ * write_lock already.
+ */
+static inline void write_seqrwlock(seqrwlock_t *sl)
+{
+	write_lock(&sl->lock);
+	++sl->sequence;
+	smp_wmb();
+}
+
+static inline void write_seqrwunlock(seqrwlock_t *sl)
+{
+	smp_wmb();
+	sl->sequence++;
+	write_unlock(&sl->lock);
+}
+
+static inline int write_tryseqrwlock(seqrwlock_t *sl)
+{
+	int ret = write_trylock(&sl->lock);
+
+	if (ret) {
+		++sl->sequence;
+		smp_wmb();
+	}
+	return ret;
+}
+
+/* For 2nd type of reader:
+ * Lock out other writers, but don't update the sequence number
+ */
+static inline void read_seqrwlock(seqrwlock_t *sl)
+{
+	read_lock(&sl->lock);
+}
+
+static inline void read_seqrwunlock(seqrwlock_t *sl)
+{
+	read_unlock(&sl->lock);
+}
+
+static inline int read_tryseqrwlock(seqrwlock_t *sl)
+{
+	return read_trylock(&sl->lock);
+}
+
+/* Start of read calculation -- fetch last complete writer token */
+static __always_inline unsigned read_seqrwbegin(const seqrwlock_t *sl)
+{
+	unsigned ret;
+
+repeat:
+	ret = ACCESS_ONCE(sl->sequence);
+	if (unlikely(ret & 1)) {
+		cpu_relax();
+		goto repeat;
+	}
+	smp_rmb();
+	return ret;
+}
+
+/*
+ * Test if reader processed invalid data.
+ *
+ * If sequence value changed then writer changed data while in section.
+ */
+static __always_inline int
+read_seqrwretry(const seqrwlock_t *sl, unsigned start)
+{
+	smp_rmb();
+	return unlikely(sl->sequence != start);
+}
+
+#endif /* __LINUX_SEQLOCK_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] dcache: change rename_lock to a sequence read/write lock
  2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
                   ` (2 preceding siblings ...)
  2013-02-19 18:50   ` Waiman Long
@ 2013-02-19 18:50 ` Waiman Long
  2013-02-19 18:50 ` [PATCH 4/4] dcache: don't need to take d_lock in prepend_path() Waiman Long
  2013-02-21 23:38 ` [PATCH 0/4] dcache: make Oracle more scalable on large systems Dave Chinner
  5 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: linux-fsdevel, Alexander Viro, Ian Kent, Sage Weil, Steve French,
	Trond Myklebust, Eric Paris, Jeff Layton, Miklos Szeredi
  Cc: Waiman Long, linux-kernel, autofs, ceph-devel, linux-cifs,
	samba-technical, linux-nfs

The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.

It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.

All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/autofs4/waitq.c     |    6 ++--
 fs/ceph/mds_client.c   |    4 +-
 fs/cifs/dir.c          |    4 +-
 fs/dcache.c            |   87 ++++++++++++++++++++++++-----------------------
 fs/nfs/namespace.c     |    6 ++--
 include/linux/dcache.h |    4 +-
 kernel/auditsc.c       |    5 ++-
 7 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 03bc1d3..95eee02 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -199,7 +199,7 @@ rename_retry:
 	buf = *name;
 	len = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	spin_lock(&sbi->fs_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -208,7 +208,7 @@ rename_retry:
 	if (!len || --len > NAME_MAX) {
 		spin_unlock(&sbi->fs_lock);
 		rcu_read_unlock();
-		if (read_seqretry(&rename_lock, seq))
+		if (read_seqrwretry(&rename_lock, seq))
 			goto rename_retry;
 		return 0;
 	}
@@ -224,7 +224,7 @@ rename_retry:
 	}
 	spin_unlock(&sbi->fs_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 
 	return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9165eb8..da6bd2c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1458,7 +1458,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
 
 retry:
 	len = 0;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	for (temp = dentry; !IS_ROOT(temp);) {
 		struct inode *inode = temp->d_inode;
@@ -1508,7 +1508,7 @@ retry:
 		temp = temp->d_parent;
 	}
 	rcu_read_unlock();
-	if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+	if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
 		pr_err("build_path did not end path lookup where "
 		       "expected, namelen is %d, pos is %d\n", len, pos);
 		/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 8719bbe..4842523 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
 		dfsplen = 0;
 cifs_bp_rename_retry:
 	namelen = dfsplen;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	for (temp = direntry; !IS_ROOT(temp);) {
 		namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
 		}
 	}
 	rcu_read_unlock();
-	if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+	if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
 		cFYI(1, "did not end path lookup where expected. namelen=%d "
 			"dfsplen=%d", namelen, dfsplen);
 		/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 20cc789..b1487e2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
 #include <asm/uaccess.h>
 #include <linux/security.h>
 #include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
 #include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 
@@ -1030,7 +1031,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
 	 */
 	if (new != old->d_parent ||
 		 (old->d_flags & DCACHE_DENTRY_KILLED) ||
-		 (!locked && read_seqretry(&rename_lock, seq))) {
+		 (!locked && read_seqrwretry(&rename_lock, seq))) {
 		spin_unlock(&new->d_lock);
 		new = NULL;
 	}
@@ -1059,7 +1060,7 @@ int have_submounts(struct dentry *parent)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = parent;
 
@@ -1102,23 +1103,23 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return 0; /* No mount points found in tree */
 positive:
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return 1;
 
 rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 EXPORT_SYMBOL(have_submounts);
@@ -1145,7 +1146,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
 	int found = 0;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = parent;
 	spin_lock(&this_parent->d_lock);
@@ -1210,10 +1211,10 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return found;
 
 rename_retry:
@@ -1222,7 +1223,7 @@ rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 
@@ -1832,7 +1833,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
 	 * It is possible that concurrent renames can mess up our list
 	 * walk here and result in missing our dentry, resulting in the
 	 * false-negative result. d_lookup() protects against concurrent
-	 * renames using rename_lock seqlock.
+	 * renames using rename_lock seqrwlock.
 	 *
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
@@ -1900,11 +1901,11 @@ struct dentry *d_lookup(struct dentry *parent, struct qstr *name)
 	unsigned seq;
 
         do {
-                seq = read_seqbegin(&rename_lock);
+		seq = read_seqrwbegin(&rename_lock);
                 dentry = __d_lookup(parent, name);
                 if (dentry)
 			break;
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqrwretry(&rename_lock, seq));
 	return dentry;
 }
 EXPORT_SYMBOL(d_lookup);
@@ -1918,7 +1919,7 @@ EXPORT_SYMBOL(d_lookup);
  * __d_lookup is like d_lookup, however it may (rarely) return a
  * false-negative result due to unrelated rename activity.
  *
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
  * however it must be used carefully, eg. with a following d_lookup in
  * the case of failure.
  *
@@ -1950,7 +1951,7 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 	 * It is possible that concurrent renames can mess up our list
 	 * walk here and result in missing our dentry, resulting in the
 	 * false-negative result. d_lookup() protects against concurrent
-	 * renames using rename_lock seqlock.
+	 * renames using rename_lock seqrwlock.
 	 *
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
@@ -2327,9 +2328,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	__d_move(dentry, target);
-	write_sequnlock(&rename_lock);
+	write_seqrwunlock(&rename_lock);
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2467,7 +2468,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 		alias = __d_find_alias(inode, 0);
 		if (alias) {
 			actual = alias;
-			write_seqlock(&rename_lock);
+			write_seqrwlock(&rename_lock);
 
 			if (d_ancestor(alias, dentry)) {
 				/* Check for loops */
@@ -2477,7 +2478,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				/* Is this an anonymous mountpoint that we
 				 * could splice into our tree? */
 				__d_materialise_dentry(dentry, alias);
-				write_sequnlock(&rename_lock);
+				write_seqrwunlock(&rename_lock);
 				__d_drop(alias);
 				goto found;
 			} else {
@@ -2485,7 +2486,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				 * aliasing. This drops inode->i_lock */
 				actual = __d_unalias(inode, dentry, alias);
 			}
-			write_sequnlock(&rename_lock);
+			write_seqrwunlock(&rename_lock);
 			if (IS_ERR(actual)) {
 				if (PTR_ERR(actual) == -ELOOP)
 					pr_warn_ratelimited(
@@ -2632,9 +2633,9 @@ char *__d_path(const struct path *path,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	if (error < 0)
 		return ERR_PTR(error);
@@ -2651,9 +2652,9 @@ char *d_absolute_path(const struct path *path,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = prepend_path(path, &root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	if (error > 1)
 		error = -EINVAL;
@@ -2717,11 +2718,11 @@ char *d_path(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
 	if (error < 0)
 		res = ERR_PTR(error);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	path_put(&root);
 	return res;
 }
@@ -2746,11 +2747,11 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
 	if (error > 0)
 		error = prepend_unreachable(&res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	path_put(&root);
 	if (error)
 		res =  ERR_PTR(error);
@@ -2817,9 +2818,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	return retval;
 }
@@ -2830,7 +2831,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2838,7 +2839,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
@@ -2876,7 +2877,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	get_fs_root_and_pwd(current->fs, &root, &pwd);
 
 	error = -ENOENT;
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		char *cwd = page + PAGE_SIZE;
@@ -2884,7 +2885,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &root, &cwd, &buflen);
-		write_sequnlock(&rename_lock);
+		read_seqrwunlock(&rename_lock);
 
 		if (error < 0)
 			goto out;
@@ -2904,7 +2905,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 				error = -EFAULT;
 		}
 	} else {
-		write_sequnlock(&rename_lock);
+		read_seqrwunlock(&rename_lock);
 	}
 
 out:
@@ -2940,7 +2941,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 
 	do {
 		/* for restarting inner loop in case of seq retry */
-		seq = read_seqbegin(&rename_lock);
+		seq = read_seqrwbegin(&rename_lock);
 		/*
 		 * Need rcu_readlock to protect against the d_parent trashing
 		 * due to d_move
@@ -2951,7 +2952,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 		else
 			result = 0;
 		rcu_read_unlock();
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqrwretry(&rename_lock, seq));
 
 	return result;
 }
@@ -2963,7 +2964,7 @@ void d_genocide(struct dentry *root)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = root;
 	spin_lock(&this_parent->d_lock);
@@ -3006,17 +3007,17 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return;
 
 rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
 	*--end = '\0';
 	buflen--;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	while (1) {
 		spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
 		spin_unlock(&dentry->d_lock);
 		dentry = dentry->d_parent;
 	}
-	if (read_seqretry(&rename_lock, seq)) {
+	if (read_seqrwretry(&rename_lock, seq)) {
 		spin_unlock(&dentry->d_lock);
 		rcu_read_unlock();
 		goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
 Elong_unlock:
 	spin_unlock(&dentry->d_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 09495ba..ed54134 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
 #include <linux/rculist.h>
 #include <linux/rculist_bl.h>
 #include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
 
@@ -207,7 +207,7 @@ struct dentry_operations {
 
 #define DCACHE_DENTRY_KILLED	0x100000
 
-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;
 
 static inline int dname_external(struct dentry *dentry)
 {
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index a371f85..d4a7bf2 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1892,7 +1892,7 @@ retry:
 	drop = NULL;
 	d = dentry;
 	rcu_read_lock();
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	for(;;) {
 		struct inode *inode = d->d_inode;
 		if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1910,7 +1910,8 @@ retry:
 			break;
 		d = parent;
 	}
-	if (unlikely(read_seqretry(&rename_lock, seq) || drop)) {  /* in this order */
+	if (unlikely(read_seqrwretry(&rename_lock, seq) ||
+	    drop)) { /* in this order */
 		rcu_read_unlock();
 		if (!drop) {
 			/* just a race with rename */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] dcache: change rename_lock to a sequence read/write lock
  2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
@ 2013-02-19 18:50   ` Waiman Long
  2013-02-19 18:50 ` [PATCH 2/4] dcache: introduce a new sequence read/write lock type Waiman Long
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: , linux-fsdevel, Alexander Viro, Ian Kent, Sage Weil,
	Steve French, Trond Myklebust, Eric Paris, Jeff Layton,
	Miklos Szeredi
  Cc: Waiman Long, linux-cifs, linux-nfs, autofs, samba-technical,
	linux-kernel, ceph-devel

The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.

It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.

All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/autofs4/waitq.c     |    6 ++--
 fs/ceph/mds_client.c   |    4 +-
 fs/cifs/dir.c          |    4 +-
 fs/dcache.c            |   87 ++++++++++++++++++++++++-----------------------
 fs/nfs/namespace.c     |    6 ++--
 include/linux/dcache.h |    4 +-
 kernel/auditsc.c       |    5 ++-
 7 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 03bc1d3..95eee02 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -199,7 +199,7 @@ rename_retry:
 	buf = *name;
 	len = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	spin_lock(&sbi->fs_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -208,7 +208,7 @@ rename_retry:
 	if (!len || --len > NAME_MAX) {
 		spin_unlock(&sbi->fs_lock);
 		rcu_read_unlock();
-		if (read_seqretry(&rename_lock, seq))
+		if (read_seqrwretry(&rename_lock, seq))
 			goto rename_retry;
 		return 0;
 	}
@@ -224,7 +224,7 @@ rename_retry:
 	}
 	spin_unlock(&sbi->fs_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 
 	return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9165eb8..da6bd2c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1458,7 +1458,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
 
 retry:
 	len = 0;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	for (temp = dentry; !IS_ROOT(temp);) {
 		struct inode *inode = temp->d_inode;
@@ -1508,7 +1508,7 @@ retry:
 		temp = temp->d_parent;
 	}
 	rcu_read_unlock();
-	if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+	if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
 		pr_err("build_path did not end path lookup where "
 		       "expected, namelen is %d, pos is %d\n", len, pos);
 		/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 8719bbe..4842523 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
 		dfsplen = 0;
 cifs_bp_rename_retry:
 	namelen = dfsplen;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	for (temp = direntry; !IS_ROOT(temp);) {
 		namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
 		}
 	}
 	rcu_read_unlock();
-	if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+	if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
 		cFYI(1, "did not end path lookup where expected. namelen=%d "
 			"dfsplen=%d", namelen, dfsplen);
 		/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 20cc789..b1487e2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
 #include <asm/uaccess.h>
 #include <linux/security.h>
 #include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
 #include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 
@@ -1030,7 +1031,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
 	 */
 	if (new != old->d_parent ||
 		 (old->d_flags & DCACHE_DENTRY_KILLED) ||
-		 (!locked && read_seqretry(&rename_lock, seq))) {
+		 (!locked && read_seqrwretry(&rename_lock, seq))) {
 		spin_unlock(&new->d_lock);
 		new = NULL;
 	}
@@ -1059,7 +1060,7 @@ int have_submounts(struct dentry *parent)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = parent;
 
@@ -1102,23 +1103,23 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return 0; /* No mount points found in tree */
 positive:
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return 1;
 
 rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 EXPORT_SYMBOL(have_submounts);
@@ -1145,7 +1146,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
 	int found = 0;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = parent;
 	spin_lock(&this_parent->d_lock);
@@ -1210,10 +1211,10 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return found;
 
 rename_retry:
@@ -1222,7 +1223,7 @@ rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 
@@ -1832,7 +1833,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
 	 * It is possible that concurrent renames can mess up our list
 	 * walk here and result in missing our dentry, resulting in the
 	 * false-negative result. d_lookup() protects against concurrent
-	 * renames using rename_lock seqlock.
+	 * renames using rename_lock seqrwlock.
 	 *
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
@@ -1900,11 +1901,11 @@ struct dentry *d_lookup(struct dentry *parent, struct qstr *name)
 	unsigned seq;
 
         do {
-                seq = read_seqbegin(&rename_lock);
+		seq = read_seqrwbegin(&rename_lock);
                 dentry = __d_lookup(parent, name);
                 if (dentry)
 			break;
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqrwretry(&rename_lock, seq));
 	return dentry;
 }
 EXPORT_SYMBOL(d_lookup);
@@ -1918,7 +1919,7 @@ EXPORT_SYMBOL(d_lookup);
  * __d_lookup is like d_lookup, however it may (rarely) return a
  * false-negative result due to unrelated rename activity.
  *
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
  * however it must be used carefully, eg. with a following d_lookup in
  * the case of failure.
  *
@@ -1950,7 +1951,7 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 	 * It is possible that concurrent renames can mess up our list
 	 * walk here and result in missing our dentry, resulting in the
 	 * false-negative result. d_lookup() protects against concurrent
-	 * renames using rename_lock seqlock.
+	 * renames using rename_lock seqrwlock.
 	 *
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
@@ -2327,9 +2328,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	__d_move(dentry, target);
-	write_sequnlock(&rename_lock);
+	write_seqrwunlock(&rename_lock);
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2467,7 +2468,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 		alias = __d_find_alias(inode, 0);
 		if (alias) {
 			actual = alias;
-			write_seqlock(&rename_lock);
+			write_seqrwlock(&rename_lock);
 
 			if (d_ancestor(alias, dentry)) {
 				/* Check for loops */
@@ -2477,7 +2478,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				/* Is this an anonymous mountpoint that we
 				 * could splice into our tree? */
 				__d_materialise_dentry(dentry, alias);
-				write_sequnlock(&rename_lock);
+				write_seqrwunlock(&rename_lock);
 				__d_drop(alias);
 				goto found;
 			} else {
@@ -2485,7 +2486,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				 * aliasing. This drops inode->i_lock */
 				actual = __d_unalias(inode, dentry, alias);
 			}
-			write_sequnlock(&rename_lock);
+			write_seqrwunlock(&rename_lock);
 			if (IS_ERR(actual)) {
 				if (PTR_ERR(actual) == -ELOOP)
 					pr_warn_ratelimited(
@@ -2632,9 +2633,9 @@ char *__d_path(const struct path *path,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	if (error < 0)
 		return ERR_PTR(error);
@@ -2651,9 +2652,9 @@ char *d_absolute_path(const struct path *path,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = prepend_path(path, &root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	if (error > 1)
 		error = -EINVAL;
@@ -2717,11 +2718,11 @@ char *d_path(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
 	if (error < 0)
 		res = ERR_PTR(error);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	path_put(&root);
 	return res;
 }
@@ -2746,11 +2747,11 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
 	if (error > 0)
 		error = prepend_unreachable(&res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	path_put(&root);
 	if (error)
 		res =  ERR_PTR(error);
@@ -2817,9 +2818,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	return retval;
 }
@@ -2830,7 +2831,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2838,7 +2839,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
@@ -2876,7 +2877,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	get_fs_root_and_pwd(current->fs, &root, &pwd);
 
 	error = -ENOENT;
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		char *cwd = page + PAGE_SIZE;
@@ -2884,7 +2885,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &root, &cwd, &buflen);
-		write_sequnlock(&rename_lock);
+		read_seqrwunlock(&rename_lock);
 
 		if (error < 0)
 			goto out;
@@ -2904,7 +2905,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 				error = -EFAULT;
 		}
 	} else {
-		write_sequnlock(&rename_lock);
+		read_seqrwunlock(&rename_lock);
 	}
 
 out:
@@ -2940,7 +2941,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 
 	do {
 		/* for restarting inner loop in case of seq retry */
-		seq = read_seqbegin(&rename_lock);
+		seq = read_seqrwbegin(&rename_lock);
 		/*
 		 * Need rcu_readlock to protect against the d_parent trashing
 		 * due to d_move
@@ -2951,7 +2952,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 		else
 			result = 0;
 		rcu_read_unlock();
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqrwretry(&rename_lock, seq));
 
 	return result;
 }
@@ -2963,7 +2964,7 @@ void d_genocide(struct dentry *root)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = root;
 	spin_lock(&this_parent->d_lock);
@@ -3006,17 +3007,17 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return;
 
 rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
 	*--end = '\0';
 	buflen--;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	while (1) {
 		spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
 		spin_unlock(&dentry->d_lock);
 		dentry = dentry->d_parent;
 	}
-	if (read_seqretry(&rename_lock, seq)) {
+	if (read_seqrwretry(&rename_lock, seq)) {
 		spin_unlock(&dentry->d_lock);
 		rcu_read_unlock();
 		goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
 Elong_unlock:
 	spin_unlock(&dentry->d_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 09495ba..ed54134 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
 #include <linux/rculist.h>
 #include <linux/rculist_bl.h>
 #include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
 
@@ -207,7 +207,7 @@ struct dentry_operations {
 
 #define DCACHE_DENTRY_KILLED	0x100000
 
-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;
 
 static inline int dname_external(struct dentry *dentry)
 {
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index a371f85..d4a7bf2 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1892,7 +1892,7 @@ retry:
 	drop = NULL;
 	d = dentry;
 	rcu_read_lock();
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	for(;;) {
 		struct inode *inode = d->d_inode;
 		if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1910,7 +1910,8 @@ retry:
 			break;
 		d = parent;
 	}
-	if (unlikely(read_seqretry(&rename_lock, seq) || drop)) {  /* in this order */
+	if (unlikely(read_seqrwretry(&rename_lock, seq) ||
+	    drop)) { /* in this order */
 		rcu_read_unlock();
 		if (!drop) {
 			/* just a race with rename */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] dcache: change rename_lock to a sequence read/write lock
@ 2013-02-19 18:50   ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: linux-fsdevel, Alexander Viro, Ian Kent, Sage Weil, Steve French,
	Trond Myklebust, Eric Paris, Jeff Layton, Miklos Szeredi
  Cc: Waiman Long, linux-cifs, linux-nfs, autofs, samba-technical,
	linux-kernel, ceph-devel

The d_path() and related kernel functions currently take a writer
lock on rename_lock because they need to follow pointers. By changing
rename_lock to be the new sequence read/write lock, a reader lock
can be taken and multiple d_path() threads can proceed concurrently
without blocking each other.

It is unlikely that the frequency of filesystem changes and d_path()
name lookup will be high enough to cause writer starvation, the current
limitation of the read/write lock should be acceptable in that case.

All the sites where rename_lock is referenced were modified to use the
sequence read/write lock declaration and access functions.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/autofs4/waitq.c     |    6 ++--
 fs/ceph/mds_client.c   |    4 +-
 fs/cifs/dir.c          |    4 +-
 fs/dcache.c            |   87 ++++++++++++++++++++++++-----------------------
 fs/nfs/namespace.c     |    6 ++--
 include/linux/dcache.h |    4 +-
 kernel/auditsc.c       |    5 ++-
 7 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 03bc1d3..95eee02 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -199,7 +199,7 @@ rename_retry:
 	buf = *name;
 	len = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	spin_lock(&sbi->fs_lock);
 	for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
@@ -208,7 +208,7 @@ rename_retry:
 	if (!len || --len > NAME_MAX) {
 		spin_unlock(&sbi->fs_lock);
 		rcu_read_unlock();
-		if (read_seqretry(&rename_lock, seq))
+		if (read_seqrwretry(&rename_lock, seq))
 			goto rename_retry;
 		return 0;
 	}
@@ -224,7 +224,7 @@ rename_retry:
 	}
 	spin_unlock(&sbi->fs_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 
 	return len;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9165eb8..da6bd2c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1458,7 +1458,7 @@ char *ceph_mdsc_build_path(struct dentry *dentry, int *plen, u64 *base,
 
 retry:
 	len = 0;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	for (temp = dentry; !IS_ROOT(temp);) {
 		struct inode *inode = temp->d_inode;
@@ -1508,7 +1508,7 @@ retry:
 		temp = temp->d_parent;
 	}
 	rcu_read_unlock();
-	if (pos != 0 || read_seqretry(&rename_lock, seq)) {
+	if (pos != 0 || read_seqrwretry(&rename_lock, seq)) {
 		pr_err("build_path did not end path lookup where "
 		       "expected, namelen is %d, pos is %d\n", len, pos);
 		/* presumably this is only possible if racing with a
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 8719bbe..4842523 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -96,7 +96,7 @@ build_path_from_dentry(struct dentry *direntry)
 		dfsplen = 0;
 cifs_bp_rename_retry:
 	namelen = dfsplen;
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	for (temp = direntry; !IS_ROOT(temp);) {
 		namelen += (1 + temp->d_name.len);
@@ -136,7 +136,7 @@ cifs_bp_rename_retry:
 		}
 	}
 	rcu_read_unlock();
-	if (namelen != dfsplen || read_seqretry(&rename_lock, seq)) {
+	if (namelen != dfsplen || read_seqrwretry(&rename_lock, seq)) {
 		cFYI(1, "did not end path lookup where expected. namelen=%d "
 			"dfsplen=%d", namelen, dfsplen);
 		/* presumably this is only possible if racing with a rename
diff --git a/fs/dcache.c b/fs/dcache.c
index 20cc789..b1487e2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -29,6 +29,7 @@
 #include <asm/uaccess.h>
 #include <linux/security.h>
 #include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
 #include <linux/fs_struct.h>
@@ -82,7 +83,7 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+__cacheline_aligned_in_smp DEFINE_SEQRWLOCK(rename_lock);
 
 EXPORT_SYMBOL(rename_lock);
 
@@ -1030,7 +1031,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq
 	 */
 	if (new != old->d_parent ||
 		 (old->d_flags & DCACHE_DENTRY_KILLED) ||
-		 (!locked && read_seqretry(&rename_lock, seq))) {
+		 (!locked && read_seqrwretry(&rename_lock, seq))) {
 		spin_unlock(&new->d_lock);
 		new = NULL;
 	}
@@ -1059,7 +1060,7 @@ int have_submounts(struct dentry *parent)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = parent;
 
@@ -1102,23 +1103,23 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return 0; /* No mount points found in tree */
 positive:
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return 1;
 
 rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 EXPORT_SYMBOL(have_submounts);
@@ -1145,7 +1146,7 @@ static int select_parent(struct dentry *parent, struct list_head *dispose)
 	int found = 0;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = parent;
 	spin_lock(&this_parent->d_lock);
@@ -1210,10 +1211,10 @@ resume:
 	}
 out:
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return found;
 
 rename_retry:
@@ -1222,7 +1223,7 @@ rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 
@@ -1832,7 +1833,7 @@ struct dentry *__d_lookup_rcu(const struct dentry *parent,
 	 * It is possible that concurrent renames can mess up our list
 	 * walk here and result in missing our dentry, resulting in the
 	 * false-negative result. d_lookup() protects against concurrent
-	 * renames using rename_lock seqlock.
+	 * renames using rename_lock seqrwlock.
 	 *
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
@@ -1900,11 +1901,11 @@ struct dentry *d_lookup(struct dentry *parent, struct qstr *name)
 	unsigned seq;
 
         do {
-                seq = read_seqbegin(&rename_lock);
+		seq = read_seqrwbegin(&rename_lock);
                 dentry = __d_lookup(parent, name);
                 if (dentry)
 			break;
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqrwretry(&rename_lock, seq));
 	return dentry;
 }
 EXPORT_SYMBOL(d_lookup);
@@ -1918,7 +1919,7 @@ EXPORT_SYMBOL(d_lookup);
  * __d_lookup is like d_lookup, however it may (rarely) return a
  * false-negative result due to unrelated rename activity.
  *
- * __d_lookup is slightly faster by avoiding rename_lock read seqlock,
+ * __d_lookup is slightly faster by avoiding rename_lock read seqrwlock,
  * however it must be used carefully, eg. with a following d_lookup in
  * the case of failure.
  *
@@ -1950,7 +1951,7 @@ struct dentry *__d_lookup(struct dentry *parent, struct qstr *name)
 	 * It is possible that concurrent renames can mess up our list
 	 * walk here and result in missing our dentry, resulting in the
 	 * false-negative result. d_lookup() protects against concurrent
-	 * renames using rename_lock seqlock.
+	 * renames using rename_lock seqrwlock.
 	 *
 	 * See Documentation/filesystems/path-lookup.txt for more details.
 	 */
@@ -2327,9 +2328,9 @@ static void __d_move(struct dentry * dentry, struct dentry * target)
  */
 void d_move(struct dentry *dentry, struct dentry *target)
 {
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	__d_move(dentry, target);
-	write_sequnlock(&rename_lock);
+	write_seqrwunlock(&rename_lock);
 }
 EXPORT_SYMBOL(d_move);
 
@@ -2467,7 +2468,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 		alias = __d_find_alias(inode, 0);
 		if (alias) {
 			actual = alias;
-			write_seqlock(&rename_lock);
+			write_seqrwlock(&rename_lock);
 
 			if (d_ancestor(alias, dentry)) {
 				/* Check for loops */
@@ -2477,7 +2478,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				/* Is this an anonymous mountpoint that we
 				 * could splice into our tree? */
 				__d_materialise_dentry(dentry, alias);
-				write_sequnlock(&rename_lock);
+				write_seqrwunlock(&rename_lock);
 				__d_drop(alias);
 				goto found;
 			} else {
@@ -2485,7 +2486,7 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
 				 * aliasing. This drops inode->i_lock */
 				actual = __d_unalias(inode, dentry, alias);
 			}
-			write_sequnlock(&rename_lock);
+			write_seqrwunlock(&rename_lock);
 			if (IS_ERR(actual)) {
 				if (PTR_ERR(actual) == -ELOOP)
 					pr_warn_ratelimited(
@@ -2632,9 +2633,9 @@ char *__d_path(const struct path *path,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	if (error < 0)
 		return ERR_PTR(error);
@@ -2651,9 +2652,9 @@ char *d_absolute_path(const struct path *path,
 	int error;
 
 	prepend(&res, &buflen, "\0", 1);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = prepend_path(path, &root, &res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	if (error > 1)
 		error = -EINVAL;
@@ -2717,11 +2718,11 @@ char *d_path(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
 	if (error < 0)
 		res = ERR_PTR(error);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	path_put(&root);
 	return res;
 }
@@ -2746,11 +2747,11 @@ char *d_path_with_unreachable(const struct path *path, char *buf, int buflen)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	get_fs_root(current->fs, &root);
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
 	if (error > 0)
 		error = prepend_unreachable(&res, &buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	path_put(&root);
 	if (error)
 		res =  ERR_PTR(error);
@@ -2817,9 +2818,9 @@ char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 
 	return retval;
 }
@@ -2830,7 +2831,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2838,7 +2839,7 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
+	read_seqrwunlock(&rename_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
@@ -2876,7 +2877,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 	get_fs_root_and_pwd(current->fs, &root, &pwd);
 
 	error = -ENOENT;
-	write_seqlock(&rename_lock);
+	read_seqrwlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		char *cwd = page + PAGE_SIZE;
@@ -2884,7 +2885,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &root, &cwd, &buflen);
-		write_sequnlock(&rename_lock);
+		read_seqrwunlock(&rename_lock);
 
 		if (error < 0)
 			goto out;
@@ -2904,7 +2905,7 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 				error = -EFAULT;
 		}
 	} else {
-		write_sequnlock(&rename_lock);
+		read_seqrwunlock(&rename_lock);
 	}
 
 out:
@@ -2940,7 +2941,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 
 	do {
 		/* for restarting inner loop in case of seq retry */
-		seq = read_seqbegin(&rename_lock);
+		seq = read_seqrwbegin(&rename_lock);
 		/*
 		 * Need rcu_readlock to protect against the d_parent trashing
 		 * due to d_move
@@ -2951,7 +2952,7 @@ int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
 		else
 			result = 0;
 		rcu_read_unlock();
-	} while (read_seqretry(&rename_lock, seq));
+	} while (read_seqrwretry(&rename_lock, seq));
 
 	return result;
 }
@@ -2963,7 +2964,7 @@ void d_genocide(struct dentry *root)
 	unsigned seq;
 	int locked = 0;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 again:
 	this_parent = root;
 	spin_lock(&this_parent->d_lock);
@@ -3006,17 +3007,17 @@ resume:
 		goto resume;
 	}
 	spin_unlock(&this_parent->d_lock);
-	if (!locked && read_seqretry(&rename_lock, seq))
+	if (!locked && read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 	if (locked)
-		write_sequnlock(&rename_lock);
+		write_seqrwunlock(&rename_lock);
 	return;
 
 rename_retry:
 	if (locked)
 		goto again;
 	locked = 1;
-	write_seqlock(&rename_lock);
+	write_seqrwlock(&rename_lock);
 	goto again;
 }
 
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index fc8dc20..0eca871 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -60,7 +60,7 @@ rename_retry:
 	*--end = '\0';
 	buflen--;
 
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	rcu_read_lock();
 	while (1) {
 		spin_lock(&dentry->d_lock);
@@ -76,7 +76,7 @@ rename_retry:
 		spin_unlock(&dentry->d_lock);
 		dentry = dentry->d_parent;
 	}
-	if (read_seqretry(&rename_lock, seq)) {
+	if (read_seqrwretry(&rename_lock, seq)) {
 		spin_unlock(&dentry->d_lock);
 		rcu_read_unlock();
 		goto rename_retry;
@@ -117,7 +117,7 @@ rename_retry:
 Elong_unlock:
 	spin_unlock(&dentry->d_lock);
 	rcu_read_unlock();
-	if (read_seqretry(&rename_lock, seq))
+	if (read_seqrwretry(&rename_lock, seq))
 		goto rename_retry;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 09495ba..ed54134 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -6,7 +6,7 @@
 #include <linux/rculist.h>
 #include <linux/rculist_bl.h>
 #include <linux/spinlock.h>
-#include <linux/seqlock.h>
+#include <linux/seqrwlock.h>
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
 
@@ -207,7 +207,7 @@ struct dentry_operations {
 
 #define DCACHE_DENTRY_KILLED	0x100000
 
-extern seqlock_t rename_lock;
+extern seqrwlock_t rename_lock;
 
 static inline int dname_external(struct dentry *dentry)
 {
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index a371f85..d4a7bf2 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1892,7 +1892,7 @@ retry:
 	drop = NULL;
 	d = dentry;
 	rcu_read_lock();
-	seq = read_seqbegin(&rename_lock);
+	seq = read_seqrwbegin(&rename_lock);
 	for(;;) {
 		struct inode *inode = d->d_inode;
 		if (inode && unlikely(!hlist_empty(&inode->i_fsnotify_marks))) {
@@ -1910,7 +1910,8 @@ retry:
 			break;
 		d = parent;
 	}
-	if (unlikely(read_seqretry(&rename_lock, seq) || drop)) {  /* in this order */
+	if (unlikely(read_seqrwretry(&rename_lock, seq) ||
+	    drop)) { /* in this order */
 		rcu_read_unlock();
 		if (!drop) {
 			/* just a race with rename */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] dcache: don't need to take d_lock in prepend_path()
  2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
                   ` (3 preceding siblings ...)
  2013-02-19 18:50 ` Waiman Long
@ 2013-02-19 18:50 ` Waiman Long
  2013-02-21 23:38 ` [PATCH 0/4] dcache: make Oracle more scalable on large systems Dave Chinner
  5 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-19 18:50 UTC (permalink / raw)
  To: linux-fsdevel, Alexander Viro; +Cc: Waiman Long, linux-kernel

The d_lock was used in prepend_path() to protect dentry->d_name from
being changed under the hood. As the caller of prepend_path() has
to take the rename_lock before calling into it, there is no chance
that d_name will be changed. The d_lock lock is only needed when the
rename_lock is not taken.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/dcache.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b1487e2..0e911fc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2547,6 +2547,7 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
  * @buflen: pointer to buffer length
  *
  * Caller holds the rename_lock.
+ * There is no need to lock the dentry as its name cannot be changed.
  */
 static int prepend_path(const struct path *path,
 			const struct path *root,
@@ -2573,9 +2574,7 @@ static int prepend_path(const struct path *path,
 		}
 		parent = dentry->d_parent;
 		prefetch(parent);
-		spin_lock(&dentry->d_lock);
 		error = prepend_name(buffer, buflen, &dentry->d_name);
-		spin_unlock(&dentry->d_lock);
 		if (!error)
 			error = prepend(buffer, buflen, "/", 1);
 		if (error)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
                   ` (4 preceding siblings ...)
  2013-02-19 18:50 ` [PATCH 4/4] dcache: don't need to take d_lock in prepend_path() Waiman Long
@ 2013-02-21 23:38 ` Dave Chinner
  2013-02-22  0:13   ` Andi Kleen
  5 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2013-02-21 23:38 UTC (permalink / raw)
  To: Waiman Long; +Cc: linux-fsdevel, Alexander Viro, linux-kernel

On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:
> It was found that the Oracle database software issues a lot of call
> to the seq_path() kernel function which translates a (dentry, mnt)
> pair to an absolute path. The seq_path() function will eventually
> take the following two locks:

Nobody should be doing reverse dentry-to-name lookups in a quantity
sufficient for it to become a performance limiting factor. What is
the Oracle DB actually using this path for?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-21 23:38 ` [PATCH 0/4] dcache: make Oracle more scalable on large systems Dave Chinner
@ 2013-02-22  0:13   ` Andi Kleen
  2013-02-22  4:13     ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2013-02-22  0:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Waiman Long, linux-fsdevel, Alexander Viro, linux-kernel

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:
>> It was found that the Oracle database software issues a lot of call
>> to the seq_path() kernel function which translates a (dentry, mnt)
>> pair to an absolute path. The seq_path() function will eventually
>> take the following two locks:
>
> Nobody should be doing reverse dentry-to-name lookups in a quantity
> sufficient for it to become a performance limiting factor. What is
> the Oracle DB actually using this path for?

Yes calling d_path frequently is usually a bug elsewhere.
Is that through /proc ? 

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-22  0:13   ` Andi Kleen
@ 2013-02-22  4:13     ` Waiman Long
  2013-02-22 23:00       ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2013-02-22  4:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On 02/21/2013 07:13 PM, Andi Kleen wrote:
> Dave Chinner<david@fromorbit.com>  writes:
>
>> On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:
>>> It was found that the Oracle database software issues a lot of call
>>> to the seq_path() kernel function which translates a (dentry, mnt)
>>> pair to an absolute path. The seq_path() function will eventually
>>> take the following two locks:
>> Nobody should be doing reverse dentry-to-name lookups in a quantity
>> sufficient for it to become a performance limiting factor. What is
>> the Oracle DB actually using this path for?
> Yes calling d_path frequently is usually a bug elsewhere.
> Is that through /proc ?
>
> -Andi
>
>
A sample strace of Oracle indicates that it opens a lot of /proc 
filesystem files such as the stat, maps, etc many times while running. 
Oracle has a very detailed system performance reporting infrastructure 
in place to report almost all aspect of system performance through its 
AWR reporting tool or the browser-base enterprise manager. Maybe that is 
the reason why it is hitting this performance bottleneck.

Regards,
Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-22  4:13     ` Waiman Long
@ 2013-02-22 23:00       ` Dave Chinner
  2013-02-23  0:13         ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2013-02-22 23:00 UTC (permalink / raw)
  To: Waiman Long; +Cc: Andi Kleen, linux-fsdevel, Alexander Viro, linux-kernel

On Thu, Feb 21, 2013 at 11:13:27PM -0500, Waiman Long wrote:
> On 02/21/2013 07:13 PM, Andi Kleen wrote:
> >Dave Chinner<david@fromorbit.com>  writes:
> >
> >>On Tue, Feb 19, 2013 at 01:50:55PM -0500, Waiman Long wrote:
> >>>It was found that the Oracle database software issues a lot of call
> >>>to the seq_path() kernel function which translates a (dentry, mnt)
> >>>pair to an absolute path. The seq_path() function will eventually
> >>>take the following two locks:
> >>Nobody should be doing reverse dentry-to-name lookups in a quantity
> >>sufficient for it to become a performance limiting factor. What is
> >>the Oracle DB actually using this path for?
> >Yes calling d_path frequently is usually a bug elsewhere.
> >Is that through /proc ?
> >
> >-Andi
> >
> >
> A sample strace of Oracle indicates that it opens a lot of /proc
> filesystem files such as the stat, maps, etc many times while
> running. Oracle has a very detailed system performance reporting
> infrastructure in place to report almost all aspect of system
> performance through its AWR reporting tool or the browser-base
> enterprise manager. Maybe that is the reason why it is hitting this
> performance bottleneck.

That seems to me like an application problem - poking at what the
kernel is doing via diagnostic interfaces so often that it gets in
the way of the kernel actually doing stuff is not a problem the
kernel can solve.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-22 23:00       ` Dave Chinner
@ 2013-02-23  0:13         ` Andi Kleen
  2013-02-28 20:39           ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2013-02-23  0:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Andi Kleen, linux-fsdevel, Alexander Viro, linux-kernel

> That seems to me like an application problem - poking at what the
> kernel is doing via diagnostic interfaces so often that it gets in
> the way of the kernel actually doing stuff is not a problem the
> kernel can solve.

I agree with you that the application shouldn't be doing that, but 
if there is a cheap way to lower the d_path overhead that is also
attractive.  There will be always applications doing broken things.
Any scaling problem less in the kernel is good.

But the real fix in this case is to fix the application.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-23  0:13         ` Andi Kleen
@ 2013-02-28 20:39           ` Waiman Long
  2013-02-28 23:13             ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2013-02-28 20:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On 02/22/2013 07:13 PM, Andi Kleen wrote:
>> That seems to me like an application problem - poking at what the
>> kernel is doing via diagnostic interfaces so often that it gets in
>> the way of the kernel actually doing stuff is not a problem the
>> kernel can solve.
> I agree with you that the application shouldn't be doing that, but
> if there is a cheap way to lower the d_path overhead that is also
> attractive.  There will be always applications doing broken things.
> Any scaling problem less in the kernel is good.
>
> But the real fix in this case is to fix the application.
>
> -Andi
Further investigation into the d_path() bottleneck revealed some
interesting facts about Oracle. First of all, the invocation of the
d_path() kernel function is concentrated in a few processes only
rather than distributed across many of them.  On a 1 minute test run,
the following three standard long-running Oracle processes will call
indo d_path():

1. MMNL - Memory monitor light (gathers and stores AWR statistics) [272]
2. CKPT - Checkpoint process [17]
3. DBRM - DB resource manager (new in 11g) [16]

The numbers within [] are the number of times d_path() will be
called, which are not much for a 1-minutes interval. Beyond those
standard processes, Oracle also seems to spawn transient processes
(last a few seconds) periodically to issue a bunch of d_path() calls
(about 1000) within a short time before they die. I am not sure what
the purpose of those processes are.  In an one minute interval, 2-7
of those transient processes may be spawned depending probably on the
activity level. Most of the d_path() call last for about 1ms. There
are a couple of those that last for more than 10ms.

Other system daemons that call into d_open() include irqbalance and
automount. irqbalance issues about 2000 d_path() call in a minute in
a bursty fashion. The contribution of automount is only about 50 in
the same time period which is not really significant. Regular commands
like cp, ps may also issue a couple of d_path() calls per invocation.

As I was using "perf record --call-graph" command to profile the Oracle
application, I found out that another major user of the d_path()
function happens to be perf_event_mmap_event() of the perf-event
subsystem. It took about 10% of the total d_path() calls. So the
metrics that I collected were skewed a bit because of that.

I am thinking that the impact of my patch on Oracle write performance
is probably due to its impact on the open() system call which has
to update the reference counts on dentry. On the collected perf
traces, a certain portion of the spinlock time was consumed by
dput() and path_get().  The 2 major consumers of those calls are
the d_path() and the open() system call. On test run with no writer,
I saw significantly less open() call in the perf trace and hence much
less impact on Oracle performance.

I do agree that Oracle should probably fix the application to issue less
calls to the d_path() function. However, I would argue that my patch will
still be useful for the following reasons:

1. Changing how the reference counting works (patch 1 of 4) will certainly
    help in situation when processes are issuing intensive batches of file
    system operations as is the case here.
2. Changing the rename_lock to use a sequence r/w lock (patches 2-4 of 4)
    will help to minimize the overhead of the perf-event subsystem when it
    is activated with the call-graph feature which is pretty common.

Regards,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4] dcache: make Oracle more scalable on large systems
  2013-02-28 20:39           ` Waiman Long
@ 2013-02-28 23:13             ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2013-02-28 23:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, linux-fsdevel, Alexander Viro, linux-kernel

On 02/28/2013 03:39 PM, Waiman Long wrote:
>
> activity level. Most of the d_path() call last for about 1ms. There
> are a couple of those that last for more than 10ms.
>

A correction. The time unit here should be us, not ms. Sorry for the 
mistake.

-Longman

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-02-28 23:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-19 18:50 [PATCH 0/4] dcache: make Oracle more scalable on large systems Waiman Long
2013-02-19 18:50 ` [PATCH 1/4] dcache: Don't take unncessary lock in d_count update Waiman Long
2013-02-19 18:50 ` [PATCH 2/4] dcache: introduce a new sequence read/write lock type Waiman Long
2013-02-19 18:50 ` [PATCH 3/4] dcache: change rename_lock to a sequence read/write lock Waiman Long
2013-02-19 18:50   ` Waiman Long
2013-02-19 18:50 ` Waiman Long
2013-02-19 18:50 ` [PATCH 4/4] dcache: don't need to take d_lock in prepend_path() Waiman Long
2013-02-21 23:38 ` [PATCH 0/4] dcache: make Oracle more scalable on large systems Dave Chinner
2013-02-22  0:13   ` Andi Kleen
2013-02-22  4:13     ` Waiman Long
2013-02-22 23:00       ` Dave Chinner
2013-02-23  0:13         ` Andi Kleen
2013-02-28 20:39           ` Waiman Long
2013-02-28 23:13             ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.