All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-07 22:25 ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

Hi folks,

This patch set is an attempt to address issues with XFS
truncate and hole-punch code from racing with page faults that enter
the IO path. This is traditionally deadlock prone due to the
inversion of filesystem IO path locks and the mmap_sem.

To avoid this issue, I have introduced a new "i_mmaplock" rwsem into
the XFS code similar to the IO lock, but this lock is only taken in
the mmap fault paths on entry into the filesystem (i.e. ->fault and
->page_mkwrite).

The concept is that if we invalidate the page cache over a range
after taking both the existing i_iolock and the new i_mmaplock, we
will have prevented any vector for repopulation of the page cache
over the invalidated range until one of the io and mmap locks has
been dropped. i.e. we can guarantee that both the syscall IO path
and page faults won't race with whatever operation the filesystem is
performing...

The introduction of a new lock is necessary to avoid deadlocks due
to mmap_sem entanglement. It has a defined lock order during page
faults of:

mmap_sem
-> i_mmaplock (read)
   -> page lock
      -> i_ilock (get blocks)

This lock is then taken by any extent manipulation code in XFS in
addition to the IO lock which has the lock ordering of

i_iolock (write)
-> i_mmaplock (write)
   -> page lock (data writeback, page invalidation)
      -> i_lock (data writeback)
   -> i_lock (modification transaction)

Hence we have consistent lock ordering (which has been validated so
far by testing with lockdep enabled) for page fault IO vs
truncate, hole punch, extent shifts, etc.

This patchset passes xfstests and various benchmarks and stress
workloads, so the real question is now:

	What have I missed?

Comments, thoughts, flames?

-Dave.

GI: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-07 22:25 ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

Hi folks,

This patch set is an attempt to address issues with XFS
truncate and hole-punch code from racing with page faults that enter
the IO path. This is traditionally deadlock prone due to the
inversion of filesystem IO path locks and the mmap_sem.

To avoid this issue, I have introduced a new "i_mmaplock" rwsem into
the XFS code similar to the IO lock, but this lock is only taken in
the mmap fault paths on entry into the filesystem (i.e. ->fault and
->page_mkwrite).

The concept is that if we invalidate the page cache over a range
after taking both the existing i_iolock and the new i_mmaplock, we
will have prevented any vector for repopulation of the page cache
over the invalidated range until one of the io and mmap locks has
been dropped. i.e. we can guarantee that both the syscall IO path
and page faults won't race with whatever operation the filesystem is
performing...

The introduction of a new lock is necessary to avoid deadlocks due
to mmap_sem entanglement. It has a defined lock order during page
faults of:

mmap_sem
-> i_mmaplock (read)
   -> page lock
      -> i_ilock (get blocks)

This lock is then taken by any extent manipulation code in XFS in
addition to the IO lock which has the lock ordering of

i_iolock (write)
-> i_mmaplock (write)
   -> page lock (data writeback, page invalidation)
      -> i_lock (data writeback)
   -> i_lock (modification transaction)

Hence we have consistent lock ordering (which has been validated so
far by testing with lockdep enabled) for page fault IO vs
truncate, hole punch, extent shifts, etc.

This patchset passes xfstests and various benchmarks and stress
workloads, so the real question is now:

	What have I missed?

Comments, thoughts, flames?

-Dave.

GI: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-07 22:25 ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

Hi folks,

This patch set is an attempt to address issues with XFS
truncate and hole-punch code from racing with page faults that enter
the IO path. This is traditionally deadlock prone due to the
inversion of filesystem IO path locks and the mmap_sem.

To avoid this issue, I have introduced a new "i_mmaplock" rwsem into
the XFS code similar to the IO lock, but this lock is only taken in
the mmap fault paths on entry into the filesystem (i.e. ->fault and
->page_mkwrite).

The concept is that if we invalidate the page cache over a range
after taking both the existing i_iolock and the new i_mmaplock, we
will have prevented any vector for repopulation of the page cache
over the invalidated range until one of the io and mmap locks has
been dropped. i.e. we can guarantee that both the syscall IO path
and page faults won't race with whatever operation the filesystem is
performing...

The introduction of a new lock is necessary to avoid deadlocks due
to mmap_sem entanglement. It has a defined lock order during page
faults of:

mmap_sem
-> i_mmaplock (read)
   -> page lock
      -> i_ilock (get blocks)

This lock is then taken by any extent manipulation code in XFS in
addition to the IO lock which has the lock ordering of

i_iolock (write)
-> i_mmaplock (write)
   -> page lock (data writeback, page invalidation)
      -> i_lock (data writeback)
   -> i_lock (modification transaction)

Hence we have consistent lock ordering (which has been validated so
far by testing with lockdep enabled) for page fault IO vs
truncate, hole punch, extent shifts, etc.

This patchset passes xfstests and various benchmarks and stress
workloads, so the real question is now:

	What have I missed?

Comments, thoughts, flames?

-Dave.

GI: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
  2015-01-07 22:25 ` Dave Chinner
  (?)
@ 2015-01-07 22:25   ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Right now we cannot serialise mmap against truncate or hole punch
sanely. ->page_mkwrite is not able to take locks that the read IO
path normally takes (i.e. the inode iolock) because that could
result in lock inversions (read - iolock - page fault - page_mkwrite
- iolock) and so we cannot use an IO path lock to serialise page
write faults against truncate operations.

Instead, introduce a new lock that is used *only* in the
->page_mkwrite path that is the equivalent of the iolock. The lock
ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
and so in truncate we can i_iolock -> i_mmaplock and so lock out
new write faults during the process of truncation.

Because i_mmap_lock is outside the page lock, we can hold it across
all the same operations we hold the i_iolock for. The only
difference is that we never hold the i_mmaplock in the normal IO
path and so do not ever have the possibility that we can page fault
inside it. Hence there are no recursion issues on the i_mmap_lock
and so we can use it to serialise page fault IO against inode
modification operations that affect the IO path.

This patch introduces the i_mmaplock infrastructure, lockdep
annotations and initialisation/destruction code. Use of the new lock
will be in subsequent patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_inode.h | 29 +++++++++++++-----
 fs/xfs/xfs_super.c |  2 ++
 3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 400791a..573b49c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -150,6 +150,8 @@ xfs_ilock(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -159,6 +161,11 @@ xfs_ilock(
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
 
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
+
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
 	else if (lock_flags & XFS_ILOCK_SHARED)
@@ -191,6 +198,8 @@ xfs_ilock_nowait(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -202,21 +211,35 @@ xfs_ilock_nowait(
 		if (!mrtryaccess(&ip->i_iolock))
 			goto out;
 	}
+
+	if (lock_flags & XFS_MMAPLOCK_EXCL) {
+		if (!mrtryupdate(&ip->i_mmaplock))
+			goto out_undo_iolock;
+	} else if (lock_flags & XFS_MMAPLOCK_SHARED) {
+		if (!mrtryaccess(&ip->i_mmaplock))
+			goto out_undo_iolock;
+	}
+
 	if (lock_flags & XFS_ILOCK_EXCL) {
 		if (!mrtryupdate(&ip->i_lock))
-			goto out_undo_iolock;
+			goto out_undo_mmaplock;
 	} else if (lock_flags & XFS_ILOCK_SHARED) {
 		if (!mrtryaccess(&ip->i_lock))
-			goto out_undo_iolock;
+			goto out_undo_mmaplock;
 	}
 	return 1;
 
- out_undo_iolock:
+out_undo_mmaplock:
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrunlock_excl(&ip->i_mmaplock);
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mrunlock_shared(&ip->i_mmaplock);
+out_undo_iolock:
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		mrunlock_excl(&ip->i_iolock);
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mrunlock_shared(&ip->i_iolock);
- out:
+out:
 	return 0;
 }
 
@@ -244,6 +267,8 @@ xfs_iunlock(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -254,6 +279,11 @@ xfs_iunlock(
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mrunlock_shared(&ip->i_iolock);
 
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrunlock_excl(&ip->i_mmaplock);
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mrunlock_shared(&ip->i_mmaplock);
+
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrunlock_excl(&ip->i_lock);
 	else if (lock_flags & XFS_ILOCK_SHARED)
@@ -271,11 +301,14 @@ xfs_ilock_demote(
 	xfs_inode_t		*ip,
 	uint			lock_flags)
 {
-	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL));
-	ASSERT((lock_flags & ~(XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
+	ASSERT((lock_flags &
+		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
 
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrdemote(&ip->i_lock);
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrdemote(&ip->i_mmaplock);
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		mrdemote(&ip->i_iolock);
 
@@ -294,6 +327,12 @@ xfs_isilocked(
 		return rwsem_is_locked(&ip->i_lock.mr_lock);
 	}
 
+	if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
+		if (!(lock_flags & XFS_MMAPLOCK_SHARED))
+			return !!ip->i_mmaplock.mr_writer;
+		return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
+	}
+
 	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
 		if (!(lock_flags & XFS_IOLOCK_SHARED))
 			return !!ip->i_iolock.mr_writer;
@@ -314,14 +353,27 @@ int xfs_lock_delays;
 #endif
 
 /*
- * Bump the subclass so xfs_lock_inodes() acquires each lock with
- * a different value
+ * Bump the subclass so xfs_lock_inodes() acquires each lock with a different
+ * value. This shouldn't be called for page fault locking, but we also need to
+ * ensure we don't overrun the number of lockdep subclasses for the iolock or
+ * mmaplock as that is limited to 12 by the mmap lock lockdep annotations.
  */
 static inline int
 xfs_lock_inumorder(int lock_mode, int subclass)
 {
-	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
+	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+		ASSERT(subclass + XFS_LOCK_INUMORDER <
+			(1 << (XFS_MMAPLOCK_SHIFT - XFS_IOLOCK_SHIFT)));
 		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT;
+	}
+
+	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) {
+		ASSERT(subclass + XFS_LOCK_INUMORDER <
+			(1 << (XFS_ILOCK_SHIFT - XFS_MMAPLOCK_SHIFT)));
+		lock_mode |= (subclass + XFS_LOCK_INUMORDER) <<
+							XFS_MMAPLOCK_SHIFT;
+	}
+
 	if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))
 		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT;
 
@@ -440,10 +492,10 @@ again:
 }
 
 /*
- * xfs_lock_two_inodes() can only be used to lock one type of lock
- * at a time - the iolock or the ilock, but not both at once. If
- * we lock both at once, lockdep will report false positives saying
- * we have violated locking orders.
+ * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
+ * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
+ * lock more than one at a time, lockdep will report false positives saying we
+ * have violated locking orders.
  */
 void
 xfs_lock_two_inodes(
@@ -455,8 +507,12 @@ xfs_lock_two_inodes(
 	int			attempts = 0;
 	xfs_log_item_t		*lp;
 
-	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
-		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
+	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
+		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
+		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
 	if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index de97ccc..8e7a12a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -56,6 +56,7 @@ typedef struct xfs_inode {
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_iolock;	/* inode IO lock */
+	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
 	atomic_t		i_pincount;	/* inode pin count */
 	spinlock_t		i_flags_lock;	/* inode i_flags lock */
 	/* Miscellaneous state. */
@@ -263,15 +264,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 #define	XFS_IOLOCK_SHARED	(1<<1)
 #define	XFS_ILOCK_EXCL		(1<<2)
 #define	XFS_ILOCK_SHARED	(1<<3)
+#define	XFS_MMAPLOCK_EXCL	(1<<4)
+#define	XFS_MMAPLOCK_SHARED	(1<<5)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
-				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED)
+				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
 	{ XFS_IOLOCK_SHARED,	"IOLOCK_SHARED" }, \
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
-	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }
+	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
+	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
 
 
 /*
@@ -302,17 +308,26 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 #define XFS_IOLOCK_SHIFT	16
 #define	XFS_IOLOCK_PARENT	(XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT)
 
+#define XFS_MMAPLOCK_SHIFT	20
+
 #define XFS_ILOCK_SHIFT		24
 #define	XFS_ILOCK_PARENT	(XFS_LOCK_PARENT << XFS_ILOCK_SHIFT)
 #define	XFS_ILOCK_RTBITMAP	(XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT)
 #define	XFS_ILOCK_RTSUM		(XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT)
 
-#define XFS_IOLOCK_DEP_MASK	0x00ff0000
+#define XFS_IOLOCK_DEP_MASK	0x000f0000
+#define XFS_MMAPLOCK_DEP_MASK	0x00f00000
 #define XFS_ILOCK_DEP_MASK	0xff000000
-#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | XFS_ILOCK_DEP_MASK)
-
-#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) >> XFS_IOLOCK_SHIFT)
-#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
+#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | \
+				 XFS_MMAPLOCK_DEP_MASK | \
+				 XFS_ILOCK_DEP_MASK)
+
+#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) \
+					>> XFS_IOLOCK_SHIFT)
+#define XFS_MMAPLOCK_DEP(flags)	(((flags) & XFS_MMAPLOCK_DEP_MASK) \
+					>> XFS_MMAPLOCK_SHIFT)
+#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) \
+					>> XFS_ILOCK_SHIFT)
 
 /*
  * For multiple groups support: if S_ISGID bit is set in the parent
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index afd6bae..40d2ac5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -986,6 +986,8 @@ xfs_fs_inode_init_once(
 	atomic_set(&ip->i_pincount, 0);
 	spin_lock_init(&ip->i_flags_lock);
 
+	mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
+		     "xfsino", ip->i_ino);
 	mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
 		     "xfsino", ip->i_ino);
 }
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Right now we cannot serialise mmap against truncate or hole punch
sanely. ->page_mkwrite is not able to take locks that the read IO
path normally takes (i.e. the inode iolock) because that could
result in lock inversions (read - iolock - page fault - page_mkwrite
- iolock) and so we cannot use an IO path lock to serialise page
write faults against truncate operations.

Instead, introduce a new lock that is used *only* in the
->page_mkwrite path that is the equivalent of the iolock. The lock
ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
and so in truncate we can i_iolock -> i_mmaplock and so lock out
new write faults during the process of truncation.

Because i_mmap_lock is outside the page lock, we can hold it across
all the same operations we hold the i_iolock for. The only
difference is that we never hold the i_mmaplock in the normal IO
path and so do not ever have the possibility that we can page fault
inside it. Hence there are no recursion issues on the i_mmap_lock
and so we can use it to serialise page fault IO against inode
modification operations that affect the IO path.

This patch introduces the i_mmaplock infrastructure, lockdep
annotations and initialisation/destruction code. Use of the new lock
will be in subsequent patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_inode.h | 29 +++++++++++++-----
 fs/xfs/xfs_super.c |  2 ++
 3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 400791a..573b49c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -150,6 +150,8 @@ xfs_ilock(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -159,6 +161,11 @@ xfs_ilock(
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
 
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
+
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
 	else if (lock_flags & XFS_ILOCK_SHARED)
@@ -191,6 +198,8 @@ xfs_ilock_nowait(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -202,21 +211,35 @@ xfs_ilock_nowait(
 		if (!mrtryaccess(&ip->i_iolock))
 			goto out;
 	}
+
+	if (lock_flags & XFS_MMAPLOCK_EXCL) {
+		if (!mrtryupdate(&ip->i_mmaplock))
+			goto out_undo_iolock;
+	} else if (lock_flags & XFS_MMAPLOCK_SHARED) {
+		if (!mrtryaccess(&ip->i_mmaplock))
+			goto out_undo_iolock;
+	}
+
 	if (lock_flags & XFS_ILOCK_EXCL) {
 		if (!mrtryupdate(&ip->i_lock))
-			goto out_undo_iolock;
+			goto out_undo_mmaplock;
 	} else if (lock_flags & XFS_ILOCK_SHARED) {
 		if (!mrtryaccess(&ip->i_lock))
-			goto out_undo_iolock;
+			goto out_undo_mmaplock;
 	}
 	return 1;
 
- out_undo_iolock:
+out_undo_mmaplock:
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrunlock_excl(&ip->i_mmaplock);
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mrunlock_shared(&ip->i_mmaplock);
+out_undo_iolock:
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		mrunlock_excl(&ip->i_iolock);
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mrunlock_shared(&ip->i_iolock);
- out:
+out:
 	return 0;
 }
 
@@ -244,6 +267,8 @@ xfs_iunlock(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -254,6 +279,11 @@ xfs_iunlock(
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mrunlock_shared(&ip->i_iolock);
 
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrunlock_excl(&ip->i_mmaplock);
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mrunlock_shared(&ip->i_mmaplock);
+
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrunlock_excl(&ip->i_lock);
 	else if (lock_flags & XFS_ILOCK_SHARED)
@@ -271,11 +301,14 @@ xfs_ilock_demote(
 	xfs_inode_t		*ip,
 	uint			lock_flags)
 {
-	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL));
-	ASSERT((lock_flags & ~(XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
+	ASSERT((lock_flags &
+		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
 
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrdemote(&ip->i_lock);
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrdemote(&ip->i_mmaplock);
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		mrdemote(&ip->i_iolock);
 
@@ -294,6 +327,12 @@ xfs_isilocked(
 		return rwsem_is_locked(&ip->i_lock.mr_lock);
 	}
 
+	if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
+		if (!(lock_flags & XFS_MMAPLOCK_SHARED))
+			return !!ip->i_mmaplock.mr_writer;
+		return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
+	}
+
 	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
 		if (!(lock_flags & XFS_IOLOCK_SHARED))
 			return !!ip->i_iolock.mr_writer;
@@ -314,14 +353,27 @@ int xfs_lock_delays;
 #endif
 
 /*
- * Bump the subclass so xfs_lock_inodes() acquires each lock with
- * a different value
+ * Bump the subclass so xfs_lock_inodes() acquires each lock with a different
+ * value. This shouldn't be called for page fault locking, but we also need to
+ * ensure we don't overrun the number of lockdep subclasses for the iolock or
+ * mmaplock as that is limited to 12 by the mmap lock lockdep annotations.
  */
 static inline int
 xfs_lock_inumorder(int lock_mode, int subclass)
 {
-	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
+	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+		ASSERT(subclass + XFS_LOCK_INUMORDER <
+			(1 << (XFS_MMAPLOCK_SHIFT - XFS_IOLOCK_SHIFT)));
 		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT;
+	}
+
+	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) {
+		ASSERT(subclass + XFS_LOCK_INUMORDER <
+			(1 << (XFS_ILOCK_SHIFT - XFS_MMAPLOCK_SHIFT)));
+		lock_mode |= (subclass + XFS_LOCK_INUMORDER) <<
+							XFS_MMAPLOCK_SHIFT;
+	}
+
 	if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))
 		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT;
 
@@ -440,10 +492,10 @@ again:
 }
 
 /*
- * xfs_lock_two_inodes() can only be used to lock one type of lock
- * at a time - the iolock or the ilock, but not both at once. If
- * we lock both at once, lockdep will report false positives saying
- * we have violated locking orders.
+ * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
+ * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
+ * lock more than one at a time, lockdep will report false positives saying we
+ * have violated locking orders.
  */
 void
 xfs_lock_two_inodes(
@@ -455,8 +507,12 @@ xfs_lock_two_inodes(
 	int			attempts = 0;
 	xfs_log_item_t		*lp;
 
-	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
-		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
+	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
+		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
+		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
 	if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index de97ccc..8e7a12a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -56,6 +56,7 @@ typedef struct xfs_inode {
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_iolock;	/* inode IO lock */
+	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
 	atomic_t		i_pincount;	/* inode pin count */
 	spinlock_t		i_flags_lock;	/* inode i_flags lock */
 	/* Miscellaneous state. */
@@ -263,15 +264,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 #define	XFS_IOLOCK_SHARED	(1<<1)
 #define	XFS_ILOCK_EXCL		(1<<2)
 #define	XFS_ILOCK_SHARED	(1<<3)
+#define	XFS_MMAPLOCK_EXCL	(1<<4)
+#define	XFS_MMAPLOCK_SHARED	(1<<5)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
-				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED)
+				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
 	{ XFS_IOLOCK_SHARED,	"IOLOCK_SHARED" }, \
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
-	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }
+	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
+	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
 
 
 /*
@@ -302,17 +308,26 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 #define XFS_IOLOCK_SHIFT	16
 #define	XFS_IOLOCK_PARENT	(XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT)
 
+#define XFS_MMAPLOCK_SHIFT	20
+
 #define XFS_ILOCK_SHIFT		24
 #define	XFS_ILOCK_PARENT	(XFS_LOCK_PARENT << XFS_ILOCK_SHIFT)
 #define	XFS_ILOCK_RTBITMAP	(XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT)
 #define	XFS_ILOCK_RTSUM		(XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT)
 
-#define XFS_IOLOCK_DEP_MASK	0x00ff0000
+#define XFS_IOLOCK_DEP_MASK	0x000f0000
+#define XFS_MMAPLOCK_DEP_MASK	0x00f00000
 #define XFS_ILOCK_DEP_MASK	0xff000000
-#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | XFS_ILOCK_DEP_MASK)
-
-#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) >> XFS_IOLOCK_SHIFT)
-#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
+#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | \
+				 XFS_MMAPLOCK_DEP_MASK | \
+				 XFS_ILOCK_DEP_MASK)
+
+#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) \
+					>> XFS_IOLOCK_SHIFT)
+#define XFS_MMAPLOCK_DEP(flags)	(((flags) & XFS_MMAPLOCK_DEP_MASK) \
+					>> XFS_MMAPLOCK_SHIFT)
+#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) \
+					>> XFS_ILOCK_SHIFT)
 
 /*
  * For multiple groups support: if S_ISGID bit is set in the parent
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index afd6bae..40d2ac5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -986,6 +986,8 @@ xfs_fs_inode_init_once(
 	atomic_set(&ip->i_pincount, 0);
 	spin_lock_init(&ip->i_flags_lock);
 
+	mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
+		     "xfsino", ip->i_ino);
 	mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
 		     "xfsino", ip->i_ino);
 }
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Right now we cannot serialise mmap against truncate or hole punch
sanely. ->page_mkwrite is not able to take locks that the read IO
path normally takes (i.e. the inode iolock) because that could
result in lock inversions (read - iolock - page fault - page_mkwrite
- iolock) and so we cannot use an IO path lock to serialise page
write faults against truncate operations.

Instead, introduce a new lock that is used *only* in the
->page_mkwrite path that is the equivalent of the iolock. The lock
ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
and so in truncate we can i_iolock -> i_mmaplock and so lock out
new write faults during the process of truncation.

Because i_mmap_lock is outside the page lock, we can hold it across
all the same operations we hold the i_iolock for. The only
difference is that we never hold the i_mmaplock in the normal IO
path and so do not ever have the possibility that we can page fault
inside it. Hence there are no recursion issues on the i_mmap_lock
and so we can use it to serialise page fault IO against inode
modification operations that affect the IO path.

This patch introduces the i_mmaplock infrastructure, lockdep
annotations and initialisation/destruction code. Use of the new lock
will be in subsequent patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_inode.h | 29 +++++++++++++-----
 fs/xfs/xfs_super.c |  2 ++
 3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 400791a..573b49c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -150,6 +150,8 @@ xfs_ilock(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -159,6 +161,11 @@ xfs_ilock(
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
 
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
+
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
 	else if (lock_flags & XFS_ILOCK_SHARED)
@@ -191,6 +198,8 @@ xfs_ilock_nowait(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -202,21 +211,35 @@ xfs_ilock_nowait(
 		if (!mrtryaccess(&ip->i_iolock))
 			goto out;
 	}
+
+	if (lock_flags & XFS_MMAPLOCK_EXCL) {
+		if (!mrtryupdate(&ip->i_mmaplock))
+			goto out_undo_iolock;
+	} else if (lock_flags & XFS_MMAPLOCK_SHARED) {
+		if (!mrtryaccess(&ip->i_mmaplock))
+			goto out_undo_iolock;
+	}
+
 	if (lock_flags & XFS_ILOCK_EXCL) {
 		if (!mrtryupdate(&ip->i_lock))
-			goto out_undo_iolock;
+			goto out_undo_mmaplock;
 	} else if (lock_flags & XFS_ILOCK_SHARED) {
 		if (!mrtryaccess(&ip->i_lock))
-			goto out_undo_iolock;
+			goto out_undo_mmaplock;
 	}
 	return 1;
 
- out_undo_iolock:
+out_undo_mmaplock:
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrunlock_excl(&ip->i_mmaplock);
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mrunlock_shared(&ip->i_mmaplock);
+out_undo_iolock:
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		mrunlock_excl(&ip->i_iolock);
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mrunlock_shared(&ip->i_iolock);
- out:
+out:
 	return 0;
 }
 
@@ -244,6 +267,8 @@ xfs_iunlock(
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
+	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
+	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
@@ -254,6 +279,11 @@ xfs_iunlock(
 	else if (lock_flags & XFS_IOLOCK_SHARED)
 		mrunlock_shared(&ip->i_iolock);
 
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrunlock_excl(&ip->i_mmaplock);
+	else if (lock_flags & XFS_MMAPLOCK_SHARED)
+		mrunlock_shared(&ip->i_mmaplock);
+
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrunlock_excl(&ip->i_lock);
 	else if (lock_flags & XFS_ILOCK_SHARED)
@@ -271,11 +301,14 @@ xfs_ilock_demote(
 	xfs_inode_t		*ip,
 	uint			lock_flags)
 {
-	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL));
-	ASSERT((lock_flags & ~(XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
+	ASSERT((lock_flags &
+		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
 
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrdemote(&ip->i_lock);
+	if (lock_flags & XFS_MMAPLOCK_EXCL)
+		mrdemote(&ip->i_mmaplock);
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		mrdemote(&ip->i_iolock);
 
@@ -294,6 +327,12 @@ xfs_isilocked(
 		return rwsem_is_locked(&ip->i_lock.mr_lock);
 	}
 
+	if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
+		if (!(lock_flags & XFS_MMAPLOCK_SHARED))
+			return !!ip->i_mmaplock.mr_writer;
+		return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
+	}
+
 	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
 		if (!(lock_flags & XFS_IOLOCK_SHARED))
 			return !!ip->i_iolock.mr_writer;
@@ -314,14 +353,27 @@ int xfs_lock_delays;
 #endif
 
 /*
- * Bump the subclass so xfs_lock_inodes() acquires each lock with
- * a different value
+ * Bump the subclass so xfs_lock_inodes() acquires each lock with a different
+ * value. This shouldn't be called for page fault locking, but we also need to
+ * ensure we don't overrun the number of lockdep subclasses for the iolock or
+ * mmaplock as that is limited to 12 by the mmap lock lockdep annotations.
  */
 static inline int
 xfs_lock_inumorder(int lock_mode, int subclass)
 {
-	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
+	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+		ASSERT(subclass + XFS_LOCK_INUMORDER <
+			(1 << (XFS_MMAPLOCK_SHIFT - XFS_IOLOCK_SHIFT)));
 		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT;
+	}
+
+	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) {
+		ASSERT(subclass + XFS_LOCK_INUMORDER <
+			(1 << (XFS_ILOCK_SHIFT - XFS_MMAPLOCK_SHIFT)));
+		lock_mode |= (subclass + XFS_LOCK_INUMORDER) <<
+							XFS_MMAPLOCK_SHIFT;
+	}
+
 	if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))
 		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT;
 
@@ -440,10 +492,10 @@ again:
 }
 
 /*
- * xfs_lock_two_inodes() can only be used to lock one type of lock
- * at a time - the iolock or the ilock, but not both at once. If
- * we lock both at once, lockdep will report false positives saying
- * we have violated locking orders.
+ * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
+ * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
+ * lock more than one at a time, lockdep will report false positives saying we
+ * have violated locking orders.
  */
 void
 xfs_lock_two_inodes(
@@ -455,8 +507,12 @@ xfs_lock_two_inodes(
 	int			attempts = 0;
 	xfs_log_item_t		*lp;
 
-	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
-		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
+	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
+		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
+		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
+		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
+
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
 	if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index de97ccc..8e7a12a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -56,6 +56,7 @@ typedef struct xfs_inode {
 	struct xfs_inode_log_item *i_itemp;	/* logging information */
 	mrlock_t		i_lock;		/* inode lock */
 	mrlock_t		i_iolock;	/* inode IO lock */
+	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
 	atomic_t		i_pincount;	/* inode pin count */
 	spinlock_t		i_flags_lock;	/* inode i_flags lock */
 	/* Miscellaneous state. */
@@ -263,15 +264,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 #define	XFS_IOLOCK_SHARED	(1<<1)
 #define	XFS_ILOCK_EXCL		(1<<2)
 #define	XFS_ILOCK_SHARED	(1<<3)
+#define	XFS_MMAPLOCK_EXCL	(1<<4)
+#define	XFS_MMAPLOCK_SHARED	(1<<5)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
-				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED)
+				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
 	{ XFS_IOLOCK_SHARED,	"IOLOCK_SHARED" }, \
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
-	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }
+	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
+	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
 
 
 /*
@@ -302,17 +308,26 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
 #define XFS_IOLOCK_SHIFT	16
 #define	XFS_IOLOCK_PARENT	(XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT)
 
+#define XFS_MMAPLOCK_SHIFT	20
+
 #define XFS_ILOCK_SHIFT		24
 #define	XFS_ILOCK_PARENT	(XFS_LOCK_PARENT << XFS_ILOCK_SHIFT)
 #define	XFS_ILOCK_RTBITMAP	(XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT)
 #define	XFS_ILOCK_RTSUM		(XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT)
 
-#define XFS_IOLOCK_DEP_MASK	0x00ff0000
+#define XFS_IOLOCK_DEP_MASK	0x000f0000
+#define XFS_MMAPLOCK_DEP_MASK	0x00f00000
 #define XFS_ILOCK_DEP_MASK	0xff000000
-#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | XFS_ILOCK_DEP_MASK)
-
-#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) >> XFS_IOLOCK_SHIFT)
-#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
+#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | \
+				 XFS_MMAPLOCK_DEP_MASK | \
+				 XFS_ILOCK_DEP_MASK)
+
+#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) \
+					>> XFS_IOLOCK_SHIFT)
+#define XFS_MMAPLOCK_DEP(flags)	(((flags) & XFS_MMAPLOCK_DEP_MASK) \
+					>> XFS_MMAPLOCK_SHIFT)
+#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) \
+					>> XFS_ILOCK_SHIFT)
 
 /*
  * For multiple groups support: if S_ISGID bit is set in the parent
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index afd6bae..40d2ac5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -986,6 +986,8 @@ xfs_fs_inode_init_once(
 	atomic_set(&ip->i_pincount, 0);
 	spin_lock_init(&ip->i_flags_lock);
 
+	mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
+		     "xfsino", ip->i_ino);
 	mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
 		     "xfsino", ip->i_ino);
 }
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 2/6] xfs: use i_mmaplock on read faults
  2015-01-07 22:25 ` Dave Chinner
@ 2015-01-07 22:25   ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Take the i_mmaplock over read page faults. These come through the
->fault callout, so we need to wrap the generic implementation
with the i_mmaplock. While there, add tracepoints for the read
fault as it passes through XFS.

This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 28 +++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.h |  2 ++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 13e974e..87535e6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1349,6 +1349,32 @@ xfs_file_llseek(
 	}
 }
 
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_sem (MM)
+ *   i_mmap_lock (XFS - truncate serialisation)
+ *     page_lock (MM)
+ *       i_lock (XFS - extent map serialisation)
+ */
+STATIC int
+xfs_filemap_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_fault(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = filemap_fault(vma, vmf);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1381,7 +1407,7 @@ const struct file_operations xfs_dir_file_operations = {
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= filemap_fault,
+	.fault		= xfs_filemap_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 51372e3..c496153 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -685,6 +685,8 @@ DEFINE_INODE_EVENT(xfs_inode_set_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
+DEFINE_INODE_EVENT(xfs_filemap_fault);
+
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
 	TP_ARGS(ip, caller_ip),
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 2/6] xfs: use i_mmaplock on read faults
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Take the i_mmaplock over read page faults. These come through the
->fault callout, so we need to wrap the generic implementation
with the i_mmaplock. While there, add tracepoints for the read
fault as it passes through XFS.

This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 28 +++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.h |  2 ++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 13e974e..87535e6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1349,6 +1349,32 @@ xfs_file_llseek(
 	}
 }
 
+/*
+ * Locking for serialisation of IO during page faults. This results in a lock
+ * ordering of:
+ *
+ * mmap_sem (MM)
+ *   i_mmap_lock (XFS - truncate serialisation)
+ *     page_lock (MM)
+ *       i_lock (XFS - extent map serialisation)
+ */
+STATIC int
+xfs_filemap_fault(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_fault(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = filemap_fault(vma, vmf);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1381,7 +1407,7 @@ const struct file_operations xfs_dir_file_operations = {
 };
 
 static const struct vm_operations_struct xfs_file_vm_ops = {
-	.fault		= filemap_fault,
+	.fault		= xfs_filemap_fault,
 	.map_pages	= filemap_map_pages,
 	.page_mkwrite	= xfs_vm_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 51372e3..c496153 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -685,6 +685,8 @@ DEFINE_INODE_EVENT(xfs_inode_set_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
+DEFINE_INODE_EVENT(xfs_filemap_fault);
+
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
 	TP_ARGS(ip, caller_ip),
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 3/6] xfs: use i_mmaplock on write faults
  2015-01-07 22:25 ` Dave Chinner
  (?)
@ 2015-01-07 22:25   ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Take the i_mmaplock over write page faults. These come through the
->page_mkwrite callout, so we need to wrap that calls with the
i_mmaplock.

This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.

Also, move the page_mkwrite wrapper to the same region of xfs_file.c
as the read fault wrappers and add a tracepoint.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 39 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 87535e6..e6e7e75 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -961,20 +961,6 @@ xfs_file_mmap(
 }
 
 /*
- * mmap()d file has taken write protection fault and is being made
- * writable. We can set the page state up correctly for a writable
- * page, which means we can do correct delalloc accounting (ENOSPC
- * checking!) and unwritten extent mapping.
- */
-STATIC int
-xfs_vm_page_mkwrite(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	return block_page_mkwrite(vma, vmf, xfs_get_blocks);
-}
-
-/*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
  */
@@ -1375,6 +1361,29 @@ xfs_filemap_fault(
 	return error;
 }
 
+/*
+ * mmap()d file has taken write protection fault and is being made writable. We
+ * can set the page state up correctly for a writable page, which means we can
+ * do correct delalloc accounting (ENOSPC checking!) and unwritten extent
+ * mapping.
+ */
+STATIC int
+xfs_filemap_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_page_mkwrite(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1409,6 +1418,6 @@ const struct file_operations xfs_dir_file_operations = {
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_vm_page_mkwrite,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c496153..b1e059b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -686,6 +686,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
+DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
 
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 3/6] xfs: use i_mmaplock on write faults
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Take the i_mmaplock over write page faults. These come through the
->page_mkwrite callout, so we need to wrap that calls with the
i_mmaplock.

This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.

Also, move the page_mkwrite wrapper to the same region of xfs_file.c
as the read fault wrappers and add a tracepoint.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 39 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 87535e6..e6e7e75 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -961,20 +961,6 @@ xfs_file_mmap(
 }
 
 /*
- * mmap()d file has taken write protection fault and is being made
- * writable. We can set the page state up correctly for a writable
- * page, which means we can do correct delalloc accounting (ENOSPC
- * checking!) and unwritten extent mapping.
- */
-STATIC int
-xfs_vm_page_mkwrite(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	return block_page_mkwrite(vma, vmf, xfs_get_blocks);
-}
-
-/*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
  */
@@ -1375,6 +1361,29 @@ xfs_filemap_fault(
 	return error;
 }
 
+/*
+ * mmap()d file has taken write protection fault and is being made writable. We
+ * can set the page state up correctly for a writable page, which means we can
+ * do correct delalloc accounting (ENOSPC checking!) and unwritten extent
+ * mapping.
+ */
+STATIC int
+xfs_filemap_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_page_mkwrite(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1409,6 +1418,6 @@ const struct file_operations xfs_dir_file_operations = {
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_vm_page_mkwrite,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c496153..b1e059b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -686,6 +686,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
+DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
 
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 3/6] xfs: use i_mmaplock on write faults
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Take the i_mmaplock over write page faults. These come through the
->page_mkwrite callout, so we need to wrap that calls with the
i_mmaplock.

This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.

Also, move the page_mkwrite wrapper to the same region of xfs_file.c
as the read fault wrappers and add a tracepoint.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 39 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_trace.h |  1 +
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 87535e6..e6e7e75 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -961,20 +961,6 @@ xfs_file_mmap(
 }
 
 /*
- * mmap()d file has taken write protection fault and is being made
- * writable. We can set the page state up correctly for a writable
- * page, which means we can do correct delalloc accounting (ENOSPC
- * checking!) and unwritten extent mapping.
- */
-STATIC int
-xfs_vm_page_mkwrite(
-	struct vm_area_struct	*vma,
-	struct vm_fault		*vmf)
-{
-	return block_page_mkwrite(vma, vmf, xfs_get_blocks);
-}
-
-/*
  * This type is designed to indicate the type of offset we would like
  * to search from page cache for xfs_seek_hole_data().
  */
@@ -1375,6 +1361,29 @@ xfs_filemap_fault(
 	return error;
 }
 
+/*
+ * mmap()d file has taken write protection fault and is being made writable. We
+ * can set the page state up correctly for a writable page, which means we can
+ * do correct delalloc accounting (ENOSPC checking!) and unwritten extent
+ * mapping.
+ */
+STATIC int
+xfs_filemap_page_mkwrite(
+	struct vm_area_struct	*vma,
+	struct vm_fault		*vmf)
+{
+	struct xfs_inode	*ip = XFS_I(vma->vm_file->f_mapping->host);
+	int			error;
+
+	trace_xfs_filemap_page_mkwrite(ip);
+
+	xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+	error = block_page_mkwrite(vma, vmf, xfs_get_blocks);
+	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+
+	return error;
+}
+
 const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read		= new_sync_read,
@@ -1409,6 +1418,6 @@ const struct file_operations xfs_dir_file_operations = {
 static const struct vm_operations_struct xfs_file_vm_ops = {
 	.fault		= xfs_filemap_fault,
 	.map_pages	= filemap_map_pages,
-	.page_mkwrite	= xfs_vm_page_mkwrite,
+	.page_mkwrite	= xfs_filemap_page_mkwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index c496153..b1e059b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -686,6 +686,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);
 
 DEFINE_INODE_EVENT(xfs_filemap_fault);
+DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);
 
 DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip),
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
  2015-01-07 22:25 ` Dave Chinner
@ 2015-01-07 22:25   ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now we have the i_mmap_lock being held across the page fault IO
path, we now add extent manipulation operation exclusion by adding
the lock to the paths that directly modify extent maps. This
includes truncate, hole punching and other fallocate based
operations. The operations will now take both the i_iolock and the
i_mmaplock in exclusive mode, thereby ensuring that all IO and page
faults block without holding any page locks while the extent
manipulation is in progress.

This gives us the lock order during truncate of i_iolock ->
i_mmaplock -> page_lock -> i_lock, hence providing the same
lock order as the iolock provides the normal IO path without
involving the mmap_sem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 4 ++--
 fs/xfs/xfs_ioctl.c | 4 ++--
 fs/xfs/xfs_iops.c  | 6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e6e7e75..b08c9e6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -794,7 +794,7 @@ xfs_file_fallocate(
 		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
 		return -EOPNOTSUPP;
 
-	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
@@ -874,7 +874,7 @@ xfs_file_fallocate(
 	}
 
 out_unlock:
-	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a183198..8810959 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -634,7 +634,7 @@ xfs_ioc_space(
 	if (error)
 		return error;
 
-	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
@@ -751,7 +751,7 @@ xfs_ioc_space(
 	error = xfs_trans_commit(tp, 0);
 
 out_unlock:
-	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	mnt_drop_write_file(filp);
 	return error;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8be5bb5..f491860 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -768,7 +768,7 @@ xfs_setattr_size(
 	if (error)
 		return error;
 
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
 	ASSERT(S_ISREG(ip->i_d.di_mode));
 	ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
 		ATTR_MTIME_SET|ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
@@ -984,9 +984,9 @@ xfs_vn_setattr(
 	int			error;
 
 	if (iattr->ia_valid & ATTR_SIZE) {
-		xfs_ilock(ip, XFS_IOLOCK_EXCL);
+		xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 		error = xfs_setattr_size(ip, iattr);
-		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	} else {
 		error = xfs_setattr_nonsize(ip, iattr, 0);
 	}
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now we have the i_mmap_lock being held across the page fault IO
path, we now add extent manipulation operation exclusion by adding
the lock to the paths that directly modify extent maps. This
includes truncate, hole punching and other fallocate based
operations. The operations will now take both the i_iolock and the
i_mmaplock in exclusive mode, thereby ensuring that all IO and page
faults block without holding any page locks while the extent
manipulation is in progress.

This gives us the lock order during truncate of i_iolock ->
i_mmaplock -> page_lock -> i_lock, hence providing the same
lock order as the iolock provides the normal IO path without
involving the mmap_sem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c  | 4 ++--
 fs/xfs/xfs_ioctl.c | 4 ++--
 fs/xfs/xfs_iops.c  | 6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e6e7e75..b08c9e6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -794,7 +794,7 @@ xfs_file_fallocate(
 		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
 		return -EOPNOTSUPP;
 
-	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		error = xfs_free_file_space(ip, offset, len);
 		if (error)
@@ -874,7 +874,7 @@ xfs_file_fallocate(
 	}
 
 out_unlock:
-	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a183198..8810959 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -634,7 +634,7 @@ xfs_ioc_space(
 	if (error)
 		return error;
 
-	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
@@ -751,7 +751,7 @@ xfs_ioc_space(
 	error = xfs_trans_commit(tp, 0);
 
 out_unlock:
-	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	mnt_drop_write_file(filp);
 	return error;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8be5bb5..f491860 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -768,7 +768,7 @@ xfs_setattr_size(
 	if (error)
 		return error;
 
-	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
+	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
 	ASSERT(S_ISREG(ip->i_d.di_mode));
 	ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
 		ATTR_MTIME_SET|ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
@@ -984,9 +984,9 @@ xfs_vn_setattr(
 	int			error;
 
 	if (iattr->ia_valid & ATTR_SIZE) {
-		xfs_ilock(ip, XFS_IOLOCK_EXCL);
+		xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 		error = xfs_setattr_size(ip, iattr);
-		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+		xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
 	} else {
 		error = xfs_setattr_nonsize(ip, iattr, 0);
 	}
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 5/6] xfs: xfs_setattr_size no longer races with page faults
  2015-01-07 22:25 ` Dave Chinner
  (?)
@ 2015-01-07 22:25   ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now that truncate locks out new page faults, we no longer need to do
special writeback hacks in truncate to work around potential races
between page faults, page cache truncation and file size updates to
ensure we get write page faults for extending truncates on sub-page
block size filesystems. Hence we can remove the code in
xfs_setattr_size() that handles this and update the comments around
the code tha thandles page cache truncate and size updates to
reflect the new reality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 56 ++++++++++++++-----------------------------------------
 1 file changed, 14 insertions(+), 42 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index f491860..a3b130f 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -841,55 +841,27 @@ xfs_setattr_size(
 	inode_dio_wait(inode);
 
 	/*
-	 * Do all the page cache truncate work outside the transaction context
-	 * as the "lock" order is page lock->log space reservation.  i.e.
-	 * locking pages inside the transaction can ABBA deadlock with
-	 * writeback. We have to do the VFS inode size update before we truncate
-	 * the pagecache, however, to avoid racing with page faults beyond the
-	 * new EOF they are not serialised against truncate operations except by
-	 * page locks and size updates.
+	 * We've already locked out new page faults, so now we can safely remove
+	 * pages from the page cache knowing they won't get refaulted until we
+	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
+	 * complete. The truncate_setsize() call also cleans partial EOF page
+	 * PTEs on extending truncates and hence ensures sub-page block size
+	 * filesystems are correctly handled, too.
 	 *
-	 * Hence we are in a situation where a truncate can fail with ENOMEM
-	 * from xfs_trans_reserve(), but having already truncated the in-memory
-	 * version of the file (i.e. made user visible changes). There's not
-	 * much we can do about this, except to hope that the caller sees ENOMEM
-	 * and retries the truncate operation.
+	 * We have to do all the page cache truncate work outside the
+	 * transaction context as the "lock" order is page lock->log space
+	 * reservation as defined by extent allocation in the writeback path.
+	 * Hence a truncate can fail with ENOMEM from xfs_trans_reserve(), but
+	 * having already truncated the in-memory version of the file (i.e. made
+	 * user visible changes). There's not much we can do about this, except
+	 * to hope that the caller sees ENOMEM and retries the truncate
+	 * operation.
 	 */
 	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
 
-	/*
-	 * The "we can't serialise against page faults" pain gets worse.
-	 *
-	 * If the file is mapped then we have to clean the page at the old EOF
-	 * when extending the file. Extending the file can expose changes the
-	 * underlying page mapping (e.g. from beyond EOF to a hole or
-	 * unwritten), and so on the next attempt to write to that page we need
-	 * to remap it for write. i.e. we need .page_mkwrite() to be called.
-	 * Hence we need to clean the page to clean the pte and so a new write
-	 * fault will be triggered appropriately.
-	 *
-	 * If we do it before we change the inode size, then we can race with a
-	 * page fault that maps the page with exactly the same problem. If we do
-	 * it after we change the file size, then a new page fault can come in
-	 * and allocate space before we've run the rest of the truncate
-	 * transaction. That's kinda grotesque, but it's better than have data
-	 * over a hole, and so that's the lesser evil that has been chosen here.
-	 *
-	 * The real solution, however, is to have some mechanism for locking out
-	 * page faults while a truncate is in progress.
-	 */
-	if (newsize > oldsize && mapping_mapped(VFS_I(ip)->i_mapping)) {
-		error = filemap_write_and_wait_range(
-				VFS_I(ip)->i_mapping,
-				round_down(oldsize, PAGE_CACHE_SIZE),
-				round_up(oldsize, PAGE_CACHE_SIZE) - 1);
-		if (error)
-			return error;
-	}
-
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
 	if (error)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 5/6] xfs: xfs_setattr_size no longer races with page faults
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now that truncate locks out new page faults, we no longer need to do
special writeback hacks in truncate to work around potential races
between page faults, page cache truncation and file size updates to
ensure we get write page faults for extending truncates on sub-page
block size filesystems. Hence we can remove the code in
xfs_setattr_size() that handles this and update the comments around
the code tha thandles page cache truncate and size updates to
reflect the new reality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 56 ++++++++++++++-----------------------------------------
 1 file changed, 14 insertions(+), 42 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index f491860..a3b130f 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -841,55 +841,27 @@ xfs_setattr_size(
 	inode_dio_wait(inode);
 
 	/*
-	 * Do all the page cache truncate work outside the transaction context
-	 * as the "lock" order is page lock->log space reservation.  i.e.
-	 * locking pages inside the transaction can ABBA deadlock with
-	 * writeback. We have to do the VFS inode size update before we truncate
-	 * the pagecache, however, to avoid racing with page faults beyond the
-	 * new EOF they are not serialised against truncate operations except by
-	 * page locks and size updates.
+	 * We've already locked out new page faults, so now we can safely remove
+	 * pages from the page cache knowing they won't get refaulted until we
+	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
+	 * complete. The truncate_setsize() call also cleans partial EOF page
+	 * PTEs on extending truncates and hence ensures sub-page block size
+	 * filesystems are correctly handled, too.
 	 *
-	 * Hence we are in a situation where a truncate can fail with ENOMEM
-	 * from xfs_trans_reserve(), but having already truncated the in-memory
-	 * version of the file (i.e. made user visible changes). There's not
-	 * much we can do about this, except to hope that the caller sees ENOMEM
-	 * and retries the truncate operation.
+	 * We have to do all the page cache truncate work outside the
+	 * transaction context as the "lock" order is page lock->log space
+	 * reservation as defined by extent allocation in the writeback path.
+	 * Hence a truncate can fail with ENOMEM from xfs_trans_reserve(), but
+	 * having already truncated the in-memory version of the file (i.e. made
+	 * user visible changes). There's not much we can do about this, except
+	 * to hope that the caller sees ENOMEM and retries the truncate
+	 * operation.
 	 */
 	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
 
-	/*
-	 * The "we can't serialise against page faults" pain gets worse.
-	 *
-	 * If the file is mapped then we have to clean the page at the old EOF
-	 * when extending the file. Extending the file can expose changes the
-	 * underlying page mapping (e.g. from beyond EOF to a hole or
-	 * unwritten), and so on the next attempt to write to that page we need
-	 * to remap it for write. i.e. we need .page_mkwrite() to be called.
-	 * Hence we need to clean the page to clean the pte and so a new write
-	 * fault will be triggered appropriately.
-	 *
-	 * If we do it before we change the inode size, then we can race with a
-	 * page fault that maps the page with exactly the same problem. If we do
-	 * it after we change the file size, then a new page fault can come in
-	 * and allocate space before we've run the rest of the truncate
-	 * transaction. That's kinda grotesque, but it's better than have data
-	 * over a hole, and so that's the lesser evil that has been chosen here.
-	 *
-	 * The real solution, however, is to have some mechanism for locking out
-	 * page faults while a truncate is in progress.
-	 */
-	if (newsize > oldsize && mapping_mapped(VFS_I(ip)->i_mapping)) {
-		error = filemap_write_and_wait_range(
-				VFS_I(ip)->i_mapping,
-				round_down(oldsize, PAGE_CACHE_SIZE),
-				round_up(oldsize, PAGE_CACHE_SIZE) - 1);
-		if (error)
-			return error;
-	}
-
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
 	if (error)
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 5/6] xfs: xfs_setattr_size no longer races with page faults
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now that truncate locks out new page faults, we no longer need to do
special writeback hacks in truncate to work around potential races
between page faults, page cache truncation and file size updates to
ensure we get write page faults for extending truncates on sub-page
block size filesystems. Hence we can remove the code in
xfs_setattr_size() that handles this and update the comments around
the code tha thandles page cache truncate and size updates to
reflect the new reality.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c | 56 ++++++++++++++-----------------------------------------
 1 file changed, 14 insertions(+), 42 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index f491860..a3b130f 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -841,55 +841,27 @@ xfs_setattr_size(
 	inode_dio_wait(inode);
 
 	/*
-	 * Do all the page cache truncate work outside the transaction context
-	 * as the "lock" order is page lock->log space reservation.  i.e.
-	 * locking pages inside the transaction can ABBA deadlock with
-	 * writeback. We have to do the VFS inode size update before we truncate
-	 * the pagecache, however, to avoid racing with page faults beyond the
-	 * new EOF they are not serialised against truncate operations except by
-	 * page locks and size updates.
+	 * We've already locked out new page faults, so now we can safely remove
+	 * pages from the page cache knowing they won't get refaulted until we
+	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
+	 * complete. The truncate_setsize() call also cleans partial EOF page
+	 * PTEs on extending truncates and hence ensures sub-page block size
+	 * filesystems are correctly handled, too.
 	 *
-	 * Hence we are in a situation where a truncate can fail with ENOMEM
-	 * from xfs_trans_reserve(), but having already truncated the in-memory
-	 * version of the file (i.e. made user visible changes). There's not
-	 * much we can do about this, except to hope that the caller sees ENOMEM
-	 * and retries the truncate operation.
+	 * We have to do all the page cache truncate work outside the
+	 * transaction context as the "lock" order is page lock->log space
+	 * reservation as defined by extent allocation in the writeback path.
+	 * Hence a truncate can fail with ENOMEM from xfs_trans_reserve(), but
+	 * having already truncated the in-memory version of the file (i.e. made
+	 * user visible changes). There's not much we can do about this, except
+	 * to hope that the caller sees ENOMEM and retries the truncate
+	 * operation.
 	 */
 	error = block_truncate_page(inode->i_mapping, newsize, xfs_get_blocks);
 	if (error)
 		return error;
 	truncate_setsize(inode, newsize);
 
-	/*
-	 * The "we can't serialise against page faults" pain gets worse.
-	 *
-	 * If the file is mapped then we have to clean the page at the old EOF
-	 * when extending the file. Extending the file can expose changes the
-	 * underlying page mapping (e.g. from beyond EOF to a hole or
-	 * unwritten), and so on the next attempt to write to that page we need
-	 * to remap it for write. i.e. we need .page_mkwrite() to be called.
-	 * Hence we need to clean the page to clean the pte and so a new write
-	 * fault will be triggered appropriately.
-	 *
-	 * If we do it before we change the inode size, then we can race with a
-	 * page fault that maps the page with exactly the same problem. If we do
-	 * it after we change the file size, then a new page fault can come in
-	 * and allocate space before we've run the rest of the truncate
-	 * transaction. That's kinda grotesque, but it's better than have data
-	 * over a hole, and so that's the lesser evil that has been chosen here.
-	 *
-	 * The real solution, however, is to have some mechanism for locking out
-	 * page faults while a truncate is in progress.
-	 */
-	if (newsize > oldsize && mapping_mapped(VFS_I(ip)->i_mapping)) {
-		error = filemap_write_and_wait_range(
-				VFS_I(ip)->i_mapping,
-				round_down(oldsize, PAGE_CACHE_SIZE),
-				round_up(oldsize, PAGE_CACHE_SIZE) - 1);
-		if (error)
-			return error;
-	}
-
 	tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_SIZE);
 	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_itruncate, 0, 0);
 	if (error)
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 6/6] xfs: lock out page faults from extent swap operations
  2015-01-07 22:25 ` Dave Chinner
@ 2015-01-07 22:25   ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Extent swap operations are another extent manipulation operation
that we need to ensure does not race against mmap page faults. The
current code returns if the file is mapped prior to the swap being
done, but it could potentially race against new page faults while
the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
for this operation, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 22a5dcb..1420caf 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1599,13 +1599,6 @@ xfs_swap_extent_flush(
 	/* Verify O_DIRECT for ftmp */
 	if (VFS_I(ip)->i_mapping->nrpages)
 		return -EINVAL;
-
-	/*
-	 * Don't try to swap extents on mmap()d files because we can't lock
-	 * out races against page faults safely.
-	 */
-	if (mapping_mapped(VFS_I(ip)->i_mapping))
-		return -EBUSY;
 	return 0;
 }
 
@@ -1633,13 +1626,14 @@ xfs_swap_extents(
 	}
 
 	/*
-	 * Lock up the inodes against other IO and truncate to begin with.
-	 * Then we can ensure the inodes are flushed and have no page cache
-	 * safely. Once we have done this we can take the ilocks and do the rest
-	 * of the checks.
+	 * Lock the inodes against other IO, page faults and truncate to
+	 * begin with.  Then we can ensure the inodes are flushed and have no
+	 * page cache safely. Once we have done this we can take the ilocks and
+	 * do the rest of the checks.
 	 */
-	lock_flags = XFS_IOLOCK_EXCL;
+	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
+	xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
 
 	/* Verify that both files have the same format */
 	if ((ip->i_d.di_mode & S_IFMT) != (tip->i_d.di_mode & S_IFMT)) {
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 6/6] xfs: lock out page faults from extent swap operations
@ 2015-01-07 22:25   ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-07 22:25 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Extent swap operations are another extent manipulation operation
that we need to ensure does not race against mmap page faults. The
current code returns if the file is mapped prior to the swap being
done, but it could potentially race against new page faults while
the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
for this operation, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 22a5dcb..1420caf 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1599,13 +1599,6 @@ xfs_swap_extent_flush(
 	/* Verify O_DIRECT for ftmp */
 	if (VFS_I(ip)->i_mapping->nrpages)
 		return -EINVAL;
-
-	/*
-	 * Don't try to swap extents on mmap()d files because we can't lock
-	 * out races against page faults safely.
-	 */
-	if (mapping_mapped(VFS_I(ip)->i_mapping))
-		return -EBUSY;
 	return 0;
 }
 
@@ -1633,13 +1626,14 @@ xfs_swap_extents(
 	}
 
 	/*
-	 * Lock up the inodes against other IO and truncate to begin with.
-	 * Then we can ensure the inodes are flushed and have no page cache
-	 * safely. Once we have done this we can take the ilocks and do the rest
-	 * of the checks.
+	 * Lock the inodes against other IO, page faults and truncate to
+	 * begin with.  Then we can ensure the inodes are flushed and have no
+	 * page cache safely. Once we have done this we can take the ilocks and
+	 * do the rest of the checks.
 	 */
-	lock_flags = XFS_IOLOCK_EXCL;
+	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
+	xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
 
 	/* Verify that both files have the same format */
 	if ((ip->i_d.di_mode & S_IFMT) != (tip->i_d.di_mode & S_IFMT)) {
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
  2015-01-07 22:25 ` Dave Chinner
  (?)
@ 2015-01-08 11:34   ` Jan Kara
  -1 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2015-01-08 11:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

  Hi,

On Thu 08-01-15 09:25:37, Dave Chinner wrote:
> This patch set is an attempt to address issues with XFS
> truncate and hole-punch code from racing with page faults that enter
> the IO path. This is traditionally deadlock prone due to the
> inversion of filesystem IO path locks and the mmap_sem.
> 
> To avoid this issue, I have introduced a new "i_mmaplock" rwsem into
> the XFS code similar to the IO lock, but this lock is only taken in
> the mmap fault paths on entry into the filesystem (i.e. ->fault and
> ->page_mkwrite).
> 
> The concept is that if we invalidate the page cache over a range
> after taking both the existing i_iolock and the new i_mmaplock, we
> will have prevented any vector for repopulation of the page cache
> over the invalidated range until one of the io and mmap locks has
> been dropped. i.e. we can guarantee that both the syscall IO path
> and page faults won't race with whatever operation the filesystem is
> performing...
> 
> The introduction of a new lock is necessary to avoid deadlocks due
> to mmap_sem entanglement. It has a defined lock order during page
> faults of:
> 
> mmap_sem
> -> i_mmaplock (read)
>    -> page lock
>       -> i_ilock (get blocks)
> 
> This lock is then taken by any extent manipulation code in XFS in
> addition to the IO lock which has the lock ordering of
> 
> i_iolock (write)
> -> i_mmaplock (write)
>    -> page lock (data writeback, page invalidation)
>       -> i_lock (data writeback)
>    -> i_lock (modification transaction)
> 
> Hence we have consistent lock ordering (which has been validated so
> far by testing with lockdep enabled) for page fault IO vs
> truncate, hole punch, extent shifts, etc.
> 
> This patchset passes xfstests and various benchmarks and stress
> workloads, so the real question is now:
> 
> 	What have I missed?
> 
> Comments, thoughts, flames?
  I had a look at the patches and as far as I can tell this should work
fine (at least from the VFS / MM POV).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-08 11:34   ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2015-01-08 11:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs

  Hi,

On Thu 08-01-15 09:25:37, Dave Chinner wrote:
> This patch set is an attempt to address issues with XFS
> truncate and hole-punch code from racing with page faults that enter
> the IO path. This is traditionally deadlock prone due to the
> inversion of filesystem IO path locks and the mmap_sem.
> 
> To avoid this issue, I have introduced a new "i_mmaplock" rwsem into
> the XFS code similar to the IO lock, but this lock is only taken in
> the mmap fault paths on entry into the filesystem (i.e. ->fault and
> ->page_mkwrite).
> 
> The concept is that if we invalidate the page cache over a range
> after taking both the existing i_iolock and the new i_mmaplock, we
> will have prevented any vector for repopulation of the page cache
> over the invalidated range until one of the io and mmap locks has
> been dropped. i.e. we can guarantee that both the syscall IO path
> and page faults won't race with whatever operation the filesystem is
> performing...
> 
> The introduction of a new lock is necessary to avoid deadlocks due
> to mmap_sem entanglement. It has a defined lock order during page
> faults of:
> 
> mmap_sem
> -> i_mmaplock (read)
>    -> page lock
>       -> i_ilock (get blocks)
> 
> This lock is then taken by any extent manipulation code in XFS in
> addition to the IO lock which has the lock ordering of
> 
> i_iolock (write)
> -> i_mmaplock (write)
>    -> page lock (data writeback, page invalidation)
>       -> i_lock (data writeback)
>    -> i_lock (modification transaction)
> 
> Hence we have consistent lock ordering (which has been validated so
> far by testing with lockdep enabled) for page fault IO vs
> truncate, hole punch, extent shifts, etc.
> 
> This patchset passes xfstests and various benchmarks and stress
> workloads, so the real question is now:
> 
> 	What have I missed?
> 
> Comments, thoughts, flames?
  I had a look at the patches and as far as I can tell this should work
fine (at least from the VFS / MM POV).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-08 11:34   ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2015-01-08 11:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

  Hi,

On Thu 08-01-15 09:25:37, Dave Chinner wrote:
> This patch set is an attempt to address issues with XFS
> truncate and hole-punch code from racing with page faults that enter
> the IO path. This is traditionally deadlock prone due to the
> inversion of filesystem IO path locks and the mmap_sem.
> 
> To avoid this issue, I have introduced a new "i_mmaplock" rwsem into
> the XFS code similar to the IO lock, but this lock is only taken in
> the mmap fault paths on entry into the filesystem (i.e. ->fault and
> ->page_mkwrite).
> 
> The concept is that if we invalidate the page cache over a range
> after taking both the existing i_iolock and the new i_mmaplock, we
> will have prevented any vector for repopulation of the page cache
> over the invalidated range until one of the io and mmap locks has
> been dropped. i.e. we can guarantee that both the syscall IO path
> and page faults won't race with whatever operation the filesystem is
> performing...
> 
> The introduction of a new lock is necessary to avoid deadlocks due
> to mmap_sem entanglement. It has a defined lock order during page
> faults of:
> 
> mmap_sem
> -> i_mmaplock (read)
>    -> page lock
>       -> i_ilock (get blocks)
> 
> This lock is then taken by any extent manipulation code in XFS in
> addition to the IO lock which has the lock ordering of
> 
> i_iolock (write)
> -> i_mmaplock (write)
>    -> page lock (data writeback, page invalidation)
>       -> i_lock (data writeback)
>    -> i_lock (modification transaction)
> 
> Hence we have consistent lock ordering (which has been validated so
> far by testing with lockdep enabled) for page fault IO vs
> truncate, hole punch, extent shifts, etc.
> 
> This patchset passes xfstests and various benchmarks and stress
> workloads, so the real question is now:
> 
> 	What have I missed?
> 
> Comments, thoughts, flames?
  I had a look at the patches and as far as I can tell this should work
fine (at least from the VFS / MM POV).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
  2015-01-07 22:25 ` Dave Chinner
  (?)
@ 2015-01-08 12:24   ` Christoph Hellwig
  -1 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2015-01-08 12:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

> This patchset passes xfstests and various benchmarks and stress
> workloads, so the real question is now:
> 
> 	What have I missed?
> 
> Comments, thoughts, flames?

Why is this done in XFS and not in generic code?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-08 12:24   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2015-01-08 12:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs

> This patchset passes xfstests and various benchmarks and stress
> workloads, so the real question is now:
> 
> 	What have I missed?
> 
> Comments, thoughts, flames?

Why is this done in XFS and not in generic code?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-08 12:24   ` Christoph Hellwig
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2015-01-08 12:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

> This patchset passes xfstests and various benchmarks and stress
> workloads, so the real question is now:
> 
> 	What have I missed?
> 
> Comments, thoughts, flames?

Why is this done in XFS and not in generic code?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
  2015-01-08 12:24   ` Christoph Hellwig
@ 2015-01-08 21:45     ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-08 21:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 08, 2015 at 04:24:48AM -0800, Christoph Hellwig wrote:
> > This patchset passes xfstests and various benchmarks and stress
> > workloads, so the real question is now:
> > 
> > 	What have I missed?
> > 
> > Comments, thoughts, flames?
> 
> Why is this done in XFS and not in generic code?

We might be able to do that. Indeed, if the concept is considered
sound, then this was the next question I was going to ask everyone.
I just don't know enough about other filesystem locking to be able
to say "this will always work", hence the wide distribution of the
RFC.  Different filesystems have different locking heirarchies, and
so there may be some are not able to use this technique
(cluster/network fs?)....

In the end, however, the main reason I decided on doing it in XFS
first was things like that swap extent operation that requires us to
lock multiple locks on two inodes in a specific order. We already
have all the infrastructure in XFS to enforce and *validate at
runtime* the specific lock ordering required, so it just made it a
no-brainer to do it this way first.

We also have several entry points in XFS that don't go through the
VFS that needed this page fault serialisation and they currently
only use XFS internal locks to serialise against IO.  Again, doing
it in XFS first is the easy-to-validate solution.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-08 21:45     ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-08 21:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, xfs

On Thu, Jan 08, 2015 at 04:24:48AM -0800, Christoph Hellwig wrote:
> > This patchset passes xfstests and various benchmarks and stress
> > workloads, so the real question is now:
> > 
> > 	What have I missed?
> > 
> > Comments, thoughts, flames?
> 
> Why is this done in XFS and not in generic code?

We might be able to do that. Indeed, if the concept is considered
sound, then this was the next question I was going to ask everyone.
I just don't know enough about other filesystem locking to be able
to say "this will always work", hence the wide distribution of the
RFC.  Different filesystems have different locking heirarchies, and
so there may be some are not able to use this technique
(cluster/network fs?)....

In the end, however, the main reason I decided on doing it in XFS
first was things like that swap extent operation that requires us to
lock multiple locks on two inodes in a specific order. We already
have all the infrastructure in XFS to enforce and *validate at
runtime* the specific lock ordering required, so it just made it a
no-brainer to do it this way first.

We also have several entry points in XFS that don't go through the
VFS that needed this page fault serialisation and they currently
only use XFS internal locks to serialise against IO.  Again, doing
it in XFS first is the easy-to-validate solution.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
  2015-01-08 12:24   ` Christoph Hellwig
@ 2015-01-12 17:42     ` Jan Kara
  -1 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2015-01-12 17:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, xfs, linux-fsdevel, linux-mm

On Thu 08-01-15 04:24:48, Christoph Hellwig wrote:
> > This patchset passes xfstests and various benchmarks and stress
> > workloads, so the real question is now:
> > 
> > 	What have I missed?
> > 
> > Comments, thoughts, flames?
> 
> Why is this done in XFS and not in generic code?
  I was also thinking about this. In the end I decided not to propose this
since the new rw-lock would grow struct inode and is actually necessary
only for filesystems implementing hole punching AFAICS. And that isn't
supported by that many filesystems. So fs private implementation which
isn't that complicated looked like a reasonable solution to me...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-12 17:42     ` Jan Kara
  0 siblings, 0 replies; 43+ messages in thread
From: Jan Kara @ 2015-01-12 17:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mm, xfs

On Thu 08-01-15 04:24:48, Christoph Hellwig wrote:
> > This patchset passes xfstests and various benchmarks and stress
> > workloads, so the real question is now:
> > 
> > 	What have I missed?
> > 
> > Comments, thoughts, flames?
> 
> Why is this done in XFS and not in generic code?
  I was also thinking about this. In the end I decided not to propose this
since the new rw-lock would grow struct inode and is actually necessary
only for filesystems implementing hole punching AFAICS. And that isn't
supported by that many filesystems. So fs private implementation which
isn't that complicated looked like a reasonable solution to me...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
  2015-01-12 17:42     ` Jan Kara
  (?)
@ 2015-01-21 22:26       ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-21 22:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-mm

On Mon, Jan 12, 2015 at 06:42:58PM +0100, Jan Kara wrote:
> On Thu 08-01-15 04:24:48, Christoph Hellwig wrote:
> > > This patchset passes xfstests and various benchmarks and stress
> > > workloads, so the real question is now:
> > > 
> > > 	What have I missed?
> > > 
> > > Comments, thoughts, flames?
> > 
> > Why is this done in XFS and not in generic code?
>   I was also thinking about this. In the end I decided not to propose this
> since the new rw-lock would grow struct inode and is actually necessary
> only for filesystems implementing hole punching AFAICS. And that isn't
> supported by that many filesystems. So fs private implementation which
> isn't that complicated looked like a reasonable solution to me...

Ok, so it seems that doing this in the filesystem itself as an
initial solution is the way to move forward. Given that, this
patchset has run through regression and stress testing for a couple
of weeks without uncovering problems, so now I'm looking for reviews
so I can commit it. Anyone?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-21 22:26       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-21 22:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, linux-mm, xfs

On Mon, Jan 12, 2015 at 06:42:58PM +0100, Jan Kara wrote:
> On Thu 08-01-15 04:24:48, Christoph Hellwig wrote:
> > > This patchset passes xfstests and various benchmarks and stress
> > > workloads, so the real question is now:
> > > 
> > > 	What have I missed?
> > > 
> > > Comments, thoughts, flames?
> > 
> > Why is this done in XFS and not in generic code?
>   I was also thinking about this. In the end I decided not to propose this
> since the new rw-lock would grow struct inode and is actually necessary
> only for filesystems implementing hole punching AFAICS. And that isn't
> supported by that many filesystems. So fs private implementation which
> isn't that complicated looked like a reasonable solution to me...

Ok, so it seems that doing this in the filesystem itself as an
initial solution is the way to move forward. Given that, this
patchset has run through regression and stress testing for a couple
of weeks without uncovering problems, so now I'm looking for reviews
so I can commit it. Anyone?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion
@ 2015-01-21 22:26       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-21 22:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, xfs, linux-fsdevel, linux-mm

On Mon, Jan 12, 2015 at 06:42:58PM +0100, Jan Kara wrote:
> On Thu 08-01-15 04:24:48, Christoph Hellwig wrote:
> > > This patchset passes xfstests and various benchmarks and stress
> > > workloads, so the real question is now:
> > > 
> > > 	What have I missed?
> > > 
> > > Comments, thoughts, flames?
> > 
> > Why is this done in XFS and not in generic code?
>   I was also thinking about this. In the end I decided not to propose this
> since the new rw-lock would grow struct inode and is actually necessary
> only for filesystems implementing hole punching AFAICS. And that isn't
> supported by that many filesystems. So fs private implementation which
> isn't that complicated looked like a reasonable solution to me...

Ok, so it seems that doing this in the filesystem itself as an
initial solution is the way to move forward. Given that, this
patchset has run through regression and stress testing for a couple
of weeks without uncovering problems, so now I'm looking for reviews
so I can commit it. Anyone?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
  2015-01-07 22:25   ` Dave Chinner
@ 2015-01-22 13:09     ` Brian Foster
  -1 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 08, 2015 at 09:25:38AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now we cannot serialise mmap against truncate or hole punch
> sanely. ->page_mkwrite is not able to take locks that the read IO
> path normally takes (i.e. the inode iolock) because that could
> result in lock inversions (read - iolock - page fault - page_mkwrite
> - iolock) and so we cannot use an IO path lock to serialise page
> write faults against truncate operations.
> 
> Instead, introduce a new lock that is used *only* in the
> ->page_mkwrite path that is the equivalent of the iolock. The lock
> ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
> and so in truncate we can i_iolock -> i_mmaplock and so lock out
> new write faults during the process of truncation.
> 
> Because i_mmap_lock is outside the page lock, we can hold it across
> all the same operations we hold the i_iolock for. The only
> difference is that we never hold the i_mmaplock in the normal IO
> path and so do not ever have the possibility that we can page fault
> inside it. Hence there are no recursion issues on the i_mmap_lock
> and so we can use it to serialise page fault IO against inode
> modification operations that affect the IO path.
> 
> This patch introduces the i_mmaplock infrastructure, lockdep
> annotations and initialisation/destruction code. Use of the new lock
> will be in subsequent patches.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_inode.h | 29 +++++++++++++-----
>  fs/xfs/xfs_super.c |  2 ++
>  3 files changed, 95 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 400791a..573b49c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -150,6 +150,8 @@ xfs_ilock(
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));

The comment that precedes xfs_ilock() explains the locks that exist
within the inode, locking order, etc. We should probably update it to
explain how i_mmap_lock fits in as well (e.g., text from the commit log
description would suffice, imo).

>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> @@ -159,6 +161,11 @@ xfs_ilock(
>  	else if (lock_flags & XFS_IOLOCK_SHARED)
>  		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
>  
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> +		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> +

XFS_MMAPLOCK_DEP()?

>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
>  	else if (lock_flags & XFS_ILOCK_SHARED)
> @@ -191,6 +198,8 @@ xfs_ilock_nowait(
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> @@ -202,21 +211,35 @@ xfs_ilock_nowait(
>  		if (!mrtryaccess(&ip->i_iolock))
>  			goto out;
>  	}
> +
> +	if (lock_flags & XFS_MMAPLOCK_EXCL) {
> +		if (!mrtryupdate(&ip->i_mmaplock))
> +			goto out_undo_iolock;
> +	} else if (lock_flags & XFS_MMAPLOCK_SHARED) {
> +		if (!mrtryaccess(&ip->i_mmaplock))
> +			goto out_undo_iolock;
> +	}
> +
>  	if (lock_flags & XFS_ILOCK_EXCL) {
>  		if (!mrtryupdate(&ip->i_lock))
> -			goto out_undo_iolock;
> +			goto out_undo_mmaplock;
>  	} else if (lock_flags & XFS_ILOCK_SHARED) {
>  		if (!mrtryaccess(&ip->i_lock))
> -			goto out_undo_iolock;
> +			goto out_undo_mmaplock;
>  	}
>  	return 1;
>  
> - out_undo_iolock:
> +out_undo_mmaplock:
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrunlock_excl(&ip->i_mmaplock);
> +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> +		mrunlock_shared(&ip->i_mmaplock);
> +out_undo_iolock:
>  	if (lock_flags & XFS_IOLOCK_EXCL)
>  		mrunlock_excl(&ip->i_iolock);
>  	else if (lock_flags & XFS_IOLOCK_SHARED)
>  		mrunlock_shared(&ip->i_iolock);
> - out:
> +out:
>  	return 0;
>  }
>  
> @@ -244,6 +267,8 @@ xfs_iunlock(
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> @@ -254,6 +279,11 @@ xfs_iunlock(
>  	else if (lock_flags & XFS_IOLOCK_SHARED)
>  		mrunlock_shared(&ip->i_iolock);
>  
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrunlock_excl(&ip->i_mmaplock);
> +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> +		mrunlock_shared(&ip->i_mmaplock);
> +
>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrunlock_excl(&ip->i_lock);
>  	else if (lock_flags & XFS_ILOCK_SHARED)
> @@ -271,11 +301,14 @@ xfs_ilock_demote(
>  	xfs_inode_t		*ip,
>  	uint			lock_flags)
>  {
> -	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL));
> -	ASSERT((lock_flags & ~(XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> +	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
> +	ASSERT((lock_flags &
> +		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
>  
>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrdemote(&ip->i_lock);
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrdemote(&ip->i_mmaplock);
>  	if (lock_flags & XFS_IOLOCK_EXCL)
>  		mrdemote(&ip->i_iolock);
>  
> @@ -294,6 +327,12 @@ xfs_isilocked(
>  		return rwsem_is_locked(&ip->i_lock.mr_lock);
>  	}
>  
> +	if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
> +		if (!(lock_flags & XFS_MMAPLOCK_SHARED))
> +			return !!ip->i_mmaplock.mr_writer;
> +		return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
> +	}
> +
>  	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
>  		if (!(lock_flags & XFS_IOLOCK_SHARED))
>  			return !!ip->i_iolock.mr_writer;
> @@ -314,14 +353,27 @@ int xfs_lock_delays;
>  #endif
>  
>  /*
> - * Bump the subclass so xfs_lock_inodes() acquires each lock with
> - * a different value
> + * Bump the subclass so xfs_lock_inodes() acquires each lock with a different
> + * value. This shouldn't be called for page fault locking, but we also need to
> + * ensure we don't overrun the number of lockdep subclasses for the iolock or
> + * mmaplock as that is limited to 12 by the mmap lock lockdep annotations.
>   */
>  static inline int
>  xfs_lock_inumorder(int lock_mode, int subclass)
>  {
> -	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
> +	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
> +		ASSERT(subclass + XFS_LOCK_INUMORDER <
> +			(1 << (XFS_MMAPLOCK_SHIFT - XFS_IOLOCK_SHIFT)));
>  		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT;
> +	}
> +
> +	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) {
> +		ASSERT(subclass + XFS_LOCK_INUMORDER <
> +			(1 << (XFS_ILOCK_SHIFT - XFS_MMAPLOCK_SHIFT)));
> +		lock_mode |= (subclass + XFS_LOCK_INUMORDER) <<
> +							XFS_MMAPLOCK_SHIFT;
> +	}
> +
>  	if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))
>  		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT;
>  
> @@ -440,10 +492,10 @@ again:
>  }
>  
>  /*
> - * xfs_lock_two_inodes() can only be used to lock one type of lock
> - * at a time - the iolock or the ilock, but not both at once. If
> - * we lock both at once, lockdep will report false positives saying
> - * we have violated locking orders.
> + * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
> + * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
> + * lock more than one at a time, lockdep will report false positives saying we
> + * have violated locking orders.
>   */
>  void
>  xfs_lock_two_inodes(
> @@ -455,8 +507,12 @@ xfs_lock_two_inodes(
>  	int			attempts = 0;
>  	xfs_log_item_t		*lp;
>  
> -	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
> -		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
> +	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
> +		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
> +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
> +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +

Should this last branch not also check for iolock flags? If not, how is
that consistent with the function comment above?

Brian

>  	ASSERT(ip0->i_ino != ip1->i_ino);
>  
>  	if (ip0->i_ino > ip1->i_ino) {
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index de97ccc..8e7a12a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -56,6 +56,7 @@ typedef struct xfs_inode {
>  	struct xfs_inode_log_item *i_itemp;	/* logging information */
>  	mrlock_t		i_lock;		/* inode lock */
>  	mrlock_t		i_iolock;	/* inode IO lock */
> +	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
>  	atomic_t		i_pincount;	/* inode pin count */
>  	spinlock_t		i_flags_lock;	/* inode i_flags lock */
>  	/* Miscellaneous state. */
> @@ -263,15 +264,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  #define	XFS_IOLOCK_SHARED	(1<<1)
>  #define	XFS_ILOCK_EXCL		(1<<2)
>  #define	XFS_ILOCK_SHARED	(1<<3)
> +#define	XFS_MMAPLOCK_EXCL	(1<<4)
> +#define	XFS_MMAPLOCK_SHARED	(1<<5)
>  
>  #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
> -				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED)
> +				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> +				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
>  
>  #define XFS_LOCK_FLAGS \
>  	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
>  	{ XFS_IOLOCK_SHARED,	"IOLOCK_SHARED" }, \
>  	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
> -	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }
> +	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
> +	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
> +	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
>  
>  
>  /*
> @@ -302,17 +308,26 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  #define XFS_IOLOCK_SHIFT	16
>  #define	XFS_IOLOCK_PARENT	(XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT)
>  
> +#define XFS_MMAPLOCK_SHIFT	20
> +
>  #define XFS_ILOCK_SHIFT		24
>  #define	XFS_ILOCK_PARENT	(XFS_LOCK_PARENT << XFS_ILOCK_SHIFT)
>  #define	XFS_ILOCK_RTBITMAP	(XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT)
>  #define	XFS_ILOCK_RTSUM		(XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT)
>  
> -#define XFS_IOLOCK_DEP_MASK	0x00ff0000
> +#define XFS_IOLOCK_DEP_MASK	0x000f0000
> +#define XFS_MMAPLOCK_DEP_MASK	0x00f00000
>  #define XFS_ILOCK_DEP_MASK	0xff000000
> -#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | XFS_ILOCK_DEP_MASK)
> -
> -#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) >> XFS_IOLOCK_SHIFT)
> -#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
> +#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | \
> +				 XFS_MMAPLOCK_DEP_MASK | \
> +				 XFS_ILOCK_DEP_MASK)
> +
> +#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) \
> +					>> XFS_IOLOCK_SHIFT)
> +#define XFS_MMAPLOCK_DEP(flags)	(((flags) & XFS_MMAPLOCK_DEP_MASK) \
> +					>> XFS_MMAPLOCK_SHIFT)
> +#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) \
> +					>> XFS_ILOCK_SHIFT)
>  
>  /*
>   * For multiple groups support: if S_ISGID bit is set in the parent
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index afd6bae..40d2ac5 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -986,6 +986,8 @@ xfs_fs_inode_init_once(
>  	atomic_set(&ip->i_pincount, 0);
>  	spin_lock_init(&ip->i_flags_lock);
>  
> +	mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
> +		     "xfsino", ip->i_ino);
>  	mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
>  		     "xfsino", ip->i_ino);
>  }
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
@ 2015-01-22 13:09     ` Brian Foster
  0 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs

On Thu, Jan 08, 2015 at 09:25:38AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now we cannot serialise mmap against truncate or hole punch
> sanely. ->page_mkwrite is not able to take locks that the read IO
> path normally takes (i.e. the inode iolock) because that could
> result in lock inversions (read - iolock - page fault - page_mkwrite
> - iolock) and so we cannot use an IO path lock to serialise page
> write faults against truncate operations.
> 
> Instead, introduce a new lock that is used *only* in the
> ->page_mkwrite path that is the equivalent of the iolock. The lock
> ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
> and so in truncate we can i_iolock -> i_mmaplock and so lock out
> new write faults during the process of truncation.
> 
> Because i_mmap_lock is outside the page lock, we can hold it across
> all the same operations we hold the i_iolock for. The only
> difference is that we never hold the i_mmaplock in the normal IO
> path and so do not ever have the possibility that we can page fault
> inside it. Hence there are no recursion issues on the i_mmap_lock
> and so we can use it to serialise page fault IO against inode
> modification operations that affect the IO path.
> 
> This patch introduces the i_mmaplock infrastructure, lockdep
> annotations and initialisation/destruction code. Use of the new lock
> will be in subsequent patches.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_inode.h | 29 +++++++++++++-----
>  fs/xfs/xfs_super.c |  2 ++
>  3 files changed, 95 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 400791a..573b49c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -150,6 +150,8 @@ xfs_ilock(
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));

The comment that precedes xfs_ilock() explains the locks that exist
within the inode, locking order, etc. We should probably update it to
explain how i_mmap_lock fits in as well (e.g., text from the commit log
description would suffice, imo).

>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> @@ -159,6 +161,11 @@ xfs_ilock(
>  	else if (lock_flags & XFS_IOLOCK_SHARED)
>  		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
>  
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> +		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> +

XFS_MMAPLOCK_DEP()?

>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
>  	else if (lock_flags & XFS_ILOCK_SHARED)
> @@ -191,6 +198,8 @@ xfs_ilock_nowait(
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> @@ -202,21 +211,35 @@ xfs_ilock_nowait(
>  		if (!mrtryaccess(&ip->i_iolock))
>  			goto out;
>  	}
> +
> +	if (lock_flags & XFS_MMAPLOCK_EXCL) {
> +		if (!mrtryupdate(&ip->i_mmaplock))
> +			goto out_undo_iolock;
> +	} else if (lock_flags & XFS_MMAPLOCK_SHARED) {
> +		if (!mrtryaccess(&ip->i_mmaplock))
> +			goto out_undo_iolock;
> +	}
> +
>  	if (lock_flags & XFS_ILOCK_EXCL) {
>  		if (!mrtryupdate(&ip->i_lock))
> -			goto out_undo_iolock;
> +			goto out_undo_mmaplock;
>  	} else if (lock_flags & XFS_ILOCK_SHARED) {
>  		if (!mrtryaccess(&ip->i_lock))
> -			goto out_undo_iolock;
> +			goto out_undo_mmaplock;
>  	}
>  	return 1;
>  
> - out_undo_iolock:
> +out_undo_mmaplock:
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrunlock_excl(&ip->i_mmaplock);
> +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> +		mrunlock_shared(&ip->i_mmaplock);
> +out_undo_iolock:
>  	if (lock_flags & XFS_IOLOCK_EXCL)
>  		mrunlock_excl(&ip->i_iolock);
>  	else if (lock_flags & XFS_IOLOCK_SHARED)
>  		mrunlock_shared(&ip->i_iolock);
> - out:
> +out:
>  	return 0;
>  }
>  
> @@ -244,6 +267,8 @@ xfs_iunlock(
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> @@ -254,6 +279,11 @@ xfs_iunlock(
>  	else if (lock_flags & XFS_IOLOCK_SHARED)
>  		mrunlock_shared(&ip->i_iolock);
>  
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrunlock_excl(&ip->i_mmaplock);
> +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> +		mrunlock_shared(&ip->i_mmaplock);
> +
>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrunlock_excl(&ip->i_lock);
>  	else if (lock_flags & XFS_ILOCK_SHARED)
> @@ -271,11 +301,14 @@ xfs_ilock_demote(
>  	xfs_inode_t		*ip,
>  	uint			lock_flags)
>  {
> -	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL));
> -	ASSERT((lock_flags & ~(XFS_IOLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> +	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
> +	ASSERT((lock_flags &
> +		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
>  
>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrdemote(&ip->i_lock);
> +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> +		mrdemote(&ip->i_mmaplock);
>  	if (lock_flags & XFS_IOLOCK_EXCL)
>  		mrdemote(&ip->i_iolock);
>  
> @@ -294,6 +327,12 @@ xfs_isilocked(
>  		return rwsem_is_locked(&ip->i_lock.mr_lock);
>  	}
>  
> +	if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
> +		if (!(lock_flags & XFS_MMAPLOCK_SHARED))
> +			return !!ip->i_mmaplock.mr_writer;
> +		return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
> +	}
> +
>  	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
>  		if (!(lock_flags & XFS_IOLOCK_SHARED))
>  			return !!ip->i_iolock.mr_writer;
> @@ -314,14 +353,27 @@ int xfs_lock_delays;
>  #endif
>  
>  /*
> - * Bump the subclass so xfs_lock_inodes() acquires each lock with
> - * a different value
> + * Bump the subclass so xfs_lock_inodes() acquires each lock with a different
> + * value. This shouldn't be called for page fault locking, but we also need to
> + * ensure we don't overrun the number of lockdep subclasses for the iolock or
> + * mmaplock as that is limited to 12 by the mmap lock lockdep annotations.
>   */
>  static inline int
>  xfs_lock_inumorder(int lock_mode, int subclass)
>  {
> -	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
> +	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
> +		ASSERT(subclass + XFS_LOCK_INUMORDER <
> +			(1 << (XFS_MMAPLOCK_SHIFT - XFS_IOLOCK_SHIFT)));
>  		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_IOLOCK_SHIFT;
> +	}
> +
> +	if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) {
> +		ASSERT(subclass + XFS_LOCK_INUMORDER <
> +			(1 << (XFS_ILOCK_SHIFT - XFS_MMAPLOCK_SHIFT)));
> +		lock_mode |= (subclass + XFS_LOCK_INUMORDER) <<
> +							XFS_MMAPLOCK_SHIFT;
> +	}
> +
>  	if (lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))
>  		lock_mode |= (subclass + XFS_LOCK_INUMORDER) << XFS_ILOCK_SHIFT;
>  
> @@ -440,10 +492,10 @@ again:
>  }
>  
>  /*
> - * xfs_lock_two_inodes() can only be used to lock one type of lock
> - * at a time - the iolock or the ilock, but not both at once. If
> - * we lock both at once, lockdep will report false positives saying
> - * we have violated locking orders.
> + * xfs_lock_two_inodes() can only be used to lock one type of lock at a time -
> + * the iolock, the mmaplock or the ilock, but not more than one at a time. If we
> + * lock more than one at a time, lockdep will report false positives saying we
> + * have violated locking orders.
>   */
>  void
>  xfs_lock_two_inodes(
> @@ -455,8 +507,12 @@ xfs_lock_two_inodes(
>  	int			attempts = 0;
>  	xfs_log_item_t		*lp;
>  
> -	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
> -		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
> +	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
> +		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
> +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
> +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> +

Should this last branch not also check for iolock flags? If not, how is
that consistent with the function comment above?

Brian

>  	ASSERT(ip0->i_ino != ip1->i_ino);
>  
>  	if (ip0->i_ino > ip1->i_ino) {
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index de97ccc..8e7a12a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -56,6 +56,7 @@ typedef struct xfs_inode {
>  	struct xfs_inode_log_item *i_itemp;	/* logging information */
>  	mrlock_t		i_lock;		/* inode lock */
>  	mrlock_t		i_iolock;	/* inode IO lock */
> +	mrlock_t		i_mmaplock;	/* inode mmap IO lock */
>  	atomic_t		i_pincount;	/* inode pin count */
>  	spinlock_t		i_flags_lock;	/* inode i_flags lock */
>  	/* Miscellaneous state. */
> @@ -263,15 +264,20 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  #define	XFS_IOLOCK_SHARED	(1<<1)
>  #define	XFS_ILOCK_EXCL		(1<<2)
>  #define	XFS_ILOCK_SHARED	(1<<3)
> +#define	XFS_MMAPLOCK_EXCL	(1<<4)
> +#define	XFS_MMAPLOCK_SHARED	(1<<5)
>  
>  #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
> -				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED)
> +				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> +				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
>  
>  #define XFS_LOCK_FLAGS \
>  	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
>  	{ XFS_IOLOCK_SHARED,	"IOLOCK_SHARED" }, \
>  	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
> -	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }
> +	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
> +	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
> +	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
>  
>  
>  /*
> @@ -302,17 +308,26 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>  #define XFS_IOLOCK_SHIFT	16
>  #define	XFS_IOLOCK_PARENT	(XFS_LOCK_PARENT << XFS_IOLOCK_SHIFT)
>  
> +#define XFS_MMAPLOCK_SHIFT	20
> +
>  #define XFS_ILOCK_SHIFT		24
>  #define	XFS_ILOCK_PARENT	(XFS_LOCK_PARENT << XFS_ILOCK_SHIFT)
>  #define	XFS_ILOCK_RTBITMAP	(XFS_LOCK_RTBITMAP << XFS_ILOCK_SHIFT)
>  #define	XFS_ILOCK_RTSUM		(XFS_LOCK_RTSUM << XFS_ILOCK_SHIFT)
>  
> -#define XFS_IOLOCK_DEP_MASK	0x00ff0000
> +#define XFS_IOLOCK_DEP_MASK	0x000f0000
> +#define XFS_MMAPLOCK_DEP_MASK	0x00f00000
>  #define XFS_ILOCK_DEP_MASK	0xff000000
> -#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | XFS_ILOCK_DEP_MASK)
> -
> -#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) >> XFS_IOLOCK_SHIFT)
> -#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
> +#define XFS_LOCK_DEP_MASK	(XFS_IOLOCK_DEP_MASK | \
> +				 XFS_MMAPLOCK_DEP_MASK | \
> +				 XFS_ILOCK_DEP_MASK)
> +
> +#define XFS_IOLOCK_DEP(flags)	(((flags) & XFS_IOLOCK_DEP_MASK) \
> +					>> XFS_IOLOCK_SHIFT)
> +#define XFS_MMAPLOCK_DEP(flags)	(((flags) & XFS_MMAPLOCK_DEP_MASK) \
> +					>> XFS_MMAPLOCK_SHIFT)
> +#define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) \
> +					>> XFS_ILOCK_SHIFT)
>  
>  /*
>   * For multiple groups support: if S_ISGID bit is set in the parent
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index afd6bae..40d2ac5 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -986,6 +986,8 @@ xfs_fs_inode_init_once(
>  	atomic_set(&ip->i_pincount, 0);
>  	spin_lock_init(&ip->i_flags_lock);
>  
> +	mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
> +		     "xfsino", ip->i_ino);
>  	mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER,
>  		     "xfsino", ip->i_ino);
>  }
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
  2015-01-07 22:25   ` Dave Chinner
  (?)
@ 2015-01-22 13:23     ` Brian Foster
  -1 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 08, 2015 at 09:25:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have the i_mmap_lock being held across the page fault IO
> path, we now add extent manipulation operation exclusion by adding
> the lock to the paths that directly modify extent maps. This
> includes truncate, hole punching and other fallocate based
> operations. The operations will now take both the i_iolock and the
> i_mmaplock in exclusive mode, thereby ensuring that all IO and page
> faults block without holding any page locks while the extent
> manipulation is in progress.
> 
> This gives us the lock order during truncate of i_iolock ->
> i_mmaplock -> page_lock -> i_lock, hence providing the same
> lock order as the iolock provides the normal IO path without
> involving the mmap_sem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c  | 4 ++--
>  fs/xfs/xfs_ioctl.c | 4 ++--
>  fs/xfs/xfs_iops.c  | 6 +++---
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e6e7e75..b08c9e6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -794,7 +794,7 @@ xfs_file_fallocate(
>  		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
>  		return -EOPNOTSUPP;
>  
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> @@ -874,7 +874,7 @@ xfs_file_fallocate(
>  	}
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index a183198..8810959 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -634,7 +634,7 @@ xfs_ioc_space(
>  	if (error)
>  		return error;
>  
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> @@ -751,7 +751,7 @@ xfs_ioc_space(
>  	error = xfs_trans_commit(tp, 0);
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	mnt_drop_write_file(filp);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8be5bb5..f491860 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -768,7 +768,7 @@ xfs_setattr_size(
>  	if (error)
>  		return error;
>  
> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));

Only debug code of course, but xfs_isilocked() doesn't appear to support
what is intended by this call (e.g., verification of multiple locks).

Brian

>  	ASSERT(S_ISREG(ip->i_d.di_mode));
>  	ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
>  		ATTR_MTIME_SET|ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
> @@ -984,9 +984,9 @@ xfs_vn_setattr(
>  	int			error;
>  
>  	if (iattr->ia_valid & ATTR_SIZE) {
> -		xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +		xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  		error = xfs_setattr_size(ip, iattr);
> -		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +		xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	} else {
>  		error = xfs_setattr_nonsize(ip, iattr, 0);
>  	}
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
@ 2015-01-22 13:23     ` Brian Foster
  0 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs

On Thu, Jan 08, 2015 at 09:25:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have the i_mmap_lock being held across the page fault IO
> path, we now add extent manipulation operation exclusion by adding
> the lock to the paths that directly modify extent maps. This
> includes truncate, hole punching and other fallocate based
> operations. The operations will now take both the i_iolock and the
> i_mmaplock in exclusive mode, thereby ensuring that all IO and page
> faults block without holding any page locks while the extent
> manipulation is in progress.
> 
> This gives us the lock order during truncate of i_iolock ->
> i_mmaplock -> page_lock -> i_lock, hence providing the same
> lock order as the iolock provides the normal IO path without
> involving the mmap_sem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c  | 4 ++--
>  fs/xfs/xfs_ioctl.c | 4 ++--
>  fs/xfs/xfs_iops.c  | 6 +++---
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e6e7e75..b08c9e6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -794,7 +794,7 @@ xfs_file_fallocate(
>  		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
>  		return -EOPNOTSUPP;
>  
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> @@ -874,7 +874,7 @@ xfs_file_fallocate(
>  	}
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index a183198..8810959 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -634,7 +634,7 @@ xfs_ioc_space(
>  	if (error)
>  		return error;
>  
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> @@ -751,7 +751,7 @@ xfs_ioc_space(
>  	error = xfs_trans_commit(tp, 0);
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	mnt_drop_write_file(filp);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8be5bb5..f491860 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -768,7 +768,7 @@ xfs_setattr_size(
>  	if (error)
>  		return error;
>  
> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));

Only debug code of course, but xfs_isilocked() doesn't appear to support
what is intended by this call (e.g., verification of multiple locks).

Brian

>  	ASSERT(S_ISREG(ip->i_d.di_mode));
>  	ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
>  		ATTR_MTIME_SET|ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
> @@ -984,9 +984,9 @@ xfs_vn_setattr(
>  	int			error;
>  
>  	if (iattr->ia_valid & ATTR_SIZE) {
> -		xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +		xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  		error = xfs_setattr_size(ip, iattr);
> -		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +		xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	} else {
>  		error = xfs_setattr_nonsize(ip, iattr, 0);
>  	}
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
@ 2015-01-22 13:23     ` Brian Foster
  0 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 08, 2015 at 09:25:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have the i_mmap_lock being held across the page fault IO
> path, we now add extent manipulation operation exclusion by adding
> the lock to the paths that directly modify extent maps. This
> includes truncate, hole punching and other fallocate based
> operations. The operations will now take both the i_iolock and the
> i_mmaplock in exclusive mode, thereby ensuring that all IO and page
> faults block without holding any page locks while the extent
> manipulation is in progress.
> 
> This gives us the lock order during truncate of i_iolock ->
> i_mmaplock -> page_lock -> i_lock, hence providing the same
> lock order as the iolock provides the normal IO path without
> involving the mmap_sem.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_file.c  | 4 ++--
>  fs/xfs/xfs_ioctl.c | 4 ++--
>  fs/xfs/xfs_iops.c  | 6 +++---
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e6e7e75..b08c9e6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -794,7 +794,7 @@ xfs_file_fallocate(
>  		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
>  		return -EOPNOTSUPP;
>  
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		error = xfs_free_file_space(ip, offset, len);
>  		if (error)
> @@ -874,7 +874,7 @@ xfs_file_fallocate(
>  	}
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index a183198..8810959 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -634,7 +634,7 @@ xfs_ioc_space(
>  	if (error)
>  		return error;
>  
> -	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +	xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  
>  	switch (bf->l_whence) {
>  	case 0: /*SEEK_SET*/
> @@ -751,7 +751,7 @@ xfs_ioc_space(
>  	error = xfs_trans_commit(tp, 0);
>  
>  out_unlock:
> -	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +	xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	mnt_drop_write_file(filp);
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8be5bb5..f491860 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -768,7 +768,7 @@ xfs_setattr_size(
>  	if (error)
>  		return error;
>  
> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));

Only debug code of course, but xfs_isilocked() doesn't appear to support
what is intended by this call (e.g., verification of multiple locks).

Brian

>  	ASSERT(S_ISREG(ip->i_d.di_mode));
>  	ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
>  		ATTR_MTIME_SET|ATTR_KILL_PRIV|ATTR_TIMES_SET)) == 0);
> @@ -984,9 +984,9 @@ xfs_vn_setattr(
>  	int			error;
>  
>  	if (iattr->ia_valid & ATTR_SIZE) {
> -		xfs_ilock(ip, XFS_IOLOCK_EXCL);
> +		xfs_ilock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  		error = xfs_setattr_size(ip, iattr);
> -		xfs_iunlock(ip, XFS_IOLOCK_EXCL);
> +		xfs_iunlock(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL);
>  	} else {
>  		error = xfs_setattr_nonsize(ip, iattr, 0);
>  	}
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 6/6] xfs: lock out page faults from extent swap operations
  2015-01-07 22:25   ` Dave Chinner
@ 2015-01-22 13:41     ` Brian Foster
  -1 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 08, 2015 at 09:25:43AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Extent swap operations are another extent manipulation operation
> that we need to ensure does not race against mmap page faults. The
> current code returns if the file is mapped prior to the swap being
> done, but it could potentially race against new page faults while
> the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
> for this operation, too.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 22a5dcb..1420caf 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1599,13 +1599,6 @@ xfs_swap_extent_flush(
>  	/* Verify O_DIRECT for ftmp */
>  	if (VFS_I(ip)->i_mapping->nrpages)
>  		return -EINVAL;
> -
> -	/*
> -	 * Don't try to swap extents on mmap()d files because we can't lock
> -	 * out races against page faults safely.
> -	 */
> -	if (mapping_mapped(VFS_I(ip)->i_mapping))
> -		return -EBUSY;
>  	return 0;
>  }
>  
> @@ -1633,13 +1626,14 @@ xfs_swap_extents(
>  	}
>  
>  	/*
> -	 * Lock up the inodes against other IO and truncate to begin with.
> -	 * Then we can ensure the inodes are flushed and have no page cache
> -	 * safely. Once we have done this we can take the ilocks and do the rest
> -	 * of the checks.
> +	 * Lock the inodes against other IO, page faults and truncate to
> +	 * begin with.  Then we can ensure the inodes are flushed and have no
> +	 * page cache safely. Once we have done this we can take the ilocks and
> +	 * do the rest of the checks.
>  	 */
> -	lock_flags = XFS_IOLOCK_EXCL;
> +	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
> +	xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
>  
>  	/* Verify that both files have the same format */
>  	if ((ip->i_d.di_mode & S_IFMT) != (tip->i_d.di_mode & S_IFMT)) {

Not introduced by this patch, but it looks like we have a couple
out_trans_cancel->out_unlock error paths after the inodes are joined to
the transaction (with lock transfer) that can result in double unlocks.
We might as well fix that up here one way or another as well...

Brian

> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 6/6] xfs: lock out page faults from extent swap operations
@ 2015-01-22 13:41     ` Brian Foster
  0 siblings, 0 replies; 43+ messages in thread
From: Brian Foster @ 2015-01-22 13:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-mm, xfs

On Thu, Jan 08, 2015 at 09:25:43AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Extent swap operations are another extent manipulation operation
> that we need to ensure does not race against mmap page faults. The
> current code returns if the file is mapped prior to the swap being
> done, but it could potentially race against new page faults while
> the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
> for this operation, too.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 22a5dcb..1420caf 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1599,13 +1599,6 @@ xfs_swap_extent_flush(
>  	/* Verify O_DIRECT for ftmp */
>  	if (VFS_I(ip)->i_mapping->nrpages)
>  		return -EINVAL;
> -
> -	/*
> -	 * Don't try to swap extents on mmap()d files because we can't lock
> -	 * out races against page faults safely.
> -	 */
> -	if (mapping_mapped(VFS_I(ip)->i_mapping))
> -		return -EBUSY;
>  	return 0;
>  }
>  
> @@ -1633,13 +1626,14 @@ xfs_swap_extents(
>  	}
>  
>  	/*
> -	 * Lock up the inodes against other IO and truncate to begin with.
> -	 * Then we can ensure the inodes are flushed and have no page cache
> -	 * safely. Once we have done this we can take the ilocks and do the rest
> -	 * of the checks.
> +	 * Lock the inodes against other IO, page faults and truncate to
> +	 * begin with.  Then we can ensure the inodes are flushed and have no
> +	 * page cache safely. Once we have done this we can take the ilocks and
> +	 * do the rest of the checks.
>  	 */
> -	lock_flags = XFS_IOLOCK_EXCL;
> +	lock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  	xfs_lock_two_inodes(ip, tip, XFS_IOLOCK_EXCL);
> +	xfs_lock_two_inodes(ip, tip, XFS_MMAPLOCK_EXCL);
>  
>  	/* Verify that both files have the same format */
>  	if ((ip->i_d.di_mode & S_IFMT) != (tip->i_d.di_mode & S_IFMT)) {

Not introduced by this patch, but it looks like we have a couple
out_trans_cancel->out_unlock error paths after the inodes are joined to
the transaction (with lock transfer) that can result in double unlocks.
We might as well fix that up here one way or another as well...

Brian

> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
  2015-01-22 13:09     ` Brian Foster
@ 2015-01-22 21:30       ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-22 21:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 22, 2015 at 08:09:06AM -0500, Brian Foster wrote:
> On Thu, Jan 08, 2015 at 09:25:38AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Right now we cannot serialise mmap against truncate or hole punch
> > sanely. ->page_mkwrite is not able to take locks that the read IO
> > path normally takes (i.e. the inode iolock) because that could
> > result in lock inversions (read - iolock - page fault - page_mkwrite
> > - iolock) and so we cannot use an IO path lock to serialise page
> > write faults against truncate operations.
.....
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -150,6 +150,8 @@ xfs_ilock(
> >  	 */
> >  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
> >  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> > +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> > +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
> 
> The comment that precedes xfs_ilock() explains the locks that exist
> within the inode, locking order, etc. We should probably update it to
> explain how i_mmap_lock fits in as well (e.g., text from the commit log
> description would suffice, imo).

*nod*. Will fix.

> >  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
> >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> > @@ -159,6 +161,11 @@ xfs_ilock(
> >  	else if (lock_flags & XFS_IOLOCK_SHARED)
> >  		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
> >  
> > +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> > +		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> > +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> > +		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> > +
> 
> XFS_MMAPLOCK_DEP()?

Good catch.

> > @@ -455,8 +507,12 @@ xfs_lock_two_inodes(
> >  	int			attempts = 0;
> >  	xfs_log_item_t		*lp;
> >  
> > -	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
> > -		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
> > +	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
> > +		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
> > +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
> > +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +
> 
> Should this last branch not also check for iolock flags? If not, how is
> that consistent with the function comment above?

If we hit that else branch, we already know that the lock mode
does not contain IOLOCK flags. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/6] xfs: introduce mmap/truncate lock
@ 2015-01-22 21:30       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-22 21:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-mm, xfs

On Thu, Jan 22, 2015 at 08:09:06AM -0500, Brian Foster wrote:
> On Thu, Jan 08, 2015 at 09:25:38AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Right now we cannot serialise mmap against truncate or hole punch
> > sanely. ->page_mkwrite is not able to take locks that the read IO
> > path normally takes (i.e. the inode iolock) because that could
> > result in lock inversions (read - iolock - page fault - page_mkwrite
> > - iolock) and so we cannot use an IO path lock to serialise page
> > write faults against truncate operations.
.....
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -150,6 +150,8 @@ xfs_ilock(
> >  	 */
> >  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
> >  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> > +	ASSERT((lock_flags & (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL)) !=
> > +	       (XFS_MMAPLOCK_SHARED | XFS_MMAPLOCK_EXCL));
> 
> The comment that precedes xfs_ilock() explains the locks that exist
> within the inode, locking order, etc. We should probably update it to
> explain how i_mmap_lock fits in as well (e.g., text from the commit log
> description would suffice, imo).

*nod*. Will fix.

> >  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
> >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_DEP_MASK)) == 0);
> > @@ -159,6 +161,11 @@ xfs_ilock(
> >  	else if (lock_flags & XFS_IOLOCK_SHARED)
> >  		mraccess_nested(&ip->i_iolock, XFS_IOLOCK_DEP(lock_flags));
> >  
> > +	if (lock_flags & XFS_MMAPLOCK_EXCL)
> > +		mrupdate_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> > +	else if (lock_flags & XFS_MMAPLOCK_SHARED)
> > +		mraccess_nested(&ip->i_mmaplock, XFS_IOLOCK_DEP(lock_flags));
> > +
> 
> XFS_MMAPLOCK_DEP()?

Good catch.

> > @@ -455,8 +507,12 @@ xfs_lock_two_inodes(
> >  	int			attempts = 0;
> >  	xfs_log_item_t		*lp;
> >  
> > -	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))
> > -		ASSERT((lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)) == 0);
> > +	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
> > +		ASSERT(!(lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
> > +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +	} else if (lock_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))
> > +		ASSERT(!(lock_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> > +
> 
> Should this last branch not also check for iolock flags? If not, how is
> that consistent with the function comment above?

If we hit that else branch, we already know that the lock mode
does not contain IOLOCK flags. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
  2015-01-22 13:23     ` Brian Foster
  (?)
@ 2015-01-22 21:32       ` Dave Chinner
  -1 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-22 21:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 22, 2015 at 08:23:08AM -0500, Brian Foster wrote:
> On Thu, Jan 08, 2015 at 09:25:41AM +1100, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 8be5bb5..f491860 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -768,7 +768,7 @@ xfs_setattr_size(
> >  	if (error)
> >  		return error;
> >  
> > -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
> 
> Only debug code of course, but xfs_isilocked() doesn't appear to support
> what is intended by this call (e.g., verification of multiple locks).

Ah, right. Didn't think that one though properly. I'll fix it up.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
@ 2015-01-22 21:32       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-22 21:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-mm, xfs

On Thu, Jan 22, 2015 at 08:23:08AM -0500, Brian Foster wrote:
> On Thu, Jan 08, 2015 at 09:25:41AM +1100, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 8be5bb5..f491860 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -768,7 +768,7 @@ xfs_setattr_size(
> >  	if (error)
> >  		return error;
> >  
> > -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
> 
> Only debug code of course, but xfs_isilocked() doesn't appear to support
> what is intended by this call (e.g., verification of multiple locks).

Ah, right. Didn't think that one though properly. I'll fix it up.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations
@ 2015-01-22 21:32       ` Dave Chinner
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-01-22 21:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-fsdevel, linux-mm

On Thu, Jan 22, 2015 at 08:23:08AM -0500, Brian Foster wrote:
> On Thu, Jan 08, 2015 at 09:25:41AM +1100, Dave Chinner wrote:
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 8be5bb5..f491860 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -768,7 +768,7 @@ xfs_setattr_size(
> >  	if (error)
> >  		return error;
> >  
> > -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> > +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
> 
> Only debug code of course, but xfs_isilocked() doesn't appear to support
> what is intended by this call (e.g., verification of multiple locks).

Ah, right. Didn't think that one though properly. I'll fix it up.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-01-22 21:32 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-07 22:25 [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion Dave Chinner
2015-01-07 22:25 ` Dave Chinner
2015-01-07 22:25 ` Dave Chinner
2015-01-07 22:25 ` [RFC PATCH 1/6] xfs: introduce mmap/truncate lock Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-22 13:09   ` Brian Foster
2015-01-22 13:09     ` Brian Foster
2015-01-22 21:30     ` Dave Chinner
2015-01-22 21:30       ` Dave Chinner
2015-01-07 22:25 ` [RFC PATCH 2/6] xfs: use i_mmaplock on read faults Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-07 22:25 ` [RFC PATCH 3/6] xfs: use i_mmaplock on write faults Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-07 22:25 ` [RFC PATCH 4/6] xfs: take i_mmap_lock on extent manipulation operations Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-22 13:23   ` Brian Foster
2015-01-22 13:23     ` Brian Foster
2015-01-22 13:23     ` Brian Foster
2015-01-22 21:32     ` Dave Chinner
2015-01-22 21:32       ` Dave Chinner
2015-01-22 21:32       ` Dave Chinner
2015-01-07 22:25 ` [RFC PATCH 5/6] xfs: xfs_setattr_size no longer races with page faults Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-07 22:25 ` [RFC PATCH 6/6] xfs: lock out page faults from extent swap operations Dave Chinner
2015-01-07 22:25   ` Dave Chinner
2015-01-22 13:41   ` Brian Foster
2015-01-22 13:41     ` Brian Foster
2015-01-08 11:34 ` [RFC PATCH 0/6] xfs: truncate vs page fault IO exclusion Jan Kara
2015-01-08 11:34   ` Jan Kara
2015-01-08 11:34   ` Jan Kara
2015-01-08 12:24 ` Christoph Hellwig
2015-01-08 12:24   ` Christoph Hellwig
2015-01-08 12:24   ` Christoph Hellwig
2015-01-08 21:45   ` Dave Chinner
2015-01-08 21:45     ` Dave Chinner
2015-01-12 17:42   ` Jan Kara
2015-01-12 17:42     ` Jan Kara
2015-01-21 22:26     ` Dave Chinner
2015-01-21 22:26       ` Dave Chinner
2015-01-21 22:26       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.