All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 RFC] Mapping range lock
@ 2013-01-31 21:49 ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

  Hello,

  As I promised in my LSF/MM summit proposal here are initial patches
implementing mapping range lock. There's ext3 converted to fully use the
range locks, converting other filesystems shouldn't be difficult but I want
to spend time on it only after we are sure what we want. The following part
is copied from the LSF/MM proposal, below it are some performance numbers.

There are several different motivations for implementing mapping range
locking:

a) Punch hole is currently racy wrt mmap (page can be faulted in in the
   punched range after page cache has been invalidated) leading to nasty
   results as fs corruption (we can end up writing to already freed block),
   user exposure of uninitialized data, etc. To fix this we need some new
   mechanism of serializing hole punching and page faults.

b) There is an uncomfortable number of mechanisms serializing various paths
   manipulating pagecache and data underlying it. We have i_mutex, page lock,
   checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
   Different pairs of operations are serialized by different mechanisms and
   not all the cases are covered. Case (a) above is likely the worst but DIO
   vs buffered IO isn't ideal either (we provide only limited consistency).
   The range locking should somewhat simplify serialization of pagecache
   operations. So i_dio_count can be removed completely, i_mutex to certain
   extent (we still need something for things like timestamp updates,
   possibly for i_size changes although those can be dealt with I think).

c) i_mutex doesn't allow any paralellism of operations using it and some
   filesystems workaround this for specific cases (e.g. DIO reads). Using
   range locking allows for concurrent operations (e.g. writes, DIO) on
   different parts of the file. Of course, range locking itself isn't
   enough to make the parallelism possible. Filesystems still have to
   somehow deal with the concurrency when manipulating inode allocation
   data. But the range locking at least provides a common VFS mechanism for
   serialization VFS itself needs and it's upto each filesystem to
   serialize more if it needs to.

How it works:
------------

General idea is that range lock for range x-y prevents creation of pages in
that range.

In practice this means:
----------------------

All read paths adding page to page cache and grab_cache_page_write_begin()
first take range lock for the index, then insert locked page, and finally
unlock the range. See below on why buffered IO uses range locks on per-page
basis.

DIO gets range lock at the moment it submits bio for the range covering
pages in the bio. Then pagecache is truncated and bio submitted. Range lock
is unlocked once bio is completed.

Punch hole for range x-y takes range lock for the range before truncating
page cache and the lock is released after filesystem blocks for the range
are freed.

Truncate to size x is equivalent to punch hole for the range x - ~0UL.

The reason why we take the range lock for buffered IO on per-page basis and
for DIO for each bio separately is lock ordering with mmap_sem. Page faults
need to instantiate page under mmap_sem. That establishes mmap_sem > range
lock. Buffered IO takes mmap_sem when prefaulting pages so we cannot hold
range lock at that moment. Similarly get_user_pages() in DIO code takes
mmap_sem so we have be sure not to hold range lock when calling that.

How much does it cost:
---------------------

There's a memory cost - an extra pointer and spinlock in struct
address_space, 64 bytes on stack for buffered IO, truncate, punch hole, and
dynamically allocated 72-byte structure per each BIO submitted by direct IO.

And there's a cpu cost. I measured it on an 8 CPU machine with 4 GB of memory
with ext2 (yes, I added support also for ext2 and used it for measurements as
especially write results are much less noisy) over 1G ramdisk. The workloads
were generated by FIO and were 1) read 800 MB file, 2) overwrite 800 MB file,
3) mmap read 800 MB file. Each test was run 30 times.

The results are here (times to complete in ms):
	Vanilla				Range Locks
	Avg		Stddev		Avg		Stddev
READ	1133.566667	11.954590	1137.06666	7.827019
WRITE	1069.300000	7.996458	1101.200000	8.607748
MMAP	1416.733333	28.459250	1421.900000	30.636960

So we see READ and MMAP time changes are in the noise (although for reads
there seem to be about 1% cost if I compare more tests), for WRITE the cost
barely stands out of the noise at ~3% (and here I verified with perf what's
going on and indeed the range_lock() and range_unlock() calls cost in total
close to 3% of CPU time).

So the cost is noticeable. Is it a problem? Maybe, not sure... We could
likely optimize the lock-single-page range case but I wanted to start
simple and get some feedback first.

								Honza

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/6 RFC] Mapping range lock
@ 2013-01-31 21:49 ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

  Hello,

  As I promised in my LSF/MM summit proposal here are initial patches
implementing mapping range lock. There's ext3 converted to fully use the
range locks, converting other filesystems shouldn't be difficult but I want
to spend time on it only after we are sure what we want. The following part
is copied from the LSF/MM proposal, below it are some performance numbers.

There are several different motivations for implementing mapping range
locking:

a) Punch hole is currently racy wrt mmap (page can be faulted in in the
   punched range after page cache has been invalidated) leading to nasty
   results as fs corruption (we can end up writing to already freed block),
   user exposure of uninitialized data, etc. To fix this we need some new
   mechanism of serializing hole punching and page faults.

b) There is an uncomfortable number of mechanisms serializing various paths
   manipulating pagecache and data underlying it. We have i_mutex, page lock,
   checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
   Different pairs of operations are serialized by different mechanisms and
   not all the cases are covered. Case (a) above is likely the worst but DIO
   vs buffered IO isn't ideal either (we provide only limited consistency).
   The range locking should somewhat simplify serialization of pagecache
   operations. So i_dio_count can be removed completely, i_mutex to certain
   extent (we still need something for things like timestamp updates,
   possibly for i_size changes although those can be dealt with I think).

c) i_mutex doesn't allow any paralellism of operations using it and some
   filesystems workaround this for specific cases (e.g. DIO reads). Using
   range locking allows for concurrent operations (e.g. writes, DIO) on
   different parts of the file. Of course, range locking itself isn't
   enough to make the parallelism possible. Filesystems still have to
   somehow deal with the concurrency when manipulating inode allocation
   data. But the range locking at least provides a common VFS mechanism for
   serialization VFS itself needs and it's upto each filesystem to
   serialize more if it needs to.

How it works:
------------

General idea is that range lock for range x-y prevents creation of pages in
that range.

In practice this means:
----------------------

All read paths adding page to page cache and grab_cache_page_write_begin()
first take range lock for the index, then insert locked page, and finally
unlock the range. See below on why buffered IO uses range locks on per-page
basis.

DIO gets range lock at the moment it submits bio for the range covering
pages in the bio. Then pagecache is truncated and bio submitted. Range lock
is unlocked once bio is completed.

Punch hole for range x-y takes range lock for the range before truncating
page cache and the lock is released after filesystem blocks for the range
are freed.

Truncate to size x is equivalent to punch hole for the range x - ~0UL.

The reason why we take the range lock for buffered IO on per-page basis and
for DIO for each bio separately is lock ordering with mmap_sem. Page faults
need to instantiate page under mmap_sem. That establishes mmap_sem > range
lock. Buffered IO takes mmap_sem when prefaulting pages so we cannot hold
range lock at that moment. Similarly get_user_pages() in DIO code takes
mmap_sem so we have be sure not to hold range lock when calling that.

How much does it cost:
---------------------

There's a memory cost - an extra pointer and spinlock in struct
address_space, 64 bytes on stack for buffered IO, truncate, punch hole, and
dynamically allocated 72-byte structure per each BIO submitted by direct IO.

And there's a cpu cost. I measured it on an 8 CPU machine with 4 GB of memory
with ext2 (yes, I added support also for ext2 and used it for measurements as
especially write results are much less noisy) over 1G ramdisk. The workloads
were generated by FIO and were 1) read 800 MB file, 2) overwrite 800 MB file,
3) mmap read 800 MB file. Each test was run 30 times.

The results are here (times to complete in ms):
	Vanilla				Range Locks
	Avg		Stddev		Avg		Stddev
READ	1133.566667	11.954590	1137.06666	7.827019
WRITE	1069.300000	7.996458	1101.200000	8.607748
MMAP	1416.733333	28.459250	1421.900000	30.636960

So we see READ and MMAP time changes are in the noise (although for reads
there seem to be about 1% cost if I compare more tests), for WRITE the cost
barely stands out of the noise at ~3% (and here I verified with perf what's
going on and indeed the range_lock() and range_unlock() calls cost in total
close to 3% of CPU time).

So the cost is noticeable. Is it a problem? Maybe, not sure... We could
likely optimize the lock-single-page range case but I wanted to start
simple and get some feedback first.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/6] lib: Implement range locks
  2013-01-31 21:49 ` Jan Kara
@ 2013-01-31 21:49   ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Implement range locking using interval tree.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/range_lock.h |   51 ++++++++++++++++++++++++++++
 lib/Makefile               |    4 +-
 lib/range_lock.c           |   78 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 131 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/range_lock.h
 create mode 100644 lib/range_lock.c

diff --git a/include/linux/range_lock.h b/include/linux/range_lock.h
new file mode 100644
index 0000000..fe258a5
--- /dev/null
+++ b/include/linux/range_lock.h
@@ -0,0 +1,51 @@
+/*
+ * Range locking
+ *
+ * We allow exclusive locking of arbitrary ranges. We guarantee that each
+ * range is locked only after all conflicting range locks requested previously
+ * have been unlocked. Thus we achieve fairness and avoid livelocks.
+ *
+ * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
+ * total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_LOCK_H
+#define _LINUX_RANGE_LOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+
+struct task_struct;
+
+struct range_lock {
+	struct interval_tree_node node;
+	struct task_struct *task;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+};
+
+struct range_lock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+};
+
+#define RANGE_LOCK_INITIALIZER(start, end) {\
+	.node = {\
+		.start = (start),\
+		.end = (end)\
+	}\
+}
+
+static inline void range_lock_tree_init(struct range_lock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+}
+void range_lock_init(struct range_lock *lock, unsigned long start,
+		     unsigned long end);
+void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
+void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 02ed6c0..04a9caa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o
+	 earlycpio.o interval_tree.o range_lock.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
@@ -144,7 +144,7 @@ lib-$(CONFIG_LIBFDT) += $(libfdt_files)
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
-interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
+interval_tree_test-objs := interval_tree_test_main.o
 
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
diff --git a/lib/range_lock.c b/lib/range_lock.c
new file mode 100644
index 0000000..1cb119b
--- /dev/null
+++ b/lib/range_lock.c
@@ -0,0 +1,78 @@
+/*
+ * Implementation of range locks.
+ *
+ * We keep interval tree of locked and to-be-locked ranges. When new range lock
+ * is requested, we add its interval to the tree and store number of intervals
+ * intersecting it to 'blocking_ranges'.
+ *
+ * When a range is unlocked, we again walk intervals that intersect with the
+ * unlocked one and decrement their 'blocking_ranges'.  We wake up owner of any
+ * range lock whose 'blocking_ranges' drops to 0.
+ */
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/spinlock.h>
+#include <linux/range_lock.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+void range_lock_init(struct range_lock *lock, unsigned long start,
+		     unsigned long end)
+{
+	lock->node.start = start;
+	lock->node.last = end;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+}
+EXPORT_SYMBOL(range_lock_init);
+
+void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		lock->blocking_ranges++;
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+	interval_tree_insert(&lock->node, &tree->root);
+	/* Do we need to go to sleep? */
+	while (lock->blocking_ranges) {
+		lock->task = current;
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock_irqrestore(&tree->lock, flags);
+		schedule();
+		spin_lock_irqsave(&tree->lock, flags);
+	}
+	spin_unlock_irqrestore(&tree->lock, flags);
+}
+EXPORT_SYMBOL(range_lock);
+
+static void range_lock_unblock(struct range_lock *lock)
+{
+	if (!--lock->blocking_ranges)
+		wake_up_process(lock->task);
+}
+
+void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	interval_tree_remove(&lock->node, &tree->root);
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		range_lock_unblock((struct range_lock *)node);
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+	spin_unlock_irqrestore(&tree->lock, flags);
+}
+EXPORT_SYMBOL(range_unlock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 1/6] lib: Implement range locks
@ 2013-01-31 21:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Implement range locking using interval tree.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/range_lock.h |   51 ++++++++++++++++++++++++++++
 lib/Makefile               |    4 +-
 lib/range_lock.c           |   78 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 131 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/range_lock.h
 create mode 100644 lib/range_lock.c

diff --git a/include/linux/range_lock.h b/include/linux/range_lock.h
new file mode 100644
index 0000000..fe258a5
--- /dev/null
+++ b/include/linux/range_lock.h
@@ -0,0 +1,51 @@
+/*
+ * Range locking
+ *
+ * We allow exclusive locking of arbitrary ranges. We guarantee that each
+ * range is locked only after all conflicting range locks requested previously
+ * have been unlocked. Thus we achieve fairness and avoid livelocks.
+ *
+ * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
+ * total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_LOCK_H
+#define _LINUX_RANGE_LOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+
+struct task_struct;
+
+struct range_lock {
+	struct interval_tree_node node;
+	struct task_struct *task;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+};
+
+struct range_lock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+};
+
+#define RANGE_LOCK_INITIALIZER(start, end) {\
+	.node = {\
+		.start = (start),\
+		.end = (end)\
+	}\
+}
+
+static inline void range_lock_tree_init(struct range_lock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+}
+void range_lock_init(struct range_lock *lock, unsigned long start,
+		     unsigned long end);
+void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
+void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 02ed6c0..04a9caa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o
+	 earlycpio.o interval_tree.o range_lock.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
@@ -144,7 +144,7 @@ lib-$(CONFIG_LIBFDT) += $(libfdt_files)
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
-interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
+interval_tree_test-objs := interval_tree_test_main.o
 
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
diff --git a/lib/range_lock.c b/lib/range_lock.c
new file mode 100644
index 0000000..1cb119b
--- /dev/null
+++ b/lib/range_lock.c
@@ -0,0 +1,78 @@
+/*
+ * Implementation of range locks.
+ *
+ * We keep interval tree of locked and to-be-locked ranges. When new range lock
+ * is requested, we add its interval to the tree and store number of intervals
+ * intersecting it to 'blocking_ranges'.
+ *
+ * When a range is unlocked, we again walk intervals that intersect with the
+ * unlocked one and decrement their 'blocking_ranges'.  We wake up owner of any
+ * range lock whose 'blocking_ranges' drops to 0.
+ */
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/spinlock.h>
+#include <linux/range_lock.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+void range_lock_init(struct range_lock *lock, unsigned long start,
+		     unsigned long end)
+{
+	lock->node.start = start;
+	lock->node.last = end;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+}
+EXPORT_SYMBOL(range_lock_init);
+
+void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		lock->blocking_ranges++;
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+	interval_tree_insert(&lock->node, &tree->root);
+	/* Do we need to go to sleep? */
+	while (lock->blocking_ranges) {
+		lock->task = current;
+		__set_current_state(TASK_UNINTERRUPTIBLE);
+		spin_unlock_irqrestore(&tree->lock, flags);
+		schedule();
+		spin_lock_irqsave(&tree->lock, flags);
+	}
+	spin_unlock_irqrestore(&tree->lock, flags);
+}
+EXPORT_SYMBOL(range_lock);
+
+static void range_lock_unblock(struct range_lock *lock)
+{
+	if (!--lock->blocking_ranges)
+		wake_up_process(lock->task);
+}
+
+void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	interval_tree_remove(&lock->node, &tree->root);
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		range_lock_unblock((struct range_lock *)node);
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+	spin_unlock_irqrestore(&tree->lock, flags);
+}
+EXPORT_SYMBOL(range_unlock);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/6] fs: Take mapping lock in generic read paths
  2013-01-31 21:49 ` Jan Kara
@ 2013-01-31 21:49   ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Add mapping lock to struct address_space and grab it in all paths
creating pages in page cache to read data into them. That means buffered
read, readahead, and page fault code.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c              |    2 ++
 include/linux/fs.h      |    4 ++++
 include/linux/pagemap.h |    2 ++
 mm/filemap.c            |   21 ++++++++++++++++++---
 mm/readahead.c          |    8 ++++----
 5 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 14084b7..85db16c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -168,6 +168,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	range_lock_tree_init(&mapping->mapping_lock);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -513,6 +514,7 @@ void clear_inode(struct inode *inode)
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
+	BUG_ON(inode->i_data.mapping_lock.root.rb_node);
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7617ee0..2027d25 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/range_lock.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -420,6 +421,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+	struct range_lock_tree	mapping_lock;	/* Lock protecting creation /
+						 * eviction of pages from
+						 * the mapping */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6da609d..ba81ea9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -537,6 +537,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
+int add_to_page_cache_read(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..4826cb4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -491,6 +491,20 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
+int add_to_page_cache_read(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask)
+{
+	struct range_lock mapping_lock;
+	int ret;
+
+	range_lock_init(&mapping_lock, offset, offset);
+	range_lock(&mapping->mapping_lock, &mapping_lock);
+	ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
+	range_unlock(&mapping->mapping_lock, &mapping_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(add_to_page_cache_read);
+
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)
 {
@@ -1274,7 +1288,7 @@ no_cached_page:
 			desc->error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping,
+		error = add_to_page_cache_read(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
 			page_cache_release(page);
@@ -1493,7 +1507,8 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 		if (!page)
 			return -ENOMEM;
 
-		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		ret = add_to_page_cache_read(page, mapping, offset,
+						GFP_KERNEL);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
@@ -1790,7 +1805,7 @@ repeat:
 		page = __page_cache_alloc(gfp | __GFP_COLD);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
-		err = add_to_page_cache_lru(page, mapping, index, gfp);
+		err = add_to_page_cache_read(page, mapping, index, gfp);
 		if (unlikely(err)) {
 			page_cache_release(page);
 			if (err == -EEXIST)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..28a5e40 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -89,7 +89,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 	while (!list_empty(pages)) {
 		page = list_to_page(pages);
 		list_del(&page->lru);
-		if (add_to_page_cache_lru(page, mapping,
+		if (add_to_page_cache_read(page, mapping,
 					page->index, GFP_KERNEL)) {
 			read_cache_pages_invalidate_page(mapping, page);
 			continue;
@@ -126,11 +126,11 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_to_page(pages);
+
 		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+		if (!add_to_page_cache_read(page, mapping,
+					page->index, GFP_KERNEL))
 			mapping->a_ops->readpage(filp, page);
-		}
 		page_cache_release(page);
 	}
 	ret = 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/6] fs: Take mapping lock in generic read paths
@ 2013-01-31 21:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Add mapping lock to struct address_space and grab it in all paths
creating pages in page cache to read data into them. That means buffered
read, readahead, and page fault code.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c              |    2 ++
 include/linux/fs.h      |    4 ++++
 include/linux/pagemap.h |    2 ++
 mm/filemap.c            |   21 ++++++++++++++++++---
 mm/readahead.c          |    8 ++++----
 5 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 14084b7..85db16c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -168,6 +168,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->private_data = NULL;
 	mapping->backing_dev_info = &default_backing_dev_info;
 	mapping->writeback_index = 0;
+	range_lock_tree_init(&mapping->mapping_lock);
 
 	/*
 	 * If the block_device provides a backing_dev_info for client
@@ -513,6 +514,7 @@ void clear_inode(struct inode *inode)
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
+	BUG_ON(inode->i_data.mapping_lock.root.rb_node);
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7617ee0..2027d25 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/range_lock.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -420,6 +421,9 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	void			*private_data;	/* ditto */
+	struct range_lock_tree	mapping_lock;	/* Lock protecting creation /
+						 * eviction of pages from
+						 * the mapping */
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6da609d..ba81ea9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -537,6 +537,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
+int add_to_page_cache_read(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page);
 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..4826cb4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -491,6 +491,20 @@ int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
+int add_to_page_cache_read(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask)
+{
+	struct range_lock mapping_lock;
+	int ret;
+
+	range_lock_init(&mapping_lock, offset, offset);
+	range_lock(&mapping->mapping_lock, &mapping_lock);
+	ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
+	range_unlock(&mapping->mapping_lock, &mapping_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(add_to_page_cache_read);
+
 #ifdef CONFIG_NUMA
 struct page *__page_cache_alloc(gfp_t gfp)
 {
@@ -1274,7 +1288,7 @@ no_cached_page:
 			desc->error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping,
+		error = add_to_page_cache_read(page, mapping,
 						index, GFP_KERNEL);
 		if (error) {
 			page_cache_release(page);
@@ -1493,7 +1507,8 @@ static int page_cache_read(struct file *file, pgoff_t offset)
 		if (!page)
 			return -ENOMEM;
 
-		ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+		ret = add_to_page_cache_read(page, mapping, offset,
+						GFP_KERNEL);
 		if (ret == 0)
 			ret = mapping->a_ops->readpage(file, page);
 		else if (ret == -EEXIST)
@@ -1790,7 +1805,7 @@ repeat:
 		page = __page_cache_alloc(gfp | __GFP_COLD);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
-		err = add_to_page_cache_lru(page, mapping, index, gfp);
+		err = add_to_page_cache_read(page, mapping, index, gfp);
 		if (unlikely(err)) {
 			page_cache_release(page);
 			if (err == -EEXIST)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..28a5e40 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -89,7 +89,7 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 	while (!list_empty(pages)) {
 		page = list_to_page(pages);
 		list_del(&page->lru);
-		if (add_to_page_cache_lru(page, mapping,
+		if (add_to_page_cache_read(page, mapping,
 					page->index, GFP_KERNEL)) {
 			read_cache_pages_invalidate_page(mapping, page);
 			continue;
@@ -126,11 +126,11 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = list_to_page(pages);
+
 		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping,
-					page->index, GFP_KERNEL)) {
+		if (!add_to_page_cache_read(page, mapping,
+					page->index, GFP_KERNEL))
 			mapping->a_ops->readpage(filp, page);
-		}
 		page_cache_release(page);
 	}
 	ret = 0;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/6] fs: Provide function to take mapping lock in buffered write path
  2013-01-31 21:49 ` Jan Kara
@ 2013-01-31 21:49   ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Add a flag to grab_cache_page_write_begin() which makes the function grab
mapping range lock while creating the page in page cache. Callers that don't
need special lock ordering for the mapping range lock (e.g. xfs) can use
this flag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 ++
 mm/filemap.c       |    7 +++++++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2027d25..1e0a1e4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -281,6 +281,8 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_LOCK_MAPPING		0x0008 /* Lock mapping range where
+						* page is create */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 4826cb4..f4a9e9a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2266,6 +2266,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	struct range_lock mapping_lock;
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
@@ -2280,8 +2281,14 @@ repeat:
 	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
+	if (flags & AOP_FLAG_LOCK_MAPPING) {
+		range_lock_init(&mapping_lock, index, index);
+		range_lock(&mapping->mapping_lock, &mapping_lock);
+	}
 	status = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
+	if (flags & AOP_FLAG_LOCK_MAPPING)
+		range_unlock(&mapping->mapping_lock, &mapping_lock);
 	if (unlikely(status)) {
 		page_cache_release(page);
 		if (status == -EEXIST)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/6] fs: Provide function to take mapping lock in buffered write path
@ 2013-01-31 21:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Add a flag to grab_cache_page_write_begin() which makes the function grab
mapping range lock while creating the page in page cache. Callers that don't
need special lock ordering for the mapping range lock (e.g. xfs) can use
this flag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h |    2 ++
 mm/filemap.c       |    7 +++++++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2027d25..1e0a1e4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -281,6 +281,8 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_LOCK_MAPPING		0x0008 /* Lock mapping range where
+						* page is create */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 4826cb4..f4a9e9a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2266,6 +2266,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+	struct range_lock mapping_lock;
 
 	gfp_mask = mapping_gfp_mask(mapping);
 	if (mapping_cap_account_dirty(mapping))
@@ -2280,8 +2281,14 @@ repeat:
 	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
+	if (flags & AOP_FLAG_LOCK_MAPPING) {
+		range_lock_init(&mapping_lock, index, index);
+		range_lock(&mapping->mapping_lock, &mapping_lock);
+	}
 	status = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
+	if (flags & AOP_FLAG_LOCK_MAPPING)
+		range_unlock(&mapping->mapping_lock, &mapping_lock);
 	if (unlikely(status)) {
 		page_cache_release(page);
 		if (status == -EEXIST)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/6] fs: Don't call dio_cleanup() before submitting all bios
  2013-01-31 21:49 ` Jan Kara
@ 2013-01-31 21:49   ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

do_blockdev_direct_IO() can call dio_cleanup() before submitting
all bios. This will be inconvenient for us because we need to keep
preallocated structure in sdio which we attach to bio on submit and
it is natural to cleanup unused allocation in dio_cleanup().

Since dio_cleanup() is called again after submitting the last bio it is
enough to just remove the first dio_cleanup() call.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/direct-io.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index cf5b44b..3a430f3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1209,10 +1209,8 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 			((sdio.final_block_in_request - sdio.block_in_file) <<
 					blkbits);
 
-		if (retval) {
-			dio_cleanup(dio, &sdio);
+		if (retval)
 			break;
-		}
 	} /* end iovec loop */
 
 	if (retval == -ENOTBLK) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/6] fs: Don't call dio_cleanup() before submitting all bios
@ 2013-01-31 21:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

do_blockdev_direct_IO() can call dio_cleanup() before submitting
all bios. This will be inconvenient for us because we need to keep
preallocated structure in sdio which we attach to bio on submit and
it is natural to cleanup unused allocation in dio_cleanup().

Since dio_cleanup() is called again after submitting the last bio it is
enough to just remove the first dio_cleanup() call.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/direct-io.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index cf5b44b..3a430f3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1209,10 +1209,8 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 			((sdio.final_block_in_request - sdio.block_in_file) <<
 					blkbits);
 
-		if (retval) {
-			dio_cleanup(dio, &sdio);
+		if (retval)
 			break;
-		}
 	} /* end iovec loop */
 
 	if (retval == -ENOTBLK) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/6] fs: Take mapping lock during direct IO
  2013-01-31 21:49 ` Jan Kara
@ 2013-01-31 21:49   ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Make direct IO code grab mapping range lock just before DIO is submitted
for the range under IO and release the lock once the IO is complete.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/direct-io.c |   67 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 3a430f3..1127ca5 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -56,10 +56,13 @@
  * blocksize.
  */
 
+struct dio_bio_data;
+
 /* dio_state only used in the submission path */
 
 struct dio_submit {
 	struct bio *bio;		/* bio under assembly */
+	struct dio_bio_data *bio_data;	/* structure to be attached to the bio*/
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -143,7 +146,17 @@ struct dio {
 	struct page *pages[DIO_PAGES];	/* page buffer */
 } ____cacheline_aligned_in_smp;
 
+/*
+ * Structure associated with each submitted bio to provide back pointer and
+ * lock for the range accessed by the bio.
+ */
+struct dio_bio_data {
+	struct dio *dio;
+	struct range_lock lock;
+};
+
 static struct kmem_cache *dio_cache __read_mostly;
+static struct kmem_cache *dio_bio_data_cache __read_mostly;
 
 /*
  * How many pages are in the queue?
@@ -275,10 +288,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio);
  */
 static void dio_bio_end_aio(struct bio *bio, int error)
 {
-	struct dio *dio = bio->bi_private;
+	struct dio_bio_data *bio_data = bio->bi_private;
+	struct dio *dio = bio_data->dio;
 	unsigned long remaining;
 	unsigned long flags;
 
+	range_unlock(&dio->inode->i_mapping->mapping_lock, &bio_data->lock);
+	kmem_cache_free(dio_bio_data_cache, bio_data);
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
 
@@ -298,14 +314,17 @@ static void dio_bio_end_aio(struct bio *bio, int error)
  * The BIO completion handler simply queues the BIO up for the process-context
  * handler.
  *
- * During I/O bi_private points at the dio.  After I/O, bi_private is used to
- * implement a singly-linked list of completed BIOs, at dio->bio_list.
+ * During I/O bi_private points at the dio_data.  After I/O, bi_private is used
+ * to implement a singly-linked list of completed BIOs, at dio->bio_list.
  */
 static void dio_bio_end_io(struct bio *bio, int error)
 {
-	struct dio *dio = bio->bi_private;
+	struct dio_bio_data *bio_data = bio->bi_private;
+	struct dio *dio = bio_data->dio;
 	unsigned long flags;
 
+	range_unlock(&dio->inode->i_mapping->mapping_lock, &bio_data->lock);
+	kmem_cache_free(dio_bio_data_cache, bio_data);
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	bio->bi_private = dio->bio_list;
 	dio->bio_list = bio;
@@ -325,7 +344,8 @@ static void dio_bio_end_io(struct bio *bio, int error)
  */
 void dio_end_io(struct bio *bio, int error)
 {
-	struct dio *dio = bio->bi_private;
+	struct dio_bio_data *bio_data = bio->bi_private;
+	struct dio *dio = bio_data->dio;
 
 	if (dio->is_async)
 		dio_bio_end_aio(bio, error);
@@ -369,8 +389,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 {
 	struct bio *bio = sdio->bio;
 	unsigned long flags;
-
-	bio->bi_private = dio;
+	loff_t start = sdio->logical_offset_in_bio;
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
@@ -380,10 +399,30 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
-		sdio->submit_io(dio->rw, bio, dio->inode,
-			       sdio->logical_offset_in_bio);
-	else
+		sdio->submit_io(dio->rw, bio, dio->inode, start);
+	else {
+		struct address_space *mapping = dio->inode->i_mapping;
+		loff_t end = sdio->logical_offset_in_bio + bio->bi_size - 1;
+
+		sdio->bio_data->dio = dio;
+		range_lock_init(&sdio->bio_data->lock,
+			start >> PAGE_CACHE_SHIFT, end >> PAGE_CACHE_SHIFT);
+		range_lock(&mapping->mapping_lock, &sdio->bio_data->lock);
+		/*
+		 * Once we hold mapping range lock writeout and invalidation
+		 * cannot race with page faults of buffered IO.
+		 */
+		filemap_write_and_wait_range(mapping, start, end);
+		if (dio->rw == WRITE && mapping->nrpages) {
+			invalidate_inode_pages2_range(mapping,
+				start >> PAGE_CACHE_SHIFT,
+				end >> PAGE_CACHE_SHIFT);
+		}
+		bio->bi_private = sdio->bio_data;
+		sdio->bio_data = NULL;
+
 		submit_bio(dio->rw, bio);
+	}
 
 	sdio->bio = NULL;
 	sdio->boundary = 0;
@@ -397,6 +436,8 @@ static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (dio_pages_present(sdio))
 		page_cache_release(dio_get_page(dio, sdio));
+	if (sdio->bio_data)
+		kmem_cache_free(dio_bio_data_cache, sdio->bio_data);
 }
 
 /*
@@ -600,6 +641,11 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
 	nr_pages = min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev));
 	nr_pages = min(nr_pages, BIO_MAX_PAGES);
 	BUG_ON(nr_pages <= 0);
+	sdio->bio_data = kmem_cache_alloc(dio_bio_data_cache, GFP_KERNEL);
+	if (!sdio->bio_data) {
+		ret = -ENOMEM;
+		goto out;
+	}
 	dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
 	sdio->boundary = 0;
 out:
@@ -1307,6 +1353,7 @@ EXPORT_SYMBOL(__blockdev_direct_IO);
 static __init int dio_init(void)
 {
 	dio_cache = KMEM_CACHE(dio, SLAB_PANIC);
+	dio_bio_data_cache = KMEM_CACHE(dio_bio_data, SLAB_PANIC);
 	return 0;
 }
 module_init(dio_init)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/6] fs: Take mapping lock during direct IO
@ 2013-01-31 21:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Make direct IO code grab mapping range lock just before DIO is submitted
for the range under IO and release the lock once the IO is complete.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/direct-io.c |   67 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 3a430f3..1127ca5 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -56,10 +56,13 @@
  * blocksize.
  */
 
+struct dio_bio_data;
+
 /* dio_state only used in the submission path */
 
 struct dio_submit {
 	struct bio *bio;		/* bio under assembly */
+	struct dio_bio_data *bio_data;	/* structure to be attached to the bio*/
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -143,7 +146,17 @@ struct dio {
 	struct page *pages[DIO_PAGES];	/* page buffer */
 } ____cacheline_aligned_in_smp;
 
+/*
+ * Structure associated with each submitted bio to provide back pointer and
+ * lock for the range accessed by the bio.
+ */
+struct dio_bio_data {
+	struct dio *dio;
+	struct range_lock lock;
+};
+
 static struct kmem_cache *dio_cache __read_mostly;
+static struct kmem_cache *dio_bio_data_cache __read_mostly;
 
 /*
  * How many pages are in the queue?
@@ -275,10 +288,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio);
  */
 static void dio_bio_end_aio(struct bio *bio, int error)
 {
-	struct dio *dio = bio->bi_private;
+	struct dio_bio_data *bio_data = bio->bi_private;
+	struct dio *dio = bio_data->dio;
 	unsigned long remaining;
 	unsigned long flags;
 
+	range_unlock(&dio->inode->i_mapping->mapping_lock, &bio_data->lock);
+	kmem_cache_free(dio_bio_data_cache, bio_data);
 	/* cleanup the bio */
 	dio_bio_complete(dio, bio);
 
@@ -298,14 +314,17 @@ static void dio_bio_end_aio(struct bio *bio, int error)
  * The BIO completion handler simply queues the BIO up for the process-context
  * handler.
  *
- * During I/O bi_private points at the dio.  After I/O, bi_private is used to
- * implement a singly-linked list of completed BIOs, at dio->bio_list.
+ * During I/O bi_private points at the dio_data.  After I/O, bi_private is used
+ * to implement a singly-linked list of completed BIOs, at dio->bio_list.
  */
 static void dio_bio_end_io(struct bio *bio, int error)
 {
-	struct dio *dio = bio->bi_private;
+	struct dio_bio_data *bio_data = bio->bi_private;
+	struct dio *dio = bio_data->dio;
 	unsigned long flags;
 
+	range_unlock(&dio->inode->i_mapping->mapping_lock, &bio_data->lock);
+	kmem_cache_free(dio_bio_data_cache, bio_data);
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	bio->bi_private = dio->bio_list;
 	dio->bio_list = bio;
@@ -325,7 +344,8 @@ static void dio_bio_end_io(struct bio *bio, int error)
  */
 void dio_end_io(struct bio *bio, int error)
 {
-	struct dio *dio = bio->bi_private;
+	struct dio_bio_data *bio_data = bio->bi_private;
+	struct dio *dio = bio_data->dio;
 
 	if (dio->is_async)
 		dio_bio_end_aio(bio, error);
@@ -369,8 +389,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 {
 	struct bio *bio = sdio->bio;
 	unsigned long flags;
-
-	bio->bi_private = dio;
+	loff_t start = sdio->logical_offset_in_bio;
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	dio->refcount++;
@@ -380,10 +399,30 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 		bio_set_pages_dirty(bio);
 
 	if (sdio->submit_io)
-		sdio->submit_io(dio->rw, bio, dio->inode,
-			       sdio->logical_offset_in_bio);
-	else
+		sdio->submit_io(dio->rw, bio, dio->inode, start);
+	else {
+		struct address_space *mapping = dio->inode->i_mapping;
+		loff_t end = sdio->logical_offset_in_bio + bio->bi_size - 1;
+
+		sdio->bio_data->dio = dio;
+		range_lock_init(&sdio->bio_data->lock,
+			start >> PAGE_CACHE_SHIFT, end >> PAGE_CACHE_SHIFT);
+		range_lock(&mapping->mapping_lock, &sdio->bio_data->lock);
+		/*
+		 * Once we hold mapping range lock writeout and invalidation
+		 * cannot race with page faults of buffered IO.
+		 */
+		filemap_write_and_wait_range(mapping, start, end);
+		if (dio->rw == WRITE && mapping->nrpages) {
+			invalidate_inode_pages2_range(mapping,
+				start >> PAGE_CACHE_SHIFT,
+				end >> PAGE_CACHE_SHIFT);
+		}
+		bio->bi_private = sdio->bio_data;
+		sdio->bio_data = NULL;
+
 		submit_bio(dio->rw, bio);
+	}
 
 	sdio->bio = NULL;
 	sdio->boundary = 0;
@@ -397,6 +436,8 @@ static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (dio_pages_present(sdio))
 		page_cache_release(dio_get_page(dio, sdio));
+	if (sdio->bio_data)
+		kmem_cache_free(dio_bio_data_cache, sdio->bio_data);
 }
 
 /*
@@ -600,6 +641,11 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
 	nr_pages = min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev));
 	nr_pages = min(nr_pages, BIO_MAX_PAGES);
 	BUG_ON(nr_pages <= 0);
+	sdio->bio_data = kmem_cache_alloc(dio_bio_data_cache, GFP_KERNEL);
+	if (!sdio->bio_data) {
+		ret = -ENOMEM;
+		goto out;
+	}
 	dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
 	sdio->boundary = 0;
 out:
@@ -1307,6 +1353,7 @@ EXPORT_SYMBOL(__blockdev_direct_IO);
 static __init int dio_init(void)
 {
 	dio_cache = KMEM_CACHE(dio, SLAB_PANIC);
+	dio_bio_data_cache = KMEM_CACHE(dio_bio_data, SLAB_PANIC);
 	return 0;
 }
 module_init(dio_init)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/6] ext3: Convert ext3 to use mapping lock
  2013-01-31 21:49 ` Jan Kara
@ 2013-01-31 21:49   ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Convert the filesystem to use mapping lock when truncating files and doing
buffered writes. The rest is handled in the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index b176d42..1f5d5f1 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1261,6 +1261,7 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
 	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
+	flags |= AOP_FLAG_LOCK_MAPPING;
 
 retry:
 	page = grab_cache_page_write_begin(mapping, index, flags);
@@ -3273,6 +3274,8 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 	struct inode *inode = dentry->d_inode;
 	int error, rc = 0;
 	const unsigned int ia_valid = attr->ia_valid;
+	struct range_lock mapping_lock;
+	int range_locked = 0;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
@@ -3314,6 +3317,12 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
 
+		range_lock_init(&mapping_lock,
+				attr->ia_size >> PAGE_CACHE_SHIFT,
+				ULONG_MAX);
+		range_lock(&inode->i_mapping->mapping_lock, &mapping_lock);
+		range_locked = 1;
+
 		handle = ext3_journal_start(inode, 3);
 		if (IS_ERR(handle)) {
 			error = PTR_ERR(handle);
@@ -3351,6 +3360,12 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 	    attr->ia_size != i_size_read(inode)) {
 		truncate_setsize(inode, attr->ia_size);
 		ext3_truncate(inode);
+
+		if (range_locked) {
+			range_unlock(&inode->i_mapping->mapping_lock,
+				     &mapping_lock);
+			range_locked = 0;
+		}
 	}
 
 	setattr_copy(inode, attr);
@@ -3360,6 +3375,8 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = ext3_acl_chmod(inode);
 
 err_out:
+	if (range_locked)
+		range_unlock(&inode->i_mapping->mapping_lock, &mapping_lock);
 	ext3_std_error(inode->i_sb, error);
 	if (!error)
 		error = rc;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/6] ext3: Convert ext3 to use mapping lock
@ 2013-01-31 21:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-01-31 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, linux-mm, Jan Kara

Convert the filesystem to use mapping lock when truncating files and doing
buffered writes. The rest is handled in the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index b176d42..1f5d5f1 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1261,6 +1261,7 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
 	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
+	flags |= AOP_FLAG_LOCK_MAPPING;
 
 retry:
 	page = grab_cache_page_write_begin(mapping, index, flags);
@@ -3273,6 +3274,8 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 	struct inode *inode = dentry->d_inode;
 	int error, rc = 0;
 	const unsigned int ia_valid = attr->ia_valid;
+	struct range_lock mapping_lock;
+	int range_locked = 0;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
@@ -3314,6 +3317,12 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
 
+		range_lock_init(&mapping_lock,
+				attr->ia_size >> PAGE_CACHE_SHIFT,
+				ULONG_MAX);
+		range_lock(&inode->i_mapping->mapping_lock, &mapping_lock);
+		range_locked = 1;
+
 		handle = ext3_journal_start(inode, 3);
 		if (IS_ERR(handle)) {
 			error = PTR_ERR(handle);
@@ -3351,6 +3360,12 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 	    attr->ia_size != i_size_read(inode)) {
 		truncate_setsize(inode, attr->ia_size);
 		ext3_truncate(inode);
+
+		if (range_locked) {
+			range_unlock(&inode->i_mapping->mapping_lock,
+				     &mapping_lock);
+			range_locked = 0;
+		}
 	}
 
 	setattr_copy(inode, attr);
@@ -3360,6 +3375,8 @@ int ext3_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = ext3_acl_chmod(inode);
 
 err_out:
+	if (range_locked)
+		range_unlock(&inode->i_mapping->mapping_lock, &mapping_lock);
 	ext3_std_error(inode->i_sb, error);
 	if (!error)
 		error = rc;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
  2013-01-31 21:49   ` Jan Kara
@ 2013-01-31 23:57     ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2013-01-31 23:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Thu, 31 Jan 2013 22:49:49 +0100
Jan Kara <jack@suse.cz> wrote:

> Implement range locking using interval tree.
> 
> ...
>
> +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +	interval_tree_insert(&lock->node, &tree->root);
> +	/* Do we need to go to sleep? */
> +	while (lock->blocking_ranges) {
> +		lock->task = current;
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		schedule();
> +		spin_lock_irqsave(&tree->lock, flags);
> +	}
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +}
>
> ...
>
> +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	interval_tree_remove(&lock->node, &tree->root);
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		range_lock_unblock((struct range_lock *)node);
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +}

What are the worst-case interrupt-off durations here?

I note that the new exported functions in this patchset are
refreshingly free of documentation ;)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
@ 2013-01-31 23:57     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2013-01-31 23:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Thu, 31 Jan 2013 22:49:49 +0100
Jan Kara <jack@suse.cz> wrote:

> Implement range locking using interval tree.
> 
> ...
>
> +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +	interval_tree_insert(&lock->node, &tree->root);
> +	/* Do we need to go to sleep? */
> +	while (lock->blocking_ranges) {
> +		lock->task = current;
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		schedule();
> +		spin_lock_irqsave(&tree->lock, flags);
> +	}
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +}
>
> ...
>
> +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	interval_tree_remove(&lock->node, &tree->root);
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		range_lock_unblock((struct range_lock *)node);
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +}

What are the worst-case interrupt-off durations here?

I note that the new exported functions in this patchset are
refreshingly free of documentation ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] fs: Take mapping lock in generic read paths
  2013-01-31 21:49   ` Jan Kara
@ 2013-01-31 23:59     ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2013-01-31 23:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Thu, 31 Jan 2013 22:49:50 +0100
Jan Kara <jack@suse.cz> wrote:

> Add mapping lock to struct address_space and grab it in all paths
> creating pages in page cache to read data into them. That means buffered
> read, readahead, and page fault code.

Boy, this does look expensive in both speed and space.

As you pointed out in [0/n], it's 2-3%.  As always with pagecache
stuff, the cost of filling the page generally swamps any inefficiencies
in preparing that page.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] fs: Take mapping lock in generic read paths
@ 2013-01-31 23:59     ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2013-01-31 23:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Thu, 31 Jan 2013 22:49:50 +0100
Jan Kara <jack@suse.cz> wrote:

> Add mapping lock to struct address_space and grab it in all paths
> creating pages in page cache to read data into them. That means buffered
> read, readahead, and page fault code.

Boy, this does look expensive in both speed and space.

As you pointed out in [0/n], it's 2-3%.  As always with pagecache
stuff, the cost of filling the page generally swamps any inefficiencies
in preparing that page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-01-31 21:49 ` Jan Kara
@ 2013-02-01  0:07   ` Andrew Morton
  -1 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2013-02-01  0:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Thu, 31 Jan 2013 22:49:48 +0100
Jan Kara <jack@suse.cz> wrote:

> There are several different motivations for implementing mapping range
> locking:
> 
> a) Punch hole is currently racy wrt mmap (page can be faulted in in the
>    punched range after page cache has been invalidated) leading to nasty
>    results as fs corruption (we can end up writing to already freed block),
>    user exposure of uninitialized data, etc. To fix this we need some new
>    mechanism of serializing hole punching and page faults.

This one doesn't seem very exciting - perhaps there are local fixes
which can be made?

> b) There is an uncomfortable number of mechanisms serializing various paths
>    manipulating pagecache and data underlying it. We have i_mutex, page lock,
>    checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
>    Different pairs of operations are serialized by different mechanisms and
>    not all the cases are covered. Case (a) above is likely the worst but DIO
>    vs buffered IO isn't ideal either (we provide only limited consistency).
>    The range locking should somewhat simplify serialization of pagecache
>    operations. So i_dio_count can be removed completely, i_mutex to certain
>    extent (we still need something for things like timestamp updates,
>    possibly for i_size changes although those can be dealt with I think).

Those would be nice cleanups and simplifications, to make kernel
developers' lives easier.  And there is value in this, but doing this
means our users incur real costs.

I'm rather uncomfortable changes which make our lives easier at the
expense of our users.  If we had an infinite amount of labor, we
wouldn't do this.  In reality we have finite labor, but a small cost
dispersed amongst millions or billions of users becomes a very large
cost.

> c) i_mutex doesn't allow any paralellism of operations using it and some
>    filesystems workaround this for specific cases (e.g. DIO reads). Using
>    range locking allows for concurrent operations (e.g. writes, DIO) on
>    different parts of the file. Of course, range locking itself isn't
>    enough to make the parallelism possible. Filesystems still have to
>    somehow deal with the concurrency when manipulating inode allocation
>    data. But the range locking at least provides a common VFS mechanism for
>    serialization VFS itself needs and it's upto each filesystem to
>    serialize more if it needs to.

That would be useful to end-users, but I'm having trouble predicting
*how* useful.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-01  0:07   ` Andrew Morton
  0 siblings, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2013-02-01  0:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Thu, 31 Jan 2013 22:49:48 +0100
Jan Kara <jack@suse.cz> wrote:

> There are several different motivations for implementing mapping range
> locking:
> 
> a) Punch hole is currently racy wrt mmap (page can be faulted in in the
>    punched range after page cache has been invalidated) leading to nasty
>    results as fs corruption (we can end up writing to already freed block),
>    user exposure of uninitialized data, etc. To fix this we need some new
>    mechanism of serializing hole punching and page faults.

This one doesn't seem very exciting - perhaps there are local fixes
which can be made?

> b) There is an uncomfortable number of mechanisms serializing various paths
>    manipulating pagecache and data underlying it. We have i_mutex, page lock,
>    checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
>    Different pairs of operations are serialized by different mechanisms and
>    not all the cases are covered. Case (a) above is likely the worst but DIO
>    vs buffered IO isn't ideal either (we provide only limited consistency).
>    The range locking should somewhat simplify serialization of pagecache
>    operations. So i_dio_count can be removed completely, i_mutex to certain
>    extent (we still need something for things like timestamp updates,
>    possibly for i_size changes although those can be dealt with I think).

Those would be nice cleanups and simplifications, to make kernel
developers' lives easier.  And there is value in this, but doing this
means our users incur real costs.

I'm rather uncomfortable changes which make our lives easier at the
expense of our users.  If we had an infinite amount of labor, we
wouldn't do this.  In reality we have finite labor, but a small cost
dispersed amongst millions or billions of users becomes a very large
cost.

> c) i_mutex doesn't allow any paralellism of operations using it and some
>    filesystems workaround this for specific cases (e.g. DIO reads). Using
>    range locking allows for concurrent operations (e.g. writes, DIO) on
>    different parts of the file. Of course, range locking itself isn't
>    enough to make the parallelism possible. Filesystems still have to
>    somehow deal with the concurrency when manipulating inode allocation
>    data. But the range locking at least provides a common VFS mechanism for
>    serialization VFS itself needs and it's upto each filesystem to
>    serialize more if it needs to.

That would be useful to end-users, but I'm having trouble predicting
*how* useful.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-02-01  0:07   ` Andrew Morton
@ 2013-02-04  9:29     ` Zheng Liu
  -1 siblings, 0 replies; 46+ messages in thread
From: Zheng Liu @ 2013-02-04  9:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu, Jan 31, 2013 at 04:07:57PM -0800, Andrew Morton wrote:
[snip]
> > c) i_mutex doesn't allow any paralellism of operations using it and some
> >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> >    range locking allows for concurrent operations (e.g. writes, DIO) on
> >    different parts of the file. Of course, range locking itself isn't
> >    enough to make the parallelism possible. Filesystems still have to
> >    somehow deal with the concurrency when manipulating inode allocation
> >    data. But the range locking at least provides a common VFS mechanism for
> >    serialization VFS itself needs and it's upto each filesystem to
> >    serialize more if it needs to.
> 
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.

Hello Andrew,

I believe this would be useful for the end-user, at least for dio user, e.g.
database.  Now when we want to issue a direct io, i_mutex will serialize
concurrent reader/writer.  Some filesystems (xfs and ext4) try not to take
this locking in some conditions to improve the performance, especially for the
latency, because we don't need to wait on this locking.  But there are still
some cases that need to take it.  So range locking can make us reduce the
contention as far as possible.

Regards,
						- Zheng

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-04  9:29     ` Zheng Liu
  0 siblings, 0 replies; 46+ messages in thread
From: Zheng Liu @ 2013-02-04  9:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu, Jan 31, 2013 at 04:07:57PM -0800, Andrew Morton wrote:
[snip]
> > c) i_mutex doesn't allow any paralellism of operations using it and some
> >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> >    range locking allows for concurrent operations (e.g. writes, DIO) on
> >    different parts of the file. Of course, range locking itself isn't
> >    enough to make the parallelism possible. Filesystems still have to
> >    somehow deal with the concurrency when manipulating inode allocation
> >    data. But the range locking at least provides a common VFS mechanism for
> >    serialization VFS itself needs and it's upto each filesystem to
> >    serialize more if it needs to.
> 
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.

Hello Andrew,

I believe this would be useful for the end-user, at least for dio user, e.g.
database.  Now when we want to issue a direct io, i_mutex will serialize
concurrent reader/writer.  Some filesystems (xfs and ext4) try not to take
this locking in some conditions to improve the performance, especially for the
latency, because we don't need to wait on this locking.  But there are still
some cases that need to take it.  So range locking can make us reduce the
contention as far as possible.

Regards,
						- Zheng

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-02-01  0:07   ` Andrew Morton
@ 2013-02-04 12:38     ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-04 12:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:48 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > There are several different motivations for implementing mapping range
> > locking:
> > 
> > a) Punch hole is currently racy wrt mmap (page can be faulted in in the
> >    punched range after page cache has been invalidated) leading to nasty
> >    results as fs corruption (we can end up writing to already freed block),
> >    user exposure of uninitialized data, etc. To fix this we need some new
> >    mechanism of serializing hole punching and page faults.
> 
> This one doesn't seem very exciting - perhaps there are local fixes
> which can be made?
  I agree this probably won't be triggered by accident since punch hole
uses are limited. But a malicious user is a different thing...

Regarding local fix - local in what sense? We could fix it inside each
filesystem separately but the number of filesystems supporting punch hole
is growing so I don't think it's a good decision for each of them to devise
their own synchronization mechanisms. Fixing 'locally' in a sence that we
fix just the mmap vs punch hole race is possible but we need some
synchronisation of page fault and punch hole - likely in a form of rwsem
where page fault will take a reader side and punch hole a writer side. So
this "minimal" fix requires additional rwsem in struct address_space and
also incurs some cost to page fault path. It is likely a lower cost than
the one of range locking but there is some.

> > b) There is an uncomfortable number of mechanisms serializing various paths
> >    manipulating pagecache and data underlying it. We have i_mutex, page lock,
> >    checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
> >    Different pairs of operations are serialized by different mechanisms and
> >    not all the cases are covered. Case (a) above is likely the worst but DIO
> >    vs buffered IO isn't ideal either (we provide only limited consistency).
> >    The range locking should somewhat simplify serialization of pagecache
> >    operations. So i_dio_count can be removed completely, i_mutex to certain
> >    extent (we still need something for things like timestamp updates,
> >    possibly for i_size changes although those can be dealt with I think).
> 
> Those would be nice cleanups and simplifications, to make kernel
> developers' lives easier.  And there is value in this, but doing this
> means our users incur real costs.
> 
> I'm rather uncomfortable changes which make our lives easier at the
> expense of our users.  If we had an infinite amount of labor, we
> wouldn't do this.  In reality we have finite labor, but a small cost
> dispersed amongst millions or billions of users becomes a very large
> cost.
  I agree there's a cost (as with everything) and personally I feel the
cost is larger than I'd like so we mostly agree on that. OTOH I don't quite
buy the argument "multiplied by millions or billions of users" - the more
machines running the code, the more wealth these machines hopefully
generate ;-). So where the additional cost starts mattering is when it is
making the code not worth it for some purposes. But this is really
philosophy :)

> > c) i_mutex doesn't allow any paralellism of operations using it and some
> >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> >    range locking allows for concurrent operations (e.g. writes, DIO) on
> >    different parts of the file. Of course, range locking itself isn't
> >    enough to make the parallelism possible. Filesystems still have to
> >    somehow deal with the concurrency when manipulating inode allocation
> >    data. But the range locking at least provides a common VFS mechanism for
> >    serialization VFS itself needs and it's upto each filesystem to
> >    serialize more if it needs to.
> 
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.
  As Zheng said, there are people interested in this for DIO. Currently
filesystems each invent their own tweaks to avoid the serialization at
least for the easiest cases.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-04 12:38     ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-04 12:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:48 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > There are several different motivations for implementing mapping range
> > locking:
> > 
> > a) Punch hole is currently racy wrt mmap (page can be faulted in in the
> >    punched range after page cache has been invalidated) leading to nasty
> >    results as fs corruption (we can end up writing to already freed block),
> >    user exposure of uninitialized data, etc. To fix this we need some new
> >    mechanism of serializing hole punching and page faults.
> 
> This one doesn't seem very exciting - perhaps there are local fixes
> which can be made?
  I agree this probably won't be triggered by accident since punch hole
uses are limited. But a malicious user is a different thing...

Regarding local fix - local in what sense? We could fix it inside each
filesystem separately but the number of filesystems supporting punch hole
is growing so I don't think it's a good decision for each of them to devise
their own synchronization mechanisms. Fixing 'locally' in a sence that we
fix just the mmap vs punch hole race is possible but we need some
synchronisation of page fault and punch hole - likely in a form of rwsem
where page fault will take a reader side and punch hole a writer side. So
this "minimal" fix requires additional rwsem in struct address_space and
also incurs some cost to page fault path. It is likely a lower cost than
the one of range locking but there is some.

> > b) There is an uncomfortable number of mechanisms serializing various paths
> >    manipulating pagecache and data underlying it. We have i_mutex, page lock,
> >    checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
> >    Different pairs of operations are serialized by different mechanisms and
> >    not all the cases are covered. Case (a) above is likely the worst but DIO
> >    vs buffered IO isn't ideal either (we provide only limited consistency).
> >    The range locking should somewhat simplify serialization of pagecache
> >    operations. So i_dio_count can be removed completely, i_mutex to certain
> >    extent (we still need something for things like timestamp updates,
> >    possibly for i_size changes although those can be dealt with I think).
> 
> Those would be nice cleanups and simplifications, to make kernel
> developers' lives easier.  And there is value in this, but doing this
> means our users incur real costs.
> 
> I'm rather uncomfortable changes which make our lives easier at the
> expense of our users.  If we had an infinite amount of labor, we
> wouldn't do this.  In reality we have finite labor, but a small cost
> dispersed amongst millions or billions of users becomes a very large
> cost.
  I agree there's a cost (as with everything) and personally I feel the
cost is larger than I'd like so we mostly agree on that. OTOH I don't quite
buy the argument "multiplied by millions or billions of users" - the more
machines running the code, the more wealth these machines hopefully
generate ;-). So where the additional cost starts mattering is when it is
making the code not worth it for some purposes. But this is really
philosophy :)

> > c) i_mutex doesn't allow any paralellism of operations using it and some
> >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> >    range locking allows for concurrent operations (e.g. writes, DIO) on
> >    different parts of the file. Of course, range locking itself isn't
> >    enough to make the parallelism possible. Filesystems still have to
> >    somehow deal with the concurrency when manipulating inode allocation
> >    data. But the range locking at least provides a common VFS mechanism for
> >    serialization VFS itself needs and it's upto each filesystem to
> >    serialize more if it needs to.
> 
> That would be useful to end-users, but I'm having trouble predicting
> *how* useful.
  As Zheng said, there are people interested in this for DIO. Currently
filesystems each invent their own tweaks to avoid the serialization at
least for the easiest cases.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] fs: Take mapping lock in generic read paths
  2013-01-31 23:59     ` Andrew Morton
@ 2013-02-04 12:47       ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-04 12:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu 31-01-13 15:59:40, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:50 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > Add mapping lock to struct address_space and grab it in all paths
> > creating pages in page cache to read data into them. That means buffered
> > read, readahead, and page fault code.
> 
> Boy, this does look expensive in both speed and space.
  I'm not sure I'll be able to do much with the space cost but hopefully
the CPU cost could be reduced.

> As you pointed out in [0/n], it's 2-3%.  As always with pagecache
> stuff, the cost of filling the page generally swamps any inefficiencies
> in preparing that page.
  Yes, I measured it with with ramdisk backed fs exactly to remove the cost
of filling the page from the picture. But there are systems where IO is CPU
bound (e.g. when you have PCIe attached devices) and although there is the
additional cost of block layer which will further hide the cost of page
cache itself I assume the added 2-3% incurred by page cache itself will be
measurable on such systems. So that's why I'd like to reduce the CPU cost
of range locking.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] fs: Take mapping lock in generic read paths
@ 2013-02-04 12:47       ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-04 12:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu 31-01-13 15:59:40, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:50 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > Add mapping lock to struct address_space and grab it in all paths
> > creating pages in page cache to read data into them. That means buffered
> > read, readahead, and page fault code.
> 
> Boy, this does look expensive in both speed and space.
  I'm not sure I'll be able to do much with the space cost but hopefully
the CPU cost could be reduced.

> As you pointed out in [0/n], it's 2-3%.  As always with pagecache
> stuff, the cost of filling the page generally swamps any inefficiencies
> in preparing that page.
  Yes, I measured it with with ramdisk backed fs exactly to remove the cost
of filling the page from the picture. But there are systems where IO is CPU
bound (e.g. when you have PCIe attached devices) and although there is the
additional cost of block layer which will further hide the cost of page
cache itself I assume the added 2-3% incurred by page cache itself will be
measurable on such systems. So that's why I'd like to reduce the CPU cost
of range locking.
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
  2013-01-31 23:57     ` Andrew Morton
@ 2013-02-04 16:41       ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-04 16:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu 31-01-13 15:57:26, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:49 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > Implement range locking using interval tree.
> > 
> > ...
> >
> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +	struct interval_tree_node *node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&tree->lock, flags);
> > +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +					lock->node.last);
> > +	while (node) {
> > +		lock->blocking_ranges++;
> > +		node = interval_tree_iter_next(node, lock->node.start,
> > +					       lock->node.last);
> > +	}
> > +	interval_tree_insert(&lock->node, &tree->root);
> > +	/* Do we need to go to sleep? */
> > +	while (lock->blocking_ranges) {
> > +		lock->task = current;
> > +		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		spin_unlock_irqrestore(&tree->lock, flags);
> > +		schedule();
> > +		spin_lock_irqsave(&tree->lock, flags);
> > +	}
> > +	spin_unlock_irqrestore(&tree->lock, flags);
> > +}
> >
> > ...
> >
> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +	struct interval_tree_node *node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&tree->lock, flags);
> > +	interval_tree_remove(&lock->node, &tree->root);
> > +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +					lock->node.last);
> > +	while (node) {
> > +		range_lock_unblock((struct range_lock *)node);
> > +		node = interval_tree_iter_next(node, lock->node.start,
> > +					       lock->node.last);
> > +	}
> > +	spin_unlock_irqrestore(&tree->lock, flags);
> > +}
> 
> What are the worst-case interrupt-off durations here?
  Good question. In theory, it could be relatively long, e.g. when there
are lots of DIOs in flight against a range which is locked. I'll do some
measurements to get some idea.
 
> I note that the new exported functions in this patchset are
> refreshingly free of documentation ;)
  They are *so* obvious ;) Point taken... Thanks for review.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
@ 2013-02-04 16:41       ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-04 16:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Thu 31-01-13 15:57:26, Andrew Morton wrote:
> On Thu, 31 Jan 2013 22:49:49 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > Implement range locking using interval tree.
> > 
> > ...
> >
> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +	struct interval_tree_node *node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&tree->lock, flags);
> > +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +					lock->node.last);
> > +	while (node) {
> > +		lock->blocking_ranges++;
> > +		node = interval_tree_iter_next(node, lock->node.start,
> > +					       lock->node.last);
> > +	}
> > +	interval_tree_insert(&lock->node, &tree->root);
> > +	/* Do we need to go to sleep? */
> > +	while (lock->blocking_ranges) {
> > +		lock->task = current;
> > +		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		spin_unlock_irqrestore(&tree->lock, flags);
> > +		schedule();
> > +		spin_lock_irqsave(&tree->lock, flags);
> > +	}
> > +	spin_unlock_irqrestore(&tree->lock, flags);
> > +}
> >
> > ...
> >
> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +	struct interval_tree_node *node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&tree->lock, flags);
> > +	interval_tree_remove(&lock->node, &tree->root);
> > +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +					lock->node.last);
> > +	while (node) {
> > +		range_lock_unblock((struct range_lock *)node);
> > +		node = interval_tree_iter_next(node, lock->node.start,
> > +					       lock->node.last);
> > +	}
> > +	spin_unlock_irqrestore(&tree->lock, flags);
> > +}
> 
> What are the worst-case interrupt-off durations here?
  Good question. In theory, it could be relatively long, e.g. when there
are lots of DIOs in flight against a range which is locked. I'll do some
measurements to get some idea.
 
> I note that the new exported functions in this patchset are
> refreshingly free of documentation ;)
  They are *so* obvious ;) Point taken... Thanks for review.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-02-04 12:38     ` Jan Kara
@ 2013-02-05 23:25       ` Dave Chinner
  -1 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-02-05 23:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, LKML, linux-fsdevel, linux-mm

On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > >    different parts of the file. Of course, range locking itself isn't
> > >    enough to make the parallelism possible. Filesystems still have to
> > >    somehow deal with the concurrency when manipulating inode allocation
> > >    data. But the range locking at least provides a common VFS mechanism for
> > >    serialization VFS itself needs and it's upto each filesystem to
> > >    serialize more if it needs to.
> > 
> > That would be useful to end-users, but I'm having trouble predicting
> > *how* useful.
>   As Zheng said, there are people interested in this for DIO. Currently
> filesystems each invent their own tweaks to avoid the serialization at
> least for the easiest cases.

The thing is, this won't replace the locking those filesystems use
to parallelise DIO - it just adds another layer of locking they'll
need to use. The locks filesystems like XFS use to serialise IO
against hole punch also serialise against many more internal
functions and so if these range locks don't have the same capability
we're going to have to retain those locks even after the range locks
are introduced. It basically means we're going to have two layers
of range locks - one for IO sanity and atomicity, and then this
layer just for hole punch vs mmap.

As i've said before, what we really need in XFS is IO range locks
because we need to be able to serialise operations against IO in
progress, not page cache operations in progress.  IOWs, locking at
the mapping tree level does not provide the right exclusion
semantics we need to get rid of the existing filesystem locking that
allows concurrent IO to be managed.  Hence the XFS IO path locking
suddenly because 4 locks deep:

	i_mutex
	  XFS_IOLOCK_{SHARED,EXCL}
	    mapping range lock
	      XFS_ILOCK_{SHARED,EXCL}

That's because the buffered IO path uses per-page lock ranges and to
provide atomicity of read vs write, read vs truncate, etc we still
need to use the XFS_IOLOCK_EXCL to provide this functionality.

Hence I really think we need to be driving this lock outwards to
where the i_mutex currently sits, turning it into an *IO range
lock*, and not an inner-level mapping range lock. i.e flattening the
locking to:

	io_range_lock(off, len)
	  fs internal inode metadata modification lock

Yes, I know this causes problems with mmap and locking orders, but
perhaps we should be trying to get that fixed first because it
simplifies the whole locking schema we need for filesystems to
behave sanely. i.e. shouldn't we be aiming to simplify things
as we rework locking rather than make the more complex?

IOWs, I think the "it's a mapping range lock" approach is not the
right level to be providing IO exclusion semantics. After all, it's
entire IO ranges that we need to provide -atomic- exclusion for...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-05 23:25       ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-02-05 23:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, LKML, linux-fsdevel, linux-mm

On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > >    different parts of the file. Of course, range locking itself isn't
> > >    enough to make the parallelism possible. Filesystems still have to
> > >    somehow deal with the concurrency when manipulating inode allocation
> > >    data. But the range locking at least provides a common VFS mechanism for
> > >    serialization VFS itself needs and it's upto each filesystem to
> > >    serialize more if it needs to.
> > 
> > That would be useful to end-users, but I'm having trouble predicting
> > *how* useful.
>   As Zheng said, there are people interested in this for DIO. Currently
> filesystems each invent their own tweaks to avoid the serialization at
> least for the easiest cases.

The thing is, this won't replace the locking those filesystems use
to parallelise DIO - it just adds another layer of locking they'll
need to use. The locks filesystems like XFS use to serialise IO
against hole punch also serialise against many more internal
functions and so if these range locks don't have the same capability
we're going to have to retain those locks even after the range locks
are introduced. It basically means we're going to have two layers
of range locks - one for IO sanity and atomicity, and then this
layer just for hole punch vs mmap.

As i've said before, what we really need in XFS is IO range locks
because we need to be able to serialise operations against IO in
progress, not page cache operations in progress.  IOWs, locking at
the mapping tree level does not provide the right exclusion
semantics we need to get rid of the existing filesystem locking that
allows concurrent IO to be managed.  Hence the XFS IO path locking
suddenly because 4 locks deep:

	i_mutex
	  XFS_IOLOCK_{SHARED,EXCL}
	    mapping range lock
	      XFS_ILOCK_{SHARED,EXCL}

That's because the buffered IO path uses per-page lock ranges and to
provide atomicity of read vs write, read vs truncate, etc we still
need to use the XFS_IOLOCK_EXCL to provide this functionality.

Hence I really think we need to be driving this lock outwards to
where the i_mutex currently sits, turning it into an *IO range
lock*, and not an inner-level mapping range lock. i.e flattening the
locking to:

	io_range_lock(off, len)
	  fs internal inode metadata modification lock

Yes, I know this causes problems with mmap and locking orders, but
perhaps we should be trying to get that fixed first because it
simplifies the whole locking schema we need for filesystems to
behave sanely. i.e. shouldn't we be aiming to simplify things
as we rework locking rather than make the more complex?

IOWs, I think the "it's a mapping range lock" approach is not the
right level to be providing IO exclusion semantics. After all, it's
entire IO ranges that we need to provide -atomic- exclusion for...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-02-05 23:25       ` Dave Chinner
@ 2013-02-06 19:25         ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-06 19:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel, linux-mm

On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > > >    different parts of the file. Of course, range locking itself isn't
> > > >    enough to make the parallelism possible. Filesystems still have to
> > > >    somehow deal with the concurrency when manipulating inode allocation
> > > >    data. But the range locking at least provides a common VFS mechanism for
> > > >    serialization VFS itself needs and it's upto each filesystem to
> > > >    serialize more if it needs to.
> > > 
> > > That would be useful to end-users, but I'm having trouble predicting
> > > *how* useful.
> >   As Zheng said, there are people interested in this for DIO. Currently
> > filesystems each invent their own tweaks to avoid the serialization at
> > least for the easiest cases.
> 
> The thing is, this won't replace the locking those filesystems use
> to parallelise DIO - it just adds another layer of locking they'll
> need to use. The locks filesystems like XFS use to serialise IO
> against hole punch also serialise against many more internal
> functions and so if these range locks don't have the same capability
> we're going to have to retain those locks even after the range locks
> are introduced. It basically means we're going to have two layers
> of range locks - one for IO sanity and atomicity, and then this
> layer just for hole punch vs mmap.
> 
> As i've said before, what we really need in XFS is IO range locks
> because we need to be able to serialise operations against IO in
> progress, not page cache operations in progress.
  Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
PageWriteback serialize all IO for part of the file underlying the page.
I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
buffered read, page fault can run on that part of file. So how come it
doesn't provide enough serialization for XFS?

Ah, is it the problem that if two threads do overlapping buffered writes
to a file then we can end up with data mixed from the two writes (if we
didn't have something like i_mutex)?

> IOWs, locking at
> the mapping tree level does not provide the right exclusion
> semantics we need to get rid of the existing filesystem locking that
> allows concurrent IO to be managed.  Hence the XFS IO path locking
> suddenly because 4 locks deep:
> 
> 	i_mutex
> 	  XFS_IOLOCK_{SHARED,EXCL}
> 	    mapping range lock
> 	      XFS_ILOCK_{SHARED,EXCL}
> 
> That's because the buffered IO path uses per-page lock ranges and to
> provide atomicity of read vs write, read vs truncate, etc we still
> need to use the XFS_IOLOCK_EXCL to provide this functionality.
>
> Hence I really think we need to be driving this lock outwards to
> where the i_mutex currently sits, turning it into an *IO range
> lock*, and not an inner-level mapping range lock. i.e flattening the
> locking to:
> 
> 	io_range_lock(off, len)
> 	  fs internal inode metadata modification lock
  If I get you right, your IO range lock would be +- what current mapping
tree lock is for DIO and truncate but for buffered IO you'd want to release
it after the read / write is finished? That is possible to do if we keep the
per-page granularity but that's what you seem to have problems with.

> Yes, I know this causes problems with mmap and locking orders, but
> perhaps we should be trying to get that fixed first because it
> simplifies the whole locking schema we need for filesystems to
> behave sanely. i.e. shouldn't we be aiming to simplify things
> as we rework locking rather than make the more complex?
  Yes. I was looking at how we could free filesystems from mmap_sem locking
issues but as Al Viro put it, the use of mmap_sem is a mess which isn't
easy to untangle. I want to have a look at it but I fear it's going to be a
long run...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-06 19:25         ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-06 19:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel, linux-mm

On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > > >    different parts of the file. Of course, range locking itself isn't
> > > >    enough to make the parallelism possible. Filesystems still have to
> > > >    somehow deal with the concurrency when manipulating inode allocation
> > > >    data. But the range locking at least provides a common VFS mechanism for
> > > >    serialization VFS itself needs and it's upto each filesystem to
> > > >    serialize more if it needs to.
> > > 
> > > That would be useful to end-users, but I'm having trouble predicting
> > > *how* useful.
> >   As Zheng said, there are people interested in this for DIO. Currently
> > filesystems each invent their own tweaks to avoid the serialization at
> > least for the easiest cases.
> 
> The thing is, this won't replace the locking those filesystems use
> to parallelise DIO - it just adds another layer of locking they'll
> need to use. The locks filesystems like XFS use to serialise IO
> against hole punch also serialise against many more internal
> functions and so if these range locks don't have the same capability
> we're going to have to retain those locks even after the range locks
> are introduced. It basically means we're going to have two layers
> of range locks - one for IO sanity and atomicity, and then this
> layer just for hole punch vs mmap.
> 
> As i've said before, what we really need in XFS is IO range locks
> because we need to be able to serialise operations against IO in
> progress, not page cache operations in progress.
  Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
PageWriteback serialize all IO for part of the file underlying the page.
I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
buffered read, page fault can run on that part of file. So how come it
doesn't provide enough serialization for XFS?

Ah, is it the problem that if two threads do overlapping buffered writes
to a file then we can end up with data mixed from the two writes (if we
didn't have something like i_mutex)?

> IOWs, locking at
> the mapping tree level does not provide the right exclusion
> semantics we need to get rid of the existing filesystem locking that
> allows concurrent IO to be managed.  Hence the XFS IO path locking
> suddenly because 4 locks deep:
> 
> 	i_mutex
> 	  XFS_IOLOCK_{SHARED,EXCL}
> 	    mapping range lock
> 	      XFS_ILOCK_{SHARED,EXCL}
> 
> That's because the buffered IO path uses per-page lock ranges and to
> provide atomicity of read vs write, read vs truncate, etc we still
> need to use the XFS_IOLOCK_EXCL to provide this functionality.
>
> Hence I really think we need to be driving this lock outwards to
> where the i_mutex currently sits, turning it into an *IO range
> lock*, and not an inner-level mapping range lock. i.e flattening the
> locking to:
> 
> 	io_range_lock(off, len)
> 	  fs internal inode metadata modification lock
  If I get you right, your IO range lock would be +- what current mapping
tree lock is for DIO and truncate but for buffered IO you'd want to release
it after the read / write is finished? That is possible to do if we keep the
per-page granularity but that's what you seem to have problems with.

> Yes, I know this causes problems with mmap and locking orders, but
> perhaps we should be trying to get that fixed first because it
> simplifies the whole locking schema we need for filesystems to
> behave sanely. i.e. shouldn't we be aiming to simplify things
> as we rework locking rather than make the more complex?
  Yes. I was looking at how we could free filesystems from mmap_sem locking
issues but as Al Viro put it, the use of mmap_sem is a mess which isn't
easy to untangle. I want to have a look at it but I fear it's going to be a
long run...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-02-06 19:25         ` Jan Kara
@ 2013-02-07  2:43           ` Dave Chinner
  -1 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-02-07  2:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, LKML, linux-fsdevel, linux-mm

On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote:
> On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> > On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > > > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > > > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > > > >    different parts of the file. Of course, range locking itself isn't
> > > > >    enough to make the parallelism possible. Filesystems still have to
> > > > >    somehow deal with the concurrency when manipulating inode allocation
> > > > >    data. But the range locking at least provides a common VFS mechanism for
> > > > >    serialization VFS itself needs and it's upto each filesystem to
> > > > >    serialize more if it needs to.
> > > > 
> > > > That would be useful to end-users, but I'm having trouble predicting
> > > > *how* useful.
> > >   As Zheng said, there are people interested in this for DIO. Currently
> > > filesystems each invent their own tweaks to avoid the serialization at
> > > least for the easiest cases.
> > 
> > The thing is, this won't replace the locking those filesystems use
> > to parallelise DIO - it just adds another layer of locking they'll
> > need to use. The locks filesystems like XFS use to serialise IO
> > against hole punch also serialise against many more internal
> > functions and so if these range locks don't have the same capability
> > we're going to have to retain those locks even after the range locks
> > are introduced. It basically means we're going to have two layers
> > of range locks - one for IO sanity and atomicity, and then this
> > layer just for hole punch vs mmap.
> > 
> > As i've said before, what we really need in XFS is IO range locks
> > because we need to be able to serialise operations against IO in
> > progress, not page cache operations in progress.
>   Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
> PageWriteback serialize all IO for part of the file underlying the page.
> I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
> buffered read, page fault can run on that part of file.

Right, it serialises page cache operations sufficient to avoid
page cache coherence problems, but it does not serialise operations
sufficiently to provide atomicity between operations that should be
atomic w.r.t. each other.

> So how come it
> doesn't provide enough serialization for XFS?
> 
> Ah, is it the problem that if two threads do overlapping buffered writes
> to a file then we can end up with data mixed from the two writes (if we
> didn't have something like i_mutex)?

That's one case of specific concern - the POSIX write() atomicity
guarantee - but it indicates the cause of many of my other concerns,
too. e.g. write vs prealloc, write vs punch, read vs truncate, write
vs truncate, buffered vs direct write, etc.

Basically, page-cache granularity locking for buffered IO means that
it cannot be wholly serialised against any other operation in
progress. That means we can't use the range lock to provide a
barrier to guarantee that no IO is currently in progress at all, and
hence it doesn't provide the IO barrier semantics we need for
various operations within XFS.

An example of this is that the online defrag ioctl requires copyin +
mtime updates in the write path are atomic w.r.t the swap extents
ioctl so that it can detect concurrent modification of the file being
defragged and abort. The page cache based range locks simply don't
provide this coverage, and so we'd need to maintain the IO operation
locking we currently have to provide this exclusion..

Truncate is something I also see as particularly troublesome,
because the i_mutex current provides mutual exclusion against the
operational range of a buffered write (i.e. at the .aio_write level)
and not page granularity like this patch set would result in. Hence
the behaviour of write vs truncate races could change quite
significantly. e.g.  instead of "write completes then truncate" or
"truncate completes then write", we could have "partial write,
truncate, write continues and completes" resulting in a bunch of
zeros inside the range the write call wrote to. The application
won't even realise that the data it wrote was corrupted by the
racing truncate.....

IOWs, I think that the fundamental unit of atomicity we need here is
the operational range of the syscall i.e. that each of the protected
operations needs to execute atomically as a whole with respect to
each other, not in a piecemeal fashion where some use whole range
locking and others use fine-grained page-range locking...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-07  2:43           ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2013-02-07  2:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, LKML, linux-fsdevel, linux-mm

On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote:
> On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> > On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > > > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > > > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > > > >    different parts of the file. Of course, range locking itself isn't
> > > > >    enough to make the parallelism possible. Filesystems still have to
> > > > >    somehow deal with the concurrency when manipulating inode allocation
> > > > >    data. But the range locking at least provides a common VFS mechanism for
> > > > >    serialization VFS itself needs and it's upto each filesystem to
> > > > >    serialize more if it needs to.
> > > > 
> > > > That would be useful to end-users, but I'm having trouble predicting
> > > > *how* useful.
> > >   As Zheng said, there are people interested in this for DIO. Currently
> > > filesystems each invent their own tweaks to avoid the serialization at
> > > least for the easiest cases.
> > 
> > The thing is, this won't replace the locking those filesystems use
> > to parallelise DIO - it just adds another layer of locking they'll
> > need to use. The locks filesystems like XFS use to serialise IO
> > against hole punch also serialise against many more internal
> > functions and so if these range locks don't have the same capability
> > we're going to have to retain those locks even after the range locks
> > are introduced. It basically means we're going to have two layers
> > of range locks - one for IO sanity and atomicity, and then this
> > layer just for hole punch vs mmap.
> > 
> > As i've said before, what we really need in XFS is IO range locks
> > because we need to be able to serialise operations against IO in
> > progress, not page cache operations in progress.
>   Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
> PageWriteback serialize all IO for part of the file underlying the page.
> I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
> buffered read, page fault can run on that part of file.

Right, it serialises page cache operations sufficient to avoid
page cache coherence problems, but it does not serialise operations
sufficiently to provide atomicity between operations that should be
atomic w.r.t. each other.

> So how come it
> doesn't provide enough serialization for XFS?
> 
> Ah, is it the problem that if two threads do overlapping buffered writes
> to a file then we can end up with data mixed from the two writes (if we
> didn't have something like i_mutex)?

That's one case of specific concern - the POSIX write() atomicity
guarantee - but it indicates the cause of many of my other concerns,
too. e.g. write vs prealloc, write vs punch, read vs truncate, write
vs truncate, buffered vs direct write, etc.

Basically, page-cache granularity locking for buffered IO means that
it cannot be wholly serialised against any other operation in
progress. That means we can't use the range lock to provide a
barrier to guarantee that no IO is currently in progress at all, and
hence it doesn't provide the IO barrier semantics we need for
various operations within XFS.

An example of this is that the online defrag ioctl requires copyin +
mtime updates in the write path are atomic w.r.t the swap extents
ioctl so that it can detect concurrent modification of the file being
defragged and abort. The page cache based range locks simply don't
provide this coverage, and so we'd need to maintain the IO operation
locking we currently have to provide this exclusion..

Truncate is something I also see as particularly troublesome,
because the i_mutex current provides mutual exclusion against the
operational range of a buffered write (i.e. at the .aio_write level)
and not page granularity like this patch set would result in. Hence
the behaviour of write vs truncate races could change quite
significantly. e.g.  instead of "write completes then truncate" or
"truncate completes then write", we could have "partial write,
truncate, write continues and completes" resulting in a bunch of
zeros inside the range the write call wrote to. The application
won't even realise that the data it wrote was corrupted by the
racing truncate.....

IOWs, I think that the fundamental unit of atomicity we need here is
the operational range of the syscall i.e. that each of the protected
operations needs to execute atomically as a whole with respect to
each other, not in a piecemeal fashion where some use whole range
locking and others use fine-grained page-range locking...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
  2013-02-07  2:43           ` Dave Chinner
@ 2013-02-07 11:06             ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-07 11:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel, linux-mm

On Thu 07-02-13 13:43:42, Dave Chinner wrote:
> On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote:
> > On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> > > On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > > > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > > > > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > > > > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > > > > >    different parts of the file. Of course, range locking itself isn't
> > > > > >    enough to make the parallelism possible. Filesystems still have to
> > > > > >    somehow deal with the concurrency when manipulating inode allocation
> > > > > >    data. But the range locking at least provides a common VFS mechanism for
> > > > > >    serialization VFS itself needs and it's upto each filesystem to
> > > > > >    serialize more if it needs to.
> > > > > 
> > > > > That would be useful to end-users, but I'm having trouble predicting
> > > > > *how* useful.
> > > >   As Zheng said, there are people interested in this for DIO. Currently
> > > > filesystems each invent their own tweaks to avoid the serialization at
> > > > least for the easiest cases.
> > > 
> > > The thing is, this won't replace the locking those filesystems use
> > > to parallelise DIO - it just adds another layer of locking they'll
> > > need to use. The locks filesystems like XFS use to serialise IO
> > > against hole punch also serialise against many more internal
> > > functions and so if these range locks don't have the same capability
> > > we're going to have to retain those locks even after the range locks
> > > are introduced. It basically means we're going to have two layers
> > > of range locks - one for IO sanity and atomicity, and then this
> > > layer just for hole punch vs mmap.
> > > 
> > > As i've said before, what we really need in XFS is IO range locks
> > > because we need to be able to serialise operations against IO in
> > > progress, not page cache operations in progress.
> >   Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
> > PageWriteback serialize all IO for part of the file underlying the page.
> > I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
> > buffered read, page fault can run on that part of file.
> 
> Right, it serialises page cache operations sufficient to avoid
> page cache coherence problems, but it does not serialise operations
> sufficiently to provide atomicity between operations that should be
> atomic w.r.t. each other.
> 
> > So how come it
> > doesn't provide enough serialization for XFS?
> > 
> > Ah, is it the problem that if two threads do overlapping buffered writes
> > to a file then we can end up with data mixed from the two writes (if we
> > didn't have something like i_mutex)?
> 
> That's one case of specific concern - the POSIX write() atomicity
> guarantee
  So I was searching for this both yesterday and today and I didn't find
anywhere any comment explaining how concurrent writes should behave. But
regardless how spec defines it, we provided write vs write exclusion so far
and I can imagine changing this could break some applications. So here I
agree we probably have no choice...

> - but it indicates the cause of many of my other concerns,
> too. e.g. write vs prealloc, write vs punch, read vs truncate, write
> vs truncate, buffered vs direct write, etc.
> 
> Basically, page-cache granularity locking for buffered IO means that
> it cannot be wholly serialised against any other operation in
> progress. That means we can't use the range lock to provide a
> barrier to guarantee that no IO is currently in progress at all, and
> hence it doesn't provide the IO barrier semantics we need for
> various operations within XFS.
  Well, I never really thought about write() call as about a single
operation but rather as about a sequence of block-sized operations. But I
agree this is a developer / implementation centric view and doesn't make
much sence from user POV. Users could reasonably expect that if they do
write to a range and punch hole to the same range then they either see the
whole range empty or whole range written. Or the similar problem with a
truncate you mention below. The page granularity of buffered IO locking
doesn't provide that level of consistency.

So I agree with you that if we want to get rid of i_mutex we'd need to lock
the whole range specified by the syscall (at least for writes, for reads
I'm not that convinced because that would heavily reduce the amount of
parallelism we have in the code now for some workloads - there the page
granularity would be better I think and consistency would be unchanged from
what we have now).

But if we are going to do the locking for the range specified by syscall
it's going to make the write path heavier on CPU by 2% or so for small
writes workload from my measurements. And I'm not sure that's acceptable :(
Sigh... But I guess I'll code it without handling mmap for now and see what
the situation exactly is (how much we gain from dropping i_mutex etc.).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/6 RFC] Mapping range lock
@ 2013-02-07 11:06             ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-07 11:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel, linux-mm

On Thu 07-02-13 13:43:42, Dave Chinner wrote:
> On Wed, Feb 06, 2013 at 08:25:34PM +0100, Jan Kara wrote:
> > On Wed 06-02-13 10:25:12, Dave Chinner wrote:
> > > On Mon, Feb 04, 2013 at 01:38:31PM +0100, Jan Kara wrote:
> > > > On Thu 31-01-13 16:07:57, Andrew Morton wrote:
> > > > > > c) i_mutex doesn't allow any paralellism of operations using it and some
> > > > > >    filesystems workaround this for specific cases (e.g. DIO reads). Using
> > > > > >    range locking allows for concurrent operations (e.g. writes, DIO) on
> > > > > >    different parts of the file. Of course, range locking itself isn't
> > > > > >    enough to make the parallelism possible. Filesystems still have to
> > > > > >    somehow deal with the concurrency when manipulating inode allocation
> > > > > >    data. But the range locking at least provides a common VFS mechanism for
> > > > > >    serialization VFS itself needs and it's upto each filesystem to
> > > > > >    serialize more if it needs to.
> > > > > 
> > > > > That would be useful to end-users, but I'm having trouble predicting
> > > > > *how* useful.
> > > >   As Zheng said, there are people interested in this for DIO. Currently
> > > > filesystems each invent their own tweaks to avoid the serialization at
> > > > least for the easiest cases.
> > > 
> > > The thing is, this won't replace the locking those filesystems use
> > > to parallelise DIO - it just adds another layer of locking they'll
> > > need to use. The locks filesystems like XFS use to serialise IO
> > > against hole punch also serialise against many more internal
> > > functions and so if these range locks don't have the same capability
> > > we're going to have to retain those locks even after the range locks
> > > are introduced. It basically means we're going to have two layers
> > > of range locks - one for IO sanity and atomicity, and then this
> > > layer just for hole punch vs mmap.
> > > 
> > > As i've said before, what we really need in XFS is IO range locks
> > > because we need to be able to serialise operations against IO in
> > > progress, not page cache operations in progress.
> >   Hum, I'm not sure I follow you here. So mapping tree lock + PageLocked +
> > PageWriteback serialize all IO for part of the file underlying the page.
> > I.e. at most one of truncate (punch hole), DIO, writeback, buffered write,
> > buffered read, page fault can run on that part of file.
> 
> Right, it serialises page cache operations sufficient to avoid
> page cache coherence problems, but it does not serialise operations
> sufficiently to provide atomicity between operations that should be
> atomic w.r.t. each other.
> 
> > So how come it
> > doesn't provide enough serialization for XFS?
> > 
> > Ah, is it the problem that if two threads do overlapping buffered writes
> > to a file then we can end up with data mixed from the two writes (if we
> > didn't have something like i_mutex)?
> 
> That's one case of specific concern - the POSIX write() atomicity
> guarantee
  So I was searching for this both yesterday and today and I didn't find
anywhere any comment explaining how concurrent writes should behave. But
regardless how spec defines it, we provided write vs write exclusion so far
and I can imagine changing this could break some applications. So here I
agree we probably have no choice...

> - but it indicates the cause of many of my other concerns,
> too. e.g. write vs prealloc, write vs punch, read vs truncate, write
> vs truncate, buffered vs direct write, etc.
> 
> Basically, page-cache granularity locking for buffered IO means that
> it cannot be wholly serialised against any other operation in
> progress. That means we can't use the range lock to provide a
> barrier to guarantee that no IO is currently in progress at all, and
> hence it doesn't provide the IO barrier semantics we need for
> various operations within XFS.
  Well, I never really thought about write() call as about a single
operation but rather as about a sequence of block-sized operations. But I
agree this is a developer / implementation centric view and doesn't make
much sence from user POV. Users could reasonably expect that if they do
write to a range and punch hole to the same range then they either see the
whole range empty or whole range written. Or the similar problem with a
truncate you mention below. The page granularity of buffered IO locking
doesn't provide that level of consistency.

So I agree with you that if we want to get rid of i_mutex we'd need to lock
the whole range specified by the syscall (at least for writes, for reads
I'm not that convinced because that would heavily reduce the amount of
parallelism we have in the code now for some workloads - there the page
granularity would be better I think and consistency would be unchanged from
what we have now).

But if we are going to do the locking for the range specified by syscall
it's going to make the write path heavier on CPU by 2% or so for small
writes workload from my measurements. And I'm not sure that's acceptable :(
Sigh... But I guess I'll code it without handling mmap for now and see what
the situation exactly is (how much we gain from dropping i_mutex etc.).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] fs: Take mapping lock in generic read paths
  2013-02-04 12:47       ` Jan Kara
@ 2013-02-08 14:59         ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-08 14:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Mon 04-02-13 13:47:15, Jan Kara wrote:
> On Thu 31-01-13 15:59:40, Andrew Morton wrote:
> > On Thu, 31 Jan 2013 22:49:50 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Add mapping lock to struct address_space and grab it in all paths
> > > creating pages in page cache to read data into them. That means buffered
> > > read, readahead, and page fault code.
> > 
> > Boy, this does look expensive in both speed and space.
>   I'm not sure I'll be able to do much with the space cost but hopefully
> the CPU cost could be reduced.
> 
> > As you pointed out in [0/n], it's 2-3%.  As always with pagecache
> > stuff, the cost of filling the page generally swamps any inefficiencies
> > in preparing that page.
>   Yes, I measured it with with ramdisk backed fs exactly to remove the cost
> of filling the page from the picture. But there are systems where IO is CPU
> bound (e.g. when you have PCIe attached devices) and although there is the
> additional cost of block layer which will further hide the cost of page
> cache itself I assume the added 2-3% incurred by page cache itself will be
> measurable on such systems. So that's why I'd like to reduce the CPU cost
> of range locking.
  So I played a bit more with the code and I was able to reduce the space
cost to a single pointer in struct address_space and unmeasurable impact in
write path. I still see ~ 1% regression in the read path and I'm not sure
why that is as the fast path now only adds a test for one value. But maybe
there's some thinko somewhere. Anyway I'm optimistic that at least in
the current form the code could be massaged so that the CPU cost is in the
noise.

I write "in the current form" because as Dave Chinner pointed out we need
to lock the whole range used by write() at once to ever have a chance to
drop i_mutex and that will require some non-trivial changes. So I'll be
looking into that now...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/6] fs: Take mapping lock in generic read paths
@ 2013-02-08 14:59         ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-08 14:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Mon 04-02-13 13:47:15, Jan Kara wrote:
> On Thu 31-01-13 15:59:40, Andrew Morton wrote:
> > On Thu, 31 Jan 2013 22:49:50 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Add mapping lock to struct address_space and grab it in all paths
> > > creating pages in page cache to read data into them. That means buffered
> > > read, readahead, and page fault code.
> > 
> > Boy, this does look expensive in both speed and space.
>   I'm not sure I'll be able to do much with the space cost but hopefully
> the CPU cost could be reduced.
> 
> > As you pointed out in [0/n], it's 2-3%.  As always with pagecache
> > stuff, the cost of filling the page generally swamps any inefficiencies
> > in preparing that page.
>   Yes, I measured it with with ramdisk backed fs exactly to remove the cost
> of filling the page from the picture. But there are systems where IO is CPU
> bound (e.g. when you have PCIe attached devices) and although there is the
> additional cost of block layer which will further hide the cost of page
> cache itself I assume the added 2-3% incurred by page cache itself will be
> measurable on such systems. So that's why I'd like to reduce the CPU cost
> of range locking.
  So I played a bit more with the code and I was able to reduce the space
cost to a single pointer in struct address_space and unmeasurable impact in
write path. I still see ~ 1% regression in the read path and I'm not sure
why that is as the fast path now only adds a test for one value. But maybe
there's some thinko somewhere. Anyway I'm optimistic that at least in
the current form the code could be massaged so that the CPU cost is in the
noise.

I write "in the current form" because as Dave Chinner pointed out we need
to lock the whole range used by write() at once to ever have a chance to
drop i_mutex and that will require some non-trivial changes. So I'll be
looking into that now...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
  2013-01-31 21:49   ` Jan Kara
@ 2013-02-11  5:42     ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2013-02-11  5:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

Hi Jan,

On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
> Implement range locking using interval tree.

Yay! I like to see interval trees being put to good use.

> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */

I think the cost is actually O((1+R_int)log(R_all)) as each
interval_tree_iter_{first,next} call is O(log(R_all))

Not that it'll make a huge difference in practice - the cost will be
cheap enough either way.

> +struct range_lock {
> +       struct interval_tree_node node;
> +       struct task_struct *task;
> +       /* Number of ranges which are blocking acquisition of the lock */
s/ranges/previously requested ranges/

I think it's worth writing this down as I originally found this confusing.

BTW, I like how you only count previously requested ranges in order to
guarantee fairness. This was absolutely not obvious to me.

> +#define RANGE_LOCK_INITIALIZER(start, end) {\
> +       .node = {\
> +               .start = (start),\
> +               .end = (end)\
> +       }\
> +}

I have not found any uses of this, but it seems it wouldn't work as
you want .last instead of .end

BTW, it's important to make it clear that last is the last value that
is *included* in the interval, not the first value that follows it.

> +void range_lock_init(struct range_lock *lock, unsigned long start,
> +                    unsigned long end);
> +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);

Is there a point to separating the init and lock stages ? maybe the API could be
void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
unsigned long start, unsigned long last);
void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);

(I changed end to last because I think end makes it sound like it's
the first value after the interval, while last makes it clear that
it's the last value in the interval)

> +/*
> + * Implementation of range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'.  We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0.
> + */

May be worth repeating the comment about how this achieves fairness
and avoids livelocks.

> +void range_lock_init(struct range_lock *lock, unsigned long start,
> +                    unsigned long end)
> +{
> +       lock->node.start = start;
> +       lock->node.last = end;
> +       RB_CLEAR_NODE(&lock->node.rb);

I really wish people didn't unnecessarily use RB_CLEAR_NODE before
inserting nodes in an rbtree.
RB_CLEAR_NODE is never necessary unless you want to tag unused nodes
and check them later using RB_EMPTY_NODES.

> +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +       struct interval_tree_node *node;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&tree->lock, flags);

Are you expecting range locks to be used from hardirq context ? If
not, it may be more appropriate to just use spin_lock_bh ?

> +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> +                                       lock->node.last);
> +       while (node) {
> +               lock->blocking_ranges++;
> +               node = interval_tree_iter_next(node, lock->node.start,
> +                                              lock->node.last);
> +       }

Nitpicking here, but I think this is slightly easier to read as a for loop:
for (node = interval_tree_iter_first(...);
     node;
     node = interval_tree_iter_next(...))
        lock->blocking_ranges++;

> +       /* Do we need to go to sleep? */
> +       while (lock->blocking_ranges) {
> +               lock->task = current;
> +               __set_current_state(TASK_UNINTERRUPTIBLE);
> +               spin_unlock_irqrestore(&tree->lock, flags);
> +               schedule();
> +               spin_lock_irqsave(&tree->lock, flags);
> +       }
> +       spin_unlock_irqrestore(&tree->lock, flags);

I think I would prefer:
        lock->task = tsk = current;
        spin_unlock_irqrestore(&tree->lock, flags);
        while (true) {
                set_task_state(tsk, TASK_UNINTERRUPTIBLE);
                if (!lock->blocking_ranges)
                        break;
                schedule();
        }
        set_task_state(tsk, TASK_RUNNING);

This avoids an unnecessary spinlock acquisition when we obtain the range lock.
(You can optionally choose to avoid the whole thing and just unlock
the spinlock if !lock->blocking_ranges)

> +static void range_lock_unblock(struct range_lock *lock)
> +{
> +       if (!--lock->blocking_ranges)
> +               wake_up_process(lock->task);
> +}
> +
> +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +       struct interval_tree_node *node;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&tree->lock, flags);
> +       interval_tree_remove(&lock->node, &tree->root);
> +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> +                                       lock->node.last);
> +       while (node) {
> +               range_lock_unblock((struct range_lock *)node);
> +               node = interval_tree_iter_next(node, lock->node.start,
> +                                              lock->node.last);
> +       }

Maybe just a personal preference, but I prefer a for loop.
Also, I would prefer container_of() instead of a cast.
Finally, I don't think I see the benefit of having a separate
range_lock_unblock function instead of a couple lines to do it within
the for loop.

The above may sound critical, but I actually like your proposal a lot :)

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
@ 2013-02-11  5:42     ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2013-02-11  5:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

Hi Jan,

On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
> Implement range locking using interval tree.

Yay! I like to see interval trees being put to good use.

> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */

I think the cost is actually O((1+R_int)log(R_all)) as each
interval_tree_iter_{first,next} call is O(log(R_all))

Not that it'll make a huge difference in practice - the cost will be
cheap enough either way.

> +struct range_lock {
> +       struct interval_tree_node node;
> +       struct task_struct *task;
> +       /* Number of ranges which are blocking acquisition of the lock */
s/ranges/previously requested ranges/

I think it's worth writing this down as I originally found this confusing.

BTW, I like how you only count previously requested ranges in order to
guarantee fairness. This was absolutely not obvious to me.

> +#define RANGE_LOCK_INITIALIZER(start, end) {\
> +       .node = {\
> +               .start = (start),\
> +               .end = (end)\
> +       }\
> +}

I have not found any uses of this, but it seems it wouldn't work as
you want .last instead of .end

BTW, it's important to make it clear that last is the last value that
is *included* in the interval, not the first value that follows it.

> +void range_lock_init(struct range_lock *lock, unsigned long start,
> +                    unsigned long end);
> +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);

Is there a point to separating the init and lock stages ? maybe the API could be
void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
unsigned long start, unsigned long last);
void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);

(I changed end to last because I think end makes it sound like it's
the first value after the interval, while last makes it clear that
it's the last value in the interval)

> +/*
> + * Implementation of range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'.  We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0.
> + */

May be worth repeating the comment about how this achieves fairness
and avoids livelocks.

> +void range_lock_init(struct range_lock *lock, unsigned long start,
> +                    unsigned long end)
> +{
> +       lock->node.start = start;
> +       lock->node.last = end;
> +       RB_CLEAR_NODE(&lock->node.rb);

I really wish people didn't unnecessarily use RB_CLEAR_NODE before
inserting nodes in an rbtree.
RB_CLEAR_NODE is never necessary unless you want to tag unused nodes
and check them later using RB_EMPTY_NODES.

> +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +       struct interval_tree_node *node;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&tree->lock, flags);

Are you expecting range locks to be used from hardirq context ? If
not, it may be more appropriate to just use spin_lock_bh ?

> +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> +                                       lock->node.last);
> +       while (node) {
> +               lock->blocking_ranges++;
> +               node = interval_tree_iter_next(node, lock->node.start,
> +                                              lock->node.last);
> +       }

Nitpicking here, but I think this is slightly easier to read as a for loop:
for (node = interval_tree_iter_first(...);
     node;
     node = interval_tree_iter_next(...))
        lock->blocking_ranges++;

> +       /* Do we need to go to sleep? */
> +       while (lock->blocking_ranges) {
> +               lock->task = current;
> +               __set_current_state(TASK_UNINTERRUPTIBLE);
> +               spin_unlock_irqrestore(&tree->lock, flags);
> +               schedule();
> +               spin_lock_irqsave(&tree->lock, flags);
> +       }
> +       spin_unlock_irqrestore(&tree->lock, flags);

I think I would prefer:
        lock->task = tsk = current;
        spin_unlock_irqrestore(&tree->lock, flags);
        while (true) {
                set_task_state(tsk, TASK_UNINTERRUPTIBLE);
                if (!lock->blocking_ranges)
                        break;
                schedule();
        }
        set_task_state(tsk, TASK_RUNNING);

This avoids an unnecessary spinlock acquisition when we obtain the range lock.
(You can optionally choose to avoid the whole thing and just unlock
the spinlock if !lock->blocking_ranges)

> +static void range_lock_unblock(struct range_lock *lock)
> +{
> +       if (!--lock->blocking_ranges)
> +               wake_up_process(lock->task);
> +}
> +
> +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +       struct interval_tree_node *node;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&tree->lock, flags);
> +       interval_tree_remove(&lock->node, &tree->root);
> +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> +                                       lock->node.last);
> +       while (node) {
> +               range_lock_unblock((struct range_lock *)node);
> +               node = interval_tree_iter_next(node, lock->node.start,
> +                                              lock->node.last);
> +       }

Maybe just a personal preference, but I prefer a for loop.
Also, I would prefer container_of() instead of a cast.
Finally, I don't think I see the benefit of having a separate
range_lock_unblock function instead of a couple lines to do it within
the for loop.

The above may sound critical, but I actually like your proposal a lot :)

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
  2013-02-11  5:42     ` Michel Lespinasse
@ 2013-02-11 10:27       ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-11 10:27 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Sun 10-02-13 21:42:32, Michel Lespinasse wrote:
> Hi Jan,
> 
> On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
> > Implement range locking using interval tree.
> 
> Yay! I like to see interval trees being put to good use.
  Yeah, you saved me some coding of interval tree implementation :) The
code I originally planned would be slightly more efficient I think but
yours is far more simpler.

> > +/*
> > + * Range locking
> > + *
> > + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> > + * range is locked only after all conflicting range locks requested previously
> > + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> > + *
> > + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> > + * total number of ranges and R_int is the number of ranges intersecting the
> > + * operated range.
> > + */
> 
> I think the cost is actually O((1+R_int)log(R_all)) as each
> interval_tree_iter_{first,next} call is O(log(R_all))
  Right. I'll fix that in the comment.
 
> Not that it'll make a huge difference in practice - the cost will be
> cheap enough either way.
> 
> > +struct range_lock {
> > +       struct interval_tree_node node;
> > +       struct task_struct *task;
> > +       /* Number of ranges which are blocking acquisition of the lock */
> s/ranges/previously requested ranges/
> 
> I think it's worth writing this down as I originally found this confusing.
> 
> BTW, I like how you only count previously requested ranges in order to
> guarantee fairness. This was absolutely not obvious to me.
  OK, I'll update the comment.

> > +#define RANGE_LOCK_INITIALIZER(start, end) {\
> > +       .node = {\
> > +               .start = (start),\
> > +               .end = (end)\
> > +       }\
> > +}
> 
> I have not found any uses of this, but it seems it wouldn't work as
> you want .last instead of .end
  I'll just delete it I guess.

> BTW, it's important to make it clear that last is the last value that
> is *included* in the interval, not the first value that follows it.
  In current versions I have this noted at function definitions.

> > +void range_lock_init(struct range_lock *lock, unsigned long start,
> > +                    unsigned long end);
> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> 
> Is there a point to separating the init and lock stages ? maybe the API could be
> void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
> unsigned long start, unsigned long last);
> void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
  I was thinking about this as well. Currently I don't have a place which
would make it beneficial to separate _init and _lock but I can imagine such
uses (where you don't want to pass the interval information down the stack
and it's easier to pass the whole lock structure). Also it looks a bit
confusing to pass (tree, lock, start, last) to the locking functon. So I
left it there. 

OTOH I had to somewhat change the API so that the locking phase is now
separated in "lock_prep" phase which inserts the node into the tree and
counts blocking ranges and "wait" phase which waits for the blocking ranges
to unlock. The reason for this split is that while "lock_prep" needs to
happen under some lock synchronizing operations on the tree, "wait" phase
can be easily lockless. So this allows me to remove the knowledge of how
operations on the tree are synchronized from range locking code itself.
That further allowed me to use mapping->tree_lock for synchronization and
basically reduce the cost of mapping range locking close to 0 for buffered
IO (just a single tree lookup in the tree in the fast path).

So maybe we want to reduce the number of calls for locking from 3 to 2 by
removing the _init phase. I'm not really decided as for mapping range lock
itself, the lock operation is squashed into 1 call anyway and we don't have
other users now...

> (I changed end to last because I think end makes it sound like it's
> the first value after the interval, while last makes it clear that
> it's the last value in the interval)
  This may be a useful change. I'll use that I think.

> > +/*
> > + * Implementation of range locks.
> > + *
> > + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> > + * is requested, we add its interval to the tree and store number of intervals
> > + * intersecting it to 'blocking_ranges'.
> > + *
> > + * When a range is unlocked, we again walk intervals that intersect with the
> > + * unlocked one and decrement their 'blocking_ranges'.  We wake up owner of any
> > + * range lock whose 'blocking_ranges' drops to 0.
> > + */
> 
> May be worth repeating the comment about how this achieves fairness
> and avoids livelocks.
  Good idea. Added.

> > +void range_lock_init(struct range_lock *lock, unsigned long start,
> > +                    unsigned long end)
> > +{
> > +       lock->node.start = start;
> > +       lock->node.last = end;
> > +       RB_CLEAR_NODE(&lock->node.rb);
> 
> I really wish people didn't unnecessarily use RB_CLEAR_NODE before
> inserting nodes in an rbtree.
> RB_CLEAR_NODE is never necessary unless you want to tag unused nodes
> and check them later using RB_EMPTY_NODES.
  OK, removed and noted in memory.

> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +       struct interval_tree_node *node;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&tree->lock, flags);
> 
> Are you expecting range locks to be used from hardirq context ? If
> not, it may be more appropriate to just use spin_lock_bh ?
  They are used from ->end_io context. I'm actually not sure whether that's
hardirq or just bh (I guess just bh). Anyway I use mapping->tree_lock for
now.

> > +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +                                       lock->node.last);
> > +       while (node) {
> > +               lock->blocking_ranges++;
> > +               node = interval_tree_iter_next(node, lock->node.start,
> > +                                              lock->node.last);
> > +       }
> 
> Nitpicking here, but I think this is slightly easier to read as a for loop:
> for (node = interval_tree_iter_first(...);
>      node;
>      node = interval_tree_iter_next(...))
>         lock->blocking_ranges++;
  OK.

> > +       /* Do we need to go to sleep? */
> > +       while (lock->blocking_ranges) {
> > +               lock->task = current;
> > +               __set_current_state(TASK_UNINTERRUPTIBLE);
> > +               spin_unlock_irqrestore(&tree->lock, flags);
> > +               schedule();
> > +               spin_lock_irqsave(&tree->lock, flags);
> > +       }
> > +       spin_unlock_irqrestore(&tree->lock, flags);
> 
> I think I would prefer:
>         lock->task = tsk = current;
>         spin_unlock_irqrestore(&tree->lock, flags);
>         while (true) {
>                 set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>                 if (!lock->blocking_ranges)
>                         break;
>                 schedule();
>         }
>         set_task_state(tsk, TASK_RUNNING);
> 
> This avoids an unnecessary spinlock acquisition when we obtain the range lock.
> (You can optionally choose to avoid the whole thing and just unlock
> the spinlock if !lock->blocking_ranges)
  This code is somewhat different in the latest version...

> > +static void range_lock_unblock(struct range_lock *lock)
> > +{
> > +       if (!--lock->blocking_ranges)
> > +               wake_up_process(lock->task);
> > +}
> > +
> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +       struct interval_tree_node *node;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&tree->lock, flags);
> > +       interval_tree_remove(&lock->node, &tree->root);
> > +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +                                       lock->node.last);
> > +       while (node) {
> > +               range_lock_unblock((struct range_lock *)node);
> > +               node = interval_tree_iter_next(node, lock->node.start,
> > +                                              lock->node.last);
> > +       }
> 
> Maybe just a personal preference, but I prefer a for loop.
  OK, although I don't see a difference in this case...

> Also, I would prefer container_of() instead of a cast.
  Ah, that's indeed better.

> Finally, I don't think I see the benefit of having a separate
> range_lock_unblock function instead of a couple lines to do it within
> the for loop.
  Yup.

> The above may sound critical, but I actually like your proposal a lot :)
  Thanks for detailed review!

> Reviewed-by: Michel Lespinasse <walken@google.com>
  I actually didn't add this because there are some differences in the
current version...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
@ 2013-02-11 10:27       ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-11 10:27 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Sun 10-02-13 21:42:32, Michel Lespinasse wrote:
> Hi Jan,
> 
> On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
> > Implement range locking using interval tree.
> 
> Yay! I like to see interval trees being put to good use.
  Yeah, you saved me some coding of interval tree implementation :) The
code I originally planned would be slightly more efficient I think but
yours is far more simpler.

> > +/*
> > + * Range locking
> > + *
> > + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> > + * range is locked only after all conflicting range locks requested previously
> > + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> > + *
> > + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> > + * total number of ranges and R_int is the number of ranges intersecting the
> > + * operated range.
> > + */
> 
> I think the cost is actually O((1+R_int)log(R_all)) as each
> interval_tree_iter_{first,next} call is O(log(R_all))
  Right. I'll fix that in the comment.
 
> Not that it'll make a huge difference in practice - the cost will be
> cheap enough either way.
> 
> > +struct range_lock {
> > +       struct interval_tree_node node;
> > +       struct task_struct *task;
> > +       /* Number of ranges which are blocking acquisition of the lock */
> s/ranges/previously requested ranges/
> 
> I think it's worth writing this down as I originally found this confusing.
> 
> BTW, I like how you only count previously requested ranges in order to
> guarantee fairness. This was absolutely not obvious to me.
  OK, I'll update the comment.

> > +#define RANGE_LOCK_INITIALIZER(start, end) {\
> > +       .node = {\
> > +               .start = (start),\
> > +               .end = (end)\
> > +       }\
> > +}
> 
> I have not found any uses of this, but it seems it wouldn't work as
> you want .last instead of .end
  I'll just delete it I guess.

> BTW, it's important to make it clear that last is the last value that
> is *included* in the interval, not the first value that follows it.
  In current versions I have this noted at function definitions.

> > +void range_lock_init(struct range_lock *lock, unsigned long start,
> > +                    unsigned long end);
> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> 
> Is there a point to separating the init and lock stages ? maybe the API could be
> void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
> unsigned long start, unsigned long last);
> void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
  I was thinking about this as well. Currently I don't have a place which
would make it beneficial to separate _init and _lock but I can imagine such
uses (where you don't want to pass the interval information down the stack
and it's easier to pass the whole lock structure). Also it looks a bit
confusing to pass (tree, lock, start, last) to the locking functon. So I
left it there. 

OTOH I had to somewhat change the API so that the locking phase is now
separated in "lock_prep" phase which inserts the node into the tree and
counts blocking ranges and "wait" phase which waits for the blocking ranges
to unlock. The reason for this split is that while "lock_prep" needs to
happen under some lock synchronizing operations on the tree, "wait" phase
can be easily lockless. So this allows me to remove the knowledge of how
operations on the tree are synchronized from range locking code itself.
That further allowed me to use mapping->tree_lock for synchronization and
basically reduce the cost of mapping range locking close to 0 for buffered
IO (just a single tree lookup in the tree in the fast path).

So maybe we want to reduce the number of calls for locking from 3 to 2 by
removing the _init phase. I'm not really decided as for mapping range lock
itself, the lock operation is squashed into 1 call anyway and we don't have
other users now...

> (I changed end to last because I think end makes it sound like it's
> the first value after the interval, while last makes it clear that
> it's the last value in the interval)
  This may be a useful change. I'll use that I think.

> > +/*
> > + * Implementation of range locks.
> > + *
> > + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> > + * is requested, we add its interval to the tree and store number of intervals
> > + * intersecting it to 'blocking_ranges'.
> > + *
> > + * When a range is unlocked, we again walk intervals that intersect with the
> > + * unlocked one and decrement their 'blocking_ranges'.  We wake up owner of any
> > + * range lock whose 'blocking_ranges' drops to 0.
> > + */
> 
> May be worth repeating the comment about how this achieves fairness
> and avoids livelocks.
  Good idea. Added.

> > +void range_lock_init(struct range_lock *lock, unsigned long start,
> > +                    unsigned long end)
> > +{
> > +       lock->node.start = start;
> > +       lock->node.last = end;
> > +       RB_CLEAR_NODE(&lock->node.rb);
> 
> I really wish people didn't unnecessarily use RB_CLEAR_NODE before
> inserting nodes in an rbtree.
> RB_CLEAR_NODE is never necessary unless you want to tag unused nodes
> and check them later using RB_EMPTY_NODES.
  OK, removed and noted in memory.

> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +       struct interval_tree_node *node;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&tree->lock, flags);
> 
> Are you expecting range locks to be used from hardirq context ? If
> not, it may be more appropriate to just use spin_lock_bh ?
  They are used from ->end_io context. I'm actually not sure whether that's
hardirq or just bh (I guess just bh). Anyway I use mapping->tree_lock for
now.

> > +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +                                       lock->node.last);
> > +       while (node) {
> > +               lock->blocking_ranges++;
> > +               node = interval_tree_iter_next(node, lock->node.start,
> > +                                              lock->node.last);
> > +       }
> 
> Nitpicking here, but I think this is slightly easier to read as a for loop:
> for (node = interval_tree_iter_first(...);
>      node;
>      node = interval_tree_iter_next(...))
>         lock->blocking_ranges++;
  OK.

> > +       /* Do we need to go to sleep? */
> > +       while (lock->blocking_ranges) {
> > +               lock->task = current;
> > +               __set_current_state(TASK_UNINTERRUPTIBLE);
> > +               spin_unlock_irqrestore(&tree->lock, flags);
> > +               schedule();
> > +               spin_lock_irqsave(&tree->lock, flags);
> > +       }
> > +       spin_unlock_irqrestore(&tree->lock, flags);
> 
> I think I would prefer:
>         lock->task = tsk = current;
>         spin_unlock_irqrestore(&tree->lock, flags);
>         while (true) {
>                 set_task_state(tsk, TASK_UNINTERRUPTIBLE);
>                 if (!lock->blocking_ranges)
>                         break;
>                 schedule();
>         }
>         set_task_state(tsk, TASK_RUNNING);
> 
> This avoids an unnecessary spinlock acquisition when we obtain the range lock.
> (You can optionally choose to avoid the whole thing and just unlock
> the spinlock if !lock->blocking_ranges)
  This code is somewhat different in the latest version...

> > +static void range_lock_unblock(struct range_lock *lock)
> > +{
> > +       if (!--lock->blocking_ranges)
> > +               wake_up_process(lock->task);
> > +}
> > +
> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> > +{
> > +       struct interval_tree_node *node;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&tree->lock, flags);
> > +       interval_tree_remove(&lock->node, &tree->root);
> > +       node = interval_tree_iter_first(&tree->root, lock->node.start,
> > +                                       lock->node.last);
> > +       while (node) {
> > +               range_lock_unblock((struct range_lock *)node);
> > +               node = interval_tree_iter_next(node, lock->node.start,
> > +                                              lock->node.last);
> > +       }
> 
> Maybe just a personal preference, but I prefer a for loop.
  OK, although I don't see a difference in this case...

> Also, I would prefer container_of() instead of a cast.
  Ah, that's indeed better.

> Finally, I don't think I see the benefit of having a separate
> range_lock_unblock function instead of a couple lines to do it within
> the for loop.
  Yup.

> The above may sound critical, but I actually like your proposal a lot :)
  Thanks for detailed review!

> Reviewed-by: Michel Lespinasse <walken@google.com>
  I actually didn't add this because there are some differences in the
current version...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
  2013-02-11 10:27       ` Jan Kara
@ 2013-02-11 11:03         ` Michel Lespinasse
  -1 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2013-02-11 11:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Mon, Feb 11, 2013 at 2:27 AM, Jan Kara <jack@suse.cz> wrote:
> On Sun 10-02-13 21:42:32, Michel Lespinasse wrote:
>> On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
>> > +void range_lock_init(struct range_lock *lock, unsigned long start,
>> > +                    unsigned long end);
>> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
>> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
>>
>> Is there a point to separating the init and lock stages ? maybe the API could be
>> void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
>> unsigned long start, unsigned long last);
>> void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
>   I was thinking about this as well. Currently I don't have a place which
> would make it beneficial to separate _init and _lock but I can imagine such
> uses (where you don't want to pass the interval information down the stack
> and it's easier to pass the whole lock structure). Also it looks a bit
> confusing to pass (tree, lock, start, last) to the locking functon. So I
> left it there.
>
> OTOH I had to somewhat change the API so that the locking phase is now
> separated in "lock_prep" phase which inserts the node into the tree and
> counts blocking ranges and "wait" phase which waits for the blocking ranges
> to unlock. The reason for this split is that while "lock_prep" needs to
> happen under some lock synchronizing operations on the tree, "wait" phase
> can be easily lockless. So this allows me to remove the knowledge of how
> operations on the tree are synchronized from range locking code itself.
> That further allowed me to use mapping->tree_lock for synchronization and
> basically reduce the cost of mapping range locking close to 0 for buffered
> IO (just a single tree lookup in the tree in the fast path).

Ah yes, being able to externalize the lock is good.

I think in this case, it makes the most sense for lock_prep phase to
also initialize the lock node, though.

>> Reviewed-by: Michel Lespinasse <walken@google.com>
>   I actually didn't add this because there are some differences in the
> current version...

Did I miss another posting of yours, or is that coming up ?

Cheers,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
@ 2013-02-11 11:03         ` Michel Lespinasse
  0 siblings, 0 replies; 46+ messages in thread
From: Michel Lespinasse @ 2013-02-11 11:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: LKML, linux-fsdevel, linux-mm

On Mon, Feb 11, 2013 at 2:27 AM, Jan Kara <jack@suse.cz> wrote:
> On Sun 10-02-13 21:42:32, Michel Lespinasse wrote:
>> On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
>> > +void range_lock_init(struct range_lock *lock, unsigned long start,
>> > +                    unsigned long end);
>> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
>> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
>>
>> Is there a point to separating the init and lock stages ? maybe the API could be
>> void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
>> unsigned long start, unsigned long last);
>> void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
>   I was thinking about this as well. Currently I don't have a place which
> would make it beneficial to separate _init and _lock but I can imagine such
> uses (where you don't want to pass the interval information down the stack
> and it's easier to pass the whole lock structure). Also it looks a bit
> confusing to pass (tree, lock, start, last) to the locking functon. So I
> left it there.
>
> OTOH I had to somewhat change the API so that the locking phase is now
> separated in "lock_prep" phase which inserts the node into the tree and
> counts blocking ranges and "wait" phase which waits for the blocking ranges
> to unlock. The reason for this split is that while "lock_prep" needs to
> happen under some lock synchronizing operations on the tree, "wait" phase
> can be easily lockless. So this allows me to remove the knowledge of how
> operations on the tree are synchronized from range locking code itself.
> That further allowed me to use mapping->tree_lock for synchronization and
> basically reduce the cost of mapping range locking close to 0 for buffered
> IO (just a single tree lookup in the tree in the fast path).

Ah yes, being able to externalize the lock is good.

I think in this case, it makes the most sense for lock_prep phase to
also initialize the lock node, though.

>> Reviewed-by: Michel Lespinasse <walken@google.com>
>   I actually didn't add this because there are some differences in the
> current version...

Did I miss another posting of yours, or is that coming up ?

Cheers,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
  2013-02-11 11:03         ` Michel Lespinasse
@ 2013-02-11 12:58           ` Jan Kara
  -1 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-11 12:58 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Mon 11-02-13 03:03:30, Michel Lespinasse wrote:
> On Mon, Feb 11, 2013 at 2:27 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 10-02-13 21:42:32, Michel Lespinasse wrote:
> >> On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
> >> > +void range_lock_init(struct range_lock *lock, unsigned long start,
> >> > +                    unsigned long end);
> >> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> >> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> >>
> >> Is there a point to separating the init and lock stages ? maybe the API could be
> >> void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
> >> unsigned long start, unsigned long last);
> >> void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> >   I was thinking about this as well. Currently I don't have a place which
> > would make it beneficial to separate _init and _lock but I can imagine such
> > uses (where you don't want to pass the interval information down the stack
> > and it's easier to pass the whole lock structure). Also it looks a bit
> > confusing to pass (tree, lock, start, last) to the locking functon. So I
> > left it there.
> >
> > OTOH I had to somewhat change the API so that the locking phase is now
> > separated in "lock_prep" phase which inserts the node into the tree and
> > counts blocking ranges and "wait" phase which waits for the blocking ranges
> > to unlock. The reason for this split is that while "lock_prep" needs to
> > happen under some lock synchronizing operations on the tree, "wait" phase
> > can be easily lockless. So this allows me to remove the knowledge of how
> > operations on the tree are synchronized from range locking code itself.
> > That further allowed me to use mapping->tree_lock for synchronization and
> > basically reduce the cost of mapping range locking close to 0 for buffered
> > IO (just a single tree lookup in the tree in the fast path).
> 
> Ah yes, being able to externalize the lock is good.
> 
> I think in this case, it makes the most sense for lock_prep phase to
> also initialize the lock node, though.
  I guess so.

> >> Reviewed-by: Michel Lespinasse <walken@google.com>
> >   I actually didn't add this because there are some differences in the
> > current version...
> 
> Did I miss another posting of yours, or is that coming up ?
  That will come. But as Dave Chinner pointed out for buffered writes we
should rather lock the whole range specified in the syscall (to avoid
strange results of racing truncate / write when i_mutex isn't used) and
that requires us to put the range lock above mmap_sem which isn't currently
easily possible due to page fault handling... So if the whole patch set
should go anywhere I need to solve that somehow.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/6] lib: Implement range locks
@ 2013-02-11 12:58           ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2013-02-11 12:58 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: Jan Kara, LKML, linux-fsdevel, linux-mm

On Mon 11-02-13 03:03:30, Michel Lespinasse wrote:
> On Mon, Feb 11, 2013 at 2:27 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 10-02-13 21:42:32, Michel Lespinasse wrote:
> >> On Thu, Jan 31, 2013 at 1:49 PM, Jan Kara <jack@suse.cz> wrote:
> >> > +void range_lock_init(struct range_lock *lock, unsigned long start,
> >> > +                    unsigned long end);
> >> > +void range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> >> > +void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> >>
> >> Is there a point to separating the init and lock stages ? maybe the API could be
> >> void range_lock(struct range_lock_tree *tree, struct range_lock *lock,
> >> unsigned long start, unsigned long last);
> >> void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> >   I was thinking about this as well. Currently I don't have a place which
> > would make it beneficial to separate _init and _lock but I can imagine such
> > uses (where you don't want to pass the interval information down the stack
> > and it's easier to pass the whole lock structure). Also it looks a bit
> > confusing to pass (tree, lock, start, last) to the locking functon. So I
> > left it there.
> >
> > OTOH I had to somewhat change the API so that the locking phase is now
> > separated in "lock_prep" phase which inserts the node into the tree and
> > counts blocking ranges and "wait" phase which waits for the blocking ranges
> > to unlock. The reason for this split is that while "lock_prep" needs to
> > happen under some lock synchronizing operations on the tree, "wait" phase
> > can be easily lockless. So this allows me to remove the knowledge of how
> > operations on the tree are synchronized from range locking code itself.
> > That further allowed me to use mapping->tree_lock for synchronization and
> > basically reduce the cost of mapping range locking close to 0 for buffered
> > IO (just a single tree lookup in the tree in the fast path).
> 
> Ah yes, being able to externalize the lock is good.
> 
> I think in this case, it makes the most sense for lock_prep phase to
> also initialize the lock node, though.
  I guess so.

> >> Reviewed-by: Michel Lespinasse <walken@google.com>
> >   I actually didn't add this because there are some differences in the
> > current version...
> 
> Did I miss another posting of yours, or is that coming up ?
  That will come. But as Dave Chinner pointed out for buffered writes we
should rather lock the whole range specified in the syscall (to avoid
strange results of racing truncate / write when i_mutex isn't used) and
that requires us to put the range lock above mmap_sem which isn't currently
easily possible due to page fault handling... So if the whole patch set
should go anywhere I need to solve that somehow.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2013-02-11 12:58 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-31 21:49 [PATCH 0/6 RFC] Mapping range lock Jan Kara
2013-01-31 21:49 ` Jan Kara
2013-01-31 21:49 ` [PATCH 1/6] lib: Implement range locks Jan Kara
2013-01-31 21:49   ` Jan Kara
2013-01-31 23:57   ` Andrew Morton
2013-01-31 23:57     ` Andrew Morton
2013-02-04 16:41     ` Jan Kara
2013-02-04 16:41       ` Jan Kara
2013-02-11  5:42   ` Michel Lespinasse
2013-02-11  5:42     ` Michel Lespinasse
2013-02-11 10:27     ` Jan Kara
2013-02-11 10:27       ` Jan Kara
2013-02-11 11:03       ` Michel Lespinasse
2013-02-11 11:03         ` Michel Lespinasse
2013-02-11 12:58         ` Jan Kara
2013-02-11 12:58           ` Jan Kara
2013-01-31 21:49 ` [PATCH 2/6] fs: Take mapping lock in generic read paths Jan Kara
2013-01-31 21:49   ` Jan Kara
2013-01-31 23:59   ` Andrew Morton
2013-01-31 23:59     ` Andrew Morton
2013-02-04 12:47     ` Jan Kara
2013-02-04 12:47       ` Jan Kara
2013-02-08 14:59       ` Jan Kara
2013-02-08 14:59         ` Jan Kara
2013-01-31 21:49 ` [PATCH 3/6] fs: Provide function to take mapping lock in buffered write path Jan Kara
2013-01-31 21:49   ` Jan Kara
2013-01-31 21:49 ` [PATCH 4/6] fs: Don't call dio_cleanup() before submitting all bios Jan Kara
2013-01-31 21:49   ` Jan Kara
2013-01-31 21:49 ` [PATCH 5/6] fs: Take mapping lock during direct IO Jan Kara
2013-01-31 21:49   ` Jan Kara
2013-01-31 21:49 ` [PATCH 6/6] ext3: Convert ext3 to use mapping lock Jan Kara
2013-01-31 21:49   ` Jan Kara
2013-02-01  0:07 ` [PATCH 0/6 RFC] Mapping range lock Andrew Morton
2013-02-01  0:07   ` Andrew Morton
2013-02-04  9:29   ` Zheng Liu
2013-02-04  9:29     ` Zheng Liu
2013-02-04 12:38   ` Jan Kara
2013-02-04 12:38     ` Jan Kara
2013-02-05 23:25     ` Dave Chinner
2013-02-05 23:25       ` Dave Chinner
2013-02-06 19:25       ` Jan Kara
2013-02-06 19:25         ` Jan Kara
2013-02-07  2:43         ` Dave Chinner
2013-02-07  2:43           ` Dave Chinner
2013-02-07 11:06           ` Jan Kara
2013-02-07 11:06             ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.