All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0 of 8] O_DIRECT locking rework
@ 2007-02-07  0:32 Chris Mason
  2007-02-07  0:32 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

Hello everyone,

Here's a respin of the O_DIRECT locking changes.  There are some minor
updates, mostly cleanups based on suggestions from Zach Brown and Dave
Chinner.

Linus found a deadlock between the placeholders and the page fault handler
when userland uses an mmap of the file as the O_DIRECT buffer.  I'm
getting around that by capping placeholder extents at DIO_PAGES (64)
intervals when the file has mappings.  This matches the interval used by
get_user_pages and makes sure get_user_pages has been called on a given
range in the address space before placeholders are added.

I also hit a stale data exposure when get_block returns data past the
end of the placeholder extent.  A concurrent readpage could come in and
find metadata before the data had gotten to disk.  The fix is to
only ask get_block for data up to the end of the placeholder.

Generally, this isn't a problem because the common case is for
a single placeholder to cover the entire DIO.  So the suboptimal
allocation result wouldn't happen very often.

-chris


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1 of 8] Introduce a place holder page for the pagecache
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07 17:36   ` Zach Brown
  2007-02-07  0:32 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

mm/filemap.c is changed to wait on these before adding a page into the page
cache, and truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be examined with the mapping lock held.  They
cannot be locked, and cannot have references increased or decreased on them.

Placeholders can span a range bigger than one page.  The placeholder is
inserted into the radix slot for the end of the range, and the flags field in
the page struct is used to record the start of the range.

A bit is added for the radix root (PAGECACHE_TAG_EXTENTS), and when
mm/filemap.c finds that bit set, searches for an index in the pagecache
look forward to find any placeholders that index may intersect.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r fc2d683623bb -r 7819e6e3f674 drivers/mtd/devices/block2mtd.c
--- a/drivers/mtd/devices/block2mtd.c	Sun Feb 04 10:44:54 2007 -0800
+++ b/drivers/mtd/devices/block2mtd.c	Tue Feb 06 19:45:28 2007 -0500
@@ -66,7 +66,7 @@ static void cache_readahead(struct addre
 			INFO("Overrun end of disk in cache readahead\n");
 			break;
 		}
-		page = radix_tree_lookup(&mapping->page_tree, pagei);
+		page = radix_tree_lookup_extent(&mapping->page_tree, pagei);
 		if (page && (!i))
 			break;
 		if (page)
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/fs.h
--- a/include/linux/fs.h	Sun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/fs.h	Tue Feb 06 19:45:28 2007 -0500
@@ -490,6 +490,11 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+
+/*
+ * This tag is only valid on the root of the radix tree
+ */
+#define PAGE_CACHE_TAG_EXTENTS 2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/page-flags.h
--- a/include/linux/page-flags.h	Sun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/page-flags.h	Tue Feb 06 19:45:28 2007 -0500
@@ -263,4 +263,6 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+void set_page_placeholder(struct page *page, pgoff_t start, pgoff_t end);
+
 #endif	/* PAGE_FLAGS_H */
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/pagemap.h
--- a/include/linux/pagemap.h	Sun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/pagemap.h	Tue Feb 06 19:45:28 2007 -0500
@@ -76,6 +76,9 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
+int find_or_insert_placeholders(struct address_space *mapping,
+                                  unsigned long start, unsigned long end,
+                                  gfp_t gfp_mask, int wait);
 extern __deprecated_for_modules struct page * find_trylock_page(
 			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -86,6 +89,12 @@ unsigned find_get_pages_contig(struct ad
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_pages(struct address_space *mapping,
+                             unsigned long offset, unsigned long end);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
+			       pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
@@ -116,6 +125,8 @@ int add_to_page_cache_lru(struct page *p
 				unsigned long index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
 extern void __remove_from_page_cache(struct page *page);
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+					     unsigned long index);
 
 /*
  * Return byte-offset into filesystem object for page.
diff -r fc2d683623bb -r 7819e6e3f674 include/linux/radix-tree.h
--- a/include/linux/radix-tree.h	Sun Feb 04 10:44:54 2007 -0800
+++ b/include/linux/radix-tree.h	Tue Feb 06 19:45:28 2007 -0500
@@ -53,6 +53,7 @@ static inline int radix_tree_is_direct_p
 /*** radix-tree API starts here ***/
 
 #define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_ROOT_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
@@ -168,6 +169,7 @@ radix_tree_gang_lookup_tag(struct radix_
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
+void radix_tree_root_tag_set(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
 {
diff -r fc2d683623bb -r 7819e6e3f674 lib/radix-tree.c
--- a/lib/radix-tree.c	Sun Feb 04 10:44:54 2007 -0800
+++ b/lib/radix-tree.c	Tue Feb 06 19:45:28 2007 -0500
@@ -468,6 +468,12 @@ void *radix_tree_tag_set(struct radix_tr
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
+
+void radix_tree_root_tag_set(struct radix_tree_root *root, unsigned int tag)
+{
+	root_tag_set(root, tag);
+}
+EXPORT_SYMBOL(radix_tree_root_tag_set);
 
 /**
  *	radix_tree_tag_clear - clear a tag on a radix tree node
diff -r fc2d683623bb -r 7819e6e3f674 mm/filemap.c
--- a/mm/filemap.c	Sun Feb 04 10:44:54 2007 -0800
+++ b/mm/filemap.c	Tue Feb 06 19:45:28 2007 -0500
@@ -44,6 +44,11 @@ generic_file_direct_IO(int rw, struct ki
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+static wait_queue_head_t *page_waitqueue(struct page *page);
+
+static struct address_space placeholder_address_space;
+#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -421,6 +426,35 @@ int filemap_write_and_wait_range(struct 
 	return err;
 }
 
+/*
+ * When the radix tree has the extent bit set, a lookup needs to search
+ * forward in the tree to find any extent the index might intersect.
+ * When extents are off, a faster radix_tree_lookup can be done instead.
+ * + * This does the appropriate lookup based on the PAGE_CACHE_TAG_EXTENTS + * bit on the root node
+ */
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+					     unsigned long index)
+{
+	if (radix_tree_tagged(root, PAGE_CACHE_TAG_EXTENTS)) {
+		struct page *p;
+		unsigned int found;
+		found = radix_tree_gang_lookup(root, (void **)(&p), index, 1);
+		if (found) {
+			if (PagePlaceHolder(p)) {
+				pgoff_t start = p->flags;
+				pgoff_t end = p->index;
+				if (end >= index && start <= index)
+					return p;
+			} else if (p->index == index)
+				return p;
+		}
+		return NULL;
+	}
+	return radix_tree_lookup(root, index);
+}
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:	page to add
@@ -437,22 +471,62 @@ int add_to_page_cache(struct page *page,
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
-
-	if (error == 0) {
-		write_lock_irq(&mapping->tree_lock);
+	int error;
+	struct page *tmp;
+
+	BUG_ON(PagePlaceHolder(page));
+
+again:
+	tmp = NULL;
+	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+
+	if (error)
+		goto out;
+
+	write_lock_irq(&mapping->tree_lock);
+	/*
+	 * If extents are on for this radix tree, we have to do
+	 * the more expensive search for an overlapping extent
+	 * before we try to insert.
+	 */
+	if (radix_tree_tagged(&mapping->page_tree, PAGE_CACHE_TAG_EXTENTS)) {
+		tmp = radix_tree_lookup_extent(&mapping->page_tree,
+					       offset);
+		if (tmp)
+			error = -EEXIST;
+	}
+	if (!error)
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
-			page_cache_get(page);
-			SetPageLocked(page);
-			page->mapping = mapping;
-			page->index = offset;
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-		}
-		write_unlock_irq(&mapping->tree_lock);
-		radix_tree_preload_end();
-	}
+	if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
+		/*
+		 * we need this second search because not every
+		 * placeholder forces the extent bit on the root
+		 */
+		if (!tmp)
+			tmp = radix_tree_lookup_extent(&mapping->page_tree,
+					       offset);
+		if (tmp && PagePlaceHolder(tmp)) {
+			DEFINE_WAIT(wait);
+			wait_queue_head_t *wqh = page_waitqueue(tmp);
+			radix_tree_preload_end();
+			prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+			write_unlock_irq(&mapping->tree_lock);
+			io_schedule();
+			finish_wait(wqh, &wait);
+			goto again;
+		}
+	}
+	if (!error) {
+		page_cache_get(page);
+		SetPageLocked(page);
+		page->mapping = mapping;
+		page->index = offset;
+		mapping->nrpages++;
+		__inc_zone_page_state(page, NR_FILE_PAGES);
+	}
+	write_unlock_irq(&mapping->tree_lock);
+	radix_tree_preload_end();
+out:
 	return error;
 }
 EXPORT_SYMBOL(add_to_page_cache);
@@ -516,6 +590,70 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+void wake_up_placeholder_page(struct page *page)
+{
+	__wake_up_bit(page_waitqueue(page), &page->flags, PG_locked);
+}
+EXPORT_SYMBOL_GPL(wake_up_placeholder_page);
+
+/**
+ * wait_on_placeholder_pages - gang placeholder page waiter
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @end:	The max page index (inclusive)
+ *
+ * wait_on_placeholder_pages() will search for and wait on a range of pages
+ * in the mapping
+ *
+ * On return, the range has no placeholder pages sitting in it.
+ */
+void wait_on_placeholder_pages_range(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	unsigned int i;
+	unsigned int ret;
+	struct page *pages[8];
+	DEFINE_WAIT(wait);
+
+	/*
+	 * we expect a very small number of place holder pages, so
+	 * this code isn't trying to be very fast.
+	 */
+again:
+	read_lock_irq(&mapping->tree_lock);
+	while(start <= end) {
+		ret = radix_tree_gang_lookup(&mapping->page_tree,
+					(void **)pages, start,
+					ARRAY_SIZE(pages));
+		if (!ret)
+			break;
+		for (i = 0; i < ret; i++) {
+			if (PagePlaceHolder(pages[i]) &&
+			    pages[i]->flags <= end) {
+				wait_queue_head_t *wqh;
+				wqh = page_waitqueue(pages[i]);
+				prepare_to_wait(wqh, &wait,
+						TASK_UNINTERRUPTIBLE);
+				read_unlock_irq(&mapping->tree_lock);
+				io_schedule();
+				finish_wait(wqh, &wait);
+				goto again;
+			}
+			start = pages[i]->index + 1;
+			if (pages[i]->index > end)
+				goto done;
+		}
+		if (need_resched()) {
+			read_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+			read_lock_irq(&mapping->tree_lock);
+		}
+	}
+done:
+	read_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range);
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -532,6 +670,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
  */
 void fastcall unlock_page(struct page *page)
 {
+	BUG_ON(PagePlaceHolder(page));
 	smp_mb__before_clear_bit();
 	if (!TestClearPageLocked(page))
 		BUG();
@@ -568,6 +707,7 @@ void fastcall __lock_page(struct page *p
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -580,6 +720,7 @@ void fastcall __lock_page_nosync(struct 
 void fastcall __lock_page_nosync(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -597,13 +738,269 @@ struct page * find_get_page(struct addre
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
+	if (page) {
+		if (PagePlaceHolder(page))
+			page = NULL;
+		else
+			page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
+
+/**
+ * remove_placeholder_pages - remove a range of placeholder or locked pages
+ * @mapping: the page's address_space
+ * @placeholder: the placeholder page previously inserted (for verification)
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ *
+ * Any placeholder pages in the range specified are removed.  Any real
+ * pages are unlocked and released.
+ */
+void remove_placeholder_pages(struct address_space *mapping,
+			     unsigned long start,
+			     unsigned long end)
+{
+	struct page *page;
+	int ret;
+	int i;
+	unsigned long num;
+	struct page *pages[8];
+
+	write_lock_irq(&mapping->tree_lock);
+	while (start < end) {
+		num = min(ARRAY_SIZE(pages), end - start);
+		ret = radix_tree_gang_lookup(&mapping->page_tree,
+						(void **)pages, start, num);
+		for (i = 0; i < ret; i++) {
+			page = pages[i];
+			start = page->index + 1;
+			if (page->index >= end)
+				break;
+			if (PagePlaceHolder(page)) {
+				radix_tree_delete(&mapping->page_tree,
+						  page->index);
+				wake_up_placeholder_page(page);
+				kfree(page);
+			} else {
+				unlock_page(page);
+				page_cache_release(page);
+			}
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(remove_placeholder_pages);
+
+/*
+ * a helper function to insert a placeholder into multiple slots
+ * in the radix tree.  This could probably use an optimized version
+ * in the radix code.  It may insert fewer than the request number
+ * of placeholders if we need to reschedule or the radix tree needs to
+ * be preloaded again.
+ *
+ * returns zero on error or the number actually inserted.
+ */
+static int insert_placeholder(struct address_space *mapping,
+					 struct page *insert)
+{
+	int err;
+	unsigned int found;
+	struct page *debug_page;
+	/* sanity check, make sure other extents don't exist in this range */
+	found = radix_tree_gang_lookup(&mapping->page_tree,
+				    (void **)(&debug_page),
+				    insert->flags, 1);
+	BUG_ON(found > 0 && debug_page->flags <= (insert->index));
+	err = radix_tree_insert(&mapping->page_tree, insert->index, insert);
+	return err;
+}
+
+
+static struct page *alloc_placeholder(gfp_t gfp_mask)
+{
+	struct page *p;
+	p = kzalloc(sizeof(*p), gfp_mask);
+	if (p)
+		p->mapping = &placeholder_address_space;
+	return p;
+}
+
+/**
+ * find_or_insert_placeholders - locate a group of pagecache pages or insert one
+ * @mapping: the page's address_space
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @gfp_mask: page allocation mode
+ * @insert: the page to insert if none is found
+ * @iowait: 1 if you want to wait for dirty or writeback pages.
+ *
+ * This locks down a range of offsets in the address space.  Any pages
+ * already present are locked and a placeholder page is inserted into
+ * the radix tree for any offsets without pages.
+ */
+int find_or_insert_placeholders(struct address_space *mapping,
+				  unsigned long start, unsigned long end,
+				  gfp_t gfp_mask, int iowait)
+{
+	int err = 0;
+	int i, ret;
+	unsigned long cur = start;
+	struct page *page;
+	int restart;
+	struct page *insert = NULL;
+	struct page *pages[8];
+	/*
+	 * this gets complicated.  Placeholders and page locks need to
+	 * be taken in order.  We use gang lookup to cut down on the cpu
+	 * cost, but we need to keep track of holes in the results and
+	 * insert placeholders as appropriate.
+	 *
+	 * If a locked page or a placeholder is found, we wait for it and
+	 * pick up where we left off.  If a dirty or PG_writeback page is found
+	 * and iowait==1, we have to drop all of our locks, kick/wait for the
+	 * io and resume again.
+	 */
+repeat:
+	if (!insert) {
+		insert = alloc_placeholder(gfp_mask);
+		if (!insert) {
+			err = -ENOMEM;
+			goto fail;
+		}
+	}
+	if (cur != start )
+		cond_resched();
+	err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (err)
+		goto fail;
+	write_lock_irq(&mapping->tree_lock);
+
+	/* only set the extent tag if we are inserting placeholders for more
+	 * than one page worth of slots.  This way small random ios don't
+	 * suffer from slower lookups.
+	 */
+	if (cur == start && end - start > 1)
+		radix_tree_root_tag_set(&mapping->page_tree,
+					PAGE_CACHE_TAG_EXTENTS);
+repeat_lock:
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+					(void **)pages, cur,
+					min(ARRAY_SIZE(pages), end-cur));
+	for (i = 0 ; i < ret ; i++) {
+		restart = 0;
+		page = pages[i];
+
+		if (PagePlaceHolder(page) && page->flags < end) {
+			DEFINE_WAIT(wait);
+			wait_queue_head_t *wqh = page_waitqueue(page);
+			radix_tree_preload_end();
+			prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+			write_unlock_irq(&mapping->tree_lock);
+			io_schedule();
+			finish_wait(wqh, &wait);
+			goto repeat;
+		}
+
+		if (page->index > cur) {
+			unsigned long top = min(end, page->index);
+			insert->index = top - 1;
+			insert->flags = cur;
+			err = insert_placeholder(mapping, insert);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			insert = NULL;
+			if (err)
+				goto fail;
+			cur = top;
+			if (cur < end)
+				goto repeat;
+			else
+				goto done;
+		}
+		if (page->index >= end) {
+			ret = 0;
+			break;
+		}
+		page_cache_get(page);
+		BUG_ON(page->index != cur);
+		BUG_ON(PagePlaceHolder(page));
+		if (TestSetPageLocked(page)) {
+			unsigned long tmpoff = page->index;
+			page_cache_get(page);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			__lock_page(page);
+			/* Has the page been truncated while we slept? */
+			if (unlikely(page->mapping != mapping ||
+				     page->index != tmpoff)) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto repeat;
+			} else {
+				/* we've locked the page, but  we need to
+				 *  check it for dirty/writeback
+				 */
+				restart = 1;
+			}
+		}
+		if (iowait && (PageDirty(page) || PageWriteback(page))) {
+			unlock_page(page);
+			page_cache_release(page);
+			if (!restart) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+			}
+			err = filemap_write_and_wait_range(mapping,
+						 cur << PAGE_CACHE_SHIFT,
+						 end << PAGE_CACHE_SHIFT);
+			if (err)
+				goto fail;
+			goto repeat;
+		}
+		cur++;
+		if (restart)
+			goto repeat;
+		if (cur >= end)
+			break;
+	}
+
+	/* we haven't yet filled the range */
+	if (cur < end) {
+		/* if the search filled our array, there is more to do. */
+		if (ret && ret == ARRAY_SIZE(pages))
+			goto repeat_lock;
+
+		/* otherwise insert placeholders for the remaining offsets */
+		insert->index = end - 1;
+		insert->flags = cur;
+		err = insert_placeholder(mapping, insert);
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+		if (err)
+			goto fail;
+		insert = NULL;
+		cur = end;
+	} else {
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+	}
+done:
+	BUG_ON(cur < end);
+	BUG_ON(cur > end);
+	if (insert)
+		kfree(insert);
+	return err;
+fail:
+	remove_placeholder_pages(mapping, start, cur);
+	if (insert)
+		kfree(insert);
+	return err;
+}
+EXPORT_SYMBOL_GPL(find_or_insert_placeholders);
 
 /**
  * find_trylock_page - find and lock a page
@@ -617,8 +1014,8 @@ struct page *find_trylock_page(struct ad
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
+	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
 		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
@@ -642,8 +1039,18 @@ struct page *find_lock_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
 	if (page) {
+		if (PagePlaceHolder(page)) {
+			DEFINE_WAIT(wait);
+			wait_queue_head_t *wqh = page_waitqueue(page);
+			prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+			read_unlock_irq(&mapping->tree_lock);
+			io_schedule();
+			finish_wait(wqh, &wait);
+			read_lock_irq(&mapping->tree_lock);
+			goto repeat;
+		}
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -727,14 +1134,25 @@ unsigned find_get_pages(struct address_s
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			    unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -761,6 +1179,8 @@ unsigned find_get_pages_contig(struct ad
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i]))
+			break;
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
@@ -785,14 +1205,25 @@ unsigned find_get_pages_tag(struct addre
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
@@ -2406,18 +2837,15 @@ generic_file_direct_IO(int rw, struct ki
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
 
-	retval = filemap_write_and_wait(mapping);
-	if (retval == 0) {
-		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
-						offset, nr_segs);
-		if (rw == WRITE && mapping->nrpages) {
-			pgoff_t end = (offset + write_len - 1)
-						>> PAGE_CACHE_SHIFT;
-			int err = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
-			if (err)
-				retval = err;
-		}
+	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
+					offset, nr_segs);
+	if (rw == WRITE && mapping->nrpages) {
+		pgoff_t end = (offset + write_len - 1)
+					>> PAGE_CACHE_SHIFT;
+		int err = invalidate_inode_pages2_range(mapping,
+				offset >> PAGE_CACHE_SHIFT, end);
+		if (err)
+			retval = err;
 	}
 	return retval;
 }
diff -r fc2d683623bb -r 7819e6e3f674 mm/migrate.c
--- a/mm/migrate.c	Sun Feb 04 10:44:54 2007 -0800
+++ b/mm/migrate.c	Tue Feb 06 19:45:28 2007 -0500
@@ -305,8 +305,12 @@ static int migrate_page_move_mapping(str
 
 	write_lock_irq(&mapping->tree_lock);
 
+	/*
+	 * we don't need to worry about placeholders here,
+	 * the slot in the tree is verified
+	 */
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
- 					page_index(page));
+					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
diff -r fc2d683623bb -r 7819e6e3f674 mm/readahead.c
--- a/mm/readahead.c	Sun Feb 04 10:44:54 2007 -0800
+++ b/mm/readahead.c	Tue Feb 06 19:45:28 2007 -0500
@@ -288,7 +288,8 @@ __do_page_cache_readahead(struct address
 		if (page_offset > end_index)
 			break;
 
-		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		page = radix_tree_lookup_extent(&mapping->page_tree,
+						page_offset);
 		if (page)
 			continue;
 
diff -r fc2d683623bb -r 7819e6e3f674 mm/truncate.c
--- a/mm/truncate.c	Sun Feb 04 10:44:54 2007 -0800
+++ b/mm/truncate.c	Tue Feb 06 19:45:28 2007 -0500
@@ -236,6 +236,7 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_release(&pvec);
 	}
+	wait_on_placeholder_pages_range(mapping, start, end);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
  2007-02-07  0:32 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07 20:11   ` Zach Brown
  2007-02-07  0:32 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

All mutex and semaphore usage is removed from the blockdev_direct_IO paths.
Filesystems can either do this locking on their own, or ask for placeholder
pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 7819e6e3f674 -r 5cd028318654 fs/direct-io.c
--- a/fs/direct-io.c	Tue Feb 06 19:45:28 2007 -0500
+++ b/fs/direct-io.c	Tue Feb 06 20:02:49 2007 -0500
@@ -36,6 +36,7 @@
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
+#include <linux/writeback.h>
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -95,6 +96,22 @@ struct dio {
 	struct buffer_head map_bh;	/* last get_block() result */
 
 	/*
+	 * kernel page pinning (placeholders);
+	 */
+	unsigned long fspages_start_off; /* page index where pinning starts */
+
+	/*
+	 * end off is the first page past the of the pinned range.  If
+	 * no pages or placeholders are pinned down, start_off == end_off
+	 */
+	unsigned long fspages_end_off;
+
+	/*
+	 * how big of a radix extent are we allowed to insert
+	 */
+	unsigned long fspages_span;
+
+	/*
 	 * Deferred addition of a page to the dio.  These variables are
 	 * private to dio_send_cur_page(), submit_page_section() and
 	 * dio_bio_add_page().
@@ -187,7 +204,50 @@ static int dio_refill_pages(struct dio *
 		ret = 0;
 	}
 out:
-	return ret;	
+	return ret;
+}
+
+static void dio_unlock_page_range(struct dio *dio)
+{
+	if (dio->lock_type != DIO_NO_LOCKING) {
+		remove_placeholder_pages(dio->inode->i_mapping,
+					 dio->fspages_start_off,
+					 dio->fspages_end_off);
+		dio->fspages_end_off = dio->fspages_start_off;
+	}
+}
+
+static int dio_lock_page_range(struct dio *dio, struct buffer_head *map_bh,
+				unsigned long index, unsigned long end)
+{
+	struct address_space *mapping = dio->inode->i_mapping;
+	unsigned long max_size;
+	int ret = 0;
+
+	if (dio->lock_type == DIO_NO_LOCKING)
+		return 0;
+
+	while (index >= dio->fspages_end_off) {
+		unsigned long nr = end - dio->fspages_end_off + 1;
+		nr = min(nr, dio->fspages_span);
+		ret = find_or_insert_placeholders(mapping,
+						  dio->fspages_end_off,
+						  dio->fspages_end_off + nr,
+						  GFP_KERNEL, 1);
+		if (ret)
+			break;
+		dio->fspages_end_off += nr;
+	}
+	/*
+	 * if we allow the FS to allocate more than we've placeholdered,
+	 * a concurrent readahead operation will find metadata where the
+	 * corresponding data has never been written.  This will trim
+	 * down amount of data we ask the FS to return.
+	 */
+	max_size = (dio->fspages_end_off - index) << PAGE_CACHE_SHIFT;
+	if (map_bh->b_size > max_size)
+		map_bh->b_size = max_size;
+	return ret;
 }
 
 /*
@@ -246,9 +306,7 @@ static int dio_complete(struct dio *dio,
 	if (dio->end_io && dio->result)
 		dio->end_io(dio->iocb, offset, transferred,
 			    dio->map_bh.b_private);
-	if (dio->lock_type == DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
+	dio_unlock_page_range(dio);
 
 	if (ret == 0)
 		ret = dio->page_errors;
@@ -513,6 +571,8 @@ static int get_more_blocks(struct dio *d
 	unsigned long fs_count;	/* Number of filesystem-sized blocks */
 	unsigned long dio_count;/* Number of dio_block-sized blocks */
 	unsigned long blkmask;
+	unsigned long index;
+	unsigned long end;
 	int create;
 
 	/*
@@ -540,7 +600,14 @@ static int get_more_blocks(struct dio *d
 		} else if (dio->lock_type == DIO_NO_LOCKING) {
 			create = 0;
 		}
-
+	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
+		                        dio->inode->i_blkbits);
+		end = (dio->final_block_in_request >> dio->blkfactor) >>
+		      (PAGE_CACHE_SHIFT - dio->inode->i_blkbits);
+		BUG_ON(index > end);
+		ret = dio_lock_page_range(dio, map_bh, index, end);
+		if (ret)
+			goto error;
 		/*
 		 * For writes inside i_size we forbid block creations: only
 		 * overwrites are permitted.  We fall back to buffered writes
@@ -550,6 +617,7 @@ static int get_more_blocks(struct dio *d
 		ret = (*dio->get_block)(dio->inode, fs_startblk,
 						map_bh, create);
 	}
+error:
 	return ret;
 }
 
@@ -946,9 +1014,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -992,6 +1057,24 @@ direct_io_worker(int rw, struct kiocb *i
 	dio->bio_list = NULL;
 	dio->waiter = NULL;
 
+	if (dio->lock_type != DIO_NO_LOCKING) {
+		dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
+		dio->fspages_end_off = dio->fspages_start_off;
+
+		/* if the mapping is mapped, they may be using a mmap'd portion
+		 * of the file as the buffer for this io.  That will deadlock
+		 * with placeholders because the placeholder code forces the
+		 * page fault handler to block.  The (ugly) solution is to
+		 * limit the span of inserted placeholders to the same
+		 * increment we use for get_user_pages.
+		 */
+		if (inode->i_mapping->nrpages ||
+		    mapping_mapped(inode->i_mapping))
+			dio->fspages_span = DIO_PAGES;
+		else
+			dio->fspages_span = ULONG_MAX;
+	}
+
 	/*
 	 * In case of non-aligned buffers, we may need 2 more
 	 * pages since we need to zero out first and last block.
@@ -1074,14 +1157,6 @@ direct_io_worker(int rw, struct kiocb *i
 	dio_cleanup(dio);
 
 	/*
-	 * All block lookups have been performed. For READ requests
-	 * we can let i_mutex go now that its achieved its purpose
-	 * of protecting us from looking up uninitialized blocks.
-	 */
-	if ((rw == READ) && (dio->lock_type == DIO_LOCKING))
-		mutex_unlock(&dio->inode->i_mutex);
-
-	/*
 	 * The only time we want to leave bios in flight is when a successful
 	 * partial aio read or full aio write have been setup.  In that case
 	 * bio completion will call aio_complete.  The only time it's safe to
@@ -1130,8 +1205,6 @@ direct_io_worker(int rw, struct kiocb *i
  * DIO_LOCKING (simple locking for regular files)
  * For writes we are called under i_mutex and return with i_mutex held, even
  * though it is internally dropped.
- * For reads, i_mutex is not held on entry, but it is taken and dropped before
- * returning.
  *
  * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
  *	uninitialised data, allowing parallel direct readers and writers)
@@ -1156,8 +1229,7 @@ __blockdev_direct_IO(int rw, struct kioc
 	ssize_t retval = -EINVAL;
 	loff_t end = offset;
 	struct dio *dio;
-	int release_i_mutex = 0;
-	int acquire_i_mutex = 0;
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 
 	if (rw & WRITE)
 		rw = WRITE_SYNC;
@@ -1186,49 +1258,26 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 		}
 	}
-
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;
 
+
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
 	 *	neither readers nor writers do any locking at all
 	 * For regular files using DIO_LOCKING,
-	 *	readers need to grab i_mutex and i_alloc_sem
-	 *	writers need to grab i_alloc_sem only (i_mutex is already held)
+	 *	No locks are taken
 	 * For regular files using DIO_OWN_LOCKING,
 	 *	neither readers nor writers take any locks here
 	 */
 	dio->lock_type = dio_lock_type;
-	if (dio_lock_type != DIO_NO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
-			struct address_space *mapping;
-
-			mapping = iocb->ki_filp->f_mapping;
-			if (dio_lock_type != DIO_OWN_LOCKING) {
-				mutex_lock(&inode->i_mutex);
-				release_i_mutex = 1;
-			}
-
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
-			if (retval) {
-				kfree(dio);
-				goto out;
-			}
-
-			if (dio_lock_type == DIO_OWN_LOCKING) {
-				mutex_unlock(&inode->i_mutex);
-				acquire_i_mutex = 1;
-			}
-		}
-
-		if (dio_lock_type == DIO_LOCKING)
-			/* lockdep: not the owner will release it */
-			down_read_non_owner(&inode->i_alloc_sem);
+
+	if (dio->lock_type == DIO_NO_LOCKING && end > offset) {
+		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (retval)
+			goto out;
 	}
 
 	/*
@@ -1242,15 +1291,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
-
-	if (rw == READ && dio_lock_type == DIO_LOCKING)
-		release_i_mutex = 0;
-
 out:
-	if (release_i_mutex)
-		mutex_unlock(&inode->i_mutex);
-	else if (acquire_i_mutex)
-		mutex_lock(&inode->i_mutex);
 	return retval;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3 of 8] DIO: don't fall back to buffered writes
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
  2007-02-07  0:32 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
  2007-02-07  0:32 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07  0:32 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

Placeholder pages allow DIO to use locking rules similar to that of
writepage.  DIO can now fill holes, and it can extend the file via
get_block().

i_mutex can be dropped during writes if we are writing inside i_size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 5cd028318654 -r 1a7105ab9c19 fs/direct-io.c
--- a/fs/direct-io.c	Tue Feb 06 20:02:49 2007 -0500
+++ b/fs/direct-io.c	Tue Feb 06 20:02:55 2007 -0500
@@ -1,3 +1,4 @@
+					  GFP_KERNEL, 1);
 /*
  * fs/direct-io.c
  *
@@ -70,6 +71,7 @@ struct dio {
 	int rw;
 	loff_t i_size;			/* i_size when submitted */
 	int lock_type;			/* doesn't change */
+	int reacquire_i_mutex;		/* should we get i_mutex when done? */
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -308,6 +310,9 @@ static int dio_complete(struct dio *dio,
 			    dio->map_bh.b_private);
 	dio_unlock_page_range(dio);
 
+	if (dio->reacquire_i_mutex)
+		mutex_lock(&dio->inode->i_mutex);
+
 	if (ret == 0)
 		ret = dio->page_errors;
 	if (ret == 0)
@@ -593,13 +598,8 @@ static int get_more_blocks(struct dio *d
 		map_bh->b_size = fs_count << dio->inode->i_blkbits;
 
 		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_LOCKING) {
-			if (dio->block_in_file < (i_size_read(dio->inode) >>
-							dio->blkbits))
-				create = 0;
-		} else if (dio->lock_type == DIO_NO_LOCKING) {
+		if (dio->lock_type == DIO_NO_LOCKING)
 			create = 0;
-		}
 	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
 		                        dio->inode->i_blkbits);
 		end = (dio->final_block_in_request >> dio->blkfactor) >>
@@ -1289,6 +1289,13 @@ __blockdev_direct_IO(int rw, struct kioc
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
+	/* if our write is inside i_size, we can drop i_mutex */
+	dio->reacquire_i_mutex = 0;
+	if ((rw & WRITE) && dio_lock_type == DIO_LOCKING &&
+	   end <= i_size_read(inode) && is_sync_kiocb(iocb)) {
+		dio->reacquire_i_mutex = 1;
+		mutex_unlock(&inode->i_mutex);
+	}
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
 out:
diff -r 5cd028318654 -r 1a7105ab9c19 mm/filemap.c
--- a/mm/filemap.c	Tue Feb 06 20:02:49 2007 -0500
+++ b/mm/filemap.c	Tue Feb 06 20:02:55 2007 -0500
@@ -2840,10 +2840,19 @@ generic_file_direct_IO(int rw, struct ki
 	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
 					offset, nr_segs);
 	if (rw == WRITE && mapping->nrpages) {
+		int err;
 		pgoff_t end = (offset + write_len - 1)
 					>> PAGE_CACHE_SHIFT;
-		int err = invalidate_inode_pages2_range(mapping,
-				offset >> PAGE_CACHE_SHIFT, end);
+
+		/* O_DIRECT is allowed to drop i_mutex, so more data
+		 * could have been dirtied by others.  Start io one more
+		 * time
+		 */
+		err = filemap_fdatawrite_range(mapping, offset,
+		                               offset + write_len - 1);
+		if (!err)
+			err = invalidate_inode_pages2_range(mapping,
+					offset >> PAGE_CACHE_SHIFT, end);
 		if (err)
 			retval = err;
 	}



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4 of 8] Add flags to control direct IO helpers
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
                   ` (2 preceding siblings ...)
  2007-02-07  0:32 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07 17:08   ` Suparna Bhattacharya
  2007-02-07  0:32 ` [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules Chris Mason
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

This creates a number of flags so that filesystems can control
blockdev_direct_IO.  It is based on code from Russell Cettelan.

The new flags are:
DIO_CREATE -- always pass create=1 to get_block on writes.  This allows
	      DIO to fill holes in the file.
DIO_PLACEHOLDERS -- use placeholder pages to provide locking against buffered
	            io and truncates.
DIO_DROP_I_MUTEX -- drop i_mutex before starting the mapping, io submission,
		    or io waiting.  The mutex is still dropped for AIO
		    as well.

Some API changes are made so that filesystems can have more control
over the DIO features.

__blockdev_direct_IO is more or less renamed to blockdev_direct_IO_flags.
All waiting and invalidating of page cache data is pushed down into
blockdev_direct_IO_flags (and removed from mm/filemap.c)

direct_io_worker is exported into the wild.  Filesystems that want to be
special can pull out the bits of blockdev_direct_IO_flags they care about
and then call direct_io_worker directly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 1a7105ab9c19 -r 04dd7ddd593e fs/direct-io.c
--- a/fs/direct-io.c	Tue Feb 06 20:02:55 2007 -0500
+++ b/fs/direct-io.c	Tue Feb 06 20:02:56 2007 -0500
@@ -1,4 +1,3 @@
-					  GFP_KERNEL, 1);
 /*
  * fs/direct-io.c
  *
@@ -55,13 +54,6 @@
  *
  * If blkfactor is zero then the user's request was aligned to the filesystem's
  * blocksize.
- *
- * lock_type is DIO_LOCKING for regular files on direct-IO-naive filesystems.
- * This determines whether we need to do the fancy locking which prevents
- * direct-IO from being able to read uninitialised disk blocks.  If its zero
- * (blockdev) this locking is not done, and if it is DIO_OWN_LOCKING i_mutex is
- * not held for the entire direct write (taken briefly, initially, during a
- * direct read though, but its never held for the duration of a direct-IO).
  */
 
 struct dio {
@@ -70,8 +62,7 @@ struct dio {
 	struct inode *inode;
 	int rw;
 	loff_t i_size;			/* i_size when submitted */
-	int lock_type;			/* doesn't change */
-	int reacquire_i_mutex;		/* should we get i_mutex when done? */
+	unsigned flags;			/* locking and get_block flags */
 	unsigned blkbits;		/* doesn't change */
 	unsigned blkfactor;		/* When we're using an alignment which
 					   is finer than the filesystem's soft
@@ -211,7 +202,7 @@ out:
 
 static void dio_unlock_page_range(struct dio *dio)
 {
-	if (dio->lock_type != DIO_NO_LOCKING) {
+	if (dio->flags & DIO_PLACEHOLDERS) {
 		remove_placeholder_pages(dio->inode->i_mapping,
 					 dio->fspages_start_off,
 					 dio->fspages_end_off);
@@ -226,7 +217,7 @@ static int dio_lock_page_range(struct di
 	unsigned long max_size;
 	int ret = 0;
 
-	if (dio->lock_type == DIO_NO_LOCKING)
+	if (!(dio->flags & DIO_PLACEHOLDERS))
 		return 0;
 
 	while (index >= dio->fspages_end_off) {
@@ -310,9 +301,6 @@ static int dio_complete(struct dio *dio,
 			    dio->map_bh.b_private);
 	dio_unlock_page_range(dio);
 
-	if (dio->reacquire_i_mutex)
-		mutex_lock(&dio->inode->i_mutex);
-
 	if (ret == 0)
 		ret = dio->page_errors;
 	if (ret == 0)
@@ -597,8 +585,9 @@ static int get_more_blocks(struct dio *d
 		map_bh->b_state = 0;
 		map_bh->b_size = fs_count << dio->inode->i_blkbits;
 
-		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_NO_LOCKING)
+		if (dio->flags & DIO_CREATE)
+			create = dio->rw & WRITE;
+		else
 			create = 0;
 	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
 		                        dio->inode->i_blkbits);
@@ -1014,19 +1003,41 @@ out:
 	return ret;
 }
 
-static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
+/*
+ * This does all the real work of the direct io.  Most filesystems want to
+ * call blockdev_direct_IO_flags instead, but if you have exotic locking
+ * routines you can call this directly.
+ *
+ * The flags parameter is a bitmask of:
+ *
+ * DIO_PLACEHOLDERS (use placeholder pages for locking)
+ * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
+ * DIO_DROP_I_MUTEX (drop inode->i_mutex during writes)
+ */
+ssize_t
+direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
+	const struct iovec *iov, loff_t offset, unsigned long nr_segs,
 	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-	struct dio *dio)
-{
-	unsigned long user_addr; 
+	int is_async, unsigned dioflags)
+{
+	unsigned long user_addr;
 	unsigned long flags;
 	int seg;
 	ssize_t ret = 0;
 	ssize_t ret2;
 	size_t bytes;
-
+	struct dio *dio;
+
+	if (rw & WRITE)
+		rw = WRITE_SYNC;
+
+	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
+	ret = -ENOMEM;
+	if (!dio)
+		goto out;
+
+	dio->flags = dioflags;
+	dio->is_async = is_async;
 	dio->bio = NULL;
 	dio->inode = inode;
 	dio->rw = rw;
@@ -1057,7 +1068,7 @@ direct_io_worker(int rw, struct kiocb *i
 	dio->bio_list = NULL;
 	dio->waiter = NULL;
 
-	if (dio->lock_type != DIO_NO_LOCKING) {
+	if (dio->flags & DIO_PLACEHOLDERS) {
 		dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
 		dio->fspages_end_off = dio->fspages_start_off;
 
@@ -1192,33 +1203,24 @@ direct_io_worker(int rw, struct kiocb *i
 	} else
 		BUG_ON(ret != -EIOCBQUEUED);
 
+out:
 	return ret;
 }
-
-/*
- * This is a library function for use by filesystem drivers.
- * The locking rules are governed by the dio_lock_type parameter.
- *
- * DIO_NO_LOCKING (no locking, for raw block device access)
- * For writes, i_mutex is not held on entry; it is never taken.
- *
- * DIO_LOCKING (simple locking for regular files)
- * For writes we are called under i_mutex and return with i_mutex held, even
- * though it is internally dropped.
- *
- * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
- *	uninitialised data, allowing parallel direct readers and writers)
- * For writes we are called without i_mutex, return without it, never touch it.
- * For reads we are called under i_mutex and return with i_mutex held, even
- * though it may be internally dropped.
- *
- * Additional i_alloc_sem locking requirements described inline below.
- */
-ssize_t
-__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
-	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
-	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int dio_lock_type)
+EXPORT_SYMBOL(direct_io_worker);
+
+/*
+ * A utility function fro blockdev_direct_IO_flags, this checks
+ * alignment of a O_DIRECT iovec against filesystem and blockdevice
+ * requirements.
+ *
+ * It returns a blkbits value that will work for the io, and returns the
+ * end offset of the io (via blkbits_ret and end_ret).
+ *
+ * The function returns 0 if everything will work or -EINVAL on error
+ */
+int check_dio_alignment(struct inode *inode, struct block_device *bdev,
+			const struct iovec *iov, loff_t offset, unsigned long nr_segs,
+			unsigned *blkbits_ret, loff_t *end_ret)
 {
 	int seg;
 	size_t size;
@@ -1226,13 +1228,7 @@ __blockdev_direct_IO(int rw, struct kioc
 	unsigned blkbits = inode->i_blkbits;
 	unsigned bdev_blkbits = 0;
 	unsigned blocksize_mask = (1 << blkbits) - 1;
-	ssize_t retval = -EINVAL;
 	loff_t end = offset;
-	struct dio *dio;
-	struct address_space *mapping = iocb->ki_filp->f_mapping;
-
-	if (rw & WRITE)
-		rw = WRITE_SYNC;
 
 	if (bdev)
 		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
@@ -1242,7 +1238,7 @@ __blockdev_direct_IO(int rw, struct kioc
 			 blkbits = bdev_blkbits;
 		blocksize_mask = (1 << blkbits) - 1;
 		if (offset & blocksize_mask)
-			goto out;
+			return -EINVAL;
 	}
 
 	/* Check the memory alignment.  Blocks cannot straddle pages */
@@ -1254,27 +1250,60 @@ __blockdev_direct_IO(int rw, struct kioc
 			if (bdev)
 				 blkbits = bdev_blkbits;
 			blocksize_mask = (1 << blkbits) - 1;
-			if ((addr & blocksize_mask) || (size & blocksize_mask))  
-				goto out;
+			if ((addr & blocksize_mask) || (size & blocksize_mask))
+				return -EINVAL;
 		}
 	}
-	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
-	retval = -ENOMEM;
-	if (!dio)
+	*end_ret = end;
+	*blkbits_ret = blkbits;
+	return 0;
+}
+EXPORT_SYMBOL(check_dio_alignment);
+
+/*
+ * This is a library function for use by filesystem drivers.
+ * The flags parameter is a bitmask of:
+ *
+ * DIO_PLACEHOLDERS (use placeholder pages for locking)
+ * DIO_CREATE (pass create=1 to get_block for filling holes)
+ * DIO_DROP_I_MUTEX (drop inode->i_mutex during writes)
+ */
+ssize_t
+blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, const struct iovec *iov, loff_t offset,
+	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
+	unsigned flags)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	unsigned blkbits = 0;
+	ssize_t retval = -EINVAL;
+	loff_t end = 0;
+	int is_async;
+	int grab_i_mutex = 0;
+
+
+	if (check_dio_alignment(inode, bdev, iov, offset, nr_segs,
+				&blkbits, &end))
 		goto out;
 
-
-	/*
-	 * For block device access DIO_NO_LOCKING is used,
-	 *	neither readers nor writers do any locking at all
-	 * For regular files using DIO_LOCKING,
-	 *	No locks are taken
-	 * For regular files using DIO_OWN_LOCKING,
-	 *	neither readers nor writers take any locks here
-	 */
-	dio->lock_type = dio_lock_type;
-
-	if (dio->lock_type == DIO_NO_LOCKING && end > offset) {
+	if (rw & WRITE) {
+		/*
+		 * If it's a write, unmap all mmappings of the file up-front.
+		 * This will cause any pte dirty bits to be propagated into
+		 * the pageframes for the subsequent filemap_write_and_wait().
+		 */
+		if (mapping_mapped(mapping))
+			unmap_mapping_range(mapping, offset, end - offset, 0);
+		if (end <= i_size_read(inode) && (flags & DIO_DROP_I_MUTEX)) {
+			mutex_unlock(&inode->i_mutex);
+			grab_i_mutex = 1;
+		}
+	}
+	/*
+	 * the placeholder code does filemap_write_and_wait, so if we
+	 * aren't using placeholders we have to do it here
+	 */
+	if (!(flags & DIO_PLACEHOLDERS) && end > offset) {
 		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
 		if (retval)
 			goto out;
@@ -1286,19 +1315,30 @@ __blockdev_direct_IO(int rw, struct kioc
 	 * even for AIO, we need to wait for i/o to complete before
 	 * returning in this case.
 	 */
-	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
+	is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
-	/* if our write is inside i_size, we can drop i_mutex */
-	dio->reacquire_i_mutex = 0;
-	if ((rw & WRITE) && dio_lock_type == DIO_LOCKING &&
-	   end <= i_size_read(inode) && is_sync_kiocb(iocb)) {
-		dio->reacquire_i_mutex = 1;
-		mutex_unlock(&inode->i_mutex);
-	}
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
-				nr_segs, blkbits, get_block, end_io, dio);
+				nr_segs, blkbits, get_block, end_io, is_async,
+				flags);
 out:
+	if (grab_i_mutex)
+		mutex_lock(&inode->i_mutex);
+
+	if ((rw & WRITE) && mapping->nrpages) {
+		int err;
+		/* O_DIRECT is allowed to drop i_mutex, so more data
+		 * could have been dirtied by others.  Start io one more
+		 * time
+		 */
+		err = filemap_write_and_wait_range(mapping, offset, end - 1);
+		if (!err)
+			err = invalidate_inode_pages2_range(mapping,
+					offset >> PAGE_CACHE_SHIFT,
+					(end - 1) >> PAGE_CACHE_SHIFT);
+		if (!retval && err)
+			retval = err;
+	}
 	return retval;
 }
-EXPORT_SYMBOL(__blockdev_direct_IO);
+EXPORT_SYMBOL(blockdev_direct_IO_flags);
diff -r 1a7105ab9c19 -r 04dd7ddd593e include/linux/fs.h
--- a/include/linux/fs.h	Tue Feb 06 20:02:55 2007 -0500
+++ b/include/linux/fs.h	Tue Feb 06 20:02:56 2007 -0500
@@ -1776,24 +1776,28 @@ static inline void do_generic_file_read(
 }
 
 #ifdef CONFIG_BLOCK
-ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
+int check_dio_alignment(struct inode *inode, struct block_device *bdev,
+                        const struct iovec *iov, loff_t offset, unsigned long nr_segs,
+			                        unsigned *blkbits_ret, loff_t *end_ret);
+
+ssize_t blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
 	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int lock_type);
-
-enum {
-	DIO_LOCKING = 1, /* need locking between buffered and direct access */
-	DIO_NO_LOCKING,  /* bdev; no locking at all between buffered/direct */
-	DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
-};
+	unsigned int dio_flags);
+
+#define DIO_PLACEHOLDERS (1 << 0)  /* insert placeholder pages */
+#define DIO_CREATE	(1 << 1)  /* pass create=1 to get_block when writing */
+#define DIO_DROP_I_MUTEX (1 << 2) /* drop i_mutex during writes */
 
 static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_LOCKING);
+	/* locking is on, FS wants to fill holes w/get_block */
+	return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
+				nr_segs, get_block, end_io, DIO_PLACEHOLDERS |
+				DIO_CREATE | DIO_DROP_I_MUTEX);
 }
 
 static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb,
@@ -1801,17 +1805,9 @@ static inline ssize_t blockdev_direct_IO
 	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_NO_LOCKING);
-}
-
-static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb,
-	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_block_t get_block,
-	dio_iodone_t end_io)
-{
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_OWN_LOCKING);
+	/* locking is off, create is off */
+	return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
+				nr_segs, get_block, end_io, 0);
 }
 #endif
 
diff -r 1a7105ab9c19 -r 04dd7ddd593e mm/filemap.c
--- a/mm/filemap.c	Tue Feb 06 20:02:55 2007 -0500
+++ b/mm/filemap.c	Tue Feb 06 20:02:56 2007 -0500
@@ -40,7 +40,7 @@
 
 #include <asm/mman.h>
 
-static ssize_t
+static inline ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
@@ -2817,46 +2817,12 @@ EXPORT_SYMBOL(generic_file_aio_write);
  * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if something
  * went wrong during pagecache shootdown.
  */
-static ssize_t
+static inline ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs)
 {
-	struct file *file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	ssize_t retval;
-	size_t write_len = 0;
-
-	/*
-	 * If it's a write, unmap all mmappings of the file up-front.  This
-	 * will cause any pte dirty bits to be propagated into the pageframes
-	 * for the subsequent filemap_write_and_wait().
-	 */
-	if (rw == WRITE) {
-		write_len = iov_length(iov, nr_segs);
-	       	if (mapping_mapped(mapping))
-			unmap_mapping_range(mapping, offset, write_len, 0);
-	}
-
-	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
-					offset, nr_segs);
-	if (rw == WRITE && mapping->nrpages) {
-		int err;
-		pgoff_t end = (offset + write_len - 1)
-					>> PAGE_CACHE_SHIFT;
-
-		/* O_DIRECT is allowed to drop i_mutex, so more data
-		 * could have been dirtied by others.  Start io one more
-		 * time
-		 */
-		err = filemap_fdatawrite_range(mapping, offset,
-		                               offset + write_len - 1);
-		if (!err)
-			err = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
-		if (err)
-			retval = err;
-	}
-	return retval;
+	return iocb->ki_filp->f_mapping->a_ops->direct_IO(rw, iocb, iov,
+							  offset, nr_segs);
 }
 
 /**



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
                   ` (3 preceding siblings ...)
  2007-02-07  0:32 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07  0:32 ` [PATCH 6 of 8] Make reiserfs safe for " Chris Mason
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

This creates a version of ext3_get_block that starts and ends a transaction.

By starting and ending the transaction inside get_block, this is able to
avoid lock inversion problems when the DIO code tries to take page locks
inside blockdev_direct_IO. (transaction locks must always happen after
page locks).

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 04dd7ddd593e -r 42596f5254ca fs/ext3/inode.c
--- a/fs/ext3/inode.c	Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/ext3/inode.c	Tue Feb 06 20:02:56 2007 -0500
@@ -1673,6 +1673,30 @@ static int ext3_releasepage(struct page 
 	return journal_try_to_free_buffers(journal, page, wait);
 }
 
+static int ext3_get_block_direct_IO(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create)
+{
+	int ret = 0;
+	handle_t *handle = ext3_journal_start(inode, DIO_CREDITS);
+	if (IS_ERR(handle)) {
+		ret = PTR_ERR(handle);
+		goto out;
+	}
+	ret = ext3_get_block(inode, iblock, bh_result, create);
+	/*
+	 * Reacquire the handle: ext3_get_block() can restart the transaction
+	 */
+	handle = journal_current_handle();
+	if (handle) {
+		int err;
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
+	}
+out:
+	return ret;
+}
+
 /*
  * If the O_DIRECT write will extend the file then add this inode to the
  * orphan list.  So recovery will truncate it back to the original size
@@ -1693,39 +1717,58 @@ static ssize_t ext3_direct_IO(int rw, st
 	int orphan = 0;
 	size_t count = iov_length(iov, nr_segs);
 
-	if (rw == WRITE) {
-		loff_t final_size = offset + count;
-
+	if (rw == WRITE && (offset + count > inode->i_size)) { 
 		handle = ext3_journal_start(inode, DIO_CREDITS);
 		if (IS_ERR(handle)) {
 			ret = PTR_ERR(handle);
 			goto out;
 		}
-		if (final_size > inode->i_size) {
-			ret = ext3_orphan_add(handle, inode);
-			if (ret)
-				goto out_stop;
-			orphan = 1;
-			ei->i_disksize = inode->i_size;
-		}
-	}
-
+		ret = ext3_orphan_add(handle, inode);
+		if (ret) {
+			ext3_journal_stop(handle);
+			goto out;
+		}
+		ei->i_disksize = inode->i_size;
+		ret = ext3_journal_stop(handle);
+		if (ret) {
+			/* something has gone horribly wrong, cleanup
+			 * the orphan list in ram
+			 */
+			if (inode->i_nlink)
+				ext3_orphan_del(NULL, inode);
+			goto out;
+		}
+		orphan = 1;
+	}
+
+	/*
+	 * the placeholder page code may take a page lock, so we have
+	 * to stop any running transactions before calling
+	 * blockdev_direct_IO.  Use ext3_get_block_direct_IO to start
+	 * and stop a transaction on each get_block call.
+	 */
 	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,
-				 ext3_get_block, NULL);
+				 ext3_get_block_direct_IO, NULL);
 
 	/*
 	 * Reacquire the handle: ext3_get_block() can restart the transaction
 	 */
 	handle = journal_current_handle();
 
-out_stop:
-	if (handle) {
+	if (orphan) {
 		int err;
-
-		if (orphan && inode->i_nlink)
+		handle = ext3_journal_start(inode, DIO_CREDITS);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			if (inode->i_nlink)
+				ext3_orphan_del(NULL, inode);
+			goto out;
+		}
+
+		if (inode->i_nlink)
 			ext3_orphan_del(handle, inode);
-		if (orphan && ret > 0) {
+		if (ret > 0) {
 			loff_t end = offset + ret;
 			if (end > inode->i_size) {
 				ei->i_disksize = end;



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 6 of 8] Make reiserfs safe for new DIO locking rules
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
                   ` (4 preceding siblings ...)
  2007-02-07  0:32 ` [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07  0:32 ` [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls Chris Mason
  2007-02-07  0:32 ` [PATCH 8 of 8] Avoid too many boundary buffers in DIO Chris Mason
  7 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

reiserfs is changed to use a version of reiserfs_get_block that is safe
for filling holes without i_mutex held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 42596f5254ca -r 1ab8a2112a7d fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c	Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/reiserfs/inode.c	Tue Feb 06 20:02:56 2007 -0500
@@ -469,7 +469,8 @@ static int reiserfs_get_blocks_direct_io
 	bh_result->b_size = (1 << inode->i_blkbits);
 
 	ret = reiserfs_get_block(inode, iblock, bh_result,
-				 create | GET_BLOCK_NO_DANGLE);
+				 create | GET_BLOCK_NO_DANGLE |
+				 GET_BLOCK_NO_IMUX);
 	if (ret)
 		goto out;
 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
                   ` (5 preceding siblings ...)
  2007-02-07  0:32 ` [PATCH 6 of 8] Make reiserfs safe for " Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  2007-02-07  0:32 ` [PATCH 8 of 8] Avoid too many boundary buffers in DIO Chris Mason
  7 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

XFS is changed to use blockdev_direct_IO flags instead of DIO_OWN_LOCKING.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 1ab8a2112a7d -r f53fd3802dc9 fs/xfs/linux-2.6/xfs_aops.c
--- a/fs/xfs/linux-2.6/xfs_aops.c	Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/xfs/linux-2.6/xfs_aops.c	Tue Feb 06 20:02:56 2007 -0500
@@ -1392,19 +1392,16 @@ xfs_vm_direct_IO(
 
 	iocb->private = xfs_alloc_ioend(inode, IOMAP_UNWRITTEN);
 
-	if (rw == WRITE) {
-		ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
-			iomap.iomap_target->bt_bdev,
-			iov, offset, nr_segs,
-			xfs_get_blocks_direct,
-			xfs_end_io_direct);
-	} else {
-		ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-			iomap.iomap_target->bt_bdev,
-			iov, offset, nr_segs,
-			xfs_get_blocks_direct,
-			xfs_end_io_direct);
-	}
+	/*
+	 * ask DIO not to do any special locking for us, and to always
+	 * pass create=1 to get_block on writes
+	 */
+	ret = blockdev_direct_IO_flags(rw, iocb, inode,
+				       iomap.iomap_target->bt_bdev,
+				       iov, offset, nr_segs,
+				       xfs_get_blocks_direct,
+				       xfs_end_io_direct,
+				       DIO_CREATE);
 
 	if (unlikely(ret != -EIOCBQUEUED && iocb->private))
 		xfs_destroy_ioend(iocb->private);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 8 of 8] Avoid too many boundary buffers in DIO
  2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
                   ` (6 preceding siblings ...)
  2007-02-07  0:32 ` [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls Chris Mason
@ 2007-02-07  0:32 ` Chris Mason
  7 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-07  0:32 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: akpm

Dave Chinner found a 10% performance regression with ext3 when using DIO
to fill holes instead of buffered IO.  On large IOs, the ext3 get_block
routine will send more than a page worth of blocks back to DIO via a
single buffer_head with a large b_size value.

The DIO code iterates through this massive block and tests for a
boundary buffer over and over again.  For every block size unit spanned
by the big map_bh, the boundary bit is tested and a bio may be forced
down to the block layer.

There are two potential fixes, one is to ignore the boundary bit on
large regions returned by the FS.  DIO can't tell which part of the big
region was a boundary, and so it may not be a good idea to trust the
hint.

This patch just clears the boundary bit after using it once.  It is 10%
faster for a streaming DIO write w/blocksize of 512k on my sata drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r f53fd3802dc9 -r d068ea378c04 fs/direct-io.c
--- a/fs/direct-io.c	Tue Feb 06 20:02:56 2007 -0500
+++ b/fs/direct-io.c	Tue Feb 06 20:02:56 2007 -0500
@@ -625,7 +625,6 @@ static int dio_new_bio(struct dio *dio, 
 	nr_pages = min(dio->pages_in_io, bio_get_nr_vecs(dio->map_bh.b_bdev));
 	BUG_ON(nr_pages <= 0);
 	ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector, nr_pages);
-	dio->boundary = 0;
 out:
 	return ret;
 }
@@ -679,12 +678,6 @@ static int dio_send_cur_page(struct dio 
 		 */
 		if (dio->final_block_in_bio != dio->cur_page_block)
 			dio_bio_submit(dio);
-		/*
-		 * Submit now if the underlying fs is about to perform a
-		 * metadata read
-		 */
-		if (dio->boundary)
-			dio_bio_submit(dio);
 	}
 
 	if (dio->bio == NULL) {
@@ -701,6 +694,12 @@ static int dio_send_cur_page(struct dio 
 			BUG_ON(ret != 0);
 		}
 	}
+	/*
+	 * Submit now if the underlying fs is about to perform a
+	 * metadata read
+	 */
+	if (dio->boundary)
+		dio_bio_submit(dio);
 out:
 	return ret;
 }
@@ -727,6 +726,10 @@ submit_page_section(struct dio *dio, str
 		unsigned offset, unsigned len, sector_t blocknr)
 {
 	int ret = 0;
+	int boundary = dio->boundary;
+
+	/* don't let dio_send_cur_page do the boundary too soon */
+	dio->boundary = 0;
 
 	if (dio->rw & WRITE) {
 		/*
@@ -743,17 +746,7 @@ submit_page_section(struct dio *dio, str
 		(dio->cur_page_block +
 			(dio->cur_page_len >> dio->blkbits) == blocknr)) {
 		dio->cur_page_len += len;
-
-		/*
-		 * If dio->boundary then we want to schedule the IO now to
-		 * avoid metadata seeks.
-		 */
-		if (dio->boundary) {
-			ret = dio_send_cur_page(dio);
-			page_cache_release(dio->cur_page);
-			dio->cur_page = NULL;
-		}
-		goto out;
+		goto out_send;
 	}
 
 	/*
@@ -772,6 +765,18 @@ submit_page_section(struct dio *dio, str
 	dio->cur_page_offset = offset;
 	dio->cur_page_len = len;
 	dio->cur_page_block = blocknr;
+
+out_send:
+	/*
+	 * If dio->boundary then we want to schedule the IO now to
+	 * avoid metadata seeks.
+	 */
+	if (boundary) {
+		dio->boundary = 1;
+		ret = dio_send_cur_page(dio);
+		page_cache_release(dio->cur_page);
+		dio->cur_page = NULL;
+	}
 out:
 	return ret;
 }
@@ -977,7 +982,16 @@ do_holes:
 			this_chunk_bytes = this_chunk_blocks << blkbits;
 			BUG_ON(this_chunk_bytes == 0);
 
-			dio->boundary = buffer_boundary(map_bh);
+			/*
+			 * get_block may return more than one page worth
+			 * of blocks.  Make sure only the last io we
+			 * send down for this region is a boundary
+			 */
+			if (dio->blocks_available == this_chunk_blocks)
+				dio->boundary = buffer_boundary(map_bh);
+			else
+				dio->boundary = 0;
+
 			ret = submit_page_section(dio, page, offset_in_page,
 				this_chunk_bytes, dio->next_block_for_io);
 			if (ret) {



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4 of 8] Add flags to control direct IO helpers
  2007-02-07  0:32 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
@ 2007-02-07 17:08   ` Suparna Bhattacharya
  2007-02-07 18:05     ` Chris Mason
  0 siblings, 1 reply; 18+ messages in thread
From: Suparna Bhattacharya @ 2007-02-07 17:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm

On Tue, Feb 06, 2007 at 08:32:49PM -0400, Chris Mason wrote:
> This creates a number of flags so that filesystems can control
> blockdev_direct_IO.  It is based on code from Russell Cettelan.
> 
> The new flags are:
> DIO_CREATE -- always pass create=1 to get_block on writes.  This allows
> 	      DIO to fill holes in the file.
> DIO_PLACEHOLDERS -- use placeholder pages to provide locking against buffered
> 	            io and truncates.
> DIO_DROP_I_MUTEX -- drop i_mutex before starting the mapping, io submission,
> 		    or io waiting.  The mutex is still dropped for AIO
> 		    as well.
> 
> Some API changes are made so that filesystems can have more control
> over the DIO features.
> 
> __blockdev_direct_IO is more or less renamed to blockdev_direct_IO_flags.
> All waiting and invalidating of page cache data is pushed down into
> blockdev_direct_IO_flags (and removed from mm/filemap.c)
> 
> direct_io_worker is exported into the wild.  Filesystems that want to be
> special can pull out the bits of blockdev_direct_IO_flags they care about
> and then call direct_io_worker directly.
> 
> Signed-off-by: Chris Mason <chris.mason@oracle.com>
> 
> diff -r 1a7105ab9c19 -r 04dd7ddd593e fs/direct-io.c
> --- a/fs/direct-io.c	Tue Feb 06 20:02:55 2007 -0500
> +++ b/fs/direct-io.c	Tue Feb 06 20:02:56 2007 -0500
> @@ -1,4 +1,3 @@
> -					  GFP_KERNEL, 1);
>  /*
>   * fs/direct-io.c
>   *
> @@ -55,13 +54,6 @@
>   *
>   * If blkfactor is zero then the user's request was aligned to the filesystem's
>   * blocksize.
> - *
> - * lock_type is DIO_LOCKING for regular files on direct-IO-naive filesystems.
> - * This determines whether we need to do the fancy locking which prevents
> - * direct-IO from being able to read uninitialised disk blocks.  If its zero
> - * (blockdev) this locking is not done, and if it is DIO_OWN_LOCKING i_mutex is
> - * not held for the entire direct write (taken briefly, initially, during a
> - * direct read though, but its never held for the duration of a direct-IO).
>   */
> 
>  struct dio {
> @@ -70,8 +62,7 @@ struct dio {
>  	struct inode *inode;
>  	int rw;
>  	loff_t i_size;			/* i_size when submitted */
> -	int lock_type;			/* doesn't change */
> -	int reacquire_i_mutex;		/* should we get i_mutex when done? */
> +	unsigned flags;			/* locking and get_block flags */
>  	unsigned blkbits;		/* doesn't change */
>  	unsigned blkfactor;		/* When we're using an alignment which
>  					   is finer than the filesystem's soft
> @@ -211,7 +202,7 @@ out:
> 
>  static void dio_unlock_page_range(struct dio *dio)
>  {
> -	if (dio->lock_type != DIO_NO_LOCKING) {
> +	if (dio->flags & DIO_PLACEHOLDERS) {
>  		remove_placeholder_pages(dio->inode->i_mapping,
>  					 dio->fspages_start_off,
>  					 dio->fspages_end_off);
> @@ -226,7 +217,7 @@ static int dio_lock_page_range(struct di
>  	unsigned long max_size;
>  	int ret = 0;
> 
> -	if (dio->lock_type == DIO_NO_LOCKING)
> +	if (!(dio->flags & DIO_PLACEHOLDERS))
>  		return 0;
> 
>  	while (index >= dio->fspages_end_off) {
> @@ -310,9 +301,6 @@ static int dio_complete(struct dio *dio,
>  			    dio->map_bh.b_private);
>  	dio_unlock_page_range(dio);
> 
> -	if (dio->reacquire_i_mutex)
> -		mutex_lock(&dio->inode->i_mutex);
> -
>  	if (ret == 0)
>  		ret = dio->page_errors;
>  	if (ret == 0)
> @@ -597,8 +585,9 @@ static int get_more_blocks(struct dio *d
>  		map_bh->b_state = 0;
>  		map_bh->b_size = fs_count << dio->inode->i_blkbits;
> 
> -		create = dio->rw & WRITE;
> -		if (dio->lock_type == DIO_NO_LOCKING)
> +		if (dio->flags & DIO_CREATE)
> +			create = dio->rw & WRITE;
> +		else
>  			create = 0;
>  	        index = fs_startblk >> (PAGE_CACHE_SHIFT -
>  		                        dio->inode->i_blkbits);
> @@ -1014,19 +1003,41 @@ out:
>  	return ret;
>  }
> 
> -static ssize_t
> -direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
> -	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
> +/*
> + * This does all the real work of the direct io.  Most filesystems want to
> + * call blockdev_direct_IO_flags instead, but if you have exotic locking
> + * routines you can call this directly.
> + *
> + * The flags parameter is a bitmask of:
> + *
> + * DIO_PLACEHOLDERS (use placeholder pages for locking)
> + * DIO_CREATE (pass create=1 to get_block for filling holes or extending)

A little more explanation about why these options are needed, and examples
of when one would specify each of these options would be good.

Regards
Suparna

> + * DIO_DROP_I_MUTEX (drop inode->i_mutex during writes)
> + */
> +ssize_t
> +direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
> +	const struct iovec *iov, loff_t offset, unsigned long nr_segs,
>  	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
> -	struct dio *dio)
> -{
> -	unsigned long user_addr; 
> +	int is_async, unsigned dioflags)
> +{
> +	unsigned long user_addr;
>  	unsigned long flags;
>  	int seg;
>  	ssize_t ret = 0;
>  	ssize_t ret2;
>  	size_t bytes;
> -
> +	struct dio *dio;
> +
> +	if (rw & WRITE)
> +		rw = WRITE_SYNC;
> +
> +	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
> +	ret = -ENOMEM;
> +	if (!dio)
> +		goto out;
> +
> +	dio->flags = dioflags;
> +	dio->is_async = is_async;
>  	dio->bio = NULL;
>  	dio->inode = inode;
>  	dio->rw = rw;
> @@ -1057,7 +1068,7 @@ direct_io_worker(int rw, struct kiocb *i
>  	dio->bio_list = NULL;
>  	dio->waiter = NULL;
> 
> -	if (dio->lock_type != DIO_NO_LOCKING) {
> +	if (dio->flags & DIO_PLACEHOLDERS) {
>  		dio->fspages_start_off = offset >> PAGE_CACHE_SHIFT;
>  		dio->fspages_end_off = dio->fspages_start_off;
> 
> @@ -1192,33 +1203,24 @@ direct_io_worker(int rw, struct kiocb *i
>  	} else
>  		BUG_ON(ret != -EIOCBQUEUED);
> 
> +out:
>  	return ret;
>  }
> -
> -/*
> - * This is a library function for use by filesystem drivers.
> - * The locking rules are governed by the dio_lock_type parameter.
> - *
> - * DIO_NO_LOCKING (no locking, for raw block device access)
> - * For writes, i_mutex is not held on entry; it is never taken.
> - *
> - * DIO_LOCKING (simple locking for regular files)
> - * For writes we are called under i_mutex and return with i_mutex held, even
> - * though it is internally dropped.
> - *
> - * DIO_OWN_LOCKING (filesystem provides synchronisation and handling of
> - *	uninitialised data, allowing parallel direct readers and writers)
> - * For writes we are called without i_mutex, return without it, never touch it.
> - * For reads we are called under i_mutex and return with i_mutex held, even
> - * though it may be internally dropped.
> - *
> - * Additional i_alloc_sem locking requirements described inline below.
> - */
> -ssize_t
> -__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> -	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
> -	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
> -	int dio_lock_type)
> +EXPORT_SYMBOL(direct_io_worker);
> +
> +/*
> + * A utility function fro blockdev_direct_IO_flags, this checks
> + * alignment of a O_DIRECT iovec against filesystem and blockdevice
> + * requirements.
> + *
> + * It returns a blkbits value that will work for the io, and returns the
> + * end offset of the io (via blkbits_ret and end_ret).
> + *
> + * The function returns 0 if everything will work or -EINVAL on error
> + */
> +int check_dio_alignment(struct inode *inode, struct block_device *bdev,
> +			const struct iovec *iov, loff_t offset, unsigned long nr_segs,
> +			unsigned *blkbits_ret, loff_t *end_ret)
>  {
>  	int seg;
>  	size_t size;
> @@ -1226,13 +1228,7 @@ __blockdev_direct_IO(int rw, struct kioc
>  	unsigned blkbits = inode->i_blkbits;
>  	unsigned bdev_blkbits = 0;
>  	unsigned blocksize_mask = (1 << blkbits) - 1;
> -	ssize_t retval = -EINVAL;
>  	loff_t end = offset;
> -	struct dio *dio;
> -	struct address_space *mapping = iocb->ki_filp->f_mapping;
> -
> -	if (rw & WRITE)
> -		rw = WRITE_SYNC;
> 
>  	if (bdev)
>  		bdev_blkbits = blksize_bits(bdev_hardsect_size(bdev));
> @@ -1242,7 +1238,7 @@ __blockdev_direct_IO(int rw, struct kioc
>  			 blkbits = bdev_blkbits;
>  		blocksize_mask = (1 << blkbits) - 1;
>  		if (offset & blocksize_mask)
> -			goto out;
> +			return -EINVAL;
>  	}
> 
>  	/* Check the memory alignment.  Blocks cannot straddle pages */
> @@ -1254,27 +1250,60 @@ __blockdev_direct_IO(int rw, struct kioc
>  			if (bdev)
>  				 blkbits = bdev_blkbits;
>  			blocksize_mask = (1 << blkbits) - 1;
> -			if ((addr & blocksize_mask) || (size & blocksize_mask))  
> -				goto out;
> +			if ((addr & blocksize_mask) || (size & blocksize_mask))
> +				return -EINVAL;
>  		}
>  	}
> -	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
> -	retval = -ENOMEM;
> -	if (!dio)
> +	*end_ret = end;
> +	*blkbits_ret = blkbits;
> +	return 0;
> +}
> +EXPORT_SYMBOL(check_dio_alignment);
> +
> +/*
> + * This is a library function for use by filesystem drivers.
> + * The flags parameter is a bitmask of:
> + *
> + * DIO_PLACEHOLDERS (use placeholder pages for locking)
> + * DIO_CREATE (pass create=1 to get_block for filling holes)
> + * DIO_DROP_I_MUTEX (drop inode->i_mutex during writes)
> + */
> +ssize_t
> +blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode *inode,
> +	struct block_device *bdev, const struct iovec *iov, loff_t offset,
> +	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
> +	unsigned flags)
> +{
> +	struct address_space *mapping = iocb->ki_filp->f_mapping;
> +	unsigned blkbits = 0;
> +	ssize_t retval = -EINVAL;
> +	loff_t end = 0;
> +	int is_async;
> +	int grab_i_mutex = 0;
> +
> +
> +	if (check_dio_alignment(inode, bdev, iov, offset, nr_segs,
> +				&blkbits, &end))
>  		goto out;
> 
> -
> -	/*
> -	 * For block device access DIO_NO_LOCKING is used,
> -	 *	neither readers nor writers do any locking at all
> -	 * For regular files using DIO_LOCKING,
> -	 *	No locks are taken
> -	 * For regular files using DIO_OWN_LOCKING,
> -	 *	neither readers nor writers take any locks here
> -	 */
> -	dio->lock_type = dio_lock_type;
> -
> -	if (dio->lock_type == DIO_NO_LOCKING && end > offset) {
> +	if (rw & WRITE) {
> +		/*
> +		 * If it's a write, unmap all mmappings of the file up-front.
> +		 * This will cause any pte dirty bits to be propagated into
> +		 * the pageframes for the subsequent filemap_write_and_wait().
> +		 */
> +		if (mapping_mapped(mapping))
> +			unmap_mapping_range(mapping, offset, end - offset, 0);
> +		if (end <= i_size_read(inode) && (flags & DIO_DROP_I_MUTEX)) {
> +			mutex_unlock(&inode->i_mutex);
> +			grab_i_mutex = 1;
> +		}
> +	}
> +	/*
> +	 * the placeholder code does filemap_write_and_wait, so if we
> +	 * aren't using placeholders we have to do it here
> +	 */
> +	if (!(flags & DIO_PLACEHOLDERS) && end > offset) {
>  		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
>  		if (retval)
>  			goto out;
> @@ -1286,19 +1315,30 @@ __blockdev_direct_IO(int rw, struct kioc
>  	 * even for AIO, we need to wait for i/o to complete before
>  	 * returning in this case.
>  	 */
> -	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
> +	is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
>  		(end > i_size_read(inode)));
> 
> -	/* if our write is inside i_size, we can drop i_mutex */
> -	dio->reacquire_i_mutex = 0;
> -	if ((rw & WRITE) && dio_lock_type == DIO_LOCKING &&
> -	   end <= i_size_read(inode) && is_sync_kiocb(iocb)) {
> -		dio->reacquire_i_mutex = 1;
> -		mutex_unlock(&inode->i_mutex);
> -	}
>  	retval = direct_io_worker(rw, iocb, inode, iov, offset,
> -				nr_segs, blkbits, get_block, end_io, dio);
> +				nr_segs, blkbits, get_block, end_io, is_async,
> +				flags);
>  out:
> +	if (grab_i_mutex)
> +		mutex_lock(&inode->i_mutex);
> +
> +	if ((rw & WRITE) && mapping->nrpages) {
> +		int err;
> +		/* O_DIRECT is allowed to drop i_mutex, so more data
> +		 * could have been dirtied by others.  Start io one more
> +		 * time
> +		 */
> +		err = filemap_write_and_wait_range(mapping, offset, end - 1);
> +		if (!err)
> +			err = invalidate_inode_pages2_range(mapping,
> +					offset >> PAGE_CACHE_SHIFT,
> +					(end - 1) >> PAGE_CACHE_SHIFT);
> +		if (!retval && err)
> +			retval = err;
> +	}
>  	return retval;
>  }
> -EXPORT_SYMBOL(__blockdev_direct_IO);
> +EXPORT_SYMBOL(blockdev_direct_IO_flags);
> diff -r 1a7105ab9c19 -r 04dd7ddd593e include/linux/fs.h
> --- a/include/linux/fs.h	Tue Feb 06 20:02:55 2007 -0500
> +++ b/include/linux/fs.h	Tue Feb 06 20:02:56 2007 -0500
> @@ -1776,24 +1776,28 @@ static inline void do_generic_file_read(
>  }
> 
>  #ifdef CONFIG_BLOCK
> -ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> +int check_dio_alignment(struct inode *inode, struct block_device *bdev,
> +                        const struct iovec *iov, loff_t offset, unsigned long nr_segs,
> +			                        unsigned *blkbits_ret, loff_t *end_ret);
> +
> +ssize_t blockdev_direct_IO_flags(int rw, struct kiocb *iocb, struct inode *inode,
>  	struct block_device *bdev, const struct iovec *iov, loff_t offset,
>  	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
> -	int lock_type);
> -
> -enum {
> -	DIO_LOCKING = 1, /* need locking between buffered and direct access */
> -	DIO_NO_LOCKING,  /* bdev; no locking at all between buffered/direct */
> -	DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
> -};
> +	unsigned int dio_flags);
> +
> +#define DIO_PLACEHOLDERS (1 << 0)  /* insert placeholder pages */
> +#define DIO_CREATE	(1 << 1)  /* pass create=1 to get_block when writing */
> +#define DIO_DROP_I_MUTEX (1 << 2) /* drop i_mutex during writes */
> 
>  static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
>  	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
>  	loff_t offset, unsigned long nr_segs, get_block_t get_block,
>  	dio_iodone_t end_io)
>  {
> -	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> -				nr_segs, get_block, end_io, DIO_LOCKING);
> +	/* locking is on, FS wants to fill holes w/get_block */
> +	return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
> +				nr_segs, get_block, end_io, DIO_PLACEHOLDERS |
> +				DIO_CREATE | DIO_DROP_I_MUTEX);
>  }
> 
>  static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb,
> @@ -1801,17 +1805,9 @@ static inline ssize_t blockdev_direct_IO
>  	loff_t offset, unsigned long nr_segs, get_block_t get_block,
>  	dio_iodone_t end_io)
>  {
> -	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> -				nr_segs, get_block, end_io, DIO_NO_LOCKING);
> -}
> -
> -static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb,
> -	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
> -	loff_t offset, unsigned long nr_segs, get_block_t get_block,
> -	dio_iodone_t end_io)
> -{
> -	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> -				nr_segs, get_block, end_io, DIO_OWN_LOCKING);
> +	/* locking is off, create is off */
> +	return blockdev_direct_IO_flags(rw, iocb, inode, bdev, iov, offset,
> +				nr_segs, get_block, end_io, 0);
>  }
>  #endif
> 
> diff -r 1a7105ab9c19 -r 04dd7ddd593e mm/filemap.c
> --- a/mm/filemap.c	Tue Feb 06 20:02:55 2007 -0500
> +++ b/mm/filemap.c	Tue Feb 06 20:02:56 2007 -0500
> @@ -40,7 +40,7 @@
> 
>  #include <asm/mman.h>
> 
> -static ssize_t
> +static inline ssize_t
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>  	loff_t offset, unsigned long nr_segs);
> 
> @@ -2817,46 +2817,12 @@ EXPORT_SYMBOL(generic_file_aio_write);
>   * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if something
>   * went wrong during pagecache shootdown.
>   */
> -static ssize_t
> +static inline ssize_t
>  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>  	loff_t offset, unsigned long nr_segs)
>  {
> -	struct file *file = iocb->ki_filp;
> -	struct address_space *mapping = file->f_mapping;
> -	ssize_t retval;
> -	size_t write_len = 0;
> -
> -	/*
> -	 * If it's a write, unmap all mmappings of the file up-front.  This
> -	 * will cause any pte dirty bits to be propagated into the pageframes
> -	 * for the subsequent filemap_write_and_wait().
> -	 */
> -	if (rw == WRITE) {
> -		write_len = iov_length(iov, nr_segs);
> -	       	if (mapping_mapped(mapping))
> -			unmap_mapping_range(mapping, offset, write_len, 0);
> -	}
> -
> -	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
> -					offset, nr_segs);
> -	if (rw == WRITE && mapping->nrpages) {
> -		int err;
> -		pgoff_t end = (offset + write_len - 1)
> -					>> PAGE_CACHE_SHIFT;
> -
> -		/* O_DIRECT is allowed to drop i_mutex, so more data
> -		 * could have been dirtied by others.  Start io one more
> -		 * time
> -		 */
> -		err = filemap_fdatawrite_range(mapping, offset,
> -		                               offset + write_len - 1);
> -		if (!err)
> -			err = invalidate_inode_pages2_range(mapping,
> -					offset >> PAGE_CACHE_SHIFT, end);
> -		if (err)
> -			retval = err;
> -	}
> -	return retval;
> +	return iocb->ki_filp->f_mapping->a_ops->direct_IO(rw, iocb, iov,
> +							  offset, nr_segs);
>  }
> 
>  /**
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1 of 8] Introduce a place holder page for the pagecache
  2007-02-07  0:32 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
@ 2007-02-07 17:36   ` Zach Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-07 17:36 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm

> Placeholders can span a range bigger than one page.  The  
> placeholder is
> inserted into the radix slot for the end of the range, and the  
> flags field in
> the page struct is used to record the start of the range.

Not surprisingly, I like how this makes the interaction between  
buffered and O_DIRECT a lot less magical.  Nice work.

Some comments:

> +void wait_on_placeholder_pages_range(struct address_space *mapping,
> +			       pgoff_t start, pgoff_t end)

...

> +	while(start <= end) {

...

> +			start = pages[i]->index + 1;
> +			if (pages[i]->index > end)
> +				goto done;

It looks like this might be confused by a end and final page->index  
of ~0?  Maybe it'd be safer to explicitly deal with this here instead  
of relying on safe arguments.

> +		}
> +		if (need_resched()) {
> +			read_unlock_irq(&mapping->tree_lock);
> +			cond_resched();
> +			read_lock_irq(&mapping->tree_lock);
> +		}

We should introduce a cond_sched_lock_irq() to match cond_resched_lock 
().

> +void remove_placeholder_pages(struct address_space *mapping,
> +			     unsigned long start,
> +			     unsigned long end)
> +{
> +	struct page *page;
> +	int ret;
> +	int i;
> +	unsigned long num;
> +	struct page *pages[8];
> +
> +	write_lock_irq(&mapping->tree_lock);
> +	while (start < end) {

...

> +	write_unlock_irq(&mapping->tree_lock);

And call that cond_resched_lock_irq() in here too?  Since we grabbing  
the lock in here, presumably the caller won't mind if we grab and  
release it a few times?

> +static int insert_placeholder(struct address_space *mapping,
> +					 struct page *insert)
> +{

spin_lock_assert(&mapping->tree_lock)?

> +int find_or_insert_placeholders(struct address_space *mapping,
> +				  unsigned long start, unsigned long end,
> +				  gfp_t gfp_mask, int iowait)
> +{

...

> +	/*
> +	 * this gets complicated.

It sure does!  Sprinkling a few comments above the main blocks in the  
loop might be nice.

> +			err = filemap_write_and_wait_range(mapping,
> +						 cur << PAGE_CACHE_SHIFT,
> +						 end << PAGE_CACHE_SHIFT);

This end is exclusive (>= end breaks) but filemap_write_and_wait_range 
() -> wait_on_page_writeback_range() wants an argument that is  
inclusive (> end breaks)?

> +		if (PagePlaceHolder(page)) {
> +			DEFINE_WAIT(wait);
> +			wait_queue_head_t *wqh = page_waitqueue(page);
> +			prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +			read_unlock_irq(&mapping->tree_lock);
> +			io_schedule();
> +			finish_wait(wqh, &wait);
> +			read_lock_irq(&mapping->tree_lock);

This pattern seems to be repeated quite a few times.  Maybe a helper  
would.. help?  Maybe it's not worth it if it has to do silly little  
pre-schedule tasks differently at each site.

> -	for (i = 0; i < ret; i++)
> -		page_cache_get(pages[i]);
> +	while(i < ret) {
> +		if (PagePlaceHolder(pages[i])) {
> +			/* we can't return a place holder, shift it away */
> +			if (i + 1 < ret) {
> +				memcpy(&pages[i], &pages[i+1],
> +		                       (ret - i - 1) * sizeof(struct page *));
> +			}
> +			ret--;
> +			continue;
> +		} else
> +			page_cache_get(pages[i]);
> +		i++;

It looks like this loop is implemented twice?  How about a factoring  
it out?  Especially with the risk of off-by-ones in there.  (Though I  
don't see problems, it looks correct.)

More on the rest of the series after lunch :).

- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4 of 8] Add flags to control direct IO helpers
  2007-02-07 17:08   ` Suparna Bhattacharya
@ 2007-02-07 18:05     ` Chris Mason
  2007-02-08  4:03       ` Suparna Bhattacharya
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-07 18:05 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm

On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote:
> > + * The flags parameter is a bitmask of:
> > + *
> > + * DIO_PLACEHOLDERS (use placeholder pages for locking)
> > + * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
> 
> A little more explanation about why these options are needed, and examples
> of when one would specify each of these options would be good.

I'll extend the comments in the patch, but for discussion here:

DIO_PLACEHOLDERS:  placeholders are inserted into the page cache to
synchronize the DIO with buffered writes.  From a locking point of view,
this is similar to inserting and locking pages in the address space
corresponding to the DIO.

placeholders guard against concurrent allocations and truncates during the DIO.
You don't need placeholders if truncates and allocations are are
impossible (for example, on a block device).

DIO_CREATE: placeholders make it possible for filesystems to safely fill
holes and extend the file via get_block during the DIO.  If DIO_CREATE
is turned on, get_block will be called with create=1, allowing the FS to
allocate blocks during the DIO.

DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped
during the DIO and taken again before returning.

-chris


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
  2007-02-07  0:32 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
@ 2007-02-07 20:11   ` Zach Brown
  2007-02-07 20:22     ` Chris Mason
  0 siblings, 1 reply; 18+ messages in thread
From: Zach Brown @ 2007-02-07 20:11 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm


> +static void dio_unlock_page_range(struct dio *dio)
> +{
> +	if (dio->lock_type != DIO_NO_LOCKING) {
> +		remove_placeholder_pages(dio->inode->i_mapping,
> +					 dio->fspages_start_off,
> +					 dio->fspages_end_off);
> +		dio->fspages_end_off = dio->fspages_start_off;

I wonder if it should call remove_placeholder_pages when _end_off !=  
_start_off.  That feels like a more direct test.

> +	max_size = (dio->fspages_end_off - index) << PAGE_CACHE_SHIFT;
> +	if (map_bh->b_size > max_size)
> +		map_bh->b_size = max_size;

	b_size = min(b_size, math) ?

> +		/* if the mapping is mapped, they may be using a mmap'd portion
> +		 * of the file as the buffer for this io.  That will deadlock
> +		 * with placeholders because the placeholder code forces the
> +		 * page fault handler to block.  The (ugly) solution is to
> +		 * limit the span of inserted placeholders to the same
> +		 * increment we use for get_user_pages.
> +		 */
> +		if (inode->i_mapping->nrpages ||
> +		    mapping_mapped(inode->i_mapping))
> +			dio->fspages_span = DIO_PAGES;
> +		else
> +			dio->fspages_span = ULONG_MAX;

Couldn't the decision to use DIO_PAGES in the presence of a mapping  
race with establishing a mapping?  I fear we have to use DIO_PAGES  
all the time, which at least has the benefit of getting rid of the  
_span member of dio :).

- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
  2007-02-07 20:11   ` Zach Brown
@ 2007-02-07 20:22     ` Chris Mason
  2007-02-07 20:34       ` Zach Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2007-02-07 20:22 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-fsdevel, akpm

On Wed, Feb 07, 2007 at 03:11:02PM -0500, Zach Brown wrote:

[ cleanups, sure ]

> >+		if (inode->i_mapping->nrpages ||
> >+		    mapping_mapped(inode->i_mapping))
> >+			dio->fspages_span = DIO_PAGES;
> >+		else
> >+			dio->fspages_span = ULONG_MAX;
> 
> Couldn't the decision to use DIO_PAGES in the presence of a mapping  
> race with establishing a mapping?  I fear we have to use DIO_PAGES  
> all the time, which at least has the benefit of getting rid of the  
> _span member of dio :).

The test case Linus sent me boils down to this:

fd = open(file)
buffer = mmap(fd, 128 pages);
close(fd);
fd = open(file, O_DIRECT);
write(fd, buffer, 66 pages);

I think the deadlock is limited to cases where get_user_pages will get
stuck in filemap_nopage waiting for placeholders inserted by this DIO.
It looks like that can only happen when buffer is mapped at the start of
the dio.  Outside of some futuristic async syscall system, are
there ways for the buffer to get mmaped during the DIO?

-chris


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking
  2007-02-07 20:22     ` Chris Mason
@ 2007-02-07 20:34       ` Zach Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Zach Brown @ 2007-02-07 20:34 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm

>
> The test case Linus sent me boils down to this:
>
> fd = open(file)
> buffer = mmap(fd, 128 pages);
> close(fd);
> fd = open(file, O_DIRECT);
> write(fd, buffer, 66 pages);

Yeah, though I bet the inner close/open isn't needed.

> I think the deadlock is limited to cases where get_user_pages will get
> stuck in filemap_nopage waiting for placeholders inserted by this DIO.

Yeah.

> It looks like that can only happen when buffer is mapped at the  
> start of
> the dio.

At the *start* of the dio or by the time get_user_pages() is called?

The dio and mmap() aren't serialized, are they?  mmap() just sets up  
the vma, I thought, and will only touch the mmap_sem.

I'm fearing threads racing write(fd, buffer, ) and mmap(buffer,  
MAP_FIXED...).

I might just be missing the locking that serializes them.  If nothing  
else, this should be mentioned in the comment above the code that  
looks like a racy test against the presence of a mapping.

- z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4 of 8] Add flags to control direct IO helpers
  2007-02-07 18:05     ` Chris Mason
@ 2007-02-08  4:03       ` Suparna Bhattacharya
  2007-02-08 12:58         ` Chris Mason
  0 siblings, 1 reply; 18+ messages in thread
From: Suparna Bhattacharya @ 2007-02-08  4:03 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel, akpm

On Wed, Feb 07, 2007 at 01:05:44PM -0500, Chris Mason wrote:
> On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote:
> > > + * The flags parameter is a bitmask of:
> > > + *
> > > + * DIO_PLACEHOLDERS (use placeholder pages for locking)
> > > + * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
> > 
> > A little more explanation about why these options are needed, and examples
> > of when one would specify each of these options would be good.
> 
> I'll extend the comments in the patch, but for discussion here:
> 
> DIO_PLACEHOLDERS:  placeholders are inserted into the page cache to
> synchronize the DIO with buffered writes.  From a locking point of view,
> this is similar to inserting and locking pages in the address space
> corresponding to the DIO.
> 
> placeholders guard against concurrent allocations and truncates during the DIO.
> You don't need placeholders if truncates and allocations are are
> impossible (for example, on a block device).

Likewise placeholders may not be needed if the underlying filesystem
already takes care of locking to synchronizes DIO vs buffered.

> 
> DIO_CREATE: placeholders make it possible for filesystems to safely fill
> holes and extend the file via get_block during the DIO.  If DIO_CREATE
> is turned on, get_block will be called with create=1, allowing the FS to
> allocate blocks during the DIO.

When would one NOT specify DIO_CREATE, and what are the implications ?
The purpose of having an option of NOT allowing the FS to allocate blocks
during DIO is one is not very intuitive from the standpoint of the caller.
(the block device case could be an example, but then create=1 could not do
any harm or add extra overhead, so why bother ?)

Is there still a valid case where we fallback to buffered IO to fill holes
- to me that seems to be the only situation where create=0 must be enforced.

> 
> DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped
> during the DIO and taken again before returning.

Again an example of when one would not specify this (block device and
XFS ?) would be useful.

Regards
Suparna

> 
> -chris

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4 of 8] Add flags to control direct IO helpers
  2007-02-08  4:03       ` Suparna Bhattacharya
@ 2007-02-08 12:58         ` Chris Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2007-02-08 12:58 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: linux-fsdevel, akpm

On Thu, Feb 08, 2007 at 09:33:05AM +0530, Suparna Bhattacharya wrote:
> On Wed, Feb 07, 2007 at 01:05:44PM -0500, Chris Mason wrote:
> > On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote:
> > > > + * The flags parameter is a bitmask of:
> > > > + *
> > > > + * DIO_PLACEHOLDERS (use placeholder pages for locking)
> > > > + * DIO_CREATE (pass create=1 to get_block for filling holes or extending)
> > > 
> > > A little more explanation about why these options are needed, and examples
> > > of when one would specify each of these options would be good.
> > 
> > I'll extend the comments in the patch, but for discussion here:
> > 
> > DIO_PLACEHOLDERS:  placeholders are inserted into the page cache to
> > synchronize the DIO with buffered writes.  From a locking point of view,
> > this is similar to inserting and locking pages in the address space
> > corresponding to the DIO.
> > 
> > placeholders guard against concurrent allocations and truncates during the DIO.
> > You don't need placeholders if truncates and allocations are are
> > impossible (for example, on a block device).
> 
> Likewise placeholders may not be needed if the underlying filesystem
> already takes care of locking to synchronizes DIO vs buffered.

True, although I don't think any FS covers 100% of the cases right now.

> 
> > 
> > DIO_CREATE: placeholders make it possible for filesystems to safely fill
> > holes and extend the file via get_block during the DIO.  If DIO_CREATE
> > is turned on, get_block will be called with create=1, allowing the FS to
> > allocate blocks during the DIO.
> 
> When would one NOT specify DIO_CREATE, and what are the implications ?
> The purpose of having an option of NOT allowing the FS to allocate blocks
> during DIO is one is not very intuitive from the standpoint of the caller.
> (the block device case could be an example, but then create=1 could not do
> any harm or add extra overhead, so why bother ?)

DIO has fallen back to buffered IO for so long that I wanted filesystems
to explicitly choose the create=1 for now.  A good example is my patch
for ext3, where the ext3 get_block routine needed to be changed to start
a transaction instead of finding the current trans in
current->journal_info.  The reiserfs DIO get_block needed to be told not
to expect i_mutex to be held, etc etc.

> 
> Is there still a valid case where we fallback to buffered IO to fill holes
> - to me that seems to be the only situation where create=0 must be enforced.

Right, when create=0 we fall back, otherwise we don't.

> 
> > 
> > DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped
> > during the DIO and taken again before returning.
> 
> Again an example of when one would not specify this (block device and
> XFS ?) would be useful.

If the FS can't fill a hole or extend the file without i_mutex, or if
the caller has already dropped I_MUTEX themselves.  I think this is
only XFS right now, the long term goal is to make placeholders fast
enough for XFS to use.

-chris


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1 of 8] Introduce a place holder page for the pagecache
  2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
@ 2006-12-22  1:52 ` Chris Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2006-12-22  1:52 UTC (permalink / raw)
  To: linux-fsdevel, akpm, zach.brown

mm/filemap.c is changed to wait on these before adding a page into the page
cache, and truncates are changed to wait for all of the place holder pages to
disappear.

Place holder pages can only be examined with the mapping lock held.  They
cannot be locked, and cannot have references increased or decreased on them.

Placeholders can span a range bigger than one page.  The placeholder is
inserted into the radix slot for the end of the range, and the flags field in
the page struct is used to record the start of the range.

A bit is added for the radix root (PAGECACHE_TAG_EXTENTS), and when
mm/filemap.c finds that bit set, searches for an index in the pagecache
look forward to find any placeholders that index may intersect.

Signed-off-by: Chris Mason <chris.mason@oracle.com>

diff -r 511f067627ac -r 4cac7e560b53 drivers/mtd/devices/block2mtd.c
--- a/drivers/mtd/devices/block2mtd.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/drivers/mtd/devices/block2mtd.c	Thu Dec 21 15:31:30 2006 -0500
@@ -66,7 +66,7 @@ static void cache_readahead(struct addre
 			INFO("Overrun end of disk in cache readahead\n");
 			break;
 		}
-		page = radix_tree_lookup(&mapping->page_tree, pagei);
+		page = radix_tree_lookup_extent(&mapping->page_tree, pagei);
 		if (page && (!i))
 			break;
 		if (page)
diff -r 511f067627ac -r 4cac7e560b53 include/linux/fs.h
--- a/include/linux/fs.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/fs.h	Thu Dec 21 15:31:30 2006 -0500
@@ -489,6 +489,11 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+
+/*
+ * This tag is only valid on the root of the radix tree
+ */
+#define PAGE_CACHE_TAG_EXTENTS 2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff -r 511f067627ac -r 4cac7e560b53 include/linux/page-flags.h
--- a/include/linux/page-flags.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/page-flags.h	Thu Dec 21 15:31:30 2006 -0500
@@ -267,4 +267,6 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
+void set_page_placeholder(struct page *page, pgoff_t start, pgoff_t end);
+
 #endif	/* PAGE_FLAGS_H */
diff -r 511f067627ac -r 4cac7e560b53 include/linux/pagemap.h
--- a/include/linux/pagemap.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/pagemap.h	Thu Dec 21 15:31:30 2006 -0500
@@ -76,6 +76,11 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
+int find_or_insert_placeholders(struct address_space *mapping,
+                                  struct page **pages, unsigned long start,
+                                  unsigned long end, unsigned long nr,
+                                  gfp_t gfp_mask,
+                                  int wait);
 extern __deprecated_for_modules struct page * find_trylock_page(
 			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -86,6 +91,15 @@ unsigned find_get_pages_contig(struct ad
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+void remove_placeholder_pages(struct address_space *mapping,
+                             struct page **pages,
+                             unsigned long offset,
+                             unsigned long end,
+                             unsigned long nr);
+void wake_up_placeholder_page(struct page *page);
+void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start,
+			       pgoff_t end);
+
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
@@ -116,6 +130,8 @@ int add_to_page_cache_lru(struct page *p
 				unsigned long index, gfp_t gfp_mask);
 extern void remove_from_page_cache(struct page *page);
 extern void __remove_from_page_cache(struct page *page);
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+					     unsigned long index);
 
 /*
  * Return byte-offset into filesystem object for page.
diff -r 511f067627ac -r 4cac7e560b53 include/linux/radix-tree.h
--- a/include/linux/radix-tree.h	Thu Dec 21 00:20:01 2006 -0800
+++ b/include/linux/radix-tree.h	Thu Dec 21 15:31:30 2006 -0500
@@ -53,6 +53,7 @@ static inline int radix_tree_is_direct_p
 /*** radix-tree API starts here ***/
 
 #define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_ROOT_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
@@ -168,6 +169,7 @@ radix_tree_gang_lookup_tag(struct radix_
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
+void radix_tree_root_tag_set(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
 {
diff -r 511f067627ac -r 4cac7e560b53 lib/radix-tree.c
--- a/lib/radix-tree.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/lib/radix-tree.c	Thu Dec 21 15:31:30 2006 -0500
@@ -468,6 +468,12 @@ void *radix_tree_tag_set(struct radix_tr
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
+
+void radix_tree_root_tag_set(struct radix_tree_root *root, unsigned int tag)
+{
+	root_tag_set(root, tag);
+}
+EXPORT_SYMBOL(radix_tree_root_tag_set);
 
 /**
  *	radix_tree_tag_clear - clear a tag on a radix tree node
diff -r 511f067627ac -r 4cac7e560b53 mm/filemap.c
--- a/mm/filemap.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/filemap.c	Thu Dec 21 15:31:30 2006 -0500
@@ -44,6 +44,14 @@ generic_file_direct_IO(int rw, struct ki
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+static wait_queue_head_t *page_waitqueue(struct page *page);
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock);
+
+static struct address_space placeholder_address_space;
+#define PagePlaceHolder(page) ((page)->mapping == &placeholder_address_space)
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -421,6 +429,41 @@ int filemap_write_and_wait_range(struct 
 	return err;
 }
 
+/*
+ * When the radix tree has the extent bit set, a lookup needs to search
+ * forward in the tree to find any extent the index might intersect.
+ * When extents are off, a faster radix_tree_lookup can be done instead.
+ *
+ * This does the appropriate lookup based on the PAGE_CACHE_TAG_EXTENTS
+ * bit on the root node
+ */
+struct page *radix_tree_lookup_extent(struct radix_tree_root *root,
+					     unsigned long index)
+{
+	if (radix_tree_tagged(root, PAGE_CACHE_TAG_EXTENTS)) {
+		struct page *p;
+		unsigned int found;
+		found = radix_tree_gang_lookup(root, (void **)(&p), index, 1);
+		if (found) {
+			if (PagePlaceHolder(p)) {
+				pgoff_t start = p->flags;
+				pgoff_t end = p->index;
+				if (end >= index && start <= index)
+					return p;
+				return NULL;
+			} else {
+				if (p->index == index) {
+					return p;
+				}
+				return NULL;
+			}
+		} else
+			return NULL;
+	}
+	return radix_tree_lookup(root, index);
+}
+
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:	page to add
@@ -437,12 +480,38 @@ int add_to_page_cache(struct page *page,
 int add_to_page_cache(struct page *page, struct address_space *mapping,
 		pgoff_t offset, gfp_t gfp_mask)
 {
-	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	int error;
+again:
+	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		struct page *tmp;
 		write_lock_irq(&mapping->tree_lock);
+		/*
+		 * If extents are on for this radix tree, we have to do
+		 * the more expensive search for an overlapping extent
+		 * before we try to insert.
+		 */
+		if (radix_tree_tagged(&mapping->page_tree,
+				      PAGE_CACHE_TAG_EXTENTS)) {
+			tmp = radix_tree_lookup_extent(&mapping->page_tree,
+						       offset);
+			if (tmp && PagePlaceHolder(tmp))
+				goto exists;
+		}
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
-		if (!error) {
+		if (error == -EEXIST && (gfp_mask & __GFP_WAIT)) {
+			tmp = radix_tree_lookup_extent(&mapping->page_tree,
+						       offset);
+			if (tmp && PagePlaceHolder(tmp)) {
+exists:
+				radix_tree_preload_end();
+				/* drops the lock */
+				wait_on_placeholder_page(mapping, tmp, 1);
+				goto again;
+			}
+		}
+		if (!error && !PagePlaceHolder(page)) {
 			page_cache_get(page);
 			SetPageLocked(page);
 			page->mapping = mapping;
@@ -516,6 +585,92 @@ void fastcall wait_on_page_bit(struct pa
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
+/*
+ * Call with either a read lock or a write lock on the mapping tree.
+ *
+ * placeholder pages can't be tested or checked without the tree lock held
+ *
+ * In order to wait for the placeholders without losing a wakeup from someone
+ * removing them, we have to prepare_to_wait before dropping the tree lock.
+ *
+ * The lock is dropped just before waiting for the place holder.  It is not
+ * retaken before returning.
+ */
+static void wait_on_placeholder_page(struct address_space *mapping,
+				     struct page *page,
+				     int write_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	if (write_lock)
+		write_unlock_irq(&mapping->tree_lock);
+	else
+		read_unlock_irq(&mapping->tree_lock);
+	io_schedule();
+	finish_wait(wqh, &wait);
+}
+
+void wake_up_placeholder_page(struct page *page)
+{
+	__wake_up_bit(page_waitqueue(page), &page->flags, PG_locked);
+}
+EXPORT_SYMBOL_GPL(wake_up_placeholder_page);
+
+/**
+ * wait_on_placeholder_pages - gang placeholder page waiter
+ * @mapping:	The address_space to search
+ * @start:	The starting page index
+ * @end:	The max page index (inclusive)
+ *
+ * wait_on_placeholder_pages() will search for and wait on a range of pages
+ * in the mapping
+ *
+ * On return, the range has no placeholder pages sitting in it.
+ */
+void wait_on_placeholder_pages_range(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end)
+{
+	unsigned int i;
+	unsigned int ret;
+	struct page *pages[8];
+	pgoff_t cur = start;
+	pgoff_t highest = start;
+
+	/*
+	 * we expect a very small number of place holder pages, so
+	 * this code isn't trying to be very fast.
+	 */
+again:
+	read_lock_irq(&mapping->tree_lock);
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pages, cur, ARRAY_SIZE(pages));
+	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i])) {
+			if (pages[i]->flags > end)
+				goto done;
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, pages[i], 0);
+			goto again;
+		}
+		if (pages[i]->index > highest)
+			highest = pages[i]->index;
+		if (pages[i]->index > end)
+			goto done;
+	}
+	if (highest < end && ret == ARRAY_SIZE(pages)) {
+		cur = highest;
+		if (need_resched()) {
+			read_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+		}
+		goto again;
+	}
+done:
+	read_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(wait_on_placeholder_pages_range);
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -532,6 +687,7 @@ EXPORT_SYMBOL(wait_on_page_bit);
  */
 void fastcall unlock_page(struct page *page)
 {
+	BUG_ON(PagePlaceHolder(page));
 	smp_mb__before_clear_bit();
 	if (!TestClearPageLocked(page))
 		BUG();
@@ -568,6 +724,7 @@ void fastcall __lock_page(struct page *p
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -580,6 +737,7 @@ void fastcall __lock_page_nosync(struct 
 void fastcall __lock_page_nosync(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	BUG_ON(PagePlaceHolder(page));
 	__wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock,
 							TASK_UNINTERRUPTIBLE);
 }
@@ -597,13 +755,281 @@ struct page * find_get_page(struct addre
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
+	if (page) {
+		if (PagePlaceHolder(page))
+			page = NULL;
+		else
+			page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
+
+/**
+ * remove_placeholder_pages - remove a range of placeholder or locked pages
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @placeholder: the placeholder page previously inserted (for verification)
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ *
+ * Any placeholder pages in the range specified are removed.  Any real
+ * pages are unlocked and released.
+ */
+void remove_placeholder_pages(struct address_space *mapping,
+			     struct page **pages,
+			     unsigned long start,
+			     unsigned long end,
+			     unsigned long nr)
+{
+	struct page *page;
+	int ret;
+	int i;
+	unsigned long num;
+
+	write_lock_irq(&mapping->tree_lock);
+	while (start < end) {
+		num = min(nr, end-start);
+		ret = radix_tree_gang_lookup(&mapping->page_tree,
+						(void **)pages, start, num);
+		for (i = 0; i < ret; i++) {
+			page = pages[i];
+			if (PagePlaceHolder(page)) {
+				if (page->index >= end)
+					break;
+				radix_tree_delete(&mapping->page_tree,
+						  page->index);
+				start = page->index + 1;
+				wake_up_placeholder_page(page);
+				kfree(page);
+			} else {
+				if (page->index >= end)
+					break;
+				unlock_page(page);
+				page_cache_release(page);
+				start = page->index + 1;
+			}
+		}
+		if (need_resched()) {
+			write_unlock_irq(&mapping->tree_lock);
+			cond_resched();
+			write_lock_irq(&mapping->tree_lock);
+		}
+	}
+	write_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL_GPL(remove_placeholder_pages);
+
+/*
+ * a helper function to insert a placeholder into multiple slots
+ * in the radix tree.  This could probably use an optimized version
+ * in the radix code.  It may insert fewer than the request number
+ * of placeholders if we need to reschedule or the radix tree needs to
+ * be preloaded again.
+ *
+ * returns zero on error or the number actually inserted.
+ */
+static int insert_placeholder(struct address_space *mapping,
+					 struct page *insert)
+{
+	int err;
+	unsigned int found;
+	struct page *debug_page;
+	/* sanity check, make sure other extents don't exist in this range */
+	found = radix_tree_gang_lookup(&mapping->page_tree,
+				    (void **)(&debug_page),
+				    insert->flags, 1);
+	BUG_ON(found > 0 && debug_page->flags <= (insert->index));
+	err = radix_tree_insert(&mapping->page_tree, insert->index, insert);
+	return err;
+}
+
+
+static struct page *alloc_placeholder(gfp_t gfp_mask)
+{
+	struct page *p;
+	p = kmalloc(sizeof(*p), gfp_mask);
+	if (p) {
+		memset(p, 0, sizeof(*p));
+		p->mapping = &placeholder_address_space;
+	}
+	return p;
+}
+
+/**
+ * find_or_insert_placeholders - locate a group of pagecache pages or insert one
+ * @mapping: the page's address_space
+ * @pages: an array of page pointers to use for gang looukps
+ * @start: the search starting point
+ * @end: the search end point (offsets >= end are not touched)
+ * @nr: the size of the pages array.
+ * @gfp_mask: page allocation mode
+ * @insert: the page to insert if none is found
+ * @iowait: 1 if you want to wait for dirty or writeback pages.
+ *
+ * This locks down a range of offsets in the address space.  Any pages
+ * already present are locked and a placeholder page is inserted into
+ * the radix tree for any offsets without pages.
+ */
+int find_or_insert_placeholders(struct address_space *mapping,
+				  struct page **pages, unsigned long start,
+				  unsigned long end, unsigned long nr,
+				  gfp_t gfp_mask,
+				  int iowait)
+{
+	int err = 0;
+	int i, ret;
+	unsigned long cur = start;
+	struct page *page;
+	int restart;
+	struct page *insert = NULL;
+	/*
+	 * this gets complicated.  Placeholders and page locks need to
+	 * be taken in order.  We use gang lookup to cut down on the cpu
+	 * cost, but we need to keep track of holes in the results and
+	 * insert placeholders as appropriate.
+	 *
+	 * If a locked page or a placeholder is found, we wait for it and
+	 * pick up where we left off.  If a dirty or PG_writeback page is found
+	 * and iowait==1, we have to drop all of our locks, kick/wait for the
+	 * io and resume again.
+	 */
+repeat:
+	if (!insert) {
+		insert = alloc_placeholder(gfp_mask);
+		if (!insert) {
+			err = -ENOMEM;
+			goto fail;
+		}
+	}
+	if (cur != start )
+		cond_resched();
+	err = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	if (err)
+		goto fail;
+	write_lock_irq(&mapping->tree_lock);
+
+	/* only set the extent tag if we are inserting placeholders for more
+	 * than one page worth of slots.  This way small random ios don't
+	 * suffer from slower lookups.
+	 */
+	if (cur == start && end - start > 1)
+		radix_tree_root_tag_set(&mapping->page_tree,
+					PAGE_CACHE_TAG_EXTENTS);
+repeat_lock:
+	ret = radix_tree_gang_lookup(&mapping->page_tree,
+					(void **)pages, cur,
+					min(nr, end-cur));
+	for (i = 0 ; i < ret ; i++) {
+		restart = 0;
+		page = pages[i];
+
+		if (PagePlaceHolder(page) && page->flags < end) {
+			radix_tree_preload_end();
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 1);
+			goto repeat;
+		}
+
+		if (page->index > cur) {
+			unsigned long top = min(end, page->index);
+			insert->index = top - 1;
+			insert->flags = cur;
+			err = insert_placeholder(mapping, insert);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			insert = NULL;
+			if (err)
+				goto fail;
+			cur = top;
+			if (cur < end)
+				goto repeat;
+			else
+				goto done;
+		}
+		if (page->index >= end) {
+			ret = 0;
+			break;
+		}
+		page_cache_get(page);
+		BUG_ON(page->index != cur);
+		BUG_ON(PagePlaceHolder(page));
+		if (TestSetPageLocked(page)) {
+			unsigned long tmpoff = page->index;
+			page_cache_get(page);
+			write_unlock_irq(&mapping->tree_lock);
+			radix_tree_preload_end();
+			__lock_page(page);
+			/* Has the page been truncated while we slept? */
+			if (unlikely(page->mapping != mapping ||
+				     page->index != tmpoff)) {
+				unlock_page(page);
+				page_cache_release(page);
+				goto repeat;
+			} else {
+				/* we've locked the page, but  we need to
+				 *  check it for dirty/writeback
+				 */
+				restart = 1;
+			}
+		}
+		if (iowait && (PageDirty(page) || PageWriteback(page))) {
+			unlock_page(page);
+			page_cache_release(page);
+			if (!restart) {
+				write_unlock_irq(&mapping->tree_lock);
+				radix_tree_preload_end();
+			}
+			err = filemap_write_and_wait_range(mapping,
+						 cur << PAGE_CACHE_SHIFT,
+						 end << PAGE_CACHE_SHIFT);
+			if (err)
+				goto fail;
+			goto repeat;
+		}
+		cur++;
+		if (restart)
+			goto repeat;
+		if (cur >= end)
+			break;
+	}
+
+	/* we haven't yet filled the range */
+	if (cur < end) {
+		/* if the search filled our array, there is more to do. */
+		if (ret && ret == nr)
+			goto repeat_lock;
+
+		/* otherwise insert placeholders for the remaining offsets */
+		insert->index = end - 1;
+		insert->flags = cur;
+		err = insert_placeholder(mapping, insert);
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+		if (err)
+			goto fail;
+		insert = NULL;
+		cur = end;
+	} else {
+		write_unlock_irq(&mapping->tree_lock);
+		radix_tree_preload_end();
+	}
+done:
+	BUG_ON(cur < end);
+	BUG_ON(cur > end);
+	if (insert)
+		kfree(insert);
+	return err;
+fail:
+	remove_placeholder_pages(mapping, pages, start, cur, nr);
+	if (insert)
+		kfree(insert);
+	return err;
+}
+EXPORT_SYMBOL_GPL(find_or_insert_placeholders);
 
 /**
  * find_trylock_page - find and lock a page
@@ -617,8 +1043,8 @@ struct page *find_trylock_page(struct ad
 	struct page *page;
 
 	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
+	if (page && (PagePlaceHolder(page) || TestSetPageLocked(page)))
 		page = NULL;
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
@@ -642,8 +1068,14 @@ struct page *find_lock_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = radix_tree_lookup_extent(&mapping->page_tree, offset);
 	if (page) {
+		if (PagePlaceHolder(page)) {
+			/* drops the lock */
+			wait_on_placeholder_page(mapping, page, 0);
+			read_lock_irq(&mapping->tree_lock);
+			goto repeat;
+		}
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -727,14 +1159,25 @@ unsigned find_get_pages(struct address_s
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 			    unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -761,6 +1204,8 @@ unsigned find_get_pages_contig(struct ad
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
+		if (PagePlaceHolder(pages[i]))
+			break;
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
@@ -785,14 +1230,25 @@ unsigned find_get_pages_tag(struct addre
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int i;
+	unsigned int i = 0;
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
+	while(i < ret) {
+		if (PagePlaceHolder(pages[i])) {
+			/* we can't return a place holder, shift it away */
+			if (i + 1 < ret) {
+				memcpy(&pages[i], &pages[i+1],
+		                       (ret - i - 1) * sizeof(struct page *));
+			}
+			ret--;
+			continue;
+		} else
+			page_cache_get(pages[i]);
+		i++;
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
@@ -2406,18 +2862,15 @@ generic_file_direct_IO(int rw, struct ki
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
 
-	retval = filemap_write_and_wait(mapping);
-	if (retval == 0) {
-		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
-						offset, nr_segs);
-		if (rw == WRITE && mapping->nrpages) {
-			pgoff_t end = (offset + write_len - 1)
-						>> PAGE_CACHE_SHIFT;
-			int err = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
-			if (err)
-				retval = err;
-		}
+	retval = mapping->a_ops->direct_IO(rw, iocb, iov,
+					offset, nr_segs);
+	if (rw == WRITE && mapping->nrpages) {
+		pgoff_t end = (offset + write_len - 1)
+					>> PAGE_CACHE_SHIFT;
+		int err = invalidate_inode_pages2_range(mapping,
+				offset >> PAGE_CACHE_SHIFT, end);
+		if (err)
+			retval = err;
 	}
 	return retval;
 }
diff -r 511f067627ac -r 4cac7e560b53 mm/migrate.c
--- a/mm/migrate.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/migrate.c	Thu Dec 21 15:31:30 2006 -0500
@@ -305,8 +305,12 @@ static int migrate_page_move_mapping(str
 
 	write_lock_irq(&mapping->tree_lock);
 
+	/*
+	 * we don't need to worry about placeholders here,
+	 * the slot in the tree is verified
+	 */
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
- 					page_index(page));
+					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
diff -r 511f067627ac -r 4cac7e560b53 mm/readahead.c
--- a/mm/readahead.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/readahead.c	Thu Dec 21 15:31:30 2006 -0500
@@ -288,7 +288,8 @@ __do_page_cache_readahead(struct address
 		if (page_offset > end_index)
 			break;
 
-		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		page = radix_tree_lookup_extent(&mapping->page_tree,
+						page_offset);
 		if (page)
 			continue;
 
diff -r 511f067627ac -r 4cac7e560b53 mm/truncate.c
--- a/mm/truncate.c	Thu Dec 21 00:20:01 2006 -0800
+++ b/mm/truncate.c	Thu Dec 21 15:31:30 2006 -0500
@@ -209,6 +209,7 @@ void truncate_inode_pages_range(struct a
 		}
 		pagevec_release(&pvec);
 	}
+	wait_on_placeholder_pages_range(mapping, start, end);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
 



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-02-08 12:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-07  0:32 [RFC PATCH 0 of 8] O_DIRECT locking rework Chris Mason
2007-02-07  0:32 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason
2007-02-07 17:36   ` Zach Brown
2007-02-07  0:32 ` [PATCH 2 of 8] Change O_DIRECT to use placeholders instead of i_mutex/i_alloc_sem locking Chris Mason
2007-02-07 20:11   ` Zach Brown
2007-02-07 20:22     ` Chris Mason
2007-02-07 20:34       ` Zach Brown
2007-02-07  0:32 ` [PATCH 3 of 8] DIO: don't fall back to buffered writes Chris Mason
2007-02-07  0:32 ` [PATCH 4 of 8] Add flags to control direct IO helpers Chris Mason
2007-02-07 17:08   ` Suparna Bhattacharya
2007-02-07 18:05     ` Chris Mason
2007-02-08  4:03       ` Suparna Bhattacharya
2007-02-08 12:58         ` Chris Mason
2007-02-07  0:32 ` [PATCH 5 of 8] Make ext3 safe for the new DIO locking rules Chris Mason
2007-02-07  0:32 ` [PATCH 6 of 8] Make reiserfs safe for " Chris Mason
2007-02-07  0:32 ` [PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls Chris Mason
2007-02-07  0:32 ` [PATCH 8 of 8] Avoid too many boundary buffers in DIO Chris Mason
  -- strict thread matches above, loose matches on Subject: below --
2006-12-22  1:45 [PATCH 0 of 8] O_DIRECT locking rework v5 Chris Mason
2006-12-22  1:52 ` [PATCH 1 of 8] Introduce a place holder page for the pagecache Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.