All of lore.kernel.org
 help / color / mirror / Atom feed
* (unknown), 
@ 2010-06-16 16:33 ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin

  Hello,

  here is the fourth version of the writeback livelock avoidance patches
for data integrity writes. To quickly summarize the idea: we tag dirty
pages at the beginning of write_cache_pages with a new TOWRITE tag and
then write only tagged pages to avoid parallel writers to livelock us.
See changelogs of the patches for more details.
  I have tested the patches with fsx and a test program I wrote which
checks that if we crash after fsync, the data is indeed on disk.
  If there are no more concerns, can these patches get merged?

								Honza

  Changes since last version:
- tagging function was changed to stop after given amount of pages to
  avoid keeping tree_lock and irqs disabled for too long
- changed names and updated comments as Andrew suggested
- measured memory impact and reported it in the changelog

  Things suggested but not changed (I want to avoid going in circles ;):
- use tagging also for WB_SYNC_NONE writeback - there's problem with an
  interaction with wbc->nr_to_write. If we tag all dirty pages, we can
  spend too much time tagging when we write only a few pages in the end
  because of nr_to_write. If we tag only say nr_to_write pages, we may
  not have enough pages tagged because some pages are written out by
  someone else and so we would have to restart and tagging would become
  essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
  writeback if we can get rid of nr_to_write. But that's a story for
  a different patch set.
- implement function for clearing several tags (TOWRITE, DIRTY) at once
  - IMHO not worth it because we would save only conversion of page index
  to radix tree offsets. The rest would have to be separate anyways. And
  the interface would be incosistent as well...
- use __lookup_tag to implement radix_tree_range_tag_if_tagged - doesn't
  quite work because __lookup_tag returns only leaf nodes so we'd have to
  implement tree traversal anyways to tag also internal nodes.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* (no subject)
@ 2010-06-16 16:33 ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin

  Hello,

  here is the fourth version of the writeback livelock avoidance patches
for data integrity writes. To quickly summarize the idea: we tag dirty
pages at the beginning of write_cache_pages with a new TOWRITE tag and
then write only tagged pages to avoid parallel writers to livelock us.
See changelogs of the patches for more details.
  I have tested the patches with fsx and a test program I wrote which
checks that if we crash after fsync, the data is indeed on disk.
  If there are no more concerns, can these patches get merged?

								Honza

  Changes since last version:
- tagging function was changed to stop after given amount of pages to
  avoid keeping tree_lock and irqs disabled for too long
- changed names and updated comments as Andrew suggested
- measured memory impact and reported it in the changelog

  Things suggested but not changed (I want to avoid going in circles ;):
- use tagging also for WB_SYNC_NONE writeback - there's problem with an
  interaction with wbc->nr_to_write. If we tag all dirty pages, we can
  spend too much time tagging when we write only a few pages in the end
  because of nr_to_write. If we tag only say nr_to_write pages, we may
  not have enough pages tagged because some pages are written out by
  someone else and so we would have to restart and tagging would become
  essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
  writeback if we can get rid of nr_to_write. But that's a story for
  a different patch set.
- implement function for clearing several tags (TOWRITE, DIRTY) at once
  - IMHO not worth it because we would save only conversion of page index
  to radix tree offsets. The rest would have to be separate anyways. And
  the interface would be incosistent as well...
- use __lookup_tag to implement radix_tree_range_tag_if_tagged - doesn't
  quite work because __lookup_tag returns only leaf nodes so we'd have to
  implement tree traversal anyways to tag also internal nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
  2010-06-16 16:33 ` Jan Kara
@ 2010-06-16 16:33   ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin, Jan Kara

Implement function for setting one tag if another tag is set
for each item in given range.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h |    4 ++
 lib/radix-tree.c           |   92 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 0 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 55ca73c..a4b00e9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -192,6 +192,10 @@ unsigned int
 radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
+unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
+		unsigned long *first_indexp, unsigned long last_index,
+		unsigned long nr_to_tag,
+		unsigned int fromtag, unsigned int totag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 05da38b..549ce9c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -609,6 +609,98 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 EXPORT_SYMBOL(radix_tree_tag_get);
 
 /**
+ * radix_tree_range_tag_if_tagged - for each item in given range set given
+ *				   tag if item has another tag set
+ * @root:		radix tree root
+ * @first_indexp:	pointer to a starting index of a range to scan
+ * @last_index:		last index of a range to scan
+ * @nr_to_tag:		maximum number items to tag
+ * @iftag: 		tag index to test
+ * @settag:		tag index to set if tested tag is set
+ *
+ * This function scans range of radix tree from first_index to last_index
+ * (inclusive).  For each item in the range if iftag is set, the function sets
+ * also settag. The function stops either after tagging nr_to_tag items or
+ * after reaching last_index.
+ *
+ * The function returns number of leaves where the tag was set and sets
+ * *first_indexp to the first unscanned index.
+ */
+unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
+                unsigned long *first_indexp, unsigned long last_index,
+                unsigned long nr_to_tag,
+		unsigned int iftag, unsigned int settag)
+{
+	unsigned int height = root->height, shift;
+	unsigned long tagged = 0, index = *first_indexp;
+	struct radix_tree_node *open_slots[height], *slot;
+
+	last_index = min(last_index, radix_tree_maxindex(height));
+	if (index > last_index)
+		return 0;
+	if (!root_tag_get(root, iftag)) {
+		*first_indexp = last_index + 1;
+		return 0;
+	}
+	if (height == 0) {
+		*first_indexp = last_index + 1;
+		root_tag_set(root, settag);
+		return 1;
+	}
+
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
+
+	for (;;) {
+		int offset;
+
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		if (!slot->slots[offset])
+			goto next;
+		if (!tag_get(slot, iftag, offset))
+			goto next;
+		tag_set(slot, settag, offset);
+		if (height == 1) {
+			tagged++;
+			goto next;
+		}
+		/* Go down one level */
+		height--;
+		shift -= RADIX_TREE_MAP_SHIFT;
+		open_slots[height] = slot;
+		slot = slot->slots[offset];
+		continue;
+next:
+		/* Go to next item at level determined by 'shift' */
+		index = ((index >> shift) + 1) << shift;
+		if (index > last_index)
+			break;
+		if (tagged > nr_to_tag)
+			break;
+		while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) {
+			/*
+			 * We've fully scanned this node. Go up. Because
+			 * last_index is guaranteed to be in the tree, what
+			 * we do below cannot wander astray.
+			 */
+			slot = open_slots[height];
+			height++;
+			shift += RADIX_TREE_MAP_SHIFT;
+		}
+	}
+	/*
+	 * The iftag must have been set somewhere because otherwise
+	 * we would return immediated at the beginning of the function
+	 */
+	root_tag_set(root, settag);
+	*first_indexp = index;
+
+	return tagged;
+}
+EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
+
+
+/**
  *	radix_tree_next_hole    -    find the next hole (not-present entry)
  *	@root:		tree root
  *	@index:		index key
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
@ 2010-06-16 16:33   ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin, Jan Kara

Implement function for setting one tag if another tag is set
for each item in given range.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/radix-tree.h |    4 ++
 lib/radix-tree.c           |   92 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 0 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 55ca73c..a4b00e9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -192,6 +192,10 @@ unsigned int
 radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
+unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
+		unsigned long *first_indexp, unsigned long last_index,
+		unsigned long nr_to_tag,
+		unsigned int fromtag, unsigned int totag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 05da38b..549ce9c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -609,6 +609,98 @@ int radix_tree_tag_get(struct radix_tree_root *root,
 EXPORT_SYMBOL(radix_tree_tag_get);
 
 /**
+ * radix_tree_range_tag_if_tagged - for each item in given range set given
+ *				   tag if item has another tag set
+ * @root:		radix tree root
+ * @first_indexp:	pointer to a starting index of a range to scan
+ * @last_index:		last index of a range to scan
+ * @nr_to_tag:		maximum number items to tag
+ * @iftag: 		tag index to test
+ * @settag:		tag index to set if tested tag is set
+ *
+ * This function scans range of radix tree from first_index to last_index
+ * (inclusive).  For each item in the range if iftag is set, the function sets
+ * also settag. The function stops either after tagging nr_to_tag items or
+ * after reaching last_index.
+ *
+ * The function returns number of leaves where the tag was set and sets
+ * *first_indexp to the first unscanned index.
+ */
+unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
+                unsigned long *first_indexp, unsigned long last_index,
+                unsigned long nr_to_tag,
+		unsigned int iftag, unsigned int settag)
+{
+	unsigned int height = root->height, shift;
+	unsigned long tagged = 0, index = *first_indexp;
+	struct radix_tree_node *open_slots[height], *slot;
+
+	last_index = min(last_index, radix_tree_maxindex(height));
+	if (index > last_index)
+		return 0;
+	if (!root_tag_get(root, iftag)) {
+		*first_indexp = last_index + 1;
+		return 0;
+	}
+	if (height == 0) {
+		*first_indexp = last_index + 1;
+		root_tag_set(root, settag);
+		return 1;
+	}
+
+	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
+
+	for (;;) {
+		int offset;
+
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		if (!slot->slots[offset])
+			goto next;
+		if (!tag_get(slot, iftag, offset))
+			goto next;
+		tag_set(slot, settag, offset);
+		if (height == 1) {
+			tagged++;
+			goto next;
+		}
+		/* Go down one level */
+		height--;
+		shift -= RADIX_TREE_MAP_SHIFT;
+		open_slots[height] = slot;
+		slot = slot->slots[offset];
+		continue;
+next:
+		/* Go to next item at level determined by 'shift' */
+		index = ((index >> shift) + 1) << shift;
+		if (index > last_index)
+			break;
+		if (tagged > nr_to_tag)
+			break;
+		while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) {
+			/*
+			 * We've fully scanned this node. Go up. Because
+			 * last_index is guaranteed to be in the tree, what
+			 * we do below cannot wander astray.
+			 */
+			slot = open_slots[height];
+			height++;
+			shift += RADIX_TREE_MAP_SHIFT;
+		}
+	}
+	/*
+	 * The iftag must have been set somewhere because otherwise
+	 * we would return immediated at the beginning of the function
+	 */
+	root_tag_set(root, settag);
+	*first_indexp = index;
+
+	return tagged;
+}
+EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
+
+
+/**
  *	radix_tree_next_hole    -    find the next hole (not-present entry)
  *	@root:		tree root
  *	@index:		index key
-- 
1.6.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-16 16:33 ` Jan Kara
@ 2010-06-16 16:33   ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to
296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same
(13 and 7 respectively).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   69 +++++++++++++++++++++++++++++++++-----------
 3 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..664674e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a4b00e9..634b8e6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..1cb043e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,40 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end (inclusive) and tags
+ * all pages that have DIRTY tag set with a special TOWRITE tag. The idea is
+ * that write_cache_pages (or whoever calls this function) will then use
+ * TOWRITE tag to identify pages eligible for writeback.  This mechanism is
+ * used to avoid livelocking of writeback by a process steadily creating new
+ * dirty pages in the file (thus it is important for this function to be quick
+ * so that it can tag pages faster than a dirtying process can create them).
+ */
+/*
+ * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce tree_lock latency.
+ */
+#define WRITEBACK_TAG_BATCH 4096
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	unsigned long tagged;
+
+	do {
+		spin_lock_irq(&mapping->tree_lock);
+		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
+				&start, end, WRITEBACK_TAG_BATCH,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+		spin_unlock_irq(&mapping->tree_lock);
+		cond_resched();
+	} while (tagged >= WRITEBACK_TAG_BATCH);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +854,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -835,6 +876,7 @@ int write_cache_pages(struct address_space *mapping,
 	pgoff_t done_index;
 	int cycled;
 	int range_whole = 0;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -851,29 +893,19 @@ int write_cache_pages(struct address_space *mapping,
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
-
-		/*
-		 * If this is a data integrity sync, cap the writeback to the
-		 * current end of file. Any extension to the file that occurs
-		 * after this is a new write and we don't need to write those
-		 * pages out to fulfil our data integrity requirements. If we
-		 * try to write them out, we can get stuck in this scan until
-		 * the concurrent writer stops adding dirty pages and extending
-		 * EOF.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL &&
-		    wbc->range_end == LLONG_MAX) {
-			end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
-		}
 	}
-
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1329,6 +1361,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-06-16 16:33   ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to
296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same
(13 and 7 respectively).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   69 +++++++++++++++++++++++++++++++++-----------
 3 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..664674e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a4b00e9..634b8e6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..1cb043e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,40 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end (inclusive) and tags
+ * all pages that have DIRTY tag set with a special TOWRITE tag. The idea is
+ * that write_cache_pages (or whoever calls this function) will then use
+ * TOWRITE tag to identify pages eligible for writeback.  This mechanism is
+ * used to avoid livelocking of writeback by a process steadily creating new
+ * dirty pages in the file (thus it is important for this function to be quick
+ * so that it can tag pages faster than a dirtying process can create them).
+ */
+/*
+ * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce tree_lock latency.
+ */
+#define WRITEBACK_TAG_BATCH 4096
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	unsigned long tagged;
+
+	do {
+		spin_lock_irq(&mapping->tree_lock);
+		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
+				&start, end, WRITEBACK_TAG_BATCH,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+		spin_unlock_irq(&mapping->tree_lock);
+		cond_resched();
+	} while (tagged >= WRITEBACK_TAG_BATCH);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +854,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -835,6 +876,7 @@ int write_cache_pages(struct address_space *mapping,
 	pgoff_t done_index;
 	int cycled;
 	int range_whole = 0;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -851,29 +893,19 @@ int write_cache_pages(struct address_space *mapping,
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
-
-		/*
-		 * If this is a data integrity sync, cap the writeback to the
-		 * current end of file. Any extension to the file that occurs
-		 * after this is a new write and we don't need to write those
-		 * pages out to fulfil our data integrity requirements. If we
-		 * try to write them out, we can get stuck in this scan until
-		 * the concurrent writer stops adding dirty pages and extending
-		 * EOF.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL &&
-		    wbc->range_end == LLONG_MAX) {
-			end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
-		}
 	}
-
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1329,6 +1361,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: your mail
  2010-06-16 16:33 ` Jan Kara
                   ` (2 preceding siblings ...)
  (?)
@ 2010-06-16 22:15 ` Dave Chinner
  2010-06-17  7:43     ` Jan Kara
  -1 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2010-06-16 22:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
>   Hello,
> 
>   here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
>   I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
>   If there are no more concerns, can these patches get merged?

Has it been run through xfstests? I'd suggest doing that at least
with XFS as there are several significant sync sanity tests for XFS
in the suite...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
  2010-06-16 22:15 ` your mail Dave Chinner
@ 2010-06-17  7:43     ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-17  7:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   here is the fourth version of the writeback livelock avoidance patches
> > for data integrity writes. To quickly summarize the idea: we tag dirty
> > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > then write only tagged pages to avoid parallel writers to livelock us.
> > See changelogs of the patches for more details.
> >   I have tested the patches with fsx and a test program I wrote which
> > checks that if we crash after fsync, the data is indeed on disk.
> >   If there are no more concerns, can these patches get merged?
> 
> Has it been run through xfstests? I'd suggest doing that at least
> with XFS as there are several significant sync sanity tests for XFS
> in the suite...
  I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
a test with xfs now.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
@ 2010-06-17  7:43     ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-17  7:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   here is the fourth version of the writeback livelock avoidance patches
> > for data integrity writes. To quickly summarize the idea: we tag dirty
> > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > then write only tagged pages to avoid parallel writers to livelock us.
> > See changelogs of the patches for more details.
> >   I have tested the patches with fsx and a test program I wrote which
> > checks that if we crash after fsync, the data is indeed on disk.
> >   If there are no more concerns, can these patches get merged?
> 
> Has it been run through xfstests? I'd suggest doing that at least
> with XFS as there are several significant sync sanity tests for XFS
> in the suite...
  I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
a test with xfs now.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
  2010-06-16 16:33 ` Jan Kara
@ 2010-06-17  9:11   ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-17  9:11 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin

  Sorry for replying to my email but I forgot to set a subject while doing
git send-email. So at least set it now.

								Honza

On Wed 16-06-10 18:33:49, Jan Kara wrote:
>   Hello,
> 
>   here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
>   I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
>   If there are no more concerns, can these patches get merged?
> 
> 								Honza
> 
>   Changes since last version:
> - tagging function was changed to stop after given amount of pages to
>   avoid keeping tree_lock and irqs disabled for too long
> - changed names and updated comments as Andrew suggested
> - measured memory impact and reported it in the changelog
> 
>   Things suggested but not changed (I want to avoid going in circles ;):
> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
>   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
>   spend too much time tagging when we write only a few pages in the end
>   because of nr_to_write. If we tag only say nr_to_write pages, we may
>   not have enough pages tagged because some pages are written out by
>   someone else and so we would have to restart and tagging would become
>   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
>   writeback if we can get rid of nr_to_write. But that's a story for
>   a different patch set.
> - implement function for clearing several tags (TOWRITE, DIRTY) at once
>   - IMHO not worth it because we would save only conversion of page index
>   to radix tree offsets. The rest would have to be separate anyways. And
>   the interface would be incosistent as well...
> - use __lookup_tag to implement radix_tree_range_tag_if_tagged - doesn't
>   quite work because __lookup_tag returns only leaf nodes so we'd have to
>   implement tree traversal anyways to tag also internal nodes.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
@ 2010-06-17  9:11   ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-17  9:11 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin

  Sorry for replying to my email but I forgot to set a subject while doing
git send-email. So at least set it now.

								Honza

On Wed 16-06-10 18:33:49, Jan Kara wrote:
>   Hello,
> 
>   here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
>   I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
>   If there are no more concerns, can these patches get merged?
> 
> 								Honza
> 
>   Changes since last version:
> - tagging function was changed to stop after given amount of pages to
>   avoid keeping tree_lock and irqs disabled for too long
> - changed names and updated comments as Andrew suggested
> - measured memory impact and reported it in the changelog
> 
>   Things suggested but not changed (I want to avoid going in circles ;):
> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
>   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
>   spend too much time tagging when we write only a few pages in the end
>   because of nr_to_write. If we tag only say nr_to_write pages, we may
>   not have enough pages tagged because some pages are written out by
>   someone else and so we would have to restart and tagging would become
>   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
>   writeback if we can get rid of nr_to_write. But that's a story for
>   a different patch set.
> - implement function for clearing several tags (TOWRITE, DIRTY) at once
>   - IMHO not worth it because we would save only conversion of page index
>   to radix tree offsets. The rest would have to be separate anyways. And
>   the interface would be incosistent as well...
> - use __lookup_tag to implement radix_tree_range_tag_if_tagged - doesn't
>   quite work because __lookup_tag returns only leaf nodes so we'd have to
>   implement tree traversal anyways to tag also internal nodes.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
  2010-06-17  7:43     ` Jan Kara
  (?)
@ 2010-06-18  6:11     ` Dave Chinner
  2010-06-18  7:01         ` Nick Piggin
  -1 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2010-06-18  6:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Thu, Jun 17, 2010 at 09:43:50AM +0200, Jan Kara wrote:
> On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> > On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> > >   Hello,
> > > 
> > >   here is the fourth version of the writeback livelock avoidance patches
> > > for data integrity writes. To quickly summarize the idea: we tag dirty
> > > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > > then write only tagged pages to avoid parallel writers to livelock us.
> > > See changelogs of the patches for more details.
> > >   I have tested the patches with fsx and a test program I wrote which
> > > checks that if we crash after fsync, the data is indeed on disk.
> > >   If there are no more concerns, can these patches get merged?
> > 
> > Has it been run through xfstests? I'd suggest doing that at least
> > with XFS as there are several significant sync sanity tests for XFS
> > in the suite...
>   I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
> a test with xfs now.

Cool. if there are no problems then I'm happy with this ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
  2010-06-18  6:11     ` Dave Chinner
@ 2010-06-18  7:01         ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2010-06-18  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton

On Fri, Jun 18, 2010 at 04:11:11PM +1000, Dave Chinner wrote:
> On Thu, Jun 17, 2010 at 09:43:50AM +0200, Jan Kara wrote:
> > On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> > > On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> > > >   Hello,
> > > > 
> > > >   here is the fourth version of the writeback livelock avoidance patches
> > > > for data integrity writes. To quickly summarize the idea: we tag dirty
> > > > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > > > then write only tagged pages to avoid parallel writers to livelock us.
> > > > See changelogs of the patches for more details.
> > > >   I have tested the patches with fsx and a test program I wrote which
> > > > checks that if we crash after fsync, the data is indeed on disk.
> > > >   If there are no more concerns, can these patches get merged?
> > > 
> > > Has it been run through xfstests? I'd suggest doing that at least
> > > with XFS as there are several significant sync sanity tests for XFS
> > > in the suite...
> >   I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
> > a test with xfs now.
> 
> Cool. if there are no problems then I'm happy with this ;)

Agreed.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
@ 2010-06-18  7:01         ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2010-06-18  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton

On Fri, Jun 18, 2010 at 04:11:11PM +1000, Dave Chinner wrote:
> On Thu, Jun 17, 2010 at 09:43:50AM +0200, Jan Kara wrote:
> > On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> > > On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> > > >   Hello,
> > > > 
> > > >   here is the fourth version of the writeback livelock avoidance patches
> > > > for data integrity writes. To quickly summarize the idea: we tag dirty
> > > > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > > > then write only tagged pages to avoid parallel writers to livelock us.
> > > > See changelogs of the patches for more details.
> > > >   I have tested the patches with fsx and a test program I wrote which
> > > > checks that if we crash after fsync, the data is indeed on disk.
> > > >   If there are no more concerns, can these patches get merged?
> > > 
> > > Has it been run through xfstests? I'd suggest doing that at least
> > > with XFS as there are several significant sync sanity tests for XFS
> > > in the suite...
> >   I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
> > a test with xfs now.
> 
> Cool. if there are no problems then I'm happy with this ;)

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
  2010-06-16 16:33   ` Jan Kara
  (?)
@ 2010-06-18 22:18   ` Andrew Morton
  2010-06-21 12:09       ` Nick Piggin
  -1 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2010-06-18 22:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, npiggin

On Wed, 16 Jun 2010 18:33:50 +0200
Jan Kara <jack@suse.cz> wrote:

> Implement function for setting one tag if another tag is set
> for each item in given range.
> 

These two patches look OK to me.

fwiw I have a userspace test harness for radix-tree.c:
http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
while and updated it somewhat, but it's probably rather bitrotted and
surely needs to be taught how to test the post-2006 additions.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-16 16:33   ` Jan Kara
  (?)
@ 2010-06-18 22:21   ` Andrew Morton
  2010-06-21 12:42       ` Jan Kara
  -1 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2010-06-18 22:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, npiggin

On Wed, 16 Jun 2010 18:33:51 +0200
Jan Kara <jack@suse.cz> wrote:

> We try to avoid livelocks of writeback when some steadily creates
> dirty pages in a mapping we are writing out. For memory-cleaning
> writeback, using nr_to_write works reasonably well but we cannot
> really use it for data integrity writeback. This patch tries to
> solve the problem.
> 
> The idea is simple: Tag all pages that should be written back
> with a special tag (TOWRITE) in the radix tree. This can be done
> rather quickly and thus livelocks should not happen in practice.
> Then we start doing the hard work of locking pages and sending
> them to disk only for those pages that have TOWRITE tag set.
> 
> Note: Adding new radix tree tag grows radix tree node from 288 to
> 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
> However, the number of slab/slub items per page remains the same
> (13 and 7 respectively).
> 
>
> ...
>
> +void tag_pages_for_writeback(struct address_space *mapping,
> +			     pgoff_t start, pgoff_t end)
> +{
> +	unsigned long tagged;
> +
> +	do {
> +		spin_lock_irq(&mapping->tree_lock);
> +		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
> +				&start, end, WRITEBACK_TAG_BATCH,
> +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		cond_resched();
> +	} while (tagged >= WRITEBACK_TAG_BATCH);
> +}

grumble.  (tagged > WRITEBACK_TAG_BATCH) would be a bug, wouldn't it? 
So the ">=" is hiding a bug.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
  2010-06-18 22:18   ` Andrew Morton
@ 2010-06-21 12:09       ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2010-06-21 12:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, linux-mm

On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:33:50 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > Implement function for setting one tag if another tag is set
> > for each item in given range.
> > 
> 
> These two patches look OK to me.
> 
> fwiw I have a userspace test harness for radix-tree.c:
> http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
> while and updated it somewhat, but it's probably rather bitrotted and
> surely needs to be taught how to test the post-2006 additions.
> 

Main thing I did was add RCU support (pretty dumb RCU but it found
a couple of bugs), and add some more tests. I'll try to find it...


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
@ 2010-06-21 12:09       ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2010-06-21 12:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, linux-mm

On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:33:50 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > Implement function for setting one tag if another tag is set
> > for each item in given range.
> > 
> 
> These two patches look OK to me.
> 
> fwiw I have a userspace test harness for radix-tree.c:
> http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
> while and updated it somewhat, but it's probably rather bitrotted and
> surely needs to be taught how to test the post-2006 additions.
> 

Main thing I did was add RCU support (pretty dumb RCU but it found
a couple of bugs), and add some more tests. I'll try to find it...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-18 22:21   ` Andrew Morton
@ 2010-06-21 12:42       ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-21 12:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, linux-mm, npiggin

On Fri 18-06-10 15:21:28, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:33:51 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > We try to avoid livelocks of writeback when some steadily creates
> > dirty pages in a mapping we are writing out. For memory-cleaning
> > writeback, using nr_to_write works reasonably well but we cannot
> > really use it for data integrity writeback. This patch tries to
> > solve the problem.
> > 
> > The idea is simple: Tag all pages that should be written back
> > with a special tag (TOWRITE) in the radix tree. This can be done
> > rather quickly and thus livelocks should not happen in practice.
> > Then we start doing the hard work of locking pages and sending
> > them to disk only for those pages that have TOWRITE tag set.
> > 
> > Note: Adding new radix tree tag grows radix tree node from 288 to
> > 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
> > However, the number of slab/slub items per page remains the same
> > (13 and 7 respectively).
> > 
> >
> > ...
> >
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	unsigned long tagged;
> > +
> > +	do {
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
> > +				&start, end, WRITEBACK_TAG_BATCH,
> > +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		cond_resched();
> > +	} while (tagged >= WRITEBACK_TAG_BATCH);
> > +}
> 
> grumble.  (tagged > WRITEBACK_TAG_BATCH) would be a bug, wouldn't it? 
> So the ">=" is hiding a bug.
  Good point. I'll add WARN_ON_ONCE when tagged is > WRITEBACK_TAG_BATCH.
That will make the bug vissible while still continuing the writeback
(because it's a situation in which we can still happily continue). Should
I send you a new version of the patch or will you just fold that one line
in?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-06-21 12:42       ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-21 12:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, linux-mm, npiggin

On Fri 18-06-10 15:21:28, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:33:51 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > We try to avoid livelocks of writeback when some steadily creates
> > dirty pages in a mapping we are writing out. For memory-cleaning
> > writeback, using nr_to_write works reasonably well but we cannot
> > really use it for data integrity writeback. This patch tries to
> > solve the problem.
> > 
> > The idea is simple: Tag all pages that should be written back
> > with a special tag (TOWRITE) in the radix tree. This can be done
> > rather quickly and thus livelocks should not happen in practice.
> > Then we start doing the hard work of locking pages and sending
> > them to disk only for those pages that have TOWRITE tag set.
> > 
> > Note: Adding new radix tree tag grows radix tree node from 288 to
> > 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
> > However, the number of slab/slub items per page remains the same
> > (13 and 7 respectively).
> > 
> >
> > ...
> >
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	unsigned long tagged;
> > +
> > +	do {
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
> > +				&start, end, WRITEBACK_TAG_BATCH,
> > +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		cond_resched();
> > +	} while (tagged >= WRITEBACK_TAG_BATCH);
> > +}
> 
> grumble.  (tagged > WRITEBACK_TAG_BATCH) would be a bug, wouldn't it? 
> So the ">=" is hiding a bug.
  Good point. I'll add WARN_ON_ONCE when tagged is > WRITEBACK_TAG_BATCH.
That will make the bug vissible while still continuing the writeback
(because it's a situation in which we can still happily continue). Should
I send you a new version of the patch or will you just fold that one line
in?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
  2010-06-21 12:09       ` Nick Piggin
@ 2010-06-21 22:43         ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-21 22:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jan Kara, linux-fsdevel, linux-mm

On Mon 21-06-10 22:09:34, Nick Piggin wrote:
> On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> > On Wed, 16 Jun 2010 18:33:50 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Implement function for setting one tag if another tag is set
> > > for each item in given range.
> > > 
> > 
> > These two patches look OK to me.
> > 
> > fwiw I have a userspace test harness for radix-tree.c:
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
> > while and updated it somewhat, but it's probably rather bitrotted and
> > surely needs to be taught how to test the post-2006 additions.
> > 
> 
> Main thing I did was add RCU support (pretty dumb RCU but it found
> a couple of bugs), and add some more tests. I'll try to find it...
  Please send them my way if you can find them. I'll gladly run those tests
(and extend them to check also my new function).

									Honza
 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
@ 2010-06-21 22:43         ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-21 22:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jan Kara, linux-fsdevel, linux-mm

On Mon 21-06-10 22:09:34, Nick Piggin wrote:
> On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> > On Wed, 16 Jun 2010 18:33:50 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Implement function for setting one tag if another tag is set
> > > for each item in given range.
> > > 
> > 
> > These two patches look OK to me.
> > 
> > fwiw I have a userspace test harness for radix-tree.c:
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
> > while and updated it somewhat, but it's probably rather bitrotted and
> > surely needs to be taught how to test the post-2006 additions.
> > 
> 
> Main thing I did was add RCU support (pretty dumb RCU but it found
> a couple of bugs), and add some more tests. I'll try to find it...
  Please send them my way if you can find them. I'll gladly run those tests
(and extend them to check also my new function).

									Honza
 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: your mail
  2010-06-16 16:33 ` Jan Kara
@ 2010-06-22  2:59   ` Wu Fengguang
  -1 siblings, 0 replies; 52+ messages in thread
From: Wu Fengguang @ 2010-06-22  2:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
>   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
>   spend too much time tagging when we write only a few pages in the end
>   because of nr_to_write. If we tag only say nr_to_write pages, we may
>   not have enough pages tagged because some pages are written out by
>   someone else and so we would have to restart and tagging would become

This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
writeback triggered by sync(). write_cache_pages() already ignored
nr_to_write for WB_SYNC_ALL.

>   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
>   writeback if we can get rid of nr_to_write. But that's a story for
>   a different patch set.

Besides introducing overheads, it will be a policy change in which the
system loses control to somehow "throttle" writeback of huge files.

So it may be safer to enlarge nr_to_write instead of canceling it totally.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: your mail
@ 2010-06-22  2:59   ` Wu Fengguang
  0 siblings, 0 replies; 52+ messages in thread
From: Wu Fengguang @ 2010-06-22  2:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
>   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
>   spend too much time tagging when we write only a few pages in the end
>   because of nr_to_write. If we tag only say nr_to_write pages, we may
>   not have enough pages tagged because some pages are written out by
>   someone else and so we would have to restart and tagging would become

This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
writeback triggered by sync(). write_cache_pages() already ignored
nr_to_write for WB_SYNC_ALL.

>   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
>   writeback if we can get rid of nr_to_write. But that's a story for
>   a different patch set.

Besides introducing overheads, it will be a policy change in which the
system loses control to somehow "throttle" writeback of huge files.

So it may be safer to enlarge nr_to_write instead of canceling it totally.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: your mail
  2010-06-22  2:59   ` Wu Fengguang
@ 2010-06-22 13:54     ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-22 13:54 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> >   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> >   spend too much time tagging when we write only a few pages in the end
> >   because of nr_to_write. If we tag only say nr_to_write pages, we may
> >   not have enough pages tagged because some pages are written out by
> >   someone else and so we would have to restart and tagging would become
> 
> This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> writeback triggered by sync(). write_cache_pages() already ignored
> nr_to_write for WB_SYNC_ALL.
  We could do that but frankly, I'm not very fond of adding more special
cases to writeback code than strictly necessary...

> >   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> >   writeback if we can get rid of nr_to_write. But that's a story for
> >   a different patch set.
> 
> Besides introducing overheads, it will be a policy change in which the
> system loses control to somehow "throttle" writeback of huge files.
  Yes, but if we guarantee we cannot livelock on a single file, do we care?
Memory management does not care because it's getting rid of dirty pages
which is what it wants. User might care but actually writing out files in
the order they were dirtied (i.e., the order user written them) is quite
natural so it's not a "surprising" behavior. And I don't think we can
assume that data in those small files are more valuable than data in the
large file and thus should be written earlier...
  With the overhead you are right that tagging is more expensive than
checking nr_to_write limit...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: your mail
@ 2010-06-22 13:54     ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-22 13:54 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> >   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> >   spend too much time tagging when we write only a few pages in the end
> >   because of nr_to_write. If we tag only say nr_to_write pages, we may
> >   not have enough pages tagged because some pages are written out by
> >   someone else and so we would have to restart and tagging would become
> 
> This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> writeback triggered by sync(). write_cache_pages() already ignored
> nr_to_write for WB_SYNC_ALL.
  We could do that but frankly, I'm not very fond of adding more special
cases to writeback code than strictly necessary...

> >   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> >   writeback if we can get rid of nr_to_write. But that's a story for
> >   a different patch set.
> 
> Besides introducing overheads, it will be a policy change in which the
> system loses control to somehow "throttle" writeback of huge files.
  Yes, but if we guarantee we cannot livelock on a single file, do we care?
Memory management does not care because it's getting rid of dirty pages
which is what it wants. User might care but actually writing out files in
the order they were dirtied (i.e., the order user written them) is quite
natural so it's not a "surprising" behavior. And I don't think we can
assume that data in those small files are more valuable than data in the
large file and thus should be written earlier...
  With the overhead you are right that tagging is more expensive than
checking nr_to_write limit...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: your mail
  2010-06-22 13:54     ` Jan Kara
  (?)
@ 2010-06-22 14:12     ` Wu Fengguang
  -1 siblings, 0 replies; 52+ messages in thread
From: Wu Fengguang @ 2010-06-22 14:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin

On Tue, Jun 22, 2010 at 09:54:58PM +0800, Jan Kara wrote:
> On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> > >   interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> > >   spend too much time tagging when we write only a few pages in the end
> > >   because of nr_to_write. If we tag only say nr_to_write pages, we may
> > >   not have enough pages tagged because some pages are written out by
> > >   someone else and so we would have to restart and tagging would become
> > 
> > This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> > writeback triggered by sync(). write_cache_pages() already ignored
> > nr_to_write for WB_SYNC_ALL.
>   We could do that but frankly, I'm not very fond of adding more special
> cases to writeback code than strictly necessary...

So do me. However for this case we only need to broaden the special case test:

                        if (nr_to_write > 0) {
                                nr_to_write--;
                                if (nr_to_write == 0 &&
-                                   wbc->sync_mode == WB_SYNC_NONE) {
+                                   !wbc->for_sync) {

> > >   essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> > >   writeback if we can get rid of nr_to_write. But that's a story for
> > >   a different patch set.
> > 
> > Besides introducing overheads, it will be a policy change in which the
> > system loses control to somehow "throttle" writeback of huge files.
>   Yes, but if we guarantee we cannot livelock on a single file, do we care?
> Memory management does not care because it's getting rid of dirty pages
> which is what it wants. User might care but actually writing out files in
> the order they were dirtied (i.e., the order user written them) is quite
> natural so it's not a "surprising" behavior. And I don't think we can
> assume that data in those small files are more valuable than data in the
> large file and thus should be written earlier...

It could be a surprising behavior when, there is a 4GB dirty file and
100 small dirty files. The user may expect the 100 small dirty files
be synced to disk after 30s. However without nr_to_write, they could
be delayed by the 4GB file for 40 more seconds. Now if something goes
wrong in between and the small dirty files happen to include some
.c/.tex/.txt/... files. 

Small files are more likely your precious documents that are typed in
word-by-word and perhaps the most important data you want to protect.
Naturally you'll want them enjoy more priority than large files.

>   With the overhead you are right that tagging is more expensive than
> checking nr_to_write limit...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
  2010-06-21 12:09       ` Nick Piggin
@ 2010-06-23 13:42         ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-23 13:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jan Kara, linux-fsdevel, linux-mm

On Mon 21-06-10 22:09:34, Nick Piggin wrote:
> On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> > On Wed, 16 Jun 2010 18:33:50 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Implement function for setting one tag if another tag is set
> > > for each item in given range.
> > > 
> > 
> > These two patches look OK to me.
> > 
> > fwiw I have a userspace test harness for radix-tree.c:
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
> > while and updated it somewhat, but it's probably rather bitrotted and
> > surely needs to be taught how to test the post-2006 additions.
> > 
> 
> Main thing I did was add RCU support (pretty dumb RCU but it found
> a couple of bugs), and add some more tests. I'll try to find it...
  Nick, any luck with finding updated tests?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
@ 2010-06-23 13:42         ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-23 13:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Jan Kara, linux-fsdevel, linux-mm

On Mon 21-06-10 22:09:34, Nick Piggin wrote:
> On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> > On Wed, 16 Jun 2010 18:33:50 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > Implement function for setting one tag if another tag is set
> > > for each item in given range.
> > > 
> > 
> > These two patches look OK to me.
> > 
> > fwiw I have a userspace test harness for radix-tree.c:
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz.  Nick used it for a
> > while and updated it somewhat, but it's probably rather bitrotted and
> > surely needs to be taught how to test the post-2006 additions.
> > 
> 
> Main thing I did was add RCU support (pretty dumb RCU but it found
> a couple of bugs), and add some more tests. I'll try to find it...
  Nick, any luck with finding updated tests?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-08-12 22:28       ` Jan Kara
@ 2010-08-13  7:50         ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2010-08-13  7:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, Andrew Morton, npiggin, david,
	linux-mm

On Fri, Aug 13, 2010 at 12:28:58AM +0200, Jan Kara wrote:
>   And from the values in registers the loop seems to have went astray
> because "index" was zero at the point we entered the loop... looking
> around...  Ah, I see, you create files with 16TB size which creates
> radix tree of such height that radix_tree_maxindex(height) == ~0UL and
> if write_cache_pages() passes in ~0UL as end, we can overflow the index.
> Hmm, I haven't realized that is possible.
>   OK, attached is a patch that should fix the issue.

This seems to fix the case for me.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-08-13  7:50         ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2010-08-13  7:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel, Andrew Morton, npiggin, david,
	linux-mm

On Fri, Aug 13, 2010 at 12:28:58AM +0200, Jan Kara wrote:
>   And from the values in registers the loop seems to have went astray
> because "index" was zero at the point we entered the loop... looking
> around...  Ah, I see, you create files with 16TB size which creates
> radix tree of such height that radix_tree_maxindex(height) == ~0UL and
> if write_cache_pages() passes in ~0UL as end, we can overflow the index.
> Hmm, I haven't realized that is possible.
>   OK, attached is a patch that should fix the issue.

This seems to fix the case for me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-08-12 18:35     ` Christoph Hellwig
@ 2010-08-12 22:28       ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-08-12 22:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, Andrew Morton, npiggin, david, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2400 bytes --]

On Thu 12-08-10 14:35:47, Christoph Hellwig wrote:
> I have an oops with current Linus' tree in xfstests 217 that looks
> like it was caused by this patch:
  Thanks for report!

> 217 149s ...[ 5105.342605] XFS mounting filesystem vdb6
> [ 5105.373481] Ending clean XFS mount for filesystem: vdb6
> [ 5115.405061] XFS mounting filesystem loop0
> [ 5115.548654] Ending clean XFS mount for filesystem: loop0
> [ 5115.588067] BUG: unable to handle kernel paging request at f7f14000
> [ 5115.588067] IP: [<c07224fd>] radix_tree_range_tag_if_tagged+0x15d/0x1c0
> [ 5115.588067] *pde = 00007067 *pte = 00000000 
> [ 5115.588067] Oops: 0000 [#1] SMP 
> [ 5115.588067] last sysfs file:
> /sys/devices/virtual/block/loop0/removable
  We seem to oops at:
                while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) {
                        /*
                         * We've fully scanned this node. Go up. Because
                         * last_index is guaranteed to be in the tree, what
                         * we do below cannot wander astray.
                         */
>>>>>                   slot = open_slots[height];
                        height++;
                        shift += RADIX_TREE_MAP_SHIFT;
                }

> Entering kdb (current=0xf7868100, pid 15675) on processor 0 Oops: (null) due to oops @ 0xc07224fd
> <d>Modules linked in:
> <c>
> <d>Pid: 15675, comm: mkfs.xfs Not tainted 2.6.35+ #305 /Bochs
> <d>EIP: 0060:[<c07224fd>] EFLAGS: 00010002 CPU: 0
> EIP is at radix_tree_range_tag_if_tagged+0x15d/0x1c0
> <d>EAX: f7f14000 EBX: 00000000 ECX: 482bb4f8 EDX: 0c0748d4
> <d>ESI: 2031756d EDI: 00000000 EBP: c7d41d10 ESP: c7d41cb0
  And from the values in registers the loop seems to have went astray
because "index" was zero at the point we entered the loop... looking
around...  Ah, I see, you create files with 16TB size which creates
radix tree of such height that radix_tree_maxindex(height) == ~0UL and
if write_cache_pages() passes in ~0UL as end, we can overflow the index.
Hmm, I haven't realized that is possible.
  OK, attached is a patch that should fix the issue. There is just still an
issue that *first_indexp will overflow in this case as well and thus we
could in theory loop indefinitely. I'll have to think how to best handle
this overflow - checking in caller is kind of prone to errors...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Fix-overflow-in-radix_tree_range_tag_if_tagged.patch --]
[-- Type: text/x-patch, Size: 1048 bytes --]

>From c2095a0047822139a7f72ba6e766e94acd4affaf Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 13 Aug 2010 00:20:25 +0200
Subject: [PATCH] mm: Fix overflow in radix_tree_range_tag_if_tagged

When radix_tree_maxindex() is ~0UL, it can happen that scanning
overflows index and tree traversal code goes astray reading memory
until it hits unreadable memory. Check for overflow and exit in
that case.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 lib/radix-tree.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 549ce9c..6a6c81d 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -673,7 +673,8 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
 next:
 		/* Go to next item at level determined by 'shift' */
 		index = ((index >> shift) + 1) << shift;
-		if (index > last_index)
+		/* Overflow can happen when last_index is ~0UL... */
+		if (index > last_index || !index)
 			break;
 		if (tagged > nr_to_tag)
 			break;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-08-12 22:28       ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-08-12 22:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, Andrew Morton, npiggin, david, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2400 bytes --]

On Thu 12-08-10 14:35:47, Christoph Hellwig wrote:
> I have an oops with current Linus' tree in xfstests 217 that looks
> like it was caused by this patch:
  Thanks for report!

> 217 149s ...[ 5105.342605] XFS mounting filesystem vdb6
> [ 5105.373481] Ending clean XFS mount for filesystem: vdb6
> [ 5115.405061] XFS mounting filesystem loop0
> [ 5115.548654] Ending clean XFS mount for filesystem: loop0
> [ 5115.588067] BUG: unable to handle kernel paging request at f7f14000
> [ 5115.588067] IP: [<c07224fd>] radix_tree_range_tag_if_tagged+0x15d/0x1c0
> [ 5115.588067] *pde = 00007067 *pte = 00000000 
> [ 5115.588067] Oops: 0000 [#1] SMP 
> [ 5115.588067] last sysfs file:
> /sys/devices/virtual/block/loop0/removable
  We seem to oops at:
                while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) {
                        /*
                         * We've fully scanned this node. Go up. Because
                         * last_index is guaranteed to be in the tree, what
                         * we do below cannot wander astray.
                         */
>>>>>                   slot = open_slots[height];
                        height++;
                        shift += RADIX_TREE_MAP_SHIFT;
                }

> Entering kdb (current=0xf7868100, pid 15675) on processor 0 Oops: (null) due to oops @ 0xc07224fd
> <d>Modules linked in:
> <c>
> <d>Pid: 15675, comm: mkfs.xfs Not tainted 2.6.35+ #305 /Bochs
> <d>EIP: 0060:[<c07224fd>] EFLAGS: 00010002 CPU: 0
> EIP is at radix_tree_range_tag_if_tagged+0x15d/0x1c0
> <d>EAX: f7f14000 EBX: 00000000 ECX: 482bb4f8 EDX: 0c0748d4
> <d>ESI: 2031756d EDI: 00000000 EBP: c7d41d10 ESP: c7d41cb0
  And from the values in registers the loop seems to have went astray
because "index" was zero at the point we entered the loop... looking
around...  Ah, I see, you create files with 16TB size which creates
radix tree of such height that radix_tree_maxindex(height) == ~0UL and
if write_cache_pages() passes in ~0UL as end, we can overflow the index.
Hmm, I haven't realized that is possible.
  OK, attached is a patch that should fix the issue. There is just still an
issue that *first_indexp will overflow in this case as well and thus we
could in theory loop indefinitely. I'll have to think how to best handle
this overflow - checking in caller is kind of prone to errors...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Fix-overflow-in-radix_tree_range_tag_if_tagged.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-04 18:47   ` Jan Kara
@ 2010-08-12 18:35     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2010-08-12 18:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Andrew Morton, npiggin, david, linux-mm

I have an oops with current Linus' tree in xfstests 217 that looks
like it was caused by this patch:

217 149s ...[ 5105.342605] XFS mounting filesystem vdb6
[ 5105.373481] Ending clean XFS mount for filesystem: vdb6
[ 5115.405061] XFS mounting filesystem loop0
[ 5115.548654] Ending clean XFS mount for filesystem: loop0
[ 5115.588067] BUG: unable to handle kernel paging request at f7f14000
[ 5115.588067] IP: [<c07224fd>]
radix_tree_range_tag_if_tagged+0x15d/0x1c0
[ 5115.588067] *pde = 00007067 *pte = 00000000 
[ 5115.588067] Oops: 0000 [#1] SMP 
[ 5115.588067] last sysfs file:
/sys/devices/virtual/block/loop0/removable

Entering kdb (current=0xf7868100, pid 15675) on processor 0 Oops: (null) due to oops @ 0xc07224fd
<d>Modules linked in:
<c>
<d>Pid: 15675, comm: mkfs.xfs Not tainted 2.6.35+ #305 /Bochs
<d>EIP: 0060:[<c07224fd>] EFLAGS: 00010002 CPU: 0
EIP is at radix_tree_range_tag_if_tagged+0x15d/0x1c0
<d>EAX: f7f14000 EBX: 00000000 ECX: 482bb4f8 EDX: 0c0748d4
<d>ESI: 2031756d EDI: 00000000 EBP: c7d41d10 ESP: c7d41cb0
<d> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
<0>Process mkfs.xfs (pid: 15675, ti=c7d40000 task=f7868100
task.ti=c7d40000)
<0>Stack: ffffffff f774a1d8 f774a598 f774aa98 cfd9c598 e7c89458 00000001 00000000
<0> c01e0caf c01e0caf 00000046 00000010 00000000 e7c89568 c7d41d28 c355ad50
<0> ffffffff 00000208 00000007 c7d41cb0 00000003 ffffffff c355ad5c ffffffff
<0>Call Trace:
<0> [<c01e0caf>] ? tag_pages_for_writeback+0x1f/0xc0
<0> [<c01e0caf>] ? tag_pages_for_writeback+0x1f/0xc0
<0> [<c01e0cd3>] ? tag_pages_for_writeback+0x43/0xc0
<0> [<c01e172b>] ? write_cache_pages+0x23b/0x370
<0> [<c01e0980>] ? __writepage+0x0/0x30
<0> [<c0528979>] ? xfs_vm_writepages+0x29/0x50
[0]more> 
Only 'q' or 'Q' are processed at more prompt, input ignored
<0> [<c0528979>] ? xfs_vm_writepages+0x29/0x50
<0> [<c05296e0>] ? xfs_vm_writepage+0x0/0x630
<0> [<c01e187f>] ? generic_writepages+0x1f/0x30
<0> [<c0528992>] ? xfs_vm_writepages+0x42/0x50
<0> [<c01e18a7>] ? do_writepages+0x17/0x30
<0> [<c01da57c>] ? __filemap_fdatawrite_range+0x5c/0x70
<0> [<c01da8a6>] ? filemap_fdatawrite+0x26/0x30
<0> [<c01da8dd>] ? filemap_write_and_wait+0x2d/0x50
<0> [<c052e812>] ? xfs_flushinval_pages+0x72/0xe0
<0> [<c0506c9c>] ? xfs_ilock+0x7c/0xd0
<0> [<c052d7d7>] ? xfs_file_aio_read+0x307/0x340
<0> [<c020e69c>] ? do_sync_read+0x9c/0xd0
<0> [<c0189636>] ? up_read+0x16/0x30
<0> [<c020e957>] ? vfs_read+0x97/0x140
<0> [<c020e600>] ? do_sync_read+0x0/0xd0
<0> [<c0181c18>] ? __task_pid_nr_ns+0x88/0xd0
<0> [<c020f1c3>] ? sys_pread64+0x63/0x80
<0> [<c09cb7ed>] ? syscall_call+0x7/0xb
<0>Code: f0 83 c3 01 d3 e3 39 5d e0 72 51 8b 45 08 39 45 e4 73 49 89 d8
d3 e8 a8 3f 0f 85 47 ff ff ff 8b 75 ec 8d 04 96 90 83 c1 06 89 df <8b>
30 d3 ef 83 c2 01 83 c0 04 83 e7 3f 74 ec e9 27 ff ff ff 8b 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-08-12 18:35     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2010-08-12 18:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Andrew Morton, npiggin, david, linux-mm

I have an oops with current Linus' tree in xfstests 217 that looks
like it was caused by this patch:

217 149s ...[ 5105.342605] XFS mounting filesystem vdb6
[ 5105.373481] Ending clean XFS mount for filesystem: vdb6
[ 5115.405061] XFS mounting filesystem loop0
[ 5115.548654] Ending clean XFS mount for filesystem: loop0
[ 5115.588067] BUG: unable to handle kernel paging request at f7f14000
[ 5115.588067] IP: [<c07224fd>]
radix_tree_range_tag_if_tagged+0x15d/0x1c0
[ 5115.588067] *pde = 00007067 *pte = 00000000 
[ 5115.588067] Oops: 0000 [#1] SMP 
[ 5115.588067] last sysfs file:
/sys/devices/virtual/block/loop0/removable

Entering kdb (current=0xf7868100, pid 15675) on processor 0 Oops: (null) due to oops @ 0xc07224fd
<d>Modules linked in:
<c>
<d>Pid: 15675, comm: mkfs.xfs Not tainted 2.6.35+ #305 /Bochs
<d>EIP: 0060:[<c07224fd>] EFLAGS: 00010002 CPU: 0
EIP is at radix_tree_range_tag_if_tagged+0x15d/0x1c0
<d>EAX: f7f14000 EBX: 00000000 ECX: 482bb4f8 EDX: 0c0748d4
<d>ESI: 2031756d EDI: 00000000 EBP: c7d41d10 ESP: c7d41cb0
<d> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
<0>Process mkfs.xfs (pid: 15675, ti=c7d40000 task=f7868100
task.ti=c7d40000)
<0>Stack: ffffffff f774a1d8 f774a598 f774aa98 cfd9c598 e7c89458 00000001 00000000
<0> c01e0caf c01e0caf 00000046 00000010 00000000 e7c89568 c7d41d28 c355ad50
<0> ffffffff 00000208 00000007 c7d41cb0 00000003 ffffffff c355ad5c ffffffff
<0>Call Trace:
<0> [<c01e0caf>] ? tag_pages_for_writeback+0x1f/0xc0
<0> [<c01e0caf>] ? tag_pages_for_writeback+0x1f/0xc0
<0> [<c01e0cd3>] ? tag_pages_for_writeback+0x43/0xc0
<0> [<c01e172b>] ? write_cache_pages+0x23b/0x370
<0> [<c01e0980>] ? __writepage+0x0/0x30
<0> [<c0528979>] ? xfs_vm_writepages+0x29/0x50
[0]more> 
Only 'q' or 'Q' are processed at more prompt, input ignored
<0> [<c0528979>] ? xfs_vm_writepages+0x29/0x50
<0> [<c05296e0>] ? xfs_vm_writepage+0x0/0x630
<0> [<c01e187f>] ? generic_writepages+0x1f/0x30
<0> [<c0528992>] ? xfs_vm_writepages+0x42/0x50
<0> [<c01e18a7>] ? do_writepages+0x17/0x30
<0> [<c01da57c>] ? __filemap_fdatawrite_range+0x5c/0x70
<0> [<c01da8a6>] ? filemap_fdatawrite+0x26/0x30
<0> [<c01da8dd>] ? filemap_write_and_wait+0x2d/0x50
<0> [<c052e812>] ? xfs_flushinval_pages+0x72/0xe0
<0> [<c0506c9c>] ? xfs_ilock+0x7c/0xd0
<0> [<c052d7d7>] ? xfs_file_aio_read+0x307/0x340
<0> [<c020e69c>] ? do_sync_read+0x9c/0xd0
<0> [<c0189636>] ? up_read+0x16/0x30
<0> [<c020e957>] ? vfs_read+0x97/0x140
<0> [<c020e600>] ? do_sync_read+0x0/0xd0
<0> [<c0181c18>] ? __task_pid_nr_ns+0x88/0xd0
<0> [<c020f1c3>] ? sys_pread64+0x63/0x80
<0> [<c09cb7ed>] ? syscall_call+0x7/0xb
<0>Code: f0 83 c3 01 d3 e3 39 5d e0 72 51 8b 45 08 39 45 e4 73 49 89 d8
d3 e8 a8 3f 0f 85 47 ff ff ff 8b 75 ec 8d 04 96 90 83 c1 06 89 df <8b>
30 d3 ef 83 c2 01 83 c0 04 83 e7 3f 74 ec e9 27 ff ff ff 8b 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-24 13:57 [PATCH 0/2 v5] Livelock avoidance for data integrity writes Jan Kara
  2010-06-24 13:57 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
@ 2010-06-24 13:57 ` Jan Kara
  1 sibling, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-24 13:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-mm, npiggin, david, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to
296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same
(13 and 7 respectively).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   70 +++++++++++++++++++++++++++++++++----------
 3 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..664674e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a4b00e9..634b8e6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..d265ef9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,41 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end (inclusive) and tags
+ * all pages that have DIRTY tag set with a special TOWRITE tag. The idea is
+ * that write_cache_pages (or whoever calls this function) will then use
+ * TOWRITE tag to identify pages eligible for writeback.  This mechanism is
+ * used to avoid livelocking of writeback by a process steadily creating new
+ * dirty pages in the file (thus it is important for this function to be quick
+ * so that it can tag pages faster than a dirtying process can create them).
+ */
+/*
+ * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce tree_lock latency.
+ */
+#define WRITEBACK_TAG_BATCH 4096
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	unsigned long tagged;
+
+	do {
+		spin_lock_irq(&mapping->tree_lock);
+		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
+				&start, end, WRITEBACK_TAG_BATCH,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+		spin_unlock_irq(&mapping->tree_lock);
+		WARN_ON_ONCE(tagged > WRITEBACK_TAG_BATCH);
+		cond_resched();
+	} while (tagged >= WRITEBACK_TAG_BATCH);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +855,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -835,6 +877,7 @@ int write_cache_pages(struct address_space *mapping,
 	pgoff_t done_index;
 	int cycled;
 	int range_whole = 0;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -851,29 +894,19 @@ int write_cache_pages(struct address_space *mapping,
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
-
-		/*
-		 * If this is a data integrity sync, cap the writeback to the
-		 * current end of file. Any extension to the file that occurs
-		 * after this is a new write and we don't need to write those
-		 * pages out to fulfil our data integrity requirements. If we
-		 * try to write them out, we can get stuck in this scan until
-		 * the concurrent writer stops adding dirty pages and extending
-		 * EOF.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL &&
-		    wbc->range_end == LLONG_MAX) {
-			end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
-		}
 	}
-
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1329,6 +1362,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-24 13:57 [PATCH 0/2 v5] Livelock avoidance for data integrity writes Jan Kara
@ 2010-06-24 13:57 ` Jan Kara
  2010-06-24 13:57 ` Jan Kara
  1 sibling, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-24 13:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: inux-fsdevel, linux-mm, npiggin, david, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Note: Adding new radix tree tag grows radix tree node from 288 to
296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same
(13 and 7 respectively).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   70 +++++++++++++++++++++++++++++++++----------
 3 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..664674e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a4b00e9..634b8e6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..d265ef9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,41 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end (inclusive) and tags
+ * all pages that have DIRTY tag set with a special TOWRITE tag. The idea is
+ * that write_cache_pages (or whoever calls this function) will then use
+ * TOWRITE tag to identify pages eligible for writeback.  This mechanism is
+ * used to avoid livelocking of writeback by a process steadily creating new
+ * dirty pages in the file (thus it is important for this function to be quick
+ * so that it can tag pages faster than a dirtying process can create them).
+ */
+/*
+ * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce tree_lock latency.
+ */
+#define WRITEBACK_TAG_BATCH 4096
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	unsigned long tagged;
+
+	do {
+		spin_lock_irq(&mapping->tree_lock);
+		tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
+				&start, end, WRITEBACK_TAG_BATCH,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+		spin_unlock_irq(&mapping->tree_lock);
+		WARN_ON_ONCE(tagged > WRITEBACK_TAG_BATCH);
+		cond_resched();
+	} while (tagged >= WRITEBACK_TAG_BATCH);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +855,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -835,6 +877,7 @@ int write_cache_pages(struct address_space *mapping,
 	pgoff_t done_index;
 	int cycled;
 	int range_whole = 0;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -851,29 +894,19 @@ int write_cache_pages(struct address_space *mapping,
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
-
-		/*
-		 * If this is a data integrity sync, cap the writeback to the
-		 * current end of file. Any extension to the file that occurs
-		 * after this is a new write and we don't need to write those
-		 * pages out to fulfil our data integrity requirements. If we
-		 * try to write them out, we can get stuck in this scan until
-		 * the concurrent writer stops adding dirty pages and extending
-		 * EOF.
-		 */
-		if (wbc->sync_mode == WB_SYNC_ALL &&
-		    wbc->range_end == LLONG_MAX) {
-			end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
-		}
 	}
-
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1329,6 +1362,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-09 23:45   ` Andrew Morton
@ 2010-06-10 12:42     ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-10 12:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, npiggin, david, linux-mm

On Wed 09-06-10 16:45:33, Andrew Morton wrote:
> On Fri,  4 Jun 2010 20:40:54 +0200
> Jan Kara <jack@suse.cz> wrote:
> 
> > -#define RADIX_TREE_MAX_TAGS 2
> > +#define RADIX_TREE_MAX_TAGS 3
> 
> Adds another eight bytes to the radix_tree_node, I think.  What effect
> does this have upon the radix_tree_node_cachep packing for sl[aeiou]b? 
> Please add to changelog if you can work it out ;).
  The sizes of structure are:
32-bit: 288 vs 296
64-bit: 552 vs 560
  I have now checked (running different kernels because I wasn't sure the
computations I do are right) and that gives 7 objects per page with SLAB
and SLUB on a 64-bit kernel. I'll try to get also SLOB numbers for 64-bit
and possibly numbers for 32-bit archs (although it gets a bit tiring to try
all the kernels ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-09 23:41   ` Andrew Morton
@ 2010-06-10 12:31     ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-10 12:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, npiggin, david, linux-mm

On Wed 09-06-10 16:41:15, Andrew Morton wrote:
> On Fri,  4 Jun 2010 20:40:54 +0200
> Jan Kara <jack@suse.cz> wrote:
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index efdfb07..f7ebff8 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> >  
> >  /*** radix-tree API starts here ***/
> >  
> > -#define RADIX_TREE_MAX_TAGS 2
> > +#define RADIX_TREE_MAX_TAGS 3
> >  
> >  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
> >  struct radix_tree_root {
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index b289310..f590a12 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -807,6 +807,30 @@ void __init page_writeback_init(void)
> >  }
> >  
> >  /**
> > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> > + * @mapping: address space structure to write
> > + * @start: starting page index
> > + * @end: ending page index (inclusive)
> > + *
> > + * This function scans the page range from @start to @end
> 
> I'd add "inclusive" here as well.  Add it everywhere, very carefully
> (or "exclusive").  it really is a minefield and we've had off-by-ones
> from this before
  Done.

> > and tags all pages
> > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> > + * to identify pages eligible for writeback.  This mechanism is used to avoid
> > + * livelocking of writeback by a process steadily creating new dirty pages in
> > + * the file (thus it is important for this function to be damn quick so that it
> > + * can tag pages faster than a dirtying process can create them).
> > + */
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
> > +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +}
> > +EXPORT_SYMBOL(tag_pages_for_writeback);
> 
> Problem.  For how long can this disable interrupts?  Maybe 1TB of dirty
> pagecache before the NMI watchdog starts getting involved?  Could be a
> problem in some situations.
> 
> Easy enough to fix - just walk the range in 1000(?) page hunks,
> dropping the lock and doing cond_resched() each time. 
> radix_tree_gang_tag_if_tagged() will need to return next_index to make
> that efficient with sparse files.
  Yes, Nick had a similar objection. I've already fixed it in my tree the
way you suggest.

> > +/**
> >   * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
> >   * @mapping: address space structure to write
> >   * @wbc: subtract the number of written pages from *@wbc->nr_to_write
> > @@ -820,6 +844,13 @@ void __init page_writeback_init(void)
> >   * the call was made get new I/O started against them.  If wbc->sync_mode is
> >   * WB_SYNC_ALL then we were called for data integrity and we must wait for
> >   * existing IO to complete.
> > + *
> > + * To avoid livelocks (when other process dirties new pages), we first tag
> > + * pages which should be written back with TOWRITE tag and only then start
> > + * writing them. For data-integrity sync we have to be careful so that we do
> > + * not miss some pages (e.g., because some other process has cleared TOWRITE
> > + * tag we set). The rule we follow is that TOWRITE tag can be cleared only
> > + * by the process clearing the DIRTY tag (and submitting the page for IO).
> >   */
> 
> Seems simple enough and I can't think of any bugs which the obvious
> races will cause.
  Thanks for looking into the patches!

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-07 16:09       ` Jan Kara
  (?)
  (?)
@ 2010-06-10  8:12       ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-10  8:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm

On Mon 07-06-10 18:09:03, Jan Kara wrote:
> On Sat 05-06-10 11:38:02, Nick Piggin wrote:
> > On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> > >  	done_index = index;
> > >  	while (!done && (index <= end)) {
> > >  		int i;
> > >  
> > > -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> > > -			      PAGECACHE_TAG_DIRTY,
> > > +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
> > >  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> > >  		if (nr_pages == 0)
> > >  			break;
> > 
> > Would it be neat to clear the tag even if we didn't set page to
> > writeback? It should be uncommon case.
>   Yeah, why not.
  Looking at this more, we shouldn't leave any TOWRITE tags dangling in
WB_SYNC_ALL mode - as soon as someone writes the page, he does
set_page_writeback() which clears the tag. Similarly if the page is removed
from the mapping, the tag is cleared. Or am I missing something?

> > > @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
> > >  			radix_tree_tag_clear(&mapping->page_tree,
> > >  						page_index(page),
> > >  						PAGECACHE_TAG_DIRTY);
> > > +		radix_tree_tag_clear(&mapping->page_tree,
> > > +				     page_index(page),
> > > +				     PAGECACHE_TAG_TOWRITE);
> > >  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> > >  	} else {
> > >  		ret = TestSetPageWriteback(page);
> > 
> > It would be nice to have bitwise tag clearing so we can clear multiple
> > at once. Then
> > 
> > clear_tag = PAGECACHE_TAG_TOWRITE;
> > if (!PageDirty(page))
> >   clear_tag |= PAGECACHE_TAG_DIRTY;
> > 
> > That could reduce overhead a bit more.
>   Good idea. Will do.
  On a second thought, will it bring us enough to justify a new interface
(which will be inconsistent with all the other radix tree interfaces
because they use tag numbers and not bitmaps)? Because looking at the code,
all we could save is the transformation of page index into a radix tree
path.  We would have to do all the other work for each tag separately
anyway and it won't probably have any great cache locality either because
radix trees for different tags are separate.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
  2010-06-09 23:41   ` Andrew Morton
@ 2010-06-09 23:45   ` Andrew Morton
  2010-06-10 12:42     ` Jan Kara
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2010-06-09 23:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, npiggin, david, linux-mm

On Fri,  4 Jun 2010 20:40:54 +0200
Jan Kara <jack@suse.cz> wrote:

> -#define RADIX_TREE_MAX_TAGS 2
> +#define RADIX_TREE_MAX_TAGS 3

Adds another eight bytes to the radix_tree_node, I think.  What effect
does this have upon the radix_tree_node_cachep packing for sl[aeiou]b? 
Please add to changelog if you can work it out ;).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
@ 2010-06-09 23:41   ` Andrew Morton
  2010-06-10 12:31     ` Jan Kara
  2010-06-09 23:45   ` Andrew Morton
  1 sibling, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2010-06-09 23:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, npiggin, david, linux-mm

On Fri,  4 Jun 2010 20:40:54 +0200
Jan Kara <jack@suse.cz> wrote:

> We try to avoid livelocks of writeback when some steadily creates
> dirty pages in a mapping we are writing out. For memory-cleaning
> writeback, using nr_to_write works reasonably well but we cannot
> really use it for data integrity writeback. This patch tries to
> solve the problem.
> 
> The idea is simple: Tag all pages that should be written back
> with a special tag (TOWRITE) in the radix tree. This can be done
> rather quickly and thus livelocks should not happen in practice.
> Then we start doing the hard work of locking pages and sending
> them to disk only for those pages that have TOWRITE tag set.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/fs.h         |    1 +
>  include/linux/radix-tree.h |    2 +-
>  mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3428393..fe308f0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -685,6 +685,7 @@ struct block_device {
>   */
>  #define PAGECACHE_TAG_DIRTY	0
>  #define PAGECACHE_TAG_WRITEBACK	1
> +#define PAGECACHE_TAG_TOWRITE	2
>  
>  int mapping_tagged(struct address_space *mapping, int tag);
>  
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index efdfb07..f7ebff8 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
>  
>  /*** radix-tree API starts here ***/
>  
> -#define RADIX_TREE_MAX_TAGS 2
> +#define RADIX_TREE_MAX_TAGS 3
>  
>  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
>  struct radix_tree_root {
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b289310..f590a12 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -807,6 +807,30 @@ void __init page_writeback_init(void)
>  }
>  
>  /**
> + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> + * @mapping: address space structure to write
> + * @start: starting page index
> + * @end: ending page index (inclusive)
> + *
> + * This function scans the page range from @start to @end

I'd add "inclusive" here as well.  Add it everywhere, very carefully
(or "exclusive").  it really is a minefield and we've had off-by-ones
from this before


> and tags all pages
> + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> + * to identify pages eligible for writeback.  This mechanism is used to avoid
> + * livelocking of writeback by a process steadily creating new dirty pages in
> + * the file (thus it is important for this function to be damn quick so that it
> + * can tag pages faster than a dirtying process can create them).
> + */
> +void tag_pages_for_writeback(struct address_space *mapping,
> +			     pgoff_t start, pgoff_t end)
> +{
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
> +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +}
> +EXPORT_SYMBOL(tag_pages_for_writeback);

Problem.  For how long can this disable interrupts?  Maybe 1TB of dirty
pagecache before the NMI watchdog starts getting involved?  Could be a
problem in some situations.

Easy enough to fix - just walk the range in 1000(?) page hunks,
dropping the lock and doing cond_resched() each time. 
radix_tree_gang_tag_if_tagged() will need to return next_index to make
that efficient with sparse files.

> +/**
>   * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
>   * @mapping: address space structure to write
>   * @wbc: subtract the number of written pages from *@wbc->nr_to_write
> @@ -820,6 +844,13 @@ void __init page_writeback_init(void)
>   * the call was made get new I/O started against them.  If wbc->sync_mode is
>   * WB_SYNC_ALL then we were called for data integrity and we must wait for
>   * existing IO to complete.
> + *
> + * To avoid livelocks (when other process dirties new pages), we first tag
> + * pages which should be written back with TOWRITE tag and only then start
> + * writing them. For data-integrity sync we have to be careful so that we do
> + * not miss some pages (e.g., because some other process has cleared TOWRITE
> + * tag we set). The rule we follow is that TOWRITE tag can be cleared only
> + * by the process clearing the DIRTY tag (and submitting the page for IO).
>   */

Seems simple enough and I can't think of any bugs which the obvious
races will cause.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-08  5:29       ` Nick Piggin
@ 2010-06-09 13:04           ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-09 13:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm

On Tue 08-06-10 15:29:37, Nick Piggin wrote:
> On Mon, Jun 07, 2010 at 06:09:03PM +0200, Jan Kara wrote:
> > On Sat 05-06-10 11:38:02, Nick Piggin wrote:
> > > On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> > > > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > > > +		tag_pages_for_writeback(mapping, index, end);
> > > 
> > > I wonder if this is too much spinlock latency in a huge dirty file?
> > > Some kid of batching of the operation perhaps would be good?
> >   You mean like copy tags for 4096 pages, then cond_resched the spin lock
> > and continue? That should be doable but it will give tasks that try to
> > livelock us more time (i.e. if there were 4096 tasks creating dirty pages
> > than probably they would be able to livelock us, won't they? Maybe we don't
> > care?).
> 
> Not 100% sure. I think that if we've got the inode in I_SYNC state, it
> should stop cleaning and dirtiers will get throttled.
> 
> Even if writeback was able to continue on that inode, it would be a big
> achievement to dirty then clean pages as fast as we are able to tag them
> in batches of 4096 :)
  In practice, you are probably right that the writers will eventually get
throttled if they were aggressive enough to dirty lots of pages while
we cond_resched the lock...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-06-09 13:04           ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-09 13:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm

On Tue 08-06-10 15:29:37, Nick Piggin wrote:
> On Mon, Jun 07, 2010 at 06:09:03PM +0200, Jan Kara wrote:
> > On Sat 05-06-10 11:38:02, Nick Piggin wrote:
> > > On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> > > > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > > > +		tag_pages_for_writeback(mapping, index, end);
> > > 
> > > I wonder if this is too much spinlock latency in a huge dirty file?
> > > Some kid of batching of the operation perhaps would be good?
> >   You mean like copy tags for 4096 pages, then cond_resched the spin lock
> > and continue? That should be doable but it will give tasks that try to
> > livelock us more time (i.e. if there were 4096 tasks creating dirty pages
> > than probably they would be able to livelock us, won't they? Maybe we don't
> > care?).
> 
> Not 100% sure. I think that if we've got the inode in I_SYNC state, it
> should stop cleaning and dirtiers will get throttled.
> 
> Even if writeback was able to continue on that inode, it would be a big
> achievement to dirty then clean pages as fast as we are able to tag them
> in batches of 4096 :)
  In practice, you are probably right that the writers will eventually get
throttled if they were aggressive enough to dirty lots of pages while
we cond_resched the lock...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-07 16:09       ` Jan Kara
  (?)
@ 2010-06-08  5:29       ` Nick Piggin
  2010-06-09 13:04           ` Jan Kara
  -1 siblings, 1 reply; 52+ messages in thread
From: Nick Piggin @ 2010-06-08  5:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Andrew Morton, david, linux-mm

On Mon, Jun 07, 2010 at 06:09:03PM +0200, Jan Kara wrote:
> On Sat 05-06-10 11:38:02, Nick Piggin wrote:
> > On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> > > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > > +		tag_pages_for_writeback(mapping, index, end);
> > 
> > I wonder if this is too much spinlock latency in a huge dirty file?
> > Some kid of batching of the operation perhaps would be good?
>   You mean like copy tags for 4096 pages, then cond_resched the spin lock
> and continue? That should be doable but it will give tasks that try to
> livelock us more time (i.e. if there were 4096 tasks creating dirty pages
> than probably they would be able to livelock us, won't they? Maybe we don't
> care?).

Not 100% sure. I think that if we've got the inode in I_SYNC state, it
should stop cleaning and dirtiers will get throttled.

Even if writeback was able to continue on that inode, it would be a big
achievement to dirty then clean pages as fast as we are able to tag them
in batches of 4096 :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-05  1:38     ` Nick Piggin
@ 2010-06-07 16:09       ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-07 16:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm

On Sat 05-06-10 11:38:02, Nick Piggin wrote:
> On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> > We try to avoid livelocks of writeback when some steadily creates
> > dirty pages in a mapping we are writing out. For memory-cleaning
> > writeback, using nr_to_write works reasonably well but we cannot
> > really use it for data integrity writeback. This patch tries to
> > solve the problem.
> > 
> > The idea is simple: Tag all pages that should be written back
> > with a special tag (TOWRITE) in the radix tree. This can be done
> > rather quickly and thus livelocks should not happen in practice.
> > Then we start doing the hard work of locking pages and sending
> > them to disk only for those pages that have TOWRITE tag set.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  include/linux/fs.h         |    1 +
> >  include/linux/radix-tree.h |    2 +-
> >  mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 44 insertions(+), 3 deletions(-)
...
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index b289310..f590a12 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -807,6 +807,30 @@ void __init page_writeback_init(void)
> >  }
> >  
> >  /**
> > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> > + * @mapping: address space structure to write
> > + * @start: starting page index
> > + * @end: ending page index (inclusive)
> > + *
> > + * This function scans the page range from @start to @end and tags all pages
> > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> > + * to identify pages eligible for writeback.  This mechanism is used to avoid
> > + * livelocking of writeback by a process steadily creating new dirty pages in
> > + * the file (thus it is important for this function to be damn quick so that it
> > + * can tag pages faster than a dirtying process can create them).
> > + */
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
> > +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +}
> > +EXPORT_SYMBOL(tag_pages_for_writeback);
> > +
> > +/**
> >   * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
> >   * @mapping: address space structure to write
> >   * @wbc: subtract the number of written pages from *@wbc->nr_to_write
> > @@ -820,6 +844,13 @@ void __init page_writeback_init(void)
> >   * the call was made get new I/O started against them.  If wbc->sync_mode is
> >   * WB_SYNC_ALL then we were called for data integrity and we must wait for
> >   * existing IO to complete.
> > + *
> > + * To avoid livelocks (when other process dirties new pages), we first tag
> > + * pages which should be written back with TOWRITE tag and only then start
> > + * writing them. For data-integrity sync we have to be careful so that we do
> > + * not miss some pages (e.g., because some other process has cleared TOWRITE
> > + * tag we set). The rule we follow is that TOWRITE tag can be cleared only
> > + * by the process clearing the DIRTY tag (and submitting the page for IO).
> >   */
> >  int write_cache_pages(struct address_space *mapping,
> >  		      struct writeback_control *wbc, writepage_t writepage,
> > @@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
> >  	int cycled;
> >  	int range_whole = 0;
> >  	long nr_to_write = wbc->nr_to_write;
> > +	int tag;
> >  
> >  	pagevec_init(&pvec, 0);
> >  	if (wbc->range_cyclic) {
> > @@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
> >  			range_whole = 1;
> >  		cycled = 1; /* ignore range_cyclic tests */
> >  	}
> > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > +		tag = PAGECACHE_TAG_TOWRITE;
> > +	else
> > +		tag = PAGECACHE_TAG_DIRTY;
> >  retry:
> > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > +		tag_pages_for_writeback(mapping, index, end);
> 
> I wonder if this is too much spinlock latency in a huge dirty file?
> Some kid of batching of the operation perhaps would be good?
  You mean like copy tags for 4096 pages, then cond_resched the spin lock
and continue? That should be doable but it will give tasks that try to
livelock us more time (i.e. if there were 4096 tasks creating dirty pages
than probably they would be able to livelock us, won't they? Maybe we don't
care?).

> >  	done_index = index;
> >  	while (!done && (index <= end)) {
> >  		int i;
> >  
> > -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> > -			      PAGECACHE_TAG_DIRTY,
> > +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
> >  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> >  		if (nr_pages == 0)
> >  			break;
> 
> Would it be neat to clear the tag even if we didn't set page to
> writeback? It should be uncommon case.
  Yeah, why not.

> > @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
> >  			radix_tree_tag_clear(&mapping->page_tree,
> >  						page_index(page),
> >  						PAGECACHE_TAG_DIRTY);
> > +		radix_tree_tag_clear(&mapping->page_tree,
> > +				     page_index(page),
> > +				     PAGECACHE_TAG_TOWRITE);
> >  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> >  	} else {
> >  		ret = TestSetPageWriteback(page);
> 
> It would be nice to have bitwise tag clearing so we can clear multiple
> at once. Then
> 
> clear_tag = PAGECACHE_TAG_TOWRITE;
> if (!PageDirty(page))
>   clear_tag |= PAGECACHE_TAG_DIRTY;
> 
> That could reduce overhead a bit more.
  Good idea. Will do.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-06-07 16:09       ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-07 16:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Jan Kara, linux-fsdevel, Andrew Morton, david, linux-mm

On Sat 05-06-10 11:38:02, Nick Piggin wrote:
> On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> > We try to avoid livelocks of writeback when some steadily creates
> > dirty pages in a mapping we are writing out. For memory-cleaning
> > writeback, using nr_to_write works reasonably well but we cannot
> > really use it for data integrity writeback. This patch tries to
> > solve the problem.
> > 
> > The idea is simple: Tag all pages that should be written back
> > with a special tag (TOWRITE) in the radix tree. This can be done
> > rather quickly and thus livelocks should not happen in practice.
> > Then we start doing the hard work of locking pages and sending
> > them to disk only for those pages that have TOWRITE tag set.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  include/linux/fs.h         |    1 +
> >  include/linux/radix-tree.h |    2 +-
> >  mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 44 insertions(+), 3 deletions(-)
...
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index b289310..f590a12 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -807,6 +807,30 @@ void __init page_writeback_init(void)
> >  }
> >  
> >  /**
> > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> > + * @mapping: address space structure to write
> > + * @start: starting page index
> > + * @end: ending page index (inclusive)
> > + *
> > + * This function scans the page range from @start to @end and tags all pages
> > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> > + * to identify pages eligible for writeback.  This mechanism is used to avoid
> > + * livelocking of writeback by a process steadily creating new dirty pages in
> > + * the file (thus it is important for this function to be damn quick so that it
> > + * can tag pages faster than a dirtying process can create them).
> > + */
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
> > +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +}
> > +EXPORT_SYMBOL(tag_pages_for_writeback);
> > +
> > +/**
> >   * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
> >   * @mapping: address space structure to write
> >   * @wbc: subtract the number of written pages from *@wbc->nr_to_write
> > @@ -820,6 +844,13 @@ void __init page_writeback_init(void)
> >   * the call was made get new I/O started against them.  If wbc->sync_mode is
> >   * WB_SYNC_ALL then we were called for data integrity and we must wait for
> >   * existing IO to complete.
> > + *
> > + * To avoid livelocks (when other process dirties new pages), we first tag
> > + * pages which should be written back with TOWRITE tag and only then start
> > + * writing them. For data-integrity sync we have to be careful so that we do
> > + * not miss some pages (e.g., because some other process has cleared TOWRITE
> > + * tag we set). The rule we follow is that TOWRITE tag can be cleared only
> > + * by the process clearing the DIRTY tag (and submitting the page for IO).
> >   */
> >  int write_cache_pages(struct address_space *mapping,
> >  		      struct writeback_control *wbc, writepage_t writepage,
> > @@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
> >  	int cycled;
> >  	int range_whole = 0;
> >  	long nr_to_write = wbc->nr_to_write;
> > +	int tag;
> >  
> >  	pagevec_init(&pvec, 0);
> >  	if (wbc->range_cyclic) {
> > @@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
> >  			range_whole = 1;
> >  		cycled = 1; /* ignore range_cyclic tests */
> >  	}
> > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > +		tag = PAGECACHE_TAG_TOWRITE;
> > +	else
> > +		tag = PAGECACHE_TAG_DIRTY;
> >  retry:
> > +	if (wbc->sync_mode == WB_SYNC_ALL)
> > +		tag_pages_for_writeback(mapping, index, end);
> 
> I wonder if this is too much spinlock latency in a huge dirty file?
> Some kid of batching of the operation perhaps would be good?
  You mean like copy tags for 4096 pages, then cond_resched the spin lock
and continue? That should be doable but it will give tasks that try to
livelock us more time (i.e. if there were 4096 tasks creating dirty pages
than probably they would be able to livelock us, won't they? Maybe we don't
care?).

> >  	done_index = index;
> >  	while (!done && (index <= end)) {
> >  		int i;
> >  
> > -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> > -			      PAGECACHE_TAG_DIRTY,
> > +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
> >  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> >  		if (nr_pages == 0)
> >  			break;
> 
> Would it be neat to clear the tag even if we didn't set page to
> writeback? It should be uncommon case.
  Yeah, why not.

> > @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
> >  			radix_tree_tag_clear(&mapping->page_tree,
> >  						page_index(page),
> >  						PAGECACHE_TAG_DIRTY);
> > +		radix_tree_tag_clear(&mapping->page_tree,
> > +				     page_index(page),
> > +				     PAGECACHE_TAG_TOWRITE);
> >  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> >  	} else {
> >  		ret = TestSetPageWriteback(page);
> 
> It would be nice to have bitwise tag clearing so we can clear multiple
> at once. Then
> 
> clear_tag = PAGECACHE_TAG_TOWRITE;
> if (!PageDirty(page))
>   clear_tag |= PAGECACHE_TAG_DIRTY;
> 
> That could reduce overhead a bit more.
  Good idea. Will do.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-04 18:47   ` Jan Kara
@ 2010-06-05  1:38     ` Nick Piggin
  -1 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2010-06-05  1:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Andrew Morton, david, linux-mm

On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> We try to avoid livelocks of writeback when some steadily creates
> dirty pages in a mapping we are writing out. For memory-cleaning
> writeback, using nr_to_write works reasonably well but we cannot
> really use it for data integrity writeback. This patch tries to
> solve the problem.
> 
> The idea is simple: Tag all pages that should be written back
> with a special tag (TOWRITE) in the radix tree. This can be done
> rather quickly and thus livelocks should not happen in practice.
> Then we start doing the hard work of locking pages and sending
> them to disk only for those pages that have TOWRITE tag set.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/fs.h         |    1 +
>  include/linux/radix-tree.h |    2 +-
>  mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3428393..fe308f0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -685,6 +685,7 @@ struct block_device {
>   */
>  #define PAGECACHE_TAG_DIRTY	0
>  #define PAGECACHE_TAG_WRITEBACK	1
> +#define PAGECACHE_TAG_TOWRITE	2
>  
>  int mapping_tagged(struct address_space *mapping, int tag);
>  
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index efdfb07..f7ebff8 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
>  
>  /*** radix-tree API starts here ***/
>  
> -#define RADIX_TREE_MAX_TAGS 2
> +#define RADIX_TREE_MAX_TAGS 3
>  
>  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
>  struct radix_tree_root {
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b289310..f590a12 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -807,6 +807,30 @@ void __init page_writeback_init(void)
>  }
>  
>  /**
> + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> + * @mapping: address space structure to write
> + * @start: starting page index
> + * @end: ending page index (inclusive)
> + *
> + * This function scans the page range from @start to @end and tags all pages
> + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> + * to identify pages eligible for writeback.  This mechanism is used to avoid
> + * livelocking of writeback by a process steadily creating new dirty pages in
> + * the file (thus it is important for this function to be damn quick so that it
> + * can tag pages faster than a dirtying process can create them).
> + */
> +void tag_pages_for_writeback(struct address_space *mapping,
> +			     pgoff_t start, pgoff_t end)
> +{
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
> +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +}
> +EXPORT_SYMBOL(tag_pages_for_writeback);
> +
> +/**
>   * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
>   * @mapping: address space structure to write
>   * @wbc: subtract the number of written pages from *@wbc->nr_to_write
> @@ -820,6 +844,13 @@ void __init page_writeback_init(void)
>   * the call was made get new I/O started against them.  If wbc->sync_mode is
>   * WB_SYNC_ALL then we were called for data integrity and we must wait for
>   * existing IO to complete.
> + *
> + * To avoid livelocks (when other process dirties new pages), we first tag
> + * pages which should be written back with TOWRITE tag and only then start
> + * writing them. For data-integrity sync we have to be careful so that we do
> + * not miss some pages (e.g., because some other process has cleared TOWRITE
> + * tag we set). The rule we follow is that TOWRITE tag can be cleared only
> + * by the process clearing the DIRTY tag (and submitting the page for IO).
>   */
>  int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
> @@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
>  	int cycled;
>  	int range_whole = 0;
>  	long nr_to_write = wbc->nr_to_write;
> +	int tag;
>  
>  	pagevec_init(&pvec, 0);
>  	if (wbc->range_cyclic) {
> @@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
>  			range_whole = 1;
>  		cycled = 1; /* ignore range_cyclic tests */
>  	}
> +	if (wbc->sync_mode == WB_SYNC_ALL)
> +		tag = PAGECACHE_TAG_TOWRITE;
> +	else
> +		tag = PAGECACHE_TAG_DIRTY;
>  retry:
> +	if (wbc->sync_mode == WB_SYNC_ALL)
> +		tag_pages_for_writeback(mapping, index, end);

I wonder if this is too much spinlock latency in a huge dirty file?
Some kid of batching of the operation perhaps would be good?


>  	done_index = index;
>  	while (!done && (index <= end)) {
>  		int i;
>  
> -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -			      PAGECACHE_TAG_DIRTY,
> +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
>  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
>  		if (nr_pages == 0)
>  			break;

Would it be neat to clear the tag even if we didn't set page to
writeback? It should be uncommon case.


> @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
>  			radix_tree_tag_clear(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_DIRTY);
> +		radix_tree_tag_clear(&mapping->page_tree,
> +				     page_index(page),
> +				     PAGECACHE_TAG_TOWRITE);
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  	} else {
>  		ret = TestSetPageWriteback(page);

It would be nice to have bitwise tag clearing so we can clear multiple
at once. Then

clear_tag = PAGECACHE_TAG_TOWRITE;
if (!PageDirty(page))
  clear_tag |= PAGECACHE_TAG_DIRTY;

That could reduce overhead a bit more.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-06-05  1:38     ` Nick Piggin
  0 siblings, 0 replies; 52+ messages in thread
From: Nick Piggin @ 2010-06-05  1:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Andrew Morton, david, linux-mm

On Fri, Jun 04, 2010 at 08:47:11PM +0200, Jan Kara wrote:
> We try to avoid livelocks of writeback when some steadily creates
> dirty pages in a mapping we are writing out. For memory-cleaning
> writeback, using nr_to_write works reasonably well but we cannot
> really use it for data integrity writeback. This patch tries to
> solve the problem.
> 
> The idea is simple: Tag all pages that should be written back
> with a special tag (TOWRITE) in the radix tree. This can be done
> rather quickly and thus livelocks should not happen in practice.
> Then we start doing the hard work of locking pages and sending
> them to disk only for those pages that have TOWRITE tag set.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/fs.h         |    1 +
>  include/linux/radix-tree.h |    2 +-
>  mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3428393..fe308f0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -685,6 +685,7 @@ struct block_device {
>   */
>  #define PAGECACHE_TAG_DIRTY	0
>  #define PAGECACHE_TAG_WRITEBACK	1
> +#define PAGECACHE_TAG_TOWRITE	2
>  
>  int mapping_tagged(struct address_space *mapping, int tag);
>  
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index efdfb07..f7ebff8 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
>  
>  /*** radix-tree API starts here ***/
>  
> -#define RADIX_TREE_MAX_TAGS 2
> +#define RADIX_TREE_MAX_TAGS 3
>  
>  /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
>  struct radix_tree_root {
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b289310..f590a12 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -807,6 +807,30 @@ void __init page_writeback_init(void)
>  }
>  
>  /**
> + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> + * @mapping: address space structure to write
> + * @start: starting page index
> + * @end: ending page index (inclusive)
> + *
> + * This function scans the page range from @start to @end and tags all pages
> + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> + * to identify pages eligible for writeback.  This mechanism is used to avoid
> + * livelocking of writeback by a process steadily creating new dirty pages in
> + * the file (thus it is important for this function to be damn quick so that it
> + * can tag pages faster than a dirtying process can create them).
> + */
> +void tag_pages_for_writeback(struct address_space *mapping,
> +			     pgoff_t start, pgoff_t end)
> +{
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
> +				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +}
> +EXPORT_SYMBOL(tag_pages_for_writeback);
> +
> +/**
>   * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
>   * @mapping: address space structure to write
>   * @wbc: subtract the number of written pages from *@wbc->nr_to_write
> @@ -820,6 +844,13 @@ void __init page_writeback_init(void)
>   * the call was made get new I/O started against them.  If wbc->sync_mode is
>   * WB_SYNC_ALL then we were called for data integrity and we must wait for
>   * existing IO to complete.
> + *
> + * To avoid livelocks (when other process dirties new pages), we first tag
> + * pages which should be written back with TOWRITE tag and only then start
> + * writing them. For data-integrity sync we have to be careful so that we do
> + * not miss some pages (e.g., because some other process has cleared TOWRITE
> + * tag we set). The rule we follow is that TOWRITE tag can be cleared only
> + * by the process clearing the DIRTY tag (and submitting the page for IO).
>   */
>  int write_cache_pages(struct address_space *mapping,
>  		      struct writeback_control *wbc, writepage_t writepage,
> @@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
>  	int cycled;
>  	int range_whole = 0;
>  	long nr_to_write = wbc->nr_to_write;
> +	int tag;
>  
>  	pagevec_init(&pvec, 0);
>  	if (wbc->range_cyclic) {
> @@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
>  			range_whole = 1;
>  		cycled = 1; /* ignore range_cyclic tests */
>  	}
> +	if (wbc->sync_mode == WB_SYNC_ALL)
> +		tag = PAGECACHE_TAG_TOWRITE;
> +	else
> +		tag = PAGECACHE_TAG_DIRTY;
>  retry:
> +	if (wbc->sync_mode == WB_SYNC_ALL)
> +		tag_pages_for_writeback(mapping, index, end);

I wonder if this is too much spinlock latency in a huge dirty file?
Some kid of batching of the operation perhaps would be good?


>  	done_index = index;
>  	while (!done && (index <= end)) {
>  		int i;
>  
> -		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -			      PAGECACHE_TAG_DIRTY,
> +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
>  			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
>  		if (nr_pages == 0)
>  			break;

Would it be neat to clear the tag even if we didn't set page to
writeback? It should be uncommon case.


> @@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
>  			radix_tree_tag_clear(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_DIRTY);
> +		radix_tree_tag_clear(&mapping->page_tree,
> +				     page_index(page),
> +				     PAGECACHE_TAG_TOWRITE);
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
>  	} else {
>  		ret = TestSetPageWriteback(page);

It would be nice to have bitwise tag clearing so we can clear multiple
at once. Then

clear_tag = PAGECACHE_TAG_TOWRITE;
if (!PageDirty(page))
  clear_tag |= PAGECACHE_TAG_DIRTY;

That could reduce overhead a bit more.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-04 18:47 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara
@ 2010-06-04 18:47   ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-04 18:47 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3428393..fe308f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index efdfb07..f7ebff8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b289310..f590a12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,30 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end and tags all pages
+ * that have DIRTY tag set with a special TOWRITE tag. The idea is that
+ * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
+ * to identify pages eligible for writeback.  This mechanism is used to avoid
+ * livelocking of writeback by a process steadily creating new dirty pages in
+ * the file (thus it is important for this function to be damn quick so that it
+ * can tag pages faster than a dirtying process can create them).
+ */
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +844,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
@ 2010-06-04 18:47   ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-04 18:47 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3428393..fe308f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index efdfb07..f7ebff8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b289310..f590a12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,30 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end and tags all pages
+ * that have DIRTY tag set with a special TOWRITE tag. The idea is that
+ * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
+ * to identify pages eligible for writeback.  This mechanism is used to avoid
+ * livelocking of writeback by a process steadily creating new dirty pages in
+ * the file (thus it is important for this function to be damn quick so that it
+ * can tag pages faster than a dirtying process can create them).
+ */
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +844,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
  2010-06-04 18:40 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara
@ 2010-06-04 18:40 ` Jan Kara
  2010-06-09 23:41   ` Andrew Morton
  2010-06-09 23:45   ` Andrew Morton
  0 siblings, 2 replies; 52+ messages in thread
From: Jan Kara @ 2010-06-04 18:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Andrew Morton, npiggin, david, linux-mm, Jan Kara

We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   44 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3428393..fe308f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index efdfb07..f7ebff8 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b289310..f590a12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,30 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end and tags all pages
+ * that have DIRTY tag set with a special TOWRITE tag. The idea is that
+ * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
+ * to identify pages eligible for writeback.  This mechanism is used to avoid
+ * livelocking of writeback by a process steadily creating new dirty pages in
+ * the file (thus it is important for this function to be damn quick so that it
+ * can tag pages faster than a dirtying process can create them).
+ */
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_gang_tag_if_tagged(&mapping->page_tree, start, end,
+				PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +844,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -836,6 +867,7 @@ int write_cache_pages(struct address_space *mapping,
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
+	int tag;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -853,13 +885,18 @@ int write_cache_pages(struct address_space *mapping,
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
 	}
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag = PAGECACHE_TAG_TOWRITE;
+	else
+		tag = PAGECACHE_TAG_DIRTY;
 retry:
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
-		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -1319,6 +1356,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2010-08-13  7:51 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-16 16:33 (unknown), Jan Kara
2010-06-16 16:33 ` Jan Kara
2010-06-16 16:33 ` [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged Jan Kara
2010-06-16 16:33   ` Jan Kara
2010-06-18 22:18   ` Andrew Morton
2010-06-21 12:09     ` Nick Piggin
2010-06-21 12:09       ` Nick Piggin
2010-06-21 22:43       ` Jan Kara
2010-06-21 22:43         ` Jan Kara
2010-06-23 13:42       ` Jan Kara
2010-06-23 13:42         ` Jan Kara
2010-06-16 16:33 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
2010-06-16 16:33   ` Jan Kara
2010-06-18 22:21   ` Andrew Morton
2010-06-21 12:42     ` Jan Kara
2010-06-21 12:42       ` Jan Kara
2010-06-16 22:15 ` your mail Dave Chinner
2010-06-17  7:43   ` [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes Jan Kara
2010-06-17  7:43     ` Jan Kara
2010-06-18  6:11     ` Dave Chinner
2010-06-18  7:01       ` Nick Piggin
2010-06-18  7:01         ` Nick Piggin
2010-06-17  9:11 ` Jan Kara
2010-06-17  9:11   ` Jan Kara
2010-06-22  2:59 ` your mail Wu Fengguang
2010-06-22  2:59   ` Wu Fengguang
2010-06-22 13:54   ` Jan Kara
2010-06-22 13:54     ` Jan Kara
2010-06-22 14:12     ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2010-06-24 13:57 [PATCH 0/2 v5] Livelock avoidance for data integrity writes Jan Kara
2010-06-24 13:57 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
2010-06-24 13:57 ` Jan Kara
2010-06-04 18:47 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara
2010-06-04 18:47 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
2010-06-04 18:47   ` Jan Kara
2010-06-05  1:38   ` Nick Piggin
2010-06-05  1:38     ` Nick Piggin
2010-06-07 16:09     ` Jan Kara
2010-06-07 16:09       ` Jan Kara
2010-06-08  5:29       ` Nick Piggin
2010-06-09 13:04         ` Jan Kara
2010-06-09 13:04           ` Jan Kara
2010-06-10  8:12       ` Jan Kara
2010-08-12 18:35   ` Christoph Hellwig
2010-08-12 18:35     ` Christoph Hellwig
2010-08-12 22:28     ` Jan Kara
2010-08-12 22:28       ` Jan Kara
2010-08-13  7:50       ` Christoph Hellwig
2010-08-13  7:50         ` Christoph Hellwig
2010-06-04 18:40 [PATCH 0/2 RFC v3] Livelock avoidance for data integrity writeback Jan Kara
2010-06-04 18:40 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
2010-06-09 23:41   ` Andrew Morton
2010-06-10 12:31     ` Jan Kara
2010-06-09 23:45   ` Andrew Morton
2010-06-10 12:42     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.