All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-26 18:13 ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-10-26 18:13 UTC (permalink / raw)
  To: WU Fengguang; +Cc: npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason

[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]

  Hi,

  on my way back from Kernel Summit, I've coded the attached patch which
implements livelock avoidance for write_cache_pages. We tag patches that
should be written in the beginning of write_cache_pages and then write
only tagged pages (see the patch for details). The patch is based on Nick's
idea.
  The next thing I've aimed at with this patch is a simplification of
current writeback code. Basically, with this patch I think we can just rip
out all the range_cyclic and nr_to_write (or other "fairness logic"). The
rationalle is following:
  What we want to achieve with fairness logic is that when a page is
dirtied, it gets written to disk within some reasonable time (like 30s or
so). We track dirty time on per-inode basis only because keeping it
per-page is simply too expensive. So in this setting fairness between
inodes really does not make any sence - why should be a page in a file
penalized and written later only because there are lots of other dirty
pages in the file? It is enough to make sure that we don't write one file
indefinitely when there are new dirty pages continuously created - and my
patch achieves that.
  So with my patch we can make write_cache_pages always write from
range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
after changing balance_dirty_pages() so that a throttled process does not
directly submit the IO (Fengguang has the patches for this), we can
completely remove the nr_to_write logic because nothing really uses it
anymore. Thus also the requeue_io logic should go away etc...
  Fengguang, do you have the series somewhere publicly available? You had
there a plenty of changes and quite some of them are not needed when the
above is done. So could you maybe separate out the balance_dirty_pages
change and I'd base my patch and further simplifications on top of that?
Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Implement-writeback-livelock-avoidance-using-pag.patch --]
[-- Type: text/x-patch, Size: 6000 bytes --]

>From cf9ee78989c8db475cc41180eac3dbf74b35980b Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Sat, 24 Oct 2009 09:17:52 +0200
Subject: [PATCH] mm: Implement writeback livelock avoidance using page tagging

We go though all sorts of hoops to avoid livelocking when we do
writeback on a mapping and someone steadily creates dirty pages
in that mapping. The motivation of this patch is to implement
a simple way to avoid livelocks (an thus we can rip out all the
methods we used previously).

The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h         |    1 +
 include/linux/radix-tree.h |    2 +-
 mm/page-writeback.c        |   70 +++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2620a8c..6959d65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -679,6 +679,7 @@ struct block_device {
  */
 #define PAGECACHE_TAG_DIRTY	0
 #define PAGECACHE_TAG_WRITEBACK	1
+#define PAGECACHE_TAG_TOWRITE	2
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index c5da749..a7bc41b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
 struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2c5d792..a19fd5a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -803,6 +803,51 @@ void __init page_writeback_init(void)
 }
 
 /**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end and tags all pages
+ * that have DIRTY tag set with a special TOWRITE tag. The idea is that
+ * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
+ * to identify pages eligible for writeback.  This mechanism is used to avoid
+ * livelocking of writeback by a process steadily creating new dirty pages in
+ * the file (thus it is important for this function to be damn quick so that it
+ * can tag pages faster than a dirtying process can create them).
+ */
+void tag_pages_for_writeback(struct address_space *mapping,
+			     pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	int nr_pages, i;
+	struct page *page;
+
+	pagevec_init(&pvec, 0);
+	while (start <= end) {
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
+			      PAGECACHE_TAG_DIRTY,
+			      min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+		if (!nr_pages)
+			return;
+
+		spin_lock_irq(&mapping->tree_lock);
+		for (i = 0; i < nr_pages; i++) {
+			page = pvec.pages[i];
+			/* Raced with someone freeing the page? */
+			if (page->mapping != mapping)
+				continue;
+			if (page->index > end)
+				break;
+			radix_tree_tag_set(&mapping->page_tree,
+				page_index(page), PAGECACHE_TAG_TOWRITE);
+		}
+		spin_unlock_irq(&mapping->tree_lock);
+	}
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
  * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
  * @mapping: address space structure to write
  * @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -816,6 +861,13 @@ void __init page_writeback_init(void)
  * the call was made get new I/O started against them.  If wbc->sync_mode is
  * WB_SYNC_ALL then we were called for data integrity and we must wait for
  * existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
  */
 int write_cache_pages(struct address_space *mapping,
 		      struct writeback_control *wbc, writepage_t writepage,
@@ -856,12 +908,13 @@ int write_cache_pages(struct address_space *mapping,
 		cycled = 1; /* ignore range_cyclic tests */
 	}
 retry:
+	tag_pages_for_writeback(mapping, index, end);
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
 
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-			      PAGECACHE_TAG_DIRTY,
+			      PAGECACHE_TAG_TOWRITE,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0)
 			break;
@@ -984,6 +1037,18 @@ continue_unlock:
 		wbc->nr_to_write = nr_to_write;
 	}
 
+	/* Debugging aid */
+	{
+	int i;
+	index = 0;
+	end = -1;
+	nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+		      PAGECACHE_TAG_TOWRITE,
+		      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+	for (i = 0; i < nr_pages; i++)
+		printk("Seeing tagged page %lu, state %lu\n", pvec.pages[i]->index, pvec.pages[i]->flags);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);
@@ -1327,6 +1392,9 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
+		radix_tree_tag_clear(&mapping->page_tree,
+				     page_index(page),
+				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-26 18:13 ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-10-26 18:13 UTC (permalink / raw)
  To: WU Fengguang; +Cc: npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason

[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]

  Hi,

  on my way back from Kernel Summit, I've coded the attached patch which
implements livelock avoidance for write_cache_pages. We tag patches that
should be written in the beginning of write_cache_pages and then write
only tagged pages (see the patch for details). The patch is based on Nick's
idea.
  The next thing I've aimed at with this patch is a simplification of
current writeback code. Basically, with this patch I think we can just rip
out all the range_cyclic and nr_to_write (or other "fairness logic"). The
rationalle is following:
  What we want to achieve with fairness logic is that when a page is
dirtied, it gets written to disk within some reasonable time (like 30s or
so). We track dirty time on per-inode basis only because keeping it
per-page is simply too expensive. So in this setting fairness between
inodes really does not make any sence - why should be a page in a file
penalized and written later only because there are lots of other dirty
pages in the file? It is enough to make sure that we don't write one file
indefinitely when there are new dirty pages continuously created - and my
patch achieves that.
  So with my patch we can make write_cache_pages always write from
range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
after changing balance_dirty_pages() so that a throttled process does not
directly submit the IO (Fengguang has the patches for this), we can
completely remove the nr_to_write logic because nothing really uses it
anymore. Thus also the requeue_io logic should go away etc...
  Fengguang, do you have the series somewhere publicly available? You had
there a plenty of changes and quite some of them are not needed when the
above is done. So could you maybe separate out the balance_dirty_pages
change and I'd base my patch and further simplifications on top of that?
Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Implement-writeback-livelock-avoidance-using-pag.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-26 18:13 ` Jan Kara
@ 2009-10-27  3:39   ` Nick Piggin
  -1 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2009-10-27  3:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: WU Fengguang, Andrew Morton, LKML, linux-mm, hch, chris.mason

On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.
>   The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
>   What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or
> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between
> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.
>   So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...
>   Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.

Like I said (and as we concluded when I last posted my tagging patch),
I think this idea should work fine, but there is perhaps a little bit of
overhead/complexity so provided that we can get some numbers or show a
real improvement in behaviour or code simplifications then I think we
could justify the patch.

I would be interested to know how it goes.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-27  3:39   ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2009-10-27  3:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: WU Fengguang, Andrew Morton, LKML, linux-mm, hch, chris.mason

On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.
>   The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
>   What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or
> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between
> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.
>   So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...
>   Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.

Like I said (and as we concluded when I last posted my tagging patch),
I think this idea should work fine, but there is perhaps a little bit of
overhead/complexity so provided that we can get some numbers or show a
real improvement in behaviour or code simplifications then I think we
could justify the patch.

I would be interested to know how it goes.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-27  3:39   ` Nick Piggin
@ 2009-10-27  9:17     ` Jan Kara
  -1 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-10-27  9:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, WU Fengguang, Andrew Morton, LKML, linux-mm, hch, chris.mason

On Tue 27-10-09 04:39:47, Nick Piggin wrote:
> On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> >   The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> >   What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> >   So with my patch we can make write_cache_pages always write from
> > range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> > after changing balance_dirty_pages() so that a throttled process does not
> > directly submit the IO (Fengguang has the patches for this), we can
> > completely remove the nr_to_write logic because nothing really uses it
> > anymore. Thus also the requeue_io logic should go away etc...
> >   Fengguang, do you have the series somewhere publicly available? You had
> > there a plenty of changes and quite some of them are not needed when the
> > above is done. So could you maybe separate out the balance_dirty_pages
> > change and I'd base my patch and further simplifications on top of that?
> > Thanks.
> 
> Like I said (and as we concluded when I last posted my tagging patch),
> I think this idea should work fine, but there is perhaps a little bit of
> overhead/complexity so provided that we can get some numbers or show a
> real improvement in behaviour or code simplifications then I think we
> could justify the patch.
  Yes, after I rebase my patch on top of Fengguang's work, I'll write also
the cleanup patch so that we can really see, how much simpler the code gets
and can test what advantages / disadvantages does it bring. I'll keep you
updated.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-27  9:17     ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-10-27  9:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, WU Fengguang, Andrew Morton, LKML, linux-mm, hch, chris.mason

On Tue 27-10-09 04:39:47, Nick Piggin wrote:
> On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> >   The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> >   What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> >   So with my patch we can make write_cache_pages always write from
> > range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> > after changing balance_dirty_pages() so that a throttled process does not
> > directly submit the IO (Fengguang has the patches for this), we can
> > completely remove the nr_to_write logic because nothing really uses it
> > anymore. Thus also the requeue_io logic should go away etc...
> >   Fengguang, do you have the series somewhere publicly available? You had
> > there a plenty of changes and quite some of them are not needed when the
> > above is done. So could you maybe separate out the balance_dirty_pages
> > change and I'd base my patch and further simplifications on top of that?
> > Thanks.
> 
> Like I said (and as we concluded when I last posted my tagging patch),
> I think this idea should work fine, but there is perhaps a little bit of
> overhead/complexity so provided that we can get some numbers or show a
> real improvement in behaviour or code simplifications then I think we
> could justify the patch.
  Yes, after I rebase my patch on top of Fengguang's work, I'll write also
the cleanup patch so that we can really see, how much simpler the code gets
and can test what advantages / disadvantages does it bring. I'll keep you
updated.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-26 18:13 ` Jan Kara
@ 2009-10-27 13:56   ` Nikanth Karthikesan
  -1 siblings, 0 replies; 20+ messages in thread
From: Nikanth Karthikesan @ 2009-10-27 13:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: WU Fengguang, npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason

On Monday 26 October 2009 23:43:14 Jan Kara wrote:
>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.

As I understand, livelock can be caused only by dirtying new pages.

So theoretically, if a process can dirty pages faster than we can tag pages 
for writeback, even now isn't there a chance for livelock? But if it is really 
a very fast operation and livelock is not possible, why not hold the tree_lock 
during the entire period of tagging the pages for writeback i.e., call 
tag_pages_for_writeback() under mapping->tree_lock? Would it cause 
deadlock/starvation or some other serious problems?

Thanks
Nikanth

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-27 13:56   ` Nikanth Karthikesan
  0 siblings, 0 replies; 20+ messages in thread
From: Nikanth Karthikesan @ 2009-10-27 13:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: WU Fengguang, npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason

On Monday 26 October 2009 23:43:14 Jan Kara wrote:
>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.

As I understand, livelock can be caused only by dirtying new pages.

So theoretically, if a process can dirty pages faster than we can tag pages 
for writeback, even now isn't there a chance for livelock? But if it is really 
a very fast operation and livelock is not possible, why not hold the tree_lock 
during the entire period of tagging the pages for writeback i.e., call 
tag_pages_for_writeback() under mapping->tree_lock? Would it cause 
deadlock/starvation or some other serious problems?

Thanks
Nikanth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-27 13:56   ` Nikanth Karthikesan
@ 2009-10-27 15:32     ` Jan Kara
  -1 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-10-27 15:32 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Jan Kara, WU Fengguang, npiggin, Andrew Morton, LKML, linux-mm,
	hch, chris.mason

On Tue 27-10-09 19:26:14, Nikanth Karthikesan wrote:
> On Monday 26 October 2009 23:43:14 Jan Kara wrote:
> >   Hi,
> > 
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> 
> As I understand, livelock can be caused only by dirtying new pages.
> 
> So theoretically, if a process can dirty pages faster than we can tag pages 
> for writeback, even now isn't there a chance for livelock? But if it is really 
  Yes, theoretically the livelock is still there but practically, I don't
think it's triggerable (the amount of work needed to do either write(2) or
page fault is much higher than just looking up a page in radix tree and
setting there one bit). If the file has lots of dirty pages, I belive user
can create a few more while we are tagging but not much...

> a very fast operation and livelock is not possible, why not hold the tree_lock 
> during the entire period of tagging the pages for writeback i.e., call 
> tag_pages_for_writeback() under mapping->tree_lock? Would it cause 
> deadlock/starvation or some other serious problems?
  I'm dropping tree_lock because I don't think I can hold it during
pagevec_lookup_tag. Even if that was worked-around, if the file has lots of
dirty pages, it could take us long enough to tag all of them that it would
matter latency-wise for other users of the lock. So I'd leave the code as
is.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-27 15:32     ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-10-27 15:32 UTC (permalink / raw)
  To: Nikanth Karthikesan
  Cc: Jan Kara, WU Fengguang, npiggin, Andrew Morton, LKML, linux-mm,
	hch, chris.mason

On Tue 27-10-09 19:26:14, Nikanth Karthikesan wrote:
> On Monday 26 October 2009 23:43:14 Jan Kara wrote:
> >   Hi,
> > 
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> 
> As I understand, livelock can be caused only by dirtying new pages.
> 
> So theoretically, if a process can dirty pages faster than we can tag pages 
> for writeback, even now isn't there a chance for livelock? But if it is really 
  Yes, theoretically the livelock is still there but practically, I don't
think it's triggerable (the amount of work needed to do either write(2) or
page fault is much higher than just looking up a page in radix tree and
setting there one bit). If the file has lots of dirty pages, I belive user
can create a few more while we are tagging but not much...

> a very fast operation and livelock is not possible, why not hold the tree_lock 
> during the entire period of tagging the pages for writeback i.e., call 
> tag_pages_for_writeback() under mapping->tree_lock? Would it cause 
> deadlock/starvation or some other serious problems?
  I'm dropping tree_lock because I don't think I can hold it during
pagevec_lookup_tag. Even if that was worked-around, if the file has lots of
dirty pages, it could take us long enough to tag all of them that it would
matter latency-wise for other users of the lock. So I'd leave the code as
is.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-26 18:13 ` Jan Kara
@ 2009-10-28 21:47   ` Andrew Morton
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2009-10-28 21:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: WU Fengguang, npiggin, LKML, linux-mm, hch, chris.mason

On Mon, 26 Oct 2009 19:13:14 +0100
Jan Kara <jack@suse.cz> wrote:

>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.
>   The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
>   What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or
> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between
> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.
>   So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...
>   Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.
> 

I need to think about this.  Hard.

So I'll defer that and nitpick the implementation instead ;)

My MUA doesn't understand text/x-patch.  Please use text/plain if you
must use attachments?

	
 /**
> + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> + * @mapping: address space structure to write
> + * @start: starting page index
> + * @end: ending page index (inclusive)
> + *
> + * This function scans the page range from @start to @end and tags all pages
> + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> + * to identify pages eligible for writeback.  This mechanism is used to avoid
> + * livelocking of writeback by a process steadily creating new dirty pages in
> + * the file (thus it is important for this function to be damn quick so that it
> + * can tag pages faster than a dirtying process can create them).
> + */
> +void tag_pages_for_writeback(struct address_space *mapping,
> +			     pgoff_t start, pgoff_t end)
> +{
> +	struct pagevec pvec;
> +	int nr_pages, i;
> +	struct page *page;
> +
> +	pagevec_init(&pvec, 0);
> +	while (start <= end) {
> +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
> +			      PAGECACHE_TAG_DIRTY,
> +			      min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> +		if (!nr_pages)
> +			return;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		for (i = 0; i < nr_pages; i++) {
> +			page = pvec.pages[i];
> +			/* Raced with someone freeing the page? */
> +			if (page->mapping != mapping)
> +				continue;
> +			if (page->index > end)
> +				break;
> +			radix_tree_tag_set(&mapping->page_tree,
> +				page_index(page), PAGECACHE_TAG_TOWRITE);
> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +	}
> +}
> +EXPORT_SYMBOL(tag_pages_for_writeback);

This is really inefficient.  We do a full tree descent for each dirty
page.

It would be far more efficient to do a combined lookup and set
operation.  Bascially that's the same as pagevec_lookup_tag(), only we
set the PAGECACHE_TAG_TOWRITE on each page instead of taking a copy
into the pagevec.

Which makes one wonder: would such an operation require ->tree_lock? 
pagevec_lookup_tag() just uses rcu_read_lock() - what do we need to do
to use lighter locking in the new
radix_tree_gang_lookup_tag_slot_then_set_a_flag()?  Convert tag_set()
and tag_clear() to atomic ops, perhaps?




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-10-28 21:47   ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2009-10-28 21:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: WU Fengguang, npiggin, LKML, linux-mm, hch, chris.mason

On Mon, 26 Oct 2009 19:13:14 +0100
Jan Kara <jack@suse.cz> wrote:

>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.
>   The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
>   What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or
> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between
> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.
>   So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...
>   Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.
> 

I need to think about this.  Hard.

So I'll defer that and nitpick the implementation instead ;)

My MUA doesn't understand text/x-patch.  Please use text/plain if you
must use attachments?

	
 /**
> + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> + * @mapping: address space structure to write
> + * @start: starting page index
> + * @end: ending page index (inclusive)
> + *
> + * This function scans the page range from @start to @end and tags all pages
> + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> + * to identify pages eligible for writeback.  This mechanism is used to avoid
> + * livelocking of writeback by a process steadily creating new dirty pages in
> + * the file (thus it is important for this function to be damn quick so that it
> + * can tag pages faster than a dirtying process can create them).
> + */
> +void tag_pages_for_writeback(struct address_space *mapping,
> +			     pgoff_t start, pgoff_t end)
> +{
> +	struct pagevec pvec;
> +	int nr_pages, i;
> +	struct page *page;
> +
> +	pagevec_init(&pvec, 0);
> +	while (start <= end) {
> +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
> +			      PAGECACHE_TAG_DIRTY,
> +			      min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> +		if (!nr_pages)
> +			return;
> +
> +		spin_lock_irq(&mapping->tree_lock);
> +		for (i = 0; i < nr_pages; i++) {
> +			page = pvec.pages[i];
> +			/* Raced with someone freeing the page? */
> +			if (page->mapping != mapping)
> +				continue;
> +			if (page->index > end)
> +				break;
> +			radix_tree_tag_set(&mapping->page_tree,
> +				page_index(page), PAGECACHE_TAG_TOWRITE);
> +		}
> +		spin_unlock_irq(&mapping->tree_lock);
> +	}
> +}
> +EXPORT_SYMBOL(tag_pages_for_writeback);

This is really inefficient.  We do a full tree descent for each dirty
page.

It would be far more efficient to do a combined lookup and set
operation.  Bascially that's the same as pagevec_lookup_tag(), only we
set the PAGECACHE_TAG_TOWRITE on each page instead of taking a copy
into the pagevec.

Which makes one wonder: would such an operation require ->tree_lock? 
pagevec_lookup_tag() just uses rcu_read_lock() - what do we need to do
to use lighter locking in the new
radix_tree_gang_lookup_tag_slot_then_set_a_flag()?  Convert tag_set()
and tag_clear() to atomic ops, perhaps?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-28 21:47   ` Andrew Morton
@ 2009-11-02  3:34     ` Nick Piggin
  -1 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2009-11-02  3:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, WU Fengguang, LKML, linux-mm, hch, chris.mason

On Wed, Oct 28, 2009 at 02:47:31PM -0700, Andrew Morton wrote:
> On Mon, 26 Oct 2009 19:13:14 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> >   Hi,
> > 
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> >   The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> >   What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> >   So with my patch we can make write_cache_pages always write from
> > range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> > after changing balance_dirty_pages() so that a throttled process does not
> > directly submit the IO (Fengguang has the patches for this), we can
> > completely remove the nr_to_write logic because nothing really uses it
> > anymore. Thus also the requeue_io logic should go away etc...
> >   Fengguang, do you have the series somewhere publicly available? You had
> > there a plenty of changes and quite some of them are not needed when the
> > above is done. So could you maybe separate out the balance_dirty_pages
> > change and I'd base my patch and further simplifications on top of that?
> > Thanks.
> > 
> 
> I need to think about this.  Hard.
> 
> So I'll defer that and nitpick the implementation instead ;)
> 
> My MUA doesn't understand text/x-patch.  Please use text/plain if you
> must use attachments?
> 
> 	
>  /**
> > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> > + * @mapping: address space structure to write
> > + * @start: starting page index
> > + * @end: ending page index (inclusive)
> > + *
> > + * This function scans the page range from @start to @end and tags all pages
> > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> > + * to identify pages eligible for writeback.  This mechanism is used to avoid
> > + * livelocking of writeback by a process steadily creating new dirty pages in
> > + * the file (thus it is important for this function to be damn quick so that it
> > + * can tag pages faster than a dirtying process can create them).
> > + */
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	struct pagevec pvec;
> > +	int nr_pages, i;
> > +	struct page *page;
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (start <= end) {
> > +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
> > +			      PAGECACHE_TAG_DIRTY,
> > +			      min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> > +		if (!nr_pages)
> > +			return;
> > +
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		for (i = 0; i < nr_pages; i++) {
> > +			page = pvec.pages[i];
> > +			/* Raced with someone freeing the page? */
> > +			if (page->mapping != mapping)
> > +				continue;
> > +			if (page->index > end)
> > +				break;
> > +			radix_tree_tag_set(&mapping->page_tree,
> > +				page_index(page), PAGECACHE_TAG_TOWRITE);
> > +		}
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +EXPORT_SYMBOL(tag_pages_for_writeback);
> 
> This is really inefficient.  We do a full tree descent for each dirty
> page.
> 
> It would be far more efficient to do a combined lookup and set
> operation.  Bascially that's the same as pagevec_lookup_tag(), only we
> set the PAGECACHE_TAG_TOWRITE on each page instead of taking a copy
> into the pagevec.

I had a radix_tree_gang_set_if_tagged operation in my earlier
patchset, which should basically do this.

 
> Which makes one wonder: would such an operation require ->tree_lock? 
> pagevec_lookup_tag() just uses rcu_read_lock() - what do we need to do
> to use lighter locking in the new
> radix_tree_gang_lookup_tag_slot_then_set_a_flag()?  Convert tag_set()
> and tag_clear() to atomic ops, perhaps?

Well that, but the hard part is propagating the tag back to the root
in a coherent way (when other guys are setting and clearing tags in
other nodes). Also, if we have more than a couple of atomic bitops,
then the spinlock will win out in straight line performance (although
scalability could still be better with an unlocked version... but I
think the propagation is the hard part).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-11-02  3:34     ` Nick Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nick Piggin @ 2009-11-02  3:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, WU Fengguang, LKML, linux-mm, hch, chris.mason

On Wed, Oct 28, 2009 at 02:47:31PM -0700, Andrew Morton wrote:
> On Mon, 26 Oct 2009 19:13:14 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> >   Hi,
> > 
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> >   The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> >   What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> >   So with my patch we can make write_cache_pages always write from
> > range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> > after changing balance_dirty_pages() so that a throttled process does not
> > directly submit the IO (Fengguang has the patches for this), we can
> > completely remove the nr_to_write logic because nothing really uses it
> > anymore. Thus also the requeue_io logic should go away etc...
> >   Fengguang, do you have the series somewhere publicly available? You had
> > there a plenty of changes and quite some of them are not needed when the
> > above is done. So could you maybe separate out the balance_dirty_pages
> > change and I'd base my patch and further simplifications on top of that?
> > Thanks.
> > 
> 
> I need to think about this.  Hard.
> 
> So I'll defer that and nitpick the implementation instead ;)
> 
> My MUA doesn't understand text/x-patch.  Please use text/plain if you
> must use attachments?
> 
> 	
>  /**
> > + * tag_pages_for_writeback - tag pages to be written by write_cache_pages
> > + * @mapping: address space structure to write
> > + * @start: starting page index
> > + * @end: ending page index (inclusive)
> > + *
> > + * This function scans the page range from @start to @end and tags all pages
> > + * that have DIRTY tag set with a special TOWRITE tag. The idea is that
> > + * write_cache_pages (or whoever calls this function) will then use TOWRITE tag
> > + * to identify pages eligible for writeback.  This mechanism is used to avoid
> > + * livelocking of writeback by a process steadily creating new dirty pages in
> > + * the file (thus it is important for this function to be damn quick so that it
> > + * can tag pages faster than a dirtying process can create them).
> > + */
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > +			     pgoff_t start, pgoff_t end)
> > +{
> > +	struct pagevec pvec;
> > +	int nr_pages, i;
> > +	struct page *page;
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (start <= end) {
> > +		nr_pages = pagevec_lookup_tag(&pvec, mapping, &start,
> > +			      PAGECACHE_TAG_DIRTY,
> > +			      min(end - start, (pgoff_t)PAGEVEC_SIZE-1) + 1);
> > +		if (!nr_pages)
> > +			return;
> > +
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		for (i = 0; i < nr_pages; i++) {
> > +			page = pvec.pages[i];
> > +			/* Raced with someone freeing the page? */
> > +			if (page->mapping != mapping)
> > +				continue;
> > +			if (page->index > end)
> > +				break;
> > +			radix_tree_tag_set(&mapping->page_tree,
> > +				page_index(page), PAGECACHE_TAG_TOWRITE);
> > +		}
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +EXPORT_SYMBOL(tag_pages_for_writeback);
> 
> This is really inefficient.  We do a full tree descent for each dirty
> page.
> 
> It would be far more efficient to do a combined lookup and set
> operation.  Bascially that's the same as pagevec_lookup_tag(), only we
> set the PAGECACHE_TAG_TOWRITE on each page instead of taking a copy
> into the pagevec.

I had a radix_tree_gang_set_if_tagged operation in my earlier
patchset, which should basically do this.

 
> Which makes one wonder: would such an operation require ->tree_lock? 
> pagevec_lookup_tag() just uses rcu_read_lock() - what do we need to do
> to use lighter locking in the new
> radix_tree_gang_lookup_tag_slot_then_set_a_flag()?  Convert tag_set()
> and tag_clear() to atomic ops, perhaps?

Well that, but the hard part is propagating the tag back to the root
in a coherent way (when other guys are setting and clearing tags in
other nodes). Also, if we have more than a couple of atomic bitops,
then the spinlock will win out in straight line performance (although
scalability could still be better with an unlocked version... but I
think the propagation is the hard part).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-10-26 18:13 ` Jan Kara
@ 2009-11-03 13:14   ` Wu Fengguang
  -1 siblings, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-11-03 13:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason, Peter Zijlstra

Hi Jan,

Sorry for being late! - for some reason I was just able to see your email.

On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.

Yes, tagging is a very fine grained way for livelock avoidance.
However I doubt this patch can achieve the simplification goals
listed below..

>   The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
>   What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or

Right.

> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between

Right.

> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.

This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
With current policy, it could be

        sync 4MB of A
        sync B
        sync C
        sync D
        sync 4MB of A
        sync 4MB of A
        ...

And you want to change to

        sync A (all 4GB)
        sync B
        sync C
        sync D

This means the writeback of B,C,D won't be able to start at 30s, but
delayed to 80s because of A. This is not entirely fair. IMHO writeback
of big files shall not delay small files too much. 

>   So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...

For the above reason I think we should think twice on removing
nr_to_write and requeue_io()..

>   Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.

Sorry I don't maintain a public git tree. However it's a good idea to
break down the big patchset to smaller pieces, and submit/review them
bits by bits.

I'm on leave tomorrow and will do that after coming back.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-11-03 13:14   ` Wu Fengguang
  0 siblings, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-11-03 13:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason, Peter Zijlstra

Hi Jan,

Sorry for being late! - for some reason I was just able to see your email.

On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
>   Hi,
> 
>   on my way back from Kernel Summit, I've coded the attached patch which
> implements livelock avoidance for write_cache_pages. We tag patches that
> should be written in the beginning of write_cache_pages and then write
> only tagged pages (see the patch for details). The patch is based on Nick's
> idea.

Yes, tagging is a very fine grained way for livelock avoidance.
However I doubt this patch can achieve the simplification goals
listed below..

>   The next thing I've aimed at with this patch is a simplification of
> current writeback code. Basically, with this patch I think we can just rip
> out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> rationalle is following:
>   What we want to achieve with fairness logic is that when a page is
> dirtied, it gets written to disk within some reasonable time (like 30s or

Right.

> so). We track dirty time on per-inode basis only because keeping it
> per-page is simply too expensive. So in this setting fairness between

Right.

> inodes really does not make any sence - why should be a page in a file
> penalized and written later only because there are lots of other dirty
> pages in the file? It is enough to make sure that we don't write one file
> indefinitely when there are new dirty pages continuously created - and my
> patch achieves that.

This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
With current policy, it could be

        sync 4MB of A
        sync B
        sync C
        sync D
        sync 4MB of A
        sync 4MB of A
        ...

And you want to change to

        sync A (all 4GB)
        sync B
        sync C
        sync D

This means the writeback of B,C,D won't be able to start at 30s, but
delayed to 80s because of A. This is not entirely fair. IMHO writeback
of big files shall not delay small files too much. 

>   So with my patch we can make write_cache_pages always write from
> range_start (or 0) to range_end (or EOF) and write all tagged pages. Also
> after changing balance_dirty_pages() so that a throttled process does not
> directly submit the IO (Fengguang has the patches for this), we can
> completely remove the nr_to_write logic because nothing really uses it
> anymore. Thus also the requeue_io logic should go away etc...

For the above reason I think we should think twice on removing
nr_to_write and requeue_io()..

>   Fengguang, do you have the series somewhere publicly available? You had
> there a plenty of changes and quite some of them are not needed when the
> above is done. So could you maybe separate out the balance_dirty_pages
> change and I'd base my patch and further simplifications on top of that?
> Thanks.

Sorry I don't maintain a public git tree. However it's a good idea to
break down the big patchset to smaller pieces, and submit/review them
bits by bits.

I'm on leave tomorrow and will do that after coming back.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-11-03 13:14   ` Wu Fengguang
@ 2009-11-03 14:56     ` Jan Kara
  -1 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-11-03 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, npiggin, Andrew Morton, LKML, linux-mm, hch,
	chris.mason, Peter Zijlstra

  Hi Fengguang,

On Tue 03-11-09 21:14:34, Wu Fengguang wrote:
> Sorry for being late! - for some reason I was just able to see your email.
  No problem :).

> On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> >   Hi,
> > 
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> 
> Yes, tagging is a very fine grained way for livelock avoidance.
> However I doubt this patch can achieve the simplification goals
> listed below..
> 
> >   The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> >   What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> 
> Right.
> 
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> 
> Right.
> 
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> 
> This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
> With current policy, it could be
> 
>         sync 4MB of A
>         sync B
>         sync C
>         sync D
>         sync 4MB of A
>         sync 4MB of A
>         ...
> 
> And you want to change to
> 
>         sync A (all 4GB)
>         sync B
>         sync C
>         sync D
> 
> This means the writeback of B,C,D won't be able to start at 30s, but
> delayed to 80s because of A. This is not entirely fair. IMHO writeback
> of big files shall not delay small files too much. 
  Yes, I'm aware of this change. It's just that I'm not sure we really
care. There are few reasons to this: What advantage does it bring that we
are "fair among files"? User can only tell the difference if after a crash,
files he wrote long time ago are still not on disk. But we shouldn't
accumulate too many dirty data (like minutes of writeback) in caches
anyway... So the difference should not be too big. Also how is the case
"one big and a few small files" different from the case "many small files"
where to be fair among files does not bring anything? 
  It's just that see some substantial code complexity and also performance
impact (because of smaller chunks of sequential IO) in trying to be fair
among files and I don't really see adequate advantages of that approach.
That's why I'm suggesting we should revisit the decision and possibly go in
a different direction.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-11-03 14:56     ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2009-11-03 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, npiggin, Andrew Morton, LKML, linux-mm, hch,
	chris.mason, Peter Zijlstra

  Hi Fengguang,

On Tue 03-11-09 21:14:34, Wu Fengguang wrote:
> Sorry for being late! - for some reason I was just able to see your email.
  No problem :).

> On Mon, Oct 26, 2009 at 07:13:14PM +0100, Jan Kara wrote:
> >   Hi,
> > 
> >   on my way back from Kernel Summit, I've coded the attached patch which
> > implements livelock avoidance for write_cache_pages. We tag patches that
> > should be written in the beginning of write_cache_pages and then write
> > only tagged pages (see the patch for details). The patch is based on Nick's
> > idea.
> 
> Yes, tagging is a very fine grained way for livelock avoidance.
> However I doubt this patch can achieve the simplification goals
> listed below..
> 
> >   The next thing I've aimed at with this patch is a simplification of
> > current writeback code. Basically, with this patch I think we can just rip
> > out all the range_cyclic and nr_to_write (or other "fairness logic"). The
> > rationalle is following:
> >   What we want to achieve with fairness logic is that when a page is
> > dirtied, it gets written to disk within some reasonable time (like 30s or
> 
> Right.
> 
> > so). We track dirty time on per-inode basis only because keeping it
> > per-page is simply too expensive. So in this setting fairness between
> 
> Right.
> 
> > inodes really does not make any sence - why should be a page in a file
> > penalized and written later only because there are lots of other dirty
> > pages in the file? It is enough to make sure that we don't write one file
> > indefinitely when there are new dirty pages continuously created - and my
> > patch achieves that.
> 
> This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
> With current policy, it could be
> 
>         sync 4MB of A
>         sync B
>         sync C
>         sync D
>         sync 4MB of A
>         sync 4MB of A
>         ...
> 
> And you want to change to
> 
>         sync A (all 4GB)
>         sync B
>         sync C
>         sync D
> 
> This means the writeback of B,C,D won't be able to start at 30s, but
> delayed to 80s because of A. This is not entirely fair. IMHO writeback
> of big files shall not delay small files too much. 
  Yes, I'm aware of this change. It's just that I'm not sure we really
care. There are few reasons to this: What advantage does it bring that we
are "fair among files"? User can only tell the difference if after a crash,
files he wrote long time ago are still not on disk. But we shouldn't
accumulate too many dirty data (like minutes of writeback) in caches
anyway... So the difference should not be too big. Also how is the case
"one big and a few small files" different from the case "many small files"
where to be fair among files does not bring anything? 
  It's just that see some substantial code complexity and also performance
impact (because of smaller chunks of sequential IO) in trying to be fair
among files and I don't really see adequate advantages of that approach.
That's why I'm suggesting we should revisit the decision and possibly go in
a different direction.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
  2009-11-03 14:56     ` Jan Kara
@ 2009-11-04 11:32       ` Wu Fengguang
  -1 siblings, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-11-04 11:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason, Peter Zijlstra

Jan,

On Tue, Nov 03, 2009 at 10:56:54PM +0800, Jan Kara wrote:
[snip]
> > > inodes really does not make any sence - why should be a page in a file
> > > penalized and written later only because there are lots of other dirty
> > > pages in the file? It is enough to make sure that we don't write one file
> > > indefinitely when there are new dirty pages continuously created - and my
> > > patch achieves that.
> > 
> > This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
> > With current policy, it could be
> > 
> >         sync 4MB of A
> >         sync B
> >         sync C
> >         sync D
> >         sync 4MB of A
> >         sync 4MB of A
> >         ...
> > 
> > And you want to change to
> > 
> >         sync A (all 4GB)
> >         sync B
> >         sync C
> >         sync D
> > 
> > This means the writeback of B,C,D won't be able to start at 30s, but
> > delayed to 80s because of A. This is not entirely fair. IMHO writeback
> > of big files shall not delay small files too much. 
>   Yes, I'm aware of this change. It's just that I'm not sure we really
> care. There are few reasons to this: What advantage does it bring that we
> are "fair among files"? User can only tell the difference if after a crash,

I'm not all that sure, too. The perception is, big files normally
contain less valuable information per-page than small files ;)

If crashed, it's much better to lose one single big file, than to lose
all the (big and small) files.

Maybe nobody really care that - sync() has always been working file
after file (ignoring nr_to_write) and no one complained.

> files he wrote long time ago are still not on disk. But we shouldn't
> accumulate too many dirty data (like minutes of writeback) in caches
> anyway... So the difference should not be too big. Also how is the case
> "one big and a few small files" different from the case "many small files"
> where to be fair among files does not bring anything? 
>   It's just that see some substantial code complexity and also performance
> impact (because of smaller chunks of sequential IO) in trying to be fair
> among files and I don't really see adequate advantages of that approach.
> That's why I'm suggesting we should revisit the decision and possibly go in
> a different direction.

Anyway, if this is not a big concern, nr_to_write could be removed.

Note that requeue_io() (or requeue_io_wait) still cannot be removed
because sometimes we have (temporary) problems on writeback an inode.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH] Avoid livelock for fsync
@ 2009-11-04 11:32       ` Wu Fengguang
  0 siblings, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-11-04 11:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: npiggin, Andrew Morton, LKML, linux-mm, hch, chris.mason, Peter Zijlstra

Jan,

On Tue, Nov 03, 2009 at 10:56:54PM +0800, Jan Kara wrote:
[snip]
> > > inodes really does not make any sence - why should be a page in a file
> > > penalized and written later only because there are lots of other dirty
> > > pages in the file? It is enough to make sure that we don't write one file
> > > indefinitely when there are new dirty pages continuously created - and my
> > > patch achieves that.
> > 
> > This is a big policy change. Imagine dirty files A=4GB, B=C=D=1MB.
> > With current policy, it could be
> > 
> >         sync 4MB of A
> >         sync B
> >         sync C
> >         sync D
> >         sync 4MB of A
> >         sync 4MB of A
> >         ...
> > 
> > And you want to change to
> > 
> >         sync A (all 4GB)
> >         sync B
> >         sync C
> >         sync D
> > 
> > This means the writeback of B,C,D won't be able to start at 30s, but
> > delayed to 80s because of A. This is not entirely fair. IMHO writeback
> > of big files shall not delay small files too much. 
>   Yes, I'm aware of this change. It's just that I'm not sure we really
> care. There are few reasons to this: What advantage does it bring that we
> are "fair among files"? User can only tell the difference if after a crash,

I'm not all that sure, too. The perception is, big files normally
contain less valuable information per-page than small files ;)

If crashed, it's much better to lose one single big file, than to lose
all the (big and small) files.

Maybe nobody really care that - sync() has always been working file
after file (ignoring nr_to_write) and no one complained.

> files he wrote long time ago are still not on disk. But we shouldn't
> accumulate too many dirty data (like minutes of writeback) in caches
> anyway... So the difference should not be too big. Also how is the case
> "one big and a few small files" different from the case "many small files"
> where to be fair among files does not bring anything? 
>   It's just that see some substantial code complexity and also performance
> impact (because of smaller chunks of sequential IO) in trying to be fair
> among files and I don't really see adequate advantages of that approach.
> That's why I'm suggesting we should revisit the decision and possibly go in
> a different direction.

Anyway, if this is not a big concern, nr_to_write could be removed.

Note that requeue_io() (or requeue_io_wait) still cannot be removed
because sometimes we have (temporary) problems on writeback an inode.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-11-04 11:32 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-26 18:13 [RFC] [PATCH] Avoid livelock for fsync Jan Kara
2009-10-26 18:13 ` Jan Kara
2009-10-27  3:39 ` Nick Piggin
2009-10-27  3:39   ` Nick Piggin
2009-10-27  9:17   ` Jan Kara
2009-10-27  9:17     ` Jan Kara
2009-10-27 13:56 ` Nikanth Karthikesan
2009-10-27 13:56   ` Nikanth Karthikesan
2009-10-27 15:32   ` Jan Kara
2009-10-27 15:32     ` Jan Kara
2009-10-28 21:47 ` Andrew Morton
2009-10-28 21:47   ` Andrew Morton
2009-11-02  3:34   ` Nick Piggin
2009-11-02  3:34     ` Nick Piggin
2009-11-03 13:14 ` Wu Fengguang
2009-11-03 13:14   ` Wu Fengguang
2009-11-03 14:56   ` Jan Kara
2009-11-03 14:56     ` Jan Kara
2009-11-04 11:32     ` Wu Fengguang
2009-11-04 11:32       ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.