linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] msync improvements
@ 2012-06-15 15:12 Paolo Bonzini
  2012-06-15 15:12 ` [PATCH v2 1/2] msync: support syncing a small part of the file Paolo Bonzini
  2012-06-15 15:12 ` [PATCH v2 2/2] msync: start async writeout when MS_ASYNC Paolo Bonzini
  0 siblings, 2 replies; 7+ messages in thread
From: Paolo Bonzini @ 2012-06-15 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Hugh Dickins, Chris Friesen

These two patches are improvements on the implementation of the msync
system call.  To give some context, I found them while working on
the implementation of a persistent dirty bitmap.

Paolo Bonzini (2):
  msync: support syncing a small part of the file
  msync: start async writeout when MS_ASYNC

 include/linux/fs.h |    3 +-
 mm/fadvise.c       |    2 +-
 mm/filemap.c       |   11 +++++---
 mm/msync.c         |   63 +++++++++++++++++++++++++++++++++-------------------
 4 files changed, 50 insertions(+), 29 deletions(-)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] msync: support syncing a small part of the file
  2012-06-15 15:12 [PATCH v2 0/2] msync improvements Paolo Bonzini
@ 2012-06-15 15:12 ` Paolo Bonzini
  2012-06-22 21:12   ` Andrew Morton
  2012-06-15 15:12 ` [PATCH v2 2/2] msync: start async writeout when MS_ASYNC Paolo Bonzini
  1 sibling, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2012-06-15 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Hugh Dickins, Chris Friesen

msync does not need to flush changes to the entire file, even with MS_SYNC.
Instead, it can use vfs_fsync_range to only synchronize a part of the file.
This is part of the specification; expecting msync to synchronize all the
file would take a very creative interpretation of the manual page as well
as the specification.

In addition, not all metadata has to be synced; msync is closer to
fdatasync than it is to fsync.  So, pass 1 to vfs_fsync_range.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Chris Friesen <chris.friesen@genband.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        v1->v2: fixed off-by-one in vall to vfs_fsync_range.

 mm/msync.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 632df45..505fe99 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -15,7 +15,7 @@
 #include <linux/sched.h>
 
 /*
- * MS_SYNC syncs the entire file - including mappings.
+ * MS_SYNC syncs the specified range - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
@@ -58,6 +58,8 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 	vma = find_vma(mm, start);
 	for (;;) {
 		struct file *file;
+		unsigned long next;
+		loff_t file_offset;
 
 		/* Still start < end. */
 		error = -ENOMEM;
@@ -77,18 +79,23 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 			goto out_unlock;
 		}
 		file = vma->vm_file;
-		start = vma->vm_end;
+		next = min(end, vma->vm_end);
 		if ((flags & MS_SYNC) && file &&
 				(vma->vm_flags & VM_SHARED)) {
+			file_offset = vma->vm_pgoff * PAGE_SIZE;
 			get_file(file);
 			up_read(&mm->mmap_sem);
-			error = vfs_fsync(file, 0);
+			error = vfs_fsync_range(file,
+				    start - vma->vm_start + file_offset,
+				    next - vma->vm_start + file_offset - 1, 1);
 			fput(file);
+			start = next;
 			if (error || start >= end)
 				goto out;
 			down_read(&mm->mmap_sem);
 			vma = find_vma(mm, start);
 		} else {
+			start = next;
 			if (start >= end) {
 				error = 0;
 				goto out_unlock;
-- 
1.7.1




^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] msync: start async writeout when MS_ASYNC
  2012-06-15 15:12 [PATCH v2 0/2] msync improvements Paolo Bonzini
  2012-06-15 15:12 ` [PATCH v2 1/2] msync: support syncing a small part of the file Paolo Bonzini
@ 2012-06-15 15:12 ` Paolo Bonzini
  2012-06-22 21:26   ` Andrew Morton
  1 sibling, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2012-06-15 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Hugh Dickins, Chris Friesen

msync.c says that applications had better use fsync() or fadvise(FADV_DONTNEED)
instead of MS_ASYNC.  Both advices are really bad:

* fsync() can be a replacement for MS_SYNC, not for MS_ASYNC;

* fadvise(FADV_DONTNEED) invalidates the pages completely, which will make
  later accesses expensive.

Even sync_file_range would not be a replacement, because the writeout is
done synchronously and can block for an extended period of time.

Having the possibility to schedule a writeback immediately is an advantage
for the applications.  They can do the same thing that fadvise does,
but without the invalidation part.  The implementation is also similar
to fadvise, but with tag-and-write enabled.

One example is if you are implementing a persistent dirty bitmap.
Whenever you set bits to 1 you need to synchronize it with MS_SYNC, so
that dirtiness is reported properly after a host crash.  If you have set
any bits to 0, getting them to disk is not needed for correctness, but
it is still desirable to save some work after a host crash.  You could
simply use MS_SYNC in a separate thread, but MS_ASYNC provides exactly
the desired semantics and is easily done in the kernel.

If the application does not want to start I/O, it can simply call msync
with flags equal to MS_INVALIDATE.  This one remains a no-op, as it should
be on a reasonable implementation.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Chris Friesen <chris.friesen@genband.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        v1->v2: rebased on new PATCH 1/2

 include/linux/fs.h |    3 +-
 mm/fadvise.c       |    2 +-
 mm/filemap.c       |   11 ++++++---
 mm/msync.c         |   60 ++++++++++++++++++++++++++++++---------------------
 4 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8de6755..0aeedb9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2196,7 +2196,8 @@ extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
-				loff_t start, loff_t end, int sync_mode);
+				loff_t start, loff_t end, int sync_mode,
+				bool tagged_writepages);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 469491e..a3579f1 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -118,7 +118,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 	case POSIX_FADV_DONTNEED:
 		if (!bdi_write_congested(mapping->backing_dev_info))
 			__filemap_fdatawrite_range(mapping, offset, endbyte,
-						   WB_SYNC_NONE);
+						   WB_SYNC_NONE, 0);
 
 		/* First and last FULL page! */
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
diff --git a/mm/filemap.c b/mm/filemap.c
index 79c4b2b..641e2a8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -191,6 +191,7 @@ static int sleep_on_page_killable(void *word)
  * @start:	offset in bytes where the range starts
  * @end:	offset in bytes where the range ends (inclusive)
  * @sync_mode:	enable synchronous operation
+ * @tagged_writepages: tag-and-write to avoid livelock (implicit if WB_SYNC_ALL)
  *
  * Start writeback against all of a mapping's dirty pages that lie
  * within the byte offsets <start, end> inclusive.
@@ -201,7 +202,8 @@ static int sleep_on_page_killable(void *word)
  * be waited upon, and not just skipped over.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
-				loff_t end, int sync_mode)
+				loff_t end, int sync_mode,
+				bool tagged_writepages)
 {
 	int ret;
 	struct writeback_control wbc = {
@@ -209,6 +211,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 		.nr_to_write = LONG_MAX,
 		.range_start = start,
 		.range_end = end,
+		.tagged_writepages = tagged_writepages,
 	};
 
 	if (!mapping_cap_writeback_dirty(mapping))
@@ -221,7 +224,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 static inline int __filemap_fdatawrite(struct address_space *mapping,
 	int sync_mode)
 {
-	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
+	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode, 0);
 }
 
 int filemap_fdatawrite(struct address_space *mapping)
@@ -233,7 +236,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
 int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end)
 {
-	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
+	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL, 1);
 }
 EXPORT_SYMBOL(filemap_fdatawrite_range);
 
@@ -361,7 +364,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
-						 WB_SYNC_ALL);
+						 WB_SYNC_ALL, 1);
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = filemap_fdatawait_range(mapping,
diff --git a/mm/msync.c b/mm/msync.c
index 505fe99..4d1f813 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,20 +13,16 @@
 #include <linux/file.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
+#include <linux/backing-dev.h>
+#include <linux/writeback.h>
 
 /*
  * MS_SYNC syncs the specified range - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
- *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
- * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
- * applications.
+ * MS_ASYNC only starts I/O, as it did up to 2.5.67, but only dirty pages
+ * will now be written.   While the application may run fadvise(FADV_DONTNEED)
+ * against the fd to start async writeout immediately, invalidating the
+ * pages will make later accesses expensive.
  */
 SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 {
@@ -78,30 +74,44 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 			error = -EBUSY;
 			goto out_unlock;
 		}
+
+		error = 0;
 		file = vma->vm_file;
 		next = min(end, vma->vm_end);
-		if ((flags & MS_SYNC) && file &&
-				(vma->vm_flags & VM_SHARED)) {
-			file_offset = vma->vm_pgoff * PAGE_SIZE;
-			get_file(file);
-			up_read(&mm->mmap_sem);
-			error = vfs_fsync_range(file,
-				    start - vma->vm_start + file_offset,
-				    next - vma->vm_start + file_offset - 1, 1);
-			fput(file);
-			start = next;
-			if (error || start >= end)
-				goto out;
-			down_read(&mm->mmap_sem);
-			vma = find_vma(mm, start);
-		} else {
+		if (!file || !(vma->vm_flags & VM_SHARED) ||
+		    !(flags & ~MS_INVALIDATE)) {
 			start = next;
 			if (start >= end) {
 				error = 0;
 				goto out_unlock;
 			}
 			vma = vma->vm_next;
+			continue;
+		}
+
+		file_offset = vma->vm_pgoff * PAGE_SIZE;
+		get_file(file);
+		up_read(&mm->mmap_sem);
+		if (flags & MS_SYNC) {
+			error = vfs_fsync_range(file,
+				    start - vma->vm_start + file_offset,
+				    next - vma->vm_start + file_offset - 1, 1);
+		} else {
+			struct address_space *mapping = file->f_mapping;
+			/* end offset is inclusive! */
+			if (mapping &&
+			    !bdi_write_congested(mapping->backing_dev_info))
+				__filemap_fdatawrite_range(mapping,
+				    start - vma->vm_start + file_offset,
+				    next - vma->vm_start + file_offset - 1,
+				    WB_SYNC_NONE, 1);
 		}
+		fput(file);
+		start = next;
+		if (error || start >= end)
+			goto out;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, start);
 	}
 out_unlock:
 	up_read(&mm->mmap_sem);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] msync: support syncing a small part of the file
  2012-06-15 15:12 ` [PATCH v2 1/2] msync: support syncing a small part of the file Paolo Bonzini
@ 2012-06-22 21:12   ` Andrew Morton
  2012-07-02  8:14     ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2012-06-22 21:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, Hugh Dickins, Chris Friesen

On Fri, 15 Jun 2012 17:12:58 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> msync does not need to flush changes to the entire file, even with MS_SYNC.
> Instead, it can use vfs_fsync_range to only synchronize a part of the file.
> This is part of the specification; expecting msync to synchronize all the
> file would take a very creative interpretation of the manual page as well
> as the specification.
> 
> In addition, not all metadata has to be synced; msync is closer to
> fdatasync than it is to fsync.  So, pass 1 to vfs_fsync_range.

I renamed the patch to "msync: switch to syncing only the affected part
of the file" to emphasise that there is a behavioural change here.

Then I deleted the patch.

Please recall my thus-far-utterly-and-irritatingly-ignored comments
about nonlinear mappings.  Presently an msync() on a nonlinear mapping
will work correctly, because we sync the whole file, yes?  And with
this change, we will no longer sync all the pages which were covered by
the affected memory range.  Thus breaking the syscall?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] msync: start async writeout when MS_ASYNC
  2012-06-15 15:12 ` [PATCH v2 2/2] msync: start async writeout when MS_ASYNC Paolo Bonzini
@ 2012-06-22 21:26   ` Andrew Morton
  2012-07-02  8:15     ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2012-06-22 21:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, Hugh Dickins, Chris Friesen

On Fri, 15 Jun 2012 17:12:59 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> msync.c says that applications had better use fsync() or fadvise(FADV_DONTNEED)
> instead of MS_ASYNC.  Both advices are really bad:
> 
> * fsync() can be a replacement for MS_SYNC, not for MS_ASYNC;
> 
> * fadvise(FADV_DONTNEED) invalidates the pages completely, which will make
>   later accesses expensive.
> 
> Even sync_file_range would not be a replacement, because the writeout is
> done synchronously and can block for an extended period of time.

This is just wrong.  sync_file_range() is, within limits, asynchronous
when SYNC_FILE_RANGE_WAIT_* are not used.

> Having the possibility to schedule a writeback immediately is an advantage
> for the applications.

Having this forced upon them is also a disadvantage.  The syscall will
now take longer, consuming more CPU: starting all that IO will add
latency.  It also moves work away from the flusher threads and into the
calling process thus increasing overall runtime and reducing SMP
utilisation.

And as bdi_wrte_congested() is a best-effort, sometime-gets-it-wrong
thing, the patch will introduce quite rare but very long delays where
msync(MS_ASYNC) waits on IO.

>  They can do the same thing that fadvise does,
> but without the invalidation part.  The implementation is also similar
> to fadvise, but with tag-and-write enabled.
> 
> One example is if you are implementing a persistent dirty bitmap.
> Whenever you set bits to 1 you need to synchronize it with MS_SYNC, so
> that dirtiness is reported properly after a host crash.  If you have set
> any bits to 0, getting them to disk is not needed for correctness, but
> it is still desirable to save some work after a host crash.  You could
> simply use MS_SYNC in a separate thread, but MS_ASYNC provides exactly
> the desired semantics and is easily done in the kernel.

This is already the case.  The current msync(MS_ASYNC) will mark the
pages for writeout within a dirty_expire_centisecs period (default 30
seconds).  This has always been why we consider the current MS_ASYNC
implementation to be standards-compliant.

If you think that some applications will *benefit* from having that 30
seconds changed to zero seconds under their feet then please describe
the reasoning.

> If the application does not want to start I/O, it can simply call msync
> with flags equal to MS_INVALIDATE.  This one remains a no-op, as it should
> be on a reasonable implementation.

Using MS_INVALIDATE is a bit of a hack.


I'm just not seeing it, sorry.  The change has risks and downsides and
forces the application to do things which it could already have done,
had it so chosen.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] msync: support syncing a small part of the file
  2012-06-22 21:12   ` Andrew Morton
@ 2012-07-02  8:14     ` Paolo Bonzini
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2012-07-02  8:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Hugh Dickins, Chris Friesen

Il 22/06/2012 23:12, Andrew Morton ha scritto:
> Please recall my thus-far-utterly-and-irritatingly-ignored comments
> about nonlinear mappings.  Presently an msync() on a nonlinear mapping
> will work correctly, because we sync the whole file, yes?  And with
> this change, we will no longer sync all the pages which were covered by
> the affected memory range.  Thus breaking the syscall?

The comment was not ignored, it was missed altogether, and you're
completely right.

Paolo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] msync: start async writeout when MS_ASYNC
  2012-06-22 21:26   ` Andrew Morton
@ 2012-07-02  8:15     ` Paolo Bonzini
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2012-07-02  8:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Hugh Dickins, Chris Friesen

Il 22/06/2012 23:26, Andrew Morton ha scritto:
>> If the application does not want to start I/O, it can simply call msync
>> with flags equal to MS_INVALIDATE.  This one remains a no-op, as it should
>> be on a reasonable implementation.
> 
> Using MS_INVALIDATE is a bit of a hack.

Right, you could use flags == 0.

I'm not _that_ attached to this patch, I'll send a v3 of the other one
and drop this one.

Paolo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-07-02  8:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-15 15:12 [PATCH v2 0/2] msync improvements Paolo Bonzini
2012-06-15 15:12 ` [PATCH v2 1/2] msync: support syncing a small part of the file Paolo Bonzini
2012-06-22 21:12   ` Andrew Morton
2012-07-02  8:14     ` Paolo Bonzini
2012-06-15 15:12 ` [PATCH v2 2/2] msync: start async writeout when MS_ASYNC Paolo Bonzini
2012-06-22 21:26   ` Andrew Morton
2012-07-02  8:15     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).