linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes
@ 2013-08-23  0:03 Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 1/7] mm: Track mappings that have been written via ptes Andy Lutomirski
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

Writes via mmap currently update mtime and ctime in ->page_mkwrite.
This hurts both throughput and latency.  In workloads that dirty a
large number of mmapped pages, ->page_mkwrite can be hot and
file_update_time is slow and scales poorly.  Updating timestamps can
also sleep, which hurts latency for real-time workloads.

This is also a correctness issue.  SuS says:

    The st_ctime and st_mtime fields of a file that is mapped with
    MAP_SHARED and PROT_WRITE, will be marked for update at some point
    in the interval between a write reference to the mapped region and
    the next call to msync() with MS_ASYNC or MS_SYNC for that portion
    of the file by any process. If there is no such call, these fields
    may be marked for update at any time after a write reference if
    the underlying file is modified as a result.

Currently, if the same mmapped page is written twice, the timestamp
may not be update at all after the second write, whereas SuS (and
anything using timestamps to invalidate caches, backup data, etc.)
would expect the timestamp to eventually be updated.

This patchset attempts to fix both issues at once.  It adds a new
address_space flag AS_CMTIME that is set atomically whenever the
system transfers a pte dirty bit to a struct page backed by the
address_space.  This can happen with various locks held and when low
on memory.

Later on, a_ops.update_cmtime_deferred is called to tell the FS to
update cmtime due to a previous mmapped write.

The core changes have no effect on unmodified filesystems.  To opt in,
a filesystem should implement .update_cmtime_deferred (most likely by
using generic_update_cmtime_deferred) and must call either
mapping_flush_cmtime or mapping_test_clear_cmtime in .writepages.
Filesystems should avoid updating timestamps in ->page_mkwrite.

The reason that this is not completely automatic is that filesystems
without backing stores do not really fit in to this model.
Eventually, someone can add support.

I've converted ext4, xfs, and btrfs.  Converting most other
filesystems should be straightforward.

I wrote an xfstest for this.  ext4, xfs, and btrfs pass.  It's here:

https://github.com/amluto/xfstests/commit/5fbb72ac799cc44a9c4c6d3919f00a479202c899

This series is pullable from:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4

Changes from v3:
 - The new address space op is now called update_cmtime_deferred.
   Callers take care of protection from fs freezing and checking
   AS_CMTIME.  I fixed a deadlock in the freezer interaction.
 - Block plugs should be handled better.
 - Fixed an infinite loop in msync(MS_ASYNC).
 - Converted xfs and btrfs.
 - Misc minor cleanups.
 - Fixed a corner case: reclaim or migration could have cleaned all
   pages without updating cmtime.

Changes from v2:
 - The core code now interacts with filesystems only through
   address_space ops, so there should be fewer layering issues.
 - MS_ASYNC is handled correctly.

Changes from v1:
 - inode_update_time_writable now locks against the fs freezer.
 - Minor cleanups.
 - Major changelog improvements.

Andy Lutomirski (7):
  mm: Track mappings that have been written via ptes
  fs: Add inode_update_time_writable
  mm: Allow filesystems to defer cmtime updates
  mm: Scan for dirty ptes and update cmtime on MS_ASYNC
  ext4: Defer mmap cmtime updates
  btrfs: Defer mmap cmtime updates
  xfs: Defer mmap cmtime updates

 fs/btrfs/extent_io.c      |  1 +
 fs/btrfs/inode.c          | 32 +++++++++---------
 fs/buffer.c               |  7 ----
 fs/ext4/inode.c           | 11 +++++--
 fs/inode.c                | 64 +++++++++++++++++++++++++++---------
 fs/xfs/xfs_aops.c         |  1 +
 include/linux/fs.h        |  9 +++++
 include/linux/pagemap.h   | 22 +++++++++++++
 include/linux/writeback.h |  1 +
 mm/memory.c               |  7 +++-
 mm/migrate.c              |  2 ++
 mm/mmap.c                 |  6 +++-
 mm/msync.c                | 84 ++++++++++++++++++++++++++++++++++++++++-------
 mm/page-writeback.c       | 53 +++++++++++++++++++++++++++++-
 mm/rmap.c                 | 27 +++++++++++++--
 mm/vmscan.c               |  1 +
 16 files changed, 272 insertions(+), 56 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/7] mm: Track mappings that have been written via ptes
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 2/7] fs: Add inode_update_time_writable Andy Lutomirski
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This will allow the mm code to figure out when a file has been
changed through a writable mmap.  Future changes will use this
information to update the file timestamp after writes.

This is handled in core mm code for two reasons:

1. Performance.  Setting a bit directly is faster than an indirect
   call to a vma op.

2. Simplicity.  The cmtime bit is set with lots of mm locks held.
   Rather than making filesystems add a new vm operation that needs
   to be aware of locking, it's easier to just get it right in core
   code.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/pagemap.h | 16 ++++++++++++++++
 mm/memory.c             |  7 ++++++-
 mm/rmap.c               | 27 +++++++++++++++++++++++++--
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..9a461ee 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_CMTIME	= __GFP_BITS_SHIFT + 5, /* cmtime update deferred */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -74,6 +75,21 @@ static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
 }
 
+static inline void mapping_set_cmtime(struct address_space * mapping)
+{
+	set_bit(AS_CMTIME, &mapping->flags);
+}
+
+static inline bool mapping_test_cmtime(struct address_space * mapping)
+{
+	return test_bit(AS_CMTIME, &mapping->flags);
+}
+
+static inline bool mapping_test_clear_cmtime(struct address_space * mapping)
+{
+	return test_and_clear_bit(AS_CMTIME, &mapping->flags);
+}
+
 /*
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
diff --git a/mm/memory.c b/mm/memory.c
index 4026841..1737a90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1150,8 +1150,13 @@ again:
 			if (PageAnon(page))
 				rss[MM_ANONPAGES]--;
 			else {
-				if (pte_dirty(ptent))
+				if (pte_dirty(ptent)) {
+					struct address_space *mapping =
+						page_mapping(page);
+					if (mapping)
+						mapping_set_cmtime(mapping);
 					set_page_dirty(page);
+				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index b2e29ac..2e3fb27 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -928,6 +928,10 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page)
 		}
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (ret)
+		mapping_set_cmtime(mapping);
+
 	return ret;
 }
 
@@ -1179,6 +1183,19 @@ out:
 }
 
 /*
+ * Mark a page's mapping for future cmtime update.  It's safe to call this
+ * on any page, but it only has any effect if the page is backed by a mapping
+ * that uses mapping_test_clear_cmtime to handle file time updates.  This means
+ * that there's no need to call this on for non-VM_SHARED vmas.
+ */
+static void page_set_cmtime(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	if (mapping)
+		mapping_set_cmtime(mapping);
+}
+
+/*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
  */
@@ -1219,8 +1236,11 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
 		set_page_dirty(page);
+		if (vma->vm_flags & VM_SHARED)
+			page_set_cmtime(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -1413,8 +1433,11 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		}
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval)) {
 			set_page_dirty(page);
+			if (vma->vm_flags & VM_SHARED)
+				page_set_cmtime(page);
+		}
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/7] fs: Add inode_update_time_writable
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 1/7] mm: Track mappings that have been written via ptes Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-09-04 18:27   ` [PATCH] Ensure prepare_update_cmtime() returns an initialized value (was: Re: [PATCH v4 2/7] fs: Add inode_update_time_writable) Sylvain 'ythier' Hitier
  2013-08-23  0:03 ` [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates Andy Lutomirski
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This is like file_update_time, except that it acts on a struct inode *
instead of a struct file *.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/inode.c         | 64 +++++++++++++++++++++++++++++++++++++++++-------------
 include/linux/fs.h |  1 +
 2 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d6dfb09..2bbcb19 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1637,6 +1637,34 @@ int file_remove_suid(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_suid);
 
+/*
+ * This does the work that's common to file_update_time and
+ * inode_update_time.
+ */
+static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
+{
+	int sync_it;
+
+	/* First try to exhaust all avenues to not sync */
+	if (IS_NOCMTIME(inode))
+		return 0;
+
+	*now = current_fs_time(inode->i_sb);
+	if (!timespec_equal(&inode->i_mtime, now))
+		sync_it = S_MTIME;
+
+	if (!timespec_equal(&inode->i_ctime, now))
+		sync_it |= S_CTIME;
+
+	if (IS_I_VERSION(inode))
+		sync_it |= S_VERSION;
+
+	if (!sync_it)
+		return 0;
+
+	return sync_it;
+}
+
 /**
  *	file_update_time	-	update mtime and ctime time
  *	@file: file accessed
@@ -1654,23 +1682,9 @@ int file_update_time(struct file *file)
 {
 	struct inode *inode = file_inode(file);
 	struct timespec now;
-	int sync_it = 0;
+	int sync_it = prepare_update_cmtime(inode, &now);
 	int ret;
 
-	/* First try to exhaust all avenues to not sync */
-	if (IS_NOCMTIME(inode))
-		return 0;
-
-	now = current_fs_time(inode->i_sb);
-	if (!timespec_equal(&inode->i_mtime, &now))
-		sync_it = S_MTIME;
-
-	if (!timespec_equal(&inode->i_ctime, &now))
-		sync_it |= S_CTIME;
-
-	if (IS_I_VERSION(inode))
-		sync_it |= S_VERSION;
-
 	if (!sync_it)
 		return 0;
 
@@ -1685,6 +1699,26 @@ int file_update_time(struct file *file)
 }
 EXPORT_SYMBOL(file_update_time);
 
+/**
+ *	inode_update_time_writable	-	update mtime and ctime time
+ *	@inode: inode accessed
+ *
+ *	This is like file_update_time, but it assumes the mnt is
+ *	writable and not frozen and takes an inode parameter instead.
+ */
+
+int inode_update_time_writable(struct inode *inode)
+{
+	struct timespec now;
+	int sync_it = prepare_update_cmtime(inode, &now);
+
+	if (!sync_it)
+		return 0;
+
+	return update_time(inode, &now, sync_it);
+}
+EXPORT_SYMBOL(inode_update_time_writable);
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..86cf0a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2590,6 +2590,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
 
 extern int file_update_time(struct file *file);
+extern int inode_update_time_writable(struct inode *inode);
 
 extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 1/7] mm: Track mappings that have been written via ptes Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 2/7] fs: Add inode_update_time_writable Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-09-04 14:57   ` Jan Kara
  2013-08-23  0:03 ` [PATCH v4 4/7] mm: Scan for dirty ptes and update cmtime on MS_ASYNC Andy Lutomirski
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

Filesystems that defer cmtime updates should update cmtime when any
of these events happen after a write via a mapping:

 - The mapping is written back to disk.  This happens from all kinds
   of places, most of which eventually call ->writepages.  (The
   exceptions are vmscan and migration.)

 - munmap is called or the mapping is removed when the process exits

 - msync(MS_ASYNC) is called.  Linux currently does nothing for
   msync(MS_ASYNC), but POSIX says that cmtime should be updated some
   time between an mmaped write and the subsequent msync call.
   MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.

Filesystems are responsible for checking for pending deferred cmtime
updates in .writepages (a helper is provided for this purpose) and
for doing the actual update in .update_cmtime_deferred.

These changes have no effect by themselves; filesystems must opt in
by implementing .update_cmtime_deferred and removing any
file_update_time call in .page_mkwrite.

This patch does not implement the MS_ASYNC case; that's in the next
patch.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/fs.h        |  8 +++++++
 include/linux/pagemap.h   |  6 ++++++
 include/linux/writeback.h |  1 +
 mm/migrate.c              |  2 ++
 mm/mmap.c                 |  6 +++++-
 mm/page-writeback.c       | 53 ++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c               |  1 +
 7 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86cf0a4..f6b0f8b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -350,6 +350,14 @@ struct address_space_operations {
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
 
+	/*
+	 * Called when a deferred cmtime update should be applied.
+	 * Implementations should update cmtime.  (As an optional
+	 * optimization, implementaions can call mapping_test_clear_cmtime
+	 * from writepages as well.)
+	 */
+	void (*update_cmtime_deferred)(struct address_space *);
+
 	/* Set a page dirty.  Return true if this dirtied it */
 	int (*set_page_dirty)(struct page *page);
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9a461ee..2647a13 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -90,6 +90,12 @@ static inline bool mapping_test_clear_cmtime(struct address_space * mapping)
 	return test_and_clear_bit(AS_CMTIME, &mapping->flags);
 }
 
+/* Use this one in writepages, etc. */
+extern void mapping_flush_cmtime(struct address_space * mapping);
+
+/* Use this one outside writeback. */
+extern void mapping_flush_cmtime_nowb(struct address_space * mapping);
+
 /*
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..efe4970 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -174,6 +174,7 @@ typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
+void generic_update_cmtime_deferred(struct address_space *mapping);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f0c244..e4124e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -627,6 +627,8 @@ static int writeout(struct address_space *mapping, struct page *page)
 		/* unlocked. Relock */
 		lock_page(page);
 
+	mapping_flush_cmtime(mapping);
+
 	return (rc < 0) ? -EIO : -EAGAIN;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..189eb7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1,3 +1,4 @@
+
 /*
  * mm/mmap.c
  *
@@ -249,8 +250,11 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
+	if (vma->vm_file) {
+		if ((vma->vm_flags & VM_SHARED) && vma->vm_file->f_mapping)
+			mapping_flush_cmtime_nowb(vma->vm_file->f_mapping);
 		fput(vma->vm_file);
+	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..4ec8c02 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1912,12 +1912,30 @@ int generic_writepages(struct address_space *mapping,
 
 	blk_start_plug(&plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
+	mapping_flush_cmtime(mapping);
 	blk_finish_plug(&plug);
 	return ret;
 }
-
 EXPORT_SYMBOL(generic_writepages);
 
+/**
+ * generic_update_cmtime_deferred - update cmtime after an mmapped write
+ * @mapping: The mapping
+ *
+ * This library function implements .update_cmtime_deferred.  It is unlikely
+ * that any filesystem will want to do anything here except update the time
+ * (using this helper) or nothing at all (by leaving .update_cmtime_deferred
+ * NULL).
+ */
+void generic_update_cmtime_deferred(struct address_space *mapping)
+{
+	struct blk_plug plug;
+	blk_start_plug(&plug);
+	inode_update_time_writable(mapping->host);
+	blk_finish_plug(&plug);
+}
+EXPORT_SYMBOL(generic_update_cmtime_deferred);
+
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
@@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
 }
 EXPORT_SYMBOL(write_one_page);
 
+void mapping_flush_cmtime(struct address_space *mapping)
+{
+	if (mapping_test_clear_cmtime(mapping) &&
+	    mapping->a_ops->update_cmtime_deferred)
+		mapping->a_ops->update_cmtime_deferred(mapping);
+}
+EXPORT_SYMBOL(mapping_flush_cmtime);
+
+void mapping_flush_cmtime_nowb(struct address_space *mapping)
+{
+	/*
+	 * We get called from munmap and msync.  Both calls can race
+	 * with fs freezing.  If the fs is frozen after
+	 * mapping_test_clear_cmtime but before the time update, then
+	 * sync_filesystem will miss the cmtime update (because we
+	 * just cleared it) and we don't be able to write (because the
+	 * fs is frozen).  On the other hand, we can't just return if
+	 * we're in the SB_FREEZE_PAGEFAULT state because our caller
+	 * expects the timestamp to be synchronously updated.  So we
+	 * get write access without blocking, at the SB_FREEZE_FS
+	 * level.  If the fs is already fully frozen, then we already
+	 * know we have nothing to do.
+	 */
+
+	if (!mapping_test_cmtime(mapping))
+		return;  /* Optimization: nothing to do. */
+
+	if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
+		mapping_flush_cmtime(mapping);
+		__sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
+	}
+}
+
 /*
  * For address_spaces which do not use buffers nor write back.
  */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cff0d4..3b759e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -429,6 +429,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		res = mapping->a_ops->writepage(page, &wbc);
 		if (res < 0)
 			handle_write_error(mapping, page, res);
+		mapping_flush_cmtime(mapping);
 		if (res == AOP_WRITEPAGE_ACTIVATE) {
 			ClearPageReclaim(page);
 			return PAGE_ACTIVATE;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/7] mm: Scan for dirty ptes and update cmtime on MS_ASYNC
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (2 preceding siblings ...)
  2013-08-23  0:03 ` [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 5/7] ext4: Defer mmap cmtime updates Andy Lutomirski
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This is probably unimportant but improves POSIX compliance.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 mm/msync.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 632df45..a2ee43c 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,13 +13,16 @@
 #include <linux/file.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * Now all it does is ensure that file timestamps get updated, since POSIX
+ * requires it.  We track dirty pages correct without MS_ASYNC.
  *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
@@ -28,6 +31,54 @@
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
+
+static int msync_async_range(struct vm_area_struct *vma,
+			      unsigned long *start, unsigned long end)
+{
+	struct mm_struct *mm;
+	int iters = 0;
+
+	while (*start < end && *start < vma->vm_end && iters < 128) {
+		unsigned int page_mask, page_increm;
+
+		/*
+		 * Require that the pte is writable (because otherwise
+		 * it can't be dirty, so there's nothing to clean).
+		 *
+		 * In theory we could check the pte dirty bit, but this is
+		 * awkward and barely worth it.
+		 */
+		struct page *page = follow_page_mask(vma, *start,
+						     FOLL_GET | FOLL_WRITE,
+						     &page_mask);
+
+		if (page && !IS_ERR(page)) {
+			if (lock_page_killable(page) == 0) {
+				page_mkclean(page);
+				unlock_page(page);
+			}
+			put_page(page);
+		}
+
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		page_increm = 1 + (~(*start >> PAGE_SHIFT) & page_mask);
+		*start += page_increm * PAGE_SIZE;
+		cond_resched();
+		iters++;
+	}
+
+	/* XXX: try to do this only once? */
+	mapping_flush_cmtime_nowb(vma->vm_file->f_mapping);
+
+	/* Give mmap_sem writers a chance. */
+	mm = current->mm;
+	up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	return 0;
+}
+
 SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 {
 	unsigned long end;
@@ -77,18 +128,29 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 			goto out_unlock;
 		}
 		file = vma->vm_file;
-		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
-				(vma->vm_flags & VM_SHARED)) {
-			get_file(file);
-			up_read(&mm->mmap_sem);
-			error = vfs_fsync(file, 0);
-			fput(file);
-			if (error || start >= end)
-				goto out;
-			down_read(&mm->mmap_sem);
+		if (file && vma->vm_flags & VM_SHARED) {
+			if (flags & MS_SYNC) {
+				start = vma->vm_end;
+				get_file(file);
+				up_read(&mm->mmap_sem);
+				error = vfs_fsync(file, 0);
+				fput(file);
+				if (error || start >= end)
+					goto out;
+				down_read(&mm->mmap_sem);
+			} else if ((vma->vm_flags & VM_WRITE) &&
+				   file->f_mapping) {
+				error = msync_async_range(vma, &start, end);
+				if (error || start >= end)
+					goto out_unlock;
+			} else {
+				start = vma->vm_end;
+				if (start >= end)
+					goto out_unlock;
+			}
 			vma = find_vma(mm, start);
 		} else {
+			start = vma->vm_end;
 			if (start >= end) {
 				error = 0;
 				goto out_unlock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/7] ext4: Defer mmap cmtime updates
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (3 preceding siblings ...)
  2013-08-23  0:03 ` [PATCH v4 4/7] mm: Scan for dirty ptes and update cmtime on MS_ASYNC Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 6/7] btrfs: " Andy Lutomirski
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/ext4/inode.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd32a2e..2cb2961 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2382,8 +2382,11 @@ static int ext4_writepages(struct address_space *mapping,
 	 * a transaction for special inodes like journal inode on last iput()
 	 * because that could violate lock ordering on umount
 	 */
-	if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+	if (!mapping->nrpages ||
+	    !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		mapping_flush_cmtime(mapping);
 		return 0;
+	}
 
 	if (ext4_should_journal_data(inode)) {
 		struct blk_plug plug;
@@ -2391,6 +2394,7 @@ static int ext4_writepages(struct address_space *mapping,
 
 		blk_start_plug(&plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
+		mapping_flush_cmtime(mapping);
 		blk_finish_plug(&plug);
 		return ret;
 	}
@@ -2525,6 +2529,7 @@ retry:
 		if (ret)
 			break;
 	}
+	mapping_flush_cmtime(mapping);
 	blk_finish_plug(&plug);
 	if (!ret && !cycled) {
 		cycled = 1;
@@ -3238,6 +3243,7 @@ static const struct address_space_operations ext4_aops = {
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_write_end,
+	.update_cmtime_deferred	= generic_update_cmtime_deferred,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -3254,6 +3260,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_journalled_write_end,
+	.update_cmtime_deferred	= generic_update_cmtime_deferred,
 	.set_page_dirty		= ext4_journalled_set_page_dirty,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_journalled_invalidatepage,
@@ -3270,6 +3277,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_da_write_begin,
 	.write_end		= ext4_da_write_end,
+	.update_cmtime_deferred	= generic_update_cmtime_deferred,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_da_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -5025,7 +5033,6 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int retries = 0;
 
 	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/7] btrfs: Defer mmap cmtime updates
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (4 preceding siblings ...)
  2013-08-23  0:03 ` [PATCH v4 5/7] ext4: Defer mmap cmtime updates Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-08-23  0:03 ` [PATCH v4 7/7] xfs: " Andy Lutomirski
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/btrfs/extent_io.c |  1 +
 fs/btrfs/inode.c     | 32 ++++++++++++++++----------------
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fe443fe..dc2f851 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3756,6 +3756,7 @@ int extent_writepages(struct extent_io_tree *tree,
 				       __extent_writepage, &epd,
 				       flush_write_bio);
 	flush_epd_write_bio(&epd);
+	mapping_flush_cmtime(mapping);
 	return ret;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 021694c..fc51380 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7499,10 +7499,8 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	sb_start_pagefault(inode->i_sb);
 	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
-	if (!ret) {
-		ret = file_update_time(vma->vm_file);
+	if (!ret)
 		reserved = 1;
-	}
 	if (ret) {
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
@@ -8711,22 +8709,24 @@ static struct extent_io_ops btrfs_extent_io_ops = {
  * For now we're avoiding this by dropping bmap.
  */
 static const struct address_space_operations btrfs_aops = {
-	.readpage	= btrfs_readpage,
-	.writepage	= btrfs_writepage,
-	.writepages	= btrfs_writepages,
-	.readpages	= btrfs_readpages,
-	.direct_IO	= btrfs_direct_IO,
-	.invalidatepage = btrfs_invalidatepage,
-	.releasepage	= btrfs_releasepage,
-	.set_page_dirty	= btrfs_set_page_dirty,
-	.error_remove_page = generic_error_remove_page,
+	.readpage		= btrfs_readpage,
+	.writepage		= btrfs_writepage,
+	.writepages		= btrfs_writepages,
+	.update_cmtime_deferred	= generic_update_cmtime_deferred,
+	.readpages		= btrfs_readpages,
+	.direct_IO		= btrfs_direct_IO,
+	.invalidatepage		= btrfs_invalidatepage,
+	.releasepage		= btrfs_releasepage,
+	.set_page_dirty		= btrfs_set_page_dirty,
+	.error_remove_page	= generic_error_remove_page,
 };
 
 static const struct address_space_operations btrfs_symlink_aops = {
-	.readpage	= btrfs_readpage,
-	.writepage	= btrfs_writepage,
-	.invalidatepage = btrfs_invalidatepage,
-	.releasepage	= btrfs_releasepage,
+	.readpage		= btrfs_readpage,
+	.writepage		= btrfs_writepage,
+	.update_cmtime_deferred	= generic_update_cmtime_deferred,
+	.invalidatepage 	= btrfs_invalidatepage,
+	.releasepage		= btrfs_releasepage,
 };
 
 static const struct inode_operations btrfs_file_inode_operations = {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 7/7] xfs: Defer mmap cmtime updates
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (5 preceding siblings ...)
  2013-08-23  0:03 ` [PATCH v4 6/7] btrfs: " Andy Lutomirski
@ 2013-08-23  0:03 ` Andy Lutomirski
  2013-08-23  1:12 ` [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  0:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This involves a change to block_page_mkwrite.  xfs is the only user.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/buffer.c       | 7 -------
 fs/xfs/xfs_aops.c | 1 +
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d74335..408677c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2431,13 +2431,6 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
 
 	sb_start_pagefault(sb);
-
-	/*
-	 * Update file times before taking page lock. We may end up failing the
-	 * fault so this update may be superfluous but who really cares...
-	 */
-	file_update_time(vma->vm_file);
-
 	ret = __block_page_mkwrite(vma, vmf, get_block);
 	sb_end_pagefault(sb);
 	return block_page_mkwrite_return(ret);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 596ec71..aa8fbcf 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1668,6 +1668,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.readpages		= xfs_vm_readpages,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
+	.update_cmtime_deferred	= generic_update_cmtime_deferred,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
 	.write_begin		= xfs_vm_write_begin,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (6 preceding siblings ...)
  2013-08-23  0:03 ` [PATCH v4 7/7] xfs: " Andy Lutomirski
@ 2013-08-23  1:12 ` Andy Lutomirski
  2013-09-04 14:06 ` Jan Kara
  2013-09-04 15:08 ` Jan Kara
  9 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-08-23  1:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Jan Kara, Tim Chen, Christoph Hellwig

On 08/22/2013 05:03 PM, Andy Lutomirski wrote:
> Writes via mmap currently update mtime and ctime in ->page_mkwrite.

The subject should be [PATCH v4 0.7]...  Sorry for the cut-and-pasteo.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (7 preceding siblings ...)
  2013-08-23  1:12 ` [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
@ 2013-09-04 14:06 ` Jan Kara
  2013-09-04 15:08 ` Jan Kara
  9 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2013-09-04 14:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Jan Kara, Tim Chen, Christoph Hellwig

On Thu 22-08-13 17:03:16, Andy Lutomirski wrote:
> Writes via mmap currently update mtime and ctime in ->page_mkwrite.
> This hurts both throughput and latency.  In workloads that dirty a
> large number of mmapped pages, ->page_mkwrite can be hot and
> file_update_time is slow and scales poorly.  Updating timestamps can
> also sleep, which hurts latency for real-time workloads.
> 
> This is also a correctness issue.  SuS says:
> 
>     The st_ctime and st_mtime fields of a file that is mapped with
>     MAP_SHARED and PROT_WRITE, will be marked for update at some point
>     in the interval between a write reference to the mapped region and
>     the next call to msync() with MS_ASYNC or MS_SYNC for that portion
>     of the file by any process. If there is no such call, these fields
>     may be marked for update at any time after a write reference if
>     the underlying file is modified as a result.
> 
> Currently, if the same mmapped page is written twice, the timestamp
> may not be update at all after the second write, whereas SuS (and
> anything using timestamps to invalidate caches, backup data, etc.)
> would expect the timestamp to eventually be updated.
> 
> This patchset attempts to fix both issues at once.  It adds a new
> address_space flag AS_CMTIME that is set atomically whenever the
> system transfers a pte dirty bit to a struct page backed by the
> address_space.  This can happen with various locks held and when low
> on memory.
> 
> Later on, a_ops.update_cmtime_deferred is called to tell the FS to
> update cmtime due to a previous mmapped write.
> 
> The core changes have no effect on unmodified filesystems.  To opt in,
> a filesystem should implement .update_cmtime_deferred (most likely by
> using generic_update_cmtime_deferred) and must call either
> mapping_flush_cmtime or mapping_test_clear_cmtime in .writepages.
> Filesystems should avoid updating timestamps in ->page_mkwrite.
> 
> The reason that this is not completely automatic is that filesystems
> without backing stores do not really fit in to this model.
> Eventually, someone can add support.
> 
> I've converted ext4, xfs, and btrfs.  Converting most other
> filesystems should be straightforward.
> 
> I wrote an xfstest for this.  ext4, xfs, and btrfs pass.  It's here:
> 
> https://github.com/amluto/xfstests/commit/5fbb72ac799cc44a9c4c6d3919f00a479202c899
> 
> This series is pullable from:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4
  As a general note, I think you should CC linux-mm@kvack.org on this
series so that mm guys are more likely to notice it. Since the patches
touch mm you should probably get some opinions from them...

								Honza
> 
> Changes from v3:
>  - The new address space op is now called update_cmtime_deferred.
>    Callers take care of protection from fs freezing and checking
>    AS_CMTIME.  I fixed a deadlock in the freezer interaction.
>  - Block plugs should be handled better.
>  - Fixed an infinite loop in msync(MS_ASYNC).
>  - Converted xfs and btrfs.
>  - Misc minor cleanups.
>  - Fixed a corner case: reclaim or migration could have cleaned all
>    pages without updating cmtime.
> 
> Changes from v2:
>  - The core code now interacts with filesystems only through
>    address_space ops, so there should be fewer layering issues.
>  - MS_ASYNC is handled correctly.
> 
> Changes from v1:
>  - inode_update_time_writable now locks against the fs freezer.
>  - Minor cleanups.
>  - Major changelog improvements.
> 
> Andy Lutomirski (7):
>   mm: Track mappings that have been written via ptes
>   fs: Add inode_update_time_writable
>   mm: Allow filesystems to defer cmtime updates
>   mm: Scan for dirty ptes and update cmtime on MS_ASYNC
>   ext4: Defer mmap cmtime updates
>   btrfs: Defer mmap cmtime updates
>   xfs: Defer mmap cmtime updates
> 
>  fs/btrfs/extent_io.c      |  1 +
>  fs/btrfs/inode.c          | 32 +++++++++---------
>  fs/buffer.c               |  7 ----
>  fs/ext4/inode.c           | 11 +++++--
>  fs/inode.c                | 64 +++++++++++++++++++++++++++---------
>  fs/xfs/xfs_aops.c         |  1 +
>  include/linux/fs.h        |  9 +++++
>  include/linux/pagemap.h   | 22 +++++++++++++
>  include/linux/writeback.h |  1 +
>  mm/memory.c               |  7 +++-
>  mm/migrate.c              |  2 ++
>  mm/mmap.c                 |  6 +++-
>  mm/msync.c                | 84 ++++++++++++++++++++++++++++++++++++++++-------
>  mm/page-writeback.c       | 53 +++++++++++++++++++++++++++++-
>  mm/rmap.c                 | 27 +++++++++++++--
>  mm/vmscan.c               |  1 +
>  16 files changed, 272 insertions(+), 56 deletions(-)
> 
> -- 
> 1.8.3.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates
  2013-08-23  0:03 ` [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates Andy Lutomirski
@ 2013-09-04 14:57   ` Jan Kara
  2013-09-04 17:54     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2013-09-04 14:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Jan Kara, Tim Chen, Christoph Hellwig

On Thu 22-08-13 17:03:19, Andy Lutomirski wrote:
> Filesystems that defer cmtime updates should update cmtime when any
> of these events happen after a write via a mapping:
> 
>  - The mapping is written back to disk.  This happens from all kinds
>    of places, most of which eventually call ->writepages.  (The
>    exceptions are vmscan and migration.)
> 
>  - munmap is called or the mapping is removed when the process exits
> 
>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>    time between an mmaped write and the subsequent msync call.
>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> 
> Filesystems are responsible for checking for pending deferred cmtime
> updates in .writepages (a helper is provided for this purpose) and
> for doing the actual update in .update_cmtime_deferred.
> 
> These changes have no effect by themselves; filesystems must opt in
> by implementing .update_cmtime_deferred and removing any
> file_update_time call in .page_mkwrite.
> 
> This patch does not implement the MS_ASYNC case; that's in the next
> patch.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
...
> +/**
> + * generic_update_cmtime_deferred - update cmtime after an mmapped write
> + * @mapping: The mapping
> + *
> + * This library function implements .update_cmtime_deferred.  It is unlikely
> + * that any filesystem will want to do anything here except update the time
> + * (using this helper) or nothing at all (by leaving .update_cmtime_deferred
> + * NULL).
> + */
> +void generic_update_cmtime_deferred(struct address_space *mapping)
> +{
> +	struct blk_plug plug;
> +	blk_start_plug(&plug);
> +	inode_update_time_writable(mapping->host);
> +	blk_finish_plug(&plug);
> +}
> +EXPORT_SYMBOL(generic_update_cmtime_deferred);
> +
  You can remove the pluggin here. Inode update will likely result in a
single write so there's no point.

> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
>  }
>  EXPORT_SYMBOL(write_one_page);
>  
> +void mapping_flush_cmtime(struct address_space *mapping)
> +{
> +	if (mapping_test_clear_cmtime(mapping) &&
> +	    mapping->a_ops->update_cmtime_deferred)
> +		mapping->a_ops->update_cmtime_deferred(mapping);
> +}
> +EXPORT_SYMBOL(mapping_flush_cmtime);
  Hum, is there a reason for update_cmtime_deferred() operation? I can
hardly imagine anyone will want to do anything else than what
inode_update_time_writable() does so why bother? You mention tmpfs & co.
don't fit into your scheme well with which I agree so let's just keep
file_update_time() in their page_mkwrite() operation. But I don't see a
real need for avoiding the deferred cmtime logic...

> +
> +void mapping_flush_cmtime_nowb(struct address_space *mapping)
> +{
> +	/*
> +	 * We get called from munmap and msync.  Both calls can race
> +	 * with fs freezing.  If the fs is frozen after
> +	 * mapping_test_clear_cmtime but before the time update, then
> +	 * sync_filesystem will miss the cmtime update (because we
> +	 * just cleared it) and we don't be able to write (because the
> +	 * fs is frozen).  On the other hand, we can't just return if
> +	 * we're in the SB_FREEZE_PAGEFAULT state because our caller
> +	 * expects the timestamp to be synchronously updated.  So we
> +	 * get write access without blocking, at the SB_FREEZE_FS
> +	 * level.  If the fs is already fully frozen, then we already
> +	 * know we have nothing to do.
> +	 */
> +
> +	if (!mapping_test_cmtime(mapping))
> +		return;  /* Optimization: nothing to do. */
> +
> +	if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
> +		mapping_flush_cmtime(mapping);
> +		__sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
> +	}
> +}
  This is wrong because SB_FREEZE_FS level is targetted for filesystem
internal use. Also it is racy. mapping_flush_cmtime() ends up calling
mark_inode_dirty() and filesystems such as ext4 or xfs will start a
transaction to store inode in the journal. This gets freeze protection at
SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to
SB_FREEZE_FS before this second protection, things will deadlock.

Since the callers of this function hold mmap_sem, using SB_FREEZE_PAGEFAULT
protection would be appropriate. Also since there are just two places that
need the freeze protection I'd be inclined to open code the protection in
the two places rather than hiding it in a special function.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes
  2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
                   ` (8 preceding siblings ...)
  2013-09-04 14:06 ` Jan Kara
@ 2013-09-04 15:08 ` Jan Kara
  2013-09-04 17:33   ` Andy Lutomirski
  9 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2013-09-04 15:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Jan Kara, Tim Chen, Christoph Hellwig

On Thu 22-08-13 17:03:16, Andy Lutomirski wrote:
> Writes via mmap currently update mtime and ctime in ->page_mkwrite.
> This hurts both throughput and latency.  In workloads that dirty a
> large number of mmapped pages, ->page_mkwrite can be hot and
> file_update_time is slow and scales poorly.  Updating timestamps can
> also sleep, which hurts latency for real-time workloads.
  It would help to make your case if you posted the latency comparison
before & after the patchset in this introductory email. We can then see
how significant is the reduction of latency...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes
  2013-09-04 15:08 ` Jan Kara
@ 2013-09-04 17:33   ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-09-04 17:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Wed, Sep 4, 2013 at 8:08 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 22-08-13 17:03:16, Andy Lutomirski wrote:
>> Writes via mmap currently update mtime and ctime in ->page_mkwrite.
>> This hurts both throughput and latency.  In workloads that dirty a
>> large number of mmapped pages, ->page_mkwrite can be hot and
>> file_update_time is slow and scales poorly.  Updating timestamps can
>> also sleep, which hurts latency for real-time workloads.
>   It would help to make your case if you posted the latency comparison
> before & after the patchset in this introductory email. We can then see
> how significant is the reduction of latency...

Will do, although the data from my workload will be a little strange.

I was hoping that Dave Hansen would re-run his benchmark with these
patches applied.  I tried to run it, but it wasn't obvious what the
numbers that spewed out meant.

--Andy

>
>                                                                 Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates
  2013-09-04 14:57   ` Jan Kara
@ 2013-09-04 17:54     ` Andy Lutomirski
  2013-09-04 19:20       ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Lutomirski @ 2013-09-04 17:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Wed, Sep 4, 2013 at 7:57 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 22-08-13 17:03:19, Andy Lutomirski wrote:
>> Filesystems that defer cmtime updates should update cmtime when any
>> of these events happen after a write via a mapping:
>>
>>  - The mapping is written back to disk.  This happens from all kinds
>>    of places, most of which eventually call ->writepages.  (The
>>    exceptions are vmscan and migration.)
>>
>>  - munmap is called or the mapping is removed when the process exits
>>
>>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>>    time between an mmaped write and the subsequent msync call.
>>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>>
>> Filesystems are responsible for checking for pending deferred cmtime
>> updates in .writepages (a helper is provided for this purpose) and
>> for doing the actual update in .update_cmtime_deferred.
>>
>> These changes have no effect by themselves; filesystems must opt in
>> by implementing .update_cmtime_deferred and removing any
>> file_update_time call in .page_mkwrite.
>>
>> This patch does not implement the MS_ASYNC case; that's in the next
>> patch.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ...
>> +/**
>> + * generic_update_cmtime_deferred - update cmtime after an mmapped write
>> + * @mapping: The mapping
>> + *
>> + * This library function implements .update_cmtime_deferred.  It is unlikely
>> + * that any filesystem will want to do anything here except update the time
>> + * (using this helper) or nothing at all (by leaving .update_cmtime_deferred
>> + * NULL).
>> + */
>> +void generic_update_cmtime_deferred(struct address_space *mapping)
>> +{
>> +     struct blk_plug plug;
>> +     blk_start_plug(&plug);
>> +     inode_update_time_writable(mapping->host);
>> +     blk_finish_plug(&plug);
>> +}
>> +EXPORT_SYMBOL(generic_update_cmtime_deferred);
>> +
>   You can remove the pluggin here. Inode update will likely result in a
> single write so there's no point.
>
>> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
>>  }
>>  EXPORT_SYMBOL(write_one_page);
>>
>> +void mapping_flush_cmtime(struct address_space *mapping)
>> +{
>> +     if (mapping_test_clear_cmtime(mapping) &&
>> +         mapping->a_ops->update_cmtime_deferred)
>> +             mapping->a_ops->update_cmtime_deferred(mapping);
>> +}
>> +EXPORT_SYMBOL(mapping_flush_cmtime);
>   Hum, is there a reason for update_cmtime_deferred() operation? I can
> hardly imagine anyone will want to do anything else than what
> inode_update_time_writable() does so why bother? You mention tmpfs & co.
> don't fit into your scheme well with which I agree so let's just keep
> file_update_time() in their page_mkwrite() operation. But I don't see a
> real need for avoiding the deferred cmtime logic...

I think there might be odd corner cases.  For example, mmap a tmpfs
file, write it, and unmap it.  Then, an hour later, maybe the system
will be under memory pressure and page out the file.  This could
trigger a surprising time update.  (I'm not sure this can actually
happen on tmpfs, but maybe it would on some other filesystem.)

Does this actually matter?  A flag to turn the feature on or off would
do the trick, but I don't think there's precedent for sticking a flag
in a_ops.

>
>> +
>> +void mapping_flush_cmtime_nowb(struct address_space *mapping)
>> +{
>> +     /*
>> +      * We get called from munmap and msync.  Both calls can race
>> +      * with fs freezing.  If the fs is frozen after
>> +      * mapping_test_clear_cmtime but before the time update, then
>> +      * sync_filesystem will miss the cmtime update (because we
>> +      * just cleared it) and we don't be able to write (because the
>> +      * fs is frozen).  On the other hand, we can't just return if
>> +      * we're in the SB_FREEZE_PAGEFAULT state because our caller
>> +      * expects the timestamp to be synchronously updated.  So we
>> +      * get write access without blocking, at the SB_FREEZE_FS
>> +      * level.  If the fs is already fully frozen, then we already
>> +      * know we have nothing to do.
>> +      */
>> +
>> +     if (!mapping_test_cmtime(mapping))
>> +             return;  /* Optimization: nothing to do. */
>> +
>> +     if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
>> +             mapping_flush_cmtime(mapping);
>> +             __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
>> +     }
>> +}
>   This is wrong because SB_FREEZE_FS level is targetted for filesystem
> internal use. Also it is racy. mapping_flush_cmtime() ends up calling
> mark_inode_dirty() and filesystems such as ext4 or xfs will start a
> transaction to store inode in the journal. This gets freeze protection at
> SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to
> SB_FREEZE_FS before this second protection, things will deadlock.

Whoops -- I assumed that it was safe to recursively take freeze
protection at the same level.

I'm worried about the following race:

Thread 1 (in munmap):
Check AS_CMTIME set
sb_start_pagefault

Thread 2 (freezing the fs):
frozen = SB_FREEZE_PAGEFAULT;
sync_filesystem()

Thread 1 is now stuck.  It doesn't need to be, because sync_filesystem
will flush out the cmtime write.  But there doesn't seem to be a clean
mechanism to wait for the freeze to finish.

Is there a clean way to avoid this?  I don't want to return
immediately if a freeze is in progress, because userspace expects that
munmap will update cmtime synchronously.

And ugly but simple solution is:

if (!mapping_test_cmtime(mapping))
    return;  /* Optimization: nothing to do. */

if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
    mapping_flush_cmtime(mapping);
    __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
} else {
    /* Freeze is or was in progress.  The part of freezing from
SB_FREEZE_PAGEFAULT through sync_filesystem holds s_umount for write,
so we can wait for it to finish by taking s_umount for read. */
    down_read(&sb->s_umount);
    up_read(&sb->s_umount);
}

--Andy

>
> Since the callers of this function hold mmap_sem, using SB_FREEZE_PAGEFAULT
> protection would be appropriate. Also since there are just two places that
> need the freeze protection I'd be inclined to open code the protection in
> the two places rather than hiding it in a special function.

Given that this is rather subtle (I've gotten it wrong multiple
times), I'd rather leave it in one place and comment it well.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] Ensure prepare_update_cmtime() returns an initialized value (was: Re: [PATCH v4 2/7] fs: Add inode_update_time_writable)
  2013-08-23  0:03 ` [PATCH v4 2/7] fs: Add inode_update_time_writable Andy Lutomirski
@ 2013-09-04 18:27   ` Sylvain 'ythier' Hitier
  2013-09-04 18:30     ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Sylvain 'ythier' Hitier @ 2013-09-04 18:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andy Lutomirski

Hello,

This is to apply on top of
"[PATCH v4 2/7] fs: Add inode_update_time_writable"
dated Thu, 22 Aug 2013 17:03:18 -0700



Ensure prepare_update_cmtime() returns an initialized value.

While at it:
  - always use sync_it var as a bitfield
  - remove useless test

Signed-off-by: Sylvain "ythier" Hitier <sylvain.hitier@gmail.com>
---
 fs/inode.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 2bbcb19..edf225c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1643,7 +1643,7 @@ EXPORT_SYMBOL(file_remove_suid);
  */
 static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
 {
-	int sync_it;
+	int sync_it = 0;
 
 	/* First try to exhaust all avenues to not sync */
 	if (IS_NOCMTIME(inode))
@@ -1651,7 +1651,7 @@ static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
 
 	*now = current_fs_time(inode->i_sb);
 	if (!timespec_equal(&inode->i_mtime, now))
-		sync_it = S_MTIME;
+		sync_it |= S_MTIME;
 
 	if (!timespec_equal(&inode->i_ctime, now))
 		sync_it |= S_CTIME;
@@ -1659,9 +1659,6 @@ static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
 	if (IS_I_VERSION(inode))
 		sync_it |= S_VERSION;
 
-	if (!sync_it)
-		return 0;
-
 	return sync_it;
 }
 

Regards,
Sylvain "ythier" Hitier

-- 
Business is about being busy, not being rich...
Lived 777 days in a Debian package => http://en.wikipedia.org/wiki/Apt,_Vaucluse
There's THE room for ideals in this mechanical place!

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] Ensure prepare_update_cmtime() returns an initialized value (was: Re: [PATCH v4 2/7] fs: Add inode_update_time_writable)
  2013-09-04 18:27   ` [PATCH] Ensure prepare_update_cmtime() returns an initialized value (was: Re: [PATCH v4 2/7] fs: Add inode_update_time_writable) Sylvain 'ythier' Hitier
@ 2013-09-04 18:30     ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-09-04 18:30 UTC (permalink / raw)
  To: Sylvain 'ythier' Hitier; +Cc: linux-kernel

I'll integrate that into v5.

--Andy

On Wed, Sep 4, 2013 at 11:27 AM, Sylvain 'ythier' Hitier
<sylvain.hitier@gmail.com> wrote:
> Hello,
>
> This is to apply on top of
> "[PATCH v4 2/7] fs: Add inode_update_time_writable"
> dated Thu, 22 Aug 2013 17:03:18 -0700
>
>
>
> Ensure prepare_update_cmtime() returns an initialized value.
>
> While at it:
>   - always use sync_it var as a bitfield
>   - remove useless test
>
> Signed-off-by: Sylvain "ythier" Hitier <sylvain.hitier@gmail.com>
> ---
>  fs/inode.c |    7 ++-----
>  1 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 2bbcb19..edf225c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1643,7 +1643,7 @@ EXPORT_SYMBOL(file_remove_suid);
>   */
>  static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
>  {
> -       int sync_it;
> +       int sync_it = 0;
>
>         /* First try to exhaust all avenues to not sync */
>         if (IS_NOCMTIME(inode))
> @@ -1651,7 +1651,7 @@ static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
>
>         *now = current_fs_time(inode->i_sb);
>         if (!timespec_equal(&inode->i_mtime, now))
> -               sync_it = S_MTIME;
> +               sync_it |= S_MTIME;
>
>         if (!timespec_equal(&inode->i_ctime, now))
>                 sync_it |= S_CTIME;
> @@ -1659,9 +1659,6 @@ static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
>         if (IS_I_VERSION(inode))
>                 sync_it |= S_VERSION;
>
> -       if (!sync_it)
> -               return 0;
> -
>         return sync_it;
>  }
>
>
> Regards,
> Sylvain "ythier" Hitier
>
> --
> Business is about being busy, not being rich...
> Lived 777 days in a Debian package => http://en.wikipedia.org/wiki/Apt,_Vaucluse
> There's THE room for ideals in this mechanical place!



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates
  2013-09-04 17:54     ` Andy Lutomirski
@ 2013-09-04 19:20       ` Jan Kara
  2013-09-04 20:05         ` Andy Lutomirski
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2013-09-04 19:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-kernel, linux-ext4, Dave Chinner,
	Theodore Ts'o, Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Wed 04-09-13 10:54:50, Andy Lutomirski wrote:
> >> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
> >>  }
> >>  EXPORT_SYMBOL(write_one_page);
> >>
> >> +void mapping_flush_cmtime(struct address_space *mapping)
> >> +{
> >> +     if (mapping_test_clear_cmtime(mapping) &&
> >> +         mapping->a_ops->update_cmtime_deferred)
> >> +             mapping->a_ops->update_cmtime_deferred(mapping);
> >> +}
> >> +EXPORT_SYMBOL(mapping_flush_cmtime);
> >   Hum, is there a reason for update_cmtime_deferred() operation? I can
> > hardly imagine anyone will want to do anything else than what
> > inode_update_time_writable() does so why bother? You mention tmpfs & co.
> > don't fit into your scheme well with which I agree so let's just keep
> > file_update_time() in their page_mkwrite() operation. But I don't see a
> > real need for avoiding the deferred cmtime logic...
> 
> I think there might be odd corner cases.  For example, mmap a tmpfs
> file, write it, and unmap it.  Then, an hour later, maybe the system
  If you unmap it then that will handle the update. But if you won't unmap,
you'd get spurious updates of timestamps which would be strange.

> will be under memory pressure and page out the file.  This could
> trigger a surprising time update.  (I'm not sure this can actually
> happen on tmpfs, but maybe it would on some other filesystem.)
> 
> Does this actually matter?  A flag to turn the feature on or off would
> do the trick, but I don't think there's precedent for sticking a flag
> in a_ops.
  Flag in a_ops is ugly. But you can have a flag in 'struct
filesystem_type' which would be reasonable. 

> >> +void mapping_flush_cmtime_nowb(struct address_space *mapping)
> >> +{
> >> +     /*
> >> +      * We get called from munmap and msync.  Both calls can race
> >> +      * with fs freezing.  If the fs is frozen after
> >> +      * mapping_test_clear_cmtime but before the time update, then
> >> +      * sync_filesystem will miss the cmtime update (because we
> >> +      * just cleared it) and we don't be able to write (because the
> >> +      * fs is frozen).  On the other hand, we can't just return if
> >> +      * we're in the SB_FREEZE_PAGEFAULT state because our caller
> >> +      * expects the timestamp to be synchronously updated.  So we
> >> +      * get write access without blocking, at the SB_FREEZE_FS
> >> +      * level.  If the fs is already fully frozen, then we already
> >> +      * know we have nothing to do.
> >> +      */
> >> +
> >> +     if (!mapping_test_cmtime(mapping))
> >> +             return;  /* Optimization: nothing to do. */
> >> +
> >> +     if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
> >> +             mapping_flush_cmtime(mapping);
> >> +             __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
> >> +     }
> >> +}
> >   This is wrong because SB_FREEZE_FS level is targetted for filesystem
> > internal use. Also it is racy. mapping_flush_cmtime() ends up calling
> > mark_inode_dirty() and filesystems such as ext4 or xfs will start a
> > transaction to store inode in the journal. This gets freeze protection at
> > SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to
> > SB_FREEZE_FS before this second protection, things will deadlock.
> 
> Whoops -- I assumed that it was safe to recursively take freeze
> protection at the same level.
> 
> I'm worried about the following race:
> 
> Thread 1 (in munmap):
> Check AS_CMTIME set
> sb_start_pagefault
> 
> Thread 2 (freezing the fs):
> frozen = SB_FREEZE_PAGEFAULT;
> sync_filesystem()
> 
> Thread 1 is now stuck.  It doesn't need to be, because sync_filesystem
> will flush out the cmtime write.  But there doesn't seem to be a clean
> mechanism to wait for the freeze to finish.
  OK, I see. Frankly, I'd rather live with msync() and munmap() blocking
while filesystem is frozen than trying to outsmart the freezing logic...
If someone comes up with a usecase where it causes trouble, we can always
improve the logic with some clever tricks.

> Is there a clean way to avoid this?  I don't want to return
> immediately if a freeze is in progress, because userspace expects that
> munmap will update cmtime synchronously.
> 
> And ugly but simple solution is:
> 
> if (!mapping_test_cmtime(mapping))
>     return;  /* Optimization: nothing to do. */
> 
> if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
>     mapping_flush_cmtime(mapping);
>     __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
> } else {
>     /* Freeze is or was in progress.  The part of freezing from
> SB_FREEZE_PAGEFAULT through sync_filesystem holds s_umount for write,
> so we can wait for it to finish by taking s_umount for read. */
>     down_read(&sb->s_umount);
>     up_read(&sb->s_umount);
> }
  Yes, this would probably work but as I said above, I'd prefer the to keep
it simple unless we have a good reason for the complex solution.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates
  2013-09-04 19:20       ` Jan Kara
@ 2013-09-04 20:05         ` Andy Lutomirski
  0 siblings, 0 replies; 18+ messages in thread
From: Andy Lutomirski @ 2013-09-04 20:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-ext4, Dave Chinner, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Wed, Sep 4, 2013 at 12:20 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 04-09-13 10:54:50, Andy Lutomirski wrote:
>> >> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
>> >>  }
>> >>  EXPORT_SYMBOL(write_one_page);
>> >>
>> >> +void mapping_flush_cmtime(struct address_space *mapping)
>> >> +{
>> >> +     if (mapping_test_clear_cmtime(mapping) &&
>> >> +         mapping->a_ops->update_cmtime_deferred)
>> >> +             mapping->a_ops->update_cmtime_deferred(mapping);
>> >> +}
>> >> +EXPORT_SYMBOL(mapping_flush_cmtime);
>> >   Hum, is there a reason for update_cmtime_deferred() operation? I can
>> > hardly imagine anyone will want to do anything else than what
>> > inode_update_time_writable() does so why bother? You mention tmpfs & co.
>> > don't fit into your scheme well with which I agree so let's just keep
>> > file_update_time() in their page_mkwrite() operation. But I don't see a
>> > real need for avoiding the deferred cmtime logic...
>>
>> I think there might be odd corner cases.  For example, mmap a tmpfs
>> file, write it, and unmap it.  Then, an hour later, maybe the system
>   If you unmap it then that will handle the update. But if you won't unmap,
> you'd get spurious updates of timestamps which would be strange.
>
>> will be under memory pressure and page out the file.  This could
>> trigger a surprising time update.  (I'm not sure this can actually
>> happen on tmpfs, but maybe it would on some other filesystem.)
>>
>> Does this actually matter?  A flag to turn the feature on or off would
>> do the trick, but I don't think there's precedent for sticking a flag
>> in a_ops.
>   Flag in a_ops is ugly. But you can have a flag in 'struct
> filesystem_type' which would be reasonable.

OK, will do.

>
>> >> +void mapping_flush_cmtime_nowb(struct address_space *mapping)
>> >> +{
>> >> +     /*
>> >> +      * We get called from munmap and msync.  Both calls can race
>> >> +      * with fs freezing.  If the fs is frozen after
>> >> +      * mapping_test_clear_cmtime but before the time update, then
>> >> +      * sync_filesystem will miss the cmtime update (because we
>> >> +      * just cleared it) and we don't be able to write (because the
>> >> +      * fs is frozen).  On the other hand, we can't just return if
>> >> +      * we're in the SB_FREEZE_PAGEFAULT state because our caller
>> >> +      * expects the timestamp to be synchronously updated.  So we
>> >> +      * get write access without blocking, at the SB_FREEZE_FS
>> >> +      * level.  If the fs is already fully frozen, then we already
>> >> +      * know we have nothing to do.
>> >> +      */
>> >> +
>> >> +     if (!mapping_test_cmtime(mapping))
>> >> +             return;  /* Optimization: nothing to do. */
>> >> +
>> >> +     if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
>> >> +             mapping_flush_cmtime(mapping);
>> >> +             __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
>> >> +     }
>> >> +}
>> >   This is wrong because SB_FREEZE_FS level is targetted for filesystem
>> > internal use. Also it is racy. mapping_flush_cmtime() ends up calling
>> > mark_inode_dirty() and filesystems such as ext4 or xfs will start a
>> > transaction to store inode in the journal. This gets freeze protection at
>> > SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to
>> > SB_FREEZE_FS before this second protection, things will deadlock.
>>
>> Whoops -- I assumed that it was safe to recursively take freeze
>> protection at the same level.
>>
>> I'm worried about the following race:
>>
>> Thread 1 (in munmap):
>> Check AS_CMTIME set
>> sb_start_pagefault
>>
>> Thread 2 (freezing the fs):
>> frozen = SB_FREEZE_PAGEFAULT;
>> sync_filesystem()
>>
>> Thread 1 is now stuck.  It doesn't need to be, because sync_filesystem
>> will flush out the cmtime write.  But there doesn't seem to be a clean
>> mechanism to wait for the freeze to finish.
>   OK, I see. Frankly, I'd rather live with msync() and munmap() blocking
> while filesystem is frozen than trying to outsmart the freezing logic...
> If someone comes up with a usecase where it causes trouble, we can always
> improve the logic with some clever tricks.

I'll at least check that it's a shared writable mapping before doing
the flush to avoid blocking on other types of munmap.

--Andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-09-04 20:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-23  0:03 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 1/7] mm: Track mappings that have been written via ptes Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 2/7] fs: Add inode_update_time_writable Andy Lutomirski
2013-09-04 18:27   ` [PATCH] Ensure prepare_update_cmtime() returns an initialized value (was: Re: [PATCH v4 2/7] fs: Add inode_update_time_writable) Sylvain 'ythier' Hitier
2013-09-04 18:30     ` Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates Andy Lutomirski
2013-09-04 14:57   ` Jan Kara
2013-09-04 17:54     ` Andy Lutomirski
2013-09-04 19:20       ` Jan Kara
2013-09-04 20:05         ` Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 4/7] mm: Scan for dirty ptes and update cmtime on MS_ASYNC Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 5/7] ext4: Defer mmap cmtime updates Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 6/7] btrfs: " Andy Lutomirski
2013-08-23  0:03 ` [PATCH v4 7/7] xfs: " Andy Lutomirski
2013-08-23  1:12 ` [PATCH v3 0/5] Rework mtime and ctime updates on mmaped writes Andy Lutomirski
2013-09-04 14:06 ` Jan Kara
2013-09-04 15:08 ` Jan Kara
2013-09-04 17:33   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).