All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Rework mtime and ctime updates on mmaped
@ 2013-08-16 23:22 ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

Writes via mmap currently update mtime and ctime in ->page_mkwrite.
This hurts both throughput and latency.  In workloads that dirty a
large number of mmapped pages, ->page_mkwrite can be hot and
file_update_time is slow and scales poorly.  Updating timestamps can
also sleep, which hurts latency for real-time workloads.

This is also a correctness issue.  SuS says:

    The st_ctime and st_mtime fields of a file that is mapped with
    MAP_SHARED and PROT_WRITE, will be marked for update at some point
    in the interval between a write reference to the mapped region and
    the next call to msync() with MS_ASYNC or MS_SYNC for that portion
    of the file by any process. If there is no such call, these fields
    may be marked for update at any time after a write reference if
    the underlying file is modified as a result.

Currently, if the same mmapped page is written twice, the timestamp
may not be update at all after the second write, whereas SuS (and
anything using timestamps to invalidate caches, backup data, etc.)
would expect the timestamp to eventually be updated.

This patchset attempts to fix both issues at once.  It adds a new
address_space flag AS_CMTIME that is set atomically whenever the
system transfers a pte dirty bit to a struct page backed by the
address_space.  This can happen with various locks held and when low
on memory.

Later on, a new address_space op ->flush_cmtime is called at various
points at which a filesystem should update timestamps if the file was
previously modified through mmap.

The core changes have no effect on unmodified filesystems.  To opt in, a filesystem should implement ->flush_ctime (most likely by using generic_flush_cmtime) and should avoid updating timestamps in ->page_mkwrite.

I've converted ext4.  If it works well, it will be easy to convert all
the other filesystems.

Changes from v2:
 - The core code now interacts with filesystems only through
   address_space ops, so there should be fewer layering issues.
 - MS_ASYNC is handled correctly.

Changes from v1:
 - inode_update_time_writable now locks against the fs freezer.
 - Minor cleanups.
 - Major changelog improvements.

Andy Lutomirski (5):
  mm: Track mappings that have been written via ptes
  fs: Add inode_update_time_writable
  mm: Notify filesystems when it's time to apply a deferred cmtime
    update
  mm: Scan for dirty ptes and update cmtime on MS_ASYNC
  ext4: Defer mmap cmtime update until writeback

 fs/ext4/inode.c           |  4 ++-
 fs/inode.c                | 72 +++++++++++++++++++++++++++++++---------
 include/linux/fs.h        | 10 ++++++
 include/linux/pagemap.h   | 11 +++++++
 include/linux/writeback.h |  1 +
 mm/memory.c               |  7 +++-
 mm/mmap.c                 |  9 ++++-
 mm/msync.c                | 83 ++++++++++++++++++++++++++++++++++++++++-------
 mm/page-writeback.c       | 26 +++++++++++++++
 mm/rmap.c                 | 27 +++++++++++++--
 10 files changed, 219 insertions(+), 31 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 0/5] Rework mtime and ctime updates on mmaped
@ 2013-08-16 23:22 ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Dave Hansen, xfs, Christoph Hellwig,
	Andy Lutomirski, Jan Kara, linux-ext4, Tim Chen

Writes via mmap currently update mtime and ctime in ->page_mkwrite.
This hurts both throughput and latency.  In workloads that dirty a
large number of mmapped pages, ->page_mkwrite can be hot and
file_update_time is slow and scales poorly.  Updating timestamps can
also sleep, which hurts latency for real-time workloads.

This is also a correctness issue.  SuS says:

    The st_ctime and st_mtime fields of a file that is mapped with
    MAP_SHARED and PROT_WRITE, will be marked for update at some point
    in the interval between a write reference to the mapped region and
    the next call to msync() with MS_ASYNC or MS_SYNC for that portion
    of the file by any process. If there is no such call, these fields
    may be marked for update at any time after a write reference if
    the underlying file is modified as a result.

Currently, if the same mmapped page is written twice, the timestamp
may not be update at all after the second write, whereas SuS (and
anything using timestamps to invalidate caches, backup data, etc.)
would expect the timestamp to eventually be updated.

This patchset attempts to fix both issues at once.  It adds a new
address_space flag AS_CMTIME that is set atomically whenever the
system transfers a pte dirty bit to a struct page backed by the
address_space.  This can happen with various locks held and when low
on memory.

Later on, a new address_space op ->flush_cmtime is called at various
points at which a filesystem should update timestamps if the file was
previously modified through mmap.

The core changes have no effect on unmodified filesystems.  To opt in, a filesystem should implement ->flush_ctime (most likely by using generic_flush_cmtime) and should avoid updating timestamps in ->page_mkwrite.

I've converted ext4.  If it works well, it will be easy to convert all
the other filesystems.

Changes from v2:
 - The core code now interacts with filesystems only through
   address_space ops, so there should be fewer layering issues.
 - MS_ASYNC is handled correctly.

Changes from v1:
 - inode_update_time_writable now locks against the fs freezer.
 - Minor cleanups.
 - Major changelog improvements.

Andy Lutomirski (5):
  mm: Track mappings that have been written via ptes
  fs: Add inode_update_time_writable
  mm: Notify filesystems when it's time to apply a deferred cmtime
    update
  mm: Scan for dirty ptes and update cmtime on MS_ASYNC
  ext4: Defer mmap cmtime update until writeback

 fs/ext4/inode.c           |  4 ++-
 fs/inode.c                | 72 +++++++++++++++++++++++++++++++---------
 include/linux/fs.h        | 10 ++++++
 include/linux/pagemap.h   | 11 +++++++
 include/linux/writeback.h |  1 +
 mm/memory.c               |  7 +++-
 mm/mmap.c                 |  9 ++++-
 mm/msync.c                | 83 ++++++++++++++++++++++++++++++++++++++++-------
 mm/page-writeback.c       | 26 +++++++++++++++
 mm/rmap.c                 | 27 +++++++++++++--
 10 files changed, 219 insertions(+), 31 deletions(-)

-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 1/5] mm: Track mappings that have been written via ptes
  2013-08-16 23:22 ` Andy Lutomirski
@ 2013-08-16 23:22   ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This will allow filesystems to identify when their mappings have
been modified using a pte.  The idea is that, ideally, filesystems will update ctime and mtime sometime after any mmaped write.

This is handled in core mm code for two reasons:

1. Performance.  Setting a bit directly is faster than an indirect
   call to a vma op.

2. Simplicity.  The cmtime bit is set with lots of mm locks held.
   Rather than making filesystems add a new vm operation that needs
   to be aware of locking, it's easier to just get it right in core
   code.

For filesystems that don't use the deferred cmtime update mechanism,
setting the AS_CMTIME bit has no effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/pagemap.h | 11 +++++++++++
 mm/memory.c             |  7 ++++++-
 mm/rmap.c               | 27 +++++++++++++++++++++++++--
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..f98fe2d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_CMTIME	= __GFP_BITS_SHIFT + 5, /* cmtime update deferred */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -74,6 +75,16 @@ static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
 }
 
+static inline void mapping_set_cmtime(struct address_space * mapping)
+{
+	set_bit(AS_CMTIME, &mapping->flags);
+}
+
+static inline bool mapping_test_clear_cmtime(struct address_space * mapping)
+{
+	return test_and_clear_bit(AS_CMTIME, &mapping->flags);
+}
+
 /*
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
diff --git a/mm/memory.c b/mm/memory.c
index 4026841..1737a90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1150,8 +1150,13 @@ again:
 			if (PageAnon(page))
 				rss[MM_ANONPAGES]--;
 			else {
-				if (pte_dirty(ptent))
+				if (pte_dirty(ptent)) {
+					struct address_space *mapping =
+						page_mapping(page);
+					if (mapping)
+						mapping_set_cmtime(mapping);
 					set_page_dirty(page);
+				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index b2e29ac..2e3fb27 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -928,6 +928,10 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page)
 		}
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (ret)
+		mapping_set_cmtime(mapping);
+
 	return ret;
 }
 
@@ -1179,6 +1183,19 @@ out:
 }
 
 /*
+ * Mark a page's mapping for future cmtime update.  It's safe to call this
+ * on any page, but it only has any effect if the page is backed by a mapping
+ * that uses mapping_test_clear_cmtime to handle file time updates.  This means
+ * that there's no need to call this on for non-VM_SHARED vmas.
+ */
+static void page_set_cmtime(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	if (mapping)
+		mapping_set_cmtime(mapping);
+}
+
+/*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
  */
@@ -1219,8 +1236,11 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
 		set_page_dirty(page);
+		if (vma->vm_flags & VM_SHARED)
+			page_set_cmtime(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -1413,8 +1433,11 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		}
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval)) {
 			set_page_dirty(page);
+			if (vma->vm_flags & VM_SHARED)
+				page_set_cmtime(page);
+		}
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 1/5] mm: Track mappings that have been written via ptes
@ 2013-08-16 23:22   ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Dave Hansen, xfs, Christoph Hellwig,
	Andy Lutomirski, Jan Kara, linux-ext4, Tim Chen

This will allow filesystems to identify when their mappings have
been modified using a pte.  The idea is that, ideally, filesystems will update ctime and mtime sometime after any mmaped write.

This is handled in core mm code for two reasons:

1. Performance.  Setting a bit directly is faster than an indirect
   call to a vma op.

2. Simplicity.  The cmtime bit is set with lots of mm locks held.
   Rather than making filesystems add a new vm operation that needs
   to be aware of locking, it's easier to just get it right in core
   code.

For filesystems that don't use the deferred cmtime update mechanism,
setting the AS_CMTIME bit has no effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/pagemap.h | 11 +++++++++++
 mm/memory.c             |  7 ++++++-
 mm/rmap.c               | 27 +++++++++++++++++++++++++--
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..f98fe2d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_CMTIME	= __GFP_BITS_SHIFT + 5, /* cmtime update deferred */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -74,6 +75,16 @@ static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
 }
 
+static inline void mapping_set_cmtime(struct address_space * mapping)
+{
+	set_bit(AS_CMTIME, &mapping->flags);
+}
+
+static inline bool mapping_test_clear_cmtime(struct address_space * mapping)
+{
+	return test_and_clear_bit(AS_CMTIME, &mapping->flags);
+}
+
 /*
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
diff --git a/mm/memory.c b/mm/memory.c
index 4026841..1737a90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1150,8 +1150,13 @@ again:
 			if (PageAnon(page))
 				rss[MM_ANONPAGES]--;
 			else {
-				if (pte_dirty(ptent))
+				if (pte_dirty(ptent)) {
+					struct address_space *mapping =
+						page_mapping(page);
+					if (mapping)
+						mapping_set_cmtime(mapping);
 					set_page_dirty(page);
+				}
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index b2e29ac..2e3fb27 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -928,6 +928,10 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page)
 		}
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (ret)
+		mapping_set_cmtime(mapping);
+
 	return ret;
 }
 
@@ -1179,6 +1183,19 @@ out:
 }
 
 /*
+ * Mark a page's mapping for future cmtime update.  It's safe to call this
+ * on any page, but it only has any effect if the page is backed by a mapping
+ * that uses mapping_test_clear_cmtime to handle file time updates.  This means
+ * that there's no need to call this on for non-VM_SHARED vmas.
+ */
+static void page_set_cmtime(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+	if (mapping)
+		mapping_set_cmtime(mapping);
+}
+
+/*
  * Subfunctions of try_to_unmap: try_to_unmap_one called
  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
  */
@@ -1219,8 +1236,11 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval)) {
 		set_page_dirty(page);
+		if (vma->vm_flags & VM_SHARED)
+			page_set_cmtime(page);
+	}
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -1413,8 +1433,11 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		}
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
-		if (pte_dirty(pteval))
+		if (pte_dirty(pteval)) {
 			set_page_dirty(page);
+			if (vma->vm_flags & VM_SHARED)
+				page_set_cmtime(page);
+		}
 
 		page_remove_rmap(page);
 		page_cache_release(page);
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 2/5] fs: Add inode_update_time_writable
  2013-08-16 23:22 ` Andy Lutomirski
@ 2013-08-16 23:22   ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This is like file_update_time, except that it acts on a struct inode *
instead of a struct file *.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |  1 +
 2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d6dfb09..bc90c12 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1637,6 +1637,34 @@ int file_remove_suid(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_suid);
 
+/*
+ * This does the work that's common to file_update_time and
+ * inode_update_time.
+ */
+static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
+{
+	int sync_it;
+
+	/* First try to exhaust all avenues to not sync */
+	if (IS_NOCMTIME(inode))
+		return 0;
+
+	*now = current_fs_time(inode->i_sb);
+	if (!timespec_equal(&inode->i_mtime, now))
+		sync_it = S_MTIME;
+
+	if (!timespec_equal(&inode->i_ctime, now))
+		sync_it |= S_CTIME;
+
+	if (IS_I_VERSION(inode))
+		sync_it |= S_VERSION;
+
+	if (!sync_it)
+		return 0;
+
+	return sync_it;
+}
+
 /**
  *	file_update_time	-	update mtime and ctime time
  *	@file: file accessed
@@ -1654,23 +1682,9 @@ int file_update_time(struct file *file)
 {
 	struct inode *inode = file_inode(file);
 	struct timespec now;
-	int sync_it = 0;
+	int sync_it = prepare_update_cmtime(inode, &now);
 	int ret;
 
-	/* First try to exhaust all avenues to not sync */
-	if (IS_NOCMTIME(inode))
-		return 0;
-
-	now = current_fs_time(inode->i_sb);
-	if (!timespec_equal(&inode->i_mtime, &now))
-		sync_it = S_MTIME;
-
-	if (!timespec_equal(&inode->i_ctime, &now))
-		sync_it |= S_CTIME;
-
-	if (IS_I_VERSION(inode))
-		sync_it |= S_VERSION;
-
 	if (!sync_it)
 		return 0;
 
@@ -1685,6 +1699,34 @@ int file_update_time(struct file *file)
 }
 EXPORT_SYMBOL(file_update_time);
 
+/**
+ *	inode_update_time_writable	-	update mtime and ctime time
+ *	@inode: inode accessed
+ *
+ *	This is like file_update_time, but it assumes the mnt is writable
+ *	and takes an inode parameter instead.  (We need to assume the mnt
+ *	was writable because inodes aren't associated with any particular
+ *	mnt.
+ */
+
+int inode_update_time_writable(struct inode *inode)
+{
+	struct timespec now;
+	int sync_it = prepare_update_cmtime(inode, &now);
+	int ret;
+
+	if (!sync_it)
+		return 0;
+
+	/* sb_start_pagefault and update_time can both sleep. */
+	sb_start_pagefault(inode->i_sb);
+	ret = update_time(inode, &now, sync_it);
+	sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+EXPORT_SYMBOL(inode_update_time_writable);
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..86cf0a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2590,6 +2590,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
 
 extern int file_update_time(struct file *file);
+extern int inode_update_time_writable(struct inode *inode);
 
 extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 2/5] fs: Add inode_update_time_writable
@ 2013-08-16 23:22   ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Dave Hansen, xfs, Christoph Hellwig,
	Andy Lutomirski, Jan Kara, linux-ext4, Tim Chen

This is like file_update_time, except that it acts on a struct inode *
instead of a struct file *.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
 include/linux/fs.h |  1 +
 2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d6dfb09..bc90c12 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1637,6 +1637,34 @@ int file_remove_suid(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_suid);
 
+/*
+ * This does the work that's common to file_update_time and
+ * inode_update_time.
+ */
+static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
+{
+	int sync_it;
+
+	/* First try to exhaust all avenues to not sync */
+	if (IS_NOCMTIME(inode))
+		return 0;
+
+	*now = current_fs_time(inode->i_sb);
+	if (!timespec_equal(&inode->i_mtime, now))
+		sync_it = S_MTIME;
+
+	if (!timespec_equal(&inode->i_ctime, now))
+		sync_it |= S_CTIME;
+
+	if (IS_I_VERSION(inode))
+		sync_it |= S_VERSION;
+
+	if (!sync_it)
+		return 0;
+
+	return sync_it;
+}
+
 /**
  *	file_update_time	-	update mtime and ctime time
  *	@file: file accessed
@@ -1654,23 +1682,9 @@ int file_update_time(struct file *file)
 {
 	struct inode *inode = file_inode(file);
 	struct timespec now;
-	int sync_it = 0;
+	int sync_it = prepare_update_cmtime(inode, &now);
 	int ret;
 
-	/* First try to exhaust all avenues to not sync */
-	if (IS_NOCMTIME(inode))
-		return 0;
-
-	now = current_fs_time(inode->i_sb);
-	if (!timespec_equal(&inode->i_mtime, &now))
-		sync_it = S_MTIME;
-
-	if (!timespec_equal(&inode->i_ctime, &now))
-		sync_it |= S_CTIME;
-
-	if (IS_I_VERSION(inode))
-		sync_it |= S_VERSION;
-
 	if (!sync_it)
 		return 0;
 
@@ -1685,6 +1699,34 @@ int file_update_time(struct file *file)
 }
 EXPORT_SYMBOL(file_update_time);
 
+/**
+ *	inode_update_time_writable	-	update mtime and ctime time
+ *	@inode: inode accessed
+ *
+ *	This is like file_update_time, but it assumes the mnt is writable
+ *	and takes an inode parameter instead.  (We need to assume the mnt
+ *	was writable because inodes aren't associated with any particular
+ *	mnt.
+ */
+
+int inode_update_time_writable(struct inode *inode)
+{
+	struct timespec now;
+	int sync_it = prepare_update_cmtime(inode, &now);
+	int ret;
+
+	if (!sync_it)
+		return 0;
+
+	/* sb_start_pagefault and update_time can both sleep. */
+	sb_start_pagefault(inode->i_sb);
+	ret = update_time(inode, &now, sync_it);
+	sb_end_pagefault(inode->i_sb);
+
+	return ret;
+}
+EXPORT_SYMBOL(inode_update_time_writable);
+
 int inode_needs_sync(struct inode *inode)
 {
 	if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..86cf0a4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2590,6 +2590,7 @@ extern int inode_newsize_ok(const struct inode *, loff_t offset);
 extern void setattr_copy(struct inode *inode, const struct iattr *attr);
 
 extern int file_update_time(struct file *file);
+extern int inode_update_time_writable(struct inode *inode);
 
 extern int generic_show_options(struct seq_file *m, struct dentry *root);
 extern void save_mount_options(struct super_block *sb, char *options);
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-16 23:22 ` Andy Lutomirski
@ 2013-08-16 23:22   ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

Filesystems that defer cmtime updates should update cmtime when any
of these events happen after a write via a mapping:

 - The mapping is written back to disk.  This happens from all kinds
   of places, all of which eventually call ->writepages.

 - munmap is called or the mapping is removed when the process exits

 - msync(MS_ASYNC) is called.  Linux currently does nothing for
   msync(MS_ASYNC), but POSIX says that cmtime should be updated some
   time between an mmaped write and the subsequent msync call.
   MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.

Filesystmes that defer cmtime updates should flush them on munmap or
exit.  Finding out that this happened through vm_ops is messy, so
add a new address space op for this.

It's not strictly necessary to call ->flush_cmtime after ->writepages,
but it simplifies the fs code.  As an optional optimization,
filesystems can call mapping_test_clear_cmtime themselves in
->writepages (as long as they're careful to scan all the pages first
-- the cmtime bit may not be set when ->writepages is entered).

This patch does not implement the MS_ASYNC case; that's in the next
patch.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/fs.h        |  9 +++++++++
 include/linux/writeback.h |  1 +
 mm/mmap.c                 |  9 ++++++++-
 mm/page-writeback.c       | 26 ++++++++++++++++++++++++++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86cf0a4..f224155 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -350,6 +350,15 @@ struct address_space_operations {
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
 
+	/*
+	 * Userspace expects certain system calls to update cmtime if
+	 * a file has been recently written using a shared vma.  In
+	 * cases where cmtime may need to be updated but writepages is
+	 * not called, this is called instead.  (Implementations
+	 * should call mapping_test_clear_cmtime.)
+	 */
+	void (*flush_cmtime)(struct address_space *);
+
 	/* Set a page dirty.  Return true if this dirtied it */
 	int (*set_page_dirty)(struct page *page);
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..f6e8261 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -174,6 +174,7 @@ typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
+void generic_flush_cmtime(struct address_space *mapping);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..7ed7700 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1,3 +1,4 @@
+
 /*
  * mm/mmap.c
  *
@@ -249,8 +250,14 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
+	if (vma->vm_file) {
+		if ((vma->vm_flags & VM_SHARED) && vma->vm_file->f_mapping) {
+			struct address_space *mapping = vma->vm_file->f_mapping;
+			if (mapping->a_ops && mapping->a_ops->flush_cmtime)
+				mapping->a_ops->flush_cmtime(mapping);
+		}
 		fput(vma->vm_file);
+	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..9ab8c9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/*
+	 * ->writepages will call clear_page_dirty_for_io, which may, in turn,
+	 * mark the mapping for deferred cmtime update.  As an optimization,
+	 * a filesystem can flush the update at the end of ->writepages
+	 * (possibly avoiding a journal transaction, for example), but,
+	 * for simplicity, let filesystems skip that part and just implement
+	 * ->flush_cmtime.
+	 */
+	if (mapping->a_ops->flush_cmtime)
+		mapping->a_ops->flush_cmtime(mapping);
+
 	return ret;
 }
 
@@ -1970,6 +1982,20 @@ int write_one_page(struct page *page, int wait)
 }
 EXPORT_SYMBOL(write_one_page);
 
+/**
+ * generic_flush_cmtime - perform a deferred cmtime update if needed
+ * @mapping: address space structure
+ *
+ * This is a library function, which implements the flush_cmtime()
+ * address_space_operation.
+ */
+void generic_flush_cmtime(struct address_space *mapping)
+{
+	if (mapping_test_clear_cmtime(mapping))
+		inode_update_time_writable(mapping->host);
+}
+EXPORT_SYMBOL(generic_flush_cmtime);
+
 /*
  * For address_spaces which do not use buffers nor write back.
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-16 23:22   ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Dave Hansen, xfs, Christoph Hellwig,
	Andy Lutomirski, Jan Kara, linux-ext4, Tim Chen

Filesystems that defer cmtime updates should update cmtime when any
of these events happen after a write via a mapping:

 - The mapping is written back to disk.  This happens from all kinds
   of places, all of which eventually call ->writepages.

 - munmap is called or the mapping is removed when the process exits

 - msync(MS_ASYNC) is called.  Linux currently does nothing for
   msync(MS_ASYNC), but POSIX says that cmtime should be updated some
   time between an mmaped write and the subsequent msync call.
   MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.

Filesystmes that defer cmtime updates should flush them on munmap or
exit.  Finding out that this happened through vm_ops is messy, so
add a new address space op for this.

It's not strictly necessary to call ->flush_cmtime after ->writepages,
but it simplifies the fs code.  As an optional optimization,
filesystems can call mapping_test_clear_cmtime themselves in
->writepages (as long as they're careful to scan all the pages first
-- the cmtime bit may not be set when ->writepages is entered).

This patch does not implement the MS_ASYNC case; that's in the next
patch.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/fs.h        |  9 +++++++++
 include/linux/writeback.h |  1 +
 mm/mmap.c                 |  9 ++++++++-
 mm/page-writeback.c       | 26 ++++++++++++++++++++++++++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86cf0a4..f224155 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -350,6 +350,15 @@ struct address_space_operations {
 	/* Write back some dirty pages from this mapping. */
 	int (*writepages)(struct address_space *, struct writeback_control *);
 
+	/*
+	 * Userspace expects certain system calls to update cmtime if
+	 * a file has been recently written using a shared vma.  In
+	 * cases where cmtime may need to be updated but writepages is
+	 * not called, this is called instead.  (Implementations
+	 * should call mapping_test_clear_cmtime.)
+	 */
+	void (*flush_cmtime)(struct address_space *);
+
 	/* Set a page dirty.  Return true if this dirtied it */
 	int (*set_page_dirty)(struct page *page);
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..f6e8261 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -174,6 +174,7 @@ typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
+void generic_flush_cmtime(struct address_space *mapping);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..7ed7700 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1,3 +1,4 @@
+
 /*
  * mm/mmap.c
  *
@@ -249,8 +250,14 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
+	if (vma->vm_file) {
+		if ((vma->vm_flags & VM_SHARED) && vma->vm_file->f_mapping) {
+			struct address_space *mapping = vma->vm_file->f_mapping;
+			if (mapping->a_ops && mapping->a_ops->flush_cmtime)
+				mapping->a_ops->flush_cmtime(mapping);
+		}
 		fput(vma->vm_file);
+	}
 	mpol_put(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..9ab8c9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/*
+	 * ->writepages will call clear_page_dirty_for_io, which may, in turn,
+	 * mark the mapping for deferred cmtime update.  As an optimization,
+	 * a filesystem can flush the update at the end of ->writepages
+	 * (possibly avoiding a journal transaction, for example), but,
+	 * for simplicity, let filesystems skip that part and just implement
+	 * ->flush_cmtime.
+	 */
+	if (mapping->a_ops->flush_cmtime)
+		mapping->a_ops->flush_cmtime(mapping);
+
 	return ret;
 }
 
@@ -1970,6 +1982,20 @@ int write_one_page(struct page *page, int wait)
 }
 EXPORT_SYMBOL(write_one_page);
 
+/**
+ * generic_flush_cmtime - perform a deferred cmtime update if needed
+ * @mapping: address space structure
+ *
+ * This is a library function, which implements the flush_cmtime()
+ * address_space_operation.
+ */
+void generic_flush_cmtime(struct address_space *mapping)
+{
+	if (mapping_test_clear_cmtime(mapping))
+		inode_update_time_writable(mapping->host);
+}
+EXPORT_SYMBOL(generic_flush_cmtime);
+
 /*
  * For address_spaces which do not use buffers nor write back.
  */
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 4/5] mm: Scan for dirty ptes and update cmtime on MS_ASYNC
  2013-08-16 23:22 ` Andy Lutomirski
@ 2013-08-16 23:22   ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

This is probably unimportant but improves POSIX compliance.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 mm/msync.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 72 insertions(+), 11 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 632df45..9e41acd 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,13 +13,16 @@
 #include <linux/file.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * Now all it does is ensure that file timestamps get updated, since POSIX
+ * requires it.  We track dirty pages correct without MS_ASYNC.
  *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
@@ -28,6 +31,57 @@
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
+
+static int msync_async_range(struct vm_area_struct *vma,
+			      unsigned long *start, unsigned long end)
+{
+	struct mm_struct *mm;
+	struct address_space *mapping;
+	int iters = 0;
+
+	while (*start < end && *start < vma->vm_end && iters < 128) {
+		unsigned int page_mask, page_increm;
+
+		/*
+		 * Require that the pte writable (because otherwise it can't
+		 * be dirty, so there's nothing to clean).
+		 *
+		 * In theory we could check the pte dirty bit, but this is
+		 * awkward and barely worth it.
+		 */
+		struct page *page = follow_page_mask(vma, *start,
+						     FOLL_GET | FOLL_WRITE,
+						     &page_mask);
+
+		if (page && !IS_ERR(page)) {
+			if (lock_page_killable(page) == 0) {
+				page_mkclean(page);
+				unlock_page(page);
+			}
+			put_page(page);
+		}
+
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		page_increm = 1 + (~(*start >> PAGE_SHIFT) & page_mask);
+		*start += page_increm * PAGE_SIZE;
+		cond_resched();
+		iters++;
+	}
+
+	/* XXX: try to do this only once? */
+	mapping = vma->vm_file->f_mapping;
+	if (mapping->a_ops->flush_cmtime)
+		mapping->a_ops->flush_cmtime(mapping);
+
+	/* Give mmap_sem writers a chance. */
+	mm = current->mm;
+	up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	return 0;
+}
+
 SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 {
 	unsigned long end;
@@ -77,18 +131,25 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 			goto out_unlock;
 		}
 		file = vma->vm_file;
-		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
-				(vma->vm_flags & VM_SHARED)) {
-			get_file(file);
-			up_read(&mm->mmap_sem);
-			error = vfs_fsync(file, 0);
-			fput(file);
-			if (error || start >= end)
-				goto out;
-			down_read(&mm->mmap_sem);
+		if (file && vma->vm_flags & VM_SHARED) {
+			if (flags & MS_SYNC) {
+				start = vma->vm_end;
+				get_file(file);
+				up_read(&mm->mmap_sem);
+				error = vfs_fsync(file, 0);
+				fput(file);
+				if (error || start >= end)
+					goto out;
+				down_read(&mm->mmap_sem);
+			} else if ((vma->vm_flags & VM_WRITE) &&
+				   file->f_mapping) {
+				error = msync_async_range(vma, &start, end);
+			} else {
+				start = vma->vm_end;
+			}
 			vma = find_vma(mm, start);
 		} else {
+			start = vma->vm_end;
 			if (start >= end) {
 				error = 0;
 				goto out_unlock;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 4/5] mm: Scan for dirty ptes and update cmtime on MS_ASYNC
@ 2013-08-16 23:22   ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Dave Hansen, xfs, Christoph Hellwig,
	Andy Lutomirski, Jan Kara, linux-ext4, Tim Chen

This is probably unimportant but improves POSIX compliance.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 mm/msync.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 72 insertions(+), 11 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 632df45..9e41acd 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,13 +13,16 @@
 #include <linux/file.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * Now all it does is ensure that file timestamps get updated, since POSIX
+ * requires it.  We track dirty pages correct without MS_ASYNC.
  *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
@@ -28,6 +31,57 @@
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
+
+static int msync_async_range(struct vm_area_struct *vma,
+			      unsigned long *start, unsigned long end)
+{
+	struct mm_struct *mm;
+	struct address_space *mapping;
+	int iters = 0;
+
+	while (*start < end && *start < vma->vm_end && iters < 128) {
+		unsigned int page_mask, page_increm;
+
+		/*
+		 * Require that the pte writable (because otherwise it can't
+		 * be dirty, so there's nothing to clean).
+		 *
+		 * In theory we could check the pte dirty bit, but this is
+		 * awkward and barely worth it.
+		 */
+		struct page *page = follow_page_mask(vma, *start,
+						     FOLL_GET | FOLL_WRITE,
+						     &page_mask);
+
+		if (page && !IS_ERR(page)) {
+			if (lock_page_killable(page) == 0) {
+				page_mkclean(page);
+				unlock_page(page);
+			}
+			put_page(page);
+		}
+
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		page_increm = 1 + (~(*start >> PAGE_SHIFT) & page_mask);
+		*start += page_increm * PAGE_SIZE;
+		cond_resched();
+		iters++;
+	}
+
+	/* XXX: try to do this only once? */
+	mapping = vma->vm_file->f_mapping;
+	if (mapping->a_ops->flush_cmtime)
+		mapping->a_ops->flush_cmtime(mapping);
+
+	/* Give mmap_sem writers a chance. */
+	mm = current->mm;
+	up_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
+	return 0;
+}
+
 SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 {
 	unsigned long end;
@@ -77,18 +131,25 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 			goto out_unlock;
 		}
 		file = vma->vm_file;
-		start = vma->vm_end;
-		if ((flags & MS_SYNC) && file &&
-				(vma->vm_flags & VM_SHARED)) {
-			get_file(file);
-			up_read(&mm->mmap_sem);
-			error = vfs_fsync(file, 0);
-			fput(file);
-			if (error || start >= end)
-				goto out;
-			down_read(&mm->mmap_sem);
+		if (file && vma->vm_flags & VM_SHARED) {
+			if (flags & MS_SYNC) {
+				start = vma->vm_end;
+				get_file(file);
+				up_read(&mm->mmap_sem);
+				error = vfs_fsync(file, 0);
+				fput(file);
+				if (error || start >= end)
+					goto out;
+				down_read(&mm->mmap_sem);
+			} else if ((vma->vm_flags & VM_WRITE) &&
+				   file->f_mapping) {
+				error = msync_async_range(vma, &start, end);
+			} else {
+				start = vma->vm_end;
+			}
 			vma = find_vma(mm, start);
 		} else {
+			start = vma->vm_end;
 			if (start >= end) {
 				error = 0;
 				goto out_unlock;
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
  2013-08-16 23:22 ` Andy Lutomirski
@ 2013-08-16 23:22   ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Dave Chinner, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig, Andy Lutomirski

A fancier implementation could probably avoid an extra journal
transaction by adding a mapping_test_clear_cmtime call in
ext4_writepages, but this should already be a considerable
improvement -- we'll start one transaction per writepages call
instead of one per page.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/ext4/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd32a2e..cee6b45 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3236,6 +3236,7 @@ static const struct address_space_operations ext4_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
+	.flush_cmtime		= generic_flush_cmtime,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_write_end,
 	.bmap			= ext4_bmap,
@@ -3252,6 +3253,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
+	.flush_cmtime		= generic_flush_cmtime,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_journalled_write_end,
 	.set_page_dirty		= ext4_journalled_set_page_dirty,
@@ -3268,6 +3270,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
+	.flush_cmtime		= generic_flush_cmtime,
 	.write_begin		= ext4_da_write_begin,
 	.write_end		= ext4_da_write_end,
 	.bmap			= ext4_bmap,
@@ -5025,7 +5028,6 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int retries = 0;
 
 	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
@ 2013-08-16 23:22   ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Theodore Ts'o, Dave Hansen, xfs, Christoph Hellwig,
	Andy Lutomirski, Jan Kara, linux-ext4, Tim Chen

A fancier implementation could probably avoid an extra journal
transaction by adding a mapping_test_clear_cmtime call in
ext4_writepages, but this should already be a considerable
improvement -- we'll start one transaction per writepages call
instead of one per page.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/ext4/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd32a2e..cee6b45 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3236,6 +3236,7 @@ static const struct address_space_operations ext4_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
+	.flush_cmtime		= generic_flush_cmtime,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_write_end,
 	.bmap			= ext4_bmap,
@@ -3252,6 +3253,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
+	.flush_cmtime		= generic_flush_cmtime,
 	.write_begin		= ext4_write_begin,
 	.write_end		= ext4_journalled_write_end,
 	.set_page_dirty		= ext4_journalled_set_page_dirty,
@@ -3268,6 +3270,7 @@ static const struct address_space_operations ext4_da_aops = {
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
+	.flush_cmtime		= generic_flush_cmtime,
 	.write_begin		= ext4_da_write_begin,
 	.write_end		= ext4_da_write_end,
 	.bmap			= ext4_bmap,
@@ -5025,7 +5028,6 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int retries = 0;
 
 	sb_start_pagefault(inode->i_sb);
-	file_update_time(vma->vm_file);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
  2013-08-16 23:22   ` Andy Lutomirski
@ 2013-08-20  2:28     ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  2:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
> This is like file_update_time, except that it acts on a struct inode *
> instead of a struct file *.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>  include/linux/fs.h |  1 +
>  2 files changed, 58 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index d6dfb09..bc90c12 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1637,6 +1637,34 @@ int file_remove_suid(struct file *file)
>  }
>  EXPORT_SYMBOL(file_remove_suid);
>  
> +/*
> + * This does the work that's common to file_update_time and
> + * inode_update_time.
> + */
> +static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
> +{
> +	int sync_it;
> +
> +	/* First try to exhaust all avenues to not sync */
> +	if (IS_NOCMTIME(inode))
> +		return 0;
> +
> +	*now = current_fs_time(inode->i_sb);
> +	if (!timespec_equal(&inode->i_mtime, now))
> +		sync_it = S_MTIME;
> +
> +	if (!timespec_equal(&inode->i_ctime, now))
> +		sync_it |= S_CTIME;
> +
> +	if (IS_I_VERSION(inode))
> +		sync_it |= S_VERSION;
> +
> +	if (!sync_it)
> +		return 0;
> +
> +	return sync_it;
> +}
> +
>  /**
>   *	file_update_time	-	update mtime and ctime time
>   *	@file: file accessed
> @@ -1654,23 +1682,9 @@ int file_update_time(struct file *file)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct timespec now;
> -	int sync_it = 0;
> +	int sync_it = prepare_update_cmtime(inode, &now);
>  	int ret;
>  
> -	/* First try to exhaust all avenues to not sync */
> -	if (IS_NOCMTIME(inode))
> -		return 0;
> -
> -	now = current_fs_time(inode->i_sb);
> -	if (!timespec_equal(&inode->i_mtime, &now))
> -		sync_it = S_MTIME;
> -
> -	if (!timespec_equal(&inode->i_ctime, &now))
> -		sync_it |= S_CTIME;
> -
> -	if (IS_I_VERSION(inode))
> -		sync_it |= S_VERSION;
> -
>  	if (!sync_it)
>  		return 0;
>  
> @@ -1685,6 +1699,34 @@ int file_update_time(struct file *file)
>  }
>  EXPORT_SYMBOL(file_update_time);
>  
> +/**
> + *	inode_update_time_writable	-	update mtime and ctime time
> + *	@inode: inode accessed
> + *
> + *	This is like file_update_time, but it assumes the mnt is writable
> + *	and takes an inode parameter instead.  (We need to assume the mnt
> + *	was writable because inodes aren't associated with any particular
> + *	mnt.
> + */
> +
> +int inode_update_time_writable(struct inode *inode)
> +{
> +	struct timespec now;
> +	int sync_it = prepare_update_cmtime(inode, &now);
> +	int ret;
> +
> +	if (!sync_it)
> +		return 0;
> +
> +	/* sb_start_pagefault and update_time can both sleep. */
> +	sb_start_pagefault(inode->i_sb);
> +	ret = update_time(inode, &now, sync_it);
> +	sb_end_pagefault(inode->i_sb);

This gets called from the writeback path - you can't use
sb_start_pagefault/sb_end_pagefault in that path.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
@ 2013-08-20  2:28     ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  2:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
> This is like file_update_time, except that it acts on a struct inode *
> instead of a struct file *.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>  include/linux/fs.h |  1 +
>  2 files changed, 58 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index d6dfb09..bc90c12 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1637,6 +1637,34 @@ int file_remove_suid(struct file *file)
>  }
>  EXPORT_SYMBOL(file_remove_suid);
>  
> +/*
> + * This does the work that's common to file_update_time and
> + * inode_update_time.
> + */
> +static int prepare_update_cmtime(struct inode *inode, struct timespec *now)
> +{
> +	int sync_it;
> +
> +	/* First try to exhaust all avenues to not sync */
> +	if (IS_NOCMTIME(inode))
> +		return 0;
> +
> +	*now = current_fs_time(inode->i_sb);
> +	if (!timespec_equal(&inode->i_mtime, now))
> +		sync_it = S_MTIME;
> +
> +	if (!timespec_equal(&inode->i_ctime, now))
> +		sync_it |= S_CTIME;
> +
> +	if (IS_I_VERSION(inode))
> +		sync_it |= S_VERSION;
> +
> +	if (!sync_it)
> +		return 0;
> +
> +	return sync_it;
> +}
> +
>  /**
>   *	file_update_time	-	update mtime and ctime time
>   *	@file: file accessed
> @@ -1654,23 +1682,9 @@ int file_update_time(struct file *file)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct timespec now;
> -	int sync_it = 0;
> +	int sync_it = prepare_update_cmtime(inode, &now);
>  	int ret;
>  
> -	/* First try to exhaust all avenues to not sync */
> -	if (IS_NOCMTIME(inode))
> -		return 0;
> -
> -	now = current_fs_time(inode->i_sb);
> -	if (!timespec_equal(&inode->i_mtime, &now))
> -		sync_it = S_MTIME;
> -
> -	if (!timespec_equal(&inode->i_ctime, &now))
> -		sync_it |= S_CTIME;
> -
> -	if (IS_I_VERSION(inode))
> -		sync_it |= S_VERSION;
> -
>  	if (!sync_it)
>  		return 0;
>  
> @@ -1685,6 +1699,34 @@ int file_update_time(struct file *file)
>  }
>  EXPORT_SYMBOL(file_update_time);
>  
> +/**
> + *	inode_update_time_writable	-	update mtime and ctime time
> + *	@inode: inode accessed
> + *
> + *	This is like file_update_time, but it assumes the mnt is writable
> + *	and takes an inode parameter instead.  (We need to assume the mnt
> + *	was writable because inodes aren't associated with any particular
> + *	mnt.
> + */
> +
> +int inode_update_time_writable(struct inode *inode)
> +{
> +	struct timespec now;
> +	int sync_it = prepare_update_cmtime(inode, &now);
> +	int ret;
> +
> +	if (!sync_it)
> +		return 0;
> +
> +	/* sb_start_pagefault and update_time can both sleep. */
> +	sb_start_pagefault(inode->i_sb);
> +	ret = update_time(inode, &now, sync_it);
> +	sb_end_pagefault(inode->i_sb);

This gets called from the writeback path - you can't use
sb_start_pagefault/sb_end_pagefault in that path.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-16 23:22   ` Andy Lutomirski
@ 2013-08-20  2:36     ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  2:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> Filesystems that defer cmtime updates should update cmtime when any
> of these events happen after a write via a mapping:
> 
>  - The mapping is written back to disk.  This happens from all kinds
>    of places, all of which eventually call ->writepages.
> 
>  - munmap is called or the mapping is removed when the process exits
> 
>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>    time between an mmaped write and the subsequent msync call.
>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> 
> Filesystmes that defer cmtime updates should flush them on munmap or
> exit.  Finding out that this happened through vm_ops is messy, so
> add a new address space op for this.
> 
> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> but it simplifies the fs code.  As an optional optimization,
> filesystems can call mapping_test_clear_cmtime themselves in
> ->writepages (as long as they're careful to scan all the pages first
> -- the cmtime bit may not be set when ->writepages is entered).

.flush_cmtime is effectively a duplicate method.  We already have
.update_time to notify filesystems that they need to update the
timestamp in the inode transactionally.

Indeed:

> +	/*
> +	 * Userspace expects certain system calls to update cmtime if
> +	 * a file has been recently written using a shared vma.  In
> +	 * cases where cmtime may need to be updated but writepages is
> +	 * not called, this is called instead.  (Implementations
> +	 * should call mapping_test_clear_cmtime.)
> +	 */
> +	void (*flush_cmtime)(struct address_space *);

You say it can be implemented in the ->writepage(s) method, and all
filesystems provide ->writepage(s) in some form. Therefore I would
have thought it be best to simply require filesystems to check that
mapping flag during those methods and update the inode directly when
that is set?

Indeed, the way you've set up the infrastructure, we'll have to
rewrite the cmtime update code to enable writepages to update this
within some other transaction. Perhaps you should just implement it
that way first?

> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  		ret = mapping->a_ops->writepages(mapping, wbc);
>  	else
>  		ret = generic_writepages(mapping, wbc);
> +
> +	/*
> +	 * ->writepages will call clear_page_dirty_for_io, which may, in turn,
> +	 * mark the mapping for deferred cmtime update.  As an optimization,
> +	 * a filesystem can flush the update at the end of ->writepages
> +	 * (possibly avoiding a journal transaction, for example), but,
> +	 * for simplicity, let filesystems skip that part and just implement
> +	 * ->flush_cmtime.
> +	 */
> +	if (mapping->a_ops->flush_cmtime)
> +		mapping->a_ops->flush_cmtime(mapping);

And that's where you cannot call sb_pagefault_start/end....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20  2:36     ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  2:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> Filesystems that defer cmtime updates should update cmtime when any
> of these events happen after a write via a mapping:
> 
>  - The mapping is written back to disk.  This happens from all kinds
>    of places, all of which eventually call ->writepages.
> 
>  - munmap is called or the mapping is removed when the process exits
> 
>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>    time between an mmaped write and the subsequent msync call.
>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> 
> Filesystmes that defer cmtime updates should flush them on munmap or
> exit.  Finding out that this happened through vm_ops is messy, so
> add a new address space op for this.
> 
> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> but it simplifies the fs code.  As an optional optimization,
> filesystems can call mapping_test_clear_cmtime themselves in
> ->writepages (as long as they're careful to scan all the pages first
> -- the cmtime bit may not be set when ->writepages is entered).

.flush_cmtime is effectively a duplicate method.  We already have
.update_time to notify filesystems that they need to update the
timestamp in the inode transactionally.

Indeed:

> +	/*
> +	 * Userspace expects certain system calls to update cmtime if
> +	 * a file has been recently written using a shared vma.  In
> +	 * cases where cmtime may need to be updated but writepages is
> +	 * not called, this is called instead.  (Implementations
> +	 * should call mapping_test_clear_cmtime.)
> +	 */
> +	void (*flush_cmtime)(struct address_space *);

You say it can be implemented in the ->writepage(s) method, and all
filesystems provide ->writepage(s) in some form. Therefore I would
have thought it be best to simply require filesystems to check that
mapping flag during those methods and update the inode directly when
that is set?

Indeed, the way you've set up the infrastructure, we'll have to
rewrite the cmtime update code to enable writepages to update this
within some other transaction. Perhaps you should just implement it
that way first?

> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  		ret = mapping->a_ops->writepages(mapping, wbc);
>  	else
>  		ret = generic_writepages(mapping, wbc);
> +
> +	/*
> +	 * ->writepages will call clear_page_dirty_for_io, which may, in turn,
> +	 * mark the mapping for deferred cmtime update.  As an optimization,
> +	 * a filesystem can flush the update at the end of ->writepages
> +	 * (possibly avoiding a journal transaction, for example), but,
> +	 * for simplicity, let filesystems skip that part and just implement
> +	 * ->flush_cmtime.
> +	 */
> +	if (mapping->a_ops->flush_cmtime)
> +		mapping->a_ops->flush_cmtime(mapping);

And that's where you cannot call sb_pagefault_start/end....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
  2013-08-16 23:22   ` Andy Lutomirski
@ 2013-08-20  2:38     ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  2:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Fri, Aug 16, 2013 at 04:22:12PM -0700, Andy Lutomirski wrote:
> A fancier implementation could probably avoid an extra journal
> transaction by adding a mapping_test_clear_cmtime call in
> ext4_writepages, but this should already be a considerable
> improvement -- we'll start one transaction per writepages call
> instead of one per page.

I'd like to see more than just an ext4 implementation - btrfs and
XFS are the other main filesystems that should behave identically.

Also, it's worthwhile to write a generic xfstest to ensure that they
all update the timestamp appropriately - if its' in xfstests, then
we can basically guarantee that it won't get randomly regressed in
future, and other filesystems can be easily verified as well sa
their implement this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
@ 2013-08-20  2:38     ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  2:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Fri, Aug 16, 2013 at 04:22:12PM -0700, Andy Lutomirski wrote:
> A fancier implementation could probably avoid an extra journal
> transaction by adding a mapping_test_clear_cmtime call in
> ext4_writepages, but this should already be a considerable
> improvement -- we'll start one transaction per writepages call
> instead of one per page.

I'd like to see more than just an ext4 implementation - btrfs and
XFS are the other main filesystems that should behave identically.

Also, it's worthwhile to write a generic xfstest to ensure that they
all update the timestamp appropriately - if its' in xfstests, then
we can basically guarantee that it won't get randomly regressed in
future, and other filesystems can be easily verified as well sa
their implement this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
  2013-08-20  2:28     ` Dave Chinner
@ 2013-08-20  3:20       ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
>> This is like file_update_time, except that it acts on a struct inode *
>> instead of a struct file *.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>>  include/linux/fs.h |  1 +
>>  2 files changed, 58 insertions(+), 15 deletions(-)
>>

[...]

>> +
>> +int inode_update_time_writable(struct inode *inode)
>> +{
>> +     struct timespec now;
>> +     int sync_it = prepare_update_cmtime(inode, &now);
>> +     int ret;
>> +
>> +     if (!sync_it)
>> +             return 0;
>> +
>> +     /* sb_start_pagefault and update_time can both sleep. */
>> +     sb_start_pagefault(inode->i_sb);
>> +     ret = update_time(inode, &now, sync_it);
>> +     sb_end_pagefault(inode->i_sb);
>
> This gets called from the writeback path - you can't use
> sb_start_pagefault/sb_end_pagefault in that path.

The race I'm worried about is:

 - mmap
 - write to the mapping
 - remount ro
 - flush_cmtime -> inode_update_time_writable

This may be impossible, in which case I'm okay, but it's nice to have
a sanity check.  I'll see if I can figure out how to do that.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
@ 2013-08-20  3:20       ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  3:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
>> This is like file_update_time, except that it acts on a struct inode *
>> instead of a struct file *.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>>  include/linux/fs.h |  1 +
>>  2 files changed, 58 insertions(+), 15 deletions(-)
>>

[...]

>> +
>> +int inode_update_time_writable(struct inode *inode)
>> +{
>> +     struct timespec now;
>> +     int sync_it = prepare_update_cmtime(inode, &now);
>> +     int ret;
>> +
>> +     if (!sync_it)
>> +             return 0;
>> +
>> +     /* sb_start_pagefault and update_time can both sleep. */
>> +     sb_start_pagefault(inode->i_sb);
>> +     ret = update_time(inode, &now, sync_it);
>> +     sb_end_pagefault(inode->i_sb);
>
> This gets called from the writeback path - you can't use
> sb_start_pagefault/sb_end_pagefault in that path.

The race I'm worried about is:

 - mmap
 - write to the mapping
 - remount ro
 - flush_cmtime -> inode_update_time_writable

This may be impossible, in which case I'm okay, but it's nice to have
a sanity check.  I'll see if I can figure out how to do that.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20  2:36     ` Dave Chinner
@ 2013-08-20  3:28       ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  3:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> Filesystems that defer cmtime updates should update cmtime when any
>> of these events happen after a write via a mapping:
>>
>>  - The mapping is written back to disk.  This happens from all kinds
>>    of places, all of which eventually call ->writepages.
>>
>>  - munmap is called or the mapping is removed when the process exits
>>
>>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>>    time between an mmaped write and the subsequent msync call.
>>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>>
>> Filesystmes that defer cmtime updates should flush them on munmap or
>> exit.  Finding out that this happened through vm_ops is messy, so
>> add a new address space op for this.
>>
>> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> but it simplifies the fs code.  As an optional optimization,
>> filesystems can call mapping_test_clear_cmtime themselves in
>> ->writepages (as long as they're careful to scan all the pages first
>> -- the cmtime bit may not be set when ->writepages is entered).
>
> .flush_cmtime is effectively a duplicate method.  We already have
> .update_time to notify filesystems that they need to update the
> timestamp in the inode transactionally.

.update_time is used for the atime update as well, and it relies on
the core code to update the in-memory timestamp first.  I used that
approach in v2, but it was (correctly, I think) pointed out that this
was a layering violation and that core code shouldn't be mucking with
the timestamps directly during writeback.

There was a recent effort to move most of the file_update_calls from
the core into .page_mkwrite, and I don't think anyone wants to undo
that.

>
> Indeed:
>
>> +     /*
>> +      * Userspace expects certain system calls to update cmtime if
>> +      * a file has been recently written using a shared vma.  In
>> +      * cases where cmtime may need to be updated but writepages is
>> +      * not called, this is called instead.  (Implementations
>> +      * should call mapping_test_clear_cmtime.)
>> +      */
>> +     void (*flush_cmtime)(struct address_space *);
>
> You say it can be implemented in the ->writepage(s) method, and all
> filesystems provide ->writepage(s) in some form. Therefore I would
> have thought it be best to simply require filesystems to check that
> mapping flag during those methods and update the inode directly when
> that is set?

The problem with only doing it in ->writepages is that calling
writepages from munmap and exit would probably hurt performance for no
particular gain.  So I need some kind of callback to say "update the
time, but don't write data."  The AS_CMTIME bit will still be set when
the ptes are removed.

I could require ->writepages *and* ->flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.

>
> Indeed, the way you've set up the infrastructure, we'll have to
> rewrite the cmtime update code to enable writepages to update this
> within some other transaction. Perhaps you should just implement it
> that way first?

This is already possible although not IMO necessary for correctness.
All that ext4 would need to do is to add something like:

if (mapping_test_clear_cmtime(mapping)) {
  update times within current transaction
}

somewhere inside the transaction in writepages.  There would probably
be room for some kind of generic helper to do everything in
inode_update_time_writable except for the actual mark_inode_dirty
part, but this still seems nasty from a locking perspective, and I'd
rather leave that optimization to an ext4 developer who wants to do
it.

I could simplify this a bit by moving the mapping_test_clear_cmtime
part from .flush_cmtime to its callers.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20  3:28       ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  3:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> Filesystems that defer cmtime updates should update cmtime when any
>> of these events happen after a write via a mapping:
>>
>>  - The mapping is written back to disk.  This happens from all kinds
>>    of places, all of which eventually call ->writepages.
>>
>>  - munmap is called or the mapping is removed when the process exits
>>
>>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>>    time between an mmaped write and the subsequent msync call.
>>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>>
>> Filesystmes that defer cmtime updates should flush them on munmap or
>> exit.  Finding out that this happened through vm_ops is messy, so
>> add a new address space op for this.
>>
>> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> but it simplifies the fs code.  As an optional optimization,
>> filesystems can call mapping_test_clear_cmtime themselves in
>> ->writepages (as long as they're careful to scan all the pages first
>> -- the cmtime bit may not be set when ->writepages is entered).
>
> .flush_cmtime is effectively a duplicate method.  We already have
> .update_time to notify filesystems that they need to update the
> timestamp in the inode transactionally.

.update_time is used for the atime update as well, and it relies on
the core code to update the in-memory timestamp first.  I used that
approach in v2, but it was (correctly, I think) pointed out that this
was a layering violation and that core code shouldn't be mucking with
the timestamps directly during writeback.

There was a recent effort to move most of the file_update_calls from
the core into .page_mkwrite, and I don't think anyone wants to undo
that.

>
> Indeed:
>
>> +     /*
>> +      * Userspace expects certain system calls to update cmtime if
>> +      * a file has been recently written using a shared vma.  In
>> +      * cases where cmtime may need to be updated but writepages is
>> +      * not called, this is called instead.  (Implementations
>> +      * should call mapping_test_clear_cmtime.)
>> +      */
>> +     void (*flush_cmtime)(struct address_space *);
>
> You say it can be implemented in the ->writepage(s) method, and all
> filesystems provide ->writepage(s) in some form. Therefore I would
> have thought it be best to simply require filesystems to check that
> mapping flag during those methods and update the inode directly when
> that is set?

The problem with only doing it in ->writepages is that calling
writepages from munmap and exit would probably hurt performance for no
particular gain.  So I need some kind of callback to say "update the
time, but don't write data."  The AS_CMTIME bit will still be set when
the ptes are removed.

I could require ->writepages *and* ->flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.

>
> Indeed, the way you've set up the infrastructure, we'll have to
> rewrite the cmtime update code to enable writepages to update this
> within some other transaction. Perhaps you should just implement it
> that way first?

This is already possible although not IMO necessary for correctness.
All that ext4 would need to do is to add something like:

if (mapping_test_clear_cmtime(mapping)) {
  update times within current transaction
}

somewhere inside the transaction in writepages.  There would probably
be room for some kind of generic helper to do everything in
inode_update_time_writable except for the actual mark_inode_dirty
part, but this still seems nasty from a locking perspective, and I'd
rather leave that optimization to an ext4 developer who wants to do
it.

I could simplify this a bit by moving the mapping_test_clear_cmtime
part from .flush_cmtime to its callers.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
  2013-08-20  2:38     ` Dave Chinner
@ 2013-08-20  3:30       ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  3:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 7:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:12PM -0700, Andy Lutomirski wrote:
>> A fancier implementation could probably avoid an extra journal
>> transaction by adding a mapping_test_clear_cmtime call in
>> ext4_writepages, but this should already be a considerable
>> improvement -- we'll start one transaction per writepages call
>> instead of one per page.
>
> I'd like to see more than just an ext4 implementation - btrfs and
> XFS are the other main filesystems that should behave identically.

Will do.

>
> Also, it's worthwhile to write a generic xfstest to ensure that they
> all update the timestamp appropriately - if its' in xfstests, then
> we can basically guarantee that it won't get randomly regressed in
> future, and other filesystems can be easily verified as well sa
> their implement this.
>

Is there a guide to writing an xfstest?  I've already written it as a
standalone C program, so I assume it's easy.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
@ 2013-08-20  3:30       ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  3:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 7:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Aug 16, 2013 at 04:22:12PM -0700, Andy Lutomirski wrote:
>> A fancier implementation could probably avoid an extra journal
>> transaction by adding a mapping_test_clear_cmtime call in
>> ext4_writepages, but this should already be a considerable
>> improvement -- we'll start one transaction per writepages call
>> instead of one per page.
>
> I'd like to see more than just an ext4 implementation - btrfs and
> XFS are the other main filesystems that should behave identically.

Will do.

>
> Also, it's worthwhile to write a generic xfstest to ensure that they
> all update the timestamp appropriately - if its' in xfstests, then
> we can basically guarantee that it won't get randomly regressed in
> future, and other filesystems can be easily verified as well sa
> their implement this.
>

Is there a guide to writing an xfstest?  I've already written it as a
standalone C program, so I assume it's easy.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
  2013-08-20  3:20       ` Andy Lutomirski
@ 2013-08-20  3:33         ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  3:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 08:20:12PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
> >> This is like file_update_time, except that it acts on a struct inode *
> >> instead of a struct file *.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
> >>  include/linux/fs.h |  1 +
> >>  2 files changed, 58 insertions(+), 15 deletions(-)
> >>
> 
> [...]
> 
> >> +
> >> +int inode_update_time_writable(struct inode *inode)
> >> +{
> >> +     struct timespec now;
> >> +     int sync_it = prepare_update_cmtime(inode, &now);
> >> +     int ret;
> >> +
> >> +     if (!sync_it)
> >> +             return 0;
> >> +
> >> +     /* sb_start_pagefault and update_time can both sleep. */
> >> +     sb_start_pagefault(inode->i_sb);
> >> +     ret = update_time(inode, &now, sync_it);
> >> +     sb_end_pagefault(inode->i_sb);
> >
> > This gets called from the writeback path - you can't use
> > sb_start_pagefault/sb_end_pagefault in that path.
> 
> The race I'm worried about is:
> 
>  - mmap
>  - write to the mapping
>  - remount ro
>  - flush_cmtime -> inode_update_time_writable

sb_start_pagefault() is for filesystem freeze protection, not
remount-ro protection. If you freeze the filesystem, then we stop
writes and pagefaults by making sb_start_pagefault/sb_start_write
block, and then run writeback to clean all the pages.  If writeback
then blocks on sb_start_pagefault(), we've got a deadlock.

> This may be impossible, in which case I'm okay, but it's nice to have
> a sanity check.  I'll see if I can figure out how to do that.

The process of remount-ro should flush the dirty pages - the inode
and page has been marked dirty by page_mkwrite(), after all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
@ 2013-08-20  3:33         ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  3:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 08:20:12PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
> >> This is like file_update_time, except that it acts on a struct inode *
> >> instead of a struct file *.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> ---
> >>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
> >>  include/linux/fs.h |  1 +
> >>  2 files changed, 58 insertions(+), 15 deletions(-)
> >>
> 
> [...]
> 
> >> +
> >> +int inode_update_time_writable(struct inode *inode)
> >> +{
> >> +     struct timespec now;
> >> +     int sync_it = prepare_update_cmtime(inode, &now);
> >> +     int ret;
> >> +
> >> +     if (!sync_it)
> >> +             return 0;
> >> +
> >> +     /* sb_start_pagefault and update_time can both sleep. */
> >> +     sb_start_pagefault(inode->i_sb);
> >> +     ret = update_time(inode, &now, sync_it);
> >> +     sb_end_pagefault(inode->i_sb);
> >
> > This gets called from the writeback path - you can't use
> > sb_start_pagefault/sb_end_pagefault in that path.
> 
> The race I'm worried about is:
> 
>  - mmap
>  - write to the mapping
>  - remount ro
>  - flush_cmtime -> inode_update_time_writable

sb_start_pagefault() is for filesystem freeze protection, not
remount-ro protection. If you freeze the filesystem, then we stop
writes and pagefaults by making sb_start_pagefault/sb_start_write
block, and then run writeback to clean all the pages.  If writeback
then blocks on sb_start_pagefault(), we've got a deadlock.

> This may be impossible, in which case I'm okay, but it's nice to have
> a sanity check.  I'll see if I can figure out how to do that.

The process of remount-ro should flush the dirty pages - the inode
and page has been marked dirty by page_mkwrite(), after all.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
  2013-08-20  3:33         ` Dave Chinner
@ 2013-08-20  4:07           ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  4:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 8:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 19, 2013 at 08:20:12PM -0700, Andy Lutomirski wrote:
>> On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
>> >> This is like file_update_time, except that it acts on a struct inode *
>> >> instead of a struct file *.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>> >>  include/linux/fs.h |  1 +
>> >>  2 files changed, 58 insertions(+), 15 deletions(-)
>> >>
>>
>> [...]
>>
>> >> +
>> >> +int inode_update_time_writable(struct inode *inode)
>> >> +{
>> >> +     struct timespec now;
>> >> +     int sync_it = prepare_update_cmtime(inode, &now);
>> >> +     int ret;
>> >> +
>> >> +     if (!sync_it)
>> >> +             return 0;
>> >> +
>> >> +     /* sb_start_pagefault and update_time can both sleep. */
>> >> +     sb_start_pagefault(inode->i_sb);
>> >> +     ret = update_time(inode, &now, sync_it);
>> >> +     sb_end_pagefault(inode->i_sb);
>> >
>> > This gets called from the writeback path - you can't use
>> > sb_start_pagefault/sb_end_pagefault in that path.
>>
>> The race I'm worried about is:
>>
>>  - mmap
>>  - write to the mapping
>>  - remount ro
>>  - flush_cmtime -> inode_update_time_writable
>
> sb_start_pagefault() is for filesystem freeze protection, not
> remount-ro protection. If you freeze the filesystem, then we stop
> writes and pagefaults by making sb_start_pagefault/sb_start_write
> block, and then run writeback to clean all the pages.  If writeback
> then blocks on sb_start_pagefault(), we've got a deadlock.
>
>> This may be impossible, in which case I'm okay, but it's nice to have
>> a sanity check.  I'll see if I can figure out how to do that.
>
> The process of remount-ro should flush the dirty pages - the inode
> and page has been marked dirty by page_mkwrite(), after all.

Hmm.  We can land in here from writeback, in which case the time
should be updated unconditionally.  We can also land in here from
msync(MS_ASYNC) or munmap.  munmap at least shouldn't block.

The nasty case is if a page is dirtied, then the frozen level is set
to SB_FREEZE_PAGEFAULT, and then userspace calls munmap or msync
*before* writepages gets called.  In this case, blocking until the fs
is unfrozen is probably impolite, and returning without updating the
time is questionable.

Removing the check entirely may add a new race, though: what if
.flush_cmtime has called mapping_test_clear_cmtime but hasn't gotten
to updating the time yet when freezing finishes?  This could be
prevented by changing generic_flush_cmtime to do __sb_start_write(sb,
SB_FREEZE_FS, false) and doing nothing if the fs is already frozen.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
@ 2013-08-20  4:07           ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  4:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 8:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 19, 2013 at 08:20:12PM -0700, Andy Lutomirski wrote:
>> On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
>> >> This is like file_update_time, except that it acts on a struct inode *
>> >> instead of a struct file *.
>> >>
>> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> >> ---
>> >>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>> >>  include/linux/fs.h |  1 +
>> >>  2 files changed, 58 insertions(+), 15 deletions(-)
>> >>
>>
>> [...]
>>
>> >> +
>> >> +int inode_update_time_writable(struct inode *inode)
>> >> +{
>> >> +     struct timespec now;
>> >> +     int sync_it = prepare_update_cmtime(inode, &now);
>> >> +     int ret;
>> >> +
>> >> +     if (!sync_it)
>> >> +             return 0;
>> >> +
>> >> +     /* sb_start_pagefault and update_time can both sleep. */
>> >> +     sb_start_pagefault(inode->i_sb);
>> >> +     ret = update_time(inode, &now, sync_it);
>> >> +     sb_end_pagefault(inode->i_sb);
>> >
>> > This gets called from the writeback path - you can't use
>> > sb_start_pagefault/sb_end_pagefault in that path.
>>
>> The race I'm worried about is:
>>
>>  - mmap
>>  - write to the mapping
>>  - remount ro
>>  - flush_cmtime -> inode_update_time_writable
>
> sb_start_pagefault() is for filesystem freeze protection, not
> remount-ro protection. If you freeze the filesystem, then we stop
> writes and pagefaults by making sb_start_pagefault/sb_start_write
> block, and then run writeback to clean all the pages.  If writeback
> then blocks on sb_start_pagefault(), we've got a deadlock.
>
>> This may be impossible, in which case I'm okay, but it's nice to have
>> a sanity check.  I'll see if I can figure out how to do that.
>
> The process of remount-ro should flush the dirty pages - the inode
> and page has been marked dirty by page_mkwrite(), after all.

Hmm.  We can land in here from writeback, in which case the time
should be updated unconditionally.  We can also land in here from
msync(MS_ASYNC) or munmap.  munmap at least shouldn't block.

The nasty case is if a page is dirtied, then the frozen level is set
to SB_FREEZE_PAGEFAULT, and then userspace calls munmap or msync
*before* writepages gets called.  In this case, blocking until the fs
is unfrozen is probably impolite, and returning without updating the
time is questionable.

Removing the check entirely may add a new race, though: what if
.flush_cmtime has called mapping_test_clear_cmtime but hasn't gotten
to updating the time yet when freezing finishes?  This could be
prevented by changing generic_flush_cmtime to do __sb_start_write(sb,
SB_FREEZE_FS, false) and doing nothing if the fs is already frozen.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20  3:28       ` Andy Lutomirski
@ 2013-08-20  4:08         ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  4:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> >> Filesystems that defer cmtime updates should update cmtime when any
> >> of these events happen after a write via a mapping:
> >>
> >>  - The mapping is written back to disk.  This happens from all kinds
> >>    of places, all of which eventually call ->writepages.
> >>
> >>  - munmap is called or the mapping is removed when the process exits
> >>
> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
> >>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
> >>    time between an mmaped write and the subsequent msync call.
> >>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> >>
> >> Filesystmes that defer cmtime updates should flush them on munmap or
> >> exit.  Finding out that this happened through vm_ops is messy, so
> >> add a new address space op for this.
> >>
> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> >> but it simplifies the fs code.  As an optional optimization,
> >> filesystems can call mapping_test_clear_cmtime themselves in
> >> ->writepages (as long as they're careful to scan all the pages first
> >> -- the cmtime bit may not be set when ->writepages is entered).
> >
> > .flush_cmtime is effectively a duplicate method.  We already have
> > .update_time to notify filesystems that they need to update the
> > timestamp in the inode transactionally.
> 
> .update_time is used for the atime update as well, and it relies on
> the core code to update the in-memory timestamp first.

No, it doesn't. The caller of .update_time provides the timestamp to
that gets written into the relevant fields of the inode.

i.e. this code:

	now = current_fs_time(inode->i_sb);
	if (inode->i_op->update_time)
		return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME);

will update *only* the ctime and mtime of the inode to have a value
of @now. That's precisely what you want to do, yes?

Indeed, your .flush_cmtime function effectively does:

	flags = prepare_cmtime(inode, &now)
		{ *now = current_fs_time()
		  flags = S_CTIME|S_MTIME;
		}
	update_time(inode, now, flags);
		{
			inode->i_op->update_time(inode, now, flags)
			or
			mark_inode_dirty_sync()
		}

IOWs, you're adding a filesystem specific method to update a
timestamp that wraps around a generic method for updating a
timestamp. i.e.  .flush_cmtime is not necessary - you could just
call inode_update_time_writable() from generic_writepages() and it
would simply just work for all filesystems....

> I used that
> approach in v2, but it was (correctly, I think) pointed out that this
> was a layering violation and that core code shouldn't be mucking with
> the timestamps directly during writeback.

If calling a generic method to update the timestamp is a layering
violation, then why is replacing that with a filesystem specific
method of updating a timestamp not a layering violation?

> > Indeed:
> >
> >> +     /*
> >> +      * Userspace expects certain system calls to update cmtime if
> >> +      * a file has been recently written using a shared vma.  In
> >> +      * cases where cmtime may need to be updated but writepages is
> >> +      * not called, this is called instead.  (Implementations
> >> +      * should call mapping_test_clear_cmtime.)
> >> +      */
> >> +     void (*flush_cmtime)(struct address_space *);
> >
> > You say it can be implemented in the ->writepage(s) method, and all
> > filesystems provide ->writepage(s) in some form. Therefore I would
> > have thought it be best to simply require filesystems to check that
> > mapping flag during those methods and update the inode directly when
> > that is set?
> 
> The problem with only doing it in ->writepages is that calling
> writepages from munmap and exit would probably hurt performance for no
> particular gain.  So I need some kind of callback to say "update the
> time, but don't write data."  The AS_CMTIME bit will still be set when
> the ptes are removed.

What's the point of updating the timestamp at unmap when it's going
to be changed again when the writeback occurs?

> I could require ->writepages *and* ->flush_cmtime to handle the time
> update, but that would complicate non-transactional filesystems.
> Those filesystems should just flush cmtime at the end of writepages.

do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.

And some filesystems provide a .writepages method that doesn't call
.writepage, so those filesystems will need to do the timestamp
update in those methods.

> > Indeed, the way you've set up the infrastructure, we'll have to
> > rewrite the cmtime update code to enable writepages to update this
> > within some other transaction. Perhaps you should just implement it
> > that way first?
> 
> This is already possible although not IMO necessary for correctness.
> All that ext4 would need to do is to add something like:
> 
> if (mapping_test_clear_cmtime(mapping)) {
>   update times within current transaction
> }

Where does it get the timestamps from? i.e. the update could call
prepare_update_cmtime() to do this, yes? And so having a wrapper
that does

	if (mapping_test_clear_cmtime(mapping))
		return prepare_update_cmtime(inode);
	return 0;

would work just fine for them, yes?

> somewhere inside the transaction in writepages.  There would probably
> be room for some kind of generic helper to do everything in
> inode_update_time_writable except for the actual mark_inode_dirty
> part, but this still seems nasty from a locking perspective, and I'd
> rather leave that optimization to an ext4 developer who wants to do
> it.

filesystems that implement .update_time don't need to mark the
struct inode dirty on timestamp updates. Similarly, filesystems that
use a writepages transaction to do the update don't either....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20  4:08         ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  4:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> >> Filesystems that defer cmtime updates should update cmtime when any
> >> of these events happen after a write via a mapping:
> >>
> >>  - The mapping is written back to disk.  This happens from all kinds
> >>    of places, all of which eventually call ->writepages.
> >>
> >>  - munmap is called or the mapping is removed when the process exits
> >>
> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
> >>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
> >>    time between an mmaped write and the subsequent msync call.
> >>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> >>
> >> Filesystmes that defer cmtime updates should flush them on munmap or
> >> exit.  Finding out that this happened through vm_ops is messy, so
> >> add a new address space op for this.
> >>
> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> >> but it simplifies the fs code.  As an optional optimization,
> >> filesystems can call mapping_test_clear_cmtime themselves in
> >> ->writepages (as long as they're careful to scan all the pages first
> >> -- the cmtime bit may not be set when ->writepages is entered).
> >
> > .flush_cmtime is effectively a duplicate method.  We already have
> > .update_time to notify filesystems that they need to update the
> > timestamp in the inode transactionally.
> 
> .update_time is used for the atime update as well, and it relies on
> the core code to update the in-memory timestamp first.

No, it doesn't. The caller of .update_time provides the timestamp to
that gets written into the relevant fields of the inode.

i.e. this code:

	now = current_fs_time(inode->i_sb);
	if (inode->i_op->update_time)
		return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME);

will update *only* the ctime and mtime of the inode to have a value
of @now. That's precisely what you want to do, yes?

Indeed, your .flush_cmtime function effectively does:

	flags = prepare_cmtime(inode, &now)
		{ *now = current_fs_time()
		  flags = S_CTIME|S_MTIME;
		}
	update_time(inode, now, flags);
		{
			inode->i_op->update_time(inode, now, flags)
			or
			mark_inode_dirty_sync()
		}

IOWs, you're adding a filesystem specific method to update a
timestamp that wraps around a generic method for updating a
timestamp. i.e.  .flush_cmtime is not necessary - you could just
call inode_update_time_writable() from generic_writepages() and it
would simply just work for all filesystems....

> I used that
> approach in v2, but it was (correctly, I think) pointed out that this
> was a layering violation and that core code shouldn't be mucking with
> the timestamps directly during writeback.

If calling a generic method to update the timestamp is a layering
violation, then why is replacing that with a filesystem specific
method of updating a timestamp not a layering violation?

> > Indeed:
> >
> >> +     /*
> >> +      * Userspace expects certain system calls to update cmtime if
> >> +      * a file has been recently written using a shared vma.  In
> >> +      * cases where cmtime may need to be updated but writepages is
> >> +      * not called, this is called instead.  (Implementations
> >> +      * should call mapping_test_clear_cmtime.)
> >> +      */
> >> +     void (*flush_cmtime)(struct address_space *);
> >
> > You say it can be implemented in the ->writepage(s) method, and all
> > filesystems provide ->writepage(s) in some form. Therefore I would
> > have thought it be best to simply require filesystems to check that
> > mapping flag during those methods and update the inode directly when
> > that is set?
> 
> The problem with only doing it in ->writepages is that calling
> writepages from munmap and exit would probably hurt performance for no
> particular gain.  So I need some kind of callback to say "update the
> time, but don't write data."  The AS_CMTIME bit will still be set when
> the ptes are removed.

What's the point of updating the timestamp at unmap when it's going
to be changed again when the writeback occurs?

> I could require ->writepages *and* ->flush_cmtime to handle the time
> update, but that would complicate non-transactional filesystems.
> Those filesystems should just flush cmtime at the end of writepages.

do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.

And some filesystems provide a .writepages method that doesn't call
.writepage, so those filesystems will need to do the timestamp
update in those methods.

> > Indeed, the way you've set up the infrastructure, we'll have to
> > rewrite the cmtime update code to enable writepages to update this
> > within some other transaction. Perhaps you should just implement it
> > that way first?
> 
> This is already possible although not IMO necessary for correctness.
> All that ext4 would need to do is to add something like:
> 
> if (mapping_test_clear_cmtime(mapping)) {
>   update times within current transaction
> }

Where does it get the timestamps from? i.e. the update could call
prepare_update_cmtime() to do this, yes? And so having a wrapper
that does

	if (mapping_test_clear_cmtime(mapping))
		return prepare_update_cmtime(inode);
	return 0;

would work just fine for them, yes?

> somewhere inside the transaction in writepages.  There would probably
> be room for some kind of generic helper to do everything in
> inode_update_time_writable except for the actual mark_inode_dirty
> part, but this still seems nasty from a locking perspective, and I'd
> rather leave that optimization to an ext4 developer who wants to do
> it.

filesystems that implement .update_time don't need to mark the
struct inode dirty on timestamp updates. Similarly, filesystems that
use a writepages transaction to do the update don't either....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
  2013-08-20  3:30       ` Andy Lutomirski
@ 2013-08-20  4:08         ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  4:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 08:30:02PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:12PM -0700, Andy Lutomirski wrote:
> >> A fancier implementation could probably avoid an extra journal
> >> transaction by adding a mapping_test_clear_cmtime call in
> >> ext4_writepages, but this should already be a considerable
> >> improvement -- we'll start one transaction per writepages call
> >> instead of one per page.
> >
> > I'd like to see more than just an ext4 implementation - btrfs and
> > XFS are the other main filesystems that should behave identically.
> 
> Will do.
> 
> >
> > Also, it's worthwhile to write a generic xfstest to ensure that they
> > all update the timestamp appropriately - if its' in xfstests, then
> > we can basically guarantee that it won't get randomly regressed in
> > future, and other filesystems can be easily verified as well sa
> > their implement this.
> >
> 
> Is there a guide to writing an xfstest?

Yes. see the readme file in the root directory of the repository.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback
@ 2013-08-20  4:08         ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20  4:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 08:30:02PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:38 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Aug 16, 2013 at 04:22:12PM -0700, Andy Lutomirski wrote:
> >> A fancier implementation could probably avoid an extra journal
> >> transaction by adding a mapping_test_clear_cmtime call in
> >> ext4_writepages, but this should already be a considerable
> >> improvement -- we'll start one transaction per writepages call
> >> instead of one per page.
> >
> > I'd like to see more than just an ext4 implementation - btrfs and
> > XFS are the other main filesystems that should behave identically.
> 
> Will do.
> 
> >
> > Also, it's worthwhile to write a generic xfstest to ensure that they
> > all update the timestamp appropriately - if its' in xfstests, then
> > we can basically guarantee that it won't get randomly regressed in
> > future, and other filesystems can be easily verified as well sa
> > their implement this.
> >
> 
> Is there a guide to writing an xfstest?

Yes. see the readme file in the root directory of the repository.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20  4:08         ` Dave Chinner
@ 2013-08-20  4:14           ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  4:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-ext4, Theodore Ts'o, Dave Hansen, xfs,
	Jan Kara, Tim Chen, Christoph Hellwig

On Mon, Aug 19, 2013 at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
>> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> >> Filesystems that defer cmtime updates should update cmtime when any
>> >> of these events happen after a write via a mapping:
>> >>
>> >>  - The mapping is written back to disk.  This happens from all kinds
>> >>    of places, all of which eventually call ->writepages.
>> >>
>> >>  - munmap is called or the mapping is removed when the process exits
>> >>
>> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>> >>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>> >>    time between an mmaped write and the subsequent msync call.
>> >>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>> >>
>> >> Filesystmes that defer cmtime updates should flush them on munmap or
>> >> exit.  Finding out that this happened through vm_ops is messy, so
>> >> add a new address space op for this.
>> >>
>> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> >> but it simplifies the fs code.  As an optional optimization,
>> >> filesystems can call mapping_test_clear_cmtime themselves in
>> >> ->writepages (as long as they're careful to scan all the pages first
>> >> -- the cmtime bit may not be set when ->writepages is entered).
>> >
>> > .flush_cmtime is effectively a duplicate method.  We already have
>> > .update_time to notify filesystems that they need to update the
>> > timestamp in the inode transactionally.
>>
>> .update_time is used for the atime update as well, and it relies on
>> the core code to update the in-memory timestamp first.
>
> No, it doesn't. The caller of .update_time provides the timestamp to
> that gets written into the relevant fields of the inode.
>
> i.e. this code:
>
>         now = current_fs_time(inode->i_sb);
>         if (inode->i_op->update_time)
>                 return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME);
>
> will update *only* the ctime and mtime of the inode to have a value
> of @now. That's precisely what you want to do, yes?
>
> Indeed, your .flush_cmtime function effectively does:
>
>         flags = prepare_cmtime(inode, &now)
>                 { *now = current_fs_time()
>                   flags = S_CTIME|S_MTIME;
>                 }
>         update_time(inode, now, flags);
>                 {
>                         inode->i_op->update_time(inode, now, flags)
>                         or
>                         mark_inode_dirty_sync()
>                 }
>
> IOWs, you're adding a filesystem specific method to update a
> timestamp that wraps around a generic method for updating a
> timestamp. i.e.  .flush_cmtime is not necessary - you could just
> call inode_update_time_writable() from generic_writepages() and it
> would simply just work for all filesystems....

OK, I'll try that out.

>
>> I used that
>> approach in v2, but it was (correctly, I think) pointed out that this
>> was a layering violation and that core code shouldn't be mucking with
>> the timestamps directly during writeback.
>
> If calling a generic method to update the timestamp is a layering
> violation, then why is replacing that with a filesystem specific
> method of updating a timestamp not a layering violation?
>
>> > Indeed:
>> >
>> >> +     /*
>> >> +      * Userspace expects certain system calls to update cmtime if
>> >> +      * a file has been recently written using a shared vma.  In
>> >> +      * cases where cmtime may need to be updated but writepages is
>> >> +      * not called, this is called instead.  (Implementations
>> >> +      * should call mapping_test_clear_cmtime.)
>> >> +      */
>> >> +     void (*flush_cmtime)(struct address_space *);
>> >
>> > You say it can be implemented in the ->writepage(s) method, and all
>> > filesystems provide ->writepage(s) in some form. Therefore I would
>> > have thought it be best to simply require filesystems to check that
>> > mapping flag during those methods and update the inode directly when
>> > that is set?
>>
>> The problem with only doing it in ->writepages is that calling
>> writepages from munmap and exit would probably hurt performance for no
>> particular gain.  So I need some kind of callback to say "update the
>> time, but don't write data."  The AS_CMTIME bit will still be set when
>> the ptes are removed.
>
> What's the point of updating the timestamp at unmap when it's going
> to be changed again when the writeback occurs?

To avoid breaking make and similar tools.  Suppose that an output file
already exists but is stale.  Some program gets called to update it,
and it opens it for write, mmaps it, updates it, calls munmap, and
exits.  Its parent will expect the timestamp to have been updated, but
writeback may not have happened.

>
>> I could require ->writepages *and* ->flush_cmtime to handle the time
>> update, but that would complicate non-transactional filesystems.
>> Those filesystems should just flush cmtime at the end of writepages.
>
> do_writepages() is the wrong place to do such updates - we can get
> writeback directly through .writepage, so the time updates need to
> be in .writepage. That first .writepage call will clear the bit on
> the mapping, so it's only done on the first call to .writepage on
> the given mapping.

Last time I checked, all the paths that actually needed the timestamp
update went through .writepages.  I'll double-check.

>
> And some filesystems provide a .writepages method that doesn't call
> .writepage, so those filesystems will need to do the timestamp
> update in those methods.
>
>> > Indeed, the way you've set up the infrastructure, we'll have to
>> > rewrite the cmtime update code to enable writepages to update this
>> > within some other transaction. Perhaps you should just implement it
>> > that way first?
>>
>> This is already possible although not IMO necessary for correctness.
>> All that ext4 would need to do is to add something like:
>>
>> if (mapping_test_clear_cmtime(mapping)) {
>>   update times within current transaction
>> }
>
> Where does it get the timestamps from? i.e. the update could call
> prepare_update_cmtime() to do this, yes? And so having a wrapper
> that does
>
>         if (mapping_test_clear_cmtime(mapping))
>                 return prepare_update_cmtime(inode);
>         return 0;
>
> would work just fine for them, yes?
>
>> somewhere inside the transaction in writepages.  There would probably
>> be room for some kind of generic helper to do everything in
>> inode_update_time_writable except for the actual mark_inode_dirty
>> part, but this still seems nasty from a locking perspective, and I'd
>> rather leave that optimization to an ext4 developer who wants to do
>> it.
>
> filesystems that implement .update_time don't need to mark the
> struct inode dirty on timestamp updates. Similarly, filesystems that
> use a writepages transaction to do the update don't either....

Fair enough.  I'll spin a v4 and see how it looks.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20  4:14           ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20  4:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dave Hansen, linux-kernel, xfs, Christoph Hellwig,
	Theodore Ts'o, linux-ext4, Tim Chen

On Mon, Aug 19, 2013 at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
>> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> >> Filesystems that defer cmtime updates should update cmtime when any
>> >> of these events happen after a write via a mapping:
>> >>
>> >>  - The mapping is written back to disk.  This happens from all kinds
>> >>    of places, all of which eventually call ->writepages.
>> >>
>> >>  - munmap is called or the mapping is removed when the process exits
>> >>
>> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>> >>    msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>> >>    time between an mmaped write and the subsequent msync call.
>> >>    MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>> >>
>> >> Filesystmes that defer cmtime updates should flush them on munmap or
>> >> exit.  Finding out that this happened through vm_ops is messy, so
>> >> add a new address space op for this.
>> >>
>> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> >> but it simplifies the fs code.  As an optional optimization,
>> >> filesystems can call mapping_test_clear_cmtime themselves in
>> >> ->writepages (as long as they're careful to scan all the pages first
>> >> -- the cmtime bit may not be set when ->writepages is entered).
>> >
>> > .flush_cmtime is effectively a duplicate method.  We already have
>> > .update_time to notify filesystems that they need to update the
>> > timestamp in the inode transactionally.
>>
>> .update_time is used for the atime update as well, and it relies on
>> the core code to update the in-memory timestamp first.
>
> No, it doesn't. The caller of .update_time provides the timestamp to
> that gets written into the relevant fields of the inode.
>
> i.e. this code:
>
>         now = current_fs_time(inode->i_sb);
>         if (inode->i_op->update_time)
>                 return inode->i_op->update_time(inode, &now, S_CTIME|S_MTIME);
>
> will update *only* the ctime and mtime of the inode to have a value
> of @now. That's precisely what you want to do, yes?
>
> Indeed, your .flush_cmtime function effectively does:
>
>         flags = prepare_cmtime(inode, &now)
>                 { *now = current_fs_time()
>                   flags = S_CTIME|S_MTIME;
>                 }
>         update_time(inode, now, flags);
>                 {
>                         inode->i_op->update_time(inode, now, flags)
>                         or
>                         mark_inode_dirty_sync()
>                 }
>
> IOWs, you're adding a filesystem specific method to update a
> timestamp that wraps around a generic method for updating a
> timestamp. i.e.  .flush_cmtime is not necessary - you could just
> call inode_update_time_writable() from generic_writepages() and it
> would simply just work for all filesystems....

OK, I'll try that out.

>
>> I used that
>> approach in v2, but it was (correctly, I think) pointed out that this
>> was a layering violation and that core code shouldn't be mucking with
>> the timestamps directly during writeback.
>
> If calling a generic method to update the timestamp is a layering
> violation, then why is replacing that with a filesystem specific
> method of updating a timestamp not a layering violation?
>
>> > Indeed:
>> >
>> >> +     /*
>> >> +      * Userspace expects certain system calls to update cmtime if
>> >> +      * a file has been recently written using a shared vma.  In
>> >> +      * cases where cmtime may need to be updated but writepages is
>> >> +      * not called, this is called instead.  (Implementations
>> >> +      * should call mapping_test_clear_cmtime.)
>> >> +      */
>> >> +     void (*flush_cmtime)(struct address_space *);
>> >
>> > You say it can be implemented in the ->writepage(s) method, and all
>> > filesystems provide ->writepage(s) in some form. Therefore I would
>> > have thought it be best to simply require filesystems to check that
>> > mapping flag during those methods and update the inode directly when
>> > that is set?
>>
>> The problem with only doing it in ->writepages is that calling
>> writepages from munmap and exit would probably hurt performance for no
>> particular gain.  So I need some kind of callback to say "update the
>> time, but don't write data."  The AS_CMTIME bit will still be set when
>> the ptes are removed.
>
> What's the point of updating the timestamp at unmap when it's going
> to be changed again when the writeback occurs?

To avoid breaking make and similar tools.  Suppose that an output file
already exists but is stale.  Some program gets called to update it,
and it opens it for write, mmaps it, updates it, calls munmap, and
exits.  Its parent will expect the timestamp to have been updated, but
writeback may not have happened.

>
>> I could require ->writepages *and* ->flush_cmtime to handle the time
>> update, but that would complicate non-transactional filesystems.
>> Those filesystems should just flush cmtime at the end of writepages.
>
> do_writepages() is the wrong place to do such updates - we can get
> writeback directly through .writepage, so the time updates need to
> be in .writepage. That first .writepage call will clear the bit on
> the mapping, so it's only done on the first call to .writepage on
> the given mapping.

Last time I checked, all the paths that actually needed the timestamp
update went through .writepages.  I'll double-check.

>
> And some filesystems provide a .writepages method that doesn't call
> .writepage, so those filesystems will need to do the timestamp
> update in those methods.
>
>> > Indeed, the way you've set up the infrastructure, we'll have to
>> > rewrite the cmtime update code to enable writepages to update this
>> > within some other transaction. Perhaps you should just implement it
>> > that way first?
>>
>> This is already possible although not IMO necessary for correctness.
>> All that ext4 would need to do is to add something like:
>>
>> if (mapping_test_clear_cmtime(mapping)) {
>>   update times within current transaction
>> }
>
> Where does it get the timestamps from? i.e. the update could call
> prepare_update_cmtime() to do this, yes? And so having a wrapper
> that does
>
>         if (mapping_test_clear_cmtime(mapping))
>                 return prepare_update_cmtime(inode);
>         return 0;
>
> would work just fine for them, yes?
>
>> somewhere inside the transaction in writepages.  There would probably
>> be room for some kind of generic helper to do everything in
>> inode_update_time_writable except for the actual mark_inode_dirty
>> part, but this still seems nasty from a locking perspective, and I'd
>> rather leave that optimization to an ext4 developer who wants to do
>> it.
>
> filesystems that implement .update_time don't need to mark the
> struct inode dirty on timestamp updates. Similarly, filesystems that
> use a writepages transaction to do the update don't either....

Fair enough.  I'll spin a v4 and see how it looks.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20  4:14           ` Andy Lutomirski
@ 2013-08-20 16:00             ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2013-08-20 16:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Chinner, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Jan Kara, Tim Chen, Christoph Hellwig

On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> update, but that would complicate non-transactional filesystems.
> >> Those filesystems should just flush cmtime at the end of writepages.
> >
> > do_writepages() is the wrong place to do such updates - we can get
> > writeback directly through .writepage, so the time updates need to
> > be in .writepage. That first .writepage call will clear the bit on
> > the mapping, so it's only done on the first call to .writepage on
> > the given mapping.
> 
> Last time I checked, all the paths that actually needed the timestamp
> update went through .writepages.  I'll double-check.
  kswapd can call just .writepage to do the writeout so timestamp update
should be handled there as well. Otherwise all pages in a mapping can be
cleaned without timestamp being updated.

Which btw made me realize that even your scheme doesn't completely make
sure timestamp is updated after mmap write - if you have pages 0 and 1, you
write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
is called. We write the page 0, writeprotect it, update timestamps. But
page 1 is still writeable so writes to it won't set CMTIME flag, neither
update the timestamp... Not that I think this can be reasonably solved but
it is a food for thought.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20 16:00             ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2013-08-20 16:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> update, but that would complicate non-transactional filesystems.
> >> Those filesystems should just flush cmtime at the end of writepages.
> >
> > do_writepages() is the wrong place to do such updates - we can get
> > writeback directly through .writepage, so the time updates need to
> > be in .writepage. That first .writepage call will clear the bit on
> > the mapping, so it's only done on the first call to .writepage on
> > the given mapping.
> 
> Last time I checked, all the paths that actually needed the timestamp
> update went through .writepages.  I'll double-check.
  kswapd can call just .writepage to do the writeout so timestamp update
should be handled there as well. Otherwise all pages in a mapping can be
cleaned without timestamp being updated.

Which btw made me realize that even your scheme doesn't completely make
sure timestamp is updated after mmap write - if you have pages 0 and 1, you
write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
is called. We write the page 0, writeprotect it, update timestamps. But
page 1 is still writeable so writes to it won't set CMTIME flag, neither
update the timestamp... Not that I think this can be reasonably solved but
it is a food for thought.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
  2013-08-20  4:07           ` Andy Lutomirski
@ 2013-08-20 16:10             ` Jan Kara
  -1 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2013-08-20 16:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Chinner, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Jan Kara, Tim Chen, Christoph Hellwig

On Mon 19-08-13 21:07:49, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 8:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Aug 19, 2013 at 08:20:12PM -0700, Andy Lutomirski wrote:
> >> On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
> >> >> This is like file_update_time, except that it acts on a struct inode *
> >> >> instead of a struct file *.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> >> ---
> >> >>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
> >> >>  include/linux/fs.h |  1 +
> >> >>  2 files changed, 58 insertions(+), 15 deletions(-)
> >> >>
> >>
> >> [...]
> >>
> >> >> +
> >> >> +int inode_update_time_writable(struct inode *inode)
> >> >> +{
> >> >> +     struct timespec now;
> >> >> +     int sync_it = prepare_update_cmtime(inode, &now);
> >> >> +     int ret;
> >> >> +
> >> >> +     if (!sync_it)
> >> >> +             return 0;
> >> >> +
> >> >> +     /* sb_start_pagefault and update_time can both sleep. */
> >> >> +     sb_start_pagefault(inode->i_sb);
> >> >> +     ret = update_time(inode, &now, sync_it);
> >> >> +     sb_end_pagefault(inode->i_sb);
> >> >
> >> > This gets called from the writeback path - you can't use
> >> > sb_start_pagefault/sb_end_pagefault in that path.
> >>
> >> The race I'm worried about is:
> >>
> >>  - mmap
> >>  - write to the mapping
> >>  - remount ro
> >>  - flush_cmtime -> inode_update_time_writable
> >
> > sb_start_pagefault() is for filesystem freeze protection, not
> > remount-ro protection. If you freeze the filesystem, then we stop
> > writes and pagefaults by making sb_start_pagefault/sb_start_write
> > block, and then run writeback to clean all the pages.  If writeback
> > then blocks on sb_start_pagefault(), we've got a deadlock.
> >
> >> This may be impossible, in which case I'm okay, but it's nice to have
> >> a sanity check.  I'll see if I can figure out how to do that.
> >
> > The process of remount-ro should flush the dirty pages - the inode
> > and page has been marked dirty by page_mkwrite(), after all.
> 
> Hmm.  We can land in here from writeback, in which case the time
> should be updated unconditionally.  We can also land in here from
> msync(MS_ASYNC) or munmap.  munmap at least shouldn't block.
> 
> The nasty case is if a page is dirtied, then the frozen level is set
> to SB_FREEZE_PAGEFAULT, and then userspace calls munmap or msync
> *before* writepages gets called.  In this case, blocking until the fs
> is unfrozen is probably impolite, and returning without updating the
> time is questionable.
> 
> Removing the check entirely may add a new race, though: what if
> .flush_cmtime has called mapping_test_clear_cmtime but hasn't gotten
> to updating the time yet when freezing finishes?  This could be
> prevented by changing generic_flush_cmtime to do __sb_start_write(sb,
> SB_FREEZE_FS, false) and doing nothing if the fs is already frozen.
  I think it should be a responsibility of the caller of .flush_cmtime (as
is the case of update_time()) to handle freeze protection if needed. As
Dave told you, writeback path mustn't actually take it. OTOH things like
munmap() or msync() need to get it because we must avoid changing the
filesystem when it is frozen and these calls can happen when fs is frozen.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 2/5] fs: Add inode_update_time_writable
@ 2013-08-20 16:10             ` Jan Kara
  0 siblings, 0 replies; 52+ messages in thread
From: Jan Kara @ 2013-08-20 16:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Mon 19-08-13 21:07:49, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 8:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Aug 19, 2013 at 08:20:12PM -0700, Andy Lutomirski wrote:
> >> On Mon, Aug 19, 2013 at 7:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Fri, Aug 16, 2013 at 04:22:09PM -0700, Andy Lutomirski wrote:
> >> >> This is like file_update_time, except that it acts on a struct inode *
> >> >> instead of a struct file *.
> >> >>
> >> >> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> >> >> ---
> >> >>  fs/inode.c         | 72 ++++++++++++++++++++++++++++++++++++++++++------------
> >> >>  include/linux/fs.h |  1 +
> >> >>  2 files changed, 58 insertions(+), 15 deletions(-)
> >> >>
> >>
> >> [...]
> >>
> >> >> +
> >> >> +int inode_update_time_writable(struct inode *inode)
> >> >> +{
> >> >> +     struct timespec now;
> >> >> +     int sync_it = prepare_update_cmtime(inode, &now);
> >> >> +     int ret;
> >> >> +
> >> >> +     if (!sync_it)
> >> >> +             return 0;
> >> >> +
> >> >> +     /* sb_start_pagefault and update_time can both sleep. */
> >> >> +     sb_start_pagefault(inode->i_sb);
> >> >> +     ret = update_time(inode, &now, sync_it);
> >> >> +     sb_end_pagefault(inode->i_sb);
> >> >
> >> > This gets called from the writeback path - you can't use
> >> > sb_start_pagefault/sb_end_pagefault in that path.
> >>
> >> The race I'm worried about is:
> >>
> >>  - mmap
> >>  - write to the mapping
> >>  - remount ro
> >>  - flush_cmtime -> inode_update_time_writable
> >
> > sb_start_pagefault() is for filesystem freeze protection, not
> > remount-ro protection. If you freeze the filesystem, then we stop
> > writes and pagefaults by making sb_start_pagefault/sb_start_write
> > block, and then run writeback to clean all the pages.  If writeback
> > then blocks on sb_start_pagefault(), we've got a deadlock.
> >
> >> This may be impossible, in which case I'm okay, but it's nice to have
> >> a sanity check.  I'll see if I can figure out how to do that.
> >
> > The process of remount-ro should flush the dirty pages - the inode
> > and page has been marked dirty by page_mkwrite(), after all.
> 
> Hmm.  We can land in here from writeback, in which case the time
> should be updated unconditionally.  We can also land in here from
> msync(MS_ASYNC) or munmap.  munmap at least shouldn't block.
> 
> The nasty case is if a page is dirtied, then the frozen level is set
> to SB_FREEZE_PAGEFAULT, and then userspace calls munmap or msync
> *before* writepages gets called.  In this case, blocking until the fs
> is unfrozen is probably impolite, and returning without updating the
> time is questionable.
> 
> Removing the check entirely may add a new race, though: what if
> .flush_cmtime has called mapping_test_clear_cmtime but hasn't gotten
> to updating the time yet when freezing finishes?  This could be
> prevented by changing generic_flush_cmtime to do __sb_start_write(sb,
> SB_FREEZE_FS, false) and doing nothing if the fs is already frozen.
  I think it should be a responsibility of the caller of .flush_cmtime (as
is the case of update_time()) to handle freeze protection if needed. As
Dave told you, writeback path mustn't actually take it. OTOH things like
munmap() or msync() need to get it because we must avoid changing the
filesystem when it is frozen and these calls can happen when fs is frozen.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20 16:00             ` Jan Kara
@ 2013-08-20 16:42               ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20 16:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> update, but that would complicate non-transactional filesystems.
>> >> Those filesystems should just flush cmtime at the end of writepages.
>> >
>> > do_writepages() is the wrong place to do such updates - we can get
>> > writeback directly through .writepage, so the time updates need to
>> > be in .writepage. That first .writepage call will clear the bit on
>> > the mapping, so it's only done on the first call to .writepage on
>> > the given mapping.
>>
>> Last time I checked, all the paths that actually needed the timestamp
>> update went through .writepages.  I'll double-check.
>   kswapd can call just .writepage to do the writeout so timestamp update
> should be handled there as well. Otherwise all pages in a mapping can be
> cleaned without timestamp being updated.

OK, I'll fix that.

>
> Which btw made me realize that even your scheme doesn't completely make
> sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> is called. We write the page 0, writeprotect it, update timestamps. But
> page 1 is still writeable so writes to it won't set CMTIME flag, neither
> update the timestamp... Not that I think this can be reasonably solved but
> it is a food for thought.

This should already work.  AS_CMTIME is set when the pte goes from
dirty to clean, not when the pte goes from wp to writable.  So
whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
be set and a subsequent writepages call will update the timestamp.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20 16:42               ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20 16:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> update, but that would complicate non-transactional filesystems.
>> >> Those filesystems should just flush cmtime at the end of writepages.
>> >
>> > do_writepages() is the wrong place to do such updates - we can get
>> > writeback directly through .writepage, so the time updates need to
>> > be in .writepage. That first .writepage call will clear the bit on
>> > the mapping, so it's only done on the first call to .writepage on
>> > the given mapping.
>>
>> Last time I checked, all the paths that actually needed the timestamp
>> update went through .writepages.  I'll double-check.
>   kswapd can call just .writepage to do the writeout so timestamp update
> should be handled there as well. Otherwise all pages in a mapping can be
> cleaned without timestamp being updated.

OK, I'll fix that.

>
> Which btw made me realize that even your scheme doesn't completely make
> sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> is called. We write the page 0, writeprotect it, update timestamps. But
> page 1 is still writeable so writes to it won't set CMTIME flag, neither
> update the timestamp... Not that I think this can be reasonably solved but
> it is a food for thought.

This should already work.  AS_CMTIME is set when the pte goes from
dirty to clean, not when the pte goes from wp to writable.  So
whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
be set and a subsequent writepages call will update the timestamp.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20 16:42               ` Andy Lutomirski
@ 2013-08-20 19:27                 ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20 19:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 9:42 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>>> >> update, but that would complicate non-transactional filesystems.
>>> >> Those filesystems should just flush cmtime at the end of writepages.
>>> >
>>> > do_writepages() is the wrong place to do such updates - we can get
>>> > writeback directly through .writepage, so the time updates need to
>>> > be in .writepage. That first .writepage call will clear the bit on
>>> > the mapping, so it's only done on the first call to .writepage on
>>> > the given mapping.
>>>
>>> Last time I checked, all the paths that actually needed the timestamp
>>> update went through .writepages.  I'll double-check.
>>   kswapd can call just .writepage to do the writeout so timestamp update
>> should be handled there as well. Otherwise all pages in a mapping can be
>> cleaned without timestamp being updated.
>
> OK, I'll fix that.

This is a bit ugly.  mpage_writepages and generic_writepages both call
->writepage.  If writepage starts checking cmtime, then writepages
will do multiple timestamp updates on filesystems that use this code.

I can modify vmscan and migrate to flush cmtime -- they seem to be the
only callers of ->writepage that aren't themselves called from
->writepages.

--Andy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20 19:27                 ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20 19:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 9:42 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>>> >> update, but that would complicate non-transactional filesystems.
>>> >> Those filesystems should just flush cmtime at the end of writepages.
>>> >
>>> > do_writepages() is the wrong place to do such updates - we can get
>>> > writeback directly through .writepage, so the time updates need to
>>> > be in .writepage. That first .writepage call will clear the bit on
>>> > the mapping, so it's only done on the first call to .writepage on
>>> > the given mapping.
>>>
>>> Last time I checked, all the paths that actually needed the timestamp
>>> update went through .writepages.  I'll double-check.
>>   kswapd can call just .writepage to do the writeout so timestamp update
>> should be handled there as well. Otherwise all pages in a mapping can be
>> cleaned without timestamp being updated.
>
> OK, I'll fix that.

This is a bit ugly.  mpage_writepages and generic_writepages both call
->writepage.  If writepage starts checking cmtime, then writepages
will do multiple timestamp updates on filesystems that use this code.

I can modify vmscan and migrate to flush cmtime -- they seem to be the
only callers of ->writepage that aren't themselves called from
->writepages.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20 16:42               ` Andy Lutomirski
@ 2013-08-20 21:48                 ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> update, but that would complicate non-transactional filesystems.
> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >
> >> > do_writepages() is the wrong place to do such updates - we can get
> >> > writeback directly through .writepage, so the time updates need to
> >> > be in .writepage. That first .writepage call will clear the bit on
> >> > the mapping, so it's only done on the first call to .writepage on
> >> > the given mapping.
> >>
> >> Last time I checked, all the paths that actually needed the timestamp
> >> update went through .writepages.  I'll double-check.
> >   kswapd can call just .writepage to do the writeout so timestamp update
> > should be handled there as well. Otherwise all pages in a mapping can be
> > cleaned without timestamp being updated.
> 
> OK, I'll fix that.
> 
> >
> > Which btw made me realize that even your scheme doesn't completely make
> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> > is called. We write the page 0, writeprotect it, update timestamps. But
> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> > update the timestamp... Not that I think this can be reasonably solved but
> > it is a food for thought.
> 
> This should already work.  AS_CMTIME is set when the pte goes from
> dirty to clean, not when the pte goes from wp to writable.  So
> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> be set and a subsequent writepages call will update the timestamp.

Oh, I missed that - I thought you were setting AS_CMTIME during
.page_mkwrite.

Setting it in clear_page_dirty_for_io() is too late for filesystems
to include it in their existing transactions during .writepage, (at
least for XFs and ext4) because they do their delayed allocation
transactions before changing page state....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20 21:48                 ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> update, but that would complicate non-transactional filesystems.
> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >
> >> > do_writepages() is the wrong place to do such updates - we can get
> >> > writeback directly through .writepage, so the time updates need to
> >> > be in .writepage. That first .writepage call will clear the bit on
> >> > the mapping, so it's only done on the first call to .writepage on
> >> > the given mapping.
> >>
> >> Last time I checked, all the paths that actually needed the timestamp
> >> update went through .writepages.  I'll double-check.
> >   kswapd can call just .writepage to do the writeout so timestamp update
> > should be handled there as well. Otherwise all pages in a mapping can be
> > cleaned without timestamp being updated.
> 
> OK, I'll fix that.
> 
> >
> > Which btw made me realize that even your scheme doesn't completely make
> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> > is called. We write the page 0, writeprotect it, update timestamps. But
> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> > update the timestamp... Not that I think this can be reasonably solved but
> > it is a food for thought.
> 
> This should already work.  AS_CMTIME is set when the pte goes from
> dirty to clean, not when the pte goes from wp to writable.  So
> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> be set and a subsequent writepages call will update the timestamp.

Oh, I missed that - I thought you were setting AS_CMTIME during
.page_mkwrite.

Setting it in clear_page_dirty_for_io() is too late for filesystems
to include it in their existing transactions during .writepage, (at
least for XFs and ext4) because they do their delayed allocation
transactions before changing page state....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20 21:48                 ` Dave Chinner
@ 2013-08-20 21:54                   ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20 21:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >
>> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> > writeback directly through .writepage, so the time updates need to
>> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> > the mapping, so it's only done on the first call to .writepage on
>> >> > the given mapping.
>> >>
>> >> Last time I checked, all the paths that actually needed the timestamp
>> >> update went through .writepages.  I'll double-check.
>> >   kswapd can call just .writepage to do the writeout so timestamp update
>> > should be handled there as well. Otherwise all pages in a mapping can be
>> > cleaned without timestamp being updated.
>>
>> OK, I'll fix that.
>>
>> >
>> > Which btw made me realize that even your scheme doesn't completely make
>> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> > is called. We write the page 0, writeprotect it, update timestamps. But
>> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> > update the timestamp... Not that I think this can be reasonably solved but
>> > it is a food for thought.
>>
>> This should already work.  AS_CMTIME is set when the pte goes from
>> dirty to clean, not when the pte goes from wp to writable.  So
>> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> be set and a subsequent writepages call will update the timestamp.
>
> Oh, I missed that - I thought you were setting AS_CMTIME during
> .page_mkwrite.
>
> Setting it in clear_page_dirty_for_io() is too late for filesystems
> to include it in their existing transactions during .writepage, (at
> least for XFs and ext4) because they do their delayed allocation
> transactions before changing page state....

Couldn't it go between mpage_map_and_submit_extent and
ext4_journal_stop in ext4_writepages?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20 21:54                   ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-20 21:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >
>> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> > writeback directly through .writepage, so the time updates need to
>> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> > the mapping, so it's only done on the first call to .writepage on
>> >> > the given mapping.
>> >>
>> >> Last time I checked, all the paths that actually needed the timestamp
>> >> update went through .writepages.  I'll double-check.
>> >   kswapd can call just .writepage to do the writeout so timestamp update
>> > should be handled there as well. Otherwise all pages in a mapping can be
>> > cleaned without timestamp being updated.
>>
>> OK, I'll fix that.
>>
>> >
>> > Which btw made me realize that even your scheme doesn't completely make
>> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> > is called. We write the page 0, writeprotect it, update timestamps. But
>> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> > update the timestamp... Not that I think this can be reasonably solved but
>> > it is a food for thought.
>>
>> This should already work.  AS_CMTIME is set when the pte goes from
>> dirty to clean, not when the pte goes from wp to writable.  So
>> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> be set and a subsequent writepages call will update the timestamp.
>
> Oh, I missed that - I thought you were setting AS_CMTIME during
> .page_mkwrite.
>
> Setting it in clear_page_dirty_for_io() is too late for filesystems
> to include it in their existing transactions during .writepage, (at
> least for XFs and ext4) because they do their delayed allocation
> transactions before changing page state....

Couldn't it go between mpage_map_and_submit_extent and
ext4_journal_stop in ext4_writepages?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20 21:54                   ` Andy Lutomirski
@ 2013-08-20 22:43                     ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20 22:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >
> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> > the given mapping.
> >> >>
> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> update went through .writepages.  I'll double-check.
> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> > cleaned without timestamp being updated.
> >>
> >> OK, I'll fix that.
> >>
> >> >
> >> > Which btw made me realize that even your scheme doesn't completely make
> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> > update the timestamp... Not that I think this can be reasonably solved but
> >> > it is a food for thought.
> >>
> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> be set and a subsequent writepages call will update the timestamp.
> >
> > Oh, I missed that - I thought you were setting AS_CMTIME during
> > .page_mkwrite.
> >
> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> > to include it in their existing transactions during .writepage, (at
> > least for XFs and ext4) because they do their delayed allocation
> > transactions before changing page state....
> 
> Couldn't it go between mpage_map_and_submit_extent and
> ext4_journal_stop in ext4_writepages?

Maybe - I'm not an ext4 expert - but even if you can make it work
for ext4 in some way, that doesn't mean it is possible for any other
filesystem to use the same method. You're adding code to generic,
non-filesystem specific code paths and so the solutions need to be
generic rather not tied to how a specific filesystem is implemented.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-20 22:43                     ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-20 22:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >
> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> > the given mapping.
> >> >>
> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> update went through .writepages.  I'll double-check.
> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> > cleaned without timestamp being updated.
> >>
> >> OK, I'll fix that.
> >>
> >> >
> >> > Which btw made me realize that even your scheme doesn't completely make
> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> > update the timestamp... Not that I think this can be reasonably solved but
> >> > it is a food for thought.
> >>
> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> be set and a subsequent writepages call will update the timestamp.
> >
> > Oh, I missed that - I thought you were setting AS_CMTIME during
> > .page_mkwrite.
> >
> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> > to include it in their existing transactions during .writepage, (at
> > least for XFs and ext4) because they do their delayed allocation
> > transactions before changing page state....
> 
> Couldn't it go between mpage_map_and_submit_extent and
> ext4_journal_stop in ext4_writepages?

Maybe - I'm not an ext4 expert - but even if you can make it work
for ext4 in some way, that doesn't mean it is possible for any other
filesystem to use the same method. You're adding code to generic,
non-filesystem specific code paths and so the solutions need to be
generic rather not tied to how a specific filesystem is implemented.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-20 22:43                     ` Dave Chinner
@ 2013-08-21  0:47                       ` Andy Lutomirski
  -1 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-21  0:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >> >
>> >> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> >> > writeback directly through .writepage, so the time updates need to
>> >> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> >> > the mapping, so it's only done on the first call to .writepage on
>> >> >> > the given mapping.
>> >> >>
>> >> >> Last time I checked, all the paths that actually needed the timestamp
>> >> >> update went through .writepages.  I'll double-check.
>> >> >   kswapd can call just .writepage to do the writeout so timestamp update
>> >> > should be handled there as well. Otherwise all pages in a mapping can be
>> >> > cleaned without timestamp being updated.
>> >>
>> >> OK, I'll fix that.
>> >>
>> >> >
>> >> > Which btw made me realize that even your scheme doesn't completely make
>> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> >> > is called. We write the page 0, writeprotect it, update timestamps. But
>> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> >> > update the timestamp... Not that I think this can be reasonably solved but
>> >> > it is a food for thought.
>> >>
>> >> This should already work.  AS_CMTIME is set when the pte goes from
>> >> dirty to clean, not when the pte goes from wp to writable.  So
>> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> >> be set and a subsequent writepages call will update the timestamp.
>> >
>> > Oh, I missed that - I thought you were setting AS_CMTIME during
>> > .page_mkwrite.
>> >
>> > Setting it in clear_page_dirty_for_io() is too late for filesystems
>> > to include it in their existing transactions during .writepage, (at
>> > least for XFs and ext4) because they do their delayed allocation
>> > transactions before changing page state....
>>
>> Couldn't it go between mpage_map_and_submit_extent and
>> ext4_journal_stop in ext4_writepages?
>
> Maybe - I'm not an ext4 expert - but even if you can make it work
> for ext4 in some way, that doesn't mean it is possible for any other
> filesystem to use the same method. You're adding code to generic,
> non-filesystem specific code paths and so the solutions need to be
> generic rather not tied to how a specific filesystem is implemented.
>

I don't see the problem for xfs or btrfs either.

xfs uses generic_writepages, which already does the right thing.  (xfs
with my updated patches passes my tests.)  xfs_vm_writepage calls
xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
called.  At that point (I presume), it would still be possible to add
metadata to a transaction (assuming there's a transaction open -- I
don't have a clue here).  Even if this is too late, xfs_vm_writepage
could call page_mkwrite to for AS_CMTIME to be set if needed.
page_mkwrite will be fast if the page isn't mmapped.  What am I
missing?

btrfs seems to do much the same thing.

--Andy


> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-21  0:47                       ` Andy Lutomirski
  0 siblings, 0 replies; 52+ messages in thread
From: Andy Lutomirski @ 2013-08-21  0:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
>> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >> >
>> >> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> >> > writeback directly through .writepage, so the time updates need to
>> >> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> >> > the mapping, so it's only done on the first call to .writepage on
>> >> >> > the given mapping.
>> >> >>
>> >> >> Last time I checked, all the paths that actually needed the timestamp
>> >> >> update went through .writepages.  I'll double-check.
>> >> >   kswapd can call just .writepage to do the writeout so timestamp update
>> >> > should be handled there as well. Otherwise all pages in a mapping can be
>> >> > cleaned without timestamp being updated.
>> >>
>> >> OK, I'll fix that.
>> >>
>> >> >
>> >> > Which btw made me realize that even your scheme doesn't completely make
>> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> >> > is called. We write the page 0, writeprotect it, update timestamps. But
>> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> >> > update the timestamp... Not that I think this can be reasonably solved but
>> >> > it is a food for thought.
>> >>
>> >> This should already work.  AS_CMTIME is set when the pte goes from
>> >> dirty to clean, not when the pte goes from wp to writable.  So
>> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> >> be set and a subsequent writepages call will update the timestamp.
>> >
>> > Oh, I missed that - I thought you were setting AS_CMTIME during
>> > .page_mkwrite.
>> >
>> > Setting it in clear_page_dirty_for_io() is too late for filesystems
>> > to include it in their existing transactions during .writepage, (at
>> > least for XFs and ext4) because they do their delayed allocation
>> > transactions before changing page state....
>>
>> Couldn't it go between mpage_map_and_submit_extent and
>> ext4_journal_stop in ext4_writepages?
>
> Maybe - I'm not an ext4 expert - but even if you can make it work
> for ext4 in some way, that doesn't mean it is possible for any other
> filesystem to use the same method. You're adding code to generic,
> non-filesystem specific code paths and so the solutions need to be
> generic rather not tied to how a specific filesystem is implemented.
>

I don't see the problem for xfs or btrfs either.

xfs uses generic_writepages, which already does the right thing.  (xfs
with my updated patches passes my tests.)  xfs_vm_writepage calls
xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
called.  At that point (I presume), it would still be possible to add
metadata to a transaction (assuming there's a transaction open -- I
don't have a clue here).  Even if this is too late, xfs_vm_writepage
could call page_mkwrite to for AS_CMTIME to be set if needed.
page_mkwrite will be fast if the page isn't mmapped.  What am I
missing?

btrfs seems to do much the same thing.

--Andy


> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
  2013-08-21  0:47                       ` Andy Lutomirski
@ 2013-08-21  1:33                         ` Dave Chinner
  -1 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-21  1:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-kernel, linux-ext4, Theodore Ts'o,
	Dave Hansen, xfs, Tim Chen, Christoph Hellwig

On Tue, Aug 20, 2013 at 05:47:10PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> >> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >> >
> >> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> >> > the given mapping.
> >> >> >>
> >> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> >> update went through .writepages.  I'll double-check.
> >> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> >> > cleaned without timestamp being updated.
> >> >>
> >> >> OK, I'll fix that.
> >> >>
> >> >> >
> >> >> > Which btw made me realize that even your scheme doesn't completely make
> >> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> >> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> >> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> >> > update the timestamp... Not that I think this can be reasonably solved but
> >> >> > it is a food for thought.
> >> >>
> >> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> >> be set and a subsequent writepages call will update the timestamp.
> >> >
> >> > Oh, I missed that - I thought you were setting AS_CMTIME during
> >> > .page_mkwrite.
> >> >
> >> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> >> > to include it in their existing transactions during .writepage, (at
> >> > least for XFs and ext4) because they do their delayed allocation
> >> > transactions before changing page state....
> >>
> >> Couldn't it go between mpage_map_and_submit_extent and
> >> ext4_journal_stop in ext4_writepages?
> >
> > Maybe - I'm not an ext4 expert - but even if you can make it work
> > for ext4 in some way, that doesn't mean it is possible for any other
> > filesystem to use the same method. You're adding code to generic,
> > non-filesystem specific code paths and so the solutions need to be
> > generic rather not tied to how a specific filesystem is implemented.
> >
> 
> I don't see the problem for xfs or btrfs either.
> 
> xfs uses generic_writepages, which already does the right thing.  (xfs
> with my updated patches passes my tests.)  xfs_vm_writepage calls
> xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
> called.  At that point (I presume), it would still be possible to add
> metadata to a transaction (assuming there's a transaction open -- I
> don't have a clue here). 

That's my point - there isn't a transaction in XFS at this point,
and so if we gather that flag from clear_page_dirty_for_io() we'd
need a second transaction and therefore the optimisation you want
filesystems to use to mitigate the additional overhead can't be done
for all commonly used filesystems.

> Even if this is too late, xfs_vm_writepage
> could call page_mkwrite to for AS_CMTIME to be set if needed.
> page_mkwrite will be fast if the page isn't mmapped.  What am I
> missing?

That it leads to different behaviour for different filesystems.

i.e. page_mkwrite on page A sets the flag. writeback on a range that
doesn't include page A occurs, sees the flag, clears it after
updating the timestamp. Some time later writeback on page A occurs,
no timestamp update occurs.

The behaviour needs to be consistent across filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update
@ 2013-08-21  1:33                         ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2013-08-21  1:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Dave Hansen, linux-kernel, xfs,
	Christoph Hellwig, Jan Kara, linux-ext4, Tim Chen

On Tue, Aug 20, 2013 at 05:47:10PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara <jack@suse.cz> wrote:
> >> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >> >
> >> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> >> > the given mapping.
> >> >> >>
> >> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> >> update went through .writepages.  I'll double-check.
> >> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> >> > cleaned without timestamp being updated.
> >> >>
> >> >> OK, I'll fix that.
> >> >>
> >> >> >
> >> >> > Which btw made me realize that even your scheme doesn't completely make
> >> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> >> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> >> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> >> > update the timestamp... Not that I think this can be reasonably solved but
> >> >> > it is a food for thought.
> >> >>
> >> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> >> be set and a subsequent writepages call will update the timestamp.
> >> >
> >> > Oh, I missed that - I thought you were setting AS_CMTIME during
> >> > .page_mkwrite.
> >> >
> >> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> >> > to include it in their existing transactions during .writepage, (at
> >> > least for XFs and ext4) because they do their delayed allocation
> >> > transactions before changing page state....
> >>
> >> Couldn't it go between mpage_map_and_submit_extent and
> >> ext4_journal_stop in ext4_writepages?
> >
> > Maybe - I'm not an ext4 expert - but even if you can make it work
> > for ext4 in some way, that doesn't mean it is possible for any other
> > filesystem to use the same method. You're adding code to generic,
> > non-filesystem specific code paths and so the solutions need to be
> > generic rather not tied to how a specific filesystem is implemented.
> >
> 
> I don't see the problem for xfs or btrfs either.
> 
> xfs uses generic_writepages, which already does the right thing.  (xfs
> with my updated patches passes my tests.)  xfs_vm_writepage calls
> xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
> called.  At that point (I presume), it would still be possible to add
> metadata to a transaction (assuming there's a transaction open -- I
> don't have a clue here). 

That's my point - there isn't a transaction in XFS at this point,
and so if we gather that flag from clear_page_dirty_for_io() we'd
need a second transaction and therefore the optimisation you want
filesystems to use to mitigate the additional overhead can't be done
for all commonly used filesystems.

> Even if this is too late, xfs_vm_writepage
> could call page_mkwrite to for AS_CMTIME to be set if needed.
> page_mkwrite will be fast if the page isn't mmapped.  What am I
> missing?

That it leads to different behaviour for different filesystems.

i.e. page_mkwrite on page A sets the flag. writeback on a range that
doesn't include page A occurs, sees the flag, clears it after
updating the timestamp. Some time later writeback on page A occurs,
no timestamp update occurs.

The behaviour needs to be consistent across filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2013-08-21  1:33 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-16 23:22 [PATCH v3 0/5] Rework mtime and ctime updates on mmaped Andy Lutomirski
2013-08-16 23:22 ` Andy Lutomirski
2013-08-16 23:22 ` [PATCH v3 1/5] mm: Track mappings that have been written via ptes Andy Lutomirski
2013-08-16 23:22   ` Andy Lutomirski
2013-08-16 23:22 ` [PATCH v3 2/5] fs: Add inode_update_time_writable Andy Lutomirski
2013-08-16 23:22   ` Andy Lutomirski
2013-08-20  2:28   ` Dave Chinner
2013-08-20  2:28     ` Dave Chinner
2013-08-20  3:20     ` Andy Lutomirski
2013-08-20  3:20       ` Andy Lutomirski
2013-08-20  3:33       ` Dave Chinner
2013-08-20  3:33         ` Dave Chinner
2013-08-20  4:07         ` Andy Lutomirski
2013-08-20  4:07           ` Andy Lutomirski
2013-08-20 16:10           ` Jan Kara
2013-08-20 16:10             ` Jan Kara
2013-08-16 23:22 ` [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update Andy Lutomirski
2013-08-16 23:22   ` Andy Lutomirski
2013-08-20  2:36   ` Dave Chinner
2013-08-20  2:36     ` Dave Chinner
2013-08-20  3:28     ` Andy Lutomirski
2013-08-20  3:28       ` Andy Lutomirski
2013-08-20  4:08       ` Dave Chinner
2013-08-20  4:08         ` Dave Chinner
2013-08-20  4:14         ` Andy Lutomirski
2013-08-20  4:14           ` Andy Lutomirski
2013-08-20 16:00           ` Jan Kara
2013-08-20 16:00             ` Jan Kara
2013-08-20 16:42             ` Andy Lutomirski
2013-08-20 16:42               ` Andy Lutomirski
2013-08-20 19:27               ` Andy Lutomirski
2013-08-20 19:27                 ` Andy Lutomirski
2013-08-20 21:48               ` Dave Chinner
2013-08-20 21:48                 ` Dave Chinner
2013-08-20 21:54                 ` Andy Lutomirski
2013-08-20 21:54                   ` Andy Lutomirski
2013-08-20 22:43                   ` Dave Chinner
2013-08-20 22:43                     ` Dave Chinner
2013-08-21  0:47                     ` Andy Lutomirski
2013-08-21  0:47                       ` Andy Lutomirski
2013-08-21  1:33                       ` Dave Chinner
2013-08-21  1:33                         ` Dave Chinner
2013-08-16 23:22 ` [PATCH v3 4/5] mm: Scan for dirty ptes and update cmtime on MS_ASYNC Andy Lutomirski
2013-08-16 23:22   ` Andy Lutomirski
2013-08-16 23:22 ` [PATCH v3 5/5] ext4: Defer mmap cmtime update until writeback Andy Lutomirski
2013-08-16 23:22   ` Andy Lutomirski
2013-08-20  2:38   ` Dave Chinner
2013-08-20  2:38     ` Dave Chinner
2013-08-20  3:30     ` Andy Lutomirski
2013-08-20  3:30       ` Andy Lutomirski
2013-08-20  4:08       ` Dave Chinner
2013-08-20  4:08         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.