All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-25 21:10 ` Miklos Szeredi, Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-25 21:10 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

From: Miklos Szeredi <mszeredi@suse.cz>

Changes:
v4:
 o rename test_clear_page_modified to page_mkclean_noprot
 o clean up page_mkclean_noprot
 o don't set AS_CMTIME from fault handler, since that also sets the PTE dirty
 o only update c/mtime in munmap, if file is not mapped any more
 o cleanups
v3:
 o rename is_page_modified to test_clear_page_modified
v2:
 o set AS_CMTIME flag in clear_page_dirty_for_io() too
 o don't clear AS_CMTIME in file_update_time()
 o check the dirty bit in the page tables
v1:
 o moved check from __fput() to remove_vma(), which is more logical
 o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
 o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

   The st_ctime and st_mtime fields of a file that is mapped with
   MAP_SHARED and PROT_WRITE shall be marked for update at some point
   in the interval between a write reference to the mapped region and
   the next call to msync() with MS_ASYNC or MS_SYNC for that portion
   of the file by any process. If there is no such call and if the
   underlying file is modified as a result of a write reference, then
   these fields shall be marked for update at some time after the
   write reference.

A new address_space flag is introduced: AS_CMTIME.  This is set each
time dirtyness from the page table is transferred to the page or if
the page is dirtied without the page table being set dirty.

Note, the flag is set unconditionally, even if the page is already
dirty.  This is important, because the page might have been dirtied
earlier by a non-mmap write.

Msync checks this flag and also dirty bit in the page tables, because
the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write.  This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page.  Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes.  This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each.  This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

In munmap if there are no more memory mappings of this file, and the
AS_CMTIME flag has been set, the file times are updated.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux-2.6.21-rc4-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/pagemap.h	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/pagemap.h	2007-03-25 19:00:35.000000000 +0200
@@ -19,6 +19,7 @@
  */
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
+#define AS_CMTIME	(__GFP_BITS_SHIFT + 2)	/* ctime/mtime update needed */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
Index: linux-2.6.21-rc4-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/mm.h	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/mm.h	2007-03-25 19:00:36.000000000 +0200
@@ -808,6 +808,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long do_mremap(unsigned long addr,
Index: linux-2.6.21-rc4-mm1/mm/memory.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/memory.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/memory.c	2007-03-25 19:00:36.000000000 +0200
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
 				anon_rss--;
 			else {
 				if (pte_dirty(ptent))
-					set_page_dirty(page);
+					set_page_dirty_mapping(page);
 				if (pte_young(ptent))
 					SetPageReferenced(page);
 				file_rss--;
@@ -954,7 +954,7 @@ struct page *follow_page(struct vm_area_
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		mark_page_accessed(page);
 	}
 unlock:
Index: linux-2.6.21-rc4-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/page-writeback.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/page-writeback.c	2007-03-25 19:00:36.000000000 +0200
@@ -848,17 +848,42 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
+static inline int __set_page_dirty(struct address_space *mapping,
+				   struct page *page)
+{
+	int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+	if (!spd)
+		spd = __set_page_dirty_buffers;
+#endif
+	return (*spd)(page);
+}
+
 int fastcall set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (likely(mapping))
+		return __set_page_dirty(mapping, page);
+	if (!PageDirty(page)) {
+		if (!TestSetPageDirty(page))
+			return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(set_page_dirty);
+
+/*
+ * Special set_page_dirty() variant for dirtiness coming from a memory
+ * mapping.  In this case the ctime/mtime update flag needs to be set.
+ */
+int set_page_dirty_mapping(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
 	if (likely(mapping)) {
-		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
-		if (!spd)
-			spd = __set_page_dirty_buffers;
-#endif
-		return (*spd)(page);
+		set_bit(AS_CMTIME, &mapping->flags);
+		return __set_page_dirty(mapping, page);
 	}
 	if (!PageDirty(page)) {
 		if (!TestSetPageDirty(page))
@@ -866,7 +891,6 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
-EXPORT_SYMBOL(set_page_dirty);
 
 /*
  * set_page_dirty() is racy if the caller has no reference against
@@ -936,7 +960,7 @@ int clear_page_dirty_for_io(struct page 
 		 * threads doing their things.
 		 */
 		if (page_mkclean(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
Index: linux-2.6.21-rc4-mm1/mm/rmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/rmap.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/rmap.c	2007-03-25 21:38:18.000000000 +0200
@@ -506,6 +506,49 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+static int page_mkclean_one_noprot(struct page *page,
+				   struct vm_area_struct *vma)
+{
+	int modified = 0;
+	unsigned long address = vma_address(page, vma);
+	if (address != -EFAULT) {
+		struct mm_struct *mm = vma->vm_mm;
+		spinlock_t *ptl;
+		pte_t *pte = page_check_address(page, mm, address, &ptl);
+		if (pte) {
+			if (ptep_clear_flush_dirty(vma, address, pte))
+				modified = 1;
+			pte_unmap_unlock(pte, ptl);
+		}
+	}
+	return modified;
+}
+
+/**
+ * page_mkclean_noprot - check and clear the dirty bit for all mappings of a page
+ * @page:	the page to check
+ */
+int page_mkclean_noprot(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int modified = 0;
+
+	BUG_ON(!page_mapped(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (vma->vm_flags & VM_SHARED)
+			modified |= page_mkclean_one_noprot(page, vma);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	if (page_test_and_clear_dirty(page))
+		modified = 1;
+	return modified;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
@@ -657,7 +700,7 @@ void page_remove_rmap(struct page *page,
 		 * faster for those pages still in swapcache.
 		 */
 		if (page_test_and_clear_dirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		__dec_zone_page_state(page,
 				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
 	}
@@ -702,7 +745,7 @@ static int try_to_unmap_one(struct page 
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		set_page_dirty_mapping(page);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -836,7 +879,7 @@ static void try_to_unmap_cluster(unsigne
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 
 		page_remove_rmap(page, vma);
 		page_cache_release(page);
Index: linux-2.6.21-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/mmap.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/mmap.c	2007-03-25 19:00:36.000000000 +0200
@@ -222,12 +222,30 @@ void unlink_file_vma(struct vm_area_stru
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *file = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (file) {
+		struct address_space *mapping = file->f_mapping;
+		int update_cmtime = 0;
+		/*
+		 * Only update c/mtime if there are no more memory maps
+		 * referring to this inode.  Otherwise it would be possible,
+		 * that some modification info remains in page tables of
+		 * other mappings, and the times would be updated again,
+		 * even though the file wasn't modified after this
+		 */
+		spin_lock(&mapping->i_mmap_lock);
+		if (prio_tree_empty(&mapping->i_mmap) &&
+		    test_and_clear_bit(AS_CMTIME, &file->f_mapping->flags))
+			update_cmtime = 1;
+		spin_unlock(&mapping->i_mmap_lock);
+		if (update_cmtime)
+			file_update_time(file);
+		fput(file);
+	}
 	mpol_free(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
Index: linux-2.6.21-rc4-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/hugetlb.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/hugetlb.c	2007-03-25 19:00:36.000000000 +0200
@@ -407,7 +407,7 @@ void __unmap_hugepage_range(struct vm_ar
 
 		page = pte_page(pte);
 		if (pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		list_add(&page->lru, &page_list);
 	}
 	spin_unlock(&mm->page_table_lock);
Index: linux-2.6.21-rc4-mm1/mm/msync.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/msync.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/msync.c	2007-03-25 19:00:36.000000000 +0200
@@ -12,6 +12,81 @@
 #include <linux/mman.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/pagevec.h>
+
+/*
+ * Update ctime/mtime on msync().
+ *
+ * POSIX requires, that the times are updated between a modification
+ * of the file through a memory mapping and the next msync for a
+ * region containing the modification.  The wording implies that this
+ * must be done even if the modification was through a different
+ * address space.  Ugh.
+ *
+ * Non-linear vmas are too hard to handle and they are non-standard
+ * anyway, so they are ignored for now.
+ *
+ * The "file modified" info is collected from two places:
+ *
+ *  - AS_CMTIME flag of the mapping
+ *  - the dirty bit of the ptes
+ *
+ * For memory backed filesystems all the pages in the range need to be
+ * examined.  In other cases, since dirty pages are accurately
+ * tracked, it is enough to look at the pages with the dirty tag.
+ */
+static void msync_update_file_time(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+	struct address_space *mapping;
+	struct pagevec pvec;
+	pgoff_t index;
+	pgoff_t end_index;
+
+	if (!vma->vm_file || !(vma->vm_flags & VM_SHARED) ||
+	    (vma->vm_flags & VM_NONLINEAR))
+		return;
+
+	mapping = vma->vm_file->f_mapping;
+	pagevec_init(&pvec, 0);
+	index = linear_page_index(vma, start);
+	end_index = linear_page_index(vma, end);
+	while (index < end_index) {
+		int i;
+		int nr_pages = min(end_index - index, (pgoff_t) PAGEVEC_SIZE);
+
+		if (mapping_cap_account_dirty(mapping))
+			nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+					PAGECACHE_TAG_DIRTY, nr_pages);
+		else
+			nr_pages = pagevec_lookup(&pvec, mapping, index,
+						  nr_pages);
+		if (!nr_pages)
+			break;
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/* Skip pages which are just being read */
+			if (!PageUptodate(page))
+				continue;
+
+			lock_page(page);
+			index = page->index + 1;
+			if (page->mapping == mapping &&
+			    page_mkclean_noprot(page))
+				set_page_dirty_mapping(page);
+
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+	}
+
+	if (test_and_clear_bit(AS_CMTIME, &mapping->flags))
+		file_update_time(vma->vm_file);
+}
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
@@ -75,6 +150,9 @@ asmlinkage long sys_msync(unsigned long 
 			error = -EBUSY;
 			goto out_unlock;
 		}
+		if (flags & (MS_SYNC | MS_ASYNC))
+			msync_update_file_time(vma, start,
+					       min(end, vma->vm_end));
 		file = vma->vm_file;
 		start = vma->vm_end;
 		if ((flags & MS_SYNC) && file &&
Index: linux-2.6.21-rc4-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/rmap.h	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/rmap.h	2007-03-25 19:00:36.000000000 +0200
@@ -100,6 +100,11 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+/*
+ * Similar to the above, but doesn't write protect the PTEs
+ */
+int page_mkclean_noprot(struct page *page);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-25 21:10 ` Miklos Szeredi, Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi, Miklos Szeredi @ 2007-03-25 21:10 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

Changes:
v4:
 o rename test_clear_page_modified to page_mkclean_noprot
 o clean up page_mkclean_noprot
 o don't set AS_CMTIME from fault handler, since that also sets the PTE dirty
 o only update c/mtime in munmap, if file is not mapped any more
 o cleanups
v3:
 o rename is_page_modified to test_clear_page_modified
v2:
 o set AS_CMTIME flag in clear_page_dirty_for_io() too
 o don't clear AS_CMTIME in file_update_time()
 o check the dirty bit in the page tables
v1:
 o moved check from __fput() to remove_vma(), which is more logical
 o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
 o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

   The st_ctime and st_mtime fields of a file that is mapped with
   MAP_SHARED and PROT_WRITE shall be marked for update at some point
   in the interval between a write reference to the mapped region and
   the next call to msync() with MS_ASYNC or MS_SYNC for that portion
   of the file by any process. If there is no such call and if the
   underlying file is modified as a result of a write reference, then
   these fields shall be marked for update at some time after the
   write reference.

A new address_space flag is introduced: AS_CMTIME.  This is set each
time dirtyness from the page table is transferred to the page or if
the page is dirtied without the page table being set dirty.

Note, the flag is set unconditionally, even if the page is already
dirty.  This is important, because the page might have been dirtied
earlier by a non-mmap write.

Msync checks this flag and also dirty bit in the page tables, because
the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write.  This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page.  Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes.  This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each.  This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

In munmap if there are no more memory mappings of this file, and the
AS_CMTIME flag has been set, the file times are updated.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux-2.6.21-rc4-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/pagemap.h	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/pagemap.h	2007-03-25 19:00:35.000000000 +0200
@@ -19,6 +19,7 @@
  */
 #define	AS_EIO		(__GFP_BITS_SHIFT + 0)	/* IO error on async write */
 #define AS_ENOSPC	(__GFP_BITS_SHIFT + 1)	/* ENOSPC on async write */
+#define AS_CMTIME	(__GFP_BITS_SHIFT + 2)	/* ctime/mtime update needed */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
Index: linux-2.6.21-rc4-mm1/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/mm.h	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/mm.h	2007-03-25 19:00:36.000000000 +0200
@@ -808,6 +808,7 @@ int redirty_page_for_writepage(struct wr
 				struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long do_mremap(unsigned long addr,
Index: linux-2.6.21-rc4-mm1/mm/memory.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/memory.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/memory.c	2007-03-25 19:00:36.000000000 +0200
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
 				anon_rss--;
 			else {
 				if (pte_dirty(ptent))
-					set_page_dirty(page);
+					set_page_dirty_mapping(page);
 				if (pte_young(ptent))
 					SetPageReferenced(page);
 				file_rss--;
@@ -954,7 +954,7 @@ struct page *follow_page(struct vm_area_
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		mark_page_accessed(page);
 	}
 unlock:
Index: linux-2.6.21-rc4-mm1/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/page-writeback.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/page-writeback.c	2007-03-25 19:00:36.000000000 +0200
@@ -848,17 +848,42 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
+static inline int __set_page_dirty(struct address_space *mapping,
+				   struct page *page)
+{
+	int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+#ifdef CONFIG_BLOCK
+	if (!spd)
+		spd = __set_page_dirty_buffers;
+#endif
+	return (*spd)(page);
+}
+
 int fastcall set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (likely(mapping))
+		return __set_page_dirty(mapping, page);
+	if (!PageDirty(page)) {
+		if (!TestSetPageDirty(page))
+			return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(set_page_dirty);
+
+/*
+ * Special set_page_dirty() variant for dirtiness coming from a memory
+ * mapping.  In this case the ctime/mtime update flag needs to be set.
+ */
+int set_page_dirty_mapping(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
 	if (likely(mapping)) {
-		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
-#ifdef CONFIG_BLOCK
-		if (!spd)
-			spd = __set_page_dirty_buffers;
-#endif
-		return (*spd)(page);
+		set_bit(AS_CMTIME, &mapping->flags);
+		return __set_page_dirty(mapping, page);
 	}
 	if (!PageDirty(page)) {
 		if (!TestSetPageDirty(page))
@@ -866,7 +891,6 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
-EXPORT_SYMBOL(set_page_dirty);
 
 /*
  * set_page_dirty() is racy if the caller has no reference against
@@ -936,7 +960,7 @@ int clear_page_dirty_for_io(struct page 
 		 * threads doing their things.
 		 */
 		if (page_mkclean(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		/*
 		 * We carefully synchronise fault handlers against
 		 * installing a dirty pte and marking the page dirty
Index: linux-2.6.21-rc4-mm1/mm/rmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/rmap.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/rmap.c	2007-03-25 21:38:18.000000000 +0200
@@ -506,6 +506,49 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+static int page_mkclean_one_noprot(struct page *page,
+				   struct vm_area_struct *vma)
+{
+	int modified = 0;
+	unsigned long address = vma_address(page, vma);
+	if (address != -EFAULT) {
+		struct mm_struct *mm = vma->vm_mm;
+		spinlock_t *ptl;
+		pte_t *pte = page_check_address(page, mm, address, &ptl);
+		if (pte) {
+			if (ptep_clear_flush_dirty(vma, address, pte))
+				modified = 1;
+			pte_unmap_unlock(pte, ptl);
+		}
+	}
+	return modified;
+}
+
+/**
+ * page_mkclean_noprot - check and clear the dirty bit for all mappings of a page
+ * @page:	the page to check
+ */
+int page_mkclean_noprot(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct vm_area_struct *vma;
+	struct prio_tree_iter iter;
+	int modified = 0;
+
+	BUG_ON(!page_mapped(page));
+
+	spin_lock(&mapping->i_mmap_lock);
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (vma->vm_flags & VM_SHARED)
+			modified |= page_mkclean_one_noprot(page, vma);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	if (page_test_and_clear_dirty(page))
+		modified = 1;
+	return modified;
+}
+
 /**
  * page_set_anon_rmap - setup new anonymous rmap
  * @page:	the page to add the mapping to
@@ -657,7 +700,7 @@ void page_remove_rmap(struct page *page,
 		 * faster for those pages still in swapcache.
 		 */
 		if (page_test_and_clear_dirty(page))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		__dec_zone_page_state(page,
 				PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
 	}
@@ -702,7 +745,7 @@ static int try_to_unmap_one(struct page 
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		set_page_dirty_mapping(page);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
@@ -836,7 +879,7 @@ static void try_to_unmap_cluster(unsigne
 
 		/* Move the dirty bit to the physical page now the pte is gone. */
 		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 
 		page_remove_rmap(page, vma);
 		page_cache_release(page);
Index: linux-2.6.21-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/mmap.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/mmap.c	2007-03-25 19:00:36.000000000 +0200
@@ -222,12 +222,30 @@ void unlink_file_vma(struct vm_area_stru
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *file = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (file) {
+		struct address_space *mapping = file->f_mapping;
+		int update_cmtime = 0;
+		/*
+		 * Only update c/mtime if there are no more memory maps
+		 * referring to this inode.  Otherwise it would be possible,
+		 * that some modification info remains in page tables of
+		 * other mappings, and the times would be updated again,
+		 * even though the file wasn't modified after this
+		 */
+		spin_lock(&mapping->i_mmap_lock);
+		if (prio_tree_empty(&mapping->i_mmap) &&
+		    test_and_clear_bit(AS_CMTIME, &file->f_mapping->flags))
+			update_cmtime = 1;
+		spin_unlock(&mapping->i_mmap_lock);
+		if (update_cmtime)
+			file_update_time(file);
+		fput(file);
+	}
 	mpol_free(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
 	return next;
Index: linux-2.6.21-rc4-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/hugetlb.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/hugetlb.c	2007-03-25 19:00:36.000000000 +0200
@@ -407,7 +407,7 @@ void __unmap_hugepage_range(struct vm_ar
 
 		page = pte_page(pte);
 		if (pte_dirty(pte))
-			set_page_dirty(page);
+			set_page_dirty_mapping(page);
 		list_add(&page->lru, &page_list);
 	}
 	spin_unlock(&mm->page_table_lock);
Index: linux-2.6.21-rc4-mm1/mm/msync.c
===================================================================
--- linux-2.6.21-rc4-mm1.orig/mm/msync.c	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/mm/msync.c	2007-03-25 19:00:36.000000000 +0200
@@ -12,6 +12,81 @@
 #include <linux/mman.h>
 #include <linux/file.h>
 #include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/pagevec.h>
+
+/*
+ * Update ctime/mtime on msync().
+ *
+ * POSIX requires, that the times are updated between a modification
+ * of the file through a memory mapping and the next msync for a
+ * region containing the modification.  The wording implies that this
+ * must be done even if the modification was through a different
+ * address space.  Ugh.
+ *
+ * Non-linear vmas are too hard to handle and they are non-standard
+ * anyway, so they are ignored for now.
+ *
+ * The "file modified" info is collected from two places:
+ *
+ *  - AS_CMTIME flag of the mapping
+ *  - the dirty bit of the ptes
+ *
+ * For memory backed filesystems all the pages in the range need to be
+ * examined.  In other cases, since dirty pages are accurately
+ * tracked, it is enough to look at the pages with the dirty tag.
+ */
+static void msync_update_file_time(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+	struct address_space *mapping;
+	struct pagevec pvec;
+	pgoff_t index;
+	pgoff_t end_index;
+
+	if (!vma->vm_file || !(vma->vm_flags & VM_SHARED) ||
+	    (vma->vm_flags & VM_NONLINEAR))
+		return;
+
+	mapping = vma->vm_file->f_mapping;
+	pagevec_init(&pvec, 0);
+	index = linear_page_index(vma, start);
+	end_index = linear_page_index(vma, end);
+	while (index < end_index) {
+		int i;
+		int nr_pages = min(end_index - index, (pgoff_t) PAGEVEC_SIZE);
+
+		if (mapping_cap_account_dirty(mapping))
+			nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+					PAGECACHE_TAG_DIRTY, nr_pages);
+		else
+			nr_pages = pagevec_lookup(&pvec, mapping, index,
+						  nr_pages);
+		if (!nr_pages)
+			break;
+
+		for (i = 0; i < nr_pages; i++) {
+			struct page *page = pvec.pages[i];
+
+			/* Skip pages which are just being read */
+			if (!PageUptodate(page))
+				continue;
+
+			lock_page(page);
+			index = page->index + 1;
+			if (page->mapping == mapping &&
+			    page_mkclean_noprot(page))
+				set_page_dirty_mapping(page);
+
+			unlock_page(page);
+		}
+		pagevec_release(&pvec);
+	}
+
+	if (test_and_clear_bit(AS_CMTIME, &mapping->flags))
+		file_update_time(vma->vm_file);
+}
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
@@ -75,6 +150,9 @@ asmlinkage long sys_msync(unsigned long 
 			error = -EBUSY;
 			goto out_unlock;
 		}
+		if (flags & (MS_SYNC | MS_ASYNC))
+			msync_update_file_time(vma, start,
+					       min(end, vma->vm_end));
 		file = vma->vm_file;
 		start = vma->vm_end;
 		if ((flags & MS_SYNC) && file &&
Index: linux-2.6.21-rc4-mm1/include/linux/rmap.h
===================================================================
--- linux-2.6.21-rc4-mm1.orig/include/linux/rmap.h	2007-03-25 19:00:06.000000000 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/rmap.h	2007-03-25 19:00:36.000000000 +0200
@@ -100,6 +100,11 @@ unsigned long page_address_in_vma(struct
  */
 int page_mkclean(struct page *);
 
+/*
+ * Similar to the above, but doesn't write protect the PTEs
+ */
+int page_mkclean_noprot(struct page *page);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-25 21:10 ` Miklos Szeredi, Miklos Szeredi
@ 2007-03-26 21:00   ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-26 21:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Sun, 25 Mar 2007 23:10:21 +0200
Miklos Szeredi <miklos@szeredi.hu> wrote:

> This patch makes writing to shared memory mappings update st_ctime and
> st_mtime as defined by SUSv3:

Boy this is complicated.

Is there a simpler way of doing all this?  Say, we define a new page flag
PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
page-dirtiness.  Then, when performing writeback we look to see if any of
the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
current-time.

Or something like that - I'm just thinking out loud and picking holes in
the above doesn't shut me up ;) We're adding complexity and some overhead
and we're losing our recent msync() simplifications and this all hurts.  Is
there some other way?  I think burning a page flag to avoid this additional
complexity would be worthwhile.  

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-26 21:00   ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-26 21:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Sun, 25 Mar 2007 23:10:21 +0200
Miklos Szeredi <miklos@szeredi.hu> wrote:

> This patch makes writing to shared memory mappings update st_ctime and
> st_mtime as defined by SUSv3:

Boy this is complicated.

Is there a simpler way of doing all this?  Say, we define a new page flag
PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
page-dirtiness.  Then, when performing writeback we look to see if any of
the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
current-time.

Or something like that - I'm just thinking out loud and picking holes in
the above doesn't shut me up ;) We're adding complexity and some overhead
and we're losing our recent msync() simplifications and this all hurts.  Is
there some other way?  I think burning a page flag to avoid this additional
complexity would be worthwhile.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-26 21:00   ` Andrew Morton
@ 2007-03-26 21:10     ` Matt Mackall
  -1 siblings, 0 replies; 54+ messages in thread
From: Matt Mackall @ 2007-03-26 21:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, a.p.zijlstra, linux-kernel, linux-mm

On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> On Sun, 25 Mar 2007 23:10:21 +0200
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
> 
> Boy this is complicated.
> 
> Is there a simpler way of doing all this?  Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness.  Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.
> 
> Or something like that - I'm just thinking out loud and picking holes in
> the above doesn't shut me up ;) We're adding complexity and some overhead
> and we're losing our recent msync() simplifications and this all hurts.  Is
> there some other way?  I think burning a page flag to avoid this additional
> complexity would be worthwhile.  

Aren't we basically out of those?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-26 21:10     ` Matt Mackall
  0 siblings, 0 replies; 54+ messages in thread
From: Matt Mackall @ 2007-03-26 21:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, a.p.zijlstra, linux-kernel, linux-mm

On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> On Sun, 25 Mar 2007 23:10:21 +0200
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
> 
> Boy this is complicated.
> 
> Is there a simpler way of doing all this?  Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness.  Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.
> 
> Or something like that - I'm just thinking out loud and picking holes in
> the above doesn't shut me up ;) We're adding complexity and some overhead
> and we're losing our recent msync() simplifications and this all hurts.  Is
> there some other way?  I think burning a page flag to avoid this additional
> complexity would be worthwhile.  

Aren't we basically out of those?

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-26 21:00   ` Andrew Morton
@ 2007-03-26 21:43     ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-26 21:43 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
> 
> Boy this is complicated.

You tell me?

> Is there a simpler way of doing all this?  Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness.  Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.

I don't think a page flag gains anything over the address_space flag
that this patch already has.

The complexity is not about keeping track of the "data modified
through mmap" state, but about msync() guarantees, that POSIX wants.

And these requirements do in fact make some sense: msync() basically
means:

  "I want the data written through mmaps to be visible to the world"

And that obviously includes updating the timestamps.

So how do we know if the data was modified between two msync()
invocations?  The only sane way I can think of is to walk the page
tables in msync() and test/clear the pte dirty bit.

Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
application doesn't want to update the timestamps, it should just omit
this call, since it does nothing else.

There shouldn't be any other side effect.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-26 21:43     ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-26 21:43 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
> 
> Boy this is complicated.

You tell me?

> Is there a simpler way of doing all this?  Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness.  Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.

I don't think a page flag gains anything over the address_space flag
that this patch already has.

The complexity is not about keeping track of the "data modified
through mmap" state, but about msync() guarantees, that POSIX wants.

And these requirements do in fact make some sense: msync() basically
means:

  "I want the data written through mmaps to be visible to the world"

And that obviously includes updating the timestamps.

So how do we know if the data was modified between two msync()
invocations?  The only sane way I can think of is to walk the page
tables in msync() and test/clear the pte dirty bit.

Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
application doesn't want to update the timestamps, it should just omit
this call, since it does nothing else.

There shouldn't be any other side effect.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-26 21:10     ` Matt Mackall
@ 2007-03-26 22:25       ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-26 22:25 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Miklos Szeredi, a.p.zijlstra, linux-kernel, linux-mm

On Mon, 26 Mar 2007 16:10:09 -0500
Matt Mackall <mpm@selenic.com> wrote:

> On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> > On Sun, 25 Mar 2007 23:10:21 +0200
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> > 
> > Boy this is complicated.
> > 
> > Is there a simpler way of doing all this?  Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness.  Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> > 
> > Or something like that - I'm just thinking out loud and picking holes in
> > the above doesn't shut me up ;) We're adding complexity and some overhead
> > and we're losing our recent msync() simplifications and this all hurts.  Is
> > there some other way?  I think burning a page flag to avoid this additional
> > complexity would be worthwhile.  
> 
> Aren't we basically out of those?

Rafael has liberated a couple of the flags which swsusp was using.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-26 22:25       ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-26 22:25 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Miklos Szeredi, a.p.zijlstra, linux-kernel, linux-mm

On Mon, 26 Mar 2007 16:10:09 -0500
Matt Mackall <mpm@selenic.com> wrote:

> On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> > On Sun, 25 Mar 2007 23:10:21 +0200
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> > 
> > Boy this is complicated.
> > 
> > Is there a simpler way of doing all this?  Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness.  Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> > 
> > Or something like that - I'm just thinking out loud and picking holes in
> > the above doesn't shut me up ;) We're adding complexity and some overhead
> > and we're losing our recent msync() simplifications and this all hurts.  Is
> > there some other way?  I think burning a page flag to avoid this additional
> > complexity would be worthwhile.  
> 
> Aren't we basically out of those?

Rafael has liberated a couple of the flags which swsusp was using.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-26 21:43     ` Miklos Szeredi
@ 2007-03-26 22:31       ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-26 22:31 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Mon, 26 Mar 2007 23:43:08 +0200
Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> > 
> > Boy this is complicated.
> 
> You tell me?
> 
> > Is there a simpler way of doing all this?  Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness.  Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> 
> I don't think a page flag gains anything over the address_space flag
> that this patch already has.
> 
> The complexity is not about keeping track of the "data modified
> through mmap" state, but about msync() guarantees, that POSIX wants.
> 
> And these requirements do in fact make some sense: msync() basically
> means:
> 
>   "I want the data written through mmaps to be visible to the world"
> 
> And that obviously includes updating the timestamps.
> 
> So how do we know if the data was modified between two msync()
> invocations?  The only sane way I can think of is to walk the page
> tables in msync() and test/clear the pte dirty bit.

clear_page_dirty_for_io() already does that.

So we should be able to test PageDirtiedByWrite() after running
clear_page_dirty_for_io() to discover whether this page was dirtied via
MAP_SHARED, and then update the inode times if so.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-26 22:31       ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-26 22:31 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Mon, 26 Mar 2007 23:43:08 +0200
Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> > 
> > Boy this is complicated.
> 
> You tell me?
> 
> > Is there a simpler way of doing all this?  Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness.  Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> 
> I don't think a page flag gains anything over the address_space flag
> that this patch already has.
> 
> The complexity is not about keeping track of the "data modified
> through mmap" state, but about msync() guarantees, that POSIX wants.
> 
> And these requirements do in fact make some sense: msync() basically
> means:
> 
>   "I want the data written through mmaps to be visible to the world"
> 
> And that obviously includes updating the timestamps.
> 
> So how do we know if the data was modified between two msync()
> invocations?  The only sane way I can think of is to walk the page
> tables in msync() and test/clear the pte dirty bit.

clear_page_dirty_for_io() already does that.

So we should be able to test PageDirtiedByWrite() after running
clear_page_dirty_for_io() to discover whether this page was dirtied via
MAP_SHARED, and then update the inode times if so.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-26 22:31       ` Andrew Morton
@ 2007-03-27  6:55         ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  6:55 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > > This patch makes writing to shared memory mappings update st_ctime and
> > > > st_mtime as defined by SUSv3:
> > > 
> > > Boy this is complicated.
> > 
> > You tell me?
> > 
> > > Is there a simpler way of doing all this?  Say, we define a new page flag
> > > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > > page-dirtiness.  Then, when performing writeback we look to see if any of
> > > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > > current-time.
> > 
> > I don't think a page flag gains anything over the address_space flag
> > that this patch already has.
> > 
> > The complexity is not about keeping track of the "data modified
> > through mmap" state, but about msync() guarantees, that POSIX wants.
> > 
> > And these requirements do in fact make some sense: msync() basically
> > means:
> > 
> >   "I want the data written through mmaps to be visible to the world"
> > 
> > And that obviously includes updating the timestamps.
> > 
> > So how do we know if the data was modified between two msync()
> > invocations?  The only sane way I can think of is to walk the page
> > tables in msync() and test/clear the pte dirty bit.
> 
> clear_page_dirty_for_io() already does that.
> 
> So we should be able to test PageDirtiedByWrite() after running
> clear_page_dirty_for_io() to discover whether this page was dirtied via
> MAP_SHARED, and then update the inode times if so.

What do you need the page flag for?  The "modified through mmap" info
is there in the ptes.  And from the ptes it can be transfered to a
per-address_space flag.  Nobody is interested through which page was
the file modified.

Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
should update the timestamp.  That's the difficult one.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  6:55         ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  6:55 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > > This patch makes writing to shared memory mappings update st_ctime and
> > > > st_mtime as defined by SUSv3:
> > > 
> > > Boy this is complicated.
> > 
> > You tell me?
> > 
> > > Is there a simpler way of doing all this?  Say, we define a new page flag
> > > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > > page-dirtiness.  Then, when performing writeback we look to see if any of
> > > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > > current-time.
> > 
> > I don't think a page flag gains anything over the address_space flag
> > that this patch already has.
> > 
> > The complexity is not about keeping track of the "data modified
> > through mmap" state, but about msync() guarantees, that POSIX wants.
> > 
> > And these requirements do in fact make some sense: msync() basically
> > means:
> > 
> >   "I want the data written through mmaps to be visible to the world"
> > 
> > And that obviously includes updating the timestamps.
> > 
> > So how do we know if the data was modified between two msync()
> > invocations?  The only sane way I can think of is to walk the page
> > tables in msync() and test/clear the pte dirty bit.
> 
> clear_page_dirty_for_io() already does that.
> 
> So we should be able to test PageDirtiedByWrite() after running
> clear_page_dirty_for_io() to discover whether this page was dirtied via
> MAP_SHARED, and then update the inode times if so.

What do you need the page flag for?  The "modified through mmap" info
is there in the ptes.  And from the ptes it can be transfered to a
per-address_space flag.  Nobody is interested through which page was
the file modified.

Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
should update the timestamp.  That's the difficult one.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  6:55         ` Miklos Szeredi
@ 2007-03-27  7:22           ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  7:22 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 08:55:40 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > > > This patch makes writing to shared memory mappings update st_ctime and
> > > > > st_mtime as defined by SUSv3:
> > > > 
> > > > Boy this is complicated.
> > > 
> > > You tell me?
> > > 
> > > > Is there a simpler way of doing all this?  Say, we define a new page flag
> > > > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > > > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > > > page-dirtiness.  Then, when performing writeback we look to see if any of
> > > > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > > > current-time.
> > > 
> > > I don't think a page flag gains anything over the address_space flag
> > > that this patch already has.
> > > 
> > > The complexity is not about keeping track of the "data modified
> > > through mmap" state, but about msync() guarantees, that POSIX wants.
> > > 
> > > And these requirements do in fact make some sense: msync() basically
> > > means:
> > > 
> > >   "I want the data written through mmaps to be visible to the world"
> > > 
> > > And that obviously includes updating the timestamps.
> > > 
> > > So how do we know if the data was modified between two msync()
> > > invocations?  The only sane way I can think of is to walk the page
> > > tables in msync() and test/clear the pte dirty bit.
> > 
> > clear_page_dirty_for_io() already does that.
> > 
> > So we should be able to test PageDirtiedByWrite() after running
> > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > MAP_SHARED, and then update the inode times if so.
> 
> What do you need the page flag for?

To work out whether the page was dirtied via write (in which case the
timestamps were updated) or via mmap (in which case they were not).

>  The "modified through mmap" info
> is there in the ptes.

It might not be there any more - the ptes could have got taken down by, for
example, reclaim.

I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
with something less costly and less complex than this thing, but you appear
to be resisting.

>  And from the ptes it can be transfered to a
> per-address_space flag.  Nobody is interested through which page was
> the file modified.
> 
> Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
> should update the timestamp.  That's the difficult one.
> 

We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
pages.  Is does it in generic_writepages().  It also even walks the ptes
for you, in clear_page_dirty_for_io().

There is surely no need to duplicate all that.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  7:22           ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  7:22 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 08:55:40 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > > > This patch makes writing to shared memory mappings update st_ctime and
> > > > > st_mtime as defined by SUSv3:
> > > > 
> > > > Boy this is complicated.
> > > 
> > > You tell me?
> > > 
> > > > Is there a simpler way of doing all this?  Say, we define a new page flag
> > > > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > > > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > > > page-dirtiness.  Then, when performing writeback we look to see if any of
> > > > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > > > current-time.
> > > 
> > > I don't think a page flag gains anything over the address_space flag
> > > that this patch already has.
> > > 
> > > The complexity is not about keeping track of the "data modified
> > > through mmap" state, but about msync() guarantees, that POSIX wants.
> > > 
> > > And these requirements do in fact make some sense: msync() basically
> > > means:
> > > 
> > >   "I want the data written through mmaps to be visible to the world"
> > > 
> > > And that obviously includes updating the timestamps.
> > > 
> > > So how do we know if the data was modified between two msync()
> > > invocations?  The only sane way I can think of is to walk the page
> > > tables in msync() and test/clear the pte dirty bit.
> > 
> > clear_page_dirty_for_io() already does that.
> > 
> > So we should be able to test PageDirtiedByWrite() after running
> > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > MAP_SHARED, and then update the inode times if so.
> 
> What do you need the page flag for?

To work out whether the page was dirtied via write (in which case the
timestamps were updated) or via mmap (in which case they were not).

>  The "modified through mmap" info
> is there in the ptes.

It might not be there any more - the ptes could have got taken down by, for
example, reclaim.

I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
with something less costly and less complex than this thing, but you appear
to be resisting.

>  And from the ptes it can be transfered to a
> per-address_space flag.  Nobody is interested through which page was
> the file modified.
> 
> Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
> should update the timestamp.  That's the difficult one.
> 

We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
pages.  Is does it in generic_writepages().  It also even walks the ptes
for you, in clear_page_dirty_for_io().

There is surely no need to duplicate all that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  7:22           ` Andrew Morton
@ 2007-03-27  7:36             ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  7:36 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > clear_page_dirty_for_io() already does that.
> > > 
> > > So we should be able to test PageDirtiedByWrite() after running
> > > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > > MAP_SHARED, and then update the inode times if so.
> > 
> > What do you need the page flag for?
> 
> To work out whether the page was dirtied via write (in which case the
> timestamps were updated) or via mmap (in which case they were not).
> 
> >  The "modified through mmap" info
> > is there in the ptes.
> 
> It might not be there any more - the ptes could have got taken down by, for
> example, reclaim.

Yes, but then the "modified through mmap" is transferred to the
per-address_space flag.  All this is already done by this patch.

> I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
> with something less costly and less complex than this thing, but you appear
> to be resisting.

No, I'm just arguing that your suggestion is actually a complication,
not a simplification ;)

> >  And from the ptes it can be transfered to a
> > per-address_space flag.  Nobody is interested through which page was
> > the file modified.
> > 
> > Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
> > should update the timestamp.  That's the difficult one.
> > 
> 
> We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
> pages.  Is does it in generic_writepages().  It also even walks the ptes
> for you, in clear_page_dirty_for_io().

Yes.  But that's not very useful semantic for MS_ASYNC vs. file time
update.

It would basically say:

  "if you cann MS_ASYNC, and the file was modified then sometime in
  the future you will get an updated c/mtime".

But this is not what POSIX says, and it's not what applications want.

For example "make" would want to know if a file was modified or not,
and with your suggestion only msync(MS_SYNC) would reliably provide
that info.  But msync(MS_SYNC) is unnecessary in many cases.

> There is surely no need to duplicate all that.

Yeah, we could teach generic_writepages() to conditionally not submit
for io just test/clear pte dirtyness.

Maybe that would be somewhat cleaner, dunno.

Then there are the ram backed filesystems, which don't have dirty
accounting and radix trees, and for which this pte walking is still
needed to provide semantics consistent with normal filesystems.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  7:36             ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  7:36 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > clear_page_dirty_for_io() already does that.
> > > 
> > > So we should be able to test PageDirtiedByWrite() after running
> > > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > > MAP_SHARED, and then update the inode times if so.
> > 
> > What do you need the page flag for?
> 
> To work out whether the page was dirtied via write (in which case the
> timestamps were updated) or via mmap (in which case they were not).
> 
> >  The "modified through mmap" info
> > is there in the ptes.
> 
> It might not be there any more - the ptes could have got taken down by, for
> example, reclaim.

Yes, but then the "modified through mmap" is transferred to the
per-address_space flag.  All this is already done by this patch.

> I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
> with something less costly and less complex than this thing, but you appear
> to be resisting.

No, I'm just arguing that your suggestion is actually a complication,
not a simplification ;)

> >  And from the ptes it can be transfered to a
> > per-address_space flag.  Nobody is interested through which page was
> > the file modified.
> > 
> > Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
> > should update the timestamp.  That's the difficult one.
> > 
> 
> We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
> pages.  Is does it in generic_writepages().  It also even walks the ptes
> for you, in clear_page_dirty_for_io().

Yes.  But that's not very useful semantic for MS_ASYNC vs. file time
update.

It would basically say:

  "if you cann MS_ASYNC, and the file was modified then sometime in
  the future you will get an updated c/mtime".

But this is not what POSIX says, and it's not what applications want.

For example "make" would want to know if a file was modified or not,
and with your suggestion only msync(MS_SYNC) would reliably provide
that info.  But msync(MS_SYNC) is unnecessary in many cases.

> There is surely no need to duplicate all that.

Yeah, we could teach generic_writepages() to conditionally not submit
for io just test/clear pte dirtyness.

Maybe that would be somewhat cleaner, dunno.

Then there are the ram backed filesystems, which don't have dirty
accounting and radix trees, and for which this pte walking is still
needed to provide semantics consistent with normal filesystems.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  7:36             ` Miklos Szeredi
@ 2007-03-27  7:49               ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  7:49 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 09:36:50 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > There is surely no need to duplicate all that.
> 
> Yeah, we could teach generic_writepages() to conditionally not submit
> for io just test/clear pte dirtyness.
> 
> Maybe that would be somewhat cleaner, dunno.
> 
> Then there are the ram backed filesystems, which don't have dirty
> accounting and radix trees, and for which this pte walking is still
> needed to provide semantics consistent with normal filesystems.

hm.

I don't know how important all this is, really - we've had this bug for
ever and presumably we've already trained everyone to work around it.

What usage scenarios are people actually hurting from?  Is there anything
interesting in the mysterious Novell Bugzilla #206431?

Perhaps we can get away with doing something half-assed which covers most
requirements...

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  7:49               ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  7:49 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 09:36:50 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > There is surely no need to duplicate all that.
> 
> Yeah, we could teach generic_writepages() to conditionally not submit
> for io just test/clear pte dirtyness.
> 
> Maybe that would be somewhat cleaner, dunno.
> 
> Then there are the ram backed filesystems, which don't have dirty
> accounting and radix trees, and for which this pte walking is still
> needed to provide semantics consistent with normal filesystems.

hm.

I don't know how important all this is, really - we've had this bug for
ever and presumably we've already trained everyone to work around it.

What usage scenarios are people actually hurting from?  Is there anything
interesting in the mysterious Novell Bugzilla #206431?

Perhaps we can get away with doing something half-assed which covers most
requirements...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  7:49               ` Andrew Morton
@ 2007-03-27  8:03                 ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  8:03 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > There is surely no need to duplicate all that.
> > 
> > Yeah, we could teach generic_writepages() to conditionally not submit
> > for io just test/clear pte dirtyness.
> > 
> > Maybe that would be somewhat cleaner, dunno.
> > 
> > Then there are the ram backed filesystems, which don't have dirty
> > accounting and radix trees, and for which this pte walking is still
> > needed to provide semantics consistent with normal filesystems.
> 
> hm.
> 
> I don't know how important all this is, really - we've had this bug for
> ever and presumably we've already trained everyone to work around it.
> 
> What usage scenarios are people actually hurting from?  Is there anything
> interesting in the mysterious Novell Bugzilla #206431?

That's just a failing LTP testcase, not quite real life ;)

But Peter Staubach says a RH custumer has files written thorugh mmap,
which are not being backed up.

> Perhaps we can get away with doing something half-assed which covers most
> requirements...

OK.  At least I can split the patch into two half asses.

The big question is tmpfs and friends.  Those won't get any timestamp
update without additional page table walking.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  8:03                 ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  8:03 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > There is surely no need to duplicate all that.
> > 
> > Yeah, we could teach generic_writepages() to conditionally not submit
> > for io just test/clear pte dirtyness.
> > 
> > Maybe that would be somewhat cleaner, dunno.
> > 
> > Then there are the ram backed filesystems, which don't have dirty
> > accounting and radix trees, and for which this pte walking is still
> > needed to provide semantics consistent with normal filesystems.
> 
> hm.
> 
> I don't know how important all this is, really - we've had this bug for
> ever and presumably we've already trained everyone to work around it.
> 
> What usage scenarios are people actually hurting from?  Is there anything
> interesting in the mysterious Novell Bugzilla #206431?

That's just a failing LTP testcase, not quite real life ;)

But Peter Staubach says a RH custumer has files written thorugh mmap,
which are not being backed up.

> Perhaps we can get away with doing something half-assed which covers most
> requirements...

OK.  At least I can split the patch into two half asses.

The big question is tmpfs and friends.  Those won't get any timestamp
update without additional page table walking.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  8:03                 ` Miklos Szeredi
@ 2007-03-27  8:18                   ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  8:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 10:03:41 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> But Peter Staubach says a RH custumer has files written thorugh mmap,
> which are not being backed up.

Yes, I expect the backup problem is the major real-world hurt arising from
this bug.

But I expect we could adequately plug that problem at munmap()-time.  Or,
better, do_wp_page().  As I said - half-assed.

It's a question if whether the backup problem is the only thing which is hurting
in the real-world, or if people have other problems.

(In fact, what's wrong with doing it in do_wp_page()?  The timestamp could
be up to 30 seconds too early, but that's heaps better than what we have
now..)



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  8:18                   ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  8:18 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 10:03:41 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> But Peter Staubach says a RH custumer has files written thorugh mmap,
> which are not being backed up.

Yes, I expect the backup problem is the major real-world hurt arising from
this bug.

But I expect we could adequately plug that problem at munmap()-time.  Or,
better, do_wp_page().  As I said - half-assed.

It's a question if whether the backup problem is the only thing which is hurting
in the real-world, or if people have other problems.

(In fact, what's wrong with doing it in do_wp_page()?  The timestamp could
be up to 30 seconds too early, but that's heaps better than what we have
now..)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  8:18                   ` Andrew Morton
@ 2007-03-27  8:28                     ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  8:28 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > which are not being backed up.
> 
> Yes, I expect the backup problem is the major real-world hurt arising from
> this bug.
> 
> But I expect we could adequately plug that problem at munmap()-time.  Or,
> better, do_wp_page().  As I said - half-assed.
> 
> It's a question if whether the backup problem is the only thing which is hurting
> in the real-world, or if people have other problems.
> 
> (In fact, what's wrong with doing it in do_wp_page()?

It's rather more expensive, than just toggling a bit.

Let me work on it a bit more.  I think I can make the current patch
more palatable.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  8:28                     ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  8:28 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > which are not being backed up.
> 
> Yes, I expect the backup problem is the major real-world hurt arising from
> this bug.
> 
> But I expect we could adequately plug that problem at munmap()-time.  Or,
> better, do_wp_page().  As I said - half-assed.
> 
> It's a question if whether the backup problem is the only thing which is hurting
> in the real-world, or if people have other problems.
> 
> (In fact, what's wrong with doing it in do_wp_page()?

It's rather more expensive, than just toggling a bit.

Let me work on it a bit more.  I think I can make the current patch
more palatable.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  8:28                     ` Miklos Szeredi
@ 2007-03-27  8:51                       ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  8:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 10:28:16 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > which are not being backed up.
> > 
> > Yes, I expect the backup problem is the major real-world hurt arising from
> > this bug.
> > 
> > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > better, do_wp_page().  As I said - half-assed.
> > 
> > It's a question if whether the backup problem is the only thing which is hurting
> > in the real-world, or if people have other problems.
> > 
> > (In fact, what's wrong with doing it in do_wp_page()?
> 
> It's rather more expensive, than just toggling a bit.

It shouldn't be, especially for filesystems which have one-second timestamp
granularity.

Filesystems which have s_time_gran=1 might hurt a bit, but no more than
they will with write().

Actually, no - we'd only update the mctime once per page per writeback
period (30 seconds by default) so the load will be small.  It'll be more
often if the user is doing a lot of pte-cleaning via msync() or fsync(),
but then the m/ctime writes will be the least of their problems. 

I'd have thought there were more substantial problems with something that
crude?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  8:51                       ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27  8:51 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 10:28:16 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > which are not being backed up.
> > 
> > Yes, I expect the backup problem is the major real-world hurt arising from
> > this bug.
> > 
> > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > better, do_wp_page().  As I said - half-assed.
> > 
> > It's a question if whether the backup problem is the only thing which is hurting
> > in the real-world, or if people have other problems.
> > 
> > (In fact, what's wrong with doing it in do_wp_page()?
> 
> It's rather more expensive, than just toggling a bit.

It shouldn't be, especially for filesystems which have one-second timestamp
granularity.

Filesystems which have s_time_gran=1 might hurt a bit, but no more than
they will with write().

Actually, no - we'd only update the mctime once per page per writeback
period (30 seconds by default) so the load will be small.  It'll be more
often if the user is doing a lot of pte-cleaning via msync() or fsync(),
but then the m/ctime writes will be the least of their problems. 

I'd have thought there were more substantial problems with something that
crude?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  8:51                       ` Andrew Morton
@ 2007-03-27  9:23                         ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  9:23 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > which are not being backed up.
> > > 
> > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > this bug.
> > > 
> > > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > > better, do_wp_page().  As I said - half-assed.
> > > 
> > > It's a question if whether the backup problem is the only thing which is hurting
> > > in the real-world, or if people have other problems.
> > > 
> > > (In fact, what's wrong with doing it in do_wp_page()?
> > 
> > It's rather more expensive, than just toggling a bit.
> 
> It shouldn't be, especially for filesystems which have one-second timestamp
> granularity.
> 
> Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> they will with write().
> 
> Actually, no - we'd only update the mctime once per page per writeback
> period (30 seconds by default) so the load will be small.

Why?  For each faulted page the times will be updated, no?

Maybe it's acceptable, I don't really know the cost of
file_update_time().

Tried this patch, and it seems to work.  It will even randomly update
the time for tmpfs files (on initial fault, and on swapins).

Miklos

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2007-03-27 11:04:40.000000000 +0200
+++ linux/mm/memory.c	2007-03-27 11:08:19.000000000 +0200
@@ -1664,6 +1664,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
@@ -2316,6 +2318,8 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27  9:23                         ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27  9:23 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > which are not being backed up.
> > > 
> > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > this bug.
> > > 
> > > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > > better, do_wp_page().  As I said - half-assed.
> > > 
> > > It's a question if whether the backup problem is the only thing which is hurting
> > > in the real-world, or if people have other problems.
> > > 
> > > (In fact, what's wrong with doing it in do_wp_page()?
> > 
> > It's rather more expensive, than just toggling a bit.
> 
> It shouldn't be, especially for filesystems which have one-second timestamp
> granularity.
> 
> Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> they will with write().
> 
> Actually, no - we'd only update the mctime once per page per writeback
> period (30 seconds by default) so the load will be small.

Why?  For each faulted page the times will be updated, no?

Maybe it's acceptable, I don't really know the cost of
file_update_time().

Tried this patch, and it seems to work.  It will even randomly update
the time for tmpfs files (on initial fault, and on swapins).

Miklos

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2007-03-27 11:04:40.000000000 +0200
+++ linux/mm/memory.c	2007-03-27 11:08:19.000000000 +0200
@@ -1664,6 +1664,8 @@ gotten:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}
@@ -2316,6 +2318,8 @@ retry:
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
 		set_page_dirty_balance(dirty_page);
 		put_page(dirty_page);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27  9:23                         ` Miklos Szeredi
@ 2007-03-27 17:52                           ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27 17:52 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 11:23:06 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > > which are not being backed up.
> > > > 
> > > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > > this bug.
> > > > 
> > > > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > > > better, do_wp_page().  As I said - half-assed.
> > > > 
> > > > It's a question if whether the backup problem is the only thing which is hurting
> > > > in the real-world, or if people have other problems.
> > > > 
> > > > (In fact, what's wrong with doing it in do_wp_page()?
> > > 
> > > It's rather more expensive, than just toggling a bit.
> > 
> > It shouldn't be, especially for filesystems which have one-second timestamp
> > granularity.
> > 
> > Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> > they will with write().
> > 
> > Actually, no - we'd only update the mctime once per page per writeback
> > period (30 seconds by default) so the load will be small.
> 
> Why?  For each faulted page the times will be updated, no?

Yes, but only at pagefault-time.  And

- the faults are already "slow": we need to pull the page contents in
  from disk, or memset or cow the page

- we need to take the trap

compared to which, the cost of the timestamp update will (we hope) be
relatively low.

> Maybe it's acceptable, I don't really know the cost of
> file_update_time().
> 
> Tried this patch, and it seems to work.  It will even randomly update
> the time for tmpfs files (on initial fault, and on swapins).
> 
> Miklos
> 
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2007-03-27 11:04:40.000000000 +0200
> +++ linux/mm/memory.c	2007-03-27 11:08:19.000000000 +0200
> @@ -1664,6 +1664,8 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (dirty_page) {
> +		if (vma->vm_file)
> +			file_update_time(vma->vm_file);
>  		set_page_dirty_balance(dirty_page);
>  		put_page(dirty_page);
>  	}
> @@ -2316,6 +2318,8 @@ retry:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (dirty_page) {
> +		if (vma->vm_file)
> +			file_update_time(vma->vm_file);
>  		set_page_dirty_balance(dirty_page);
>  		put_page(dirty_page);
>  	}

that's simpler ;) Is it correct enough though?  The place where it will
become inaccurate is for repeated modification via an established map.  ie:

	p = mmap(..., MAP_SHARED);
	for ( ; ; )
		*p = 1;

in which case I think the timestamp will only get updated once per
writeback interval (ie: 30 seconds).


tmpfs files have an s_time_gran of 1, so benchmarking some workload on
tmpfs with this patch will tell us the worst-case overhead of the change. 

I guess we should arrange for multiple CPUs to perform write faults against
multiple pages of the same file in parallel.  Of course, that'd be a pretty
darn short benchmark because it'll run out of RAM.  Which reveals why we
probably won't have a performance problem in there.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27 17:52                           ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27 17:52 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, linux-kernel, linux-mm

On Tue, 27 Mar 2007 11:23:06 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > > which are not being backed up.
> > > > 
> > > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > > this bug.
> > > > 
> > > > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > > > better, do_wp_page().  As I said - half-assed.
> > > > 
> > > > It's a question if whether the backup problem is the only thing which is hurting
> > > > in the real-world, or if people have other problems.
> > > > 
> > > > (In fact, what's wrong with doing it in do_wp_page()?
> > > 
> > > It's rather more expensive, than just toggling a bit.
> > 
> > It shouldn't be, especially for filesystems which have one-second timestamp
> > granularity.
> > 
> > Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> > they will with write().
> > 
> > Actually, no - we'd only update the mctime once per page per writeback
> > period (30 seconds by default) so the load will be small.
> 
> Why?  For each faulted page the times will be updated, no?

Yes, but only at pagefault-time.  And

- the faults are already "slow": we need to pull the page contents in
  from disk, or memset or cow the page

- we need to take the trap

compared to which, the cost of the timestamp update will (we hope) be
relatively low.

> Maybe it's acceptable, I don't really know the cost of
> file_update_time().
> 
> Tried this patch, and it seems to work.  It will even randomly update
> the time for tmpfs files (on initial fault, and on swapins).
> 
> Miklos
> 
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c	2007-03-27 11:04:40.000000000 +0200
> +++ linux/mm/memory.c	2007-03-27 11:08:19.000000000 +0200
> @@ -1664,6 +1664,8 @@ gotten:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (dirty_page) {
> +		if (vma->vm_file)
> +			file_update_time(vma->vm_file);
>  		set_page_dirty_balance(dirty_page);
>  		put_page(dirty_page);
>  	}
> @@ -2316,6 +2318,8 @@ retry:
>  unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	if (dirty_page) {
> +		if (vma->vm_file)
> +			file_update_time(vma->vm_file);
>  		set_page_dirty_balance(dirty_page);
>  		put_page(dirty_page);
>  	}

that's simpler ;) Is it correct enough though?  The place where it will
become inaccurate is for repeated modification via an established map.  ie:

	p = mmap(..., MAP_SHARED);
	for ( ; ; )
		*p = 1;

in which case I think the timestamp will only get updated once per
writeback interval (ie: 30 seconds).


tmpfs files have an s_time_gran of 1, so benchmarking some workload on
tmpfs with this patch will tell us the worst-case overhead of the change. 

I guess we should arrange for multiple CPUs to perform write faults against
multiple pages of the same file in parallel.  Of course, that'd be a pretty
darn short benchmark because it'll run out of RAM.  Which reveals why we
probably won't have a performance problem in there.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 17:52                           ` Andrew Morton
@ 2007-03-27 18:29                             ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27 18:29 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> that's simpler ;) Is it correct enough though?  The place where it will
> become inaccurate is for repeated modification via an established map.  ie:
> 
> 	p = mmap(..., MAP_SHARED);
> 	for ( ; ; )
> 		*p = 1;
> 
> in which case I think the timestamp will only get updated once per
> writeback interval (ie: 30 seconds).

Which is perfectly OK, we really can't do any better in any sane way.

My concern is only about MS_ASYNC, it would be really nice to know
what other OSs do.  I've checked on a Solaris 5.7 (not exactly a
modern OS) and that is as good as current Linux.

If someone has access to others pls. find my test program at the end.

The logical way to handle msync is IMO:

   write to memory + msync(MS_ASYNC) == write()
   write to memory + msync(MS_SYNC) == write() + fdatasync()

Yes, it would add some overhead for MS_ASYNC, but that's what the user
wants isn't it?  If the user doesn't want correctly updated
timestamps, it should not call msync().

Miklos

=== msync_time.c ==============================================
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>

static const char *filename;

static void print_times(const char *msg)
{
    struct stat stbuf;
    stat(filename, &stbuf);
    printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
           stbuf.st_atime);
}

static void usage(const char *progname)
{
    fprintf(stderr, "usage: %s filename [-s]\n", progname);
    exit(1);
}

int main(int argc, char *argv[])
{
    int res;
    char *addr;
    int msync_flag = MS_ASYNC;
    int fd;

    if (argc < 2)
        usage(argv[0]);

    filename = argv[1];
    if (argc > 2) {
        if (argc > 3)
            usage(argv[0]);
        if (strcmp(argv[2], "-s") == 0)
            msync_flag = MS_SYNC;
        else
            usage(argv[0]);
    }

    fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
    if (fd == -1) {
        perror(filename);
        return 1;
    }
    print_times("begin");
    sleep(1);
    write(fd, "aaaa\n", 4);
    print_times("write");
    sleep(1);
    addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == (char *) -1) {
        perror("mmap");
        return 1;
    }
    print_times("mmap");
    sleep(1);

    addr[1] = 'b';
    print_times("b");
    sleep(1);
    res = msync(addr, 4, msync_flag);
    if (res == -1) {
        perror("msync");
        return 1;
    }
    print_times("msync b");
    sleep(1);

    addr[2] = 'c';
    print_times("c");
    sleep(1);
    res = msync(addr, 4, msync_flag);
    if (res == -1) {
        perror("msync");
        return 1;
    }
    print_times("msync c");
    sleep(1);

    addr[3] = 'd';
    print_times("d");
    sleep(1);
    res = munmap(addr, 4);
    if (res == -1) {
        perror("munmap");
        return 1;
    }
    print_times("munmap");
    sleep(1);

    res = close(fd);
    if (res == -1) {
        perror("close");
        return 1;
    }
    print_times("end");
    return 0;
}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27 18:29                             ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27 18:29 UTC (permalink / raw)
  To: akpm; +Cc: a.p.zijlstra, linux-kernel, linux-mm

> that's simpler ;) Is it correct enough though?  The place where it will
> become inaccurate is for repeated modification via an established map.  ie:
> 
> 	p = mmap(..., MAP_SHARED);
> 	for ( ; ; )
> 		*p = 1;
> 
> in which case I think the timestamp will only get updated once per
> writeback interval (ie: 30 seconds).

Which is perfectly OK, we really can't do any better in any sane way.

My concern is only about MS_ASYNC, it would be really nice to know
what other OSs do.  I've checked on a Solaris 5.7 (not exactly a
modern OS) and that is as good as current Linux.

If someone has access to others pls. find my test program at the end.

The logical way to handle msync is IMO:

   write to memory + msync(MS_ASYNC) == write()
   write to memory + msync(MS_SYNC) == write() + fdatasync()

Yes, it would add some overhead for MS_ASYNC, but that's what the user
wants isn't it?  If the user doesn't want correctly updated
timestamps, it should not call msync().

Miklos

=== msync_time.c ==============================================
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>

static const char *filename;

static void print_times(const char *msg)
{
    struct stat stbuf;
    stat(filename, &stbuf);
    printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
           stbuf.st_atime);
}

static void usage(const char *progname)
{
    fprintf(stderr, "usage: %s filename [-s]\n", progname);
    exit(1);
}

int main(int argc, char *argv[])
{
    int res;
    char *addr;
    int msync_flag = MS_ASYNC;
    int fd;

    if (argc < 2)
        usage(argv[0]);

    filename = argv[1];
    if (argc > 2) {
        if (argc > 3)
            usage(argv[0]);
        if (strcmp(argv[2], "-s") == 0)
            msync_flag = MS_SYNC;
        else
            usage(argv[0]);
    }

    fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
    if (fd == -1) {
        perror(filename);
        return 1;
    }
    print_times("begin");
    sleep(1);
    write(fd, "aaaa\n", 4);
    print_times("write");
    sleep(1);
    addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == (char *) -1) {
        perror("mmap");
        return 1;
    }
    print_times("mmap");
    sleep(1);

    addr[1] = 'b';
    print_times("b");
    sleep(1);
    res = msync(addr, 4, msync_flag);
    if (res == -1) {
        perror("msync");
        return 1;
    }
    print_times("msync b");
    sleep(1);

    addr[2] = 'c';
    print_times("c");
    sleep(1);
    res = msync(addr, 4, msync_flag);
    if (res == -1) {
        perror("msync");
        return 1;
    }
    print_times("msync c");
    sleep(1);

    addr[3] = 'd';
    print_times("d");
    sleep(1);
    res = munmap(addr, 4);
    if (res == -1) {
        perror("munmap");
        return 1;
    }
    print_times("munmap");
    sleep(1);

    res = close(fd);
    if (res == -1) {
        perror("close");
        return 1;
    }
    print_times("end");
    return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-28  9:50                 ` linux
@ 2007-03-29  4:59                   ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2007-03-29  4:59 UTC (permalink / raw)
  To: linux; +Cc: akpm, linux-kernel, linux-mm, miklos

linux@horizon.com wrote:
>>But if you didn't notice until now, then the current implementation
>>must be pretty reasonable for you use as well.
> 
> 
> Oh, I definitely noticed.  As soon as I tried to port my application
> to 2.6, it broke - as evidenced by my complaints last year.  The
> current solution is simple - since it's running on dedicated boxes,
> leave them on 2.4.

Well I didn't know that was a change in behaviour vs 2.4 (or maybe I
did and forgot). That was probably a bit silly, unless there was a
good reason for it.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-29  4:59                   ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2007-03-29  4:59 UTC (permalink / raw)
  To: linux; +Cc: akpm, linux-kernel, linux-mm, miklos

linux@horizon.com wrote:
>>But if you didn't notice until now, then the current implementation
>>must be pretty reasonable for you use as well.
> 
> 
> Oh, I definitely noticed.  As soon as I tried to port my application
> to 2.6, it broke - as evidenced by my complaints last year.  The
> current solution is simple - since it's running on dedicated boxes,
> leave them on 2.4.

Well I didn't know that was a change in behaviour vs 2.4 (or maybe I
did and forgot). That was probably a bit silly, unless there was a
good reason for it.

-- 
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-28  7:58               ` Nick Piggin
@ 2007-03-28  9:50                 ` linux
  -1 siblings, 0 replies; 54+ messages in thread
From: linux @ 2007-03-28  9:50 UTC (permalink / raw)
  To: nickpiggin; +Cc: akpm, linux, linux-kernel, linux-mm, miklos

> But if you didn't notice until now, then the current implementation
> must be pretty reasonable for you use as well.

Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.

I've now got the hint on how to make it work on 2.6 (sync_file_range()),
so I can try again.  But the pressure to upgrade is not strong, so it
might be a while.

You may recall, this subthread started when I responding to "the
only reason to use msync(MS_ASYNC) is to update timestamps" with a
counterexample.  I still think the purpose of the call is a hint to the
kernel that writing to the specified page(s) is complete and now would be
a good time to clean them.  Which has very little to do with timestamps.

Now, my application, which leaves less than a second between the MS_ASYNC
and a subsequent MS_SYNC to check whether it's done, broke, but I can
imagine similar cases where MS_ASYNC would remain a useful hint to reduce
the sort of memory hogging generally associated with "dd if=/dev/zero"
type operations.

Reading between the lines of the standard, that seems (to me, at least)
to obviously be the intended purpose of msync(MS_ASYNC).  I wonder if
there's any historical documentation describing the original intent
behind creating the call.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-28  9:50                 ` linux
  0 siblings, 0 replies; 54+ messages in thread
From: linux @ 2007-03-28  9:50 UTC (permalink / raw)
  To: nickpiggin; +Cc: akpm, linux, linux-kernel, linux-mm, miklos

> But if you didn't notice until now, then the current implementation
> must be pretty reasonable for you use as well.

Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.

I've now got the hint on how to make it work on 2.6 (sync_file_range()),
so I can try again.  But the pressure to upgrade is not strong, so it
might be a while.

You may recall, this subthread started when I responding to "the
only reason to use msync(MS_ASYNC) is to update timestamps" with a
counterexample.  I still think the purpose of the call is a hint to the
kernel that writing to the specified page(s) is complete and now would be
a good time to clean them.  Which has very little to do with timestamps.

Now, my application, which leaves less than a second between the MS_ASYNC
and a subsequent MS_SYNC to check whether it's done, broke, but I can
imagine similar cases where MS_ASYNC would remain a useful hint to reduce
the sort of memory hogging generally associated with "dd if=/dev/zero"
type operations.

Reading between the lines of the standard, that seems (to me, at least)
to obviously be the intended purpose of msync(MS_ASYNC).  I wonder if
there's any historical documentation describing the original intent
behind creating the call.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-28  1:48             ` linux
@ 2007-03-28  7:58               ` Nick Piggin
  -1 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2007-03-28  7:58 UTC (permalink / raw)
  To: linux; +Cc: miklos, akpm, linux-kernel, linux-mm

linux@horizon.com wrote:
>>Linux _will_ write all modified data to permanent storage locations.
>>Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
>>do need msync() to enable data to be written back.
>>
>>But it will not start I/O immediately, which is not a requirement in
>>the standard, or at least it's pretty vague about that.
> 
> 
> As I've said before, I disagree, but I'm not going to start a major
> flame war about it.
> 
> The most relevant paragraph is:
> 
> # When MS_ASYNC is specified, msync() returns immediately once all the
> # write operations are initiated or queued for servicing; when MS_SYNC is
> # specified, msync() will not return until all write operations are
> # completed as defined for synchronised I/O data integrity completion.
> # Either MS_ASYNC or MS_SYNC is specified, but not both.
> 
> Note two things:
> 1) In the paragraphs before, what msync does is defined independently
>    of the MS_* flags.  Only the time of the return to user space varies.
>    Thus, whatever the delay between calling msync() and the data being
>    written, it should be the same whether MS_SYNC or MS_ASYNC is used.
> 
>    The implementation intended is:
>    - Start all I/O
>    - If MS_SYNC, wait for I/O to complete
>    - Return to user space

set_page_dirty queues pages for IO, and the writeout daemon will service
that queue when it sees fit. IMO it is sufficient that you cannot say the
implementation does not meet the standard.


> 2) "all the write operations are initiated or queued for servicing".
>    It is a common convention in English (and most languages, I expect)
>    that in the "or" is a preference for the first alternative.  The second
>    is a permitted alternative if the first is not possible.
> 
>    And "queued for servicing", especially "initiated or queued for
>    servicing", to me imples queuing waiting for some resource.  To have
>    the resource being waited for be a timer expiry seems like rather a
>    cheat to me.  It's perhaps doesn't break the letter of the standard,
>    but definitely bends it.  It feels like a fiddle.
> 
> Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
> "I don't expect to write this page any more, so now would be a good time
> to clean it."
> It would just make my life easier if the kernel procrastinated less.

But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-28  7:58               ` Nick Piggin
  0 siblings, 0 replies; 54+ messages in thread
From: Nick Piggin @ 2007-03-28  7:58 UTC (permalink / raw)
  To: linux; +Cc: miklos, akpm, linux-kernel, linux-mm

linux@horizon.com wrote:
>>Linux _will_ write all modified data to permanent storage locations.
>>Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
>>do need msync() to enable data to be written back.
>>
>>But it will not start I/O immediately, which is not a requirement in
>>the standard, or at least it's pretty vague about that.
> 
> 
> As I've said before, I disagree, but I'm not going to start a major
> flame war about it.
> 
> The most relevant paragraph is:
> 
> # When MS_ASYNC is specified, msync() returns immediately once all the
> # write operations are initiated or queued for servicing; when MS_SYNC is
> # specified, msync() will not return until all write operations are
> # completed as defined for synchronised I/O data integrity completion.
> # Either MS_ASYNC or MS_SYNC is specified, but not both.
> 
> Note two things:
> 1) In the paragraphs before, what msync does is defined independently
>    of the MS_* flags.  Only the time of the return to user space varies.
>    Thus, whatever the delay between calling msync() and the data being
>    written, it should be the same whether MS_SYNC or MS_ASYNC is used.
> 
>    The implementation intended is:
>    - Start all I/O
>    - If MS_SYNC, wait for I/O to complete
>    - Return to user space

set_page_dirty queues pages for IO, and the writeout daemon will service
that queue when it sees fit. IMO it is sufficient that you cannot say the
implementation does not meet the standard.


> 2) "all the write operations are initiated or queued for servicing".
>    It is a common convention in English (and most languages, I expect)
>    that in the "or" is a preference for the first alternative.  The second
>    is a permitted alternative if the first is not possible.
> 
>    And "queued for servicing", especially "initiated or queued for
>    servicing", to me imples queuing waiting for some resource.  To have
>    the resource being waited for be a timer expiry seems like rather a
>    cheat to me.  It's perhaps doesn't break the letter of the standard,
>    but definitely bends it.  It feels like a fiddle.
> 
> Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
> "I don't expect to write this page any more, so now would be a good time
> to clean it."
> It would just make my life easier if the kernel procrastinated less.

But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 20:31           ` Miklos Szeredi
@ 2007-03-28  1:48             ` linux
  -1 siblings, 0 replies; 54+ messages in thread
From: linux @ 2007-03-28  1:48 UTC (permalink / raw)
  To: linux, miklos; +Cc: akpm, linux-kernel, linux-mm

> Linux _will_ write all modified data to permanent storage locations.
> Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
> do need msync() to enable data to be written back.
> 
> But it will not start I/O immediately, which is not a requirement in
> the standard, or at least it's pretty vague about that.

As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space

2) "all the write operations are initiated or queued for servicing".
   It is a common convention in English (and most languages, I expect)
   that in the "or" is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And "queued for servicing", especially "initiated or queued for
   servicing", to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
"I don't expect to write this page any more, so now would be a good time
to clean it."
It would just make my life easier if the kernel procrastinated less.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-28  1:48             ` linux
  0 siblings, 0 replies; 54+ messages in thread
From: linux @ 2007-03-28  1:48 UTC (permalink / raw)
  To: linux, miklos; +Cc: akpm, linux-kernel, linux-mm

> Linux _will_ write all modified data to permanent storage locations.
> Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
> do need msync() to enable data to be written back.
> 
> But it will not start I/O immediately, which is not a requirement in
> the standard, or at least it's pretty vague about that.

As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space

2) "all the write operations are initiated or queued for servicing".
   It is a common convention in English (and most languages, I expect)
   that in the "or" is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And "queued for servicing", especially "initiated or queued for
   servicing", to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
"I don't expect to write this page any more, so now would be a good time
to clean it."
It would just make my life easier if the kernel procrastinated less.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 20:09         ` linux
@ 2007-03-27 20:47           ` Andrew Morton
  -1 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27 20:47 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel, linux-mm, miklos

On 27 Mar 2007 16:09:33 -0400
linux@horizon.com wrote:

> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O?  How is that uselessly vague?
> 

Because for MS_ASYNC, "msync() shall return immediately once all the write
operations are initiated or queued for servicing".

ie: the writes can complete one millisecond or one week later.  We chose 30
seconds.

And this is not completely fatuous - before 2.6.17, MAP_SHARED pages could
float about in memory in a dirty state for arbitrarily long periods -
potentially for the entire application lifetime.  It was quite reasonable
for our MS_ASYNC implementation to do what it did: tell the VM about the
dirtiness of these pages so they get written back soon.

Post-2.6.17 we preserved that behaviour.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27 20:47           ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27 20:47 UTC (permalink / raw)
  To: linux; +Cc: linux-kernel, linux-mm, miklos

On 27 Mar 2007 16:09:33 -0400
linux@horizon.com wrote:

> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O?  How is that uselessly vague?
> 

Because for MS_ASYNC, "msync() shall return immediately once all the write
operations are initiated or queued for servicing".

ie: the writes can complete one millisecond or one week later.  We chose 30
seconds.

And this is not completely fatuous - before 2.6.17, MAP_SHARED pages could
float about in memory in a dirty state for arbitrarily long periods -
potentially for the entire application lifetime.  It was quite reasonable
for our MS_ASYNC implementation to do what it did: tell the VM about the
dirtiness of these pages so they get written back soon.

Post-2.6.17 we preserved that behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 20:09         ` linux
@ 2007-03-27 20:31           ` Miklos Szeredi
  -1 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27 20:31 UTC (permalink / raw)
  To: linux; +Cc: akpm, linux, linux-kernel, linux-mm

> > Suggest you use msync(MS_ASYNC), then
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).
> 
> Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
> back to the old behavior.
> 
> > We can fix your application, and we'll break someone else's.
> 
> If you can point to an application that it'll break, I'd be a lot more
> understanding.  Nobody did, last year.
> 
> > I don't think it's solveable, really - the range of applications is so
> > broad, and the "standard" is so vague as to be useless.
> 
> I agree that standards are sometimes vague, but that one seemed about
> as clear as it's possible to be without imposing unreasonably on
> the file system and device driver layers.
> 
> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O?  How is that uselessly vague?

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27 20:31           ` Miklos Szeredi
  0 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27 20:31 UTC (permalink / raw)
  To: linux; +Cc: akpm, linux-kernel, linux-mm

> > Suggest you use msync(MS_ASYNC), then
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).
> 
> Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
> back to the old behavior.
> 
> > We can fix your application, and we'll break someone else's.
> 
> If you can point to an application that it'll break, I'd be a lot more
> understanding.  Nobody did, last year.
> 
> > I don't think it's solveable, really - the range of applications is so
> > broad, and the "standard" is so vague as to be useless.
> 
> I agree that standards are sometimes vague, but that one seemed about
> as clear as it's possible to be without imposing unreasonably on
> the file system and device driver layers.
> 
> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O?  How is that uselessly vague?

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.

Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 19:34     ` Andrew Morton
@ 2007-03-27 20:09         ` linux
  0 siblings, 0 replies; 54+ messages in thread
From: linux @ 2007-03-27 20:09 UTC (permalink / raw)
  To: akpm, linux; +Cc: linux-kernel, linux-mm, miklos

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1605 bytes --]

> Suggest you use msync(MS_ASYNC), then
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
back to the old behavior.

> We can fix your application, and we'll break someone else's.

If you can point to an application that it'll break, I'd be a lot more
understanding.  Nobody did, last year.

> I don't think it's solveable, really - the range of applications is so
> broad, and the "standard" is so vague as to be useless.

I agree that standards are sometimes vague, but that one seemed about
as clear as it's possible to be without imposing unreasonably on
the file system and device driver layers.

What part of "The msync() function writes all modified data to
permanent storage locations [...] For mappings to files, the msync()
function ensures that all write operations are completed as defined
for synchronised I/O data integrity completion." suggests that it's not
supposed to do disk I/O?  How is that uselessly vague?

It says to me that msync's raison d'être is to write data from RAM to
stable storage.  If an application calls it too often, that's
the application's fault just as if it called sync(2) too often.

> This is why we've
> been extending these things with linux-specific goodies which permit
> applications to actually tell the kernel what they want to be done in a
> more finely-grained fashion.

Well, I still think the current Linux behavior is a bug, but there's a
usable (and run-time compatible) workaround that doesn't unreasonably
complicate the code, and that's good enough.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27 20:09         ` linux
  0 siblings, 0 replies; 54+ messages in thread
From: linux @ 2007-03-27 20:09 UTC (permalink / raw)
  To: akpm, linux; +Cc: linux-kernel, linux-mm, miklos

> Suggest you use msync(MS_ASYNC), then
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
back to the old behavior.

> We can fix your application, and we'll break someone else's.

If you can point to an application that it'll break, I'd be a lot more
understanding.  Nobody did, last year.

> I don't think it's solveable, really - the range of applications is so
> broad, and the "standard" is so vague as to be useless.

I agree that standards are sometimes vague, but that one seemed about
as clear as it's possible to be without imposing unreasonably on
the file system and device driver layers.

What part of "The msync() function writes all modified data to
permanent storage locations [...] For mappings to files, the msync()
function ensures that all write operations are completed as defined
for synchronised I/O data integrity completion." suggests that it's not
supposed to do disk I/O?  How is that uselessly vague?

It says to me that msync's raison d'etre is to write data from RAM to
stable storage.  If an application calls it too often, that's
the application's fault just as if it called sync(2) too often.

> This is why we've
> been extending these things with linux-specific goodies which permit
> applications to actually tell the kernel what they want to be done in a
> more finely-grained fashion.

Well, I still think the current Linux behavior is a bug, but there's a
usable (and run-time compatible) workaround that doesn't unreasonably
complicate the code, and that's good enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 19:24   ` linux
@ 2007-03-27 19:34     ` Andrew Morton
  2007-03-27 20:09         ` linux
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2007-03-27 19:34 UTC (permalink / raw)
  To: linux; +Cc: miklos, linux-kernel, linux-mm

On 27 Mar 2007 15:24:52 -0400
linux@horizon.com wrote:

> > * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> 
> Yes, I noticed.  See
> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
> for a bug report on the subject February 2006.

Suggest you use msync(MS_ASYNC), then
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

The new (post 2.6.17) MAP_SHARED dirty-page semantics mean that the msync()
isn't actually needed.

> That's why this application is still running on 2.4.
> 
> As I mentioned at the time, the SUS says:
> (http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
> "When MS_ASYNC is specified, msync() returns immediately once all the
> write operations are initiated or queued for servicing."
> 
> You can argue that putting it on the dirty list constitutes "queued for
> servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
> to start the I/O.  Although strict standards-ese parsing says that
> either branch of an or is acceptable, it is a common English language
> convention that the first alternative is preferred and the second
> is a fallback.
> 
> It makes sense in this case: start the write or, if that's not possible
> (the disk is already busy), queue it for service as soon as the disk
> is available.
> 
> They perhaps didn't mandate it this strictly, but that's clearly the
> intent.

We can fix your application, and we'll break someone else's.

I don't think it's solveable, really - the range of applications is so
broad, and the "standard" is so vague as to be useless.  This is why we've
been extending these things with linux-specific goodies which permit
applications to actually tell the kernel what they want to be done in a
more finely-grained fashion.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 18:55 ` Miklos Szeredi
  2007-03-27 19:00   ` Miklos Szeredi
  2007-03-27 19:05   ` Andrew Morton
@ 2007-03-27 19:24   ` linux
  2007-03-27 19:34     ` Andrew Morton
  2 siblings, 1 reply; 54+ messages in thread
From: linux @ 2007-03-27 19:24 UTC (permalink / raw)
  To: linux, miklos; +Cc: akpm, linux-kernel, linux-mm

> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).

Yes, I noticed.  See
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
for a bug report on the subject February 2006.

That's why this application is still running on 2.4.

As I mentioned at the time, the SUS says:
(http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
"When MS_ASYNC is specified, msync() returns immediately once all the
write operations are initiated or queued for servicing."

You can argue that putting it on the dirty list constitutes "queued for
servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
to start the I/O.  Although strict standards-ese parsing says that
either branch of an or is acceptable, it is a common English language
convention that the first alternative is preferred and the second
is a fallback.

It makes sense in this case: start the write or, if that's not possible
(the disk is already busy), queue it for service as soon as the disk
is available.

They perhaps didn't mandate it this strictly, but that's clearly the
intent.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 18:55 ` Miklos Szeredi
  2007-03-27 19:00   ` Miklos Szeredi
@ 2007-03-27 19:05   ` Andrew Morton
  2007-03-27 19:24   ` linux
  2 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2007-03-27 19:05 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux, linux-kernel, linux-mm

On Tue, 27 Mar 2007 20:55:51 +0200
Miklos Szeredi <miklos@szeredi.hu> wrote:

> It's actually wrong about FADV_DONTNEED, which I think doesn't start
> writeout either.

        case POSIX_FADV_DONTNEED:
                if (!bdi_write_congested(mapping->backing_dev_info))
                        filemap_flush(mapping);

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 18:55 ` Miklos Szeredi
@ 2007-03-27 19:00   ` Miklos Szeredi
  2007-03-27 19:05   ` Andrew Morton
  2007-03-27 19:24   ` linux
  2 siblings, 0 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27 19:00 UTC (permalink / raw)
  To: linux; +Cc: akpm, linux, linux-kernel, linux-mm

> > I agree that msync(MS_ASYNC) has no semantics if time is ignored.
> > But it's a useful way to tell the OS that the page is not going
> > to be dirtied again.
> 
> Just to clarify, here's the header comment for sys_msync():
> 
> /*
>  * MS_SYNC syncs the entire file - including mappings.
>  *
>  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
>  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
>  * Now it doesn't do anything, since dirty pages are properly tracked.
>  *
>  * The application may now run fsync() to
>  * write out the dirty pages and wait on the writeout and check the result.
>  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
>  * async writeout immediately.
>  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
>  * applications.
>  */
> 
> It's actually wrong about FADV_DONTNEED, which I think doesn't start
> writeout either.  So there you have it ;)

Sorry, FADV_DONTNEED _does_ seem to start writeout, but it does it
indiscriminately, not just on the specified range.  That should be
easy to fix though.

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
  2007-03-27 18:42 linux
@ 2007-03-27 18:55 ` Miklos Szeredi
  2007-03-27 19:00   ` Miklos Szeredi
                     ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Miklos Szeredi @ 2007-03-27 18:55 UTC (permalink / raw)
  To: linux; +Cc: akpm, linux, linux-kernel, linux-mm

> > Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
> > application doesn't want to update the timestamps, it should just omit
> > this call, since it does nothing else.
> 
> Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
> and doesn't care about timestamps.  (In fact, sometimes it's configured
> to write to a raw device and there are no timestamps.)
> 
> It's used as a poor man's portable async I/O.  The application logs
> data to disk, and sometimes needs to sync it to disk to ensure it has
> all been written.
> 
> To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
> as soon as a page is filled up to prompt asynchronous writeback.
> "I'm done writing this page and don't intend to write it again.
> Please start committing it to stable storage, but don't block me."
> 
> Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
> is synced to disk.  This caused annoying hiccups before the MS_ASYNC
> calls were added.
> 
> 
> I agree that msync(MS_ASYNC) has no semantics if time is ignored.
> But it's a useful way to tell the OS that the page is not going
> to be dirtied again.

Just to clarify, here's the header comment for sys_msync():

/*
 * MS_SYNC syncs the entire file - including mappings.
 *
 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
 * Now it doesn't do anything, since dirty pages are properly tracked.
 *
 * The application may now run fsync() to
 * write out the dirty pages and wait on the writeout and check the result.
 * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
 * async writeout immediately.
 * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
 * applications.
 */

It's actually wrong about FADV_DONTNEED, which I think doesn't start
writeout either.  So there you have it ;)

Miklos

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [patch resend v4] update ctime and mtime for mmaped write
@ 2007-03-27 18:42 linux
  2007-03-27 18:55 ` Miklos Szeredi
  0 siblings, 1 reply; 54+ messages in thread
From: linux @ 2007-03-27 18:42 UTC (permalink / raw)
  To: miklos; +Cc: akpm, linux, linux-kernel, linux-mm

> Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
> application doesn't want to update the timestamps, it should just omit
> this call, since it does nothing else.

Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
and doesn't care about timestamps.  (In fact, sometimes it's configured
to write to a raw device and there are no timestamps.)

It's used as a poor man's portable async I/O.  The application logs
data to disk, and sometimes needs to sync it to disk to ensure it has
all been written.

To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
as soon as a page is filled up to prompt asynchronous writeback.
"I'm done writing this page and don't intend to write it again.
Please start committing it to stable storage, but don't block me."

Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
is synced to disk.  This caused annoying hiccups before the MS_ASYNC
calls were added.


I agree that msync(MS_ASYNC) has no semantics if time is ignored.
But it's a useful way to tell the OS that the page is not going
to be dirtied again.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2007-03-29  4:59 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-25 21:10 [patch resend v4] update ctime and mtime for mmaped write Miklos Szeredi
2007-03-25 21:10 ` Miklos Szeredi, Miklos Szeredi
2007-03-26 21:00 ` Andrew Morton
2007-03-26 21:00   ` Andrew Morton
2007-03-26 21:10   ` Matt Mackall
2007-03-26 21:10     ` Matt Mackall
2007-03-26 22:25     ` Andrew Morton
2007-03-26 22:25       ` Andrew Morton
2007-03-26 21:43   ` Miklos Szeredi
2007-03-26 21:43     ` Miklos Szeredi
2007-03-26 22:31     ` Andrew Morton
2007-03-26 22:31       ` Andrew Morton
2007-03-27  6:55       ` Miklos Szeredi
2007-03-27  6:55         ` Miklos Szeredi
2007-03-27  7:22         ` Andrew Morton
2007-03-27  7:22           ` Andrew Morton
2007-03-27  7:36           ` Miklos Szeredi
2007-03-27  7:36             ` Miklos Szeredi
2007-03-27  7:49             ` Andrew Morton
2007-03-27  7:49               ` Andrew Morton
2007-03-27  8:03               ` Miklos Szeredi
2007-03-27  8:03                 ` Miklos Szeredi
2007-03-27  8:18                 ` Andrew Morton
2007-03-27  8:18                   ` Andrew Morton
2007-03-27  8:28                   ` Miklos Szeredi
2007-03-27  8:28                     ` Miklos Szeredi
2007-03-27  8:51                     ` Andrew Morton
2007-03-27  8:51                       ` Andrew Morton
2007-03-27  9:23                       ` Miklos Szeredi
2007-03-27  9:23                         ` Miklos Szeredi
2007-03-27 17:52                         ` Andrew Morton
2007-03-27 17:52                           ` Andrew Morton
2007-03-27 18:29                           ` Miklos Szeredi
2007-03-27 18:29                             ` Miklos Szeredi
2007-03-27 18:42 linux
2007-03-27 18:55 ` Miklos Szeredi
2007-03-27 19:00   ` Miklos Szeredi
2007-03-27 19:05   ` Andrew Morton
2007-03-27 19:24   ` linux
2007-03-27 19:34     ` Andrew Morton
2007-03-27 20:09       ` linux
2007-03-27 20:09         ` linux
2007-03-27 20:31         ` Miklos Szeredi
2007-03-27 20:31           ` Miklos Szeredi
2007-03-28  1:48           ` linux
2007-03-28  1:48             ` linux
2007-03-28  7:58             ` Nick Piggin
2007-03-28  7:58               ` Nick Piggin
2007-03-28  9:50               ` linux
2007-03-28  9:50                 ` linux
2007-03-29  4:59                 ` Nick Piggin
2007-03-29  4:59                   ` Nick Piggin
2007-03-27 20:47         ` Andrew Morton
2007-03-27 20:47           ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.