linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE)
       [not found]             ` <20051101000509.GA11847@ccure.user-mode-linux.org>
@ 2005-11-02  1:15               ` Badari Pulavarty
  2005-11-02  1:43                 ` Andrea Arcangeli
  0 siblings, 1 reply; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-02  1:15 UTC (permalink / raw)
  To: lkml
  Cc: Hugh Dickins, akpm, andrea, dvhltc, linux-mm, Blaisorblade, Jeff Dike

[-- Attachment #1: Type: text/plain, Size: 1124 bytes --]

Hi All,

Here is the patch to support madvise(MADV_FREE) - which frees 
up the given range of pages and truncates the underlying backing 
store. This basically provides "punch hole into file" functionality.
Currently it supports ONLY shmfs/tmpfs - where we have short term 
need. Other filesystems return -ENOSYS.

Yes. This is a *crazy* interface to do it. But this is what
we exactly need for now. Here is the discussion on linux-mm 
(for all the fun discussion and the naming):

http://marc.theaimsgroup.com/?l=linux-mm&m=113078625426989&w=2

Andrew, could you include this in your next -mm release ?
BTW, for completeness - this patch includes reiser4-truncate-inode-
pages-range patch from  your -mm series. If you want me to re-work 
my patch without that, please let me know.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.14-
rc5/2.6.14-rc5-mm1/broken-out/reiser4-truncate_inode_pages_range.patch

I tested with my test cases and Jeff Dike was kind enough to provide
a test case with UML - which found more bugs. I thank Andrea & Hugh
for helping me out heavily :)

Comments ?

Thanks,
Badari



[-- Attachment #2: madvise-free.patch --]
[-- Type: text/x-patch, Size: 22594 bytes --]

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
diff -Naurp -X dontdiff linux-2.6.14/include/asm-alpha/mman.h linux-2.6.14.madv/include/asm-alpha/mman.h
--- linux-2.6.14/include/asm-alpha/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-alpha/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-arm/mman.h linux-2.6.14.madv/include/asm-arm/mman.h
--- linux-2.6.14/include/asm-arm/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-arm/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-arm26/mman.h linux-2.6.14.madv/include/asm-arm26/mman.h
--- linux-2.6.14/include/asm-arm26/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-arm26/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-cris/mman.h linux-2.6.14.madv/include/asm-cris/mman.h
--- linux-2.6.14/include/asm-cris/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-cris/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -37,6 +37,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-frv/mman.h linux-2.6.14.madv/include/asm-frv/mman.h
--- linux-2.6.14/include/asm-frv/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-frv/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-h8300/mman.h linux-2.6.14.madv/include/asm-h8300/mman.h
--- linux-2.6.14/include/asm-h8300/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-h8300/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-i386/mman.h linux-2.6.14.madv/include/asm-i386/mman.h
--- linux-2.6.14/include/asm-i386/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-i386/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-ia64/mman.h linux-2.6.14.madv/include/asm-ia64/mman.h
--- linux-2.6.14/include/asm-ia64/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-ia64/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -43,6 +43,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-m32r/mman.h linux-2.6.14.madv/include/asm-m32r/mman.h
--- linux-2.6.14/include/asm-m32r/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-m32r/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -37,6 +37,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-m68k/mman.h linux-2.6.14.madv/include/asm-m68k/mman.h
--- linux-2.6.14/include/asm-m68k/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-m68k/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-mips/mman.h linux-2.6.14.madv/include/asm-mips/mman.h
--- linux-2.6.14/include/asm-mips/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-mips/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON       MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-parisc/mman.h linux-2.6.14.madv/include/asm-parisc/mman.h
--- linux-2.6.14/include/asm-parisc/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-parisc/mman.h	2005-11-01 11:47:06.000000000 -0800
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE       8		/* free up these pages */
 
 /* The range 12-64 is reserved for page size specification. */
 #define MADV_4K_PAGES   12              /* Use 4K pages  */
diff -Naurp -X dontdiff linux-2.6.14/include/asm-powerpc/mman.h linux-2.6.14.madv/include/asm-powerpc/mman.h
--- linux-2.6.14/include/asm-powerpc/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-powerpc/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -44,6 +44,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-s390/mman.h linux-2.6.14.madv/include/asm-s390/mman.h
--- linux-2.6.14/include/asm-s390/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-s390/mman.h	2005-11-01 11:46:37.000000000 -0800
@@ -43,6 +43,7 @@
 #define MADV_SEQUENTIAL        0x2             /* read-ahead aggressively */
 #define MADV_WILLNEED  0x3              /* pre-fault pages */
 #define MADV_DONTNEED  0x4              /* discard these pages */
+#define MADV_FREE      0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-sh/mman.h linux-2.6.14.madv/include/asm-sh/mman.h
--- linux-2.6.14/include/asm-sh/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-sh/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-sparc/mman.h linux-2.6.14.madv/include/asm-sparc/mman.h
--- linux-2.6.14/include/asm-sparc/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-sparc/mman.h	2005-11-01 11:45:42.000000000 -0800
@@ -53,7 +53,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
-#define MADV_FREE	0x5		/* (Solaris) contents can be freed */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-sparc64/mman.h linux-2.6.14.madv/include/asm-sparc64/mman.h
--- linux-2.6.14/include/asm-sparc64/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-sparc64/mman.h	2005-11-01 11:46:08.000000000 -0800
@@ -53,7 +53,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
-#define MADV_FREE	0x5		/* (Solaris) contents can be freed */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-v850/mman.h linux-2.6.14.madv/include/asm-v850/mman.h
--- linux-2.6.14/include/asm-v850/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-v850/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -32,6 +32,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-x86_64/mman.h linux-2.6.14.madv/include/asm-x86_64/mman.h
--- linux-2.6.14/include/asm-x86_64/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-x86_64/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -36,6 +36,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-xtensa/mman.h linux-2.6.14.madv/include/asm-xtensa/mman.h
--- linux-2.6.14/include/asm-xtensa/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-xtensa/mman.h	2005-11-01 11:45:09.000000000 -0800
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_FREE	0x5		/* free up these pages */
 
 /* compatibility flags */
 #define MAP_ANON       MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/linux/fs.h linux-2.6.14.madv/include/linux/fs.h
--- linux-2.6.14/include/linux/fs.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/linux/fs.h	2005-11-01 11:45:09.000000000 -0800
@@ -995,6 +995,7 @@ struct inode_operations {
 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
+	void (*truncate_range)(struct inode *, loff_t, loff_t);
 };
 
 struct seq_file;
diff -Naurp -X dontdiff linux-2.6.14/include/linux/mm.h linux-2.6.14.madv/include/linux/mm.h
--- linux-2.6.14/include/linux/mm.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/linux/mm.h	2005-11-01 11:45:09.000000000 -0800
@@ -704,6 +704,7 @@ static inline void unmap_shared_mapping_
 }
 
 extern int vmtruncate(struct inode * inode, loff_t offset);
+extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 extern pud_t *FASTCALL(__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
 extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address));
 extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
@@ -865,6 +866,7 @@ extern unsigned long do_brk(unsigned lon
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
+extern void truncate_inode_pages_range(struct address_space *, loff_t, loff_t);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
diff -Naurp -X dontdiff linux-2.6.14/mm/madvise.c linux-2.6.14.madv/mm/madvise.c
--- linux-2.6.14/mm/madvise.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/madvise.c	2005-11-01 12:01:27.000000000 -0800
@@ -140,6 +140,39 @@ static long madvise_dontneed(struct vm_a
 	return 0;
 }
 
+/*
+ * Application wants to free up the pages and associated backing store. 
+ * This is effectively punching a hole into the middle of a file.
+ *
+ * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
+ * Other filesystems return -ENOSYS.
+ */
+static long madvise_free(struct vm_area_struct * vma,
+			     unsigned long start, unsigned long end)
+{
+	struct address_space *mapping;
+        loff_t offset, endoff;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB)) 
+		return -EINVAL;
+
+	if (!vma->vm_file || !vma->vm_file->f_mapping 
+		|| !vma->vm_file->f_mapping->host) {
+			return -EINVAL;
+	}
+
+	mapping = vma->vm_file->f_mapping;
+	if (mapping == &swapper_space) {
+		return -EINVAL;
+	}
+
+	offset = (loff_t)(start - vma->vm_start) 
+			+ (vma->vm_pgoff << PAGE_SHIFT);
+	endoff = (loff_t)(end - vma->vm_start - 1) 
+			+ (vma->vm_pgoff << PAGE_SHIFT);
+	return  vmtruncate_range(mapping->host, offset, endoff);
+}
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -152,6 +185,9 @@ madvise_vma(struct vm_area_struct *vma, 
 	case MADV_RANDOM:
 		error = madvise_behavior(vma, prev, start, end, behavior);
 		break;
+	case MADV_FREE:
+		error = madvise_free(vma, start, end);
+		break;
 
 	case MADV_WILLNEED:
 		error = madvise_willneed(vma, prev, start, end);
@@ -190,6 +226,8 @@ madvise_vma(struct vm_area_struct *vma, 
  *		some pages ahead.
  *  MADV_DONTNEED - the application is finished with the given range,
  *		so the kernel can free resources associated with it.
+ *  MADV_FREE - the application wants to free up the given range of
+ *		pages and associated backing store.
  *
  * return values:
  *  zero    - success
diff -Naurp -X dontdiff linux-2.6.14/mm/memory.c linux-2.6.14.madv/mm/memory.c
--- linux-2.6.14/mm/memory.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/memory.c	2005-11-01 11:45:09.000000000 -0800
@@ -1597,6 +1597,32 @@ out_busy:
 
 EXPORT_SYMBOL(vmtruncate);
 
+int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	/*
+	 * If the underlying filesystem is not going to provide 
+	 * a way to truncate a range of blocks (punch a hole) - 
+	 * we should return failure right now.
+	 */
+	if (!inode->i_op || !inode->i_op->truncate_range)
+		return -ENOSYS;
+		
+	/* XXX - Do we need both i_sem and i_allocsem all the way ? */
+	down(&inode->i_sem);
+	down_write(&inode->i_alloc_sem);
+	unmap_mapping_range(mapping, offset, (end - offset), 1);
+	truncate_inode_pages_range(mapping, offset, end);
+	inode->i_op->truncate_range(inode, offset, end);
+	up_write(&inode->i_alloc_sem);
+	up(&inode->i_sem);
+
+	return 0;
+}
+
+EXPORT_SYMBOL(vmtruncate_range);
+
 /* 
  * Primitive swap readahead code. We simply read an aligned block of
  * (1 << page_cluster) entries in the swap area. This method is chosen
diff -Naurp -X dontdiff linux-2.6.14/mm/shmem.c linux-2.6.14.madv/mm/shmem.c
--- linux-2.6.14/mm/shmem.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/shmem.c	2005-11-01 11:45:09.000000000 -0800
@@ -459,7 +459,7 @@ static void shmem_free_pages(struct list
 	} while (next);
 }
 
-static void shmem_truncate(struct inode *inode)
+static void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	unsigned long idx;
@@ -477,18 +477,27 @@ static void shmem_truncate(struct inode 
 	long nr_swaps_freed = 0;
 	int offset;
 	int freed;
+	int punch_hole = 0;
 
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	idx = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	idx = (start + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (idx >= info->next_index)
 		return;
 
 	spin_lock(&info->lock);
 	info->flags |= SHMEM_TRUNCATE;
-	limit = info->next_index;
-	info->next_index = idx;
+	if (likely(end == (loff_t) -1)) {
+		limit = info->next_index;
+		info->next_index = idx;
+	} else {
+		limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+		if (limit > info->next_index)
+			limit = info->next_index;
+		punch_hole = 1;
+	}
+
 	topdir = info->i_indirect;
-	if (topdir && idx <= SHMEM_NR_DIRECT) {
+	if (topdir && idx <= SHMEM_NR_DIRECT && !punch_hole) {
 		info->i_indirect = NULL;
 		nr_pages_to_free++;
 		list_add(&topdir->lru, &pages_to_free);
@@ -575,11 +584,12 @@ static void shmem_truncate(struct inode 
 			subdir->nr_swapped -= freed;
 			if (offset)
 				spin_unlock(&info->lock);
-			BUG_ON(subdir->nr_swapped > offset);
+			if (!punch_hole)
+				BUG_ON(subdir->nr_swapped > offset);
 		}
 		if (offset)
 			offset = 0;
-		else if (subdir) {
+		else if (subdir && !subdir->nr_swapped) {
 			dir[diroff] = NULL;
 			nr_pages_to_free++;
 			list_add(&subdir->lru, &pages_to_free);
@@ -596,7 +606,7 @@ done2:
 		 * Also, though shmem_getpage checks i_size before adding to
 		 * cache, no recheck after: so fix the narrow window there too.
 		 */
-		truncate_inode_pages(inode->i_mapping, inode->i_size);
+		truncate_inode_pages_range(inode->i_mapping, start, end);
 	}
 
 	spin_lock(&info->lock);
@@ -616,6 +626,11 @@ done2:
 	}
 }
 
+static void shmem_truncate(struct inode *inode)
+{
+	shmem_truncate_range(inode, inode->i_size, (loff_t)-1);
+}
+
 static int shmem_notify_change(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -2083,6 +2098,7 @@ static struct file_operations shmem_file
 static struct inode_operations shmem_inode_operations = {
 	.truncate	= shmem_truncate,
 	.setattr	= shmem_notify_change,
+	.truncate_range	= shmem_truncate_range,
 };
 
 static struct inode_operations shmem_dir_inode_operations = {
diff -Naurp -X dontdiff linux-2.6.14/mm/truncate.c linux-2.6.14.madv/mm/truncate.c
--- linux-2.6.14/mm/truncate.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/truncate.c	2005-11-01 11:45:09.000000000 -0800
@@ -91,12 +91,15 @@ invalidate_complete_page(struct address_
 }
 
 /**
- * truncate_inode_pages - truncate *all* the pages from an offset
+ * truncate_inode_pages - truncate range of pages specified by start and
+ * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
+ * @lend: offset to which to truncate
  *
- * Truncate the page cache at a set offset, removing the pages that are beyond
- * that offset (and zeroing out partial pages).
+ * Truncate the page cache, removing the pages that are between
+ * specified offsets (and zeroing out partial page
+ * (if lstart is not page aligned)).
  *
  * Truncate takes two passes - the first pass is nonblocking.  It will not
  * block on page locks and it will not block on writeback.  The second pass
@@ -110,12 +113,12 @@ invalidate_complete_page(struct address_
  * We pass down the cache-hot hint to the page freeing code.  Even if the
  * mapping is large, it is probably the case that the final pages are the most
  * recently touched, and freeing happens in ascending file offset order.
- *
- * Called under (and serialised by) inode->i_sem.
  */
-void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
+void truncate_inode_pages_range(struct address_space *mapping,
+				loff_t lstart, loff_t lend)
 {
 	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+	pgoff_t end;
 	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	struct pagevec pvec;
 	pgoff_t next;
@@ -124,13 +127,22 @@ void truncate_inode_pages(struct address
 	if (mapping->nrpages == 0)
 		return;
 
+	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
+	end = (lend  >> PAGE_CACHE_SHIFT);
+
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next <= end &&
+	       pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index = page->index;
 
+			if (page_index > end) {
+				next = page_index;
+				break;
+			}
+
 			if (page_index > next)
 				next = page_index;
 			next++;
@@ -166,9 +178,15 @@ void truncate_inode_pages(struct address
 			next = start;
 			continue;
 		}
+		if (pvec.pages[0]->index > end) {
+			pagevec_release(&pvec);
+			break;
+		}
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
+			if (page->index > end)
+				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
 			if (page->index > next)
@@ -180,7 +198,19 @@ void truncate_inode_pages(struct address
 		pagevec_release(&pvec);
 	}
 }
+EXPORT_SYMBOL(truncate_inode_pages_range);
 
+/**
+ * truncate_inode_pages - truncate *all* the pages from an offset
+ * @mapping: mapping to truncate
+ * @lstart: offset from which to truncate
+ *
+ * Called under (and serialised by) inode->i_sem.
+ */
+void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
+{
+	truncate_inode_pages_range(mapping, lstart, (loff_t)-1);
+}
 EXPORT_SYMBOL(truncate_inode_pages);
 
 /**

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE)
  2005-11-02  1:15               ` [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE) Badari Pulavarty
@ 2005-11-02  1:43                 ` Andrea Arcangeli
  2005-11-02 15:49                   ` Badari Pulavarty
  2005-11-02 16:12                   ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Badari Pulavarty
  0 siblings, 2 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2005-11-02  1:43 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: lkml, Hugh Dickins, akpm, dvhltc, linux-mm, Blaisorblade, Jeff Dike

On Tue, Nov 01, 2005 at 05:15:01PM -0800, Badari Pulavarty wrote:
> Here is the patch to support madvise(MADV_FREE) - which frees 
> up the given range of pages and truncates the underlying backing 
> store. This basically provides "punch hole into file" functionality.
> Currently it supports ONLY shmfs/tmpfs - where we have short term 
> need. Other filesystems return -ENOSYS.

MADV_FREE as a name isn't right if we return -ENOSYS for anonymoys
memory.

MADV_FREE in other OS works _only_ on anonymous memory and returns
-EINVAL if used on filebacked vmas. Infact we probably should rename our
MADV_DONTNEED to MADV_FREE.

http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view

	"This value cannot be used on mappings that have underlying file objects."

Our MADV_DONTNEED exactly matches the MADV_FREE semantics, and it seems
the MADV_DONTNEED of other OS isn't destructive like ours. Except our
MADV_DONTNEED also works on filebacked mappings but it's destructive
only on anonymous memory.


I thought Andrew suggested MADV_REMOVE for the new feature.

This feature didn't exist in other OS yet AFIK, so a new MADV_name for
it makes sense. I'm not completely against extending MADV_FREE but then we
shouldn't return -ENOSYS on anonymous memory and we should do the same
thing MADV_DONTNEED does on anonymous memory. Probably a new name is
safer to avoid confusion (think an application running MADV_FREE and
expecting -EINVAL when used on filebacked mappings).

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE)
  2005-11-02  1:43                 ` Andrea Arcangeli
@ 2005-11-02 15:49                   ` Badari Pulavarty
  2005-11-02 16:12                   ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Badari Pulavarty
  1 sibling, 0 replies; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-02 15:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: lkml, Hugh Dickins, akpm, dvhltc, linux-mm, Blaisorblade, Jeff Dike

On Wed, 2005-11-02 at 02:43 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 01, 2005 at 05:15:01PM -0800, Badari Pulavarty wrote:
> > Here is the patch to support madvise(MADV_FREE) - which frees 
> > up the given range of pages and truncates the underlying backing 
> > store. This basically provides "punch hole into file" functionality.
> > Currently it supports ONLY shmfs/tmpfs - where we have short term 
> > need. Other filesystems return -ENOSYS.
> 
> MADV_FREE as a name isn't right if we return -ENOSYS for anonymoys
> memory.
> 
> MADV_FREE in other OS works _only_ on anonymous memory and returns
> -EINVAL if used on filebacked vmas. Infact we probably should rename our
> MADV_DONTNEED to MADV_FREE.
> 
> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view
> 
> 	"This value cannot be used on mappings that have underlying file objects."
> 
> Our MADV_DONTNEED exactly matches the MADV_FREE semantics, and it seems
> the MADV_DONTNEED of other OS isn't destructive like ours. Except our
> MADV_DONTNEED also works on filebacked mappings but it's destructive
> only on anonymous memory.
> 
> 
> I thought Andrew suggested MADV_REMOVE for the new feature.

Yep. My bad. Andrew did suggest MADV_REMOVE. Let me rename and
generate patch once again !! Thanks for pointing out.

> 
> This feature didn't exist in other OS yet AFIK, so a new MADV_name for
> it makes sense. I'm not completely against extending MADV_FREE but then we
> shouldn't return -ENOSYS on anonymous memory and we should do the same
> thing MADV_DONTNEED does on anonymous memory. Probably a new name is
> safer to avoid confusion (think an application running MADV_FREE and
> expecting -EINVAL when used on filebacked mappings).
> 
> Thanks!
> 

Thanks,
Badari


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)
  2005-11-02  1:43                 ` Andrea Arcangeli
  2005-11-02 15:49                   ` Badari Pulavarty
@ 2005-11-02 16:12                   ` Badari Pulavarty
  2005-11-02 19:54                     ` New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)) Blaisorblade
                                       ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-02 16:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: lkml, Hugh Dickins, akpm, dvhltc, linux-mm, Blaisorblade, Jeff Dike

[-- Attachment #1: Type: text/plain, Size: 650 bytes --]

Hi Andrew & Andrea,

Here is the updated patch with name change again :(
Hopefully this would be final. (MADV_REMOVE).

BTW, I am not sure if we need to hold i_sem and i_allocsem
all the way ? I wanted to be safe - but this may be overkill ?


+       /* XXX - Do we need both i_sem and i_allocsem all the way ? */
+       down(&inode->i_sem);
+       down_write(&inode->i_alloc_sem);
+       unmap_mapping_range(mapping, offset, (end - offset), 1);
+       truncate_inode_pages_range(mapping, offset, end);
+       inode->i_op->truncate_range(inode, offset, end);
+       up_write(&inode->i_alloc_sem);
+       up(&inode->i_sem);


Thanks,
Badari



[-- Attachment #2: madvise-remove.patch --]
[-- Type: text/x-patch, Size: 22685 bytes --]

diff -Naurp -X dontdiff linux-2.6.14/include/asm-alpha/mman.h linux-2.6.14.madv/include/asm-alpha/mman.h
--- linux-2.6.14/include/asm-alpha/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-alpha/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -42,6 +42,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_REMOVE	7		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-arm/mman.h linux-2.6.14.madv/include/asm-arm/mman.h
--- linux-2.6.14/include/asm-arm/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-arm/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-arm26/mman.h linux-2.6.14.madv/include/asm-arm26/mman.h
--- linux-2.6.14/include/asm-arm26/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-arm26/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-cris/mman.h linux-2.6.14.madv/include/asm-cris/mman.h
--- linux-2.6.14/include/asm-cris/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-cris/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -37,6 +37,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-frv/mman.h linux-2.6.14.madv/include/asm-frv/mman.h
--- linux-2.6.14/include/asm-frv/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-frv/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-h8300/mman.h linux-2.6.14.madv/include/asm-h8300/mman.h
--- linux-2.6.14/include/asm-h8300/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-h8300/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-i386/mman.h linux-2.6.14.madv/include/asm-i386/mman.h
--- linux-2.6.14/include/asm-i386/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-i386/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-ia64/mman.h linux-2.6.14.madv/include/asm-ia64/mman.h
--- linux-2.6.14/include/asm-ia64/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-ia64/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -43,6 +43,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-m32r/mman.h linux-2.6.14.madv/include/asm-m32r/mman.h
--- linux-2.6.14/include/asm-m32r/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-m32r/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -37,6 +37,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-m68k/mman.h linux-2.6.14.madv/include/asm-m68k/mman.h
--- linux-2.6.14/include/asm-m68k/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-m68k/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-mips/mman.h linux-2.6.14.madv/include/asm-mips/mman.h
--- linux-2.6.14/include/asm-mips/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-mips/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -65,6 +65,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON       MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-parisc/mman.h linux-2.6.14.madv/include/asm-parisc/mman.h
--- linux-2.6.14/include/asm-parisc/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-parisc/mman.h	2005-11-02 03:12:02.000000000 -0800
@@ -38,6 +38,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_REMOVE     8		/* remove these pages & resources */
 
 /* The range 12-64 is reserved for page size specification. */
 #define MADV_4K_PAGES   12              /* Use 4K pages  */
diff -Naurp -X dontdiff linux-2.6.14/include/asm-powerpc/mman.h linux-2.6.14.madv/include/asm-powerpc/mman.h
--- linux-2.6.14/include/asm-powerpc/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-powerpc/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -44,6 +44,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-s390/mman.h linux-2.6.14.madv/include/asm-s390/mman.h
--- linux-2.6.14/include/asm-s390/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-s390/mman.h	2005-11-02 03:12:13.000000000 -0800
@@ -43,6 +43,7 @@
 #define MADV_SEQUENTIAL        0x2             /* read-ahead aggressively */
 #define MADV_WILLNEED  0x3              /* pre-fault pages */
 #define MADV_DONTNEED  0x4              /* discard these pages */
+#define MADV_REMOVE    0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-sh/mman.h linux-2.6.14.madv/include/asm-sh/mman.h
--- linux-2.6.14/include/asm-sh/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-sh/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -35,6 +35,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-sparc/mman.h linux-2.6.14.madv/include/asm-sparc/mman.h
--- linux-2.6.14/include/asm-sparc/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-sparc/mman.h	2005-11-02 03:04:57.000000000 -0800
@@ -54,6 +54,7 @@
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
 #define MADV_FREE	0x5		/* (Solaris) contents can be freed */
+#define MADV_REMOVE	0x6		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-sparc64/mman.h linux-2.6.14.madv/include/asm-sparc64/mman.h
--- linux-2.6.14/include/asm-sparc64/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-sparc64/mman.h	2005-11-02 03:04:35.000000000 -0800
@@ -54,6 +54,7 @@
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
 #define MADV_FREE	0x5		/* (Solaris) contents can be freed */
+#define MADV_REMOVE	0x6		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-v850/mman.h linux-2.6.14.madv/include/asm-v850/mman.h
--- linux-2.6.14/include/asm-v850/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-v850/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -32,6 +32,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-x86_64/mman.h linux-2.6.14.madv/include/asm-x86_64/mman.h
--- linux-2.6.14/include/asm-x86_64/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-x86_64/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -36,6 +36,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON	MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/asm-xtensa/mman.h linux-2.6.14.madv/include/asm-xtensa/mman.h
--- linux-2.6.14/include/asm-xtensa/mman.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/asm-xtensa/mman.h	2005-11-02 03:03:55.000000000 -0800
@@ -72,6 +72,7 @@
 #define MADV_SEQUENTIAL	0x2		/* read-ahead aggressively */
 #define MADV_WILLNEED	0x3		/* pre-fault pages */
 #define MADV_DONTNEED	0x4		/* discard these pages */
+#define MADV_REMOVE	0x5		/* remove these pages & resources */
 
 /* compatibility flags */
 #define MAP_ANON       MAP_ANONYMOUS
diff -Naurp -X dontdiff linux-2.6.14/include/linux/fs.h linux-2.6.14.madv/include/linux/fs.h
--- linux-2.6.14/include/linux/fs.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/linux/fs.h	2005-11-02 03:03:55.000000000 -0800
@@ -995,6 +995,7 @@ struct inode_operations {
 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
+	void (*truncate_range)(struct inode *, loff_t, loff_t);
 };
 
 struct seq_file;
diff -Naurp -X dontdiff linux-2.6.14/include/linux/mm.h linux-2.6.14.madv/include/linux/mm.h
--- linux-2.6.14/include/linux/mm.h	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/include/linux/mm.h	2005-11-02 03:03:55.000000000 -0800
@@ -704,6 +704,7 @@ static inline void unmap_shared_mapping_
 }
 
 extern int vmtruncate(struct inode * inode, loff_t offset);
+extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
 extern pud_t *FASTCALL(__pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address));
 extern pmd_t *FASTCALL(__pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address));
 extern pte_t *FASTCALL(pte_alloc_kernel(struct mm_struct *mm, pmd_t *pmd, unsigned long address));
@@ -865,6 +866,7 @@ extern unsigned long do_brk(unsigned lon
 /* filemap.c */
 extern unsigned long page_unuse(struct page *);
 extern void truncate_inode_pages(struct address_space *, loff_t);
+extern void truncate_inode_pages_range(struct address_space *, loff_t, loff_t);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *);
diff -Naurp -X dontdiff linux-2.6.14/mm/madvise.c linux-2.6.14.madv/mm/madvise.c
--- linux-2.6.14/mm/madvise.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/madvise.c	2005-11-02 03:03:55.000000000 -0800
@@ -140,6 +140,39 @@ static long madvise_dontneed(struct vm_a
 	return 0;
 }
 
+/*
+ * Application wants to free up the pages and associated backing store. 
+ * This is effectively punching a hole into the middle of a file.
+ *
+ * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
+ * Other filesystems return -ENOSYS.
+ */
+static long madvise_remove(struct vm_area_struct * vma,
+			     unsigned long start, unsigned long end)
+{
+	struct address_space *mapping;
+        loff_t offset, endoff;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB)) 
+		return -EINVAL;
+
+	if (!vma->vm_file || !vma->vm_file->f_mapping 
+		|| !vma->vm_file->f_mapping->host) {
+			return -EINVAL;
+	}
+
+	mapping = vma->vm_file->f_mapping;
+	if (mapping == &swapper_space) {
+		return -EINVAL;
+	}
+
+	offset = (loff_t)(start - vma->vm_start) 
+			+ (vma->vm_pgoff << PAGE_SHIFT);
+	endoff = (loff_t)(end - vma->vm_start - 1) 
+			+ (vma->vm_pgoff << PAGE_SHIFT);
+	return  vmtruncate_range(mapping->host, offset, endoff);
+}
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -152,6 +185,9 @@ madvise_vma(struct vm_area_struct *vma, 
 	case MADV_RANDOM:
 		error = madvise_behavior(vma, prev, start, end, behavior);
 		break;
+	case MADV_REMOVE:
+		error = madvise_remove(vma, start, end);
+		break;
 
 	case MADV_WILLNEED:
 		error = madvise_willneed(vma, prev, start, end);
@@ -190,6 +226,8 @@ madvise_vma(struct vm_area_struct *vma, 
  *		some pages ahead.
  *  MADV_DONTNEED - the application is finished with the given range,
  *		so the kernel can free resources associated with it.
+ *  MADV_REMOVE - the application wants to free up the given range of
+ *		pages and associated backing store.
  *
  * return values:
  *  zero    - success
diff -Naurp -X dontdiff linux-2.6.14/mm/memory.c linux-2.6.14.madv/mm/memory.c
--- linux-2.6.14/mm/memory.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/memory.c	2005-11-02 03:03:55.000000000 -0800
@@ -1597,6 +1597,32 @@ out_busy:
 
 EXPORT_SYMBOL(vmtruncate);
 
+int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
+{
+	struct address_space *mapping = inode->i_mapping;
+
+	/*
+	 * If the underlying filesystem is not going to provide 
+	 * a way to truncate a range of blocks (punch a hole) - 
+	 * we should return failure right now.
+	 */
+	if (!inode->i_op || !inode->i_op->truncate_range)
+		return -ENOSYS;
+		
+	/* XXX - Do we need both i_sem and i_allocsem all the way ? */
+	down(&inode->i_sem);
+	down_write(&inode->i_alloc_sem);
+	unmap_mapping_range(mapping, offset, (end - offset), 1);
+	truncate_inode_pages_range(mapping, offset, end);
+	inode->i_op->truncate_range(inode, offset, end);
+	up_write(&inode->i_alloc_sem);
+	up(&inode->i_sem);
+
+	return 0;
+}
+
+EXPORT_SYMBOL(vmtruncate_range);
+
 /* 
  * Primitive swap readahead code. We simply read an aligned block of
  * (1 << page_cluster) entries in the swap area. This method is chosen
diff -Naurp -X dontdiff linux-2.6.14/mm/shmem.c linux-2.6.14.madv/mm/shmem.c
--- linux-2.6.14/mm/shmem.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/shmem.c	2005-11-02 03:03:55.000000000 -0800
@@ -459,7 +459,7 @@ static void shmem_free_pages(struct list
 	} while (next);
 }
 
-static void shmem_truncate(struct inode *inode)
+static void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	unsigned long idx;
@@ -477,18 +477,27 @@ static void shmem_truncate(struct inode 
 	long nr_swaps_freed = 0;
 	int offset;
 	int freed;
+	int punch_hole = 0;
 
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME;
-	idx = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	idx = (start + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (idx >= info->next_index)
 		return;
 
 	spin_lock(&info->lock);
 	info->flags |= SHMEM_TRUNCATE;
-	limit = info->next_index;
-	info->next_index = idx;
+	if (likely(end == (loff_t) -1)) {
+		limit = info->next_index;
+		info->next_index = idx;
+	} else {
+		limit = (end + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+		if (limit > info->next_index)
+			limit = info->next_index;
+		punch_hole = 1;
+	}
+
 	topdir = info->i_indirect;
-	if (topdir && idx <= SHMEM_NR_DIRECT) {
+	if (topdir && idx <= SHMEM_NR_DIRECT && !punch_hole) {
 		info->i_indirect = NULL;
 		nr_pages_to_free++;
 		list_add(&topdir->lru, &pages_to_free);
@@ -575,11 +584,12 @@ static void shmem_truncate(struct inode 
 			subdir->nr_swapped -= freed;
 			if (offset)
 				spin_unlock(&info->lock);
-			BUG_ON(subdir->nr_swapped > offset);
+			if (!punch_hole)
+				BUG_ON(subdir->nr_swapped > offset);
 		}
 		if (offset)
 			offset = 0;
-		else if (subdir) {
+		else if (subdir && !subdir->nr_swapped) {
 			dir[diroff] = NULL;
 			nr_pages_to_free++;
 			list_add(&subdir->lru, &pages_to_free);
@@ -596,7 +606,7 @@ done2:
 		 * Also, though shmem_getpage checks i_size before adding to
 		 * cache, no recheck after: so fix the narrow window there too.
 		 */
-		truncate_inode_pages(inode->i_mapping, inode->i_size);
+		truncate_inode_pages_range(inode->i_mapping, start, end);
 	}
 
 	spin_lock(&info->lock);
@@ -616,6 +626,11 @@ done2:
 	}
 }
 
+static void shmem_truncate(struct inode *inode)
+{
+	shmem_truncate_range(inode, inode->i_size, (loff_t)-1);
+}
+
 static int shmem_notify_change(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -2083,6 +2098,7 @@ static struct file_operations shmem_file
 static struct inode_operations shmem_inode_operations = {
 	.truncate	= shmem_truncate,
 	.setattr	= shmem_notify_change,
+	.truncate_range	= shmem_truncate_range,
 };
 
 static struct inode_operations shmem_dir_inode_operations = {
diff -Naurp -X dontdiff linux-2.6.14/mm/truncate.c linux-2.6.14.madv/mm/truncate.c
--- linux-2.6.14/mm/truncate.c	2005-10-27 17:02:08.000000000 -0700
+++ linux-2.6.14.madv/mm/truncate.c	2005-11-02 03:03:55.000000000 -0800
@@ -91,12 +91,15 @@ invalidate_complete_page(struct address_
 }
 
 /**
- * truncate_inode_pages - truncate *all* the pages from an offset
+ * truncate_inode_pages - truncate range of pages specified by start and
+ * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
+ * @lend: offset to which to truncate
  *
- * Truncate the page cache at a set offset, removing the pages that are beyond
- * that offset (and zeroing out partial pages).
+ * Truncate the page cache, removing the pages that are between
+ * specified offsets (and zeroing out partial page
+ * (if lstart is not page aligned)).
  *
  * Truncate takes two passes - the first pass is nonblocking.  It will not
  * block on page locks and it will not block on writeback.  The second pass
@@ -110,12 +113,12 @@ invalidate_complete_page(struct address_
  * We pass down the cache-hot hint to the page freeing code.  Even if the
  * mapping is large, it is probably the case that the final pages are the most
  * recently touched, and freeing happens in ascending file offset order.
- *
- * Called under (and serialised by) inode->i_sem.
  */
-void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
+void truncate_inode_pages_range(struct address_space *mapping,
+				loff_t lstart, loff_t lend)
 {
 	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+	pgoff_t end;
 	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
 	struct pagevec pvec;
 	pgoff_t next;
@@ -124,13 +127,22 @@ void truncate_inode_pages(struct address
 	if (mapping->nrpages == 0)
 		return;
 
+	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
+	end = (lend  >> PAGE_CACHE_SHIFT);
+
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	while (next <= end &&
+	       pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index = page->index;
 
+			if (page_index > end) {
+				next = page_index;
+				break;
+			}
+
 			if (page_index > next)
 				next = page_index;
 			next++;
@@ -166,9 +178,15 @@ void truncate_inode_pages(struct address
 			next = start;
 			continue;
 		}
+		if (pvec.pages[0]->index > end) {
+			pagevec_release(&pvec);
+			break;
+		}
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
+			if (page->index > end)
+				break;
 			lock_page(page);
 			wait_on_page_writeback(page);
 			if (page->index > next)
@@ -180,7 +198,19 @@ void truncate_inode_pages(struct address
 		pagevec_release(&pvec);
 	}
 }
+EXPORT_SYMBOL(truncate_inode_pages_range);
 
+/**
+ * truncate_inode_pages - truncate *all* the pages from an offset
+ * @mapping: mapping to truncate
+ * @lstart: offset from which to truncate
+ *
+ * Called under (and serialised by) inode->i_sem.
+ */
+void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
+{
+	truncate_inode_pages_range(mapping, lstart, (loff_t)-1);
+}
 EXPORT_SYMBOL(truncate_inode_pages);
 
 /**

^ permalink raw reply	[flat|nested] 30+ messages in thread

* New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))
  2005-11-02 16:12                   ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Badari Pulavarty
@ 2005-11-02 19:54                     ` Blaisorblade
  2005-11-02 20:12                       ` Hugh Dickins
  2005-11-02 21:36                       ` Badari Pulavarty
  2005-11-12  0:25                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton
  2005-11-12  0:34                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton
  2 siblings, 2 replies; 30+ messages in thread
From: Blaisorblade @ 2005-11-02 19:54 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Andrea Arcangeli, lkml, Hugh Dickins, akpm, dvhltc, linux-mm, Jeff Dike

On Wednesday 02 November 2005 17:12, Badari Pulavarty wrote:
> Hi Andrew & Andrea,
>
> Here is the updated patch with name change again :(
> Hopefully this would be final. (MADV_REMOVE).
>
> BTW, I am not sure if we need to hold i_sem and i_allocsem
> all the way ? I wanted to be safe - but this may be overkill ?
While looking into this, I probably found another problem, a race with 
install_page(), which doesn't use the seqlock-style check we use for 
everything else (aka do_no_page) but simply assumes a page is valid if its 
index is below the current file size.

This is clearly "truncate" specific, and is already racy. Suppose I truncate a 
file and reduce its size, and then re-extend it, the page which I previously 
fetched from the cache is invalid. The current install_page code generates 
corruption.

In fact the page is fetched from the caller of install_page and passed to it.

This affects anybody using MAP_POPULATE or using remap_file_pages.

> +       /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> +       down(&inode->i_sem);
> +       down_write(&inode->i_alloc_sem);
> +       unmap_mapping_range(mapping, offset, (end - offset), 1);
In my opinion, as already said, unmap_mapping_range can be called without 
these two locks, as it operates only on mappings for the file.

However currently it's called with these locks held in vmtruncate, but I think 
the locks are held in that case only because we need to truncate the file, 
and are hold in excess also across this call.

Instead, we need to protect against concurrent faults on the mapping (not 
against concurrent mmaps)...and that is done through (struct address_space*) 
mapping->truncate_count.

=====
Finally, there is MAP_POPULATE and other pre-faulting, i.e. install_page (no, 
this is not peculiar to VM_NONLINEAR, even if some code is shared), but 
install_page checks explicitly for truncation; the problem is that the check 
is rather bogus, compared to the rest of checks:

        /*
         * This page may have been truncated. Tell the
         * caller about it.
         */
        err = -EINVAL;
        inode = vma->vm_file->f_mapping->host;
        size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
        if (!page->mapping || page->index >= size)
                goto err_unlock;

I remember there being a BUG_ON and Linus fixing it up.

It should be converted to the normal checks used for the rest 
(->truncate_count based - see do_no_page()).

To do so, the caller (*_populate) needs to call again *_getpage, if 
install_page detects a race, but it must also save and pass the 
truncate_count.

So, we probably want need to cleanup and join {filemap,shmem}_populate 
together, because the only real difference between them is the function 
called to lookup the page from disk ({shmem,filemap}_getpage).

So, we should replace struct vm_operations_struct "populate" method with a 
"getpage" method, by using the shmem_getpage prototype, which is better 
engineered, see my comment:

        page = filemap_getpage(file, pgoff, nonblock);

        /* XXX: This is wrong, a filesystem I/O error may have happened. Fix 
that as
         * done in shmem_populate calling shmem_getpage */
        if (!page && !nonblock)
                return -ENOMEM;

> +       truncate_inode_pages_range(mapping, offset, end);
> +       inode->i_op->truncate_range(inode, offset, end);
> +       up_write(&inode->i_alloc_sem);
> +       up(&inode->i_sem);

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))
  2005-11-02 19:54                     ` New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)) Blaisorblade
@ 2005-11-02 20:12                       ` Hugh Dickins
  2005-11-02 20:45                         ` Hugh Dickins
  2005-11-02 21:36                       ` Badari Pulavarty
  1 sibling, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2005-11-02 20:12 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Badari Pulavarty, Andrea Arcangeli, lkml, akpm, dvhltc, linux-mm,
	Jeff Dike

On Wed, 2 Nov 2005, Blaisorblade wrote:
> While looking into this, I probably found another problem, a race with 
> install_page(), which doesn't use the seqlock-style check we use for 
> everything else (aka do_no_page) but simply assumes a page is valid if its 
> index is below the current file size.
> 
> This is clearly "truncate" specific, and is already racy. Suppose I truncate a 
> file and reduce its size, and then re-extend it, the page which I previously 
> fetched from the cache is invalid. The current install_page code generates 
> corruption.

No, it should be fine as is (unless perhaps some barrier is needed).

The check
	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
	if (!page->mapping || page->index >= size)
		goto err_unlock;
handles the case that worries you: page->mapping will be NULL.

do_no_page has to do the more complicated truncate_count business because
it deals with all kinds of ->nopage, some of which leave page->mapping NULL:
so it's unable to distinguish one where the driver left it NULL from one
where truncation has suddenly made it NULL.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))
  2005-11-02 20:12                       ` Hugh Dickins
@ 2005-11-02 20:45                         ` Hugh Dickins
  0 siblings, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2005-11-02 20:45 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Badari Pulavarty, Andrea Arcangeli, lkml, akpm, dvhltc, linux-mm,
	Jeff Dike

On Wed, 2 Nov 2005, Hugh Dickins wrote:
> On Wed, 2 Nov 2005, Blaisorblade wrote:
> 
> No, it should be fine as is (unless perhaps some barrier is needed).

We already have the barrier needed: we're holding page_table_lock (pte lock).

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))
  2005-11-02 19:54                     ` New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)) Blaisorblade
  2005-11-02 20:12                       ` Hugh Dickins
@ 2005-11-02 21:36                       ` Badari Pulavarty
  2005-11-02 21:55                         ` Hugh Dickins
  1 sibling, 1 reply; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-02 21:36 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Andrea Arcangeli, lkml, Hugh Dickins, akpm, dvhltc, linux-mm, Jeff Dike

On Wed, 2005-11-02 at 20:54 +0100, Blaisorblade wrote:
> On Wednesday 02 November 2005 17:12, Badari Pulavarty wrote:
> > Hi Andrew & Andrea,
> >
> > Here is the updated patch with name change again :(
> > Hopefully this would be final. (MADV_REMOVE).
> >
> > BTW, I am not sure if we need to hold i_sem and i_allocsem
> > all the way ? I wanted to be safe - but this may be overkill ?
> While looking into this, I probably found another problem, a race with 
> install_page(), which doesn't use the seqlock-style check we use for 
> everything else (aka do_no_page) but simply assumes a page is valid if its 
> index is below the current file size.
> 
> This is clearly "truncate" specific, and is already racy. Suppose I truncate a 
> file and reduce its size, and then re-extend it, the page which I previously 
> fetched from the cache is invalid. The current install_page code generates 
> corruption.
> 
> In fact the page is fetched from the caller of install_page and passed to it.
> 
> This affects anybody using MAP_POPULATE or using remap_file_pages.
> 
> > +       /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> > +       down(&inode->i_sem);
> > +       down_write(&inode->i_alloc_sem);
> > +       unmap_mapping_range(mapping, offset, (end - offset), 1);
> In my opinion, as already said, unmap_mapping_range can be called without 
> these two locks, as it operates only on mappings for the file.
> 
> However currently it's called with these locks held in vmtruncate, but I think 
> the locks are held in that case only because we need to truncate the file, 
> and are hold in excess also across this call.

I agree, I can push down the locking only for ->truncate_range - if
no one has objections. (But again, it so special case - no one really
cares about the performance of this interface ?).

Thanks,
Badari


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))
  2005-11-02 21:36                       ` Badari Pulavarty
@ 2005-11-02 21:55                         ` Hugh Dickins
  2005-11-02 22:02                           ` Badari Pulavarty
  0 siblings, 1 reply; 30+ messages in thread
From: Hugh Dickins @ 2005-11-02 21:55 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Blaisorblade, Andrea Arcangeli, lkml, akpm, dvhltc, linux-mm, Jeff Dike

On Wed, 2 Nov 2005, Badari Pulavarty wrote:
> On Wed, 2005-11-02 at 20:54 +0100, Blaisorblade wrote:
> > > +       /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> > > +       down(&inode->i_sem);
> > > +       down_write(&inode->i_alloc_sem);
> > > +       unmap_mapping_range(mapping, offset, (end - offset), 1);
> > In my opinion, as already said, unmap_mapping_range can be called without 
> > these two locks, as it operates only on mappings for the file.
> > 
> > However currently it's called with these locks held in vmtruncate, but I think 
> > the locks are held in that case only because we need to truncate the file, 
> > and are hold in excess also across this call.
> 
> I agree, I can push down the locking only for ->truncate_range - if
> no one has objections. (But again, it so special case - no one really
> cares about the performance of this interface ?).

I can't remember why i_alloc_sem got introduced, and don't have time to
work it out: something to do with direct I/O races, perhaps?  Someone
else must advise, perhaps you will be able to drop that one.

But I think you'd be very unwise to drop i_sem too.  i_mmap_lock gets
dropped whenever preemption demands here, i_sem is what's preventing
someone else coming along and doing a concurrent truncate or remove.
You don't want that.

Sorry, I've not yet had time to study your patch: I do intend to,
but cannot promise when.  I fear it won't be as easy as making
these occasional responses.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE))
  2005-11-02 21:55                         ` Hugh Dickins
@ 2005-11-02 22:02                           ` Badari Pulavarty
  0 siblings, 0 replies; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-02 22:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Blaisorblade, Andrea Arcangeli, lkml, akpm, dvhltc, linux-mm, Jeff Dike

On Wed, 2005-11-02 at 21:55 +0000, Hugh Dickins wrote:
> On Wed, 2 Nov 2005, Badari Pulavarty wrote:
> > On Wed, 2005-11-02 at 20:54 +0100, Blaisorblade wrote:
> > > > +       /* XXX - Do we need both i_sem and i_allocsem all the way ? */
> > > > +       down(&inode->i_sem);
> > > > +       down_write(&inode->i_alloc_sem);
> > > > +       unmap_mapping_range(mapping, offset, (end - offset), 1);
> > > In my opinion, as already said, unmap_mapping_range can be called without 
> > > these two locks, as it operates only on mappings for the file.
> > > 
> > > However currently it's called with these locks held in vmtruncate, but I think 
> > > the locks are held in that case only because we need to truncate the file, 
> > > and are hold in excess also across this call.
> > 
> > I agree, I can push down the locking only for ->truncate_range - if
> > no one has objections. (But again, it so special case - no one really
> > cares about the performance of this interface ?).
> 
> I can't remember why i_alloc_sem got introduced, and don't have time to
> work it out: something to do with direct I/O races, perhaps?  Someone
> else must advise, perhaps you will be able to drop that one.

Yep. i_alloc_sem is supposed to protect DIO races with truncate.

> But I think you'd be very unwise to drop i_sem too.  i_mmap_lock gets
> dropped whenever preemption demands here, i_sem is what's preventing
> someone else coming along and doing a concurrent truncate or remove.
> You don't want that.
> 
> Sorry, I've not yet had time to study your patch: I do intend to,
> but cannot promise when.  I fear it won't be as easy as making
> these occasional responses.

Thanks Hugh. For now, I will leave those locks alone. We can re-visit
later, if we really care about the performance of this interface.
Better be safe than sorry.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)
  2005-11-02 16:12                   ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Badari Pulavarty
  2005-11-02 19:54                     ` New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)) Blaisorblade
@ 2005-11-12  0:25                     ` Andrew Morton
  2005-11-12  0:34                       ` Badari Pulavarty
  2005-11-12  0:34                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton
  2 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2005-11-12  0:25 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: andrea, linux-kernel, hugh, dvhltc, linux-mm, blaisorblade, jdike

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> +/*
>  + * Application wants to free up the pages and associated backing store. 
>  + * This is effectively punching a hole into the middle of a file.
>  + *
>  + * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
>  + * Other filesystems return -ENOSYS.
>  + */
>  +static long madvise_remove(struct vm_area_struct * vma,
>  +			     unsigned long start, unsigned long end)
>  +{
>  +	struct address_space *mapping;
>  +        loff_t offset, endoff;
>  +
>  +	if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB)) 
>  +		return -EINVAL;
>  +
>  +	if (!vma->vm_file || !vma->vm_file->f_mapping 
>  +		|| !vma->vm_file->f_mapping->host) {
>  +			return -EINVAL;
>  +	}
>  +
>  +	mapping = vma->vm_file->f_mapping;
>  +	if (mapping == &swapper_space) {
>  +		return -EINVAL;
>  +	}
>  +
>  +	offset = (loff_t)(start - vma->vm_start) 
>  +			+ (vma->vm_pgoff << PAGE_SHIFT);
>  +	endoff = (loff_t)(end - vma->vm_start - 1) 
>  +			+ (vma->vm_pgoff << PAGE_SHIFT);
>  +	return  vmtruncate_range(mapping->host, offset, endoff);
>  +}
>  +

I'm suspecting you tested this on a 64-bit machine, yes?  On 32-bit that
vm_pgoff shift is going to overflow.  

Fixes-thus-far below.   Please rerun all tests on x86?

Why does madvise_remove() have an explicit check for swapper_space?

In your testing, how are you determining that the code is successfully
removing the correct number of pages, from the correct file offset?


diff -puN mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/madvise.c
--- devel/mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy	2005-11-11 16:12:43.000000000 -0800
+++ devel-akpm/mm/madvise.c	2005-11-11 16:16:50.000000000 -0800
@@ -147,8 +147,8 @@ static long madvise_dontneed(struct vm_a
  * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
  * Other filesystems return -ENOSYS.
  */
-static long madvise_remove(struct vm_area_struct * vma,
-			     unsigned long start, unsigned long end)
+static long madvise_remove(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end)
 {
 	struct address_space *mapping;
         loff_t offset, endoff;
@@ -162,14 +162,13 @@ static long madvise_remove(struct vm_are
 	}
 
 	mapping = vma->vm_file->f_mapping;
-	if (mapping == &swapper_space) {
+	if (mapping == &swapper_space)
 		return -EINVAL;
-	}
 
 	offset = (loff_t)(start - vma->vm_start)
-			+ (vma->vm_pgoff << PAGE_SHIFT);
+			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 	endoff = (loff_t)(end - vma->vm_start - 1)
-			+ (vma->vm_pgoff << PAGE_SHIFT);
+			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 	return  vmtruncate_range(mapping->host, offset, endoff);
 }
 
diff -puN mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/memory.c
--- devel/mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy	2005-11-11 16:16:54.000000000 -0800
+++ devel-akpm/mm/memory.c	2005-11-11 16:17:59.000000000 -0800
@@ -1608,10 +1608,9 @@ out_big:
 out_busy:
 	return -ETXTBSY;
 }
-
 EXPORT_SYMBOL(vmtruncate);
 
-int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
+int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
 {
 	struct address_space *mapping = inode->i_mapping;
 
@@ -1634,7 +1633,6 @@ int vmtruncate_range(struct inode * inod
 
 	return 0;
 }
-
 EXPORT_SYMBOL(vmtruncate_range);
 
 /* 
_


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)
  2005-11-02 16:12                   ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Badari Pulavarty
  2005-11-02 19:54                     ` New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)) Blaisorblade
  2005-11-12  0:25                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton
@ 2005-11-12  0:34                     ` Andrew Morton
  2 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2005-11-12  0:34 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: andrea, linux-kernel, hugh, dvhltc, linux-mm, blaisorblade, jdike

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> +int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
>  +{
>  +	struct address_space *mapping = inode->i_mapping;
>  +
>  +	/*
>  +	 * If the underlying filesystem is not going to provide 
>  +	 * a way to truncate a range of blocks (punch a hole) - 
>  +	 * we should return failure right now.
>  +	 */
>  +	if (!inode->i_op || !inode->i_op->truncate_range)
>  +		return -ENOSYS;
>  +		
>  +	/* XXX - Do we need both i_sem and i_allocsem all the way ? */
>  +	down(&inode->i_sem);
>  +	down_write(&inode->i_alloc_sem);
>  +	unmap_mapping_range(mapping, offset, (end - offset), 1);
>  +	truncate_inode_pages_range(mapping, offset, end);
>  +	inode->i_op->truncate_range(inode, offset, end);
>  +	up_write(&inode->i_alloc_sem);
>  +	up(&inode->i_sem);
>  +
>  +	return 0;
>  +}

Yes, we need to take i_alloc_sem for writing.  To prevent concurrent
direct-io reads from coming in and instantiated by unwritten blocks.

tmpfs doesn't implements direct-io though.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)
  2005-11-12  0:25                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton
@ 2005-11-12  0:34                       ` Badari Pulavarty
  2005-11-12  1:43                         ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-12  0:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: andrea, lkml, hugh, dvhltc, linux-mm, blaisorblade, jdike

On Fri, 2005-11-11 at 16:25 -0800, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > +/*
> >  + * Application wants to free up the pages and associated backing store. 
> >  + * This is effectively punching a hole into the middle of a file.
> >  + *
> >  + * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
> >  + * Other filesystems return -ENOSYS.
> >  + */
> >  +static long madvise_remove(struct vm_area_struct * vma,
> >  +			     unsigned long start, unsigned long end)
> >  +{
> >  +	struct address_space *mapping;
> >  +        loff_t offset, endoff;
> >  +
> >  +	if (vma->vm_flags & (VM_LOCKED|VM_NONLINEAR|VM_HUGETLB)) 
> >  +		return -EINVAL;
> >  +
> >  +	if (!vma->vm_file || !vma->vm_file->f_mapping 
> >  +		|| !vma->vm_file->f_mapping->host) {
> >  +			return -EINVAL;
> >  +	}
> >  +
> >  +	mapping = vma->vm_file->f_mapping;
> >  +	if (mapping == &swapper_space) {
> >  +		return -EINVAL;
> >  +	}
> >  +
> >  +	offset = (loff_t)(start - vma->vm_start) 
> >  +			+ (vma->vm_pgoff << PAGE_SHIFT);
> >  +	endoff = (loff_t)(end - vma->vm_start - 1) 
> >  +			+ (vma->vm_pgoff << PAGE_SHIFT);
> >  +	return  vmtruncate_range(mapping->host, offset, endoff);
> >  +}
> >  +
> 
> I'm suspecting you tested this on a 64-bit machine, yes?  On 32-bit that
> vm_pgoff shift is going to overflow.  

Yes. I have moved to all 64-bit (amd64, em64t, ppc64) machines. My bad.

> 
> Fixes-thus-far below.   Please rerun all tests on x86?
> 

I will verify. Thanks.

> Why does madvise_remove() have an explicit check for swapper_space?

I really don't remember (I yanked code from some other kernel routine
vmtruncate()). If you think its unnecessary, I can take it out.

> In your testing, how are you determining that the code is successfully
> removing the correct number of pages, from the correct file offset?

I verified with test programs, added debug printk + looked through live
"crash" session + verified with UML testcases.

> 
> diff -puN mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/madvise.c
> --- devel/mm/madvise.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy	2005-11-11 16:12:43.000000000 -0800
> +++ devel-akpm/mm/madvise.c	2005-11-11 16:16:50.000000000 -0800
> @@ -147,8 +147,8 @@ static long madvise_dontneed(struct vm_a
>   * NOTE: Currently, only shmfs/tmpfs is supported for this operation.
>   * Other filesystems return -ENOSYS.
>   */
> -static long madvise_remove(struct vm_area_struct * vma,
> -			     unsigned long start, unsigned long end)
> +static long madvise_remove(struct vm_area_struct *vma,
> +				unsigned long start, unsigned long end)
>  {
>  	struct address_space *mapping;
>          loff_t offset, endoff;
> @@ -162,14 +162,13 @@ static long madvise_remove(struct vm_are
>  	}
>  
>  	mapping = vma->vm_file->f_mapping;
> -	if (mapping == &swapper_space) {
> +	if (mapping == &swapper_space)
>  		return -EINVAL;
> -	}
>  
>  	offset = (loff_t)(start - vma->vm_start)
> -			+ (vma->vm_pgoff << PAGE_SHIFT);
> +			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>  	endoff = (loff_t)(end - vma->vm_start - 1)
> -			+ (vma->vm_pgoff << PAGE_SHIFT);
> +			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>  	return  vmtruncate_range(mapping->host, offset, endoff);
>  }
>  
> diff -puN mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy mm/memory.c
> --- devel/mm/memory.c~madvise-remove-remove-pages-from-tmpfs-shm-backing-store-tidy	2005-11-11 16:16:54.000000000 -0800
> +++ devel-akpm/mm/memory.c	2005-11-11 16:17:59.000000000 -0800
> @@ -1608,10 +1608,9 @@ out_big:
>  out_busy:
>  	return -ETXTBSY;
>  }
> -
>  EXPORT_SYMBOL(vmtruncate);
>  
> -int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end)
> +int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  
> @@ -1634,7 +1633,6 @@ int vmtruncate_range(struct inode * inod
>  
>  	return 0;
>  }
> -
>  EXPORT_SYMBOL(vmtruncate_range);
>  
>  /* 
> _
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)
  2005-11-12  0:34                       ` Badari Pulavarty
@ 2005-11-12  1:43                         ` Andrew Morton
  2005-11-12  4:41                           ` Badari Pulavarty
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2005-11-12  1:43 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: andrea, linux-kernel, hugh, dvhltc, linux-mm, blaisorblade, jdike

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> > Why does madvise_remove() have an explicit check for swapper_space?
> 
> I really don't remember (I yanked code from some other kernel routine
> vmtruncate()).

I don't see such a thing anywhere.  vmtruncate() has the IS_SWAPFILE()
test, which I guess vmtruncate_range() ought to have too, for
future-safety.

Logically, vmtruncate() should just be a special case of vmtruncate_range().
But it's not - ugly, but hard to do anything about (need to implement
->truncate_range in all filesystems, but "know" which ones only support
->truncate_range() at eof).

> 
> > In your testing, how are you determining that the code is successfully
> > removing the correct number of pages, from the correct file offset?
> 
> I verified with test programs, added debug printk + looked through live
> "crash" session + verified with UML testcases.

OK, well please be sure to test it on 32-bit and 64-bit, operating in three
ranges of the file: <2G, 2G-4G amd >4G.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)
  2005-11-12  1:43                         ` Andrew Morton
@ 2005-11-12  4:41                           ` Badari Pulavarty
  2006-01-16 13:06                             ` differences between MADV_FREE and MADV_DONTNEED Andrea Arcangeli
  0 siblings, 1 reply; 30+ messages in thread
From: Badari Pulavarty @ 2005-11-12  4:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andrea, linux-kernel, hugh, dvhltc, linux-mm, blaisorblade, jdike

Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> 
>>>Why does madvise_remove() have an explicit check for swapper_space?
>>
>>I really don't remember (I yanked code from some other kernel routine
>>vmtruncate()).
> 
> 
> I don't see such a thing anywhere.  vmtruncate() has the IS_SWAPFILE()
> test, which I guess vmtruncate_range() ought to have too, for
> future-safety.

Yep. That was the check. Since I don't have inode and have mapping
handy anyway, check was made using that. I could change it, if you wish.

> 
> Logically, vmtruncate() should just be a special case of vmtruncate_range().
> But it's not - ugly, but hard to do anything about (need to implement
> ->truncate_range in all filesystems, but "know" which ones only support
> ->truncate_range() at eof).
> 
> 
>>>In your testing, how are you determining that the code is successfully
>>>removing the correct number of pages, from the correct file offset?
>>
>>I verified with test programs, added debug printk + looked through live
>>"crash" session + verified with UML testcases.
> 
> 
> OK, well please be sure to test it on 32-bit and 64-bit, operating in three
> ranges of the file: <2G, 2G-4G amd >4G.
> 
Will do.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 30+ messages in thread

* differences between MADV_FREE and MADV_DONTNEED
  2005-11-12  4:41                           ` Badari Pulavarty
@ 2006-01-16 13:06                             ` Andrea Arcangeli
  2006-01-16 16:02                               ` Suleiman Souhlal
  2006-01-17  1:06                               ` Blaisorblade
  0 siblings, 2 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2006-01-16 13:06 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Andrew Morton, linux-kernel, hugh, dvhltc, linux-mm, blaisorblade, jdike

Now that MADV_REMOVE is in, should we discuss MADV_FREE?

MADV_FREE in Solaris is destructive and only works on anonymous memory,
while MADV_DONTNEED seems to never be destructive (which I assume it
means it's a noop on anonymous memory).

Our MADV_DONTNEED is destructive on anonymous memory, while it's
non-destructive on file mappings.

Perhaps we could move the destructive anonymous part of MADV_DONTNEED to
MADV_FREE?

Or we could as well go relaxed and define MADV_FREE and MADV_DONTNEED
the same way (that still leaves the question if we risk to break apps
ported from solaris where MADV_DONTNEED is apparently always not
destructive).

I only read the docs, I don't know in practice what MADV_DONTNEED does
on solaris (does it return -EINVAL if run on anonymous memory or not?).

http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view

BTW, I don't know how other specifications define MADV_FREE, but besides
MADV_REMOVE I've also got the request to provide MADV_FREE in linux,
this is why I'm asking. (right now I'm telling them to use #ifdef
__linux__ #define MADV_FREE MADV_DONTNEED but that's quite an hack since
it could break if we make MADV_DONTNEED non-destructive in the future)

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 13:06                             ` differences between MADV_FREE and MADV_DONTNEED Andrea Arcangeli
@ 2006-01-16 16:02                               ` Suleiman Souhlal
  2006-01-16 16:28                                 ` Andrea Arcangeli
  2006-01-17  1:06                               ` Blaisorblade
  1 sibling, 1 reply; 30+ messages in thread
From: Suleiman Souhlal @ 2006-01-16 16:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, blaisorblade, jdike

Andrea Arcangeli wrote:
> Now that MADV_REMOVE is in, should we discuss MADV_FREE?
> 
> MADV_FREE in Solaris is destructive and only works on anonymous memory,
> while MADV_DONTNEED seems to never be destructive (which I assume it
> means it's a noop on anonymous memory).

FWIW, in FreeBSD, MADV_DONTNEED is not destructive, and just makes pages 
(including anonymous ones) more likely to get swapped out.

> Our MADV_DONTNEED is destructive on anonymous memory, while it's
> non-destructive on file mappings.
> 
> Perhaps we could move the destructive anonymous part of MADV_DONTNEED to
> MADV_FREE?

This would seem like the best way to go, since it would bring Linux's 
behavior more in line with what other systems do.

> Or we could as well go relaxed and define MADV_FREE and MADV_DONTNEED
> the same way (that still leaves the question if we risk to break apps
> ported from solaris where MADV_DONTNEED is apparently always not
> destructive).
> 
> I only read the docs, I don't know in practice what MADV_DONTNEED does
> on solaris (does it return -EINVAL if run on anonymous memory or not?).
> 
> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view
> 
> BTW, I don't know how other specifications define MADV_FREE, but besides
> MADV_REMOVE I've also got the request to provide MADV_FREE in linux,
> this is why I'm asking. (right now I'm telling them to use #ifdef
> __linux__ #define MADV_FREE MADV_DONTNEED but that's quite an hack since
> it could break if we make MADV_DONTNEED non-destructive in the future)

FreeBSD's MADV_FREE only works on anonymous memory (it's a noop for 
vnode-backed memory), and marks the pages clean before moving them to 
the inactive queue, so that they can be freed or reused quickly, without 
causing a pagefault.

-- Suleiman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 16:02                               ` Suleiman Souhlal
@ 2006-01-16 16:28                                 ` Andrea Arcangeli
  2006-01-16 17:03                                   ` Suleiman Souhlal
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Arcangeli @ 2006-01-16 16:28 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, blaisorblade, jdike

On Mon, Jan 16, 2006 at 08:02:07AM -0800, Suleiman Souhlal wrote:
> FWIW, in FreeBSD, MADV_DONTNEED is not destructive, and just makes pages 
> (including anonymous ones) more likely to get swapped out.

We can also use it for the same purpose, we could add the pages to
swapcache mark them dirty and zap the ptes _after_ that.

> This would seem like the best way to go, since it would bring Linux's 
> behavior more in line with what other systems do.

Agreed.

> FreeBSD's MADV_FREE only works on anonymous memory (it's a noop for 
> vnode-backed memory), and marks the pages clean before moving them to 
> the inactive queue, so that they can be freed or reused quickly, without 
> causing a pagefault.

Well, perhaps solaris is also a noop and not necessairly a -EINVAL, all
I know from the docs is "This value cannot be used on mappings that have
underlying file objects.", so I expected -EINVAL but it may be a noop.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 16:28                                 ` Andrea Arcangeli
@ 2006-01-16 17:03                                   ` Suleiman Souhlal
  2006-01-16 17:24                                     ` Andrea Arcangeli
  0 siblings, 1 reply; 30+ messages in thread
From: Suleiman Souhlal @ 2006-01-16 17:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, blaisorblade, jdike

Andrea Arcangeli wrote:

> We can also use it for the same purpose, we could add the pages to
> swapcache mark them dirty and zap the ptes _after_ that.

Wouldn't that cause the pages to get swapped out immediately?
If so, I don't think this would be the best approach: It would be better 
  to just move the pages to the inactive list, if they aren't there 
already, so that they get swapped out only when they really need to be.

-- Suleiman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 17:03                                   ` Suleiman Souhlal
@ 2006-01-16 17:24                                     ` Andrea Arcangeli
  2006-01-16 21:43                                       ` Eric W. Biederman
  0 siblings, 1 reply; 30+ messages in thread
From: Andrea Arcangeli @ 2006-01-16 17:24 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, blaisorblade, jdike

On Mon, Jan 16, 2006 at 09:03:00AM -0800, Suleiman Souhlal wrote:
> Andrea Arcangeli wrote:
> 
> >We can also use it for the same purpose, we could add the pages to
> >swapcache mark them dirty and zap the ptes _after_ that.
> 
> Wouldn't that cause the pages to get swapped out immediately?

Not really, it would be a non blocking operation. But they could be
swapped out shortly later (that's the whole point of DONTNEED, right?),
once there is more memory pressure. Otherwise if they're used again, a
minor fault will happen and it will find the swapcache uptodate in ram.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 17:24                                     ` Andrea Arcangeli
@ 2006-01-16 21:43                                       ` Eric W. Biederman
  2006-01-17  0:24                                         ` Suleiman Souhlal
  0 siblings, 1 reply; 30+ messages in thread
From: Eric W. Biederman @ 2006-01-16 21:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Suleiman Souhlal, Badari Pulavarty, Andrew Morton, linux-kernel,
	hugh, dvhltc, linux-mm, blaisorblade, jdike

Andrea Arcangeli <andrea@suse.de> writes:

> On Mon, Jan 16, 2006 at 09:03:00AM -0800, Suleiman Souhlal wrote:
>> Andrea Arcangeli wrote:
>> 
>> >We can also use it for the same purpose, we could add the pages to
>> >swapcache mark them dirty and zap the ptes _after_ that.
>> 
>> Wouldn't that cause the pages to get swapped out immediately?
>
> Not really, it would be a non blocking operation. But they could be
> swapped out shortly later (that's the whole point of DONTNEED, right?),
> once there is more memory pressure. Otherwise if they're used again, a
> minor fault will happen and it will find the swapcache uptodate in ram.

As I recall the logic with DONTNEED was to mark the mapping of
the page clean so the page didn't need to be swapped out, it could
just be dropped.

That is why they anonymous and the file backed cases differ.

Part of the point is to avoid the case of swapping the pages out if
the application doesn't care what is on them anymore.

Eric

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 21:43                                       ` Eric W. Biederman
@ 2006-01-17  0:24                                         ` Suleiman Souhlal
  2006-01-17  1:04                                           ` Nicholas Miell
  0 siblings, 1 reply; 30+ messages in thread
From: Suleiman Souhlal @ 2006-01-17  0:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrea Arcangeli, Badari Pulavarty, Andrew Morton, linux-kernel,
	hugh, dvhltc, linux-mm, blaisorblade, jdike

Eric W. Biederman wrote:
> As I recall the logic with DONTNEED was to mark the mapping of
> the page clean so the page didn't need to be swapped out, it could
> just be dropped.
> 
> That is why they anonymous and the file backed cases differ.
> 
> Part of the point is to avoid the case of swapping the pages out if
> the application doesn't care what is on them anymore.

Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon", 
and MADV_FREE "I will never need this again".

-- Suleiman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17  0:24                                         ` Suleiman Souhlal
@ 2006-01-17  1:04                                           ` Nicholas Miell
  2006-01-17 12:43                                             ` Christoph Hellwig
  0 siblings, 1 reply; 30+ messages in thread
From: Nicholas Miell @ 2006-01-17  1:04 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Ulrich Drepper, Eric W. Biederman, Andrea Arcangeli,
	Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, blaisorblade, jdike

On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> Eric W. Biederman wrote:
> > As I recall the logic with DONTNEED was to mark the mapping of
> > the page clean so the page didn't need to be swapped out, it could
> > just be dropped.
> > 
> > That is why they anonymous and the file backed cases differ.
> > 
> > Part of the point is to avoid the case of swapping the pages out if
> > the application doesn't care what is on them anymore.
> 
> Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon", 
> and MADV_FREE "I will never need this again".
> 

POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
flags defined as follows:

POSIX_MADV_NORMAL
   Specifies that the application has no advice to give on its behavior
with respect to the specified range. It is the default characteristic if
no advice is given for a range of memory.
POSIX_MADV_SEQUENTIAL
   Specifies that the application expects to access the specified range
sequentially from lower addresses to higher addresses.
POSIX_MADV_RANDOM
   Specifies that the application expects to access the specified range
in a random order.
POSIX_MADV_WILLNEED
   Specifies that the application expects to access the specified range
in the near future.
POSIX_MADV_DONTNEED
   Specifies that the application expects that it will not access the
specified range in the near future.

Note that glibc forwards posix_madvise() directly to madvise(2), which
means that right now, POSIX conformant apps which use
posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
data on Linux systems.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-16 13:06                             ` differences between MADV_FREE and MADV_DONTNEED Andrea Arcangeli
  2006-01-16 16:02                               ` Suleiman Souhlal
@ 2006-01-17  1:06                               ` Blaisorblade
  2006-01-17  1:33                                 ` Andrea Arcangeli
  1 sibling, 1 reply; 30+ messages in thread
From: Blaisorblade @ 2006-01-17  1:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, jdike

On Monday 16 January 2006 14:06, Andrea Arcangeli wrote:
> Now that MADV_REMOVE is in, should we discuss MADV_FREE?

> MADV_FREE in Solaris is destructive and only works on anonymous memory,

I.e. it's a restriction of MADV_REMOVE. Is there anything conceivable relying 
on errors or no behaviour on file-backed memory? If relying on errors we 
could need an API, but if relying only on the NO-OP thing the correctness 
semantics are already implemented. I.e. data are retained on both Solaris 
MADV_FREE and Linux MADV_REMOVE for file-backed case, they get a different 
semantics for caching.

> while MADV_DONTNEED seems to never be destructive (which I assume it
> means it's a noop on anonymous memory).

It could be a "swap it out", as mentioned in Linux comments on our madvise 
semantics about "other Unices".

> Our MADV_DONTNEED is destructive on anonymous memory, while it's
> non-destructive on file mappings.

Indeed, not even that. See our madvise_dontneed() comment - dirty data are 
discarded in both cases, and the comment suggests msync(MS_INVALIDATE). It 
also speaks of "other implementation", which could also refer to Solaris.

> Perhaps we could move the destructive anonymous part of MADV_DONTNEED to
> MADV_FREE?

Why changing existing apps behaviour? That's nonsense, unless you have a 
standard. Indeed, however, posix_madvise exists, and it's DONTNEED semantics 
are the Solaris ones. Don't know past behaviour about "breaking existing to 
comply to standards" (new syscall slot?).

> Or we could as well go relaxed and define MADV_FREE and MADV_DONTNEED
> the same way (that still leaves the question if we risk to break apps
> ported from solaris where MADV_DONTNEED is apparently always not
> destructive).

Provide our fine-grained semantics with new, not misunderstandable identifiers 
(MADV_FREE_DISCARD, MADV_FREE_CACHE, for instance).

For current names, libc could provide a "let user choose the meaning of 
things", like it does for signals with _BSD_SOURCE, _POSIX_SOURCE and so on.

> I only read the docs, I don't know in practice what MADV_DONTNEED does
> on solaris (does it return -EINVAL if run on anonymous memory or not?).

> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrgk?a=view

> BTW, I don't know how other specifications define MADV_FREE, but besides
> MADV_REMOVE I've also got the request to provide MADV_FREE in linux,
> this is why I'm asking. (right now I'm telling them to use #ifdef
> __linux__ #define MADV_FREE MADV_DONTNEED but that's quite an hack since
> it could break if we make MADV_DONTNEED non-destructive in the future)

Making their apps work by causing the same breakage to Linux apps is a better 
idea?

> Thanks.

-- 
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade

	

	
		
___________________________________ 
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB 
http://mail.yahoo.it

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17  1:06                               ` Blaisorblade
@ 2006-01-17  1:33                                 ` Andrea Arcangeli
  0 siblings, 0 replies; 30+ messages in thread
From: Andrea Arcangeli @ 2006-01-17  1:33 UTC (permalink / raw)
  To: Blaisorblade
  Cc: Badari Pulavarty, Andrew Morton, linux-kernel, hugh, dvhltc,
	linux-mm, jdike

On Tue, Jan 17, 2006 at 02:06:09AM +0100, Blaisorblade wrote:
> I.e. it's a restriction of MADV_REMOVE. Is there anything conceivable
> relying on errors or no behaviour on file-backed memory? If relying on
> errors we could need an API, but if relying only on the NO-OP thing the
> correctness semantics are already implemented. I.e. data are retained on both
> Solaris MADV_FREE and Linux MADV_REMOVE for file-backed case, they get a
> different semantics for caching.

Not sure to understand but merging MADV_REMOVE into MADV_FREE apparently
would break freebsd apps that might expect a noop instead. And it could
break Solaris apps if they execpt a -EINVAL (though the latter is more
dubious, but I doubt making differences is worth it and if freebsd makes
it a noop I'd stick with the noop and leave MADV_REMOVE alone).

> are the Solaris ones. Don't know past behaviour about "breaking existing to 
> comply to standards" (new syscall slot?).

The change I suggested would be backwards compatible because it can only
affect performance.

The only thing that can break right now, is running a non-linux (and
apparently posix too) app on a linux system that will corrupt memory
with potential data loss.

> Provide our fine-grained semantics with new, not misunderstandable identifiers 
> (MADV_FREE_DISCARD, MADV_FREE_CACHE, for instance).

Why should we deviate for the sake of porting pain, when we can comply
at no tangible risk for us?

> Making their apps work by causing the same breakage to Linux apps is a better 
> idea?

Again: if an app breaks it means it's working by pure luck because it's
depending on fragile timings in the first place.

Call it a potential lower performance or less efficient memory
utilization, a breakage not.

If we were to make MADV_DONTNEED more aggressive, then we'd be risking a
breakage, but we're going to relax it instead.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17  1:04                                           ` Nicholas Miell
@ 2006-01-17 12:43                                             ` Christoph Hellwig
  2006-01-17 18:23                                               ` Eric W. Biederman
  2006-01-17 19:06                                               ` Badari Pulavarty
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2006-01-17 12:43 UTC (permalink / raw)
  To: Nicholas Miell
  Cc: Suleiman Souhlal, Ulrich Drepper, Eric W. Biederman,
	Andrea Arcangeli, Badari Pulavarty, Andrew Morton, linux-kernel,
	hugh, dvhltc, linux-mm, blaisorblade, jdike, akpm

On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
> On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> > Eric W. Biederman wrote:
> > > As I recall the logic with DONTNEED was to mark the mapping of
> > > the page clean so the page didn't need to be swapped out, it could
> > > just be dropped.
> > > 
> > > That is why they anonymous and the file backed cases differ.
> > > 
> > > Part of the point is to avoid the case of swapping the pages out if
> > > the application doesn't care what is on them anymore.
> > 
> > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon", 
> > and MADV_FREE "I will never need this again".
> > 
> 
> POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
> flags defined as follows:
> 
> POSIX_MADV_NORMAL
>    Specifies that the application has no advice to give on its behavior
> with respect to the specified range. It is the default characteristic if
> no advice is given for a range of memory.
> POSIX_MADV_SEQUENTIAL
>    Specifies that the application expects to access the specified range
> sequentially from lower addresses to higher addresses.
> POSIX_MADV_RANDOM
>    Specifies that the application expects to access the specified range
> in a random order.
> POSIX_MADV_WILLNEED
>    Specifies that the application expects to access the specified range
> in the near future.
> POSIX_MADV_DONTNEED
>    Specifies that the application expects that it will not access the
> specified range in the near future.
> 
> Note that glibc forwards posix_madvise() directly to madvise(2), which
> means that right now, POSIX conformant apps which use
> posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
> data on Linux systems.

Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?

In either case I'd say we should backout this patch for now.  We should
implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
but that's 2.6.17 material.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17 12:43                                             ` Christoph Hellwig
@ 2006-01-17 18:23                                               ` Eric W. Biederman
  2006-01-17 22:55                                                 ` Nicholas Miell
  2007-03-01 18:11                                                 ` Samuel Thibault
  2006-01-17 19:06                                               ` Badari Pulavarty
  1 sibling, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2006-01-17 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nicholas Miell, Suleiman Souhlal, Ulrich Drepper,
	Andrea Arcangeli, Badari Pulavarty, Andrew Morton, linux-kernel,
	hugh, dvhltc, linux-mm, blaisorblade, jdike

Christoph Hellwig <hch@infradead.org> writes:

> On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
>> On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
>> > Eric W. Biederman wrote:
>> > > As I recall the logic with DONTNEED was to mark the mapping of
>> > > the page clean so the page didn't need to be swapped out, it could
>> > > just be dropped.
>> > > 
>> > > That is why they anonymous and the file backed cases differ.
>> > > 
>> > > Part of the point is to avoid the case of swapping the pages out if
>> > > the application doesn't care what is on them anymore.
>> > 
>> > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon", 
>> > and MADV_FREE "I will never need this again".
>> > 
>> 
>> POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
>> flags defined as follows:
>> 
>> POSIX_MADV_NORMAL
>>    Specifies that the application has no advice to give on its behavior
>> with respect to the specified range. It is the default characteristic if
>> no advice is given for a range of memory.
>> POSIX_MADV_SEQUENTIAL
>>    Specifies that the application expects to access the specified range
>> sequentially from lower addresses to higher addresses.
>> POSIX_MADV_RANDOM
>>    Specifies that the application expects to access the specified range
>> in a random order.
>> POSIX_MADV_WILLNEED
>>    Specifies that the application expects to access the specified range
>> in the near future.
>> POSIX_MADV_DONTNEED
>>    Specifies that the application expects that it will not access the
>> specified range in the near future.
>> 
>> Note that glibc forwards posix_madvise() directly to madvise(2), which
>> means that right now, POSIX conformant apps which use
>> posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
>> data on Linux systems.
>
> Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?
>
> In either case I'd say we should backout this patch for now.  We should
> implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
> but that's 2.6.17 material.

We definitely need to check this.  I am fairly certain  I have seen this conversation
before.

Eric



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17 12:43                                             ` Christoph Hellwig
  2006-01-17 18:23                                               ` Eric W. Biederman
@ 2006-01-17 19:06                                               ` Badari Pulavarty
  1 sibling, 0 replies; 30+ messages in thread
From: Badari Pulavarty @ 2006-01-17 19:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nicholas Miell, Suleiman Souhlal, Ulrich Drepper,
	Eric W. Biederman, Andrea Arcangeli, Andrew Morton, lkml, hugh,
	dvhltc, linux-mm, blaisorblade, jdike

On Tue, 2006-01-17 at 12:43 +0000, Christoph Hellwig wrote:
> On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
> > On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> > > Eric W. Biederman wrote:
> > > > As I recall the logic with DONTNEED was to mark the mapping of
> > > > the page clean so the page didn't need to be swapped out, it could
> > > > just be dropped.
> > > > 
> > > > That is why they anonymous and the file backed cases differ.
> > > > 
> > > > Part of the point is to avoid the case of swapping the pages out if
> > > > the application doesn't care what is on them anymore.
> > > 
> > > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon", 
> > > and MADV_FREE "I will never need this again".
> > > 
> > 
> > POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
> > flags defined as follows:
> > 
> > POSIX_MADV_NORMAL
> >    Specifies that the application has no advice to give on its behavior
> > with respect to the specified range. It is the default characteristic if
> > no advice is given for a range of memory.
> > POSIX_MADV_SEQUENTIAL
> >    Specifies that the application expects to access the specified range
> > sequentially from lower addresses to higher addresses.
> > POSIX_MADV_RANDOM
> >    Specifies that the application expects to access the specified range
> > in a random order.
> > POSIX_MADV_WILLNEED
> >    Specifies that the application expects to access the specified range
> > in the near future.
> > POSIX_MADV_DONTNEED
> >    Specifies that the application expects that it will not access the
> > specified range in the near future.
> > 
> > Note that glibc forwards posix_madvise() directly to madvise(2), which
> > means that right now, POSIX conformant apps which use
> > posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
> > data on Linux systems.
> 
> Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?
> 
> In either case I'd say we should backout this patch for now.  We should
> implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
> but that's 2.6.17 material.

Christoph,

What patch are you recommending backing out ? 

Thanks,
Badari


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17 18:23                                               ` Eric W. Biederman
@ 2006-01-17 22:55                                                 ` Nicholas Miell
  2007-03-01 18:11                                                 ` Samuel Thibault
  1 sibling, 0 replies; 30+ messages in thread
From: Nicholas Miell @ 2006-01-17 22:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christoph Hellwig, Suleiman Souhlal, Ulrich Drepper,
	Andrea Arcangeli, Badari Pulavarty, Andrew Morton, linux-kernel,
	hugh, dvhltc, linux-mm, blaisorblade, jdike

On Tue, 2006-01-17 at 11:23 -0700, Eric W. Biederman wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> > On Mon, Jan 16, 2006 at 05:04:07PM -0800, Nicholas Miell wrote:
> >> On Mon, 2006-01-16 at 16:24 -0800, Suleiman Souhlal wrote:
> >> > Well, imho, MADV_DONTNEED should mean "I won't need this anytime soon", 
> >> > and MADV_FREE "I will never need this again".
> >> > 
> >> 
> >> POSIX doesn't have a madvise(), but it does have a posix_madvise(), with
> >> flags defined as follows:
> >> 
> >> POSIX_MADV_NORMAL
> >>    Specifies that the application has no advice to give on its behavior
> >> with respect to the specified range. It is the default characteristic if
> >> no advice is given for a range of memory.
> >> POSIX_MADV_SEQUENTIAL
> >>    Specifies that the application expects to access the specified range
> >> sequentially from lower addresses to higher addresses.
> >> POSIX_MADV_RANDOM
> >>    Specifies that the application expects to access the specified range
> >> in a random order.
> >> POSIX_MADV_WILLNEED
> >>    Specifies that the application expects to access the specified range
> >> in the near future.
> >> POSIX_MADV_DONTNEED
> >>    Specifies that the application expects that it will not access the
> >> specified range in the near future.
> >> 
> >> Note that glibc forwards posix_madvise() directly to madvise(2), which
> >> means that right now, POSIX conformant apps which use
> >> posix_madvise(addr, len, POSIX_MADV_DONTNEED) are silently corrupting
> >> data on Linux systems.
> >
> > Does our MAD_DONTNEED numerical value match glibc's POSIX_MADV_DONTNEED?
> >
> > In either case I'd say we should backout this patch for now.  We should
> > implement a real MADV_DONTNEED and rename the current one to MADV_FREE,
> > but that's 2.6.17 material.
> 
> We definitely need to check this.  I am fairly certain  I have seen this conversation
> before.

Yes, POSIX_MADV_* have the same values as MADV_*. And if you're trying
to find the actual implementation of posix_madvise() to verify its
behavior, it is generated by script from a line in
libc/sysdeps/unix/sysv/linux/syscalls.list.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: differences between MADV_FREE and MADV_DONTNEED
  2006-01-17 18:23                                               ` Eric W. Biederman
  2006-01-17 22:55                                                 ` Nicholas Miell
@ 2007-03-01 18:11                                                 ` Samuel Thibault
  1 sibling, 0 replies; 30+ messages in thread
From: Samuel Thibault @ 2007-03-01 18:11 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Hi,

Eric wrote:
> > We should implement a real MADV_DONTNEED and rename the current one
> > to MADV_FREE, but that's 2.6.17 material.
> 
> We definitely need to check this.  I am fairly certain I have seen
> this conversation before.

Yes, it was back in 2005:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111996850004771&w=2

Nobody took the time to fix it, I filed bug #6282 on bugzilla.kernel.org
some time ago.

Samuel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2007-03-01 18:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1130366995.23729.38.camel@localhost.localdomain>
     [not found] ` <20051028034616.GA14511@ccure.user-mode-linux.org>
     [not found]   ` <43624F82.6080003@us.ibm.com>
     [not found]     ` <20051028184235.GC8514@ccure.user-mode-linux.org>
     [not found]       ` <1130544201.23729.167.camel@localhost.localdomain>
     [not found]         ` <20051029025119.GA14998@ccure.user-mode-linux.org>
     [not found]           ` <1130788176.24503.19.camel@localhost.localdomain>
     [not found]             ` <20051101000509.GA11847@ccure.user-mode-linux.org>
2005-11-02  1:15               ` [PATCH] 2.6.14 patch for supporting madvise(MADV_FREE) Badari Pulavarty
2005-11-02  1:43                 ` Andrea Arcangeli
2005-11-02 15:49                   ` Badari Pulavarty
2005-11-02 16:12                   ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Badari Pulavarty
2005-11-02 19:54                     ` New bug in patch and existing Linux code - race with install_page() (was: Re: [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE)) Blaisorblade
2005-11-02 20:12                       ` Hugh Dickins
2005-11-02 20:45                         ` Hugh Dickins
2005-11-02 21:36                       ` Badari Pulavarty
2005-11-02 21:55                         ` Hugh Dickins
2005-11-02 22:02                           ` Badari Pulavarty
2005-11-12  0:25                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton
2005-11-12  0:34                       ` Badari Pulavarty
2005-11-12  1:43                         ` Andrew Morton
2005-11-12  4:41                           ` Badari Pulavarty
2006-01-16 13:06                             ` differences between MADV_FREE and MADV_DONTNEED Andrea Arcangeli
2006-01-16 16:02                               ` Suleiman Souhlal
2006-01-16 16:28                                 ` Andrea Arcangeli
2006-01-16 17:03                                   ` Suleiman Souhlal
2006-01-16 17:24                                     ` Andrea Arcangeli
2006-01-16 21:43                                       ` Eric W. Biederman
2006-01-17  0:24                                         ` Suleiman Souhlal
2006-01-17  1:04                                           ` Nicholas Miell
2006-01-17 12:43                                             ` Christoph Hellwig
2006-01-17 18:23                                               ` Eric W. Biederman
2006-01-17 22:55                                                 ` Nicholas Miell
2007-03-01 18:11                                                 ` Samuel Thibault
2006-01-17 19:06                                               ` Badari Pulavarty
2006-01-17  1:06                               ` Blaisorblade
2006-01-17  1:33                                 ` Andrea Arcangeli
2005-11-12  0:34                     ` [PATCH] 2.6.14 patch for supporting madvise(MADV_REMOVE) Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).