All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-05 19:43 ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:43 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

My application fallocates and mmaps (shared, writable) a lot (several
GB) of data at startup.  Those mappings are mlocked, and they live on
ext4.  The first write to any given page is slow because
ext4_da_get_block_prep can block.  This means that, to get decent
performance, I need to write something to all of these pages at
startup.  This, in turn, causes a giant IO storm as several GB of
zeros get pointlessly written to disk.

This series is an attempt to add madvise(..., MADV_WILLWRITE) to
signal to the kernel that I will eventually write to the referenced
pages.  It should cause any expensive operations that happen on the
first write to happen immediately, but it should not result in
dirtying the pages.

madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that
the operation succeeded on or a negative error code if there was an
actual failure.  A return value of zero signifies that the kernel
doesn't know how to "willwrite" the range and that userspace should
implement a fallback.

For now, it only works on shared writable ext4 mappings.  Eventually
it should support other filesystems as well as private pages (it
should COW the pages but not cause swap IO) and anonymous pages (it
should COW the zero page if applicable).

The implementation leaves much to be desired.  In particular, it
generates dirty buffer heads on a clean page, and this scares me.

Thoughts?

Andy Lutomirski (3):
  mm: Add MADV_WILLWRITE to indicate that a range will be written to
  fs: Add block_willwrite
  ext4: Implement willwrite for the delalloc case

 fs/buffer.c                            | 57 ++++++++++++++++++++++++++++++++++
 fs/ext4/ext4.h                         |  2 ++
 fs/ext4/file.c                         |  1 +
 fs/ext4/inode.c                        | 22 +++++++++++++
 include/linux/buffer_head.h            |  3 ++
 include/linux/mm.h                     | 12 +++++++
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/madvise.c                           | 28 +++++++++++++++--
 8 files changed, 126 insertions(+), 2 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-05 19:43 ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:43 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

My application fallocates and mmaps (shared, writable) a lot (several
GB) of data at startup.  Those mappings are mlocked, and they live on
ext4.  The first write to any given page is slow because
ext4_da_get_block_prep can block.  This means that, to get decent
performance, I need to write something to all of these pages at
startup.  This, in turn, causes a giant IO storm as several GB of
zeros get pointlessly written to disk.

This series is an attempt to add madvise(..., MADV_WILLWRITE) to
signal to the kernel that I will eventually write to the referenced
pages.  It should cause any expensive operations that happen on the
first write to happen immediately, but it should not result in
dirtying the pages.

madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that
the operation succeeded on or a negative error code if there was an
actual failure.  A return value of zero signifies that the kernel
doesn't know how to "willwrite" the range and that userspace should
implement a fallback.

For now, it only works on shared writable ext4 mappings.  Eventually
it should support other filesystems as well as private pages (it
should COW the pages but not cause swap IO) and anonymous pages (it
should COW the zero page if applicable).

The implementation leaves much to be desired.  In particular, it
generates dirty buffer heads on a clean page, and this scares me.

Thoughts?

Andy Lutomirski (3):
  mm: Add MADV_WILLWRITE to indicate that a range will be written to
  fs: Add block_willwrite
  ext4: Implement willwrite for the delalloc case

 fs/buffer.c                            | 57 ++++++++++++++++++++++++++++++++++
 fs/ext4/ext4.h                         |  2 ++
 fs/ext4/file.c                         |  1 +
 fs/ext4/inode.c                        | 22 +++++++++++++
 include/linux/buffer_head.h            |  3 ++
 include/linux/mm.h                     | 12 +++++++
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/madvise.c                           | 28 +++++++++++++++--
 8 files changed, 126 insertions(+), 2 deletions(-)

-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC 1/3] mm: Add MADV_WILLWRITE to indicate that a range will be written to
  2013-08-05 19:43 ` Andy Lutomirski
@ 2013-08-05 19:43   ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:43 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

This should not cause data to be written to disk.  It should, however,
do any expensive operations that would otherwise happen on the first
write fault to this range.

Some day this may COW private mappings, allocate real memory instead
of zero pages, etc.  For now it just passes the request down to
filesystems on shared writable mappings.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/mm.h                     | 12 ++++++++++++
 include/uapi/asm-generic/mman-common.h |  3 +++
 mm/madvise.c                           | 28 ++++++++++++++++++++++++++--
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..d3c89ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,18 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* request to make future writes to this page fast.  only called
+	 * on shared writable maps.  return 0 on success (or if there's
+	 * nothing to do).
+	 *
+	 * return the number of bytes for which the operation worked (i.e.
+	 * zero if unsupported) or a negative error if something goes wrong.
+	 *
+	 * called with mmap_sem held for read.
+	 */
+	long (*willwrite)(struct vm_area_struct *vma,
+			  unsigned long start, unsigned long end);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 4164529..e65e97d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WILLWRITE	18		/* Will write to this page, but maybe
+					   only after a while */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 7055883..7b537fd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -31,6 +31,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_WILLWRITE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -337,6 +338,24 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static long madvise_willwrite(struct vm_area_struct * vma,
+			     struct vm_area_struct ** prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+
+	if (!(vma->vm_flags & VM_WRITE))
+		return -EFAULT;
+
+	if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
+		return 0;  /* Not yet supported */
+
+	if (!vma->vm_file || !vma->vm_ops || !vma->vm_ops->willwrite)
+		return 0;  /* Not supported */
+
+	return vma->vm_ops->willwrite(vma, start, end);
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
@@ -380,6 +399,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
+	case MADV_WILLWRITE:
+		return madvise_willwrite(vma, prev, start, end);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -407,6 +428,7 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WILLWRITE:
 		return 1;
 
 	default:
@@ -465,6 +487,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	int write;
 	size_t len;
 	struct blk_plug plug;
+	long sum_rets = 0;
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
@@ -526,8 +549,9 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
 		error = madvise_vma(vma, &prev, start, tmp, behavior);
-		if (error)
+		if (error < 0)
 			goto out;
+		sum_rets += error;
 		start = tmp;
 		if (prev && start < prev->vm_end)
 			start = prev->vm_end;
@@ -546,5 +570,5 @@ out:
 	else
 		up_read(&current->mm->mmap_sem);
 
-	return error;
+	return error ? error : sum_rets;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 1/3] mm: Add MADV_WILLWRITE to indicate that a range will be written to
@ 2013-08-05 19:43   ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:43 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

This should not cause data to be written to disk.  It should, however,
do any expensive operations that would otherwise happen on the first
write fault to this range.

Some day this may COW private mappings, allocate real memory instead
of zero pages, etc.  For now it just passes the request down to
filesystems on shared writable mappings.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 include/linux/mm.h                     | 12 ++++++++++++
 include/uapi/asm-generic/mman-common.h |  3 +++
 mm/madvise.c                           | 28 ++++++++++++++++++++++++++--
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..d3c89ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,18 @@ struct vm_operations_struct {
 	 * writable, if an error is returned it will cause a SIGBUS */
 	int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
 
+	/* request to make future writes to this page fast.  only called
+	 * on shared writable maps.  return 0 on success (or if there's
+	 * nothing to do).
+	 *
+	 * return the number of bytes for which the operation worked (i.e.
+	 * zero if unsupported) or a negative error if something goes wrong.
+	 *
+	 * called with mmap_sem held for read.
+	 */
+	long (*willwrite)(struct vm_area_struct *vma,
+			  unsigned long start, unsigned long end);
+
 	/* called by access_process_vm when get_user_pages() fails, typically
 	 * for use by special VMAs that can switch between memory and hardware
 	 */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 4164529..e65e97d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_WILLWRITE	18		/* Will write to this page, but maybe
+					   only after a while */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 7055883..7b537fd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -31,6 +31,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_WILLWRITE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -337,6 +338,24 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static long madvise_willwrite(struct vm_area_struct * vma,
+			     struct vm_area_struct ** prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+
+	if (!(vma->vm_flags & VM_WRITE))
+		return -EFAULT;
+
+	if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
+		return 0;  /* Not yet supported */
+
+	if (!vma->vm_file || !vma->vm_ops || !vma->vm_ops->willwrite)
+		return 0;  /* Not supported */
+
+	return vma->vm_ops->willwrite(vma, start, end);
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
@@ -380,6 +399,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
+	case MADV_WILLWRITE:
+		return madvise_willwrite(vma, prev, start, end);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -407,6 +428,7 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
+	case MADV_WILLWRITE:
 		return 1;
 
 	default:
@@ -465,6 +487,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	int write;
 	size_t len;
 	struct blk_plug plug;
+	long sum_rets = 0;
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
@@ -526,8 +549,9 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
 		error = madvise_vma(vma, &prev, start, tmp, behavior);
-		if (error)
+		if (error < 0)
 			goto out;
+		sum_rets += error;
 		start = tmp;
 		if (prev && start < prev->vm_end)
 			start = prev->vm_end;
@@ -546,5 +570,5 @@ out:
 	else
 		up_read(&current->mm->mmap_sem);
 
-	return error;
+	return error ? error : sum_rets;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 2/3] fs: Add block_willwrite
  2013-08-05 19:43 ` Andy Lutomirski
@ 2013-08-05 19:44   ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:44 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

This provides generic support for MADV_WILLWRITE.  It creates and maps
buffer heads, but it should not result in anything being marked dirty.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

As described in the 0/0 summary, this may have issues.

 fs/buffer.c                 | 57 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |  3 +++
 2 files changed, 60 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d74335..017e822 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2444,6 +2444,63 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL(block_page_mkwrite);
 
+long block_willwrite(struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end,
+		     get_block_t get_block)
+{
+	long ret = 0;
+	loff_t size;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+
+	for (; start < end; start += PAGE_CACHE_SIZE) {
+		struct page *p;
+		int size_in_page;
+		int tmp = get_user_pages_fast(start, 1, 0, &p);
+		if (tmp == 0)
+			tmp = -EFAULT;
+		if (tmp != 1) {
+			ret = tmp;
+			break;
+		}
+
+		sb_start_pagefault(sb);
+
+		lock_page(p);
+		size = i_size_read(inode);
+		if (WARN_ON_ONCE(p->mapping != inode->i_mapping) ||
+		    (page_offset(p) > size)) {
+			ret = -EFAULT;  /* A real write would have failed. */
+			goto pagedone_unlock;
+		}
+
+		/* page is partially inside EOF? */
+		if (((p->index + 1) << PAGE_CACHE_SHIFT) > size)
+			size_in_page = size & ~PAGE_CACHE_MASK;
+		else
+			size_in_page = PAGE_CACHE_SIZE;
+
+		tmp = __block_write_begin(p, 0, size_in_page, get_block);
+		if (tmp) {
+			ret = tmp;
+			goto pagedone_unlock;
+		}
+
+		ret += PAGE_CACHE_SIZE;
+
+		/* No need to commit -- we're not writing anything yet. */
+
+	pagedone_unlock:
+		unlock_page(p);
+		sb_end_pagefault(sb);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(block_willwrite);
+
 /*
  * nobh_write_begin()'s prereads are special: the buffer_heads are freed
  * immediately, while under the page lock.  So it needs a special end_io
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 91fa9a9..c84639d 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -230,6 +230,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
+long block_willwrite(struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end,
+		     get_block_t get_block);
 /* Convert errno to return value from ->page_mkwrite() call */
 static inline int block_page_mkwrite_return(int err)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 2/3] fs: Add block_willwrite
@ 2013-08-05 19:44   ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:44 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

This provides generic support for MADV_WILLWRITE.  It creates and maps
buffer heads, but it should not result in anything being marked dirty.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

As described in the 0/0 summary, this may have issues.

 fs/buffer.c                 | 57 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |  3 +++
 2 files changed, 60 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d74335..017e822 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2444,6 +2444,63 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 }
 EXPORT_SYMBOL(block_page_mkwrite);
 
+long block_willwrite(struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end,
+		     get_block_t get_block)
+{
+	long ret = 0;
+	loff_t size;
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+
+	for (; start < end; start += PAGE_CACHE_SIZE) {
+		struct page *p;
+		int size_in_page;
+		int tmp = get_user_pages_fast(start, 1, 0, &p);
+		if (tmp == 0)
+			tmp = -EFAULT;
+		if (tmp != 1) {
+			ret = tmp;
+			break;
+		}
+
+		sb_start_pagefault(sb);
+
+		lock_page(p);
+		size = i_size_read(inode);
+		if (WARN_ON_ONCE(p->mapping != inode->i_mapping) ||
+		    (page_offset(p) > size)) {
+			ret = -EFAULT;  /* A real write would have failed. */
+			goto pagedone_unlock;
+		}
+
+		/* page is partially inside EOF? */
+		if (((p->index + 1) << PAGE_CACHE_SHIFT) > size)
+			size_in_page = size & ~PAGE_CACHE_MASK;
+		else
+			size_in_page = PAGE_CACHE_SIZE;
+
+		tmp = __block_write_begin(p, 0, size_in_page, get_block);
+		if (tmp) {
+			ret = tmp;
+			goto pagedone_unlock;
+		}
+
+		ret += PAGE_CACHE_SIZE;
+
+		/* No need to commit -- we're not writing anything yet. */
+
+	pagedone_unlock:
+		unlock_page(p);
+		sb_end_pagefault(sb);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(block_willwrite);
+
 /*
  * nobh_write_begin()'s prereads are special: the buffer_heads are freed
  * immediately, while under the page lock.  So it needs a special end_io
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 91fa9a9..c84639d 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -230,6 +230,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
+long block_willwrite(struct vm_area_struct *vma,
+		     unsigned long start, unsigned long end,
+		     get_block_t get_block);
 /* Convert errno to return value from ->page_mkwrite() call */
 static inline int block_page_mkwrite_return(int err)
 {
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 3/3] ext4: Implement willwrite for the delalloc case
  2013-08-05 19:43 ` Andy Lutomirski
@ 2013-08-05 19:44   ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:44 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/file.c  |  1 +
 fs/ext4/inode.c | 22 ++++++++++++++++++++++
 3 files changed, 25 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..be7308a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2103,6 +2103,8 @@ extern int ext4_block_zero_page_range(handle_t *handle,
 extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern long ext4_willwrite(struct vm_area_struct *vma,
+			   unsigned long start, unsigned long end);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern void ext4_da_update_reserve_space(struct inode *inode,
 					int used, int quota_claim);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f4cc56..159226f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -201,6 +201,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+	.willwrite	= ext4_willwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba33c67..c49e36b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5101,3 +5101,25 @@ out:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
+
+long ext4_willwrite(struct vm_area_struct *vma,
+		    unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	int retries = 0;
+
+	/* We only support the delalloc case */
+	if (test_opt(inode->i_sb, DELALLOC) &&
+	    !ext4_should_journal_data(inode) &&
+	    !ext4_nonda_switch(inode->i_sb)) {
+		do {
+			ret = block_willwrite(vma, start, end,
+					      ext4_da_get_block_prep);
+		} while (ret == -ENOSPC &&
+		       ext4_should_retry_alloc(inode->i_sb, &retries));
+	}
+
+	return ret;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC 3/3] ext4: Implement willwrite for the delalloc case
@ 2013-08-05 19:44   ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-05 19:44 UTC (permalink / raw)
  To: linux-mm, linux-ext4; +Cc: linux-kernel, Andy Lutomirski

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/file.c  |  1 +
 fs/ext4/inode.c | 22 ++++++++++++++++++++++
 3 files changed, 25 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..be7308a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2103,6 +2103,8 @@ extern int ext4_block_zero_page_range(handle_t *handle,
 extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
 			     loff_t lstart, loff_t lend);
 extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern long ext4_willwrite(struct vm_area_struct *vma,
+			   unsigned long start, unsigned long end);
 extern qsize_t *ext4_get_reserved_space(struct inode *inode);
 extern void ext4_da_update_reserve_space(struct inode *inode,
 					int used, int quota_claim);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6f4cc56..159226f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -201,6 +201,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
 static const struct vm_operations_struct ext4_file_vm_ops = {
 	.fault		= filemap_fault,
 	.page_mkwrite   = ext4_page_mkwrite,
+	.willwrite	= ext4_willwrite,
 	.remap_pages	= generic_file_remap_pages,
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba33c67..c49e36b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5101,3 +5101,25 @@ out:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
+
+long ext4_willwrite(struct vm_area_struct *vma,
+		    unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	int retries = 0;
+
+	/* We only support the delalloc case */
+	if (test_opt(inode->i_sb, DELALLOC) &&
+	    !ext4_should_journal_data(inode) &&
+	    !ext4_nonda_switch(inode->i_sb)) {
+		do {
+			ret = block_willwrite(vma, start, end,
+					      ext4_da_get_block_prep);
+		} while (ret == -ENOSPC &&
+		       ext4_should_retry_alloc(inode->i_sb, &retries));
+	}
+
+	return ret;
+}
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-05 19:43 ` Andy Lutomirski
@ 2013-08-07 13:40   ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-07 13:40 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-mm, linux-ext4, linux-kernel

On Mon 05-08-13 12:43:58, Andy Lutomirski wrote:
> My application fallocates and mmaps (shared, writable) a lot (several
> GB) of data at startup.  Those mappings are mlocked, and they live on
> ext4.  The first write to any given page is slow because
> ext4_da_get_block_prep can block.  This means that, to get decent
> performance, I need to write something to all of these pages at
> startup.  This, in turn, causes a giant IO storm as several GB of
> zeros get pointlessly written to disk.
> 
> This series is an attempt to add madvise(..., MADV_WILLWRITE) to
> signal to the kernel that I will eventually write to the referenced
> pages.  It should cause any expensive operations that happen on the
> first write to happen immediately, but it should not result in
> dirtying the pages.
> 
> madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that
> the operation succeeded on or a negative error code if there was an
> actual failure.  A return value of zero signifies that the kernel
> doesn't know how to "willwrite" the range and that userspace should
> implement a fallback.
> 
> For now, it only works on shared writable ext4 mappings.  Eventually
> it should support other filesystems as well as private pages (it
> should COW the pages but not cause swap IO) and anonymous pages (it
> should COW the zero page if applicable).
> 
> The implementation leaves much to be desired.  In particular, it
> generates dirty buffer heads on a clean page, and this scares me.
> 
> Thoughts?
  One question before I look at the patches: Why don't you use fallocate()
in your application? The functionality you require seems to be pretty
similar to it - writing to an already allocated block is usually quick.


								Honza

> Andy Lutomirski (3):
>   mm: Add MADV_WILLWRITE to indicate that a range will be written to
>   fs: Add block_willwrite
>   ext4: Implement willwrite for the delalloc case
> 
>  fs/buffer.c                            | 57 ++++++++++++++++++++++++++++++++++
>  fs/ext4/ext4.h                         |  2 ++
>  fs/ext4/file.c                         |  1 +
>  fs/ext4/inode.c                        | 22 +++++++++++++
>  include/linux/buffer_head.h            |  3 ++
>  include/linux/mm.h                     | 12 +++++++
>  include/uapi/asm-generic/mman-common.h |  3 ++
>  mm/madvise.c                           | 28 +++++++++++++++--
>  8 files changed, 126 insertions(+), 2 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-07 13:40   ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-07 13:40 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-mm, linux-ext4, linux-kernel

On Mon 05-08-13 12:43:58, Andy Lutomirski wrote:
> My application fallocates and mmaps (shared, writable) a lot (several
> GB) of data at startup.  Those mappings are mlocked, and they live on
> ext4.  The first write to any given page is slow because
> ext4_da_get_block_prep can block.  This means that, to get decent
> performance, I need to write something to all of these pages at
> startup.  This, in turn, causes a giant IO storm as several GB of
> zeros get pointlessly written to disk.
> 
> This series is an attempt to add madvise(..., MADV_WILLWRITE) to
> signal to the kernel that I will eventually write to the referenced
> pages.  It should cause any expensive operations that happen on the
> first write to happen immediately, but it should not result in
> dirtying the pages.
> 
> madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that
> the operation succeeded on or a negative error code if there was an
> actual failure.  A return value of zero signifies that the kernel
> doesn't know how to "willwrite" the range and that userspace should
> implement a fallback.
> 
> For now, it only works on shared writable ext4 mappings.  Eventually
> it should support other filesystems as well as private pages (it
> should COW the pages but not cause swap IO) and anonymous pages (it
> should COW the zero page if applicable).
> 
> The implementation leaves much to be desired.  In particular, it
> generates dirty buffer heads on a clean page, and this scares me.
> 
> Thoughts?
  One question before I look at the patches: Why don't you use fallocate()
in your application? The functionality you require seems to be pretty
similar to it - writing to an already allocated block is usually quick.


								Honza

> Andy Lutomirski (3):
>   mm: Add MADV_WILLWRITE to indicate that a range will be written to
>   fs: Add block_willwrite
>   ext4: Implement willwrite for the delalloc case
> 
>  fs/buffer.c                            | 57 ++++++++++++++++++++++++++++++++++
>  fs/ext4/ext4.h                         |  2 ++
>  fs/ext4/file.c                         |  1 +
>  fs/ext4/inode.c                        | 22 +++++++++++++
>  include/linux/buffer_head.h            |  3 ++
>  include/linux/mm.h                     | 12 +++++++
>  include/uapi/asm-generic/mman-common.h |  3 ++
>  mm/madvise.c                           | 28 +++++++++++++++--
>  8 files changed, 126 insertions(+), 2 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-07 13:40   ` Jan Kara
@ 2013-08-07 17:02     ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-07 17:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-ext4, linux-kernel

On Wed, Aug 7, 2013 at 6:40 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 05-08-13 12:43:58, Andy Lutomirski wrote:
>> My application fallocates and mmaps (shared, writable) a lot (several
>> GB) of data at startup.  Those mappings are mlocked, and they live on
>> ext4.  The first write to any given page is slow because
>> ext4_da_get_block_prep can block.  This means that, to get decent
>> performance, I need to write something to all of these pages at
>> startup.  This, in turn, causes a giant IO storm as several GB of
>> zeros get pointlessly written to disk.
>>
>> This series is an attempt to add madvise(..., MADV_WILLWRITE) to
>> signal to the kernel that I will eventually write to the referenced
>> pages.  It should cause any expensive operations that happen on the
>> first write to happen immediately, but it should not result in
>> dirtying the pages.
>>
>> madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that
>> the operation succeeded on or a negative error code if there was an
>> actual failure.  A return value of zero signifies that the kernel
>> doesn't know how to "willwrite" the range and that userspace should
>> implement a fallback.
>>
>> For now, it only works on shared writable ext4 mappings.  Eventually
>> it should support other filesystems as well as private pages (it
>> should COW the pages but not cause swap IO) and anonymous pages (it
>> should COW the zero page if applicable).
>>
>> The implementation leaves much to be desired.  In particular, it
>> generates dirty buffer heads on a clean page, and this scares me.
>>
>> Thoughts?
>   One question before I look at the patches: Why don't you use fallocate()
> in your application? The functionality you require seems to be pretty
> similar to it - writing to an already allocated block is usually quick.

I do use fallocate, and, IIRC, the problem was worse before I added
the fallocate call.

This could be argued to be a filesystem problem -- perhaps
page_mkwrite should never block.  I don't expect that to be fixed any
time soon (if ever).

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-07 17:02     ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-07 17:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-mm, linux-ext4, linux-kernel

On Wed, Aug 7, 2013 at 6:40 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 05-08-13 12:43:58, Andy Lutomirski wrote:
>> My application fallocates and mmaps (shared, writable) a lot (several
>> GB) of data at startup.  Those mappings are mlocked, and they live on
>> ext4.  The first write to any given page is slow because
>> ext4_da_get_block_prep can block.  This means that, to get decent
>> performance, I need to write something to all of these pages at
>> startup.  This, in turn, causes a giant IO storm as several GB of
>> zeros get pointlessly written to disk.
>>
>> This series is an attempt to add madvise(..., MADV_WILLWRITE) to
>> signal to the kernel that I will eventually write to the referenced
>> pages.  It should cause any expensive operations that happen on the
>> first write to happen immediately, but it should not result in
>> dirtying the pages.
>>
>> madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that
>> the operation succeeded on or a negative error code if there was an
>> actual failure.  A return value of zero signifies that the kernel
>> doesn't know how to "willwrite" the range and that userspace should
>> implement a fallback.
>>
>> For now, it only works on shared writable ext4 mappings.  Eventually
>> it should support other filesystems as well as private pages (it
>> should COW the pages but not cause swap IO) and anonymous pages (it
>> should COW the zero page if applicable).
>>
>> The implementation leaves much to be desired.  In particular, it
>> generates dirty buffer heads on a clean page, and this scares me.
>>
>> Thoughts?
>   One question before I look at the patches: Why don't you use fallocate()
> in your application? The functionality you require seems to be pretty
> similar to it - writing to an already allocated block is usually quick.

I do use fallocate, and, IIRC, the problem was worse before I added
the fallocate call.

This could be argued to be a filesystem problem -- perhaps
page_mkwrite should never block.  I don't expect that to be fixed any
time soon (if ever).

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-07 13:40   ` Jan Kara
@ 2013-08-07 17:40     ` Dave Hansen
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-07 17:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-mm, linux-ext4, linux-kernel

On 08/07/2013 06:40 AM, Jan Kara wrote:
>   One question before I look at the patches: Why don't you use fallocate()
> in your application? The functionality you require seems to be pretty
> similar to it - writing to an already allocated block is usually quick.

One problem I've seen is that it still costs you a fault per-page to get
the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
will do readahead to get the page cache filled, but it still leaves the
pages unmapped.  Those faults get expensive when you're trying to do a
couple hundred million of them all at once.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-07 17:40     ` Dave Hansen
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-07 17:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-mm, linux-ext4, linux-kernel

On 08/07/2013 06:40 AM, Jan Kara wrote:
>   One question before I look at the patches: Why don't you use fallocate()
> in your application? The functionality you require seems to be pretty
> similar to it - writing to an already allocated block is usually quick.

One problem I've seen is that it still costs you a fault per-page to get
the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
will do readahead to get the page cache filled, but it still leaves the
pages unmapped.  Those faults get expensive when you're trying to do a
couple hundred million of them all at once.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-07 17:40     ` Dave Hansen
@ 2013-08-07 18:00       ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-07 18:00 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jan Kara, linux-mm, linux-ext4, linux-kernel

On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 08/07/2013 06:40 AM, Jan Kara wrote:
>>   One question before I look at the patches: Why don't you use fallocate()
>> in your application? The functionality you require seems to be pretty
>> similar to it - writing to an already allocated block is usually quick.
>
> One problem I've seen is that it still costs you a fault per-page to get
> the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
> will do readahead to get the page cache filled, but it still leaves the
> pages unmapped.  Those faults get expensive when you're trying to do a
> couple hundred million of them all at once.

I have grand plans to teach the kernel to use hardware dirty tracking
so that (some?) pages can be left clean and writable for long periods
of time.  This will be hard.

Even so, the second write fault to a page tends to take only a few
microseconds, while the first one often blocks in fs code.

(mmap_sem is a different story, but I see it as a separate issue.)

--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-07 18:00       ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-07 18:00 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jan Kara, linux-mm, linux-ext4, linux-kernel

On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 08/07/2013 06:40 AM, Jan Kara wrote:
>>   One question before I look at the patches: Why don't you use fallocate()
>> in your application? The functionality you require seems to be pretty
>> similar to it - writing to an already allocated block is usually quick.
>
> One problem I've seen is that it still costs you a fault per-page to get
> the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
> will do readahead to get the page cache filled, but it still leaves the
> pages unmapped.  Those faults get expensive when you're trying to do a
> couple hundred million of them all at once.

I have grand plans to teach the kernel to use hardware dirty tracking
so that (some?) pages can be left clean and writable for long periods
of time.  This will be hard.

Even so, the second write fault to a page tends to take only a few
microseconds, while the first one often blocks in fs code.

(mmap_sem is a different story, but I see it as a separate issue.)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-07 18:00       ` Andy Lutomirski
@ 2013-08-08 10:18         ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-08 10:18 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Dave Hansen, Jan Kara, linux-mm, linux-ext4, linux-kernel

On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > On 08/07/2013 06:40 AM, Jan Kara wrote:
> >>   One question before I look at the patches: Why don't you use fallocate()
> >> in your application? The functionality you require seems to be pretty
> >> similar to it - writing to an already allocated block is usually quick.
> >
> > One problem I've seen is that it still costs you a fault per-page to get
> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
> > will do readahead to get the page cache filled, but it still leaves the
> > pages unmapped.  Those faults get expensive when you're trying to do a
> > couple hundred million of them all at once.
> 
> I have grand plans to teach the kernel to use hardware dirty tracking
> so that (some?) pages can be left clean and writable for long periods
> of time.  This will be hard.
  Right that will be tough... Although with your application you could
require such pages to be mlocked and then I could imagine we would get away
at least from problems with dirty page accounting.

> Even so, the second write fault to a page tends to take only a few
> microseconds, while the first one often blocks in fs code.
  So you wrote blocks are already preallocated with fallocate(). If you
also preload pages in memory with MADV_WILLNEED is there still big
difference between the first and subsequent write fault?

> (mmap_sem is a different story, but I see it as a separate issue.)
  Yeah, agreed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-08 10:18         ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-08 10:18 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Dave Hansen, Jan Kara, linux-mm, linux-ext4, linux-kernel

On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> > On 08/07/2013 06:40 AM, Jan Kara wrote:
> >>   One question before I look at the patches: Why don't you use fallocate()
> >> in your application? The functionality you require seems to be pretty
> >> similar to it - writing to an already allocated block is usually quick.
> >
> > One problem I've seen is that it still costs you a fault per-page to get
> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
> > will do readahead to get the page cache filled, but it still leaves the
> > pages unmapped.  Those faults get expensive when you're trying to do a
> > couple hundred million of them all at once.
> 
> I have grand plans to teach the kernel to use hardware dirty tracking
> so that (some?) pages can be left clean and writable for long periods
> of time.  This will be hard.
  Right that will be tough... Although with your application you could
require such pages to be mlocked and then I could imagine we would get away
at least from problems with dirty page accounting.

> Even so, the second write fault to a page tends to take only a few
> microseconds, while the first one often blocks in fs code.
  So you wrote blocks are already preallocated with fallocate(). If you
also preload pages in memory with MADV_WILLNEED is there still big
difference between the first and subsequent write fault?

> (mmap_sem is a different story, but I see it as a separate issue.)
  Yeah, agreed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-08 10:18         ` Jan Kara
@ 2013-08-08 15:56           ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-08 15:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu, Aug 8, 2013 at 3:18 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
>> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> > On 08/07/2013 06:40 AM, Jan Kara wrote:
>> >>   One question before I look at the patches: Why don't you use fallocate()
>> >> in your application? The functionality you require seems to be pretty
>> >> similar to it - writing to an already allocated block is usually quick.
>> >
>> > One problem I've seen is that it still costs you a fault per-page to get
>> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
>> > will do readahead to get the page cache filled, but it still leaves the
>> > pages unmapped.  Those faults get expensive when you're trying to do a
>> > couple hundred million of them all at once.
>>
>> I have grand plans to teach the kernel to use hardware dirty tracking
>> so that (some?) pages can be left clean and writable for long periods
>> of time.  This will be hard.
>   Right that will be tough... Although with your application you could
> require such pages to be mlocked and then I could imagine we would get away
> at least from problems with dirty page accounting.

True.  The nasty part will be all the code that assumes that the acts
of un-write-protecting and dirtying are the same thing, for example
__block_write_begin, which is why I don't really believe in my
willwrite patches...

>
>> Even so, the second write fault to a page tends to take only a few
>> microseconds, while the first one often blocks in fs code.
>   So you wrote blocks are already preallocated with fallocate(). If you
> also preload pages in memory with MADV_WILLNEED is there still big
> difference between the first and subsequent write fault?

I haven't measured it yet, because I suspect that my patches are
rather buggy in their current form.  But the idea is that fallocate
will do the heavy lifting and give me a nice contiguous allocation,
and the MADV_WILLNEED call will take about as long as the first write
fault would have taken.  Then the first write fault after
MADV_WILLNEED will take about as long as the second write fault would
have taken without it.


--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-08 15:56           ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-08 15:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu, Aug 8, 2013 at 3:18 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
>> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> > On 08/07/2013 06:40 AM, Jan Kara wrote:
>> >>   One question before I look at the patches: Why don't you use fallocate()
>> >> in your application? The functionality you require seems to be pretty
>> >> similar to it - writing to an already allocated block is usually quick.
>> >
>> > One problem I've seen is that it still costs you a fault per-page to get
>> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
>> > will do readahead to get the page cache filled, but it still leaves the
>> > pages unmapped.  Those faults get expensive when you're trying to do a
>> > couple hundred million of them all at once.
>>
>> I have grand plans to teach the kernel to use hardware dirty tracking
>> so that (some?) pages can be left clean and writable for long periods
>> of time.  This will be hard.
>   Right that will be tough... Although with your application you could
> require such pages to be mlocked and then I could imagine we would get away
> at least from problems with dirty page accounting.

True.  The nasty part will be all the code that assumes that the acts
of un-write-protecting and dirtying are the same thing, for example
__block_write_begin, which is why I don't really believe in my
willwrite patches...

>
>> Even so, the second write fault to a page tends to take only a few
>> microseconds, while the first one often blocks in fs code.
>   So you wrote blocks are already preallocated with fallocate(). If you
> also preload pages in memory with MADV_WILLNEED is there still big
> difference between the first and subsequent write fault?

I haven't measured it yet, because I suspect that my patches are
rather buggy in their current form.  But the idea is that fallocate
will do the heavy lifting and give me a nice contiguous allocation,
and the MADV_WILLNEED call will take about as long as the first write
fault would have taken.  Then the first write fault after
MADV_WILLNEED will take about as long as the second write fault would
have taken without it.


--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-08 15:56           ` Andy Lutomirski
@ 2013-08-08 18:53             ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-08 18:53 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jan Kara, Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu 08-08-13 08:56:28, Andy Lutomirski wrote:
> On Thu, Aug 8, 2013 at 3:18 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
> >> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> >> > On 08/07/2013 06:40 AM, Jan Kara wrote:
> >> >>   One question before I look at the patches: Why don't you use fallocate()
> >> >> in your application? The functionality you require seems to be pretty
> >> >> similar to it - writing to an already allocated block is usually quick.
> >> >
> >> > One problem I've seen is that it still costs you a fault per-page to get
> >> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
> >> > will do readahead to get the page cache filled, but it still leaves the
> >> > pages unmapped.  Those faults get expensive when you're trying to do a
> >> > couple hundred million of them all at once.
> >>
> >> I have grand plans to teach the kernel to use hardware dirty tracking
> >> so that (some?) pages can be left clean and writable for long periods
> >> of time.  This will be hard.
> >   Right that will be tough... Although with your application you could
> > require such pages to be mlocked and then I could imagine we would get away
> > at least from problems with dirty page accounting.
> 
> True.  The nasty part will be all the code that assumes that the acts
> of un-write-protecting and dirtying are the same thing, for example
> __block_write_begin, which is why I don't really believe in my
> willwrite patches...
> 
> >
> >> Even so, the second write fault to a page tends to take only a few
> >> microseconds, while the first one often blocks in fs code.
> >   So you wrote blocks are already preallocated with fallocate(). If you
> > also preload pages in memory with MADV_WILLNEED is there still big
> > difference between the first and subsequent write fault?
> 
> I haven't measured it yet, because I suspect that my patches are
> rather buggy in their current form.
  I think you're mistaking MADV_WILLNEED with your MADV_WILLWRITE.
MADV_WILLNEED will just trigger readahead for the range - thus you should
have all pages with their block mappings set up in memory. Now the first
writeable fault will still have to do some work, namely converting
unwritten extents in extent tree to written ones. So there is going to be
some difference between the first and subsequent writeable faults. But I'd
like to see whether the difference is really worth the effort with new
MADV_... call.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-08 18:53             ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-08 18:53 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jan Kara, Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu 08-08-13 08:56:28, Andy Lutomirski wrote:
> On Thu, Aug 8, 2013 at 3:18 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
> >> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> >> > On 08/07/2013 06:40 AM, Jan Kara wrote:
> >> >>   One question before I look at the patches: Why don't you use fallocate()
> >> >> in your application? The functionality you require seems to be pretty
> >> >> similar to it - writing to an already allocated block is usually quick.
> >> >
> >> > One problem I've seen is that it still costs you a fault per-page to get
> >> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
> >> > will do readahead to get the page cache filled, but it still leaves the
> >> > pages unmapped.  Those faults get expensive when you're trying to do a
> >> > couple hundred million of them all at once.
> >>
> >> I have grand plans to teach the kernel to use hardware dirty tracking
> >> so that (some?) pages can be left clean and writable for long periods
> >> of time.  This will be hard.
> >   Right that will be tough... Although with your application you could
> > require such pages to be mlocked and then I could imagine we would get away
> > at least from problems with dirty page accounting.
> 
> True.  The nasty part will be all the code that assumes that the acts
> of un-write-protecting and dirtying are the same thing, for example
> __block_write_begin, which is why I don't really believe in my
> willwrite patches...
> 
> >
> >> Even so, the second write fault to a page tends to take only a few
> >> microseconds, while the first one often blocks in fs code.
> >   So you wrote blocks are already preallocated with fallocate(). If you
> > also preload pages in memory with MADV_WILLNEED is there still big
> > difference between the first and subsequent write fault?
> 
> I haven't measured it yet, because I suspect that my patches are
> rather buggy in their current form.
  I think you're mistaking MADV_WILLNEED with your MADV_WILLWRITE.
MADV_WILLNEED will just trigger readahead for the range - thus you should
have all pages with their block mappings set up in memory. Now the first
writeable fault will still have to do some work, namely converting
unwritten extents in extent tree to written ones. So there is going to be
some difference between the first and subsequent writeable faults. But I'd
like to see whether the difference is really worth the effort with new
MADV_... call.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-08 18:53             ` Jan Kara
@ 2013-08-08 19:25               ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-08 19:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu, Aug 8, 2013 at 11:53 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 08-08-13 08:56:28, Andy Lutomirski wrote:
>> On Thu, Aug 8, 2013 at 3:18 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
>> >> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> >> > On 08/07/2013 06:40 AM, Jan Kara wrote:
>> >> >>   One question before I look at the patches: Why don't you use fallocate()
>> >> >> in your application? The functionality you require seems to be pretty
>> >> >> similar to it - writing to an already allocated block is usually quick.
>> >> >
>> >> > One problem I've seen is that it still costs you a fault per-page to get
>> >> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
>> >> > will do readahead to get the page cache filled, but it still leaves the
>> >> > pages unmapped.  Those faults get expensive when you're trying to do a
>> >> > couple hundred million of them all at once.
>> >>
>> >> I have grand plans to teach the kernel to use hardware dirty tracking
>> >> so that (some?) pages can be left clean and writable for long periods
>> >> of time.  This will be hard.
>> >   Right that will be tough... Although with your application you could
>> > require such pages to be mlocked and then I could imagine we would get away
>> > at least from problems with dirty page accounting.
>>
>> True.  The nasty part will be all the code that assumes that the acts
>> of un-write-protecting and dirtying are the same thing, for example
>> __block_write_begin, which is why I don't really believe in my
>> willwrite patches...
>>
>> >
>> >> Even so, the second write fault to a page tends to take only a few
>> >> microseconds, while the first one often blocks in fs code.
>> >   So you wrote blocks are already preallocated with fallocate(). If you
>> > also preload pages in memory with MADV_WILLNEED is there still big
>> > difference between the first and subsequent write fault?
>>
>> I haven't measured it yet, because I suspect that my patches are
>> rather buggy in their current form.
>   I think you're mistaking MADV_WILLNEED with your MADV_WILLWRITE.
> MADV_WILLNEED will just trigger readahead for the range - thus you should
> have all pages with their block mappings set up in memory. Now the first
> writeable fault will still have to do some work, namely converting
> unwritten extents in extent tree to written ones. So there is going to be
> some difference between the first and subsequent writeable faults. But I'd
> like to see whether the difference is really worth the effort with new
> MADV_... call.
>

Whoops -- I read your email too quickly.  I haven't tried
MADV_WILLNEED, but I think I tried reading each page to fault them in.
 Is there any reason to expect MADV_WILLNEED to do any better?  I'll
try to do some new tests to see how well this all works.

(I imagine that freshly fallocated files are somehow different when
read, since there aren't zeros on the disk backing them until they get
written.)

--Andy

>                                                                 Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-08 19:25               ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-08 19:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu, Aug 8, 2013 at 11:53 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 08-08-13 08:56:28, Andy Lutomirski wrote:
>> On Thu, Aug 8, 2013 at 3:18 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 07-08-13 11:00:52, Andy Lutomirski wrote:
>> >> On Wed, Aug 7, 2013 at 10:40 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> >> > On 08/07/2013 06:40 AM, Jan Kara wrote:
>> >> >>   One question before I look at the patches: Why don't you use fallocate()
>> >> >> in your application? The functionality you require seems to be pretty
>> >> >> similar to it - writing to an already allocated block is usually quick.
>> >> >
>> >> > One problem I've seen is that it still costs you a fault per-page to get
>> >> > the PTEs in to a state where you can write to the memory.  MADV_WILLNEED
>> >> > will do readahead to get the page cache filled, but it still leaves the
>> >> > pages unmapped.  Those faults get expensive when you're trying to do a
>> >> > couple hundred million of them all at once.
>> >>
>> >> I have grand plans to teach the kernel to use hardware dirty tracking
>> >> so that (some?) pages can be left clean and writable for long periods
>> >> of time.  This will be hard.
>> >   Right that will be tough... Although with your application you could
>> > require such pages to be mlocked and then I could imagine we would get away
>> > at least from problems with dirty page accounting.
>>
>> True.  The nasty part will be all the code that assumes that the acts
>> of un-write-protecting and dirtying are the same thing, for example
>> __block_write_begin, which is why I don't really believe in my
>> willwrite patches...
>>
>> >
>> >> Even so, the second write fault to a page tends to take only a few
>> >> microseconds, while the first one often blocks in fs code.
>> >   So you wrote blocks are already preallocated with fallocate(). If you
>> > also preload pages in memory with MADV_WILLNEED is there still big
>> > difference between the first and subsequent write fault?
>>
>> I haven't measured it yet, because I suspect that my patches are
>> rather buggy in their current form.
>   I think you're mistaking MADV_WILLNEED with your MADV_WILLWRITE.
> MADV_WILLNEED will just trigger readahead for the range - thus you should
> have all pages with their block mappings set up in memory. Now the first
> writeable fault will still have to do some work, namely converting
> unwritten extents in extent tree to written ones. So there is going to be
> some difference between the first and subsequent writeable faults. But I'd
> like to see whether the difference is really worth the effort with new
> MADV_... call.
>

Whoops -- I read your email too quickly.  I haven't tried
MADV_WILLNEED, but I think I tried reading each page to fault them in.
 Is there any reason to expect MADV_WILLNEED to do any better?  I'll
try to do some new tests to see how well this all works.

(I imagine that freshly fallocated files are somehow different when
read, since there aren't zeros on the disk backing them until they get
written.)

--Andy

>                                                                 Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-08 19:25               ` Andy Lutomirski
@ 2013-08-08 22:58                 ` Dave Hansen
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-08 22:58 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jan Kara, linux-mm, linux-ext4, linux-kernel

I was coincidentally tracking down what I thought was a scalability
problem (turned out to be full disks :).  I noticed, though, that ext4
is about 20% slower than ext2/3 at doing write page faults (x-axis is
number of tasks):

http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5

The test case is:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c

A 'perf diff' shows some of the same suspects that you've been talking
about, Andy:

	http://www.sr71.net/~dave/intel/page-fault-exts/diffprofile.txt

>      2.39%   +2.34%  [kernel.kallsyms]      [k] __set_page_dirty_buffers               
>              +2.50%  [kernel.kallsyms]      [k] __block_write_begin                    
>              +2.16%  [kernel.kallsyms]      [k] __block_commit_write                   

The same test on ext4 but doing MAP_PRIVATE instead of MAP_SHARED goes
at the same speed as ext2/3:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c

This is looking to me more like an ext4-specific problem that needs to
get solved rather than through some interfaces (like MADV_WILLWRITE).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-08 22:58                 ` Dave Hansen
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-08 22:58 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jan Kara, linux-mm, linux-ext4, linux-kernel

I was coincidentally tracking down what I thought was a scalability
problem (turned out to be full disks :).  I noticed, though, that ext4
is about 20% slower than ext2/3 at doing write page faults (x-axis is
number of tasks):

http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5

The test case is:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c

A 'perf diff' shows some of the same suspects that you've been talking
about, Andy:

	http://www.sr71.net/~dave/intel/page-fault-exts/diffprofile.txt

>      2.39%   +2.34%  [kernel.kallsyms]      [k] __set_page_dirty_buffers               
>              +2.50%  [kernel.kallsyms]      [k] __block_write_begin                    
>              +2.16%  [kernel.kallsyms]      [k] __block_commit_write                   

The same test on ext4 but doing MAP_PRIVATE instead of MAP_SHARED goes
at the same speed as ext2/3:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault2.c

This is looking to me more like an ext4-specific problem that needs to
get solved rather than through some interfaces (like MADV_WILLWRITE).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-08 19:25               ` Andy Lutomirski
@ 2013-08-09  0:11                 ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-09  0:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu, Aug 8, 2013 at 12:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Whoops -- I read your email too quickly.  I haven't tried
> MADV_WILLNEED, but I think I tried reading each page to fault them in.
>  Is there any reason to expect MADV_WILLNEED to do any better?  I'll
> try to do some new tests to see how well this all works.
>
> (I imagine that freshly fallocated files are somehow different when
> read, since there aren't zeros on the disk backing them until they get
> written.)

Well, this will teach me to write code based on an old benchmark from
memory.  It seems that prefaulting for read is okay on Linux 3.9 --
latencytop isn't do_wp_page or ext4* at all, at least not for the last
couple minutes on my test box.

I wonder if ext4 changed its handling of fallocated extents somewhere
between 3.5 and 3.9.  In any case, please consider these patches
withdrawn for the time being.

* With file_update_time stubbed out.  I need to dust off my old
patches to fix that part.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-09  0:11                 ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-09  0:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Thu, Aug 8, 2013 at 12:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Whoops -- I read your email too quickly.  I haven't tried
> MADV_WILLNEED, but I think I tried reading each page to fault them in.
>  Is there any reason to expect MADV_WILLNEED to do any better?  I'll
> try to do some new tests to see how well this all works.
>
> (I imagine that freshly fallocated files are somehow different when
> read, since there aren't zeros on the disk backing them until they get
> written.)

Well, this will teach me to write code based on an old benchmark from
memory.  It seems that prefaulting for read is okay on Linux 3.9 --
latencytop isn't do_wp_page or ext4* at all, at least not for the last
couple minutes on my test box.

I wonder if ext4 changed its handling of fallocated extents somewhere
between 3.5 and 3.9.  In any case, please consider these patches
withdrawn for the time being.

* With file_update_time stubbed out.  I need to dust off my old
patches to fix that part.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-08 22:58                 ` Dave Hansen
@ 2013-08-09  7:55                   ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-09  7:55 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andy Lutomirski, Jan Kara, linux-mm, linux-ext4, linux-kernel

On Thu 08-08-13 15:58:39, Dave Hansen wrote:
> I was coincidentally tracking down what I thought was a scalability
> problem (turned out to be full disks :).  I noticed, though, that ext4
> is about 20% slower than ext2/3 at doing write page faults (x-axis is
> number of tasks):
> 
> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
> 
> The test case is:
> 
> 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
  The reason is that ext2/ext3 do almost nothing in their write fault
handler - they are about as fast as it can get. ext4 OTOH needs to reserve
blocks for delayed allocation, setup buffers under a page etc. This is
necessary if you want to make sure that if data are written via mmap, they
also have space available on disk to be written to (ext2 / ext3 do not care
and will just drop the data on the floor if you happen to hit ENOSPC during
writeback).

I'm not saying ext4 write fault path cannot possibly be optimized (noone
seriously looked into that AFAIK so there may well be some low hanging
fruit) but it will always be slower than ext2/3. A more meaningful
comparison would be with filesystems like XFS which make similar guarantees
regarding data safety.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-09  7:55                   ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-09  7:55 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andy Lutomirski, Jan Kara, linux-mm, linux-ext4, linux-kernel

On Thu 08-08-13 15:58:39, Dave Hansen wrote:
> I was coincidentally tracking down what I thought was a scalability
> problem (turned out to be full disks :).  I noticed, though, that ext4
> is about 20% slower than ext2/3 at doing write page faults (x-axis is
> number of tasks):
> 
> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
> 
> The test case is:
> 
> 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
  The reason is that ext2/ext3 do almost nothing in their write fault
handler - they are about as fast as it can get. ext4 OTOH needs to reserve
blocks for delayed allocation, setup buffers under a page etc. This is
necessary if you want to make sure that if data are written via mmap, they
also have space available on disk to be written to (ext2 / ext3 do not care
and will just drop the data on the floor if you happen to hit ENOSPC during
writeback).

I'm not saying ext4 write fault path cannot possibly be optimized (noone
seriously looked into that AFAIK so there may well be some low hanging
fruit) but it will always be slower than ext2/3. A more meaningful
comparison would be with filesystems like XFS which make similar guarantees
regarding data safety.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-09  7:55                   ` Jan Kara
@ 2013-08-09 17:36                     ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-09 17:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Fri, Aug 9, 2013 at 12:55 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>> I was coincidentally tracking down what I thought was a scalability
>> problem (turned out to be full disks :).  I noticed, though, that ext4
>> is about 20% slower than ext2/3 at doing write page faults (x-axis is
>> number of tasks):
>>
>> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>>
>> The test case is:
>>
>>       https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>   The reason is that ext2/ext3 do almost nothing in their write fault
> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> blocks for delayed allocation, setup buffers under a page etc. This is
> necessary if you want to make sure that if data are written via mmap, they
> also have space available on disk to be written to (ext2 / ext3 do not care
> and will just drop the data on the floor if you happen to hit ENOSPC during
> writeback).

Out of curiosity, why does ext4 need to set up buffers?  That is, as
long as the fs can guarantee that there is reserved space to write out
the page, why isn't it sufficient to just mark the page dirty and let
the writeback code set up the buffers?

>
> I'm not saying ext4 write fault path cannot possibly be optimized (noone
> seriously looked into that AFAIK so there may well be some low hanging
> fruit) but it will always be slower than ext2/3. A more meaningful
> comparison would be with filesystems like XFS which make similar guarantees
> regarding data safety.

FWIW, back when I actually tested this stuff, I had awful performance
on XFS, btrfs, and ext4.  But I'm really only interested in the
whether IO (or waiting for contended locks) happens on faults or not
-- a handful of microseconds while the fs allocates something from a
slab doesn't really bother me.


--Andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-09 17:36                     ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-09 17:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Fri, Aug 9, 2013 at 12:55 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>> I was coincidentally tracking down what I thought was a scalability
>> problem (turned out to be full disks :).  I noticed, though, that ext4
>> is about 20% slower than ext2/3 at doing write page faults (x-axis is
>> number of tasks):
>>
>> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>>
>> The test case is:
>>
>>       https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>   The reason is that ext2/ext3 do almost nothing in their write fault
> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> blocks for delayed allocation, setup buffers under a page etc. This is
> necessary if you want to make sure that if data are written via mmap, they
> also have space available on disk to be written to (ext2 / ext3 do not care
> and will just drop the data on the floor if you happen to hit ENOSPC during
> writeback).

Out of curiosity, why does ext4 need to set up buffers?  That is, as
long as the fs can guarantee that there is reserved space to write out
the page, why isn't it sufficient to just mark the page dirty and let
the writeback code set up the buffers?

>
> I'm not saying ext4 write fault path cannot possibly be optimized (noone
> seriously looked into that AFAIK so there may well be some low hanging
> fruit) but it will always be slower than ext2/3. A more meaningful
> comparison would be with filesystems like XFS which make similar guarantees
> regarding data safety.

FWIW, back when I actually tested this stuff, I had awful performance
on XFS, btrfs, and ext4.  But I'm really only interested in the
whether IO (or waiting for contended locks) happens on faults or not
-- a handful of microseconds while the fs allocates something from a
slab doesn't really bother me.


--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-09  7:55                   ` Jan Kara
@ 2013-08-09 17:42                     ` Dave Hansen
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-09 17:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-mm, linux-ext4, linux-kernel

On 08/09/2013 12:55 AM, Jan Kara wrote:
> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>> > I was coincidentally tracking down what I thought was a scalability
>> > problem (turned out to be full disks :).  I noticed, though, that ext4
>> > is about 20% slower than ext2/3 at doing write page faults (x-axis is
>> > number of tasks):
>> > 
>> > http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>> > 
>> > The test case is:
>> > 
>> > 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>   The reason is that ext2/ext3 do almost nothing in their write fault
> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> blocks for delayed allocation, setup buffers under a page etc. This is
> necessary if you want to make sure that if data are written via mmap, they
> also have space available on disk to be written to (ext2 / ext3 do not care
> and will just drop the data on the floor if you happen to hit ENOSPC during
> writeback).

I did try throwing a fallocate() in there to see if it helped.  It
didn't appear to help.  Should it have?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-09 17:42                     ` Dave Hansen
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-09 17:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-mm, linux-ext4, linux-kernel

On 08/09/2013 12:55 AM, Jan Kara wrote:
> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>> > I was coincidentally tracking down what I thought was a scalability
>> > problem (turned out to be full disks :).  I noticed, though, that ext4
>> > is about 20% slower than ext2/3 at doing write page faults (x-axis is
>> > number of tasks):
>> > 
>> > http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>> > 
>> > The test case is:
>> > 
>> > 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>   The reason is that ext2/ext3 do almost nothing in their write fault
> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> blocks for delayed allocation, setup buffers under a page etc. This is
> necessary if you want to make sure that if data are written via mmap, they
> also have space available on disk to be written to (ext2 / ext3 do not care
> and will just drop the data on the floor if you happen to hit ENOSPC during
> writeback).

I did try throwing a fallocate() in there to see if it helped.  It
didn't appear to help.  Should it have?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-09 17:42                     ` Dave Hansen
@ 2013-08-09 17:44                       ` Andy Lutomirski
  -1 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-09 17:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jan Kara, linux-mm, linux-ext4, linux-kernel

On Fri, Aug 9, 2013 at 10:42 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 08/09/2013 12:55 AM, Jan Kara wrote:
>> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>>> > I was coincidentally tracking down what I thought was a scalability
>>> > problem (turned out to be full disks :).  I noticed, though, that ext4
>>> > is about 20% slower than ext2/3 at doing write page faults (x-axis is
>>> > number of tasks):
>>> >
>>> > http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>>> >
>>> > The test case is:
>>> >
>>> >    https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>>   The reason is that ext2/ext3 do almost nothing in their write fault
>> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
>> blocks for delayed allocation, setup buffers under a page etc. This is
>> necessary if you want to make sure that if data are written via mmap, they
>> also have space available on disk to be written to (ext2 / ext3 do not care
>> and will just drop the data on the floor if you happen to hit ENOSPC during
>> writeback).
>
> I did try throwing a fallocate() in there to see if it helped.  It
> didn't appear to help.  Should it have?


Try reading all the pages after mmap (and keep the fallocate).

In theory, MAP_POPULATE should help some, but until Linux 3.9
MAP_POPULATE was a disaster, and I'm still a bit afraid of it.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-09 17:44                       ` Andy Lutomirski
  0 siblings, 0 replies; 40+ messages in thread
From: Andy Lutomirski @ 2013-08-09 17:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Jan Kara, linux-mm, linux-ext4, linux-kernel

On Fri, Aug 9, 2013 at 10:42 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 08/09/2013 12:55 AM, Jan Kara wrote:
>> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>>> > I was coincidentally tracking down what I thought was a scalability
>>> > problem (turned out to be full disks :).  I noticed, though, that ext4
>>> > is about 20% slower than ext2/3 at doing write page faults (x-axis is
>>> > number of tasks):
>>> >
>>> > http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>>> >
>>> > The test case is:
>>> >
>>> >    https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>>   The reason is that ext2/ext3 do almost nothing in their write fault
>> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
>> blocks for delayed allocation, setup buffers under a page etc. This is
>> necessary if you want to make sure that if data are written via mmap, they
>> also have space available on disk to be written to (ext2 / ext3 do not care
>> and will just drop the data on the floor if you happen to hit ENOSPC during
>> writeback).
>
> I did try throwing a fallocate() in there to see if it helped.  It
> didn't appear to help.  Should it have?


Try reading all the pages after mmap (and keep the fallocate).

In theory, MAP_POPULATE should help some, but until Linux 3.9
MAP_POPULATE was a disaster, and I'm still a bit afraid of it.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-09 17:36                     ` Andy Lutomirski
@ 2013-08-09 20:34                       ` Jan Kara
  -1 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-09 20:34 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Fri 09-08-13 10:36:41, Andy Lutomirski wrote:
> On Fri, Aug 9, 2013 at 12:55 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 08-08-13 15:58:39, Dave Hansen wrote:
> >> I was coincidentally tracking down what I thought was a scalability
> >> problem (turned out to be full disks :).  I noticed, though, that ext4
> >> is about 20% slower than ext2/3 at doing write page faults (x-axis is
> >> number of tasks):
> >>
> >> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
> >>
> >> The test case is:
> >>
> >>       https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> >   The reason is that ext2/ext3 do almost nothing in their write fault
> > handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> > blocks for delayed allocation, setup buffers under a page etc. This is
> > necessary if you want to make sure that if data are written via mmap, they
> > also have space available on disk to be written to (ext2 / ext3 do not care
> > and will just drop the data on the floor if you happen to hit ENOSPC during
> > writeback).
> 
> Out of curiosity, why does ext4 need to set up buffers?  That is, as
> long as the fs can guarantee that there is reserved space to write out
> the page, why isn't it sufficient to just mark the page dirty and let
> the writeback code set up the buffers?
  Well, because we track the fact that the space is reserved in the buffer
itself.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-09 20:34                       ` Jan Kara
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Kara @ 2013-08-09 20:34 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Dave Hansen, linux-mm, linux-ext4, linux-kernel

On Fri 09-08-13 10:36:41, Andy Lutomirski wrote:
> On Fri, Aug 9, 2013 at 12:55 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 08-08-13 15:58:39, Dave Hansen wrote:
> >> I was coincidentally tracking down what I thought was a scalability
> >> problem (turned out to be full disks :).  I noticed, though, that ext4
> >> is about 20% slower than ext2/3 at doing write page faults (x-axis is
> >> number of tasks):
> >>
> >> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
> >>
> >> The test case is:
> >>
> >>       https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> >   The reason is that ext2/ext3 do almost nothing in their write fault
> > handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> > blocks for delayed allocation, setup buffers under a page etc. This is
> > necessary if you want to make sure that if data are written via mmap, they
> > also have space available on disk to be written to (ext2 / ext3 do not care
> > and will just drop the data on the floor if you happen to hit ENOSPC during
> > writeback).
> 
> Out of curiosity, why does ext4 need to set up buffers?  That is, as
> long as the fs can guarantee that there is reserved space to write out
> the page, why isn't it sufficient to just mark the page dirty and let
> the writeback code set up the buffers?
  Well, because we track the fact that the space is reserved in the buffer
itself.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
  2013-08-09  7:55                   ` Jan Kara
@ 2013-08-12 22:44                     ` Dave Hansen
  -1 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-12 22:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-mm, linux-ext4, linux-kernel

On 08/09/2013 12:55 AM, Jan Kara wrote:
> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>> I was coincidentally tracking down what I thought was a scalability
>> problem (turned out to be full disks :).  I noticed, though, that ext4
>> is about 20% slower than ext2/3 at doing write page faults (x-axis is
>> number of tasks):
>>
>> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>>
>> The test case is:
>>
>> 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>   The reason is that ext2/ext3 do almost nothing in their write fault
> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> blocks for delayed allocation, setup buffers under a page etc. This is
> necessary if you want to make sure that if data are written via mmap, they
> also have space available on disk to be written to (ext2 / ext3 do not care
> and will just drop the data on the floor if you happen to hit ENOSPC during
> writeback).
> 
> I'm not saying ext4 write fault path cannot possibly be optimized (noone
> seriously looked into that AFAIK so there may well be some low hanging
> fruit) but it will always be slower than ext2/3. A more meaningful
> comparison would be with filesystems like XFS which make similar guarantees
> regarding data safety.

ext4 beats xfs from what I can tell.  I ran with fewer steps to make the
testing faster, which is to blame for the stair-stepping, btw...

 http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&3=xfs&hide=linear,threads,threads_idle,processes_idle




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
@ 2013-08-12 22:44                     ` Dave Hansen
  0 siblings, 0 replies; 40+ messages in thread
From: Dave Hansen @ 2013-08-12 22:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andy Lutomirski, linux-mm, linux-ext4, linux-kernel

On 08/09/2013 12:55 AM, Jan Kara wrote:
> On Thu 08-08-13 15:58:39, Dave Hansen wrote:
>> I was coincidentally tracking down what I thought was a scalability
>> problem (turned out to be full disks :).  I noticed, though, that ext4
>> is about 20% slower than ext2/3 at doing write page faults (x-axis is
>> number of tasks):
>>
>> http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&hide=linear,threads,threads_idle,processes_idle&rollPeriod=5
>>
>> The test case is:
>>
>> 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
>   The reason is that ext2/ext3 do almost nothing in their write fault
> handler - they are about as fast as it can get. ext4 OTOH needs to reserve
> blocks for delayed allocation, setup buffers under a page etc. This is
> necessary if you want to make sure that if data are written via mmap, they
> also have space available on disk to be written to (ext2 / ext3 do not care
> and will just drop the data on the floor if you happen to hit ENOSPC during
> writeback).
> 
> I'm not saying ext4 write fault path cannot possibly be optimized (noone
> seriously looked into that AFAIK so there may well be some low hanging
> fruit) but it will always be slower than ext2/3. A more meaningful
> comparison would be with filesystems like XFS which make similar guarantees
> regarding data safety.

ext4 beats xfs from what I can tell.  I ran with fewer steps to make the
testing faster, which is to blame for the stair-stepping, btw...

 http://www.sr71.net/~dave/intel/page-fault-exts/cmp.html?1=ext3&2=ext4&3=xfs&hide=linear,threads,threads_idle,processes_idle



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2013-08-12 22:44 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-05 19:43 [RFC 0/3] Add madvise(..., MADV_WILLWRITE) Andy Lutomirski
2013-08-05 19:43 ` Andy Lutomirski
2013-08-05 19:43 ` [RFC 1/3] mm: Add MADV_WILLWRITE to indicate that a range will be written to Andy Lutomirski
2013-08-05 19:43   ` Andy Lutomirski
2013-08-05 19:44 ` [RFC 2/3] fs: Add block_willwrite Andy Lutomirski
2013-08-05 19:44   ` Andy Lutomirski
2013-08-05 19:44 ` [RFC 3/3] ext4: Implement willwrite for the delalloc case Andy Lutomirski
2013-08-05 19:44   ` Andy Lutomirski
2013-08-07 13:40 ` [RFC 0/3] Add madvise(..., MADV_WILLWRITE) Jan Kara
2013-08-07 13:40   ` Jan Kara
2013-08-07 17:02   ` Andy Lutomirski
2013-08-07 17:02     ` Andy Lutomirski
2013-08-07 17:40   ` Dave Hansen
2013-08-07 17:40     ` Dave Hansen
2013-08-07 18:00     ` Andy Lutomirski
2013-08-07 18:00       ` Andy Lutomirski
2013-08-08 10:18       ` Jan Kara
2013-08-08 10:18         ` Jan Kara
2013-08-08 15:56         ` Andy Lutomirski
2013-08-08 15:56           ` Andy Lutomirski
2013-08-08 18:53           ` Jan Kara
2013-08-08 18:53             ` Jan Kara
2013-08-08 19:25             ` Andy Lutomirski
2013-08-08 19:25               ` Andy Lutomirski
2013-08-08 22:58               ` Dave Hansen
2013-08-08 22:58                 ` Dave Hansen
2013-08-09  7:55                 ` Jan Kara
2013-08-09  7:55                   ` Jan Kara
2013-08-09 17:36                   ` Andy Lutomirski
2013-08-09 17:36                     ` Andy Lutomirski
2013-08-09 20:34                     ` Jan Kara
2013-08-09 20:34                       ` Jan Kara
2013-08-09 17:42                   ` Dave Hansen
2013-08-09 17:42                     ` Dave Hansen
2013-08-09 17:44                     ` Andy Lutomirski
2013-08-09 17:44                       ` Andy Lutomirski
2013-08-12 22:44                   ` Dave Hansen
2013-08-12 22:44                     ` Dave Hansen
2013-08-09  0:11               ` Andy Lutomirski
2013-08-09  0:11                 ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.