linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [V3 PATCH 1/2] tmpfs: add fallocate support
@ 2011-11-23  8:53 Cong Wang
  2011-11-23  8:53 ` [PATCH 2/2] fs: wire up .truncate_range and .fallocate Cong Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Cong Wang @ 2011-11-23  8:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, Pekka Enberg, Christoph Hellwig, Hugh Dickins, Dave Hansen,
	Lennart Poettering, Kay Sievers, KOSAKI Motohiro, WANG Cong,
	linux-mm

Systemd needs tmpfs to support fallocate [1], to be able
to safely use mmap(), regarding SIGBUS, on files on the
/dev/shm filesystem. The glibc fallback loop for -ENOSYS
on fallocate is just ugly.

This patch adds fallocate support to tmpfs, and as we
already have shmem_truncate_range(), it is also easy to
add FALLOC_FL_PUNCH_HOLE support too.

1. http://lkml.org/lkml/2011/10/20/275

V2->V3:
a) Read i_size directly after holding i_mutex;
b) Call page_cache_release() too after shmem_getpage();
c) Undo previous changes when -ENOSPC.

Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: WANG Cong <amwang@redhat.com>

---
 mm/shmem.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d672250..65f7a27 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
 #include <linux/mm.h>
 #include <linux/export.h>
 #include <linux/swap.h>
+#include <linux/falloc.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -1431,6 +1432,69 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
 	return error;
 }
 
+static void shmem_truncate_page(struct inode *inode, pgoff_t index)
+{
+	loff_t start = index << PAGE_CACHE_SHIFT;
+	loff_t end = ((index + 1) << PAGE_CACHE_SHIFT) - 1;
+	shmem_truncate_range(inode, start, end);
+}
+
+static long shmem_fallocate(struct file *file, int mode,
+				loff_t offset, loff_t len)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	pgoff_t start = offset >> PAGE_CACHE_SHIFT;
+	pgoff_t end = DIV_ROUND_UP((offset + len), PAGE_CACHE_SIZE);
+	pgoff_t index = start;
+	loff_t i_size;
+	struct page *page = NULL;
+	int ret = 0;
+
+	mutex_lock(&inode->i_mutex);
+	i_size = inode->i_size;
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		if (!(offset > i_size || (end << PAGE_CACHE_SHIFT) > i_size))
+			shmem_truncate_range(inode, offset,
+					     (end << PAGE_CACHE_SHIFT) - 1);
+		goto unlock;
+	}
+
+	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+		ret = inode_newsize_ok(inode, (offset + len));
+		if (ret)
+			goto unlock;
+	}
+
+	while (index < end) {
+		ret = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);
+		if (ret) {
+			if (ret == -ENOSPC)
+				goto undo;
+			else
+				goto unlock;
+		}
+		if (page) {
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		index++;
+	}
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && (index << PAGE_CACHE_SHIFT) > i_size)
+		i_size_write(inode, index << PAGE_CACHE_SHIFT);
+
+	goto unlock;
+
+undo:
+	while (index > start) {
+		shmem_truncate_page(inode, index);
+		index--;
+	}
+
+unlock:
+	mutex_unlock(&inode->i_mutex);
+	return ret;
+}
+
 static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2286,6 +2350,7 @@ static const struct file_operations shmem_file_operations = {
 	.fsync		= noop_fsync,
 	.splice_read	= shmem_file_splice_read,
 	.splice_write	= generic_file_splice_write,
+	.fallocate	= shmem_fallocate,
 #endif
 };
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] fs: wire up .truncate_range and .fallocate
  2011-11-23  8:53 [V3 PATCH 1/2] tmpfs: add fallocate support Cong Wang
@ 2011-11-23  8:53 ` Cong Wang
  2011-11-23 10:38   ` Christoph Hellwig
  2011-11-23  9:06 ` [V3 PATCH 1/2] tmpfs: add fallocate support Pekka Enberg
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2011-11-23  8:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, Hugh Dickins, Al Viro, Christoph Hellwig, WANG Cong,
	Matthew Wilcox, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-fsdevel, linux-mm

As Hugh suggested, with FALLOC_FL_PUNCH_HOLE, we can use do_fallocate()
to implement madvise_remove and finally remove .truncate_range call back.

Cc: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: WANG Cong <amwang@redhat.com>

---
 include/linux/fs.h |    1 -
 include/linux/mm.h |    3 ++-
 mm/madvise.c       |    8 +++++---
 mm/shmem.c         |    1 -
 mm/truncate.c      |   21 +++++++++++++--------
 5 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e313022..266df73 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1635,7 +1635,6 @@ struct inode_operations {
 	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
-	void (*truncate_range)(struct inode *, loff_t, loff_t);
 	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
 		      u64 len);
 } ____cacheline_aligned;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3dc3a8c..a47f744 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -951,7 +951,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
 extern int vmtruncate(struct inode *inode, loff_t offset);
-extern int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end);
+extern int vmtruncate_file_range(struct file *file, struct inode *inode,
+			loff_t offset, loff_t end);
 
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 74bf193..05610d3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -194,7 +194,8 @@ static long madvise_remove(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end)
 {
-	struct address_space *mapping;
+	struct file *file;
+	struct inode *inode;
 	loff_t offset, endoff;
 	int error;
 
@@ -211,7 +212,8 @@ static long madvise_remove(struct vm_area_struct *vma,
 	if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
 		return -EACCES;
 
-	mapping = vma->vm_file->f_mapping;
+	file = vma->vm_file;
+	inode = file->f_mapping->host;
 
 	offset = (loff_t)(start - vma->vm_start)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -220,7 +222,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
 	/* vmtruncate_range needs to take i_mutex */
 	up_read(&current->mm->mmap_sem);
-	error = vmtruncate_range(mapping->host, offset, endoff);
+	error = vmtruncate_file_range(file, inode, offset, endoff);
 	down_read(&current->mm->mmap_sem);
 	return error;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 65f7a27..fce5667 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2356,7 +2356,6 @@ static const struct file_operations shmem_file_operations = {
 
 static const struct inode_operations shmem_inode_operations = {
 	.setattr	= shmem_setattr,
-	.truncate_range	= shmem_truncate_range,
 #ifdef CONFIG_TMPFS_XATTR
 	.setxattr	= shmem_setxattr,
 	.getxattr	= shmem_getxattr,
diff --git a/mm/truncate.c b/mm/truncate.c
index 632b15e..7c46539 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/falloc.h>
 #include "internal.h"
 
 
@@ -602,24 +603,28 @@ int vmtruncate(struct inode *inode, loff_t newsize)
 }
 EXPORT_SYMBOL(vmtruncate);
 
-int vmtruncate_range(struct inode *inode, loff_t lstart, loff_t lend)
+int vmtruncate_file_range(struct file *file, struct inode *inode,
+		     loff_t lstart, loff_t lend)
 {
 	struct address_space *mapping = inode->i_mapping;
 	loff_t holebegin = round_up(lstart, PAGE_SIZE);
 	loff_t holelen = 1 + lend - holebegin;
+	int err;
 
-	/*
-	 * If the underlying filesystem is not going to provide
-	 * a way to truncate a range of blocks (punch a hole) -
-	 * we should return failure right now.
-	 */
-	if (!inode->i_op->truncate_range)
+	if (!file->f_op->fallocate)
 		return -ENOSYS;
 
 	mutex_lock(&inode->i_mutex);
 	inode_dio_wait(inode);
 	unmap_mapping_range(mapping, holebegin, holelen, 1);
-	inode->i_op->truncate_range(inode, lstart, lend);
+	mutex_unlock(&inode->i_mutex);
+
+	err = do_fallocate(file, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
+		     holebegin, holelen);
+	if (err)
+		return err;
+
+	mutex_lock(&inode->i_mutex);
 	/* unmap again to remove racily COWed private pages */
 	unmap_mapping_range(mapping, holebegin, holelen, 1);
 	mutex_unlock(&inode->i_mutex);
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23  8:53 [V3 PATCH 1/2] tmpfs: add fallocate support Cong Wang
  2011-11-23  8:53 ` [PATCH 2/2] fs: wire up .truncate_range and .fallocate Cong Wang
@ 2011-11-23  9:06 ` Pekka Enberg
  2011-11-23 19:07 ` Hugh Dickins
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Pekka Enberg @ 2011-11-23  9:06 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, akpm, Christoph Hellwig, Hugh Dickins, Dave Hansen,
	Lennart Poettering, Kay Sievers, KOSAKI Motohiro, linux-mm

On Wed, Nov 23, 2011 at 10:53 AM, Cong Wang <amwang@redhat.com> wrote:
> Systemd needs tmpfs to support fallocate [1], to be able
> to safely use mmap(), regarding SIGBUS, on files on the
> /dev/shm filesystem. The glibc fallback loop for -ENOSYS
> on fallocate is just ugly.
>
> This patch adds fallocate support to tmpfs, and as we
> already have shmem_truncate_range(), it is also easy to
> add FALLOC_FL_PUNCH_HOLE support too.
>
> 1. http://lkml.org/lkml/2011/10/20/275
>
> V2->V3:
> a) Read i_size directly after holding i_mutex;
> b) Call page_cache_release() too after shmem_getpage();
> c) Undo previous changes when -ENOSPC.
>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Lennart Poettering <lennart@poettering.net>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: WANG Cong <amwang@redhat.com>

Looks reasonable to me.

Acked-by: Pekka Enberg <penberg@kernel.org>

Did someone actually test this with systemd?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] fs: wire up .truncate_range and .fallocate
  2011-11-23  8:53 ` [PATCH 2/2] fs: wire up .truncate_range and .fallocate Cong Wang
@ 2011-11-23 10:38   ` Christoph Hellwig
  2011-11-23 19:16     ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2011-11-23 10:38 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, akpm, Hugh Dickins, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Andrea Arcangeli, Rik van Riel, Mel Gorman,
	Minchan Kim, Johannes Weiner, linux-fsdevel, linux-mm

On Wed, Nov 23, 2011 at 04:53:31PM +0800, Cong Wang wrote:
> +int vmtruncate_file_range(struct file *file, struct inode *inode,
> +		     loff_t lstart, loff_t lend)

We can always get the inode from file->f_path.dentry->d_inode, thus
passing both of them doesn't make much sense.

> +	if (!file->f_op->fallocate)
>  		return -ENOSYS;
>  
>  	mutex_lock(&inode->i_mutex);
>  	inode_dio_wait(inode);
>  	unmap_mapping_range(mapping, holebegin, holelen, 1);
> -	inode->i_op->truncate_range(inode, lstart, lend);
> +	mutex_unlock(&inode->i_mutex);
> +
> +	err = do_fallocate(file, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE,
> +		     holebegin, holelen);
> +	if (err)
> +		return err;
> +
> +	mutex_lock(&inode->i_mutex);
>  	/* unmap again to remove racily COWed private pages */
>  	unmap_mapping_range(mapping, holebegin, holelen, 1);
>  	mutex_unlock(&inode->i_mutex);

I suspect this should be turned inside out, just we do for normal
truncate.  That is instead of keeping vmtruncate(_file)_range we
should have a truncate_pagecache_range do to the actual zapping,
closely mirroring what we do for truncate.  In the best case we'd
even implement the regular truncate ones on top of the range version.

It also seems like all fallocate implementaions for far got away
without the unmap_mapping_range, so either people didn't test them
hard enough, or tmpfs doesn't need it either.  I fear the former
is true.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23  8:53 [V3 PATCH 1/2] tmpfs: add fallocate support Cong Wang
  2011-11-23  8:53 ` [PATCH 2/2] fs: wire up .truncate_range and .fallocate Cong Wang
  2011-11-23  9:06 ` [V3 PATCH 1/2] tmpfs: add fallocate support Pekka Enberg
@ 2011-11-23 19:07 ` Hugh Dickins
  2011-11-24  3:18   ` Cong Wang
  2011-11-23 19:59 ` KOSAKI Motohiro
  2011-11-24  1:52 ` KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2011-11-23 19:07 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, akpm, Pekka Enberg, Christoph Hellwig, Dave Hansen,
	Lennart Poettering, Kay Sievers, KOSAKI Motohiro, linux-mm

On Wed, 23 Nov 2011, Cong Wang wrote:
> +
> +	while (index < end) {
> +		ret = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);
> +		if (ret) {
> +			if (ret == -ENOSPC)
> +				goto undo;
...
> +undo:
> +	while (index > start) {
> +		shmem_truncate_page(inode, index);
> +		index--;
> +	}

As I said before, I won't actually be reviewing and testing this for
a week or two; but before this goes any further, must point out how
wrong it is.  Here you'll be deleting any pages in the range that were
already present before the failing fallocate().

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] fs: wire up .truncate_range and .fallocate
  2011-11-23 10:38   ` Christoph Hellwig
@ 2011-11-23 19:16     ` Hugh Dickins
  0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2011-11-23 19:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Cong Wang, linux-kernel, akpm, Al Viro, Matthew Wilcox,
	Andrea Arcangeli, Rik van Riel, Mel Gorman, Minchan Kim,
	Johannes Weiner, linux-fsdevel, linux-mm

On Wed, 23 Nov 2011, Christoph Hellwig wrote:
> 
> It also seems like all fallocate implementaions for far got away
> without the unmap_mapping_range, so either people didn't test them
> hard enough, or tmpfs doesn't need it either.  I fear the former
> is true.

They're saved by the funny little one-by-one unmap_mapping_range()
fallback in truncate_inode_page().  It's inefficient (in those rare
cases when someone is punching a hole somewhere that's mapped) and
we ought to do better, but we don't have an actual bug there.

Hugh

int truncate_inode_page(struct address_space *mapping, struct page *page)
{
	if (page_mapped(page)) {
		unmap_mapping_range(mapping,
				   (loff_t)page->index << PAGE_CACHE_SHIFT,
				   PAGE_CACHE_SIZE, 0);
	}
	return truncate_complete_page(mapping, page);
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23  8:53 [V3 PATCH 1/2] tmpfs: add fallocate support Cong Wang
                   ` (2 preceding siblings ...)
  2011-11-23 19:07 ` Hugh Dickins
@ 2011-11-23 19:59 ` KOSAKI Motohiro
  2011-11-23 21:11   ` Pekka Enberg
  2011-11-24  1:52 ` KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 17+ messages in thread
From: KOSAKI Motohiro @ 2011-11-23 19:59 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, akpm, Pekka Enberg, Christoph Hellwig,
	Hugh Dickins, Dave Hansen, Lennart Poettering, Kay Sievers,
	linux-mm

> Systemd needs tmpfs to support fallocate [1], to be able
> to safely use mmap(), regarding SIGBUS, on files on the
> /dev/shm filesystem. The glibc fallback loop for -ENOSYS
> on fallocate is just ugly.

for EOPNOTSUPP?

glibc/sysdeps/unix/sysv/linux/i386/posix_fallocate.c
----------------
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
#ifdef __NR_fallocate
# ifndef __ASSUME_FALLOCATE
  if (__builtin_expect (__have_fallocate >= 0, 1))
# endif
    {
      int res = __call_fallocate (fd, 0, offset, len);
      if (! res)
        return 0;

# ifndef __ASSUME_FALLOCATE
      if (__builtin_expect (res == ENOSYS, 0))
        __have_fallocate = -1;
      else
# endif
        if (res != EOPNOTSUPP)
          return res;
    }
#endif

  return internal_fallocate (fd, offset, len);
}
--------------------------


But, ok, I'm now convinced this is needed. people strongly dislike to
receive SIGBUS. yes.


> This patch adds fallocate support to tmpfs, and as we
> already have shmem_truncate_range(), it is also easy to
> add FALLOC_FL_PUNCH_HOLE support too.
>
> 1. http://lkml.org/lkml/2011/10/20/275
>
> V2->V3:
> a) Read i_size directly after holding i_mutex;
> b) Call page_cache_release() too after shmem_getpage();
> c) Undo previous changes when -ENOSPC.
>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Lennart Poettering <lennart@poettering.net>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: WANG Cong <amwang@redhat.com>
>
> ---
>  mm/shmem.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 65 insertions(+), 0 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d672250..65f7a27 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -30,6 +30,7 @@
>  #include <linux/mm.h>
>  #include <linux/export.h>
>  #include <linux/swap.h>
> +#include <linux/falloc.h>
>
>  static struct vfsmount *shm_mnt;
>
> @@ -1431,6 +1432,69 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
>        return error;
>  }
>
> +static void shmem_truncate_page(struct inode *inode, pgoff_t index)
> +{
> +       loff_t start = index << PAGE_CACHE_SHIFT;
> +       loff_t end = ((index + 1) << PAGE_CACHE_SHIFT) - 1;
> +       shmem_truncate_range(inode, start, end);
> +}
> +
> +static long shmem_fallocate(struct file *file, int mode,
> +                               loff_t offset, loff_t len)
> +{
> +       struct inode *inode = file->f_path.dentry->d_inode;
> +       pgoff_t start = offset >> PAGE_CACHE_SHIFT;
> +       pgoff_t end = DIV_ROUND_UP((offset + len), PAGE_CACHE_SIZE);
> +       pgoff_t index = start;
> +       loff_t i_size;
> +       struct page *page = NULL;
> +       int ret = 0;

do_fallocate has following file type check.

        if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
                return -ENODEV;

However, this implementation don't support dir allocation and/or punch hole.
ext4's ext4_punch_hole() has following additional check. Maybe we need similar
check.

        if (!S_ISREG(inode->i_mode))
                return -ENOTSUPP;


> +       mutex_lock(&inode->i_mutex);
> +       i_size = inode->i_size;
> +       if (mode & FALLOC_FL_PUNCH_HOLE) {
> +               if (!(offset > i_size || (end << PAGE_CACHE_SHIFT) > i_size))

Seems incorrect.
fallocate(PUNCH, 0, very_big_number) should punch to a range of [0, end).


> +                       shmem_truncate_range(inode, offset,
> +                                            (end << PAGE_CACHE_SHIFT) - 1);
> +               goto unlock;
> +       }
> +
> +       if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> +               ret = inode_newsize_ok(inode, (offset + len));
> +               if (ret)
> +                       goto unlock;
> +       }
> +       while (index < end) {
> +               ret = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);
> +               if (ret) {
> +                       if (ret == -ENOSPC)
> +                               goto undo;
> +                       else
> +                               goto unlock;
> +               }
> +               if (page) {
> +                       unlock_page(page);
> +                       page_cache_release(page);
> +               }
> +               index++;
> +       }
> +       if (!(mode & FALLOC_FL_KEEP_SIZE) && (index << PAGE_CACHE_SHIFT) > i_size)
> +               i_size_write(inode, index << PAGE_CACHE_SHIFT);

Seems incorrect.
new i_size should be offset+len. our round-up is implementation detail
and don't have to expose
to userland.

> +
> +       goto unlock;
> +
> +undo:
> +       while (index > start) {
> +               shmem_truncate_page(inode, index);
> +               index--;

Hmmm...
seems too aggressive truncate if the file has pages before starting fallocate.
but I have no idea to make better undo. ;)


> +       }
> +
> +unlock:
> +       mutex_unlock(&inode->i_mutex);
> +       return ret;
> +}
> +
>  static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
>  {
>        struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
> @@ -2286,6 +2350,7 @@ static const struct file_operations shmem_file_operations = {
>        .fsync          = noop_fsync,
>        .splice_read    = shmem_file_splice_read,
>        .splice_write   = generic_file_splice_write,
> +       .fallocate      = shmem_fallocate,
>  #endif
>  };

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23 19:59 ` KOSAKI Motohiro
@ 2011-11-23 21:11   ` Pekka Enberg
  2011-11-23 22:20     ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Pekka Enberg @ 2011-11-23 21:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cong Wang, linux-kernel, akpm, Christoph Hellwig, Hugh Dickins,
	Dave Hansen, Lennart Poettering, Kay Sievers, linux-mm

On Wed, Nov 23, 2011 at 9:59 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
>> +
>> +       goto unlock;
>> +
>> +undo:
>> +       while (index > start) {
>> +               shmem_truncate_page(inode, index);
>> +               index--;
>
> Hmmm...
> seems too aggressive truncate if the file has pages before starting fallocate.
> but I have no idea to make better undo. ;)

Why do we need to undo anyway?

                                Pekka

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23 21:11   ` Pekka Enberg
@ 2011-11-23 22:20     ` Hugh Dickins
  2011-11-24  3:15       ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2011-11-23 22:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: KOSAKI Motohiro, Cong Wang, linux-kernel, akpm,
	Christoph Hellwig, Dave Hansen, Lennart Poettering, Kay Sievers,
	linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1513 bytes --]

On Wed, 23 Nov 2011, Pekka Enberg wrote:
> On Wed, Nov 23, 2011 at 9:59 PM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> >> +
> >> +       goto unlock;
> >> +
> >> +undo:
> >> +       while (index > start) {
> >> +               shmem_truncate_page(inode, index);
> >> +               index--;
> >
> > Hmmm...
> > seems too aggressive truncate if the file has pages before starting fallocate.
> > but I have no idea to make better undo. ;)
> 
> Why do we need to undo anyway?

One answer comes earlier in this thread:

On Mon, Nov 21, 2011, Christoph Hellwig write:
> On Sun, Nov 20, 2011 at 01:22:10PM -0800, Hugh Dickins wrote:
> > First question that springs to mind (to which I shall easily find
> > an answer): is it actually acceptable for fallocate() to return
> > -ENOSPC when it has already completed a part of the work?
> 
> No, it must undo all allocations if it returns ENOSPC.

Another answer would be: if fallocate() had been defined to return
the length that has been successfully allocated (as write() returns
the length written), then it would be reasonable to return partial
length instead of failing with ENOSPC, and not undo.  But it was
defined to return -1 on failure or 0 on success, so cannot report
partial success.

Another answer would be: if the disk is near full, it's not good
for a fallocate() to fail with -ENOSPC while nonetheless grabbing
all the remaining blocks; even worse if another fallocate() were
racing with it.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23  8:53 [V3 PATCH 1/2] tmpfs: add fallocate support Cong Wang
                   ` (3 preceding siblings ...)
  2011-11-23 19:59 ` KOSAKI Motohiro
@ 2011-11-24  1:52 ` KAMEZAWA Hiroyuki
  2011-11-24  2:46   ` KOSAKI Motohiro
  4 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-24  1:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, akpm, Pekka Enberg, Christoph Hellwig,
	Hugh Dickins, Dave Hansen, Lennart Poettering, Kay Sievers,
	KOSAKI Motohiro, linux-mm


I have a question.

On Wed, 23 Nov 2011 16:53:30 +0800
Cong Wang <amwang@redhat.com> wrote:

> Systemd needs tmpfs to support fallocate [1], to be able
> to safely use mmap(), regarding SIGBUS, on files on the
> /dev/shm filesystem. The glibc fallback loop for -ENOSYS
> on fallocate is just ugly.
> 
> This patch adds fallocate support to tmpfs, and as we
> already have shmem_truncate_range(), it is also easy to
> add FALLOC_FL_PUNCH_HOLE support too.
> 
> 1. http://lkml.org/lkml/2011/10/20/275
> 
> V2->V3:
> a) Read i_size directly after holding i_mutex;
> b) Call page_cache_release() too after shmem_getpage();
> c) Undo previous changes when -ENOSPC.
> 
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Cc: Lennart Poettering <lennart@poettering.net>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: WANG Cong <amwang@redhat.com>
> 
> ---
>  mm/shmem.c |   65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 65 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d672250..65f7a27 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -30,6 +30,7 @@
>  #include <linux/mm.h>
>  #include <linux/export.h>
>  #include <linux/swap.h>
> +#include <linux/falloc.h>
>  
>  static struct vfsmount *shm_mnt;
>  
> @@ -1431,6 +1432,69 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
>  	return error;
>  }
>  
> +static void shmem_truncate_page(struct inode *inode, pgoff_t index)
> +{
> +	loff_t start = index << PAGE_CACHE_SHIFT;
> +	loff_t end = ((index + 1) << PAGE_CACHE_SHIFT) - 1;
> +	shmem_truncate_range(inode, start, end);
> +}
> +
> +static long shmem_fallocate(struct file *file, int mode,
> +				loff_t offset, loff_t len)
> +{
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	pgoff_t start = offset >> PAGE_CACHE_SHIFT;
> +	pgoff_t end = DIV_ROUND_UP((offset + len), PAGE_CACHE_SIZE);
> +	pgoff_t index = start;
> +	loff_t i_size;
> +	struct page *page = NULL;
> +	int ret = 0;
> +
> +	mutex_lock(&inode->i_mutex);
> +	i_size = inode->i_size;
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		if (!(offset > i_size || (end << PAGE_CACHE_SHIFT) > i_size))
> +			shmem_truncate_range(inode, offset,
> +					     (end << PAGE_CACHE_SHIFT) - 1);
> +		goto unlock;
> +	}
> +
> +	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> +		ret = inode_newsize_ok(inode, (offset + len));
> +		if (ret)
> +			goto unlock;
> +	}
> +
> +	while (index < end) {
> +		ret = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);

If the 'page' for index exists before this call, this will return the page without
allocaton.

Then, the page may not be zero-cleared. I think the page should be zero-cleared.
But I'm not sure when we do zero-clear them.

But I'm not sure how fallocate should work at error.
Assume  some block already exists before fallocate(), possible side-effect will be
 - the contents will be zero-cleared even if fallocate fails.
 - the contents will be deallocated in undo path if fallocate fails.
?
maybe updates to man(2) fallocate will be appreciated...

Anyway, don't you need zero-clear when you find an existing pages here ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-24  1:52 ` KAMEZAWA Hiroyuki
@ 2011-11-24  2:46   ` KOSAKI Motohiro
  2011-11-24  3:01     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 17+ messages in thread
From: KOSAKI Motohiro @ 2011-11-24  2:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Cong Wang, linux-kernel, akpm, Pekka Enberg, Christoph Hellwig,
	Hugh Dickins, Dave Hansen, Lennart Poettering, Kay Sievers,
	linux-mm

>> +     while (index < end) {
>> +             ret = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);
>
> If the 'page' for index exists before this call, this will return the page without
> allocaton.
>
> Then, the page may not be zero-cleared. I think the page should be zero-cleared.

No. fallocate shouldn't destroy existing data. It only ensure
subsequent file access don't make ENOSPC error.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-24  2:46   ` KOSAKI Motohiro
@ 2011-11-24  3:01     ` KAMEZAWA Hiroyuki
  2011-11-24  3:22       ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-24  3:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cong Wang, linux-kernel, akpm, Pekka Enberg, Christoph Hellwig,
	Hugh Dickins, Dave Hansen, Lennart Poettering, Kay Sievers,
	linux-mm

On Wed, 23 Nov 2011 21:46:39 -0500
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> >> +     while (index < end) {
> >> +             ret = shmem_getpage(inode, index, &page, SGP_WRITE, NULL);
> >
> > If the 'page' for index exists before this call, this will return the page without
> > allocaton.
> >
> > Then, the page may not be zero-cleared. I think the page should be zero-cleared.
> 
> No. fallocate shouldn't destroy existing data. It only ensure
> subsequent file access don't make ENOSPC error.
> 
      FALLOC_FL_KEEP_SIZE
              This flag allocates and initializes to zero the disk  space
              within the range specified by offset and len. ....

just manual is unclear ? it seems that the range [offset, offset+len) is
zero cleared after the call.

Thanks,
-kame


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23 22:20     ` Hugh Dickins
@ 2011-11-24  3:15       ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2011-11-24  3:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pekka Enberg, KOSAKI Motohiro, linux-kernel, akpm,
	Christoph Hellwig, Dave Hansen, Lennart Poettering, Kay Sievers,
	linux-mm

于 2011年11月24日 06:20, Hugh Dickins 写道:
> On Wed, 23 Nov 2011, Pekka Enberg wrote:
>>
>> Why do we need to undo anyway?
...
> Another answer would be: if fallocate() had been defined to return
> the length that has been successfully allocated (as write() returns
> the length written), then it would be reasonable to return partial
> length instead of failing with ENOSPC, and not undo.  But it was
> defined to return -1 on failure or 0 on success, so cannot report
> partial success.
>
> Another answer would be: if the disk is near full, it's not good
> for a fallocate() to fail with -ENOSPC while nonetheless grabbing
> all the remaining blocks; even worse if another fallocate() were
> racing with it.

Exactly, fallocate() should not make the bad situation even worse.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-23 19:07 ` Hugh Dickins
@ 2011-11-24  3:18   ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2011-11-24  3:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, akpm, Pekka Enberg, Christoph Hellwig, Dave Hansen,
	Lennart Poettering, Kay Sievers, KOSAKI Motohiro, linux-mm

于 2011年11月24日 03:07, Hugh Dickins 写道:
> On Wed, 23 Nov 2011, Cong Wang wrote:
>> +
>> +	while (index<  end) {
>> +		ret = shmem_getpage(inode, index,&page, SGP_WRITE, NULL);
>> +		if (ret) {
>> +			if (ret == -ENOSPC)
>> +				goto undo;
> ...
>> +undo:
>> +	while (index>  start) {
>> +		shmem_truncate_page(inode, index);
>> +		index--;
>> +	}
>
> As I said before, I won't actually be reviewing and testing this for
> a week or two; but before this goes any further, must point out how
> wrong it is.  Here you'll be deleting any pages in the range that were
> already present before the failing fallocate().

Ah, I totally missed this. So, is there any way to tell if the page
gotten from shmem_getpage() is newly allocated or not?

I will dig the code...

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-24  3:01     ` KAMEZAWA Hiroyuki
@ 2011-11-24  3:22       ` Cong Wang
  2011-11-24  4:23         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2011-11-24  3:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, linux-kernel, akpm, Pekka Enberg,
	Christoph Hellwig, Hugh Dickins, Dave Hansen, Lennart Poettering,
	Kay Sievers, linux-mm

于 2011年11月24日 11:01, KAMEZAWA Hiroyuki 写道:
> On Wed, 23 Nov 2011 21:46:39 -0500
> KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>  wrote:
>
>>>> +     while (index<  end) {
>>>> +             ret = shmem_getpage(inode, index,&page, SGP_WRITE, NULL);
>>>
>>> If the 'page' for index exists before this call, this will return the page without
>>> allocaton.
>>>
>>> Then, the page may not be zero-cleared. I think the page should be zero-cleared.
>>
>> No. fallocate shouldn't destroy existing data. It only ensure
>> subsequent file access don't make ENOSPC error.
>>
>        FALLOC_FL_KEEP_SIZE
>                This flag allocates and initializes to zero the disk  space
>                within the range specified by offset and len. ....
>
> just manual is unclear ? it seems that the range [offset, offset+len) is
> zero cleared after the call.

I think we should fix the man page, because at least ext4 doesn't clear
the original contents,

% echo hi > /tmp/foobar
% fallocate -n -l 1 -o 10 /tmp/foobar
% hexdump -Cv /tmp/foobar
00000000  68 69 0a                                          |hi.|
00000003

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-24  3:22       ` Cong Wang
@ 2011-11-24  4:23         ` KAMEZAWA Hiroyuki
  2011-11-24  5:52           ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-24  4:23 UTC (permalink / raw)
  To: Cong Wang
  Cc: KOSAKI Motohiro, linux-kernel, akpm, Pekka Enberg,
	Christoph Hellwig, Hugh Dickins, Dave Hansen, Lennart Poettering,
	Kay Sievers, linux-mm

On Thu, 24 Nov 2011 11:22:34 +0800
Cong Wang <amwang@redhat.com> wrote:

> 于 2011年11月24日 11:01, KAMEZAWA Hiroyuki 写道:
> > On Wed, 23 Nov 2011 21:46:39 -0500
> > KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>  wrote:
> >
> >>>> +     while (index<  end) {
> >>>> +             ret = shmem_getpage(inode, index,&page, SGP_WRITE, NULL);
> >>>
> >>> If the 'page' for index exists before this call, this will return the page without
> >>> allocaton.
> >>>
> >>> Then, the page may not be zero-cleared. I think the page should be zero-cleared.
> >>
> >> No. fallocate shouldn't destroy existing data. It only ensure
> >> subsequent file access don't make ENOSPC error.
> >>
> >        FALLOC_FL_KEEP_SIZE
> >                This flag allocates and initializes to zero the disk  space
> >                within the range specified by offset and len. ....
> >
> > just manual is unclear ? it seems that the range [offset, offset+len) is
> > zero cleared after the call.
> 
> I think we should fix the man page, because at least ext4 doesn't clear
> the original contents,
> 
> % echo hi > /tmp/foobar
> % fallocate -n -l 1 -o 10 /tmp/foobar
> % hexdump -Cv /tmp/foobar
> 00000000  68 69 0a                                          |hi.|
> 00000003
> 

thank you for checking. So, at failure path, original data should not be
cleared, either.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [V3 PATCH 1/2] tmpfs: add fallocate support
  2011-11-24  4:23         ` KAMEZAWA Hiroyuki
@ 2011-11-24  5:52           ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2011-11-24  5:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, linux-kernel, akpm, Pekka Enberg,
	Christoph Hellwig, Hugh Dickins, Dave Hansen, Lennart Poettering,
	Kay Sievers, linux-mm

于 2011年11月24日 12:23, KAMEZAWA Hiroyuki 写道:
> 
> thank you for checking. So, at failure path, original data should not be
> cleared, either.
> 

Yes, sure, I will fix that.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-11-24  5:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-23  8:53 [V3 PATCH 1/2] tmpfs: add fallocate support Cong Wang
2011-11-23  8:53 ` [PATCH 2/2] fs: wire up .truncate_range and .fallocate Cong Wang
2011-11-23 10:38   ` Christoph Hellwig
2011-11-23 19:16     ` Hugh Dickins
2011-11-23  9:06 ` [V3 PATCH 1/2] tmpfs: add fallocate support Pekka Enberg
2011-11-23 19:07 ` Hugh Dickins
2011-11-24  3:18   ` Cong Wang
2011-11-23 19:59 ` KOSAKI Motohiro
2011-11-23 21:11   ` Pekka Enberg
2011-11-23 22:20     ` Hugh Dickins
2011-11-24  3:15       ` Cong Wang
2011-11-24  1:52 ` KAMEZAWA Hiroyuki
2011-11-24  2:46   ` KOSAKI Motohiro
2011-11-24  3:01     ` KAMEZAWA Hiroyuki
2011-11-24  3:22       ` Cong Wang
2011-11-24  4:23         ` KAMEZAWA Hiroyuki
2011-11-24  5:52           ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).