All of lore.kernel.org
 help / color / mirror / Atom feed
* ext4_fallocate
@ 2012-06-25  6:42 Fredrick
  2012-06-25  7:33 ` ext4_fallocate Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Fredrick @ 2012-06-25  6:42 UTC (permalink / raw)
  To: linux-ext4

Hello Ext4 developers,

When calling fallocate on ext4 fs, ext4_fallocate does not initialize
the extents. The extents are allocated only when they are actually
written. This is causing a problem for us. Our programs create many
"write only once" files as large as 1G on ext4 very rapidly at times.
We thought fallocate would solve the problem. But it didnt.
If I change the EXT4_GET_BLOCKS_CREATE_UNINIT_EXT to
just EXT4_GET_BLOCKS_CREATE in the ext4_map_blocks in the ext4_fallocate 
call,
the extents get created in fallocate call itself. This is helping us.
Now the write throughtput to the disk was close to 98%. When extents 
were not
initialized, our disk throughput were only 70%.

Can this change be made to ext4_fallocate?

Thanks,
Fredrick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25  6:42 ext4_fallocate Fredrick
@ 2012-06-25  7:33 ` Andreas Dilger
  2012-06-28 15:12   ` ext4_fallocate Phillip Susi
  2012-06-25  8:51 ` ext4_fallocate Zheng Liu
  2012-06-26 13:06 ` ext4_fallocate Eric Sandeen
  2 siblings, 1 reply; 42+ messages in thread
From: Andreas Dilger @ 2012-06-25  7:33 UTC (permalink / raw)
  To: Fredrick; +Cc: linux-ext4

On 2012-06-25, at 12:42 AM, Fredrick wrote:
> When calling fallocate on ext4 fs, ext4_fallocate does not initialize
> the extents.

The whole point of fallocate() is that the extents are uninitialized
(i.e. not zero-filled with data, but read back as if they were), and
only overwritten with user data as needed.

> The extents are allocated only when they are actually
> written. This is causing a problem for us. Our programs create many
> "write only once" files as large as 1G on ext4 very rapidly at times.
> We thought fallocate would solve the problem. But it didnt.
> If I change the EXT4_GET_BLOCKS_CREATE_UNINIT_EXT to
> just EXT4_GET_BLOCKS_CREATE in the ext4_map_blocks in the ext4_fallocate call, the extents get created in fallocate call itself. This is helping us.

This means that if any user reads the fallocate'd file, they will be able
to read any old data that was previously written to disk and deleted.  This
is a big security hole.  Also, if the writer of such a file is interrupted
for some reason, the file will contain garbage in it.

There was a recent patch series "ext4: add an io-tree to track block
allocation" that may improve the performance for your case of overwrite
of uninitialized files, but it hasn't landed yet.

> Now the write throughtput to the disk was close to 98%. When extents were not
> initialized, our disk throughput were only 70%.
> 
> Can this change be made to ext4_fallocate?

Not if you value data security.  That said, it is possible some users value
performance over security on systems that have limited access, or do not
have any private data on them.  If the "io-tree" doesn't fix the performance
problem, it may make sense to add a mount option to allow users to specify
that uninitialized extents can return old data from disk, but before that
happens, I'd really prefer to make the correct behaviour work efficiently.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25  6:42 ext4_fallocate Fredrick
  2012-06-25  7:33 ` ext4_fallocate Andreas Dilger
@ 2012-06-25  8:51 ` Zheng Liu
  2012-06-25 19:04   ` ext4_fallocate Fredrick
  2012-06-25 19:17   ` ext4_fallocate Theodore Ts'o
  2012-06-26 13:06 ` ext4_fallocate Eric Sandeen
  2 siblings, 2 replies; 42+ messages in thread
From: Zheng Liu @ 2012-06-25  8:51 UTC (permalink / raw)
  To: Fredrick; +Cc: linux-ext4, Andreas Dilger

On Sun, Jun 24, 2012 at 11:42:55PM -0700, Fredrick wrote:
> Hello Ext4 developers,
> 
> When calling fallocate on ext4 fs, ext4_fallocate does not initialize
> the extents. The extents are allocated only when they are actually
> written. This is causing a problem for us. Our programs create many
> "write only once" files as large as 1G on ext4 very rapidly at times.
> We thought fallocate would solve the problem. But it didnt.
> If I change the EXT4_GET_BLOCKS_CREATE_UNINIT_EXT to
> just EXT4_GET_BLOCKS_CREATE in the ext4_map_blocks in the
> ext4_fallocate call,
> the extents get created in fallocate call itself. This is helping us.
> Now the write throughtput to the disk was close to 98%. When extents
> were not
> initialized, our disk throughput were only 70%.
> 
> Can this change be made to ext4_fallocate?

Hi Fredrick,

I think that this patch maybe can help you. :-)

Actually I want to send a url for you from linux mailing list archive but
I cannot find it.  After applying this patch, you can call ioctl(2) to
enable expose_stale_data flag, and then when you call fallocate(2), ext4
create initialized extents for you.  This patch cannot be merged into
upstream kernel because it brings a huge security hole.

Regards,
Zheng

From: Zheng Liu <wenqing.lz@taobao.com>
Date: Wed, 6 Jun 2012 11:10:57 +0800
Subject: [RFC][PATCH v2] ext4: add expose_stale_data flag in fallocate

Here is the v2 of FALLOC_FL_NO_HIDE_STALE in fallocate.  Now no new flag is
added into vfs in order to reduce the impacts and avoid a huge security hole.
The application cannot call fallocate with a new flag to create an unwritten
extent.  It needs to call ioctl to enable/disable this feature.  Meanwhile, in
ioctl, filesystem will check CAP_SYS_RAWIO to ensure that the user has a
privilege to switch on/off it.  Currently, I only implement it in ext4.

Even though I try to reduce its impact, this feature still brings a security
hole.  So the application must ensure that it initializes an unwritten extent
by itself before reading it, and it is used in a limited environment.

v1 -> v2:
* remove FALLOC_FL_NO_HIDE_STALE flag in vfs
* add 'i_expose_stale_data' in ext4 to enable/disable it

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h    |    5 +++++
 fs/ext4/extents.c |    6 +++++-
 fs/ext4/ioctl.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/super.c   |    1 +
 4 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cfc4e01..61da070 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -606,6 +606,8 @@ enum {
 #define EXT4_IOC_ALLOC_DA_BLKS		_IO('f', 12)
 #define EXT4_IOC_MOVE_EXT		_IOWR('f', 15, struct move_extent)
 #define EXT4_IOC_RESIZE_FS		_IOW('f', 16, __u64)
+#define EXT4_IOC_GET_EXPOSE_STALE	_IOR('f', 17, int)
+#define EXT4_IOC_SET_EXPOSE_STALE	_IOW('f', 18, int)
 
 #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
 /*
@@ -925,6 +927,9 @@ struct ext4_inode_info {
 
 	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
 	__u32 i_csum_seed;
+
+	/* expose stale data in creating a new extent */
+	int i_expose_stale_data;
 };
 
 /*
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 91341ec..9ef883c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4336,6 +4336,7 @@ static void ext4_falloc_update_inode(struct inode *inode,
 long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	handle_t *handle;
 	loff_t new_size;
 	unsigned int max_blocks;
@@ -4379,7 +4380,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
 		return ret;
 	}
-	flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
+	if (ei->i_expose_stale_data)
+		flags = EXT4_GET_BLOCKS_CREATE;
+	else
+		flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 	/*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 8ad112a..fffb3eb 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -445,6 +445,47 @@ resizefs_out:
 		return 0;
 	}
 
+	case EXT4_IOC_GET_EXPOSE_STALE: {
+		int enable;
+
+		/* security check */
+		if (!capable(CAP_SYS_RAWIO))
+			return -EPERM;
+
+		/*
+		 * currently only extent-based files support (pre)allocate with
+		 * EXPOSE_STALE_DATA flag
+		 */
+		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+			return -EOPNOTSUPP;
+
+		enable = ei->i_expose_stale_data;
+
+		return put_user(enable, (int __user *) arg);
+	}
+
+	case EXT4_IOC_SET_EXPOSE_STALE: {
+		int enable;
+
+		/* security check */
+		if (!capable(CAP_SYS_RAWIO))
+			return -EPERM;
+
+		/*
+		 * currently only extent-based files support (pre)allocate with
+		 * EXPOSE_STALE_DATA flag
+		 */
+		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+			return -EOPNOTSUPP;
+
+		if (get_user(enable, (int __user *) arg))
+			return -EFAULT;
+
+		ei->i_expose_stale_data = enable;
+
+		return 0;
+	}
+
 	default:
 		return -ENOTTY;
 	}
@@ -508,6 +549,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case EXT4_IOC_MOVE_EXT:
 	case FITRIM:
 	case EXT4_IOC_RESIZE_FS:
+	case EXT4_IOC_GET_EXPOSE_STALE:
+	case EXT4_IOC_SET_EXPOSE_STALE:
 		break;
 	default:
 		return -ENOIOCTLCMD;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb7aa3e..3654bb8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -988,6 +988,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_ioend_count, 0);
 	atomic_set(&ei->i_aiodio_unwritten, 0);
+	ei->i_expose_stale_data = 0;
 
 	return &ei->vfs_inode;
 }
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25  8:51 ` ext4_fallocate Zheng Liu
@ 2012-06-25 19:04   ` Fredrick
  2012-06-25 19:17   ` ext4_fallocate Theodore Ts'o
  1 sibling, 0 replies; 42+ messages in thread
From: Fredrick @ 2012-06-25 19:04 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, adilger

On 06/25/2012 01:51 AM, Zheng Liu wrote:
> On Sun, Jun 24, 2012 at 11:42:55PM -0700, Fredrick wrote:
>> Hello Ext4 developers,
>>
>> When calling fallocate on ext4 fs, ext4_fallocate does not initialize
>> the extents. The extents are allocated only when they are actually
>> written. This is causing a problem for us. Our programs create many
>> "write only once" files as large as 1G on ext4 very rapidly at times.
>> We thought fallocate would solve the problem. But it didnt.
>> If I change the EXT4_GET_BLOCKS_CREATE_UNINIT_EXT to
>> just EXT4_GET_BLOCKS_CREATE in the ext4_map_blocks in the
>> ext4_fallocate call,
>> the extents get created in fallocate call itself. This is helping us.
>> Now the write throughtput to the disk was close to 98%. When extents
>> were not
>> initialized, our disk throughput were only 70%.
>>
>> Can this change be made to ext4_fallocate?
>
> Hi Fredrick,
>
> I think that this patch maybe can help you. :-)
>
> Actually I want to send a url for you from linux mailing list archive but
> I cannot find it.  After applying this patch, you can call ioctl(2) to
> enable expose_stale_data flag, and then when you call fallocate(2), ext4
> create initialized extents for you.  This patch cannot be merged into
> upstream kernel because it brings a huge security hole.
>
> Regards,
> Zheng
>
> From: Zheng Liu <wenqing.lz@taobao.com>
> Date: Wed, 6 Jun 2012 11:10:57 +0800
> Subject: [RFC][PATCH v2] ext4: add expose_stale_data flag in fallocate
>
> Here is the v2 of FALLOC_FL_NO_HIDE_STALE in fallocate.  Now no new flag is
> added into vfs in order to reduce the impacts and avoid a huge security hole.
> The application cannot call fallocate with a new flag to create an unwritten
> extent.  It needs to call ioctl to enable/disable this feature.  Meanwhile, in
> ioctl, filesystem will check CAP_SYS_RAWIO to ensure that the user has a
> privilege to switch on/off it.  Currently, I only implement it in ext4.
>
> Even though I try to reduce its impact, this feature still brings a security
> hole.  So the application must ensure that it initializes an unwritten extent
> by itself before reading it, and it is used in a limited environment.
>
> v1 -> v2:
> * remove FALLOC_FL_NO_HIDE_STALE flag in vfs
> * add 'i_expose_stale_data' in ext4 to enable/disable it
>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
>   fs/ext4/ext4.h    |    5 +++++
>   fs/ext4/extents.c |    6 +++++-
>   fs/ext4/ioctl.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>   fs/ext4/super.c   |    1 +
>   4 files changed, 54 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index cfc4e01..61da070 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -606,6 +606,8 @@ enum {
>   #define EXT4_IOC_ALLOC_DA_BLKS		_IO('f', 12)
>   #define EXT4_IOC_MOVE_EXT		_IOWR('f', 15, struct move_extent)
>   #define EXT4_IOC_RESIZE_FS		_IOW('f', 16, __u64)
> +#define EXT4_IOC_GET_EXPOSE_STALE	_IOR('f', 17, int)
> +#define EXT4_IOC_SET_EXPOSE_STALE	_IOW('f', 18, int)
>
>   #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
>   /*
> @@ -925,6 +927,9 @@ struct ext4_inode_info {
>
>   	/* Precomputed uuid+inum+igen checksum for seeding inode checksums */
>   	__u32 i_csum_seed;
> +
> +	/* expose stale data in creating a new extent */
> +	int i_expose_stale_data;
>   };
>
>   /*
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 91341ec..9ef883c 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4336,6 +4336,7 @@ static void ext4_falloc_update_inode(struct inode *inode,
>   long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   {
>   	struct inode *inode = file->f_path.dentry->d_inode;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
>   	handle_t *handle;
>   	loff_t new_size;
>   	unsigned int max_blocks;
> @@ -4379,7 +4380,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
>   		return ret;
>   	}
> -	flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
> +	if (ei->i_expose_stale_data)
> +		flags = EXT4_GET_BLOCKS_CREATE;
> +	else
> +		flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
>   	if (mode & FALLOC_FL_KEEP_SIZE)
>   		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>   	/*
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 8ad112a..fffb3eb 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -445,6 +445,47 @@ resizefs_out:
>   		return 0;
>   	}
>
> +	case EXT4_IOC_GET_EXPOSE_STALE: {
> +		int enable;
> +
> +		/* security check */
> +		if (!capable(CAP_SYS_RAWIO))
> +			return -EPERM;
> +
> +		/*
> +		 * currently only extent-based files support (pre)allocate with
> +		 * EXPOSE_STALE_DATA flag
> +		 */
> +		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> +			return -EOPNOTSUPP;
> +
> +		enable = ei->i_expose_stale_data;
> +
> +		return put_user(enable, (int __user *) arg);
> +	}
> +
> +	case EXT4_IOC_SET_EXPOSE_STALE: {
> +		int enable;
> +
> +		/* security check */
> +		if (!capable(CAP_SYS_RAWIO))
> +			return -EPERM;
> +
> +		/*
> +		 * currently only extent-based files support (pre)allocate with
> +		 * EXPOSE_STALE_DATA flag
> +		 */
> +		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> +			return -EOPNOTSUPP;
> +
> +		if (get_user(enable, (int __user *) arg))
> +			return -EFAULT;
> +
> +		ei->i_expose_stale_data = enable;
> +
> +		return 0;
> +	}
> +
>   	default:
>   		return -ENOTTY;
>   	}
> @@ -508,6 +549,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>   	case EXT4_IOC_MOVE_EXT:
>   	case FITRIM:
>   	case EXT4_IOC_RESIZE_FS:
> +	case EXT4_IOC_GET_EXPOSE_STALE:
> +	case EXT4_IOC_SET_EXPOSE_STALE:
>   		break;
>   	default:
>   		return -ENOIOCTLCMD;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index eb7aa3e..3654bb8 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -988,6 +988,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>   	ei->i_datasync_tid = 0;
>   	atomic_set(&ei->i_ioend_count, 0);
>   	atomic_set(&ei->i_aiodio_unwritten, 0);
> +	ei->i_expose_stale_data = 0;
>
>   	return &ei->vfs_inode;
>   }
>


Thanks Zheng for this nice patch.
Thanks Andreas for the comments.
I had the following patch to do the same.

-Fredrick

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1a34c1c..7fe835c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -549,6 +549,7 @@ struct ext4_new_group_data {
   /* note ioctl 11 reserved for filesystem-independent FIEMAP ioctl */
  #define EXT4_IOC_ALLOC_DA_BLKS		_IO('f', 12)
  #define EXT4_IOC_MOVE_EXT		_IOWR('f', 15, struct move_extent)
+#define EXT4_IOC_INIT_EXT		_IOWR('f', 20, unsigned long)

  #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
  /*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 4cbe1c2..c66555e 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -335,6 +335,57 @@ mext_out:
  		mnt_drop_write(filp->f_path.mnt);
  		return err;
  	}
+	
+	case EXT4_IOC_INIT_EXT:
+	{
+
+		handle_t *handle;
+		unsigned int max_blocks;
+		unsigned int credits;
+		struct ext4_map_blocks map;
+		unsigned int blkbits = inode->i_blkbits;	
+		unsigned long offset = filp->f_pos;
+		unsigned long len = arg;
+		int ret = 0, ret2 = 0;
+
+		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+			return -EOPNOTSUPP;
+
+		mutex_lock(&inode->i_mutex);
+		if ((offset + len)> i_size_read(inode)) {
+			mutex_unlock(&inode->i_mutex);
+			return -EINVAL;
+		}
+		map.m_lblk = offset >> blkbits;
+		max_blocks = ((EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) -
+			      map.m_lblk);
+
+		credits = ext4_chunk_trans_blocks(inode, max_blocks);
+		while (ret >= 0 && ret < max_blocks) {
+			map.m_lblk += ret;
+			map.m_len = (max_blocks -= ret);
+			handle = ext4_journal_start(inode, credits);
+			if (IS_ERR(handle)) {
+				ret = PTR_ERR(handle);
+				break;
+			}
+			ret = ext4_map_blocks(handle, inode, &map,
+					      EXT4_GET_BLOCKS_CREATE);
+			if (ret <= 0) {
+				WARN_ON(ret <= 0);
+				printk(KERN_ERR "%s: ext4_map_blocks "
+					    "returned error inode#%lu, block=%u, "
+					    "max_blocks=%u", __func__,
+					    inode->i_ino, map.m_lblk, map.m_len);
+			}
+			ext4_mark_inode_dirty(handle, inode);
+			ret2 = ext4_journal_stop(handle);
+			if (ret <= 0 || ret2 )
+				break;
+		}
+		mutex_unlock(&inode->i_mutex);
+		return ret > 0 ? ret2 : ret;
+	}

  	case FITRIM:
  	{
@@ -432,6 +483,7 @@ long ext4_compat_ioctl(struct file *file, unsigned 
int cmd, unsigned long arg)
  		return err;
  	}
  	case EXT4_IOC_MOVE_EXT:
+	case EXT4_IOC_INIT_EXT:
  	case FITRIM:
  		break;
  	default:






^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25  8:51 ` ext4_fallocate Zheng Liu
  2012-06-25 19:04   ` ext4_fallocate Fredrick
@ 2012-06-25 19:17   ` Theodore Ts'o
  2012-06-26  1:23     ` ext4_fallocate Fredrick
  2012-06-26 13:13     ` ext4_fallocate Ric Wheeler
  1 sibling, 2 replies; 42+ messages in thread
From: Theodore Ts'o @ 2012-06-25 19:17 UTC (permalink / raw)
  To: Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On Mon, Jun 25, 2012 at 04:51:59PM +0800, Zheng Liu wrote:
> 
> Actually I want to send a url for you from linux mailing list archive but
> I cannot find it.  After applying this patch, you can call ioctl(2) to
> enable expose_stale_data flag, and then when you call fallocate(2), ext4
> create initialized extents for you.  This patch cannot be merged into
> upstream kernel because it brings a huge security hole.

This is what we're using internally inside Google.... this allows the
security exposure to be restricted to those programs running with a
specific group id (which is better than giving programs access to
CAP_SYS_RAWIO).  We also require the use of a specific fallocate flag
so that programs have to explicitly ask for this feature.

Also note that I restrict the combination of NO_HIDE_STALE &&
KEEP_SIZE since it causes e2fsck to complain --- and if you're trying
to avoid fs metadata I/O, you want to avoid the extra i_size update
anyway, so it's not worth trying to make this work w/o causing e2fsck
complaints.

This patch is versus the v3.3 kernel (as it happens, I was just in the
middle of rebasing this patch from 2.6.34 :-)

					- Ted

P.S.  It just occurred to me that there are some patches being
discussed that assign new fallocate flags for volatile data handling.
So it would probably be a good idea to move the fallocate flag
codepoint assignment up out of the way to avoid future conflicts.

commit 5f12f1bc2b0fb0866d52763a611b022780780f05
Author: Theodore Ts'o <tytso@google.com>
Date:   Fri Jun 22 17:19:53 2012 -0400

    ext4: add an fallocate flag to mark newly allocated extents initialized
    
    This commit adds a new flag to ext4's fallocate that allows new,
    uninitialized extents to be marked as initialized. This flag,
    FALLOC_FL_NO_HIDE_STALE requires that the nohide_stale_gid=<gid> mount
    option be used when the file system is mounted, and that the user is
    in the group <gid>.
    
    The benefit is to a program fallocates a larger space, but then writes
    to that space in small increments.  This option prevents ext4 from
    having to split the unallocated extent and merge the newly initialized
    extent with the extent to its left.  Even though this usually happens
    in-memory, this option is useful for tight memory situations and for
    ext4 on flash.  Note: This allows an application in ths hohide_stale
    group to see stale data on the filesystem.
    
    Tested: Updated xfstests g002 to test a case where
      fallocate:no-hide-stale is not allowed.  The existing tests now pass
      because I added a remount with a group that user root is in.
    Rebase-Tested-v3.3: same
    
    Effort: fs/nohide-stale
    Origin-2.6.34-SHA1: c3099bf61be1baf94bc91c481995bb0d77f05786
    Origin-2.6.34-SHA1: 004dd33b9ebc5d860781c3435526658cc8aa8ccb
    Change-Id: I0d2a7f2a4cf34443269acbcedb7b7074e0055e69

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index aaaece6..ac7aa42 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1240,6 +1240,9 @@ struct ext4_sb_info {
 	unsigned long s_mb_last_group;
 	unsigned long s_mb_last_start;
 
+	/* gid that's allowed to see stale data via falloc flag. */
+	gid_t no_hide_stale_gid;
+
 	/* stats for buddy allocator */
 	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
 	atomic_t s_bal_success;	/* we found long enough chunks */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index cb99346..cc57c85 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4375,6 +4375,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	int retries = 0;
 	int flags;
 	struct ext4_map_blocks map;
+	struct ext4_sb_info *sbi;
 	unsigned int credits, blkbits = inode->i_blkbits;
 
 	/*
@@ -4385,12 +4386,28 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	/* Return error if mode is not supported */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+		     FALLOC_FL_NO_HIDE_STALE))
+		return -EOPNOTSUPP;
+
+	/* The combination of NO_HIDE_STALE and KEEP_SIZE is not supported */
+	if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
+	    (mode & FALLOC_FL_KEEP_SIZE))
 		return -EOPNOTSUPP;
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		return ext4_punch_hole(file, offset, len);
 
+	sbi = EXT4_SB(inode->i_sb);
+	/* Must have RAWIO to see stale data. */
+	if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
+	    !in_egroup_p(sbi->no_hide_stale_gid))
+		return -EACCES;
+
+	/* preallocation to directories is currently not supported */
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
 	trace_ext4_fallocate_enter(inode, offset, len, mode);
 	map.m_lblk = offset >> blkbits;
 	/*
@@ -4429,6 +4446,8 @@ retry:
 			ret = PTR_ERR(handle);
 			break;
 		}
+		if (mode & FALLOC_FL_NO_HIDE_STALE)
+			flags &= ~EXT4_GET_BLOCKS_UNINIT_EXT;
 		ret = ext4_map_blocks(handle, inode, &map, flags);
 		if (ret <= 0) {
 #ifdef EXT4FS_DEBUG
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 5b443a8..d976ec1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1175,6 +1175,8 @@ static int ext4_show_options(struct seq_file *seq, struct dentry *root)
 	if (test_opt2(sb, BIG_EXT))
 		seq_puts(seq, ",big_extent");
 #endif
+	if (sbi->no_hide_stale_gid != -1)
+		seq_printf(seq, ",nohide_stale_gid=%u", sbi->no_hide_stale_gid);
 
 	ext4_show_quota_options(seq, sb);
 
@@ -1353,6 +1355,7 @@ enum {
 #ifdef CONFIG_EXT4_BIG_EXTENT
 	Opt_big_extent, Opt_nobig_extent,
 #endif
+	Opt_nohide_stale_gid,
 };
 
 static const match_table_t tokens = {
@@ -1432,6 +1435,7 @@ static const match_table_t tokens = {
 	{Opt_big_extent, "big_extent"},
 	{Opt_nobig_extent, "nobig_extent"},
 #endif
+	{Opt_nohide_stale_gid, "nohide_stale_gid=%u"},
 	{Opt_err, NULL},
 };
 
@@ -1931,6 +1935,12 @@ set_qf_format:
 				return 0;
 			sbi->s_li_wait_mult = option;
 			break;
+		case Opt_nohide_stale_gid:
+			if (match_int(&args[0], &option))
+				return 0;
+			/* -1 for disabled, otherwise it's valid. */
+			sbi->no_hide_stale_gid = option;
+			break;
 		case Opt_noinit_itable:
 			clear_opt(sb, INIT_INODE_TABLE);
 			break;
@@ -3274,6 +3284,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 #ifdef CONFIG_EXT4_BIG_EXTENT
 	sbi->s_min_big_ext_size = EXT4_DEFAULT_MIN_BIG_EXT_SIZE;
 #endif
+	/* Default to having no-hide-stale disabled. */
+	sbi->no_hide_stale_gid = -1;
 
 	if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
 		set_opt(sb, BARRIER);
diff --git a/fs/open.c b/fs/open.c
index 201431a..4edc0cd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -224,7 +224,9 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/* Return error if mode is not supported */
-	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+	if (mode & ~(FALLOC_FL_KEEP_SIZE |
+		     FALLOC_FL_PUNCH_HOLE |
+		     FALLOC_FL_NO_HIDE_STALE))
 		return -EOPNOTSUPP;
 
 	/* Punch hole must have keep size set */
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 73e0b62..a2489ac 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -3,6 +3,7 @@
 
 #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
 #define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
+#define FALLOC_FL_NO_HIDE_STALE	0x04 /* default is hide stale data */
 
 #ifdef __KERNEL__
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25 19:17   ` ext4_fallocate Theodore Ts'o
@ 2012-06-26  1:23     ` Fredrick
  2012-06-26 13:13     ` ext4_fallocate Ric Wheeler
  1 sibling, 0 replies; 42+ messages in thread
From: Fredrick @ 2012-06-26  1:23 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, Andreas Dilger, wenqing.lz

On 06/25/2012 12:17 PM, Theodore Ts'o wrote:
> On Mon, Jun 25, 2012 at 04:51:59PM +0800, Zheng Liu wrote:
>>
>> Actually I want to send a url for you from linux mailing list archive but
>> I cannot find it.  After applying this patch, you can call ioctl(2) to
>> enable expose_stale_data flag, and then when you call fallocate(2), ext4
>> create initialized extents for you.  This patch cannot be merged into
>> upstream kernel because it brings a huge security hole.
>
> This is what we're using internally inside Google.... this allows the
> security exposure to be restricted to those programs running with a
> specific group id (which is better than giving programs access to
> CAP_SYS_RAWIO).  We also require the use of a specific fallocate flag
> so that programs have to explicitly ask for this feature.
>
> Also note that I restrict the combination of NO_HIDE_STALE &&
> KEEP_SIZE since it causes e2fsck to complain --- and if you're trying
> to avoid fs metadata I/O, you want to avoid the extra i_size update
> anyway, so it's not worth trying to make this work w/o causing e2fsck
> complaints.
>
> This patch is versus the v3.3 kernel (as it happens, I was just in the
> middle of rebasing this patch from 2.6.34 :-)
>
> 					- Ted
>
> P.S.  It just occurred to me that there are some patches being
> discussed that assign new fallocate flags for volatile data handling.
> So it would probably be a good idea to move the fallocate flag
> codepoint assignment up out of the way to avoid future conflicts.
>
> commit 5f12f1bc2b0fb0866d52763a611b022780780f05
> Author: Theodore Ts'o <tytso@google.com>
> Date:   Fri Jun 22 17:19:53 2012 -0400
>
>      ext4: add an fallocate flag to mark newly allocated extents initialized
>
>      This commit adds a new flag to ext4's fallocate that allows new,
>      uninitialized extents to be marked as initialized. This flag,
>      FALLOC_FL_NO_HIDE_STALE requires that the nohide_stale_gid=<gid> mount
>      option be used when the file system is mounted, and that the user is
>      in the group <gid>.
>
>      The benefit is to a program fallocates a larger space, but then writes
>      to that space in small increments.  This option prevents ext4 from
>      having to split the unallocated extent and merge the newly initialized
>      extent with the extent to its left.  Even though this usually happens
>      in-memory, this option is useful for tight memory situations and for
>      ext4 on flash.  Note: This allows an application in ths hohide_stale
>      group to see stale data on the filesystem.
>
>      Tested: Updated xfstests g002 to test a case where
>        fallocate:no-hide-stale is not allowed.  The existing tests now pass
>        because I added a remount with a group that user root is in.
>      Rebase-Tested-v3.3: same
>
>      Effort: fs/nohide-stale
>      Origin-2.6.34-SHA1: c3099bf61be1baf94bc91c481995bb0d77f05786
>      Origin-2.6.34-SHA1: 004dd33b9ebc5d860781c3435526658cc8aa8ccb
>      Change-Id: I0d2a7f2a4cf34443269acbcedb7b7074e0055e69
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index aaaece6..ac7aa42 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1240,6 +1240,9 @@ struct ext4_sb_info {
>   	unsigned long s_mb_last_group;
>   	unsigned long s_mb_last_start;
>
> +	/* gid that's allowed to see stale data via falloc flag. */
> +	gid_t no_hide_stale_gid;
> +
>   	/* stats for buddy allocator */
>   	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
>   	atomic_t s_bal_success;	/* we found long enough chunks */
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index cb99346..cc57c85 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4375,6 +4375,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   	int retries = 0;
>   	int flags;
>   	struct ext4_map_blocks map;
> +	struct ext4_sb_info *sbi;
>   	unsigned int credits, blkbits = inode->i_blkbits;
>
>   	/*
> @@ -4385,12 +4386,28 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		return -EOPNOTSUPP;
>
>   	/* Return error if mode is not supported */
> -	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> +		     FALLOC_FL_NO_HIDE_STALE))
> +		return -EOPNOTSUPP;
> +
> +	/* The combination of NO_HIDE_STALE and KEEP_SIZE is not supported */
> +	if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
> +	    (mode & FALLOC_FL_KEEP_SIZE))
>   		return -EOPNOTSUPP;
>
>   	if (mode & FALLOC_FL_PUNCH_HOLE)
>   		return ext4_punch_hole(file, offset, len);
>
> +	sbi = EXT4_SB(inode->i_sb);
> +	/* Must have RAWIO to see stale data. */
> +	if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
> +	    !in_egroup_p(sbi->no_hide_stale_gid))
> +		return -EACCES;
> +
> +	/* preallocation to directories is currently not supported */
> +	if (S_ISDIR(inode->i_mode))
> +		return -ENODEV;
> +
>   	trace_ext4_fallocate_enter(inode, offset, len, mode);
>   	map.m_lblk = offset >> blkbits;
>   	/*
> @@ -4429,6 +4446,8 @@ retry:
>   			ret = PTR_ERR(handle);
>   			break;
>   		}
> +		if (mode & FALLOC_FL_NO_HIDE_STALE)
> +			flags &= ~EXT4_GET_BLOCKS_UNINIT_EXT;
>   		ret = ext4_map_blocks(handle, inode, &map, flags);
>   		if (ret <= 0) {
>   #ifdef EXT4FS_DEBUG
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 5b443a8..d976ec1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1175,6 +1175,8 @@ static int ext4_show_options(struct seq_file *seq, struct dentry *root)
>   	if (test_opt2(sb, BIG_EXT))
>   		seq_puts(seq, ",big_extent");
>   #endif
> +	if (sbi->no_hide_stale_gid != -1)
> +		seq_printf(seq, ",nohide_stale_gid=%u", sbi->no_hide_stale_gid);
>
>   	ext4_show_quota_options(seq, sb);
>
> @@ -1353,6 +1355,7 @@ enum {
>   #ifdef CONFIG_EXT4_BIG_EXTENT
>   	Opt_big_extent, Opt_nobig_extent,
>   #endif
> +	Opt_nohide_stale_gid,
>   };
>
>   static const match_table_t tokens = {
> @@ -1432,6 +1435,7 @@ static const match_table_t tokens = {
>   	{Opt_big_extent, "big_extent"},
>   	{Opt_nobig_extent, "nobig_extent"},
>   #endif
> +	{Opt_nohide_stale_gid, "nohide_stale_gid=%u"},
>   	{Opt_err, NULL},
>   };
>
> @@ -1931,6 +1935,12 @@ set_qf_format:
>   				return 0;
>   			sbi->s_li_wait_mult = option;
>   			break;
> +		case Opt_nohide_stale_gid:
> +			if (match_int(&args[0], &option))
> +				return 0;
> +			/* -1 for disabled, otherwise it's valid. */
> +			sbi->no_hide_stale_gid = option;
> +			break;
>   		case Opt_noinit_itable:
>   			clear_opt(sb, INIT_INODE_TABLE);
>   			break;
> @@ -3274,6 +3284,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>   #ifdef CONFIG_EXT4_BIG_EXTENT
>   	sbi->s_min_big_ext_size = EXT4_DEFAULT_MIN_BIG_EXT_SIZE;
>   #endif
> +	/* Default to having no-hide-stale disabled. */
> +	sbi->no_hide_stale_gid = -1;
>
>   	if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
>   		set_opt(sb, BARRIER);
> diff --git a/fs/open.c b/fs/open.c
> index 201431a..4edc0cd 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -224,7 +224,9 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		return -EINVAL;
>
>   	/* Return error if mode is not supported */
> -	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE |
> +		     FALLOC_FL_PUNCH_HOLE |
> +		     FALLOC_FL_NO_HIDE_STALE))
>   		return -EOPNOTSUPP;
>
>   	/* Punch hole must have keep size set */
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 73e0b62..a2489ac 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -3,6 +3,7 @@
>
>   #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
>   #define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
> +#define FALLOC_FL_NO_HIDE_STALE	0x04 /* default is hide stale data */
>
>   #ifdef __KERNEL__
>
>

Thanks Ted. This patch is very nice
and addresses the comments of Andreas
of using a mount option.

-Fredrick



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25  6:42 ext4_fallocate Fredrick
  2012-06-25  7:33 ` ext4_fallocate Andreas Dilger
  2012-06-25  8:51 ` ext4_fallocate Zheng Liu
@ 2012-06-26 13:06 ` Eric Sandeen
  2 siblings, 0 replies; 42+ messages in thread
From: Eric Sandeen @ 2012-06-26 13:06 UTC (permalink / raw)
  To: Fredrick; +Cc: linux-ext4

On 6/25/12 2:42 AM, Fredrick wrote:
> Hello Ext4 developers,
> 
> When calling fallocate on ext4 fs, ext4_fallocate does not initialize
> the extents. The extents are allocated only when they are actually
> written. This is causing a problem for us. Our programs create many
> "write only once" files as large as 1G on ext4 very rapidly at times.
> We thought fallocate would solve the problem. But it didnt.

What actual problem is this?  Fallocate does indeed allocate the extents -
it picks real, physical blocks on the disk, and allocates them to the file
in question.  If you map the file, you'll see those blocks.

The only thing it does not do is write data into them, but instead flags
them as unwritten, so that subsequent reads will return zeros.  So when
you say fallocate did not solve "this problem" which problem
do you mean?

> If I change the EXT4_GET_BLOCKS_CREATE_UNINIT_EXT to
> just EXT4_GET_BLOCKS_CREATE in the ext4_map_blocks in the ext4_fallocate call,
> the extents get created in fallocate call itself. This is helping us.
> Now the write throughtput to the disk was close to 98%. When extents were not
> initialized, our disk throughput were only 70%.
> 
> Can this change be made to ext4_fallocate?

Others addressed the serious problems with allowing fallocate to expose
bits of stale user data, but a couple things:

1) Can you describe the workload a bit more?  I'd be interested to know what
kind of workload you have which shows a 30% perf regression converting
unwritten extents.  So you fallocate 1G files, and then write to them how?

2) Are you sure zoho.com customers will be ok with the potential for
data exposure?  It's up to you I suppose to determine whether this is really
a risk, but you need to fully understand the implications of the patches
others have provided.  There are good reasons I and others have resisted
merging them upstream.

-Eric
 
> Thanks,
> Fredrick

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25 19:17   ` ext4_fallocate Theodore Ts'o
  2012-06-26  1:23     ` ext4_fallocate Fredrick
@ 2012-06-26 13:13     ` Ric Wheeler
  2012-06-26 17:30       ` ext4_fallocate Theodore Ts'o
  2012-06-26 18:05       ` ext4_fallocate Fredrick
  1 sibling, 2 replies; 42+ messages in thread
From: Ric Wheeler @ 2012-06-26 13:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On 06/25/2012 03:17 PM, Theodore Ts'o wrote:
> On Mon, Jun 25, 2012 at 04:51:59PM +0800, Zheng Liu wrote:
>> Actually I want to send a url for you from linux mailing list archive but
>> I cannot find it.  After applying this patch, you can call ioctl(2) to
>> enable expose_stale_data flag, and then when you call fallocate(2), ext4
>> create initialized extents for you.  This patch cannot be merged into
>> upstream kernel because it brings a huge security hole.
> This is what we're using internally inside Google.... this allows the
> security exposure to be restricted to those programs running with a
> specific group id (which is better than giving programs access to
> CAP_SYS_RAWIO).  We also require the use of a specific fallocate flag
> so that programs have to explicitly ask for this feature.
>
> Also note that I restrict the combination of NO_HIDE_STALE &&
> KEEP_SIZE since it causes e2fsck to complain --- and if you're trying
> to avoid fs metadata I/O, you want to avoid the extra i_size update
> anyway, so it's not worth trying to make this work w/o causing e2fsck
> complaints.
>
> This patch is versus the v3.3 kernel (as it happens, I was just in the
> middle of rebasing this patch from 2.6.34 :-)
>
> 					- Ted

Hi Ted,

Has anyone made progress digging into the performance impact of running without 
this patch? We should definitely see if there is some low hanging fruit there, 
especially given that XFS does not seem to suffer such a huge hit.

I think that we need to get a good reproducer for the workload that causes the 
pain and start to dig into this.

Opening this security exposure is still something that is clearly a hack and 
best avoided if we can fix the root cause :)

Ric

>
> P.S.  It just occurred to me that there are some patches being
> discussed that assign new fallocate flags for volatile data handling.
> So it would probably be a good idea to move the fallocate flag
> codepoint assignment up out of the way to avoid future conflicts.
>
> commit 5f12f1bc2b0fb0866d52763a611b022780780f05
> Author: Theodore Ts'o <tytso@google.com>
> Date:   Fri Jun 22 17:19:53 2012 -0400
>
>      ext4: add an fallocate flag to mark newly allocated extents initialized
>      
>      This commit adds a new flag to ext4's fallocate that allows new,
>      uninitialized extents to be marked as initialized. This flag,
>      FALLOC_FL_NO_HIDE_STALE requires that the nohide_stale_gid=<gid> mount
>      option be used when the file system is mounted, and that the user is
>      in the group <gid>.
>      
>      The benefit is to a program fallocates a larger space, but then writes
>      to that space in small increments.  This option prevents ext4 from
>      having to split the unallocated extent and merge the newly initialized
>      extent with the extent to its left.  Even though this usually happens
>      in-memory, this option is useful for tight memory situations and for
>      ext4 on flash.  Note: This allows an application in ths hohide_stale
>      group to see stale data on the filesystem.
>      
>      Tested: Updated xfstests g002 to test a case where
>        fallocate:no-hide-stale is not allowed.  The existing tests now pass
>        because I added a remount with a group that user root is in.
>      Rebase-Tested-v3.3: same
>      
>      Effort: fs/nohide-stale
>      Origin-2.6.34-SHA1: c3099bf61be1baf94bc91c481995bb0d77f05786
>      Origin-2.6.34-SHA1: 004dd33b9ebc5d860781c3435526658cc8aa8ccb
>      Change-Id: I0d2a7f2a4cf34443269acbcedb7b7074e0055e69
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index aaaece6..ac7aa42 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1240,6 +1240,9 @@ struct ext4_sb_info {
>   	unsigned long s_mb_last_group;
>   	unsigned long s_mb_last_start;
>   
> +	/* gid that's allowed to see stale data via falloc flag. */
> +	gid_t no_hide_stale_gid;
> +
>   	/* stats for buddy allocator */
>   	atomic_t s_bal_reqs;	/* number of reqs with len > 1 */
>   	atomic_t s_bal_success;	/* we found long enough chunks */
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index cb99346..cc57c85 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4375,6 +4375,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   	int retries = 0;
>   	int flags;
>   	struct ext4_map_blocks map;
> +	struct ext4_sb_info *sbi;
>   	unsigned int credits, blkbits = inode->i_blkbits;
>   
>   	/*
> @@ -4385,12 +4386,28 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		return -EOPNOTSUPP;
>   
>   	/* Return error if mode is not supported */
> -	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> +		     FALLOC_FL_NO_HIDE_STALE))
> +		return -EOPNOTSUPP;
> +
> +	/* The combination of NO_HIDE_STALE and KEEP_SIZE is not supported */
> +	if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
> +	    (mode & FALLOC_FL_KEEP_SIZE))
>   		return -EOPNOTSUPP;
>   
>   	if (mode & FALLOC_FL_PUNCH_HOLE)
>   		return ext4_punch_hole(file, offset, len);
>   
> +	sbi = EXT4_SB(inode->i_sb);
> +	/* Must have RAWIO to see stale data. */
> +	if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
> +	    !in_egroup_p(sbi->no_hide_stale_gid))
> +		return -EACCES;
> +
> +	/* preallocation to directories is currently not supported */
> +	if (S_ISDIR(inode->i_mode))
> +		return -ENODEV;
> +
>   	trace_ext4_fallocate_enter(inode, offset, len, mode);
>   	map.m_lblk = offset >> blkbits;
>   	/*
> @@ -4429,6 +4446,8 @@ retry:
>   			ret = PTR_ERR(handle);
>   			break;
>   		}
> +		if (mode & FALLOC_FL_NO_HIDE_STALE)
> +			flags &= ~EXT4_GET_BLOCKS_UNINIT_EXT;
>   		ret = ext4_map_blocks(handle, inode, &map, flags);
>   		if (ret <= 0) {
>   #ifdef EXT4FS_DEBUG
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 5b443a8..d976ec1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1175,6 +1175,8 @@ static int ext4_show_options(struct seq_file *seq, struct dentry *root)
>   	if (test_opt2(sb, BIG_EXT))
>   		seq_puts(seq, ",big_extent");
>   #endif
> +	if (sbi->no_hide_stale_gid != -1)
> +		seq_printf(seq, ",nohide_stale_gid=%u", sbi->no_hide_stale_gid);
>   
>   	ext4_show_quota_options(seq, sb);
>   
> @@ -1353,6 +1355,7 @@ enum {
>   #ifdef CONFIG_EXT4_BIG_EXTENT
>   	Opt_big_extent, Opt_nobig_extent,
>   #endif
> +	Opt_nohide_stale_gid,
>   };
>   
>   static const match_table_t tokens = {
> @@ -1432,6 +1435,7 @@ static const match_table_t tokens = {
>   	{Opt_big_extent, "big_extent"},
>   	{Opt_nobig_extent, "nobig_extent"},
>   #endif
> +	{Opt_nohide_stale_gid, "nohide_stale_gid=%u"},
>   	{Opt_err, NULL},
>   };
>   
> @@ -1931,6 +1935,12 @@ set_qf_format:
>   				return 0;
>   			sbi->s_li_wait_mult = option;
>   			break;
> +		case Opt_nohide_stale_gid:
> +			if (match_int(&args[0], &option))
> +				return 0;
> +			/* -1 for disabled, otherwise it's valid. */
> +			sbi->no_hide_stale_gid = option;
> +			break;
>   		case Opt_noinit_itable:
>   			clear_opt(sb, INIT_INODE_TABLE);
>   			break;
> @@ -3274,6 +3284,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>   #ifdef CONFIG_EXT4_BIG_EXTENT
>   	sbi->s_min_big_ext_size = EXT4_DEFAULT_MIN_BIG_EXT_SIZE;
>   #endif
> +	/* Default to having no-hide-stale disabled. */
> +	sbi->no_hide_stale_gid = -1;
>   
>   	if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
>   		set_opt(sb, BARRIER);
> diff --git a/fs/open.c b/fs/open.c
> index 201431a..4edc0cd 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -224,7 +224,9 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		return -EINVAL;
>   
>   	/* Return error if mode is not supported */
> -	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> +	if (mode & ~(FALLOC_FL_KEEP_SIZE |
> +		     FALLOC_FL_PUNCH_HOLE |
> +		     FALLOC_FL_NO_HIDE_STALE))
>   		return -EOPNOTSUPP;
>   
>   	/* Punch hole must have keep size set */
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 73e0b62..a2489ac 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -3,6 +3,7 @@
>   
>   #define FALLOC_FL_KEEP_SIZE	0x01 /* default is extend size */
>   #define FALLOC_FL_PUNCH_HOLE	0x02 /* de-allocates range */
> +#define FALLOC_FL_NO_HIDE_STALE	0x04 /* default is hide stale data */
>   
>   #ifdef __KERNEL__
>   
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 13:13     ` ext4_fallocate Ric Wheeler
@ 2012-06-26 17:30       ` Theodore Ts'o
  2012-06-26 18:06         ` ext4_fallocate Fredrick
  2012-06-26 18:21         ` ext4_fallocate Ric Wheeler
  2012-06-26 18:05       ` ext4_fallocate Fredrick
  1 sibling, 2 replies; 42+ messages in thread
From: Theodore Ts'o @ 2012-06-26 17:30 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On Tue, Jun 26, 2012 at 09:13:35AM -0400, Ric Wheeler wrote:
> 
> Has anyone made progress digging into the performance impact of
> running without this patch? We should definitely see if there is
> some low hanging fruit there, especially given that XFS does not
> seem to suffer such a huge hit.

I just haven't had time, sorry.  It's so much easier to run with the
patch.  :-)

Part of the problem certainly caused by the fact that ext4 is using
physical block journaling instead of logical journalling.  But we see
the problem in no-journal mode as well.  I think part of the problem
is simply that many of the workloads where people are doing this, they
also care about robustness after power failures, and if you are doing
random writes into uninitialized space, with fsyncs in-between, you
are basically guaranteed a 2x expansion in the number of writes you
need to do to the system.

One other thing which we *have* seen is that we need to do a better
job with extent merging; if you run without this patch, and you run
with fio in AIO mode where you are doing tons and tons of random
writes into uninitialized space, you can end up fragmenting the extent
tree very badly.   So fixing this would certainly help.

> Opening this security exposure is still something that is clearly a
> hack and best avoided if we can fix the root cause :)

See Linus's recent rant about how security arguments made by
theoreticians very often end up getting trumped by practical matters.
If you are running a daemon, whether it is a user-mode cluster file
system, or a database server, where it is (a) fundamentally trusted,
and (b) doing its own user-space checksuming and its own guarantees to
never return uninitialized data, even if we fix all potential
problems, we *still* can be reducing the number of random writes ---
and on a fully loaded system, we're guaranteed to be seek-constrained,
so each random write to update fs metadata means that you're burning
0.5% of your 200 seeks/second on your 3TB disk (where previously you
had half a dozen 500gig disks each with 200 seeks/second).

I agree with you that it would be nice to look into this further, and
optimizing our extent merging is definitely on the hot list of
perofrmance improvements to look at.  But people who are using ext4 as
back-end database servers or cluster file system servers and who are
interested in wringing out every last percentage of performace are
going to be interested in this technique, no matter what we do.  If
you have Sagans and Sagans of servers all over the world, even a tenth
of a percentage point performance improvement can easily translate
into big dollars.

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 13:13     ` ext4_fallocate Ric Wheeler
  2012-06-26 17:30       ` ext4_fallocate Theodore Ts'o
@ 2012-06-26 18:05       ` Fredrick
  2012-06-26 18:59         ` ext4_fallocate Ted Ts'o
  2012-06-26 19:30         ` ext4_fallocate Ric Wheeler
  1 sibling, 2 replies; 42+ messages in thread
From: Fredrick @ 2012-06-26 18:05 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Theodore Ts'o, linux-ext4, Andreas Dilger, wenqing.lz


> Hi Ted,
>
> Has anyone made progress digging into the performance impact of running
> without this patch? We should definitely see if there is some low
> hanging fruit there, especially given that XFS does not seem to suffer
> such a huge hit.
>
> I think that we need to get a good reproducer for the workload that
> causes the pain and start to dig into this.
>
> Opening this security exposure is still something that is clearly a hack
> and best avoided if we can fix the root cause :)
>
> Ric
>
>>

Hi Ric,

I had run perf stat on ext4 functions between two runs of our program
writing data to a file for the first time and writing data to the file
for the second time(where the extents are initialized).
The amount of data written is same between the two runs.

left is first time
right is second time.


<                 42 ext4:ext4_mb_bitmap_load 

<                 42 ext4:ext4_mb_buddy_bitmap_load 

<                642 ext4:ext4_mb_new_inode_pa 

<                645 ext4:ext4_mballoc_alloc 

<              9,596 ext4:ext4_mballoc_prealloc 

<             10,240 ext4:ext4_da_update_reserve_space 

---
 >              7,413 ext4:ext4_mark_inode_dirty 

49d52
<             10,241 ext4:ext4_allocate_blocks 

51d53
<             10,241 ext4:ext4_request_blocks 

55d56
<          1,310,720 ext4:ext4_da_reserve_space 

58,60c59,60
<          1,331,288 ext4:ext4_ext_map_blocks_enter 

<          1,331,288 ext4:ext4_ext_map_blocks_exit 

<          1,341,467 ext4:ext4_mark_inode_dirty 

---
 >          1,310,806 ext4:ext4_ext_map_blocks_enter 

 >          1,310,806 ext4:ext4_ext_map_blocks_exit 



May be the mballocs have overhead.

I ll try to compare numbers on XFS during this week.

-Fredrick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 17:30       ` ext4_fallocate Theodore Ts'o
@ 2012-06-26 18:06         ` Fredrick
  2012-06-26 18:21         ` ext4_fallocate Ric Wheeler
  1 sibling, 0 replies; 42+ messages in thread
From: Fredrick @ 2012-06-26 18:06 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ric Wheeler, linux-ext4, Andreas Dilger, wenqing.lz

On 06/26/2012 10:30 AM, Theodore Ts'o wrote:
> On Tue, Jun 26, 2012 at 09:13:35AM -0400, Ric Wheeler wrote:
>>
>> Has anyone made progress digging into the performance impact of
>> running without this patch? We should definitely see if there is
>> some low hanging fruit there, especially given that XFS does not
>> seem to suffer such a huge hit.
>


> I just haven't had time, sorry.  It's so much easier to run with the
> patch.  :-)
>
> Part of the problem certainly caused by the fact that ext4 is using
> physical block journaling instead of logical journalling.  But we see
> the problem in no-journal mode as well.  I think part of the problem
> is simply that many of the workloads where people are doing this, they
> also care about robustness after power failures, and if you are doing
> random writes into uninitialized space, with fsyncs in-between, you
> are basically guaranteed a 2x expansion in the number of writes you
> need to do to the system.
>

Even our workload is same as above. Our programs write a chunk
and do fysnc for robustness. This happens repeatedly
on the file as the program pushes more data on the disk.


> One other thing which we *have* seen is that we need to do a better
> job with extent merging; if you run without this patch, and you run
> with fio in AIO mode where you are doing tons and tons of random
> writes into uninitialized space, you can end up fragmenting the extent
> tree very badly.   So fixing this would certainly help.
>
>> Opening this security exposure is still something that is clearly a
>> hack and best avoided if we can fix the root cause :)
>
> See Linus's recent rant about how security arguments made by
> theoreticians very often end up getting trumped by practical matters.
> If you are running a daemon, whether it is a user-mode cluster file
> system, or a database server, where it is (a) fundamentally trusted,
> and (b) doing its own user-space checksuming and its own guarantees to
> never return uninitialized data, even if we fix all potential
> problems, we *still* can be reducing the number of random writes ---
> and on a fully loaded system, we're guaranteed to be seek-constrained,
> so each random write to update fs metadata means that you're burning
> 0.5% of your 200 seeks/second on your 3TB disk (where previously you
> had half a dozen 500gig disks each with 200 seeks/second).
>

I can see the performance degradation on SSDs too, though the percentage
is less compared to SATA.

> I agree with you that it would be nice to look into this further, and
> optimizing our extent merging is definitely on the hot list of
> perofrmance improvements to look at.  But people who are using ext4 as
> back-end database servers or cluster file system servers and who are
> interested in wringing out every last percentage of performace are
> going to be interested in this technique, no matter what we do.  If
> you have Sagans and Sagans of servers all over the world, even a tenth
> of a percentage point performance improvement can easily translate
> into big dollars.
>

Sailing the same boat. :)

> 						- Ted
>

-Fredrick


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 17:30       ` ext4_fallocate Theodore Ts'o
  2012-06-26 18:06         ` ext4_fallocate Fredrick
@ 2012-06-26 18:21         ` Ric Wheeler
  2012-06-26 18:57           ` ext4_fallocate Ted Ts'o
  1 sibling, 1 reply; 42+ messages in thread
From: Ric Wheeler @ 2012-06-26 18:21 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On 06/26/2012 01:30 PM, Theodore Ts'o wrote:
> On Tue, Jun 26, 2012 at 09:13:35AM -0400, Ric Wheeler wrote:
>> Has anyone made progress digging into the performance impact of
>> running without this patch? We should definitely see if there is
>> some low hanging fruit there, especially given that XFS does not
>> seem to suffer such a huge hit.
> I just haven't had time, sorry.  It's so much easier to run with the
> patch.  :-)
>
> Part of the problem certainly caused by the fact that ext4 is using
> physical block journaling instead of logical journalling.  But we see
> the problem in no-journal mode as well.  I think part of the problem
> is simply that many of the workloads where people are doing this, they
> also care about robustness after power failures, and if you are doing
> random writes into uninitialized space, with fsyncs in-between, you
> are basically guaranteed a 2x expansion in the number of writes you
> need to do to the system.

It would be interesting to see if simply not doing the preallocation would be 
easier and safer. Better yet, figure out how to leverage trim commands to safely 
allow us to preallocate and not expose other users' data (and not have to mark 
the extents as allocated but not written).

If we did that, we would have the performance boost without the security hole.

>
> One other thing which we *have* seen is that we need to do a better
> job with extent merging; if you run without this patch, and you run
> with fio in AIO mode where you are doing tons and tons of random
> writes into uninitialized space, you can end up fragmenting the extent
> tree very badly.   So fixing this would certainly help.

Definitely sounds like something worth fixing.

>
>> Opening this security exposure is still something that is clearly a
>> hack and best avoided if we can fix the root cause :)
> See Linus's recent rant about how security arguments made by
> theoreticians very often end up getting trumped by practical matters.
> If you are running a daemon, whether it is a user-mode cluster file
> system, or a database server, where it is (a) fundamentally trusted,
> and (b) doing its own user-space checksuming and its own guarantees to
> never return uninitialized data, even if we fix all potential
> problems, we *still* can be reducing the number of random writes ---
> and on a fully loaded system, we're guaranteed to be seek-constrained,
> so each random write to update fs metadata means that you're burning
> 0.5% of your 200 seeks/second on your 3TB disk (where previously you
> had half a dozen 500gig disks each with 200 seeks/second).

This is not a theory guy worry. I would not use any server/web service that 
knowingly enabled this hack in a multi-user machine and would not enable it for 
any enterprise customers.

This is a real world, hard promise that we will let other users see your data in 
a trivial way (with your patch, only if they have the capability set).

>
> I agree with you that it would be nice to look into this further, and
> optimizing our extent merging is definitely on the hot list of
> perofrmance improvements to look at.  But people who are using ext4 as
> back-end database servers or cluster file system servers and who are
> interested in wringing out every last percentage of performace are
> going to be interested in this technique, no matter what we do.  If
> you have Sagans and Sagans of servers all over the world, even a tenth
> of a percentage point performance improvement can easily translate
> into big dollars.
>
> 						- Ted

We should be very, very careful not to strip away the usefulness of file system 
just to cater to some users. You could deliver the same performance safely in a 
few ways I think:

* fully allocate the file (by actually writing the data once in large IO's)
* don't preallocate and write with large enough IO's to populate the file (you 
can tweak this by doing something like allocation of largish chunks on first 
write to  a region)
* fix our preallocation to use discard (trim) primitives (note you can also use 
"WRITE_SAME" to init regions of blocks, but that can be quite slow)

ric



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 18:21         ` ext4_fallocate Ric Wheeler
@ 2012-06-26 18:57           ` Ted Ts'o
  2012-06-26 19:22             ` ext4_fallocate Ric Wheeler
  0 siblings, 1 reply; 42+ messages in thread
From: Ted Ts'o @ 2012-06-26 18:57 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On Tue, Jun 26, 2012 at 02:21:24PM -0400, Ric Wheeler wrote:
> 
> It would be interesting to see if simply not doing the preallocation
> would be easier and safer. Better yet, figure out how to leverage
> trim commands to safely allow us to preallocate and not expose other
> users' data (and not have to mark the extents as allocated but not
> written).

TRIM is applicable to SSD's and enterprise storage arrays.  It's not
applicable HDD's.

> This is not a theory guy worry. I would not use any server/web
> service that knowingly enabled this hack in a multi-user machine and
> would not enable it for any enterprise customers.

Sure, but for single-user or for dedicated cluster file system server,
it makes sense; it's not not going to be useful for all customers,
sure, but for some customers it will make sense.

> We should be very, very careful not to strip away the usefulness of
> file system just to cater to some users. 

A mount option or a superblock flag does not "strip away the
usefulness of the file system".  It makes it more useful for some use
cases.

Your alternate proposals (preinitializing the space with zeros, using
trim, writing larger chunks) will work for some use cases, certainly.
But it's not going to be sufficient for at least some use cases, or
they will simply not work.

Fundamentally it sounds to me the biggest problem is that you don't
trust your customers, and so you're afraid people will use the
no-hide-stale feature if it becomes available, even if it's not needed
or if it's needed but you're afraid that they don't understand the
security tradeoffs.

					- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 18:05       ` ext4_fallocate Fredrick
@ 2012-06-26 18:59         ` Ted Ts'o
  2012-06-26 19:30         ` ext4_fallocate Ric Wheeler
  1 sibling, 0 replies; 42+ messages in thread
From: Ted Ts'o @ 2012-06-26 18:59 UTC (permalink / raw)
  To: Fredrick; +Cc: Ric Wheeler, linux-ext4, Andreas Dilger, wenqing.lz

On Tue, Jun 26, 2012 at 11:05:40AM -0700, Fredrick wrote:
> 
> I had run perf stat on ext4 functions between two runs of our program
> writing data to a file for the first time and writing data to the file
> for the second time(where the extents are initialized).

>From your mballoc differences, it sounds like you were comparing
fallocate with not using fallocate at all; is that right?

The comparison you need to do is using normal fallocate versus
fallocate with the no-hide-stale feature enabled.  It's obvious that
allocating blocks as you need will always be more expensive than using
fallocate.

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 18:57           ` ext4_fallocate Ted Ts'o
@ 2012-06-26 19:22             ` Ric Wheeler
  0 siblings, 0 replies; 42+ messages in thread
From: Ric Wheeler @ 2012-06-26 19:22 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On 06/26/2012 02:57 PM, Ted Ts'o wrote:
> On Tue, Jun 26, 2012 at 02:21:24PM -0400, Ric Wheeler wrote:
>> It would be interesting to see if simply not doing the preallocation
>> would be easier and safer. Better yet, figure out how to leverage
>> trim commands to safely allow us to preallocate and not expose other
>> users' data (and not have to mark the extents as allocated but not
>> written).
> TRIM is applicable to SSD's and enterprise storage arrays.  It's not
> applicable HDD's.

and device mapper can also support TRIM these days with the thinly provisioned 
lun targets (although this just moves the multiple IO issues down to device 
mapper :))

We would only use it when the device supports it of course. Also note that you 
can use WRITE_SAME on pretty much any drive (although again, it can be slow).

>
>> This is not a theory guy worry. I would not use any server/web
>> service that knowingly enabled this hack in a multi-user machine and
>> would not enable it for any enterprise customers.
> Sure, but for single-user or for dedicated cluster file system server,
> it makes sense; it's not not going to be useful for all customers,
> sure, but for some customers it will make sense.

The danger is that people use that google thing and see "ext4 performance 
improvement" and then turn it on without understanding what they bought. Believe 
me, I deal with that *a lot* in my day job :)

>
>> We should be very, very careful not to strip away the usefulness of
>> file system just to cater to some users.
> A mount option or a superblock flag does not "strip away the
> usefulness of the file system".  It makes it more useful for some use
> cases.
>
> Your alternate proposals (preinitializing the space with zeros, using
> trim, writing larger chunks) will work for some use cases, certainly.
> But it's not going to be sufficient for at least some use cases, or
> they will simply not work.

I think that for the use case discussed in this thread it would probably work 
quite well, but always worth testing of course.

>
> Fundamentally it sounds to me the biggest problem is that you don't
> trust your customers, and so you're afraid people will use the
> no-hide-stale feature if it becomes available, even if it's not needed
> or if it's needed but you're afraid that they don't understand the
> security tradeoffs.
>
> 					- Ted

No Ted, we don't trust exposing other customers data in order to cover up a poor 
design that other file systems don't suffer from.

Ric




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 18:05       ` ext4_fallocate Fredrick
  2012-06-26 18:59         ` ext4_fallocate Ted Ts'o
@ 2012-06-26 19:30         ` Ric Wheeler
  2012-06-26 19:57           ` ext4_fallocate Eric Sandeen
  1 sibling, 1 reply; 42+ messages in thread
From: Ric Wheeler @ 2012-06-26 19:30 UTC (permalink / raw)
  To: Fredrick
  Cc: Ric Wheeler, Theodore Ts'o, linux-ext4, Andreas Dilger,
	wenqing.lz, Eric Sandeen

On 06/26/2012 02:05 PM, Fredrick wrote:
>
>> Hi Ted,
>>
>> Has anyone made progress digging into the performance impact of running
>> without this patch? We should definitely see if there is some low
>> hanging fruit there, especially given that XFS does not seem to suffer
>> such a huge hit.
>>
>> I think that we need to get a good reproducer for the workload that
>> causes the pain and start to dig into this.
>>
>> Opening this security exposure is still something that is clearly a hack
>> and best avoided if we can fix the root cause :)
>>
>> Ric
>>
>>>
>
> Hi Ric,
>
> I had run perf stat on ext4 functions between two runs of our program
> writing data to a file for the first time and writing data to the file
> for the second time(where the extents are initialized).
> The amount of data written is same between the two runs.
>
> left is first time
> right is second time.
>
>
> <                 42 ext4:ext4_mb_bitmap_load
> <                 42 ext4:ext4_mb_buddy_bitmap_load
> <                642 ext4:ext4_mb_new_inode_pa
> <                645 ext4:ext4_mballoc_alloc
> <              9,596 ext4:ext4_mballoc_prealloc
> <             10,240 ext4:ext4_da_update_reserve_space
> ---
> >              7,413 ext4:ext4_mark_inode_dirty
> 49d52
> <             10,241 ext4:ext4_allocate_blocks
> 51d53
> <             10,241 ext4:ext4_request_blocks
> 55d56
> <          1,310,720 ext4:ext4_da_reserve_space
> 58,60c59,60
> <          1,331,288 ext4:ext4_ext_map_blocks_enter
> <          1,331,288 ext4:ext4_ext_map_blocks_exit
> <          1,341,467 ext4:ext4_mark_inode_dirty
> ---
> >          1,310,806 ext4:ext4_ext_map_blocks_enter
> >          1,310,806 ext4:ext4_ext_map_blocks_exit
>
>
> May be the mballocs have overhead.
>
> I ll try to compare numbers on XFS during this week.
>
> -Fredrick
>

Thanks!  Eric is also running some tests to evaluate the impact of various 
techniques :)

ric


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 19:30         ` ext4_fallocate Ric Wheeler
@ 2012-06-26 19:57           ` Eric Sandeen
  2012-06-26 20:44             ` ext4_fallocate Eric Sandeen
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Sandeen @ 2012-06-26 19:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Fredrick, Ric Wheeler, Theodore Ts'o, linux-ext4,
	Andreas Dilger, wenqing.lz

On 6/26/12 3:30 PM, Ric Wheeler wrote:

> Thanks!  Eric is also running some tests to evaluate the impact of various techniques :)
> 
> ric

Yup forgive me for interjecting actual numbers into the discussion ;)

I tried running this fio recipe on v3.3, which I think does a decent job of
emulating the situation (fallocate 1G, do random 1M writes into it, with
fsyncs after each):

[test]
filename=testfile
rw=randwrite
size=1g
filesize=1g
bs=1024k
ioengine=sync
fallocate=1
fsync=1

Stock ext4 (3 tests w/ file remove & cache drop in between):

  WRITE: io=1024.0MB, aggrb=16322KB/s, minb=16713KB/s, maxb=16713KB/s, mint=64243msec, maxt=64243msec
  WRITE: io=1024.0MB, aggrb=16249KB/s, minb=16639KB/s, maxb=16639KB/s, mint=64528msec, maxt=64528msec
  WRITE: io=1024.0MB, aggrb=16370KB/s, minb=16763KB/s, maxb=16763KB/s, mint=64052msec, maxt=64052msec

With the patch which exposes other users' data:

  WRITE: io=1024.0MB, aggrb=17840KB/s, minb=18268KB/s, maxb=18268KB/s, mint=58776msec, maxt=58776msec
  WRITE: io=1024.0MB, aggrb=17841KB/s, minb=18269KB/s, maxb=18269KB/s, mint=58773msec, maxt=58773msec
  WRITE: io=1024.0MB, aggrb=17828KB/s, minb=18255KB/s, maxb=18255KB/s, mint=58816msec, maxt=58816msec

so about 10% faster than without.

XFS, FWIW:

  WRITE: io=1024.0MB, aggrb=24008KB/s, minb=24584KB/s, maxb=24584KB/s, mint=43675msec, maxt=43675msec
  WRITE: io=1024.0MB, aggrb=24069KB/s, minb=24647KB/s, maxb=24647KB/s, mint=43564msec, maxt=43564msec
  WRITE: io=1024.0MB, aggrb=24054KB/s, minb=24632KB/s, maxb=24632KB/s, mint=43591msec, maxt=43591msec

which is 35% faster than ext4 with the risky patch.

Haven't yet tried overwrites or done any tracing or profiling, but I think the fio recipe is a decent demonstrator, I'll try the overwrites etc in a bit when I get a moment.

-Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 19:57           ` ext4_fallocate Eric Sandeen
@ 2012-06-26 20:44             ` Eric Sandeen
  2012-06-27 15:14               ` ext4_fallocate Eric Sandeen
  2012-06-27 19:30               ` ext4_fallocate Theodore Ts'o
  0 siblings, 2 replies; 42+ messages in thread
From: Eric Sandeen @ 2012-06-26 20:44 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Fredrick, Ric Wheeler, Theodore Ts'o, linux-ext4,
	Andreas Dilger, wenqing.lz

On 6/26/12 3:57 PM, Eric Sandeen wrote:
> On 6/26/12 3:30 PM, Ric Wheeler wrote:
> 
>> Thanks!  Eric is also running some tests to evaluate the impact of various techniques :)
>>
>> ric
> 
> Yup forgive me for interjecting actual numbers into the discussion ;)
> 
> I tried running this fio recipe on v3.3, which I think does a decent job of
> emulating the situation (fallocate 1G, do random 1M writes into it, with
> fsyncs after each):
> 
> [test]
> filename=testfile
> rw=randwrite
> size=1g
> filesize=1g
> bs=1024k
> ioengine=sync
> fallocate=1
> fsync=1
> 
> Stock ext4 (3 tests w/ file remove & cache drop in between):
> 
>   WRITE: io=1024.0MB, aggrb=16322KB/s, minb=16713KB/s, maxb=16713KB/s, mint=64243msec, maxt=64243msec
>   WRITE: io=1024.0MB, aggrb=16249KB/s, minb=16639KB/s, maxb=16639KB/s, mint=64528msec, maxt=64528msec
>   WRITE: io=1024.0MB, aggrb=16370KB/s, minb=16763KB/s, maxb=16763KB/s, mint=64052msec, maxt=64052msec

And as a sanity check, here are the rates for overwriting existing, written blocks in the file:

  WRITE: io=1024.0MB, aggrb=17778KB/s, minb=18205KB/s, maxb=18205KB/s, mint=58980msec, maxt=58980msec
  WRITE: io=1024.0MB, aggrb=17825KB/s, minb=18252KB/s, maxb=18252KB/s, mint=58826msec, maxt=58826msec
  WRITE: io=1024.0MB, aggrb=17769KB/s, minb=18195KB/s, maxb=18195KB/s, mint=59010msec, maxt=59010msec

so this does look like about ~10% overhead for converting the extents.

> With the patch which exposes other users' data:
> 
>   WRITE: io=1024.0MB, aggrb=17840KB/s, minb=18268KB/s, maxb=18268KB/s, mint=58776msec, maxt=58776msec
>   WRITE: io=1024.0MB, aggrb=17841KB/s, minb=18269KB/s, maxb=18269KB/s, mint=58773msec, maxt=58773msec
>   WRITE: io=1024.0MB, aggrb=17828KB/s, minb=18255KB/s, maxb=18255KB/s, mint=58816msec, maxt=58816msec

overwrites:

  WRITE: io=1024.0MB, aggrb=17768KB/s, minb=18194KB/s, maxb=18194KB/s, mint=59014msec, maxt=59014msec
  WRITE: io=1024.0MB, aggrb=17855KB/s, minb=18283KB/s, maxb=18283KB/s, mint=58726msec, maxt=58726msec
  WRITE: io=1024.0MB, aggrb=17838KB/s, minb=18266KB/s, maxb=18266KB/s, mint=58783msec, maxt=58783msec

As expected, overwriting stale data is no slower than overwriting file data.

> so about 10% faster than without.
> 
> XFS, FWIW:
> 
>   WRITE: io=1024.0MB, aggrb=24008KB/s, minb=24584KB/s, maxb=24584KB/s, mint=43675msec, maxt=43675msec
>   WRITE: io=1024.0MB, aggrb=24069KB/s, minb=24647KB/s, maxb=24647KB/s, mint=43564msec, maxt=43564msec
>   WRITE: io=1024.0MB, aggrb=24054KB/s, minb=24632KB/s, maxb=24632KB/s, mint=43591msec, maxt=43591msec

overwrites:

  WRITE: io=1024.0MB, aggrb=24108KB/s, minb=24687KB/s, maxb=24687KB/s, mint=43494msec, maxt=43494msec
  WRITE: io=1024.0MB, aggrb=24129KB/s, minb=24708KB/s, maxb=24708KB/s, mint=43456msec, maxt=43456msec
  WRITE: io=1024.0MB, aggrb=24164KB/s, minb=24744KB/s, maxb=24744KB/s, mint=43393msec, maxt=43393msec

looks like negligible overhead for the conversions here, < 1%.

-Eric

> which is 35% faster than ext4 with the risky patch.
> 
> Haven't yet tried overwrites or done any tracing or profiling, but I think the fio recipe is a decent demonstrator, I'll try the overwrites etc in a bit when I get a moment.
> 
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 20:44             ` ext4_fallocate Eric Sandeen
@ 2012-06-27 15:14               ` Eric Sandeen
  2012-06-27 19:30               ` ext4_fallocate Theodore Ts'o
  1 sibling, 0 replies; 42+ messages in thread
From: Eric Sandeen @ 2012-06-27 15:14 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Fredrick, Ric Wheeler, Theodore Ts'o, linux-ext4,
	Andreas Dilger, wenqing.lz

On 6/26/12 4:44 PM, Eric Sandeen wrote:
> On 6/26/12 3:57 PM, Eric Sandeen wrote:
>> On 6/26/12 3:30 PM, Ric Wheeler wrote:

...

>> I tried running this fio recipe on v3.3, which I think does a decent job of
>> emulating the situation (fallocate 1G, do random 1M writes into it, with
>> fsyncs after each):
>>
>> [test]
>> filename=testfile
>> rw=randwrite
>> size=1g
>> filesize=1g
>> bs=1024k
>> ioengine=sync
>> fallocate=1
>> fsync=1


I realized the kernel I tested on had a lot of debugging bells & whistles turned on.  For a more performance-configured v3.3 kernel, I see much less impact, more or less negligible:

unwritten conversion on ext4:

  WRITE: io=1024.0MB, aggrb=23077KB/s, minb=23631KB/s, maxb=23631KB/s, mint=45437msec, maxt=45437msec
  WRITE: io=1024.0MB, aggrb=23040KB/s, minb=23593KB/s, maxb=23593KB/s, mint=45511msec, maxt=45511msec
  WRITE: io=1024.0MB, aggrb=23011KB/s, minb=23564KB/s, maxb=23564KB/s, mint=45567msec, maxt=45567msec

overwrites on ext4:

  WRITE: io=1024.0MB, aggrb=23046KB/s, minb=23599KB/s, maxb=23599KB/s, mint=45499msec, maxt=45499msec
  WRITE: io=1024.0MB, aggrb=23222KB/s, minb=23780KB/s, maxb=23780KB/s, mint=45153msec, maxt=45153msec
  WRITE: io=1024.0MB, aggrb=23031KB/s, minb=23584KB/s, maxb=23584KB/s, mint=45528msec, maxt=45528msec


So I guess it's interesting information; the debug kernel showed a much bigger difference; the performance-config'd kernel showed almost no difference.

FWIW, on this same kernel,

unwritten conversion on xfs:

  WRITE: io=1024.0MB, aggrb=28409KB/s, minb=29091KB/s, maxb=29091KB/s, mint=36909msec, maxt=36909msec
  WRITE: io=1024.0MB, aggrb=28372KB/s, minb=29053KB/s, maxb=29053KB/s, mint=36958msec, maxt=36958msec
  WRITE: io=1024.0MB, aggrb=28406KB/s, minb=29088KB/s, maxb=29088KB/s, mint=36913msec, maxt=36913msec

overwrites on xfs:

  WRITE: io=1024.0MB, aggrb=28268KB/s, minb=28946KB/s, maxb=28946KB/s, mint=37094msec, maxt=37094msec
  WRITE: io=1024.0MB, aggrb=28316KB/s, minb=28995KB/s, maxb=28995KB/s, mint=37031msec, maxt=37031msec
  WRITE: io=1024.0MB, aggrb=28576KB/s, minb=29262KB/s, maxb=29262KB/s, mint=36694msec, maxt=36694msec

-Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-26 20:44             ` ext4_fallocate Eric Sandeen
  2012-06-27 15:14               ` ext4_fallocate Eric Sandeen
@ 2012-06-27 19:30               ` Theodore Ts'o
  2012-06-27 23:02                 ` ext4_fallocate Eric Sandeen
  1 sibling, 1 reply; 42+ messages in thread
From: Theodore Ts'o @ 2012-06-27 19:30 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ric Wheeler, Fredrick, Ric Wheeler, linux-ext4, Andreas Dilger,
	wenqing.lz

On Tue, Jun 26, 2012 at 04:44:08PM -0400, Eric Sandeen wrote:
> > 
> > I tried running this fio recipe on v3.3, which I think does a decent job of
> > emulating the situation (fallocate 1G, do random 1M writes into it, with
> > fsyncs after each):
> > 
> > [test]
> > filename=testfile
> > rw=randwrite
> > size=1g
> > filesize=1g
> > bs=1024k
> > ioengine=sync
> > fallocate=1
> > fsync=1

A better workload would be to use a blocksize of 4k.  By using a
blocksize of 1024k, it's not surprising that the metadata overhead is
in the noise.

Try something like this; this will cause the extent tree overhead to
be roughly equal to the data block I/O.

[global]
rw=randwrite
size=128m
filesize=1g
bs=4k
ioengine=sync
fallocate=1
fsync=1

[thread1]
filename=testfile

					- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-27 19:30               ` ext4_fallocate Theodore Ts'o
@ 2012-06-27 23:02                 ` Eric Sandeen
  2012-06-28 11:27                   ` ext4_fallocate Ric Wheeler
                                     ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Eric Sandeen @ 2012-06-27 23:02 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Ric Wheeler, Fredrick, Ric Wheeler, linux-ext4, Andreas Dilger,
	wenqing.lz

On 6/27/12 3:30 PM, Theodore Ts'o wrote:
> On Tue, Jun 26, 2012 at 04:44:08PM -0400, Eric Sandeen wrote:
>>>
>>> I tried running this fio recipe on v3.3, which I think does a decent job of
>>> emulating the situation (fallocate 1G, do random 1M writes into it, with
>>> fsyncs after each):
>>>
>>> [test]
>>> filename=testfile
>>> rw=randwrite
>>> size=1g
>>> filesize=1g
>>> bs=1024k
>>> ioengine=sync
>>> fallocate=1
>>> fsync=1
> 
> A better workload would be to use a blocksize of 4k.  By using a
> blocksize of 1024k, it's not surprising that the metadata overhead is
> in the noise.
> 
> Try something like this; this will cause the extent tree overhead to
> be roughly equal to the data block I/O.
> 
> [global]
> rw=randwrite
> size=128m
> filesize=1g
> bs=4k
> ioengine=sync
> fallocate=1
> fsync=1
> 
> [thread1]
> filename=testfile

Well, ok ... TBH I changed it to size=16m to finish in under 20m.... so here are the results:

fallocate 1g, do 16m of 4k random IOs, sync after each:

# for I in a b c; do rm -f testfile; echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done

  WRITE: io=16384KB, aggrb=154KB/s, minb=158KB/s, maxb=158KB/s, mint=105989msec, maxt=105989msec
  WRITE: io=16384KB, aggrb=163KB/s, minb=167KB/s, maxb=167KB/s, mint=99906msec, maxt=99906msec
  WRITE: io=16384KB, aggrb=176KB/s, minb=180KB/s, maxb=180KB/s, mint=92791msec, maxt=92791msec

same, but overwrite pre-written 1g file (same as the expose-my-data option ;)

# dd if=/dev/zero of=testfile bs=1M count=1024
# for I in a b c; do echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done

  WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99515msec, maxt=99515msec
  WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99371msec, maxt=99371msec
  WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99677msec, maxt=99677msec

so no great surprise, small synchronous 4k writes have terrible performance, but I'm still not seeing a lot of fallocate overhead.

xfs, FWIW:

# for I in a b c; do rm -f testfile; echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done

  WRITE: io=16384KB, aggrb=202KB/s, minb=207KB/s, maxb=207KB/s, mint=80980msec, maxt=80980msec
  WRITE: io=16384KB, aggrb=203KB/s, minb=208KB/s, maxb=208KB/s, mint=80508msec, maxt=80508msec
  WRITE: io=16384KB, aggrb=204KB/s, minb=208KB/s, maxb=208KB/s, mint=80291msec, maxt=80291msec

# dd if=/dev/zero of=testfile bs=1M count=1024
# for I in a b c; do echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done

  WRITE: io=16384KB, aggrb=197KB/s, minb=202KB/s, maxb=202KB/s, mint=82869msec, maxt=82869msec
  WRITE: io=16384KB, aggrb=203KB/s, minb=208KB/s, maxb=208KB/s, mint=80348msec, maxt=80348msec
  WRITE: io=16384KB, aggrb=202KB/s, minb=207KB/s, maxb=207KB/s, mint=80827msec, maxt=80827msec

Again, I think this is just a diabolical workload ;)

-Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-27 23:02                 ` ext4_fallocate Eric Sandeen
@ 2012-06-28 11:27                   ` Ric Wheeler
  2012-06-29 19:02                     ` ext4_fallocate Andreas Dilger
  2012-06-28 12:48                   ` ext4_fallocate Theodore Ts'o
  2012-07-02  3:16                   ` ext4_fallocate Zheng Liu
  2 siblings, 1 reply; 42+ messages in thread
From: Ric Wheeler @ 2012-06-28 11:27 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Ric Wheeler, Fredrick, linux-ext4,
	Andreas Dilger, wenqing.lz

On 06/27/2012 07:02 PM, Eric Sandeen wrote:
> On 6/27/12 3:30 PM, Theodore Ts'o wrote:
>> On Tue, Jun 26, 2012 at 04:44:08PM -0400, Eric Sandeen wrote:
>>>> I tried running this fio recipe on v3.3, which I think does a decent job of
>>>> emulating the situation (fallocate 1G, do random 1M writes into it, with
>>>> fsyncs after each):
>>>>
>>>> [test]
>>>> filename=testfile
>>>> rw=randwrite
>>>> size=1g
>>>> filesize=1g
>>>> bs=1024k
>>>> ioengine=sync
>>>> fallocate=1
>>>> fsync=1
>> A better workload would be to use a blocksize of 4k.  By using a
>> blocksize of 1024k, it's not surprising that the metadata overhead is
>> in the noise.
>>
>> Try something like this; this will cause the extent tree overhead to
>> be roughly equal to the data block I/O.
>>
>> [global]
>> rw=randwrite
>> size=128m
>> filesize=1g
>> bs=4k
>> ioengine=sync
>> fallocate=1
>> fsync=1
>>
>> [thread1]
>> filename=testfile
> Well, ok ... TBH I changed it to size=16m to finish in under 20m.... so here are the results:
>
> fallocate 1g, do 16m of 4k random IOs, sync after each:
>
> # for I in a b c; do rm -f testfile; echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done
>
>    WRITE: io=16384KB, aggrb=154KB/s, minb=158KB/s, maxb=158KB/s, mint=105989msec, maxt=105989msec
>    WRITE: io=16384KB, aggrb=163KB/s, minb=167KB/s, maxb=167KB/s, mint=99906msec, maxt=99906msec
>    WRITE: io=16384KB, aggrb=176KB/s, minb=180KB/s, maxb=180KB/s, mint=92791msec, maxt=92791msec
>
> same, but overwrite pre-written 1g file (same as the expose-my-data option ;)
>
> # dd if=/dev/zero of=testfile bs=1M count=1024
> # for I in a b c; do echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done
>
>    WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99515msec, maxt=99515msec
>    WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99371msec, maxt=99371msec
>    WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99677msec, maxt=99677msec
>
> so no great surprise, small synchronous 4k writes have terrible performance, but I'm still not seeing a lot of fallocate overhead.
>
> xfs, FWIW:
>
> # for I in a b c; do rm -f testfile; echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done
>
>    WRITE: io=16384KB, aggrb=202KB/s, minb=207KB/s, maxb=207KB/s, mint=80980msec, maxt=80980msec
>    WRITE: io=16384KB, aggrb=203KB/s, minb=208KB/s, maxb=208KB/s, mint=80508msec, maxt=80508msec
>    WRITE: io=16384KB, aggrb=204KB/s, minb=208KB/s, maxb=208KB/s, mint=80291msec, maxt=80291msec
>
> # dd if=/dev/zero of=testfile bs=1M count=1024
> # for I in a b c; do echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done
>
>    WRITE: io=16384KB, aggrb=197KB/s, minb=202KB/s, maxb=202KB/s, mint=82869msec, maxt=82869msec
>    WRITE: io=16384KB, aggrb=203KB/s, minb=208KB/s, maxb=208KB/s, mint=80348msec, maxt=80348msec
>    WRITE: io=16384KB, aggrb=202KB/s, minb=207KB/s, maxb=207KB/s, mint=80827msec, maxt=80827msec
>
> Again, I think this is just a diabolical workload ;)
>
> -Eric

We need to keep in mind what the goal of pre-allocation is (should be?) - spend 
a bit of extra time doing the allocation call so we get really good, contiguous 
layout on disk which ultimately will help in streaming read/write workloads.

If you have a reasonably small file, pre-allocation is probably simply a waste 
of time - you would be better off overwriting the maximum file size with all 
zeros (even a 1GB file would take only a few seconds).

If the file is large enough to be interesting, I think that we might want to 
think about a scheme that would bring small random IO's more into line with the 
1MB results Eric saw.

One way to do that might be to have a minimum "chunk" that we would zero out for 
any IO to an allocated but unwritten extent. You write 4KB to the middle of said 
region, we pad up and zero out to the nearest MB with zeros.

Note for the target class of drives (S-ATA) that Ted mentioned earlier, doing a 
random 4KB write vs a 1MB write is not that much slower (you need to pay the 
head movement costs already).  Of course, the sweet spot might turn out to be a 
bit smaller or larger.

Ric



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-27 23:02                 ` ext4_fallocate Eric Sandeen
  2012-06-28 11:27                   ` ext4_fallocate Ric Wheeler
@ 2012-06-28 12:48                   ` Theodore Ts'o
  2012-07-02  3:16                   ` ext4_fallocate Zheng Liu
  2 siblings, 0 replies; 42+ messages in thread
From: Theodore Ts'o @ 2012-06-28 12:48 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ric Wheeler, Fredrick, Ric Wheeler, linux-ext4, Andreas Dilger,
	wenqing.lz

On Wed, Jun 27, 2012 at 07:02:45PM -0400, Eric Sandeen wrote:
> 
> fallocate 1g, do 16m of 4k random IOs, sync after each:
> 
> # for I in a b c; do rm -f testfile; echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done
> 
>   WRITE: io=16384KB, aggrb=154KB/s, minb=158KB/s, maxb=158KB/s, mint=105989msec, maxt=105989msec
>   WRITE: io=16384KB, aggrb=163KB/s, minb=167KB/s, maxb=167KB/s, mint=99906msec, maxt=99906msec
>   WRITE: io=16384KB, aggrb=176KB/s, minb=180KB/s, maxb=180KB/s, mint=92791msec, maxt=92791msec
> 
> same, but overwrite pre-written 1g file (same as the expose-my-data option ;)
> 
> # dd if=/dev/zero of=testfile bs=1M count=1024
> # for I in a b c; do echo 3 > /proc/sys/vm/drop_caches; fio tytso.fio | grep 2>&1 WRITE; done
> 
>   WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99515msec, maxt=99515msec
>   WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99371msec, maxt=99371msec
>   WRITE: io=16384KB, aggrb=164KB/s, minb=168KB/s, maxb=168KB/s, mint=99677msec, maxt=99677msec
> 

There's a pretty large range comparing the first set versus the second
set of numbers.  With the second set of numbers, the times are much
more stable.

I wonder if one of the things that's going on is file placement; if
you're constantly deleting and recreating the file, are we getting the
same block numbers or not?  (And if we're not, is that something we
need to look at vis-a-vis the block allocator --- especially with
SSD's?)

						- Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-25  7:33 ` ext4_fallocate Andreas Dilger
@ 2012-06-28 15:12   ` Phillip Susi
  2012-06-28 15:23     ` ext4_fallocate Eric Sandeen
  0 siblings, 1 reply; 42+ messages in thread
From: Phillip Susi @ 2012-06-28 15:12 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Fredrick, linux-ext4

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/25/2012 3:33 AM, Andreas Dilger wrote:
> There was a recent patch series "ext4: add an io-tree to track
> block allocation" that may improve the performance for your case of
> overwrite of uninitialized files, but it hasn't landed yet.

I'm confused.  Why is writing to uninitialized extents slow, and why
would this help?  If you have an uninitialized extent, then the blocks
are already allocated, just flagged as containing uninitialized data.
 Writing to them should be no different than writing to initialized
extents, save for the step of clearing the uninitialized flag.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP7HROAAoJEJrBOlT6nu75nb8IANw0fBczdWvuO7Ne/Xi6uYjY
0RUoAlsEsdmHy8Gjc7aZK0f6Qlf/pC2YQanNeMCE9DdxHEc83ibBA1jppjk1oFQ4
Qq2+lX+UNP3xsvPnB6NyH6nZNY+rZGHceDksSVZySAzdN+HCwdiBlByEXMLY7iY3
AuAm5mWPlRT34yY8/YZhs4OmlKW7FaAHxetRpj9GyobRfkqFK8EcbpXp/7iAon87
dSj4QXdQoFx6fVF1wEhCbzxrmMs+R3wpfIuqnNJwSfiunND0JYQGMC/GuSKveD0R
Gn9Xd9XkNK6JXWNdGdMiGU2+R2FMjRdW1igcgkrOXd5tEUGYKjeUIwTiIclp5VY=
=N44C
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-28 15:12   ` ext4_fallocate Phillip Susi
@ 2012-06-28 15:23     ` Eric Sandeen
  0 siblings, 0 replies; 42+ messages in thread
From: Eric Sandeen @ 2012-06-28 15:23 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Andreas Dilger, Fredrick, linux-ext4

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 6/28/12 11:12 AM, Phillip Susi wrote:
> On 6/25/2012 3:33 AM, Andreas Dilger wrote:
>> There was a recent patch series "ext4: add an io-tree to track
>> block allocation" that may improve the performance for your case of
>> overwrite of uninitialized files, but it hasn't landed yet.
> 
> I'm confused.  Why is writing to uninitialized extents slow, and why
> would this help?  If you have an uninitialized extent, then the blocks
> are already allocated, just flagged as containing uninitialized data.
>  Writing to them should be no different than writing to initialized
> extents, save for the step of clearing the uninitialized flag.

The other piece is that large uninitialized extents get split into
up to 3 extents if you write into the middle of them so there are
potentially a lot more metadata updates flying around.

- -Eric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJP7HbhAAoJECCuFpLhPd7g8BUQAKlVb3EXtDNbziXkmKdkbDWo
wf/l4CMol/oxnLbKbvVkRQmRKfhdAsX5qQgmRqJvT3dPFzA7jSZqKDt3BV06fBSn
lJqWwdtHwjodeTj2UcoLZm2+8tL3d4AtJw42IntVasRtZRrUGZ5mFi1HMjm8vmK8
xSkmC/YbFtj222WiIpjLyu0XBZtSX9uMP4NZoThmA3MwuEcFr6AVqgi0WW3GVZP+
2L1ylMJDs6OykE+/Tv4AkFZ7XLtE946ytpqI6rLJvbItWdw1OqWhL6XWnOnGUgYG
B4IM5klovqIFOCPkHj0DPLm/RWjhv/+DJMkqEHAJANu4kEyS4O08Acqy/sAf1y4U
0OWCIW6fUHzV/38aMuQlLZeN+enaVmSCMh1kBWlGEeElZ/pRxClpiO+1Rk1Z7x1z
f1H+J3NV+wwGa3xiG6XO9biIMo10qdQMy93KLkeh6ndE3SYgvaAxC0AatrTZJS9Q
oRiKJLl3uTv1j0h+W+CBsRRmPzf/ad1kUKKPfT0recxy7ggO6X+6fkgsgRo3LRfN
53uAQWELJJu1opM9PsBuvX4uUvKLgbPXmky3JDS+Q3p18I4DfiBiIEPKbkZAkeD/
py0oxuqs0MGTRd9FCCqNODG2AJ7VWnY29hUOaCgWeFT5eG/r9tECJI6l62z4GeBx
b5+TFFsdJ/zJVBdjBuIT
=WcmZ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-28 11:27                   ` ext4_fallocate Ric Wheeler
@ 2012-06-29 19:02                     ` Andreas Dilger
  2012-07-02  3:03                       ` ext4_fallocate Zheng Liu
  0 siblings, 1 reply; 42+ messages in thread
From: Andreas Dilger @ 2012-06-29 19:02 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Eric Sandeen, Theodore Ts'o, Ric Wheeler, Fredrick,
	linux-ext4, wenqing.lz

On 2012-06-28, at 5:27 AM, Ric Wheeler wrote:
> We need to keep in mind what the goal of pre-allocation is (should be?) - spend a bit of extra time doing the allocation call so we get really good, contiguous layout on disk which ultimately will help in streaming read/write workloads.
> 
> If you have a reasonably small file, pre-allocation is probably simply a waste of time - you would be better off overwriting the maximum file size with all zeros (even a 1GB file would take only a few seconds).
> 
> If the file is large enough to be interesting, I think that we might want to think about a scheme that would bring small random IO's more into line with the 1MB results Eric saw.
> 
> One way to do that might be to have a minimum "chunk" that we would zero out for any IO to an allocated but unwritten extent. You write 4KB to the middle of said region, we pad up and zero out to the nearest MB with zeros.

There is already code for this in the ext4 uninit extent handling.
Currently the limit is 15(?) blocks, to inflate the write size up
to zero-fill a 64kB chunk on 4kB blocksize filesystems.  I wouldn't
object to increasing this to zero out a full 1MB (aligned) chunk
under random IO cases.

Yes, it would hurt the latency a small amount for the first writes
(though not very much for disks, given the seek overhead), but it
would avoid clobbering the extent tree so drastically, which also
has long-term bad performance effects.

> Note for the target class of drives (S-ATA) that Ted mentioned earlier, doing a random 4KB write vs a 1MB write is not that much slower (you need to pay the head movement costs already).  Of course, the sweet spot might turn out to be a bit smaller or larger.

Right, at ~100-150 seeks/sec, it is pretty close to the 150 MB/s write
bandwidth limit, so the cost of doing 4kB vs. 1MB writes is a toss-up.
I expect the bandwidth of drives to continue increasing, while the seek
rate will stay the same.  The tradeoff for SSD devices is not so clear,
since inflated writes will hurt the cache, but in that case, it makes
more sense to use trim-with-zero (if supported) instead of unallocated
blocks anyway.

Cheers, Andreas






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-29 19:02                     ` ext4_fallocate Andreas Dilger
@ 2012-07-02  3:03                       ` Zheng Liu
  0 siblings, 0 replies; 42+ messages in thread
From: Zheng Liu @ 2012-07-02  3:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ric Wheeler, Eric Sandeen, Theodore Ts'o, Fredrick,
	linux-ext4, wenqing.lz

On Fri, Jun 29, 2012 at 01:02:35PM -0600, Andreas Dilger wrote:
> There is already code for this in the ext4 uninit extent handling.
> Currently the limit is 15(?) blocks, to inflate the write size up
> to zero-fill a 64kB chunk on 4kB blocksize filesystems.  I wouldn't
> object to increasing this to zero out a full 1MB (aligned) chunk
> under random IO cases.

I agree with you that we can increase it to 1MB chunk, and I will run
some tests to see the result.  In addition, I am thinking whether we can
provide a parameter in sysfs or some other places to adjust this value
dynamically.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-06-27 23:02                 ` ext4_fallocate Eric Sandeen
  2012-06-28 11:27                   ` ext4_fallocate Ric Wheeler
  2012-06-28 12:48                   ` ext4_fallocate Theodore Ts'o
@ 2012-07-02  3:16                   ` Zheng Liu
  2012-07-02 16:33                     ` ext4_fallocate Eric Sandeen
  2 siblings, 1 reply; 42+ messages in thread
From: Zheng Liu @ 2012-07-02  3:16 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Fredrick, Ric Wheeler, linux-ext4,
	Andreas Dilger, wenqing.lz

On Wed, Jun 27, 2012 at 07:02:45PM -0400, Eric Sandeen wrote:
> On 6/27/12 3:30 PM, Theodore Ts'o wrote:
> > A better workload would be to use a blocksize of 4k.  By using a
> > blocksize of 1024k, it's not surprising that the metadata overhead is
> > in the noise.
> > 
> > Try something like this; this will cause the extent tree overhead to
> > be roughly equal to the data block I/O.
> > 
> > [global]
> > rw=randwrite
> > size=128m
> > filesize=1g
> > bs=4k
> > ioengine=sync
> > fallocate=1
> > fsync=1
> > 
> > [thread1]
> > filename=testfile

Hi Eric,

Could you please run this test with 'journal_async_commit' flag.  In my
previous test, this feature can dramatically improve the performance of
uninitialized extent conversion.

I have sent an email to do a similar test [1].  In that email, I do a
similar test and the journal_async_commit flag quite can improve the
performance.

1. http://comments.gmane.org/gmane.linux.file-systems/63569

Regards,
Zheng

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02  3:16                   ` ext4_fallocate Zheng Liu
@ 2012-07-02 16:33                     ` Eric Sandeen
  2012-07-02 17:44                       ` ext4_fallocate Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Eric Sandeen @ 2012-07-02 16:33 UTC (permalink / raw)
  To: Theodore Ts'o, Fredrick, Ric Wheeler, linux-ext4,
	Andreas Dilger, wenqing.lz

On 07/01/2012 10:16 PM, Zheng Liu wrote:
> On Wed, Jun 27, 2012 at 07:02:45PM -0400, Eric Sandeen wrote:
>> On 6/27/12 3:30 PM, Theodore Ts'o wrote:
>>> A better workload would be to use a blocksize of 4k.  By using a
>>> blocksize of 1024k, it's not surprising that the metadata overhead is
>>> in the noise.
>>>
>>> Try something like this; this will cause the extent tree overhead to
>>> be roughly equal to the data block I/O.
>>>
>>> [global]
>>> rw=randwrite
>>> size=128m
>>> filesize=1g
>>> bs=4k
>>> ioengine=sync
>>> fallocate=1
>>> fsync=1
>>>
>>> [thread1]
>>> filename=testfile
> 
> Hi Eric,
> 
> Could you please run this test with 'journal_async_commit' flag.  In my
> previous test, this feature can dramatically improve the performance of
> uninitialized extent conversion.
> 
> I have sent an email to do a similar test [1].  In that email, I do a
> similar test and the journal_async_commit flag quite can improve the
> performance.

I can try to find time for that, but so far I haven't actually seen any
severe impact of conversion on a non-debug kernel.  And didn't Jan think
that journal_async_commit was fatally flawed?

At this point I think it is up to the proponents of the
expose-stale-data patch to document & demonstrate the workload which
justifies such a change...

-Eric

> 1. http://comments.gmane.org/gmane.linux.file-systems/63569
> 
> Regards,
> Zheng


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02 16:33                     ` ext4_fallocate Eric Sandeen
@ 2012-07-02 17:44                       ` Jan Kara
  2012-07-02 17:48                         ` ext4_fallocate Ric Wheeler
                                           ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Jan Kara @ 2012-07-02 17:44 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Fredrick, Ric Wheeler, linux-ext4,
	Andreas Dilger, wenqing.lz

On Mon 02-07-12 11:33:33, Eric Sandeen wrote:
> On 07/01/2012 10:16 PM, Zheng Liu wrote:
> > Hi Eric,
> > 
> > Could you please run this test with 'journal_async_commit' flag.  In my
> > previous test, this feature can dramatically improve the performance of
> > uninitialized extent conversion.
> > 
> > I have sent an email to do a similar test [1].  In that email, I do a
> > similar test and the journal_async_commit flag quite can improve the
> > performance.
> 
> I can try to find time for that, but so far I haven't actually seen any
> severe impact of conversion on a non-debug kernel.  And didn't Jan think
> that journal_async_commit was fatally flawed?
  Yes, that option is broken and basically unfixable for data=ordered mode
(see http://comments.gmane.org/gmane.comp.file-systems.ext4/30727). For
data=writeback it works fine AFAICT.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02 17:44                       ` ext4_fallocate Jan Kara
@ 2012-07-02 17:48                         ` Ric Wheeler
  2012-07-03 17:41                           ` ext4_fallocate Zheng Liu
  2012-07-02 18:01                         ` ext4_fallocate Theodore Ts'o
  2012-07-04  1:15                         ` ext4_fallocate Phillip Susi
  2 siblings, 1 reply; 42+ messages in thread
From: Ric Wheeler @ 2012-07-02 17:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Theodore Ts'o, Fredrick, linux-ext4,
	Andreas Dilger, wenqing.lz

On 07/02/2012 01:44 PM, Jan Kara wrote:
> On Mon 02-07-12 11:33:33, Eric Sandeen wrote:
>> On 07/01/2012 10:16 PM, Zheng Liu wrote:
>>> Hi Eric,
>>>
>>> Could you please run this test with 'journal_async_commit' flag.  In my
>>> previous test, this feature can dramatically improve the performance of
>>> uninitialized extent conversion.
>>>
>>> I have sent an email to do a similar test [1].  In that email, I do a
>>> similar test and the journal_async_commit flag quite can improve the
>>> performance.
>> I can try to find time for that, but so far I haven't actually seen any
>> severe impact of conversion on a non-debug kernel.  And didn't Jan think
>> that journal_async_commit was fatally flawed?
>    Yes, that option is broken and basically unfixable for data=ordered mode
> (see http://comments.gmane.org/gmane.comp.file-systems.ext4/30727). For
> data=writeback it works fine AFAICT.
>
> 								Honza

I think that we need to start with the basics - let's find a specific work load 
that we can use to validate the performance. In Eric's testing, nothing really 
jumped out.

What type of disk, the specific worst case IO size, etc would be great (and even 
better, if someone has a real world, non-synthetic app that shows this).

Definitely more interesting I think to try and do the MB size extent conversion, 
that should be generally a good technique to minimize the overhead.

thanks!

Ric



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02 17:44                       ` ext4_fallocate Jan Kara
  2012-07-02 17:48                         ` ext4_fallocate Ric Wheeler
@ 2012-07-02 18:01                         ` Theodore Ts'o
  2012-07-03  9:30                           ` ext4_fallocate Jan Kara
  2012-07-04  1:15                         ` ext4_fallocate Phillip Susi
  2 siblings, 1 reply; 42+ messages in thread
From: Theodore Ts'o @ 2012-07-02 18:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Fredrick, Ric Wheeler, linux-ext4, Andreas Dilger,
	wenqing.lz

On Mon, Jul 02, 2012 at 07:44:21PM +0200, Jan Kara wrote:
>   Yes, that option is broken and basically unfixable for data=ordered mode
> (see http://comments.gmane.org/gmane.comp.file-systems.ext4/30727). For
> data=writeback it works fine AFAICT.

The journal_async_commit option can be saved, but it requires changing
how we handle stale data.  What we need to do is to change things so
that we update the metadata *after* the data has been written back.
We do this already if the experimental dioread_nolock code is used,
but currently it only works for 4k blocks. 

The I/O tree work will give us the infrastructure we need so we can
easily update the metadata after the data blocks have been written out
when we are extending or filling in a sparse hole, even when the block
size != page size.  (This is why we can't currently make the
dioread_nolock code path the default; it would cause things to break
on 1k/2k file systems, as well as 4k file systems on Power.)  But once
this is done, it will allow us to subsume and rip out dioread_nolock
code[path, and the distinction between ordered and writeback mode.

Also, the metadata checksum patches will fix the other potential
problem with using journal_async_commit, which is that it adds
fine-grained checksums in the journal, so we can recover more easily
from a corrupted journal.

So once all of this is stable, we'll be able significantly simplify
the ext4 code and our testing matrix, and get all of the benefits of
data=writeback, dioread_nolock, and journal_async_commit, without any
of their current drawbacks.  Which is why I've kept on pestering Zheng
about how the I/O tree work has been coming along on the ext4 calls;
it's going to enable some really cool things.  :-)

    	       	     	      		    - Ted

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02 18:01                         ` ext4_fallocate Theodore Ts'o
@ 2012-07-03  9:30                           ` Jan Kara
  0 siblings, 0 replies; 42+ messages in thread
From: Jan Kara @ 2012-07-03  9:30 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Eric Sandeen, Fredrick, Ric Wheeler, linux-ext4,
	Andreas Dilger, wenqing.lz

On Mon 02-07-12 14:01:50, Ted Tso wrote:
> On Mon, Jul 02, 2012 at 07:44:21PM +0200, Jan Kara wrote:
> >   Yes, that option is broken and basically unfixable for data=ordered mode
> > (see http://comments.gmane.org/gmane.comp.file-systems.ext4/30727). For
> > data=writeback it works fine AFAICT.
> 
> The journal_async_commit option can be saved, but it requires changing
> how we handle stale data.  What we need to do is to change things so
> that we update the metadata *after* the data has been written back.
> We do this already if the experimental dioread_nolock code is used,
> but currently it only works for 4k blocks. 
  This won't save us. The problem with async commit is that in the
sequence:
write data
wait for data write completion
change metadata
write transaction with changed metadata
write commit block
CACHE_FLUSH

if you flip power switch just before CACHE_FLUSH, disk could have cached
stuff so that the whole transaction made it to pernament storage but data
didn't. So recovery will happily replay the transaction and make unwritten
disk blocks accessible. There is no simple way around this problem except
for issuing CACHE_FLUSH sometime after data writes have completed and
before you write commit block...

And if you ask about the complicated way ;-): You can compute data
checksum, add it to the transaction and check it on replay.

> The I/O tree work will give us the infrastructure we need so we can
> easily update the metadata after the data blocks have been written out
> when we are extending or filling in a sparse hole, even when the block
> size != page size.  (This is why we can't currently make the
> dioread_nolock code path the default; it would cause things to break
> on 1k/2k file systems, as well as 4k file systems on Power.)  But once
> this is done, it will allow us to subsume and rip out dioread_nolock
> code[path, and the distinction between ordered and writeback mode.
> 
> Also, the metadata checksum patches will fix the other potential
> problem with using journal_async_commit, which is that it adds
> fine-grained checksums in the journal, so we can recover more easily
> from a corrupted journal.
> 
> So once all of this is stable, we'll be able significantly simplify
> the ext4 code and our testing matrix, and get all of the benefits of
> data=writeback, dioread_nolock, and journal_async_commit, without any
> of their current drawbacks.  Which is why I've kept on pestering Zheng
> about how the I/O tree work has been coming along on the ext4 calls;
> it's going to enable some really cool things.  :-)
  Yeah, this would be good stuff. Just it won't be enough for
journal_async_commit in data=ordered mode...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02 17:48                         ` ext4_fallocate Ric Wheeler
@ 2012-07-03 17:41                           ` Zheng Liu
  2012-07-03 17:57                             ` ext4_fallocate Zach Brown
  0 siblings, 1 reply; 42+ messages in thread
From: Zheng Liu @ 2012-07-03 17:41 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick, linux-ext4,
	Andreas Dilger, wenqing.lz

On Mon, Jul 02, 2012 at 01:48:15PM -0400, Ric Wheeler wrote:
> Definitely more interesting I think to try and do the MB size extent
> conversion, that should be generally a good technique to minimize
> the overhead.

Hi Ric and other developers,

I generate a patch to adjust the length of zero-out chunk (I attach it
at the bottom of this email), and do the following test.  The result
shows that the number of fragmentations of extent can be reduced with
the length of chunk increasingly.  Meanwhile the iops in non-journal
mode can be slightly increased.  But in journal mode (data=ordered) the
iops almost doesn't change.  So now I describe my test in detailed, and
sorry for leaving too much boring numbers in here.

[environment]
I run the test in my own desktop, which has a Intel(R) Core(TM)2 Duo CPU
E8400, 4GB memory, a SATA disk (SAMSUNG HD161GJ).  I only use a
paraition of this disk to do the test.

[workload]
I use fio to do the test, which is provided by Ted, and I paste it in
here again.

  [global]
  rw=randwrite
  size=128m
  filesize=1g
  bs=4k
  ioengine=sync
  fallocate=1
  fsync=1

  [thread1]
  filename=testfile

[ext4 parameters]
I run the same test in journal (data=ordered)/non-journal mode.  The
length of zero-out chunk includes: 7 (old deafult value), 16, 32, 64,
128, and 256.  In addition, I use dd to create a preallocated file to
simulate fallocate with NO_HIDE_STAE_DATA flag.  After every test, I
use the following command to accout the number of fragmenetations of
extent.

$ debugfs -R 'ex ${TESTFILE}' /dev/${DEVICE} | wc -l

[results]
Non-journal modes
len|extents|fio-result
7  |22656  |write: io=131072KB, bw=852170 B/s, iops=208 , runt=157501msec
16 |4273   |write: io=131072KB, bw=820552 B/s, iops=200 , runt=163570msec
32 |228    |write: io=131072KB, bw=828059 B/s, iops=202 , runt=162087msec
64 |31     |write: io=131072KB, bw=869201 B/s, iops=212 , runt=154415msec
128|23     |write: io=131072KB, bw=893706 B/s, iops=218 , runt=150181msec
256|17     |write: io=131072KB, bw=907281 B/s, iops=221 , runt=147934msec
flg|10     |write: io=131072KB, bw=1033.9KB/s, iops=258 , runt=126874msec

*flg: fallocate with NO_HIDE_STALE_DATA flag*

Journal mode
len|extents|fio-result
7  |22653  |write: io=131072KB, bw=124818 B/s, iops=30 , runt=1075302msec
16 |4260   |write: io=131072KB, bw=122595 B/s, iops=29 , runt=1094801msec
32 |228    |write: io=131072KB, bw=123968 B/s, iops=30 , runt=1082677msec
64 |32     |write: io=131072KB, bw=122272 B/s, iops=29 , runt=1097691msec
128|22     |write: io=131072KB, bw=123328 B/s, iops=30 , runt=1088291msec
256|19     |write: io=131072KB, bw=122040 B/s, iops=29 , runt=1099781msec
flg|10     |write: io=131072KB, bw=122266 B/s, iops=29 , runt=1097743msec

Obviously, after increasing the length of zero-out chunk, the number of
fragmentations is reduced, and in non-journal mode the iops is slightly
increased.  In journal mode, the performance doesn't be impacted.  So
this patch can reduce the number of fragmentations of extent when we do
a lot of uninitialized extent conversions, but it doesn't solve the key
issue.

The result of fallocate with NO_STALE_DATA flag makes me puzzled.  It
cannot improve the performance.  But I don't dig this problem yet.  So
I turn back to re-run my old test as I said in the previous email.  The
result is 88s vs. 17s in journal mode.  The command is as follow:

time for((i=0;i<2000;i++)); do \
dd if=/dev/zero of=/mnt/sda1/testfile conv=notrunc bs=4k \
      count=1 seek=`expr $i \* 16` oflag=sync,direct 2>/dev/null; \
done

So IMHO that the length of chunk is increased quite can help us to avoid
the number of fragmentations as much as possible.  But it doesn't solve
the *root cause*.

In addition, I run the fio test between xfs and ext4 (with
journal_async_commit option, and the length of zero-out chunk is 7).  The
result of iops is 39 (xfs) vs. 44 (ext4).  I do this test because I
remember that xfs's delay logging is an async journal (I am not sure
becuase I don't familiar with xfs's code).  Thus, it points out that
async journal is useful for us.

Regards,
Zheng

Subject: [PATCH] ext4: dynamical adjust the length of zero-out chunk

From: Zheng Liu <wenqing.lz@taobao.com>

Currently in ext4 the length of zero-out chunk is set to 7.  But it is
too short so that it will cause a lot of fragmentation of extent when
we use fallocate to preallocate some uninitialized extents and the
workload frequently does a uninitialized extent conversion.  Thus, now
we set it to 256 (1MB chunk), and put it into super block in order to
adjust it dynamically in sysfs.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/ext4.h    |    3 +++
 fs/ext4/extents.c |   11 ++++++-----
 fs/ext4/super.c   |    3 +++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cfc4e01..0f44577 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1265,6 +1265,9 @@ struct ext4_sb_info {
 	/* locality groups */
 	struct ext4_locality_group __percpu *s_locality_groups;
 
+	/* the length of zero-out chunk */
+	unsigned int s_extent_zeroout_len;
+
 	/* for write statistics */
 	unsigned long s_sectors_written_start;
 	u64 s_kbytes_written;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 91341ec..e921c02 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3029,7 +3029,6 @@ out:
 	return err ? err : map->m_len;
 }
 
-#define EXT4_EXT_ZERO_LEN 7
 /*
  * This function is called by ext4_ext_map_blocks() if someone tries to write
  * to an uninitialized extent. It may result in splitting the uninitialized
@@ -3055,6 +3054,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 					   struct ext4_map_blocks *map,
 					   struct ext4_ext_path *path)
 {
+	struct ext4_sb_info *sbi;
 	struct ext4_extent_header *eh;
 	struct ext4_map_blocks split_map;
 	struct ext4_extent zero_ex;
@@ -3069,6 +3069,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 		"block %llu, max_blocks %u\n", inode->i_ino,
 		(unsigned long long)map->m_lblk, map->m_len);
 
+	sbi = EXT4_SB(inode->i_sb);
 	eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >>
 		inode->i_sb->s_blocksize_bits;
 	if (eof_block < map->m_lblk + map->m_len)
@@ -3168,8 +3169,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 	 */
 	split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0;
 
-	/* If extent has less than 2*EXT4_EXT_ZERO_LEN zerout directly */
-	if (ee_len <= 2*EXT4_EXT_ZERO_LEN &&
+	/* If extent has less than 2*s_extent_zeroout_len zerout directly */
+	if (ee_len <= 2*sbi->s_extent_zeroout_len &&
 	    (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
 		err = ext4_ext_zeroout(inode, ex);
 		if (err)
@@ -3195,7 +3196,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 	split_map.m_len = map->m_len;
 
 	if (allocated > map->m_len) {
-		if (allocated <= EXT4_EXT_ZERO_LEN &&
+		if (allocated <= sbi->s_extent_zeroout_len &&
 		    (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
 			/* case 3 */
 			zero_ex.ee_block =
@@ -3209,7 +3210,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 			split_map.m_lblk = map->m_lblk;
 			split_map.m_len = allocated;
 		} else if ((map->m_lblk - ee_block + map->m_len <
-			   EXT4_EXT_ZERO_LEN) &&
+			   sbi->s_extent_zeroout_len) &&
 			   (EXT4_EXT_MAY_ZEROOUT & split_flag)) {
 			/* case 2 */
 			if (map->m_lblk != ee_block) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb7aa3e..ad6cf73 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2535,6 +2535,7 @@ EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
 EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
 EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
 EXT4_RW_ATTR_SBI_UI(max_writeback_mb_bump, s_max_writeback_mb_bump);
+EXT4_RW_ATTR_SBI_UI(extent_zeroout_len, s_extent_zeroout_len);
 EXT4_ATTR(trigger_fs_error, 0200, NULL, trigger_test_error);
 
 static struct attribute *ext4_attrs[] = {
@@ -2550,6 +2551,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(mb_stream_req),
 	ATTR_LIST(mb_group_prealloc),
 	ATTR_LIST(max_writeback_mb_bump),
+	ATTR_LIST(extent_zeroout_len),
 	ATTR_LIST(trigger_fs_error),
 	NULL,
 };
@@ -3626,6 +3628,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	sbi->s_stripe = ext4_get_stripe_size(sbi);
 	sbi->s_max_writeback_mb_bump = 128;
+	sbi->s_extent_zeroout_len = 256;
 
 	/*
 	 * set up enough so that it can read an inode
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-03 17:41                           ` ext4_fallocate Zheng Liu
@ 2012-07-03 17:57                             ` Zach Brown
  2012-07-04  2:23                               ` ext4_fallocate Zheng Liu
  0 siblings, 1 reply; 42+ messages in thread
From: Zach Brown @ 2012-07-03 17:57 UTC (permalink / raw)
  To: Ric Wheeler, Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick,
	linux-ext4, Andreas Dilger, wenqing.lz


> workload frequently does a uninitialized extent conversion.  Thus, now
> we set it to 256 (1MB chunk), and put it into super block in order to
> adjust it dynamically in sysfs.

It's a bit of a tangent, but this caught my eye.

> +	/* If extent has less than 2*s_extent_zeroout_len zerout directly */
> +	if (ee_len<= 2*sbi->s_extent_zeroout_len&&
>   	(EXT4_EXT_MAY_ZEROOUT&  split_flag)) {

> -		if (allocated<= EXT4_EXT_ZERO_LEN&&
> +		if (allocated<= sbi->s_extent_zeroout_len&&
>   		(EXT4_EXT_MAY_ZEROOUT&  split_flag)) {

>   		} else if ((map->m_lblk - ee_block + map->m_len<
> -			   EXT4_EXT_ZERO_LEN)&&
> +			   sbi->s_extent_zeroout_len)&&

I'd be worried about having to verify that nothing bad happened if these
sbi s_extent_zeroout_len references could see different values if they
raced with a sysfs update.  Can they do that?

Maybe avoid the risk all together and have an on-stack copy that's only
referenced once at the start?

- z

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-02 17:44                       ` ext4_fallocate Jan Kara
  2012-07-02 17:48                         ` ext4_fallocate Ric Wheeler
  2012-07-02 18:01                         ` ext4_fallocate Theodore Ts'o
@ 2012-07-04  1:15                         ` Phillip Susi
  2012-07-04  2:36                           ` ext4_fallocate Zheng Liu
  2 siblings, 1 reply; 42+ messages in thread
From: Phillip Susi @ 2012-07-04  1:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Theodore Ts'o, Fredrick, Ric Wheeler,
	linux-ext4, Andreas Dilger, wenqing.lz

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/02/2012 01:44 PM, Jan Kara wrote:
>   Yes, that option is broken and basically unfixable for data=ordered mode
> (see http://comments.gmane.org/gmane.comp.file-systems.ext4/30727). For
> data=writeback it works fine AFAICT.

Does data=writeback with barriers fix this fallocate problem?

I can see that writing small random bits into a very large uninitialized file would cause a lot of updates to the extent tree, but with a sufficiently large cache, shouldn't many of the small, random writes become coalesced into fewer, larger extent updates that are done at delayed flush time?  Are you sure that the extent tree updates are not being done right away and blocking the writing application, instead of being delayed?


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP85kgAAoJEJrBOlT6nu75pnEIAM1qX1PeoZOiWJumgv9yA0te
C61813Vx2cVz5UpRZLO/YpAPTtei1ry2uLdN/TlmiH8TBc/fexipJwZGucYH7b9w
iwE4u19eixYCDV8iVJYVZ6NI6dp9McKNxqrvhcrBUCRiG+/q0Qp5dsD0Gu4PAnOy
rZzMvDU6Ln12nBH0M+UqE7VVZR6FR89tgpklSCDliPbDc0xWNUZQW0CRUkMVaAXu
jFJlNMiMBg7Cu0QzGVj7EzY+GtFGcEEVE5Us2c04bdm6nbvS567Pq6LfwyOM5IVe
q+/dwIN0Sdj3nneyPvrIfaFvQAKmtRnd6vkPxUnZgYSniq+hD0uf6J4nU1IKhw8=
=i9g7
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-03 17:57                             ` ext4_fallocate Zach Brown
@ 2012-07-04  2:23                               ` Zheng Liu
  0 siblings, 0 replies; 42+ messages in thread
From: Zheng Liu @ 2012-07-04  2:23 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ric Wheeler, Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick,
	linux-ext4, Andreas Dilger, wenqing.lz

Hi Zach,

Thanks for reviewing this patch.

On Tue, Jul 03, 2012 at 10:57:35AM -0700, Zach Brown wrote:
> 
> >workload frequently does a uninitialized extent conversion.  Thus, now
> >we set it to 256 (1MB chunk), and put it into super block in order to
> >adjust it dynamically in sysfs.
> 
> It's a bit of a tangent, but this caught my eye.

Oh, actually now the default value is set to 1MB in this patch.  But I
think maybe other users want to change this value.  So I add an interface
in sysfs to adjust dynaimically.  Certainly it is convenient for me to do
the above tests. :-)

> 
> >+	/* If extent has less than 2*s_extent_zeroout_len zerout directly */
> >+	if (ee_len<= 2*sbi->s_extent_zeroout_len&&
> >  	(EXT4_EXT_MAY_ZEROOUT&  split_flag)) {
> 
> >-		if (allocated<= EXT4_EXT_ZERO_LEN&&
> >+		if (allocated<= sbi->s_extent_zeroout_len&&
> >  		(EXT4_EXT_MAY_ZEROOUT&  split_flag)) {
> 
> >  		} else if ((map->m_lblk - ee_block + map->m_len<
> >-			   EXT4_EXT_ZERO_LEN)&&
> >+			   sbi->s_extent_zeroout_len)&&
> 
> I'd be worried about having to verify that nothing bad happened if these
> sbi s_extent_zeroout_len references could see different values if they
> raced with a sysfs update.  Can they do that?
> 
> Maybe avoid the risk all together and have an on-stack copy that's only
> referenced once at the start?

I don't think 's_extent_zeroout_len' is updated frequently by user, but
using an on-stack copy quite can avoid the risk.  I will fix it if most
of all think that this patch is useful and can be applied.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-04  1:15                         ` ext4_fallocate Phillip Susi
@ 2012-07-04  2:36                           ` Zheng Liu
  2012-07-04  3:06                             ` ext4_fallocate Phillip Susi
  0 siblings, 1 reply; 42+ messages in thread
From: Zheng Liu @ 2012-07-04  2:36 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick, Ric Wheeler,
	linux-ext4, Andreas Dilger, wenqing.lz

Hi Phillip,

On Tue, Jul 03, 2012 at 09:15:16PM -0400, Phillip Susi wrote:
> On 07/02/2012 01:44 PM, Jan Kara wrote:
> >   Yes, that option is broken and basically unfixable for data=ordered mode
> > (see http://comments.gmane.org/gmane.comp.file-systems.ext4/30727). For
> > data=writeback it works fine AFAICT.
> 
> Does data=writeback with barriers fix this fallocate problem?

If we only use data=writeback without 'journal_async_commit', it won't
fix this problem according to my test. :-(

> 
> I can see that writing small random bits into a very large uninitialized file would cause a lot of updates to the extent tree, but with a sufficiently large cache, shouldn't many of the small, random writes become coalesced into fewer, larger extent updates that are done at delayed flush time?  Are you sure that the extent tree updates are not being done right away and blocking the writing application, instead of being delayed?

Actually the workload needs to flush the data after writting a small
random bits.  This workload is met in our product system at Taobao.
Thus, the application has to wait this write to be done.  Certainly if
we don't flush the data, the problem won't happen but there is a risk
that we could loss our data.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-04  2:36                           ` ext4_fallocate Zheng Liu
@ 2012-07-04  3:06                             ` Phillip Susi
  2012-07-04  3:48                               ` ext4_fallocate Zheng Liu
  2012-07-04 12:20                               ` ext4_fallocate Ric Wheeler
  0 siblings, 2 replies; 42+ messages in thread
From: Phillip Susi @ 2012-07-04  3:06 UTC (permalink / raw)
  To: Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick, Ric Wheeler,
	linux-ext4, Andreas Dilger, wenqing.lz

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/03/2012 10:36 PM, Zheng Liu wrote:
> Actually the workload needs to flush the data after writting a small
> random bits.  This workload is met in our product system at Taobao.
> Thus, the application has to wait this write to be done.  Certainly if
> we don't flush the data, the problem won't happen but there is a risk
> that we could loss our data.

Ohh, I see now... you want lots of small, random, synchronous writes.  Then I think the only way to avoid the metadata overhead is with the unsafe stale data patch.  More importantly, this workload is going to have terrible performance no matter what the fs does, because even with all of the blocks initialized, you're still doing lots of seeking and not allowing the IO elevator to help.  Maybe the application can be redesigned so that it does not generate such pathological IO?

Or is this another case of userspace really needing access to barriers rather than using the big hammer of fsync?

Is there any technical reason why a barrier flag can't be added to the aio interface?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJP87NSAAoJEJrBOlT6nu75F7wH/0NrhBZyp+KwYuJpqM+hUs8o
1CSRXpw/OBOljH61kAT1zqRofMNA/6gF5EYnAB3usUiwjJk4eQu4kFCVastZrLRQ
Zm3H2yuSXbeoRvuSnNBbpjCPFQYIKWHcKjt2+pov853Kb2auYSyN2T6I7qUmLu7U
tMJtJekMi4HPz3snBrLGsvYRyy7rMLWwf3wd+G2SBrVNS/tKuRSRaw0FwXqDaoN9
Dc4FqWmVeicG9qQbszkal84IZo1YwSWMzeRvqjn+LER1XTJJaSzvttppBKKFsS2M
FVeMwTtRHJ3wl3jWYTyN/AwDUg3Pr/PX9/xmdn8DSGQG3eo7sBlqyPBzN6QYz5w=
=C2YA
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-04  3:06                             ` ext4_fallocate Phillip Susi
@ 2012-07-04  3:48                               ` Zheng Liu
  2012-07-04 12:20                               ` ext4_fallocate Ric Wheeler
  1 sibling, 0 replies; 42+ messages in thread
From: Zheng Liu @ 2012-07-04  3:48 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick, Ric Wheeler,
	linux-ext4, Andreas Dilger, wenqing.lz

On Tue, Jul 03, 2012 at 11:06:58PM -0400, Phillip Susi wrote:
> Ohh, I see now... you want lots of small, random, synchronous writes.  Then I think the only way to avoid the metadata overhead is with the unsafe stale data patch.  More importantly, this workload is going to have terrible performance no matter what the fs does, because even with all of the blocks initialized, you're still doing lots of seeking and not allowing the IO elevator to help.  Maybe the application can be redesigned so that it does not generate such pathological IO?

Yes, I agree with you that the application should be re-designed but we
cannot control it.  So we have to use this big hammer. :-(

> 
> Or is this another case of userspace really needing access to barriers rather than using the big hammer of fsync?

Maybe Google meets the same problem.

> 
> Is there any technical reason why a barrier flag can't be added to the aio interface?

Sorry, I am not very familiar with it.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-04  3:06                             ` ext4_fallocate Phillip Susi
  2012-07-04  3:48                               ` ext4_fallocate Zheng Liu
@ 2012-07-04 12:20                               ` Ric Wheeler
  2012-07-04 13:25                                 ` ext4_fallocate Zheng Liu
  1 sibling, 1 reply; 42+ messages in thread
From: Ric Wheeler @ 2012-07-04 12:20 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Jan Kara, Eric Sandeen, Theodore Ts'o, Fredrick, linux-ext4,
	Andreas Dilger, wenqing.lz

On 07/03/2012 11:06 PM, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/03/2012 10:36 PM, Zheng Liu wrote:
>> Actually the workload needs to flush the data after writting a small
>> random bits.  This workload is met in our product system at Taobao.
>> Thus, the application has to wait this write to be done.  Certainly if
>> we don't flush the data, the problem won't happen but there is a risk
>> that we could loss our data.
> Ohh, I see now... you want lots of small, random, synchronous writes.  Then I think the only way to avoid the metadata overhead is with the unsafe stale data patch.  More importantly, this workload is going to have terrible performance no matter what the fs does, because even with all of the blocks initialized, you're still doing lots of seeking and not allowing the IO elevator to help.  Maybe the application can be redesigned so that it does not generate such pathological IO?
>
> Or is this another case of userspace really needing access to barriers rather than using the big hammer of fsync?
>
> Is there any technical reason why a barrier flag can't be added to the aio interface?

For reasonably sized files, you might just as well really write out the full 
file with "write" (do pre-allocation the old fashioned way).

Performance of course depends on the size of the file, but for a 1GB file you 
can do this in a few seconds and prevent fragmentation and totally eliminate the 
performance of flipping extents.

How large is the file you need to pre-allocate? How long does the job typically 
run (minutes? hours? days?) :)

ric



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: ext4_fallocate
  2012-07-04 12:20                               ` ext4_fallocate Ric Wheeler
@ 2012-07-04 13:25                                 ` Zheng Liu
  0 siblings, 0 replies; 42+ messages in thread
From: Zheng Liu @ 2012-07-04 13:25 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Phillip Susi, Jan Kara, Eric Sandeen, Theodore Ts'o,
	Fredrick, linux-ext4, Andreas Dilger, wenqing.lz

On Wed, Jul 04, 2012 at 08:20:07AM -0400, Ric Wheeler wrote:
> On 07/03/2012 11:06 PM, Phillip Susi wrote:
> For reasonably sized files, you might just as well really write out
> the full file with "write" (do pre-allocation the old fashioned
> way).
> 
> Performance of course depends on the size of the file, but for a 1GB
> file you can do this in a few seconds and prevent fragmentation and
> totally eliminate the performance of flipping extents.
> 
> How large is the file you need to pre-allocate? How long does the
> job typically run (minutes? hours? days?) :)

We have over one thousand servers with ten or more 2T SATA disks, and we
need to pre-allocate a lot of files with fixed size until the space of
every disks is occupied when the application starts up on first time.
Meanwhile this number is increasing every months.  It will be absolutely
a nightmare for SA if we use the old fashioned way to do pre-allocation.

Regards,
Zheng

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-07-04 13:17 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-25  6:42 ext4_fallocate Fredrick
2012-06-25  7:33 ` ext4_fallocate Andreas Dilger
2012-06-28 15:12   ` ext4_fallocate Phillip Susi
2012-06-28 15:23     ` ext4_fallocate Eric Sandeen
2012-06-25  8:51 ` ext4_fallocate Zheng Liu
2012-06-25 19:04   ` ext4_fallocate Fredrick
2012-06-25 19:17   ` ext4_fallocate Theodore Ts'o
2012-06-26  1:23     ` ext4_fallocate Fredrick
2012-06-26 13:13     ` ext4_fallocate Ric Wheeler
2012-06-26 17:30       ` ext4_fallocate Theodore Ts'o
2012-06-26 18:06         ` ext4_fallocate Fredrick
2012-06-26 18:21         ` ext4_fallocate Ric Wheeler
2012-06-26 18:57           ` ext4_fallocate Ted Ts'o
2012-06-26 19:22             ` ext4_fallocate Ric Wheeler
2012-06-26 18:05       ` ext4_fallocate Fredrick
2012-06-26 18:59         ` ext4_fallocate Ted Ts'o
2012-06-26 19:30         ` ext4_fallocate Ric Wheeler
2012-06-26 19:57           ` ext4_fallocate Eric Sandeen
2012-06-26 20:44             ` ext4_fallocate Eric Sandeen
2012-06-27 15:14               ` ext4_fallocate Eric Sandeen
2012-06-27 19:30               ` ext4_fallocate Theodore Ts'o
2012-06-27 23:02                 ` ext4_fallocate Eric Sandeen
2012-06-28 11:27                   ` ext4_fallocate Ric Wheeler
2012-06-29 19:02                     ` ext4_fallocate Andreas Dilger
2012-07-02  3:03                       ` ext4_fallocate Zheng Liu
2012-06-28 12:48                   ` ext4_fallocate Theodore Ts'o
2012-07-02  3:16                   ` ext4_fallocate Zheng Liu
2012-07-02 16:33                     ` ext4_fallocate Eric Sandeen
2012-07-02 17:44                       ` ext4_fallocate Jan Kara
2012-07-02 17:48                         ` ext4_fallocate Ric Wheeler
2012-07-03 17:41                           ` ext4_fallocate Zheng Liu
2012-07-03 17:57                             ` ext4_fallocate Zach Brown
2012-07-04  2:23                               ` ext4_fallocate Zheng Liu
2012-07-02 18:01                         ` ext4_fallocate Theodore Ts'o
2012-07-03  9:30                           ` ext4_fallocate Jan Kara
2012-07-04  1:15                         ` ext4_fallocate Phillip Susi
2012-07-04  2:36                           ` ext4_fallocate Zheng Liu
2012-07-04  3:06                             ` ext4_fallocate Phillip Susi
2012-07-04  3:48                               ` ext4_fallocate Zheng Liu
2012-07-04 12:20                               ` ext4_fallocate Ric Wheeler
2012-07-04 13:25                                 ` ext4_fallocate Zheng Liu
2012-06-26 13:06 ` ext4_fallocate Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.