All of lore.kernel.org
 help / color / mirror / Atom feed
* [Resubmit][Patch 0/2] Persistent preallocation in ext4
@ 2007-01-17  9:46 Amit K. Arora
  2007-01-17 10:13 ` [Patch 1/2] ioctl and uninitialized extents Amit K. Arora
                   ` (7 more replies)
  0 siblings, 8 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-17  9:46 UTC (permalink / raw)
  To: linux-ext4; +Cc: suparna, cmm, alex, suzuki

Please Note (especially <c> below):
----------------------------------
<a> This is being resubmitted as part of the recall for ext4 patches.
<b> The patches are based on 2.6.20-rc5 kernel version.
<c> These patches require the "EXTENT OVERLAP BUGFIX" patch submitted by
me earlier (on Jan 16th).

Description:
-----------
Persistent preallocation is a proposed new feature in ext4, which will
allow user applications to preallocate blocks for a file. It is
similar to posix_fallocate call, but does not initialize (write to)
the blocks allocated (unlike fallocate). 
This patch uses ioctl interface and returns "0" if the call succeeds,
else returns the error number. Other approaches are discussed under
"Outstanding Issues" section below.

There are two patches being submitted as part of this:

(1) The first patch implements the ioctl interface, which does the
preallocation. The preallocated blocks are part of a new extent,
which is marked "uninitialized". The MSB in ee_len (of ext4_extent
datastructure) is used to mark an extent "uninitialized". It also takes
care of preallocating through a hole and updating the file size
accordingly.

(2) The second patch implements the support for writing to the
uninitialized extent(s). This write may result in breaking down the
uninitialized extent into one initialized extent and upto two
uninitialized extents, depending on which part of the uninitialized
extent is being written to. If all the blocks in the uninitialized
extent are being written on, the extent is marked initialized and no
split is required. This patch also takes care of merging the initialized
extent with neighbouring ones, if possible.

Outstanding Issues:
------------------
(1) The final interface is yet to be decided. We have the option of
chosing from one of these:
	a> modifying posix_fallocate() in glibc
	b> using fcntl
	c> using ftruncate, or
	d> using the ioctl interface.

  If we go with ioctl interface, we need to chose the return
value from the ioctl. We should either return "0" for success and
errno for failure, or we should be returning number of bytes
preallocated.

(2) Also, we need to decide on what should happen in case of a
partial success scenario. i.e. after few blocks get preallocated, we hit
some error - say ENOSPC. Should the call just return the number of bytes
preallocated, or should it "undo" the partial preallocation and then
exit with error code ?

(3) Currently we only allow persistent preallocation on files that have
extents enabled. It was considered a rare case where user may want
preallocation on non-extent based file(s). And even if someone really
wants to do this, it will be recommended to convert the file to the
extent-based format first, and then do persistent preallocation on it.

Testing done:
------------
(1) Unit testing included preallocating blocks and writing to it.
Preallocation through holes were also tested. Creation, splitting and
merging of extents was observed through a modified (patched) version of
debugfs (part of e2fsprogs). This modified version recognises and
flags uninitialized extent(s) in the output/display.
(2) For stress testing, fsx-linux (from LTP) was patched and used. It was
modified to call preallocation ioctl instead of ftuncate operations. It
uncovered couple of bugs (extent overlap being one of them). These bugs
have already been fixed here.

The patches for e2fsprogs and fsx-linux are available with me. I can
post them if anyone is interested to try/test the preallocation patches.
Also, I have a small test program/tool written which can be used for
unit testing.

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [Patch 1/2] ioctl and uninitialized extents
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
@ 2007-01-17 10:13 ` Amit K. Arora
  2007-01-17 10:18 ` [Patch 2/2] support for writing to uninitialized extent Amit K. Arora
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-17 10:13 UTC (permalink / raw)
  To: linux-ext4; +Cc: suparna, cmm, alex, suzuki

This patch implements the ioctl which may be used for persistent
preallocation of blocks to an extent enabled file in ext4.

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  125 ++++++++++++++++++++++++++--------------
 fs/ext4/ioctl.c                 |   69 ++++++++++++++++++++++
 include/linux/ext4_fs.h         |   13 ++++
 include/linux/ext4_fs_extents.h |   13 ++++
 4 files changed, 177 insertions(+), 43 deletions(-)

Index: linux-2.6.20-rc5/fs/ext4/extents.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/ext4/extents.c
+++ linux-2.6.20-rc5/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  	ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
 		        ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
 			        ext_pblock(path[depth].p_ext),
-			        le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int depth, len1;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *ex, *fex;
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
-	int depth, len, err, next;
+	int depth, len, err, next, uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/* ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1254,7 +1275,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 			        le32_to_cpu(newext->ee_block),
 			        ext_pblock(newext),
-			        le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 		           > le32_to_cpu(nearex->ee_block)) {
@@ -1267,7 +1288,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
 				        ext_pblock(newext),
-				        le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1280,7 +1301,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1299,8 +1320,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1370,8 +1396,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1383,7 +1409,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1399,7 +1426,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 		        cbex.ec_block = le32_to_cpu(ex->ee_block);
-		        cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 		        cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1472,15 +1499,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len));
+			        (unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-		            + le16_to_cpu(ex->ee_len)) {
+		            + ext4_ext_get_actual_len(ex)) {
 	        lblock = le32_to_cpu(ex->ee_block)
-		         + le16_to_cpu(ex->ee_len);
+		         + ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len),
+			        (unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1610,12 +1637,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1629,12 +1656,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1642,12 +1669,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1661,7 +1688,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	struct ext4_extent_header *eh;
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
-	unsigned short ex_ee_len;
+	unsigned short ex_ee_len, uninitialized = 0;
 	struct ext4_extent *ex;
 
 	ext_debug("truncate since %lu in leaf\n", start);
@@ -1676,7 +1703,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1744,6 +1773,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1753,7 +1784,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2029,7 +2060,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2037,8 +2068,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2046,8 +2078,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2089,6 +2124,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err)
 		goto out2;
@@ -2100,8 +2137,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create!=EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
Index: linux-2.6.20-rc5/fs/ext4/ioctl.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/ext4/ioctl.c
+++ linux-2.6.20-rc5/fs/ext4/ioctl.c
@@ -11,6 +11,7 @@
 #include <linux/jbd2.h>
 #include <linux/capability.h>
 #include <linux/ext4_fs.h>
+#include <linux/ext4_fs_extents.h>
 #include <linux/ext4_jbd2.h>
 #include <linux/time.h>
 #include <linux/compat.h>
@@ -248,6 +249,74 @@ flags_err:
 		return err;
 	}
 
+	case EXT4_IOC_PREALLOCATE: {
+		struct ext4_falloc_input input;
+		handle_t *handle;
+		ext4_fsblk_t block, max_blocks;
+		int ret, ret2, nblocks = 0, retries = 0;
+		struct buffer_head map_bh;
+		unsigned int credits, blkbits = inode->i_blkbits;
+
+		if (IS_RDONLY(inode))
+			return -EROFS;
+
+		if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+			return -ENOTTY;
+
+		if (copy_from_user(&input,
+			(struct ext4_falloc_input __user *) arg, sizeof(input)))
+			return -EFAULT;
+
+		if (input.len == 0)
+			return -EINVAL;
+
+		block = input.offset >> blkbits;
+		max_blocks = (EXT4_BLOCK_ALIGN(input.len + input.offset,
+						blkbits) >> blkbits) - block;
+		mutex_lock(&EXT4_I(inode)->truncate_mutex);
+		credits = ext4_ext_calc_credits_for_insert(inode, NULL);
+		mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+		handle=ext4_journal_start(inode, credits +
+					EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
+		if (IS_ERR(handle))
+			return PTR_ERR(handle);
+retry:
+		ret = 0;
+		while (ret >= 0 && ret < max_blocks) {
+			block = block + ret;
+			max_blocks = max_blocks - ret;
+	  		ret = ext4_ext_get_blocks(handle, inode, block,
+						max_blocks, &map_bh,
+						EXT4_CREATE_UNINITIALIZED_EXT,
+						0);
+			BUG_ON(!ret);
+			if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
+			    && ((block + ret) << blkbits) > i_size_read(inode))
+				nblocks = nblocks + ret;
+		}
+		if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb,
+								&retries))
+			goto retry;
+
+		if (nblocks) {
+			mutex_lock(&inode->i_mutex);
+			if (!i_size_read(inode) && ret > 0)
+				i_size_write(inode, (block + ret) << blkbits);
+			else
+				i_size_write(inode, i_size_read(inode) +
+							(nblocks << blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+
+		ext4_mark_inode_dirty(handle, inode);
+		ret2 = ext4_journal_stop(handle);
+		if (ret > 0)
+			ret = ret2;
+
+		return ret > 0 ? 0 : ret;
+	}
+
 	default:
 		return -ENOTTY;
 	}
Index: linux-2.6.20-rc5/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/ext4_fs.h
+++ linux-2.6.20-rc5/include/linux/ext4_fs.h
@@ -102,6 +102,8 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits) 	(((size)+(1 << blkbits)-1) & \
+							(~((1 << blkbits)-1)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +227,16 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/* The struct ext4_falloc_input in kernel, for EXT4_IOC_PREALLOCATE */
+struct ext4_falloc_input {
+	__u64 offset;
+	__u64 len;
+};
+
+/* Following is used by preallocation logic to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -242,6 +254,7 @@ struct ext4_new_group_data {
 #endif
 #define EXT4_IOC_GETRSVSZ		_IOR('f', 5, long)
 #define EXT4_IOC_SETRSVSZ		_IOW('f', 6, long)
+#define EXT4_IOC_PREALLOCATE		_IOW('f', 9, struct ext4_falloc_input)
 
 /*
  * ioctl commands in 32 bit emulation
Index: linux-2.6.20-rc5/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.20-rc5/include/linux/ext4_fs_extents.h
@@ -125,6 +125,19 @@ struct ext4_ext_path {
 #define EXT4_EXT_CACHE_EXTENT	2
 
 /*
+ * Macro-instructions to handle (mark/unmark/check/create) unitialized
+ * extents. Applications can issue an IOCTL for preallocation, which results
+ * in assigning unitialized extents to the file.
+ */
+#define ext4_ext_mark_uninitialized(ext)	((ext)->ee_len |= \
+							cpu_to_le16(0x8000))
+#define ext4_ext_is_uninitialized(ext)  	((le16_to_cpu((ext)->ee_len))& \
+									0x8000)
+#define ext4_ext_get_actual_len(ext)		((le16_to_cpu((ext)->ee_len))& \
+									0x7FFF)
+
+
+/*
  * to be called by ext4_ext_walk_space()
  * negative retcode - error
  * positive retcode - signal for ext4_ext_walk_space(), see below

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [Patch 2/2] support for writing to uninitialized extent
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
  2007-01-17 10:13 ` [Patch 1/2] ioctl and uninitialized extents Amit K. Arora
@ 2007-01-17 10:18 ` Amit K. Arora
  2007-01-17 22:20 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Mingming Cao
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-17 10:18 UTC (permalink / raw)
  To: linux-ext4; +Cc: suparna, cmm, alex, suzuki

This patch adds the support for writing to an uninitialized extent (the
extent which was created as a result of persistent preallocation).

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  228 +++++++++++++++++++++++++++++++++++-----
 include/linux/ext4_fs_extents.h |    1 
 2 files changed, 202 insertions(+), 27 deletions(-)

Index: linux-2.6.20-rc5/fs/ext4/extents.c
===================================================================
--- linux-2.6.20-rc5.orig/fs/ext4/extents.c
+++ linux-2.6.20-rc5/fs/ext4/extents.c
@@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+				struct ext4_ext_path *path,
+				struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done=0, uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh)) {
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+					* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+		merge_done = 1;
+		BUG_ON(eh->eh_entries == 0);
+	}
+
+	return merge_done;
+}
+
+
+/*
  * ext4_ext_check_overlap:
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
@@ -1316,25 +1361,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -1999,15 +2026,149 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * ext4_ext_convert_to_initialized:
+ * this function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three). Atleast one initialized extent
+ * and atmost two uninitialized extents can result.
+ * There are three possibilities:
+ *   a> No split required: Entire extent should be initialized.
+ *   b> Split into two extents: Only one end of the extent is being written to.
+ *   c> Split into three extents: Somone is writing in middle of the extent.
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0, ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/* The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth)
+		{
+			depth=newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/* If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = newblock;
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	if ((err = ext4_ext_get_access(handle, inode, path + depth)))
+		goto out;
+	/* New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/* To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/* Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2==ex and ex3==NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2055,6 +2216,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2063,13 +2225,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2078,12 +2236,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2135,6 +2308,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
Index: linux-2.6.20-rc5/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.20-rc5.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.20-rc5/include/linux/ext4_fs_extents.h
@@ -203,6 +203,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [Resubmit][Patch 0/2] Persistent preallocation in ext4
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
  2007-01-17 10:13 ` [Patch 1/2] ioctl and uninitialized extents Amit K. Arora
  2007-01-17 10:18 ` [Patch 2/2] support for writing to uninitialized extent Amit K. Arora
@ 2007-01-17 22:20 ` Mingming Cao
  2007-01-17 22:33   ` Eric Sandeen
  2007-01-18  6:08   ` Amit K. Arora
  2007-01-19  9:11 ` Amit K. Arora
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 340+ messages in thread
From: Mingming Cao @ 2007-01-17 22:20 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: linux-ext4, suparna, alex, suzuki

Amit K. Arora wrote:

> Outstanding Issues:
> ------------------
> (1) The final interface is yet to be decided. We have the option of
> chosing from one of these:
> 	a> modifying posix_fallocate() in glibc
> 	b> using fcntl
> 	c> using ftruncate, or
> 	d> using the ioctl interface.
> 
>   If we go with ioctl interface, we need to chose the return
> value from the ioctl. We should either return "0" for success and
> errno for failure, or we should be returning number of bytes
> preallocated.
> 

Now I am more prefer just return 0 for success. Returning the number of 
bytes preallocated back to userspace might be helpful in the case when 
the specified window contains blocks already being allocated, but this 
should not be a common case.

> (2) Also, we need to decide on what should happen in case of a
> partial success scenario. i.e. after few blocks get preallocated, we hit
> some error - say ENOSPC. Should the call just return the number of bytes
> preallocated, or should it "undo" the partial preallocation and then
> exit with error code ?
> 
I think we should try to avoid this partial preallocation at the first 
place. Probably checking the number of free blocks before calling 
ext4_ext_get_blocks() and returns -ENOSPC if there isn't enough free 
blocks to allocate. Otherwise, if it still hits ENOSPC error, I think it 
doesn't hurt to leave the partial preallocated blocks there.

Cheers,
Mingming

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [Resubmit][Patch 0/2] Persistent preallocation in ext4
  2007-01-17 22:20 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Mingming Cao
@ 2007-01-17 22:33   ` Eric Sandeen
  2007-01-18  6:08   ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Eric Sandeen @ 2007-01-17 22:33 UTC (permalink / raw)
  To: Mingming Cao; +Cc: Amit K. Arora, linux-ext4, suparna, alex, suzuki

Mingming Cao wrote:
> Amit K. Arora wrote:
> 
>> Outstanding Issues:
>> ------------------
>> (1) The final interface is yet to be decided. We have the option of
>> chosing from one of these:
>> 	a> modifying posix_fallocate() in glibc
>> 	b> using fcntl
>> 	c> using ftruncate, or
>> 	d> using the ioctl interface.
>>
>>   If we go with ioctl interface, we need to chose the return
>> value from the ioctl. We should either return "0" for success and
>> errno for failure, or we should be returning number of bytes
>> preallocated.
>>
> 
> Now I am more prefer just return 0 for success. Returning the number of 
> bytes preallocated back to userspace might be helpful in the case when 
> the specified window contains blocks already being allocated, but this 
> should not be a common case.

I tend to agree; I'm not exactly sure what the "bytes/blocks allocated"
return would be useful for?  What would the caller do with this information?

Also, is ftruncate really even an option for this interface?  Wouldn't
that mean the end of sparse files?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [Resubmit][Patch 0/2] Persistent preallocation in ext4
  2007-01-17 22:20 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Mingming Cao
  2007-01-17 22:33   ` Eric Sandeen
@ 2007-01-18  6:08   ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-18  6:08 UTC (permalink / raw)
  To: Mingming Cao; +Cc: linux-ext4, suparna, alex, suzuki

On Wed, Jan 17, 2007 at 02:20:43PM -0800, Mingming Cao wrote:
> Amit K. Arora wrote:
> >  If we go with ioctl interface, we need to chose the return
> >value from the ioctl. We should either return "0" for success and
> >errno for failure, or we should be returning number of bytes
> >preallocated.
> 
> Now I am more prefer just return 0 for success. Returning the number of
> bytes preallocated back to userspace might be helpful in the case when
> the specified window contains blocks already being allocated, but this
> should not be a common case.

Agreed. Even xfs preallocation ioctl and posix_fallocate return 0 on
success.

> >(2) Also, we need to decide on what should happen in case of a
> >partial success scenario. i.e. after few blocks get preallocated, we hit
> >some error - say ENOSPC. Should the call just return the number of bytes
> >preallocated, or should it "undo" the partial preallocation and then
> >exit with error code ?
> >
> I think we should try to avoid this partial preallocation at the first
> place. Probably checking the number of free blocks before calling
> ext4_ext_get_blocks() and returns -ENOSPC if there isn't enough free
> blocks to allocate. Otherwise, if it still hits ENOSPC error, I think it
> doesn't hurt to leave the partial preallocated blocks there.

True. But, there might be a small issue with this. Before calling
ext4_ext_get_blocks(), we really don't know how many of the requested
blocks are already allocated. Consider a scenario where out of 100
blocks requested for preallocation, say, only 10 need to be physically
allocated (90 being already allocated to the file). Now, we will be
checking for 100 free blocks in the filesystem, whereas ideally we
should be checking for only 10.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [Resubmit][Patch 0/2] Persistent preallocation in ext4
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
                   ` (2 preceding siblings ...)
  2007-01-17 22:20 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Mingming Cao
@ 2007-01-19  9:11 ` Amit K. Arora
  2007-01-19  9:17 ` patch for fsx-linux Amit K. Arora
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-19  9:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: suparna, cmm, alex, suzuki

On Wed, Jan 17, 2007 at 03:16:58PM +0530, Amit K. Arora wrote:
> The patches for e2fsprogs and fsx-linux are available with me. I can
> post them if anyone is interested to try/test the preallocation patches.
> Also, I have a small test program/tool written which can be used for
> unit testing.

Here is the patch to e2fsprogs-1.39, with patches for ext4 already
applied. It makes e2fsprogs tools recognize uninitialized extents. This
is only for testing purpose as of now and it might need some
fine-tuning, before it can be really be submitted.
This patch also enables "EXT_DEBUG" flag to display debug information
(e.g. extent details).

---
 lib/ext2fs/bmap.c         |    3 +-
 lib/ext2fs/ext4_extents.h |   12 +++++++++-
 lib/ext2fs/extents.c      |   55 ++++++++++++++++++++++++++++------------------
 3 files changed, 47 insertions(+), 23 deletions(-)

Index: e2fsprogs-1.39/lib/ext2fs/bmap.c
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/bmap.c	2006-12-19 11:53:48.000000000 +0530
+++ e2fsprogs-1.39/lib/ext2fs/bmap.c	2006-12-19 11:53:52.000000000 +0530
@@ -45,7 +45,8 @@
 		ex = EXT_FIRST_EXTENT(eh);
 		for (i = 0; i < eh->eh_entries; i++, ex++) {
 			if ((ex->ee_block <= block) &&
-			    (block < ex->ee_block + ex->ee_len)) {
+			    (block < ex->ee_block + 
+			     ext4_ext_get_actual_len(ex))) {
 				*phys_blk = EXT4_EE_START(ex) +
 					(block - ex->ee_block);
 				return 0;
Index: e2fsprogs-1.39/lib/ext2fs/ext4_extents.h
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/ext4_extents.h	2006-12-19 11:53:48.000000000 +0530
+++ e2fsprogs-1.39/lib/ext2fs/ext4_extents.h	2006-12-19 15:55:32.000000000 +0530
@@ -37,7 +37,7 @@
  * if EXT_DEBUG is defined you can use 'extdebug' mount option
  * to get lots of info what's going on
  */
-//#define EXT_DEBUG
+#define EXT_DEBUG
 #ifdef EXT_DEBUG
 #define ext_debug(tree,fmt,a...) 			\
 do {							\
@@ -170,6 +170,16 @@
 
 #define EXT_ASSERT(__x__) if (!(__x__)) BUG();
 
+/*
+ * Macro-instructions used to handle (mark/unmark/check/create) unitialized
+ * extents. Applications can issue an IOCTL for preallocation, which results
+ * in assigning unitialized extents to the file
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
+#define ext4_ext_mark_uninitialized(ext)	((ext)->ee_len |= 0x8000)
+#define ext4_ext_is_uninitialized(ext)  	((ext)->ee_len & 0x8000)
+#define ext4_ext_get_actual_len(ext)		((ext)->ee_len & 0x7FFF)
+
 
 /*
  * this structure is used to gather extents from the tree via ioctl
Index: e2fsprogs-1.39/lib/ext2fs/extents.c
===================================================================
--- e2fsprogs-1.39.orig/lib/ext2fs/extents.c	2006-12-19 11:53:48.000000000 +0530
+++ e2fsprogs-1.39/lib/ext2fs/extents.c	2006-12-19 11:55:03.000000000 +0530
@@ -36,9 +36,11 @@
 
 void show_extent(struct ext4_extent *ex)
 {
-	printf("extent: block=%u-%u len=%u start=%u start_hi=%u\n",
-	       ex->ee_block, ex->ee_block + ex->ee_len - 1,
-	       ex->ee_len, ex->ee_start, ex->ee_start_hi);
+	unsigned short ee_len = ext4_ext_get_actual_len(ex);
+	printf("extent[%c]: block=%u-%u len=%u start=%u start_hi=%u\n",
+		ext4_ext_is_uninitialized(ex) ? 'u' : 'i',
+		ex->ee_block, ex->ee_block + ee_len - 1, ee_len,
+		ex->ee_start, ex->ee_start_hi);
 }
 #else
 #define show_header(eh) do { } while (0)
@@ -75,7 +77,7 @@
 	if (EXT4_EE_START(ex) > EXT2_BLOCKS_COUNT(fs->super))
 		return EXT2_ET_EXTENT_LEAF_BAD;
 
-	if (ex->ee_len == 0)
+	if (ext4_ext_get_actual_len(ex) == 0)
 		return EXT2_ET_EXTENT_LEAF_BAD;
 
 	if (ex_prev) {
@@ -84,13 +86,14 @@
 			return EXT2_ET_EXTENT_LEAF_BAD;
 
 		/* extents must be in logical offset order */
-		if (ex->ee_block < ex_prev->ee_block + ex_prev->ee_len)
+		if (ex->ee_block < ex_prev->ee_block + 
+				ext4_ext_get_actual_len(ex_prev))
 			return EXT2_ET_EXTENT_LEAF_BAD;
 
 		/* extents must not overlap physical blocks */
-		if ((EXT4_EE_START(ex) < 
-		     EXT4_EE_START(ex_prev) + ex_prev->ee_len) &&
-		    (EXT4_EE_START(ex) + ex->ee_len > EXT4_EE_START(ex_prev)))
+		if ((EXT4_EE_START(ex) < EXT4_EE_START(ex_prev) + 
+			ext4_ext_get_actual_len(ex_prev)) && (EXT4_EE_START(ex)
+			+ ext4_ext_get_actual_len(ex) > EXT4_EE_START(ex_prev)))
 			return EXT2_ET_EXTENT_LEAF_BAD;
 	}
 
@@ -98,7 +101,8 @@
 		if (ex->ee_block < ix->ei_block)
 			return EXT2_ET_EXTENT_LEAF_BAD;
 
-		if (ix_len && ex->ee_block + ex->ee_len > ix->ei_block + ix_len)
+		if (ix_len && ex->ee_block + ext4_ext_get_actual_len(ex) > 
+							ix->ei_block + ix_len)
 			return EXT2_ET_EXTENT_LEAF_BAD;
 	}
 
@@ -144,6 +148,7 @@
 {
 	int entry = ex - EXT_FIRST_EXTENT(eh);
 	struct ext4_extent *ex_new = ex + 1;
+	unsigned uninitialized=0;
 
 	if (entry < 0 || entry > eh->eh_entries)
 		return EXT2_ET_EXTENT_LEAF_BAD;
@@ -151,18 +156,25 @@
 	if (eh->eh_entries >= eh->eh_max)
 		return EXT2_ET_EXTENT_NO_SPACE;
 
-	if (count > ex->ee_len)
+	if (count > ext4_ext_get_actual_len(ex))
 		return EXT2_ET_EXTENT_LEAF_BAD;
 
-	if (count > ex->ee_len)
+	if (count > ext4_ext_get_actual_len(ex))
 		return EXT2_ET_EXTENT_LEAF_BAD;
 
+	if(ext4_ext_is_uninitialized(ex))
+		uninitialized=1;
+
 	memmove(ex_new, ex, (eh->eh_entries - entry) * sizeof(*ex));
 	++eh->eh_entries;
 
 	ex->ee_len = count;
 	ex_new->ee_len -= count;
 	ex_new->ee_block += count;
+	if(uninitialized) {
+		ext4_ext_mark_uninitialized(ex);
+		ext4_ext_mark_uninitialized(ex_new);
+	}
 	EXT4_EE_START_SET(ex_new, EXT4_EE_START(ex_new) + count);
 
 	return 0;
@@ -195,7 +207,7 @@
 		ex = EXT_FIRST_EXTENT(eh);
 		for (i = 0; i < eh->eh_entries; i++, ex++) {
 			show_extent(ex);
-			for (j = 0; j < ex->ee_len; j++) {
+			for (j = 0; j < ext4_ext_get_actual_len(ex); j++) {
 				block_address = EXT4_EE_START(ex) + j;
 				flags = (*ctx->func)(ctx->fs, &block_address,
 						     (ex->ee_block + j),
@@ -216,15 +228,15 @@
 #endif
 
 				if (ex_prev &&
-				    block_address ==
-				    EXT4_EE_START(ex_prev) + ex_prev->ee_len &&
-				    ex->ee_block + j ==
-				    ex_prev->ee_block + ex_prev->ee_len) {
+				    block_address == EXT4_EE_START(ex_prev) +
+				    ext4_ext_get_actual_len(ex_prev) &&
+				    ex->ee_block + j == ex_prev->ee_block + 
+				    ext4_ext_get_actual_len(ex_prev)) {
 					/* can merge block with prev extent */
 					ex_prev->ee_len++;
 
 					ex->ee_len--;
-					if (ex->ee_len == 0) {
+					if (ext4_ext_get_actual_len(ex) == 0) {
 						/* no blocks left in this one */
 						ext2fs_extent_remove(eh, ex);
 						i--; ex--;
@@ -238,7 +250,7 @@
 					}
 					ret |= BLOCK_CHANGED;
 
-				} else if (ex->ee_len == 1) {
+				} else if (ext4_ext_get_actual_len(ex) == 1) {
 					/* single-block extent is easy -
 					 * change extent directly */
 					EXT4_EE_START_SET(ex, block_address);
@@ -250,7 +262,8 @@
 					ret |= BLOCK_ABORT | BLOCK_ERROR;
 					return ret;
 
-				} else if (j > 0 && (ex + 1)->ee_len > 1 &&
+				} else if (j > 0 && 
+					   ext4_ext_get_actual_len(ex + 1) > 1 &&
 					   ext2fs_extent_split(eh, ex + 1, 1)) {
 					/* split after new block failed */
 					/* No multi-level split yet */
@@ -258,7 +271,7 @@
 					return ret;
 
 				} else if (j == 0) {
-					if (ex->ee_len != 1) {
+					if (ext4_ext_get_actual_len(ex) != 1) {
 						/* this is an internal error */
 						ret |= BLOCK_ABORT |BLOCK_ERROR;
 						return ret;
@@ -269,7 +282,7 @@
 				} else {
 					ex++;
 					i++;
-					if (ex->ee_len != 1) {
+					if (ext4_ext_get_actual_len(ex) != 1) {
 						/* this is an internal error */
 						ret |= BLOCK_ABORT |BLOCK_ERROR;
 						return ret;

 
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: patch for fsx-linux
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
                   ` (3 preceding siblings ...)
  2007-01-19  9:11 ` Amit K. Arora
@ 2007-01-19  9:17 ` Amit K. Arora
  2007-01-19  9:22 ` small tool for unit testing Amit K. Arora
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-19  9:17 UTC (permalink / raw)
  To: linux-ext4; +Cc: suparna, cmm, alex, suzuki

On Wed, Jan 17, 2007 at 03:16:58PM +0530, Amit K. Arora wrote:
> The patches for e2fsprogs and fsx-linux are available with me. I can
> post them if anyone is interested to try/test the preallocation patches.
> Also, I have a small test program/tool written which can be used for
> unit testing.

Here is the test patch for fsx-linux in LTP testsuite
(version ltp-full-20061121). This makes fsx call doprealloc (which does
preallocation ioctl) instead of dotruncate. Again, this is just for
testing (of persistent preallocation patches) purpose.

---
 testcases/kernel/fs/fsx-linux/fsx-linux.c |   90 ++++++++++++++++++++++++++++--
 1 files changed, 84 insertions(+), 6 deletions(-)

Index: ltp-full-20061121/testcases/kernel/fs/fsx-linux/fsx-linux.c
===================================================================
--- ltp-full-20061121.orig/testcases/kernel/fs/fsx-linux/fsx-linux.c
+++ ltp-full-20061121/testcases/kernel/fs/fsx-linux/fsx-linux.c
@@ -94,6 +94,7 @@ char	*good_buf;			/* a pointer to the co
 char	*temp_buf;			/* a pointer to the current data */
 char	*fname;				/* name of our test file */
 int	fd;				/* fd for our test file */
+int	block_size;			/* block size*/
 
 off_t		file_size = 0;
 off_t		biggest = 0;
@@ -115,6 +116,8 @@ int	truncbdy = 1;			/* -t flag */
 int	writebdy = 1;			/* -w flag */
 long	monitorstart = -1;		/* -m flag */
 long	monitorend = -1;		/* -m flag */
+int	prealloc = 0;			/* -x flag */
+int	ext4 = 0;			/* -x flag */
 int	lite = 0;			/* -L flag */
 long	numops = -1;			/* -N flag */
 int	randomoplen = 1;		/* -O flag disables it */
@@ -355,6 +358,19 @@ check_buffers(unsigned offset, unsigned 
 	}
 }
 
+int
+get_block_size(void)
+{
+	struct stat 	statbuf;
+
+	if (fstat(fd, &statbuf)) {
+		prterr("get_block_size: fstat");
+		report_failure(115);
+	}
+
+	return statbuf.st_blksize;
+}
+	
 
 void
 check_size(void)
@@ -628,6 +644,47 @@ domapwrite(unsigned offset, unsigned siz
 	}
 }
 
+#define EXT4_IOC_FALLOCATE    0x40106609
+void
+doprealloc(unsigned size)
+{
+	int ret;
+	struct ext4_falloc_input {
+		unsigned long long offset;
+		unsigned long long len;
+	} input;
+
+	if (!size) {
+		prt("skipping zero size preallocation\n");
+		return;
+	}
+
+	if (!ext4) {
+		prt("doprealloc: Preallocation currently supported "
+			"on ext4 _only_\n");
+		return;
+	}
+
+	input.offset = 0;
+	input.len = (unsigned long long)size;
+
+	if (testcalls <= simulatedopcount)
+		return;
+	
+	if (progressinterval && testcalls % progressinterval == 0 ||
+	    debug && (monitorstart == -1 || monitorend == -1 ||
+		      size <= monitorend))
+		prt("%lu trunc\tfrom 0x0 to 0x%x\n", testcalls, size);
+
+	ret = ioctl(fd, EXT4_IOC_FALLOCATE, &input);
+	if (ret < 0) {
+		prt("ioctl: %x\n", size);
+		prterr("doprealloc: ioctl");
+		report_failure(160);
+	} else
+		if (size > file_size)
+			file_size = (size + block_size - 1) & (~(block_size - 1));
+}
 
 void
 dotruncate(unsigned size)
@@ -679,7 +736,7 @@ writefileimage()
 			prt("short write: 0x%x bytes instead of 0x%qx\n",
 			    iret, (unsigned long long)file_size);
 		report_failure(172);
-	}
+	}	
 	if (lite ? 0 : ftruncate(fd, file_size) == -1) {
 	        prt("ftruncate2: %qx\n", (unsigned long long)file_size);
 		prterr("writefileimage: ftruncate");
@@ -742,13 +799,15 @@ test(void)
 	 * TRUNCATE:	op = 3
 	 * MAPWRITE:    op = 3 or 4
 	 */
-	if (lite ? 0 : op == 3 && (style & 1) == 0) /* vanilla truncate? */
-		dotruncate(random() % maxfilelen);
+	if (lite ? 0 : op == 3 && (style & 1) == 0) { /* vanilla truncate? */
+		size = random() % maxfilelen;
+		(prealloc) ? doprealloc(size) : dotruncate(size);
+	}
 	else {
 		if (randomoplen)
 			size = random() % (maxoplen+1);
 		if (lite ? 0 : op == 3)
-			dotruncate(size);
+			(prealloc) ? doprealloc(size) : dotruncate(size);
 		else {
 			offset = random();
 			if (op == 1 || op == (lite ? 3 : 4)) {
@@ -777,6 +836,7 @@ test(void)
 		check_size();
 	if (closeopen)
 		docloseopen();
+
 }
 
 
@@ -795,7 +855,7 @@ void
 usage(void)
 {
 	fprintf(stdout, "usage: %s",
-		"fsx [-dnqLOW] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-D startingop] [-N numops] [-P dirpath] [-S seed] fname\n\
+		"fsx [-dnqLOW] [-b opnum] [-c Prob] [-l flen] [-m start:end] [-o oplen] [-p progressinterval] [-r readbdy] [-s style] [-t truncbdy] [-w writebdy] [-x xfs/ext4] [-D startingop] [-N numops] [-P dirpath] [-S seed] fname\n\
 	-b opnum: beginning operation number (default 1)\n\
 	-c P: 1 in P chance of file close+open at each op (default infinity)\n\
 	-d: debug output for all operations\n\
@@ -809,6 +869,7 @@ usage(void)
 	-s style: 1 gives smaller truncates (default 0)\n\
 	-t truncbdy: 4096 would make truncates page aligned (default 1)\n\
 	-w writebdy: 4096 would make writes page aligned (default 1)\n\
+	-x ext4/xfs: Do preallocate. filesystem can be either xfs, or ext4\n\
 	-D startingop: debug output starting at specified operation\n\
 	-L: fsxLite - no file creations & no file size changes\n\
 	-N numops: total # operations to do (default infinity)\n\
@@ -869,7 +930,7 @@ main(int argc, char **argv)
 
 	setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
 
-	while ((ch = getopt(argc, argv, "b:c:dl:m:no:p:qr:s:t:w:D:LN:OP:RS:W"))
+	while ((ch = getopt(argc, argv, "b:c:dl:m:no:p:qr:s:t:w:x:D:LN:OP:RS:W"))
 	       != EOF)
 		switch (ch) {
 		case 'b':
@@ -946,6 +1007,17 @@ main(int argc, char **argv)
 			if (writebdy <= 0)
 				usage();
 			break;
+		case 'x':
+			prealloc = 1;
+			if (!strcmp(optarg, "ext4"))
+				ext4 = 1;
+			printf("ext4 = %u\n", ext4);
+			if (!ext4 && strcmp(optarg, "xfs"))
+				usage();
+			/* If we are here, ext4==1 signifies ext4 filesystem,
+			   else it signifies xfs filesystem */
+			
+			break;
 		case 'D':
 			debugstart = getnum(optarg, &endp);
 			if (debugstart < 1)
@@ -1015,6 +1087,12 @@ main(int argc, char **argv)
 		prterr(fname);
 		exit(91);
 	}
+
+	block_size = get_block_size();
+
+	if (prealloc)
+		doprealloc(maxfilelen);
+
 	strncat(goodfile, fname, 256);
 	strcat (goodfile, ".fsxgood");
 	fsxgoodfd = open(goodfile, O_RDWR|O_CREAT|O_TRUNC, 0666);
 
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: small tool for unit testing
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
                   ` (4 preceding siblings ...)
  2007-01-19  9:17 ` patch for fsx-linux Amit K. Arora
@ 2007-01-19  9:22 ` Amit K. Arora
  2007-02-07  7:48 ` Testing ext4 persistent preallocation patches for 64 bit features Amit K. Arora
  2007-02-25 10:23 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Andrew Morton
  7 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-01-19  9:22 UTC (permalink / raw)
  To: linux-ext4; +Cc: suparna, cmm, alex, suzuki

On Wed, Jan 17, 2007 at 03:16:58PM +0530, Amit K. Arora wrote:
> The patches for e2fsprogs and fsx-linux are available with me. I can
> post them if anyone is interested to try/test the preallocation patches.
> Also, I have a small test program/tool written which can be used for
> unit testing.

Here is a small test program/tool which can be used to preallocate
blocks and write to these preallocated blocks. 


#include<stdio.h>
#include<stdlib.h>
#include<fcntl.h>
#include<errno.h>

#define _IOC_NRBITS     8
#define _IOC_TYPEBITS   8
#define _IOC_SIZEBITS   14
#define _IOC_DIRBITS    2

#define _IOC_NRMASK     ((1 << _IOC_NRBITS)-1)
#define _IOC_TYPEMASK   ((1 << _IOC_TYPEBITS)-1)
#define _IOC_SIZEMASK   ((1 << _IOC_SIZEBITS)-1)
#define _IOC_DIRMASK    ((1 << _IOC_DIRBITS)-1)

#define _IOC_NRSHIFT    0
#define _IOC_TYPESHIFT  (_IOC_NRSHIFT+_IOC_NRBITS)
#define _IOC_SIZESHIFT  (_IOC_TYPESHIFT+_IOC_TYPEBITS)
#define _IOC_DIRSHIFT   (_IOC_SIZESHIFT+_IOC_SIZEBITS)

/*
 * Direction bits.
 */
#define _IOC_WRITE      1U

#define _IOC(dir,type,nr,size) \
        (((dir)  << _IOC_DIRSHIFT) | \
         ((type) << _IOC_TYPESHIFT) | \
         ((nr)   << _IOC_NRSHIFT) | \
         ((size) << _IOC_SIZESHIFT))

/* provoke compile error for invalid uses of size argument */
extern unsigned int __invalid_size_argument_for_IOC;
#define _IOC_TYPECHECK(t) \
        ((sizeof(t) == sizeof(t[1]) && \
          sizeof(t) < (1 << _IOC_SIZEBITS)) ? \
          sizeof(t) : __invalid_size_argument_for_IOC)

/* used to create numbers */
#define _IOW(type,nr,size)      _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))



#define EXT4_IOC_PREALLOCATE            _IOW('f', 9, struct ext4_falloc_input)

struct ext4_falloc_input {
        unsigned long long offset;
        unsigned long long len;
        };

int do_prealloc(char* fname, struct ext4_falloc_input input)
{
  int ret, fd = open(fname, O_CREAT|O_RDWR, 0666);

  if(fd<0) {
        printf("Error opening file %s\n", fname);
        return 1;
  }

  printf("%s : Trying to preallocate blocks (offset=%llu, len=%llu)\n", 
					fname, input.offset, input.len);
  ret = ioctl(fd, EXT4_IOC_PREALLOCATE, &input);

  if(ret <0) {
        printf("IOCTL: received error %d, ret=%d\n", errno, ret);
  	close(fd); 
        exit(1);
  }
  printf("IOCTL succedded !  ret=%d\n", ret);
  close(fd); 
}

int do_write(char* fname, struct ext4_falloc_input input)
{
  int ret, fd;
  char *buf;

  buf = (char *)malloc(input.len);

  fd = open(fname, O_CREAT|O_RDWR, 0666);
  if(fd<0) {
        printf("Error opening file %s\n", fname);
        return 1;
  }

  printf("%s : Trying to write to file (offset=%llu, len=%llu)\n", 
					fname, input.offset, input.len);

  ret = lseek(fd, input.offset, SEEK_SET);
  if(ret != input.offset) {
     	printf("lseek() failed error=%d, ret=%d\n", errno, ret);
  	close(fd); 
       	return(1);
  }

  ret = write(fd, buf, input.len);
  if(ret != input.len) {
       	 printf("write() failed error=%d, ret=%d\n", errno, ret);
  		close(fd); 
       		return(1);
  }

  printf("Write succedded ! Written %llu bytes ret=%d\n", input.len, ret);
  close(fd); 
}


int main(int argc, char **argv)
{
  struct ext4_falloc_input input;
  int ret = 1, fd;
  char *fname; 

  if(argc<5) {
	printf("%s <CMD: prealloc/write> <filename-with-path> <offset> <length>\n", argv[0]);
	exit(1);
  }

  fname = argv[2];
  input.offset=(unsigned long long)atol(argv[3]);;
  input.len=(unsigned long long)atol(argv[4]);

  if(input.offset<0 || input.len<= 0) {
        printf("%s: Invalid arguments.\n", argv[0]);
        exit(1);
  }

  if(!strcmp(argv[1], "prealloc"))
  	ret = do_prealloc(fname, input);
  else if(!strcmp(argv[1], "write"))
	ret = do_write(fname, input);
  else
        printf("%s: Invalid arguments.\n", argv[0]);

  return ret;
}

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Testing ext4 persistent preallocation patches for 64 bit features
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
                   ` (5 preceding siblings ...)
  2007-01-19  9:22 ` small tool for unit testing Amit K. Arora
@ 2007-02-07  7:48 ` Amit K. Arora
  2007-02-07  8:25   ` Mingming Cao
  2007-02-25 10:23 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Andrew Morton
  7 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-02-07  7:48 UTC (permalink / raw)
  To: linux-ext4, linux-fsdevel; +Cc: suparna, cmm, alex, suzuki

I plan to test the persistent preallocation patches on a huge sparse
device, to know if >32 bit physical block numbers (upto 48bit) behave as
expected. I have following questions for this and will appreciate
suggestions here:

a) What should be the sparse device size which I should use for testing?
Should a size of > 8TB (say, 100 TB) be enough ?
The physical device (backing store device) size I can have is upto 70GB.

b) How do I test allocation of >32 bit physical block numbers ? I can
not fill > 8TB, since the physical storage available with me is just
70GB.

c) Do I need to put some hack in the filesystem code for above (to
allocate >32 bit physical block numbers) ?

Any further ideas on how to test this will help. Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Testing ext4 persistent preallocation patches for 64 bit features
  2007-02-07  7:48 ` Testing ext4 persistent preallocation patches for 64 bit features Amit K. Arora
@ 2007-02-07  8:25   ` Mingming Cao
  2007-02-07 10:36     ` Suparna Bhattacharya
  2007-02-08 10:51     ` Amit K. Arora
  0 siblings, 2 replies; 340+ messages in thread
From: Mingming Cao @ 2007-02-07  8:25 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: linux-ext4, linux-fsdevel, suparna, alex, suzuki

On Wed, 2007-02-07 at 13:18 +0530, Amit K. Arora wrote:
> I plan to test the persistent preallocation patches on a huge sparse
> device, to know if >32 bit physical block numbers (upto 48bit) behave as
> expected.
Thanks!


>  I have following questions for this and will appreciate
> suggestions here:

> c) Do I need to put some hack in the filesystem code for above (to
> allocate >32 bit physical block numbers) ?
> 

I had a ext3 hack patch before to allow application specify which block
group is the targeted block allocation group,using ioctl command, so to
allocate >32 bit physical block numbers it just set the target block
group beyond 2**(32-15) = 2**17. patch is below..

BTW, have you considered
- move the preallocation code in ioctl to a seperate function, and call
that function from ioctl? That way we could easily switch to
posix_falloc later.
- Test preallocation with mapped IO?
- disable preallocation if the filesystem free blocks is under some low
watermarks, to save space for near future real block allocation?
- is de-preallocation something worth doing?

Mingming

---

 linux-2.6.16-ming/fs/ext3/balloc.c          |   24 ++++++++++++++---------
 linux-2.6.16-ming/fs/ext3/ioctl.c           |   29 ++++++++++++++++++++++++++++
 linux-2.6.16-ming/include/linux/ext3_fs.h   |    1 
 linux-2.6.16-ming/include/linux/ext3_fs_i.h |    1 
 4 files changed, 46 insertions(+), 9 deletions(-)

diff -puN fs/ext3/ioctl.c~ext3_set_alloc_blk_group_hack fs/ext3/ioctl.c
--- linux-2.6.16/fs/ext3/ioctl.c~ext3_set_alloc_blk_group_hack	2006-03-28 15:19:58.000000000 -0800
+++ linux-2.6.16-ming/fs/ext3/ioctl.c	2006-03-28 15:54:14.507288400 -0800
@@ -22,6 +22,7 @@ int ext3_ioctl (struct inode * inode, st
 	struct ext3_inode_info *ei = EXT3_I(inode);
 	unsigned int flags;
 	unsigned short rsv_window_size;
+	unsigned int blk_group;
 
 	ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -193,6 +194,34 @@ flags_err:
 		mutex_unlock(&ei->truncate_mutex);
 		return 0;
 	}
+	case EXT3_IOC_SETALLOCBLKGRP: {
+
+		if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
+			return -ENOTTY;
+
+		if (IS_RDONLY(inode))
+			return -EROFS;
+
+		if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
+			return -EACCES;
+
+		if (get_user(blk_group, (int __user *)arg))
+			return -EFAULT;
+
+		/*
+		 * need to allocate reservation structure for this inode
+		 * before set the window size
+		 */
+		mutex_lock(&ei->truncate_mutex);
+		if (!ei->i_block_alloc_info)
+			ext3_init_block_alloc_info(inode);
+
+		if (ei->i_block_alloc_info){
+			ei->i_block_alloc_info->goal_block_group = blk_group;
+		}
+		mutex_unlock(&ei->truncate_mutex);
+		return 0;
+	}
 	case EXT3_IOC_GROUP_EXTEND: {
 		unsigned long n_blocks_count;
 		struct super_block *sb = inode->i_sb;
diff -puN include/linux/ext3_fs.h~ext3_set_alloc_blk_group_hack include/linux/ext3_fs.h
--- linux-2.6.16/include/linux/ext3_fs.h~ext3_set_alloc_blk_group_hack	2006-03-28 15:42:51.000000000 -0800
+++ linux-2.6.16-ming/include/linux/ext3_fs.h	2006-03-28 15:51:48.321237417 -0800
@@ -238,6 +238,7 @@ struct ext3_new_group_data {
 #endif
 #define EXT3_IOC_GETRSVSZ		_IOR('f', 5, long)
 #define EXT3_IOC_SETRSVSZ		_IOW('f', 6, long)
+#define EXT3_IOC_SETALLOCBLKGRP		_IOW('f', 9, long)
 
 /*
  *  Mount options
diff -puN include/linux/ext3_fs_i.h~ext3_set_alloc_blk_group_hack include/linux/ext3_fs_i.h
--- linux-2.6.16/include/linux/ext3_fs_i.h~ext3_set_alloc_blk_group_hack	2006-03-28 15:43:59.000000000 -0800
+++ linux-2.6.16-ming/include/linux/ext3_fs_i.h	2006-03-28 15:47:54.274367219 -0800
@@ -51,6 +51,7 @@ struct ext3_block_alloc_info {
 	 * allocation when we detect linearly ascending requests.
 	 */
 	__u32                   last_alloc_physical_block;
+	__u32			goal_block_group;
 };
 
 #define rsv_start rsv_window._rsv_start
diff -puN fs/ext3/balloc.c~ext3_set_alloc_blk_group_hack fs/ext3/balloc.c
--- linux-2.6.16/fs/ext3/balloc.c~ext3_set_alloc_blk_group_hack	2006-03-28 15:45:30.000000000 -0800
+++ linux-2.6.16-ming/fs/ext3/balloc.c	2006-03-28 16:03:55.770850040 -0800
@@ -285,6 +285,7 @@ void ext3_init_block_alloc_info(struct i
 		rsv->rsv_alloc_hit = 0;
 		block_i->last_alloc_logical_block = 0;
 		block_i->last_alloc_physical_block = 0;
+		block_i->goal_block_group = 0;
 	}
 	ei->i_block_alloc_info = block_i;
 }
@@ -1263,15 +1264,20 @@ unsigned long ext3_new_blocks(handle_t *
 		*errp = -ENOSPC;
 		goto out;
 	}
-
-	/*
-	 * First, test whether the goal block is free.
-	 */
-	if (goal < le32_to_cpu(es->s_first_data_block) ||
-	    goal >= le32_to_cpu(es->s_blocks_count))
-		goal = le32_to_cpu(es->s_first_data_block);
-	group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
-			EXT3_BLOCKS_PER_GROUP(sb);
+	if (block_i->goal_block_group) {
+		group_no = block_i->goal_block_group;
+		goal = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +                                group_no * EXT3_BLOCKS_PER_GROUP(sb);
+		block_i->goal_block_group = 0;
+	} else {
+		/*
+		 * First, test whether the goal block is free.
+		 */
+		if (goal < le32_to_cpu(es->s_first_data_block) ||
+		    goal >= le32_to_cpu(es->s_blocks_count))
+			goal = le32_to_cpu(es->s_first_data_block);
+		group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
+				EXT3_BLOCKS_PER_GROUP(sb);
+	}
 	gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
 	if (!gdp)
 		goto io_error;

_

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Testing ext4 persistent preallocation patches for 64 bit features
  2007-02-07  8:25   ` Mingming Cao
@ 2007-02-07 10:36     ` Suparna Bhattacharya
  2007-02-07 21:11       ` Andreas Dilger
  2007-02-08 10:51     ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: Suparna Bhattacharya @ 2007-02-07 10:36 UTC (permalink / raw)
  To: Mingming Cao; +Cc: Amit K. Arora, linux-ext4, linux-fsdevel, alex, suzuki

On Wed, Feb 07, 2007 at 12:25:50AM -0800, Mingming Cao wrote:
> On Wed, 2007-02-07 at 13:18 +0530, Amit K. Arora wrote:
> > I plan to test the persistent preallocation patches on a huge sparse
> > device, to know if >32 bit physical block numbers (upto 48bit) behave as
> > expected.
> Thanks!
> 
> 
> >  I have following questions for this and will appreciate
> > suggestions here:
> 
> > c) Do I need to put some hack in the filesystem code for above (to
> > allocate >32 bit physical block numbers) ?
> > 
> 
> I had a ext3 hack patch before to allow application specify which block
> group is the targeted block allocation group,using ioctl command, so to
> allocate >32 bit physical block numbers it just set the target block
> group beyond 2**(32-15) = 2**17. patch is below..
> 
> BTW, have you considered
> - move the preallocation code in ioctl to a seperate function, and call
> that function from ioctl? That way we could easily switch to
> posix_falloc later.

Good suggestion.

> - Test preallocation with mapped IO?
> - disable preallocation if the filesystem free blocks is under some low
> watermarks, to save space for near future real block allocation?

A policy decision like this is probably worth a discussion during today's call.

> - is de-preallocation something worth doing?

Wouldn't truncate do that ? 
Or you thinking of something like hole punching ?

Regards
Suparna


> 
> Mingming
> 
> ---
> 
>  linux-2.6.16-ming/fs/ext3/balloc.c          |   24 ++++++++++++++---------
>  linux-2.6.16-ming/fs/ext3/ioctl.c           |   29 ++++++++++++++++++++++++++++
>  linux-2.6.16-ming/include/linux/ext3_fs.h   |    1 
>  linux-2.6.16-ming/include/linux/ext3_fs_i.h |    1 
>  4 files changed, 46 insertions(+), 9 deletions(-)
> 
> diff -puN fs/ext3/ioctl.c~ext3_set_alloc_blk_group_hack fs/ext3/ioctl.c
> --- linux-2.6.16/fs/ext3/ioctl.c~ext3_set_alloc_blk_group_hack	2006-03-28 15:19:58.000000000 -0800
> +++ linux-2.6.16-ming/fs/ext3/ioctl.c	2006-03-28 15:54:14.507288400 -0800
> @@ -22,6 +22,7 @@ int ext3_ioctl (struct inode * inode, st
>  	struct ext3_inode_info *ei = EXT3_I(inode);
>  	unsigned int flags;
>  	unsigned short rsv_window_size;
> +	unsigned int blk_group;
> 
>  	ext3_debug ("cmd = %u, arg = %lu\n", cmd, arg);
> 
> @@ -193,6 +194,34 @@ flags_err:
>  		mutex_unlock(&ei->truncate_mutex);
>  		return 0;
>  	}
> +	case EXT3_IOC_SETALLOCBLKGRP: {
> +
> +		if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
> +			return -ENOTTY;
> +
> +		if (IS_RDONLY(inode))
> +			return -EROFS;
> +
> +		if ((current->fsuid != inode->i_uid) && !capable(CAP_FOWNER))
> +			return -EACCES;
> +
> +		if (get_user(blk_group, (int __user *)arg))
> +			return -EFAULT;
> +
> +		/*
> +		 * need to allocate reservation structure for this inode
> +		 * before set the window size
> +		 */
> +		mutex_lock(&ei->truncate_mutex);
> +		if (!ei->i_block_alloc_info)
> +			ext3_init_block_alloc_info(inode);
> +
> +		if (ei->i_block_alloc_info){
> +			ei->i_block_alloc_info->goal_block_group = blk_group;
> +		}
> +		mutex_unlock(&ei->truncate_mutex);
> +		return 0;
> +	}
>  	case EXT3_IOC_GROUP_EXTEND: {
>  		unsigned long n_blocks_count;
>  		struct super_block *sb = inode->i_sb;
> diff -puN include/linux/ext3_fs.h~ext3_set_alloc_blk_group_hack include/linux/ext3_fs.h
> --- linux-2.6.16/include/linux/ext3_fs.h~ext3_set_alloc_blk_group_hack	2006-03-28 15:42:51.000000000 -0800
> +++ linux-2.6.16-ming/include/linux/ext3_fs.h	2006-03-28 15:51:48.321237417 -0800
> @@ -238,6 +238,7 @@ struct ext3_new_group_data {
>  #endif
>  #define EXT3_IOC_GETRSVSZ		_IOR('f', 5, long)
>  #define EXT3_IOC_SETRSVSZ		_IOW('f', 6, long)
> +#define EXT3_IOC_SETALLOCBLKGRP		_IOW('f', 9, long)
> 
>  /*
>   *  Mount options
> diff -puN include/linux/ext3_fs_i.h~ext3_set_alloc_blk_group_hack include/linux/ext3_fs_i.h
> --- linux-2.6.16/include/linux/ext3_fs_i.h~ext3_set_alloc_blk_group_hack	2006-03-28 15:43:59.000000000 -0800
> +++ linux-2.6.16-ming/include/linux/ext3_fs_i.h	2006-03-28 15:47:54.274367219 -0800
> @@ -51,6 +51,7 @@ struct ext3_block_alloc_info {
>  	 * allocation when we detect linearly ascending requests.
>  	 */
>  	__u32                   last_alloc_physical_block;
> +	__u32			goal_block_group;
>  };
> 
>  #define rsv_start rsv_window._rsv_start
> diff -puN fs/ext3/balloc.c~ext3_set_alloc_blk_group_hack fs/ext3/balloc.c
> --- linux-2.6.16/fs/ext3/balloc.c~ext3_set_alloc_blk_group_hack	2006-03-28 15:45:30.000000000 -0800
> +++ linux-2.6.16-ming/fs/ext3/balloc.c	2006-03-28 16:03:55.770850040 -0800
> @@ -285,6 +285,7 @@ void ext3_init_block_alloc_info(struct i
>  		rsv->rsv_alloc_hit = 0;
>  		block_i->last_alloc_logical_block = 0;
>  		block_i->last_alloc_physical_block = 0;
> +		block_i->goal_block_group = 0;
>  	}
>  	ei->i_block_alloc_info = block_i;
>  }
> @@ -1263,15 +1264,20 @@ unsigned long ext3_new_blocks(handle_t *
>  		*errp = -ENOSPC;
>  		goto out;
>  	}
> -
> -	/*
> -	 * First, test whether the goal block is free.
> -	 */
> -	if (goal < le32_to_cpu(es->s_first_data_block) ||
> -	    goal >= le32_to_cpu(es->s_blocks_count))
> -		goal = le32_to_cpu(es->s_first_data_block);
> -	group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
> -			EXT3_BLOCKS_PER_GROUP(sb);
> +	if (block_i->goal_block_group) {
> +		group_no = block_i->goal_block_group;
> +		goal = le32_to_cpu(EXT3_SB(sb)->s_es->s_first_data_block) +                                group_no * EXT3_BLOCKS_PER_GROUP(sb);
> +		block_i->goal_block_group = 0;
> +	} else {
> +		/*
> +		 * First, test whether the goal block is free.
> +		 */
> +		if (goal < le32_to_cpu(es->s_first_data_block) ||
> +		    goal >= le32_to_cpu(es->s_blocks_count))
> +			goal = le32_to_cpu(es->s_first_data_block);
> +		group_no = (goal - le32_to_cpu(es->s_first_data_block)) /
> +				EXT3_BLOCKS_PER_GROUP(sb);
> +	}
>  	gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
>  	if (!gdp)
>  		goto io_error;
> 
> _
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Testing ext4 persistent preallocation patches for 64 bit features
  2007-02-07 10:36     ` Suparna Bhattacharya
@ 2007-02-07 21:11       ` Andreas Dilger
  2007-02-08  8:52         ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-02-07 21:11 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Mingming Cao, Amit K. Arora, linux-ext4, linux-fsdevel, alex, suzuki

On Feb 07, 2007  16:06 +0530, Suparna Bhattacharya wrote:
> On Wed, Feb 07, 2007 at 12:25:50AM -0800, Mingming Cao wrote:
> > - disable preallocation if the filesystem free blocks is under some low
> > watermarks, to save space for near future real block allocation?
> 
> A policy decision like this is probably worth a discussion during today's call.
> 
> > - is de-preallocation something worth doing?

As discussed in the call - I don't think we can remove preallocations.
The whole point of database preallocation is to guarantee that this space
is available in the filesystem when writing into a file at random offsets
(which would otherwise be sparse).

Similarly, persistent preallocation shouldn't be considered differently
than an efficient way of doing zero filling of blocks.  At least that is
my understanding...  Is this code implementing the "uninitialized extents"
for databases (via explicit preallocation via fallocate/ioctl) so that
they don't have to zero-fill large files, or is there also automatic
preallocation of space to files (e.g. for O_APPEND files)?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Testing ext4 persistent preallocation patches for 64 bit features
  2007-02-07 21:11       ` Andreas Dilger
@ 2007-02-08  8:52         ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-02-08  8:52 UTC (permalink / raw)
  To: adilger; +Cc: suparna, cmm, linux-ext4, linux-fsdevel, alex, suzuki

On Wed, Feb 07, 2007 at 02:11:17PM -0700, Andreas Dilger wrote:
> On Feb 07, 2007  16:06 +0530, Suparna Bhattacharya wrote:
> > On Wed, Feb 07, 2007 at 12:25:50AM -0800, Mingming Cao wrote:
> > > - disable preallocation if the filesystem free blocks is under some low
> > > watermarks, to save space for near future real block allocation?
> > 
> > A policy decision like this is probably worth a discussion during today's call.
> > 
> > > - is de-preallocation something worth doing?
> 
> As discussed in the call - I don't think we can remove preallocations.
> The whole point of database preallocation is to guarantee that this space
> is available in the filesystem when writing into a file at random offsets
> (which would otherwise be sparse).
> 
> Similarly, persistent preallocation shouldn't be considered differently
> than an efficient way of doing zero filling of blocks.  At least that is
> my understanding...  Is this code implementing the "uninitialized extents"
> for databases (via explicit preallocation via fallocate/ioctl) so that
> they don't have to zero-fill large files, or is there also automatic
> preallocation of space to files (e.g. for O_APPEND files)?

You are right. There is no automatic preallocation of space being done
here. This code just implements the explicit (persistent) preallocation
of blocks via ioctl.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Testing ext4 persistent preallocation patches for 64 bit features
  2007-02-07  8:25   ` Mingming Cao
  2007-02-07 10:36     ` Suparna Bhattacharya
@ 2007-02-08 10:51     ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-02-08 10:51 UTC (permalink / raw)
  To: Mingming Cao; +Cc: linux-ext4, linux-fsdevel, suparna, alex, suzuki

On Wed, Feb 07, 2007 at 12:25:50AM -0800, Mingming Cao wrote:
> On Wed, 2007-02-07 at 13:18 +0530, Amit K. Arora wrote:
> > c) Do I need to put some hack in the filesystem code for above (to
> > allocate >32 bit physical block numbers) ?
> I had a ext3 hack patch before to allow application specify which block
> group is the targeted block allocation group,using ioctl command, so to
> allocate >32 bit physical block numbers it just set the target block
> group beyond 2**(32-15) = 2**17. patch is below..

Thanks for the patch! 

> BTW, have you considered
> - move the preallocation code in ioctl to a seperate function, and call
> that function from ioctl? That way we could easily switch to
> posix_falloc later.

OK.

> - Test preallocation with mapped IO?

I haven't done that yet. Will test it out too. Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [Resubmit][Patch 0/2] Persistent preallocation in ext4
  2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
                   ` (6 preceding siblings ...)
  2007-02-07  7:48 ` Testing ext4 persistent preallocation patches for 64 bit features Amit K. Arora
@ 2007-02-25 10:23 ` Andrew Morton
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
  7 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-02-25 10:23 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: linux-ext4, suparna, cmm, alex, suzuki

> On Wed, 17 Jan 2007 15:16:58 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> (1) The final interface is yet to be decided. We have the option of
> chosing from one of these:
> 	a> modifying posix_fallocate() in glibc
> 	b> using fcntl
> 	c> using ftruncate, or
> 	d> using the ioctl interface.

I'd suggest we add a new syscall, which implements the posix_fallocate()
API as per the relevant specs.  We can work the final details out with Ulrich
and I'm sure that glibc's posix_fallocate() will start using the syscall if it
is available.

Actually, we should implement

 asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

note: loff_t, not off_t.

This probably means that we'll need to implement file_operations.fallocate().

It wouldn't surprise me if XFS was able to implement fallocate() too.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [RFC] Heads up on sys_fallocate()
  2007-02-25 10:23 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Andrew Morton
@ 2007-03-01 18:34   ` Amit K. Arora
  2007-03-01 19:15     ` Eric Sandeen
                       ` (7 more replies)
  0 siblings, 8 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-03-01 18:34 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: Andrew Morton, suparna, cmm, alex, suzuki

This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation "fallocate", for persistent preallocation. The new
system call, as Andrew suggested, will look like:

  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

As we are developing and testing the required patches, we decided to
post a preliminary patch and get inputs from the community to give it
a right direction and shape. First, a little description on the feature.
 
Persistent preallocation is a file system feature using which an
application (say, relational database servers) can explicitly
preallocate blocks to a particular file. This feature can be used to
reserve space for a file to get mainly the following benefits:
1> contiguity - less defragmentation and thus faster access speed, and
2> guarantee for a minimum space availibility (depending on how many
blocks were preallocated) for the file, even if the filesystem becomes
full.

XFS already has an implementation for this, using an ioctl interface. And,
ext4 is now coming up with this feature. In coming time we may see a few
more file systems implementing this. Thus, it makes sense to have a more
standard interface for this, like this new system call.

Here is the initial and incomplete version of the patch, which can be
used for the discussion, till we come up with a set of more complete
patches.

---
 arch/i386/kernel/syscall_table.S |    1 +
 fs/ext4/file.c                   |    1 +
 fs/open.c                        |   18 ++++++++++++++++++
 include/asm-i386/unistd.h        |    3 ++-
 include/linux/fs.h               |    1 +
 include/linux/syscalls.h         |    1 +
 6 files changed, 24 insertions(+), 1 deletion(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_fallocate		/* 320 */
Index: linux-2.6.20.1/fs/ext4/file.c
===================================================================
--- linux-2.6.20.1.orig/fs/ext4/file.c
+++ linux-2.6.20.1/fs/ext4/file.c
@@ -135,5 +135,6 @@ struct inode_operations ext4_file_inode_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.20.1/fs/open.c
===================================================================
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,24 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	inode = file->f_path.dentry->d_inode;
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, offset, len);
+	else
+		ret = -ENOTTY;
+	fput(file);
+out:
+        return ret;
+}
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_fallocate		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -1124,6 +1124,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

--
Regards,
Amit Arora 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
@ 2007-03-01 19:15     ` Eric Sandeen
  2007-03-02 10:45       ` Andreas Dilger
  2007-03-01 20:23     ` Jeff Garzik
                       ` (6 subsequent siblings)
  7 siblings, 1 reply; 340+ messages in thread
From: Eric Sandeen @ 2007-03-01 19:15 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, Andrew Morton, suparna,
	cmm, alex, suzuki

Amit K. Arora wrote:
> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> 

One thing I'd like to see is a cmd argument as well, to allow for 
example allocation vs. reservation (i.e. allocating blocks vs. simply 
reserving a number), as well as the inverse of those functions 
(un-reservation, de-allocation)?

If the allocation interface allows allocation/reservation within 
arbitrary ranges, if the only way to un-allocate is via a truncate, 
that's pretty asymmetric.

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
  2007-03-01 19:15     ` Eric Sandeen
@ 2007-03-01 20:23     ` Jeff Garzik
  2007-03-01 20:31       ` Jeremy Allison
  2007-03-01 21:14     ` Jeremy Fitzhardinge
                       ` (5 subsequent siblings)
  7 siblings, 1 reply; 340+ messages in thread
From: Jeff Garzik @ 2007-03-01 20:23 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, Andrew Morton, suparna,
	cmm, alex, suzuki

Amit K. Arora wrote:
> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> 
> As we are developing and testing the required patches, we decided to
> post a preliminary patch and get inputs from the community to give it
> a right direction and shape. First, a little description on the feature.
>  
> Persistent preallocation is a file system feature using which an
> application (say, relational database servers) can explicitly
> preallocate blocks to a particular file. This feature can be used to
> reserve space for a file to get mainly the following benefits:
> 1> contiguity - less defragmentation and thus faster access speed, and
> 2> guarantee for a minimum space availibility (depending on how many
> blocks were preallocated) for the file, even if the filesystem becomes
> full.
> 
> XFS already has an implementation for this, using an ioctl interface. And,
> ext4 is now coming up with this feature. In coming time we may see a few
> more file systems implementing this. Thus, it makes sense to have a more
> standard interface for this, like this new system call.
> 
> Here is the initial and incomplete version of the patch, which can be
> used for the discussion, till we come up with a set of more complete
> patches.
> 
> ---
>  arch/i386/kernel/syscall_table.S |    1 +
>  fs/ext4/file.c                   |    1 +
>  fs/open.c                        |   18 ++++++++++++++++++
>  include/asm-i386/unistd.h        |    3 ++-
>  include/linux/fs.h               |    1 +
>  include/linux/syscalls.h         |    1 +
>  6 files changed, 24 insertions(+), 1 deletion(-)

I certainly agree that we want something like this.

posix_fallocate() is the glibc interface we want to be compatible with 
(which your definition is, AFAICS).

	Jeff




^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 20:23     ` Jeff Garzik
@ 2007-03-01 20:31       ` Jeremy Allison
  0 siblings, 0 replies; 340+ messages in thread
From: Jeremy Allison @ 2007-03-01 20:31 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

On Thu, Mar 01, 2007 at 03:23:19PM -0500, Jeff Garzik wrote:
> I certainly agree that we want something like this.
> 
> posix_fallocate() is the glibc interface we want to be compatible with 
> (which your definition is, AFAICS).

This would be great for Samba. Windows clients do this a lot....

Jeremy.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
  2007-03-01 19:15     ` Eric Sandeen
  2007-03-01 20:23     ` Jeff Garzik
@ 2007-03-01 21:14     ` Jeremy Fitzhardinge
  2007-03-01 22:58       ` Alan
  2007-03-01 22:25     ` Andrew Morton
                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 340+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 21:14 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, Andrew Morton, suparna,
	cmm, alex, suzuki

Amit K. Arora wrote:
> +	if (inode->i_op && inode->i_op->fallocate)
> +		ret = inode->i_op->fallocate(inode, offset, len);
> +	else
> +		ret = -ENOTTY;

You can only allocate space on typewriters? ;)

    J

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:58       ` Alan
@ 2007-03-01 22:05         ` Jeremy Fitzhardinge
  2007-03-01 23:11           ` Alan
  0 siblings, 1 reply; 340+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 22:05 UTC (permalink / raw)
  To: Alan
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

Alan wrote:
> A lot of people get confused about -ENOTTY, but it is the return for
> attempting to use an ioctl on the wrong type of object, so this appears
> to be quite correct.

This is a syscall though; ENOSYS is probably a better match.

    J

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 23:11           ` Alan
@ 2007-03-01 22:15             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 340+ messages in thread
From: Jeremy Fitzhardinge @ 2007-03-01 22:15 UTC (permalink / raw)
  To: Alan
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

Alan wrote:
> ENOSYS indicates quite different things and ENOTTY is also used for
> syscalls. I still think ENOTTY is correct.
>   
Yes, ENOSYS tends to me "operation flat out not support" rather than
"not on this object".  I think we can do better than ENOTTY though -
ENOTSUP for example (modulo the confusion over EOPNOTSUPP).

(You can tell the patch has very little real substance if we're arguing
over errnos at this point :)

    J

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
                       ` (2 preceding siblings ...)
  2007-03-01 21:14     ` Jeremy Fitzhardinge
@ 2007-03-01 22:25     ` Andrew Morton
  2007-03-01 22:40       ` Nathan Scott
                         ` (2 more replies)
  2007-03-01 23:29     ` Eric Sandeen
                       ` (3 subsequent siblings)
  7 siblings, 3 replies; 340+ messages in thread
From: Andrew Morton @ 2007-03-01 22:25 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki, Ulrich Drepper

On Fri, 2 Mar 2007 00:04:45 +0530
"Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

It is intended that glibc use this same syscall for both posix_fallocate()
and posix_fallocate64().

I'd agree with Eric on the "command" flag extension.

That new argument might need to come after "fd" - ARM has funny requirements on
syscall arg padding and layout.

> +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> +{
> +	struct file *file;
> +	struct inode *inode;
> +	long ret = -EINVAL;
> +	file = fget(fd);
> +	if (!file)
> +		goto out;
> +	inode = file->f_path.dentry->d_inode;
> +	if (inode->i_op && inode->i_op->fallocate)
> +		ret = inode->i_op->fallocate(inode, offset, len);
> +	else
> +		ret = -ENOTTY;
> +	fput(file);
> +out:
> +        return ret;
> +}

Please always put a blank line between the variable definitions and the
first statement.

Please always use hard tabs, not bunch-of-spaces.  This seems to happening
rather a lot in the ext4 patches.  It's a trivial thing, but also trivial
to fix.  A grep across the diffs is needed.

ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:40       ` Nathan Scott
@ 2007-03-01 22:39         ` Eric Sandeen
  2007-03-01 22:52         ` Andrew Morton
  1 sibling, 0 replies; 340+ messages in thread
From: Eric Sandeen @ 2007-03-01 22:39 UTC (permalink / raw)
  To: nscott
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki, Ulrich Drepper

Nathan Scott wrote:
> On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
>> On Fri, 2 Mar 2007 00:04:45 +0530
>> "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>>
>>> This is to give a heads up on few patches that we will be soon coming up
>>> with. These patches implement a new system call sys_fallocate() and a
>>> new inode operation "fallocate", for persistent preallocation. The new
>>> system call, as Andrew suggested, will look like:
>>>
>>>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
>> ...
>>
>> I'd agree with Eric on the "command" flag extension.
> 
> Seems like a separate syscall would be better, "command" sounds
> a bit ioctl like, especially if that command is passed into the
> filesystems..
> 
> cheers.
> 

I'm fine with that too, I'd just like the functionality :)

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:25     ` Andrew Morton
@ 2007-03-01 22:40       ` Nathan Scott
  2007-03-01 22:39         ` Eric Sandeen
  2007-03-01 22:52         ` Andrew Morton
  2007-03-01 22:41       ` Anton Blanchard
  2007-03-01 22:44       ` Dave Kleikamp
  2 siblings, 2 replies; 340+ messages in thread
From: Nathan Scott @ 2007-03-01 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper

On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 00:04:45 +0530
> "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > This is to give a heads up on few patches that we will be soon coming up
> > with. These patches implement a new system call sys_fallocate() and a
> > new inode operation "fallocate", for persistent preallocation. The new
> > system call, as Andrew suggested, will look like:
> > 
> >   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> ...
> 
> I'd agree with Eric on the "command" flag extension.

Seems like a separate syscall would be better, "command" sounds
a bit ioctl like, especially if that command is passed into the
filesystems..

cheers.

-- 
Nathan


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:25     ` Andrew Morton
  2007-03-01 22:40       ` Nathan Scott
@ 2007-03-01 22:41       ` Anton Blanchard
  2007-03-01 22:44       ` Dave Kleikamp
  2 siblings, 0 replies; 340+ messages in thread
From: Anton Blanchard @ 2007-03-01 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper


> That new argument might need to come after "fd" - ARM has funny
> requirements on syscall arg padding and layout.

FYI the 32bit ppc ABI does too, from arch/powerpc/kernel/sys_ppc32.c:

/*
 * long long munging:
 * The 32 bit ABI passes long longs in an odd even register pair.
 */

and the first argument in a function call is in r3.

Anton

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:25     ` Andrew Morton
  2007-03-01 22:40       ` Nathan Scott
  2007-03-01 22:41       ` Anton Blanchard
@ 2007-03-01 22:44       ` Dave Kleikamp
  2007-03-01 22:59         ` Andrew Morton
  2007-03-01 23:38         ` Christoph Hellwig
  2 siblings, 2 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-03-01 22:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper

On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2007 00:04:45 +0530
> "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> > +{
> > +	struct file *file;
> > +	struct inode *inode;
> > +	long ret = -EINVAL;
> > +	file = fget(fd);
> > +	if (!file)
> > +		goto out;
> > +	inode = file->f_path.dentry->d_inode;
> > +	if (inode->i_op && inode->i_op->fallocate)
> > +		ret = inode->i_op->fallocate(inode, offset, len);
> > +	else
> > +		ret = -ENOTTY;
> > +	fput(file);
> > +out:
> > +        return ret;
> > +}
> 

> ENOTTY is a bit unconventional - we often use EINVAL for this sort of
> thing.  But EINVAL has other meanings for posix_fallocate() and isn't
> really appropriate here anyway.  So I'm not sure what would be better...

Would EINVAL (or whatever) make it back to the caller of
posix_fallocate(), or would glibc fall back to its current
implementation?

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages for any
non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since the major
file systems can probably do something better in their own code.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:40       ` Nathan Scott
  2007-03-01 22:39         ` Eric Sandeen
@ 2007-03-01 22:52         ` Andrew Morton
  2007-03-02 18:28           ` Mingming Cao
  1 sibling, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-03-01 22:52 UTC (permalink / raw)
  To: nscott
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper

On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott <nscott@aconex.com> wrote:

> On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2007 00:04:45 +0530
> > "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > 
> > > This is to give a heads up on few patches that we will be soon coming up
> > > with. These patches implement a new system call sys_fallocate() and a
> > > new inode operation "fallocate", for persistent preallocation. The new
> > > system call, as Andrew suggested, will look like:
> > > 
> > >   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> > ...
> > 
> > I'd agree with Eric on the "command" flag extension.
> 
> Seems like a separate syscall would be better, "command" sounds
> a bit ioctl like, especially if that command is passed into the
> filesystems..
> 

madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to "mode"? ;)

I'm inclined to merge this patch nice and early, so the syscall number is
stabilised.  Otherwise the people who are working on out-of-tree code (ie:
ext4) will have to keep playing catchup.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 21:14     ` Jeremy Fitzhardinge
@ 2007-03-01 22:58       ` Alan
  2007-03-01 22:05         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 340+ messages in thread
From: Alan @ 2007-03-01 22:58 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

On Thu, 01 Mar 2007 13:14:32 -0800
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Amit K. Arora wrote:
> > +	if (inode->i_op && inode->i_op->fallocate)
> > +		ret = inode->i_op->fallocate(inode, offset, len);
> > +	else
> > +		ret = -ENOTTY;
> 
> You can only allocate space on typewriters? ;)

A lot of people get confused about -ENOTTY, but it is the return for
attempting to use an ioctl on the wrong type of object, so this appears
to be quite correct.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:44       ` Dave Kleikamp
@ 2007-03-01 22:59         ` Andrew Morton
  2007-03-01 23:09           ` Dave Kleikamp
  2007-03-02  7:09           ` Ulrich Drepper
  2007-03-01 23:38         ` Christoph Hellwig
  1 sibling, 2 replies; 340+ messages in thread
From: Andrew Morton @ 2007-03-01 22:59 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper

On Thu, 01 Mar 2007 22:44:16 +0000
Dave Kleikamp <shaggy@linux.vnet.ibm.com> wrote:

> On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2007 00:04:45 +0530
> > "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> > > +{
> > > +	struct file *file;
> > > +	struct inode *inode;
> > > +	long ret = -EINVAL;
> > > +	file = fget(fd);
> > > +	if (!file)
> > > +		goto out;
> > > +	inode = file->f_path.dentry->d_inode;
> > > +	if (inode->i_op && inode->i_op->fallocate)
> > > +		ret = inode->i_op->fallocate(inode, offset, len);
> > > +	else
> > > +		ret = -ENOTTY;
> > > +	fput(file);
> > > +out:
> > > +        return ret;
> > > +}
> > 
> 
> > ENOTTY is a bit unconventional - we often use EINVAL for this sort of
> > thing.  But EINVAL has other meanings for posix_fallocate() and isn't
> > really appropriate here anyway.  So I'm not sure what would be better...
> 
> Would EINVAL (or whatever) make it back to the caller of
> posix_fallocate(), or would glibc fall back to its current
> implementation?
> 
> Forgive me if I haven't put enough thought into it, but would it be
> useful to create a generic_fallocate() that writes zeroed pages for any
> non-existent pages in the range?  I don't know how glibc currently
> implements posix_fallocate(), but maybe the kernel could do it more
> efficiently, even in generic code.  Maybe we don't care, since the major
> file systems can probably do something better in their own code.

Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.

However with this kernel change, glibc will need to look at the errno,
so that it can correctly propagate EIO, ENOSPC and whatever.  So we will
need to return a reliable and stable and sensible value so that glibc knows
when it should emulate and when it should propagate.

Perhaps Ulrich can comment.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:59         ` Andrew Morton
@ 2007-03-01 23:09           ` Dave Kleikamp
  2007-03-02 13:41             ` Jan Engelhardt
  2007-03-02 18:09             ` Mingming Cao
  2007-03-02  7:09           ` Ulrich Drepper
  1 sibling, 2 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-03-01 23:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper

On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:
> On Thu, 01 Mar 2007 22:44:16 +0000
> Dave Kleikamp <shaggy@linux.vnet.ibm.com> wrote:
> 
> > On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> > > On Fri, 2 Mar 2007 00:04:45 +0530
> > > "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > 
> > > > +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> > > > +{
> > > > +	struct file *file;
> > > > +	struct inode *inode;
> > > > +	long ret = -EINVAL;
> > > > +	file = fget(fd);
> > > > +	if (!file)
> > > > +		goto out;
> > > > +	inode = file->f_path.dentry->d_inode;
> > > > +	if (inode->i_op && inode->i_op->fallocate)
> > > > +		ret = inode->i_op->fallocate(inode, offset, len);
> > > > +	else
> > > > +		ret = -ENOTTY;
> > > > +	fput(file);
> > > > +out:
> > > > +        return ret;
> > > > +}
> > > 
> > 
> > > ENOTTY is a bit unconventional - we often use EINVAL for this sort of
> > > thing.  But EINVAL has other meanings for posix_fallocate() and isn't
> > > really appropriate here anyway.  So I'm not sure what would be better...
> > 
> > Would EINVAL (or whatever) make it back to the caller of
> > posix_fallocate(), or would glibc fall back to its current
> > implementation?
> > 
> > Forgive me if I haven't put enough thought into it, but would it be
> > useful to create a generic_fallocate() that writes zeroed pages for any
> > non-existent pages in the range?  I don't know how glibc currently
> > implements posix_fallocate(), but maybe the kernel could do it more
> > efficiently, even in generic code.  Maybe we don't care, since the major
> > file systems can probably do something better in their own code.
> 
> Given that glibc already implements fallocate for all filesystems, it will
> need to continue to do so for filesystems which don't implement this
> syscall - otherwise applications would start breaking.

I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op->allocate().

if (inode->i_op && inode->i_op->fallocate)
	ret = inode->i_op->fallocate(inode, offset, len);
else
	ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:05         ` Jeremy Fitzhardinge
@ 2007-03-01 23:11           ` Alan
  2007-03-01 22:15             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 340+ messages in thread
From: Alan @ 2007-03-01 23:11 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

On Thu, 01 Mar 2007 14:05:36 -0800
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Alan wrote:
> > A lot of people get confused about -ENOTTY, but it is the return for
> > attempting to use an ioctl on the wrong type of object, so this appears
> > to be quite correct.
> 
> This is a syscall though; ENOSYS is probably a better match.

ENOSYS indicates quite different things and ENOTTY is also used for
syscalls. I still think ENOTTY is correct.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
                       ` (3 preceding siblings ...)
  2007-03-01 22:25     ` Andrew Morton
@ 2007-03-01 23:29     ` Eric Sandeen
  2007-03-01 23:51       ` Christoph Hellwig
  2007-03-01 23:36     ` Christoph Hellwig
                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 340+ messages in thread
From: Eric Sandeen @ 2007-03-01 23:29 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, Andrew Morton, suparna,
	cmm, alex, suzuki

Amit K. Arora wrote:

Might want more error checking in there, something like (rough cut)...
(or is some of this glibc's job?)

> +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> +{
> +	struct file *file;
> +	struct inode *inode;
> +	long ret;
 > +
 > +	ret = -EINVAL;
 > + 	if (len == 0 || offset < 0)
 > +		goto out;
 > + 	ret = -EBADF;
> +	file = fget(fd);
> +	if (!file)
> +		goto out;
 > +	if (!(file->f_mode & FMODE_WRITE))
 > +		goto out_fput;
> +	inode = file->f_path.dentry->d_inode;
 > +	ret = -ESPIPE;
 > +	if (S_ISFIFO(inode->i_mode))
 > +		goto out_fput;
 > +	ret = -ENODEV;
 > +     if (!S_ISREG(inode->i_mode))
 > +		goto out_fput;
 > + 	ret = -EFBIG;
 > + 	if (offset + len > inode->i_sb->s_maxbytes)
 > +		goto out_fput;
> +	if (inode->i_op && inode->i_op->fallocate)
> +		ret = inode->i_op->fallocate(inode, offset, len);
> +	else
> +		ret = -ENOTTY;
 > +out_fput:
> +	fput(file);
> +out:
> +	return ret;
> +}

which would keep things in line with posix_fallocate's specified errors, 
too?

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
                       ` (4 preceding siblings ...)
  2007-03-01 23:29     ` Eric Sandeen
@ 2007-03-01 23:36     ` Christoph Hellwig
  2007-03-02  6:03     ` Badari Pulavarty
  2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
  7 siblings, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-01 23:36 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, Andrew Morton, suparna,
	cmm, alex

On Fri, Mar 02, 2007 at 12:04:45AM +0530, Amit K. Arora wrote:
> This is to give a heads up on few patches that we will be soon coming up
> with. These patches implement a new system call sys_fallocate() and a
> new inode operation "fallocate", for persistent preallocation. The new
> system call, as Andrew suggested, will look like:
> 
>   asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> 
> As we are developing and testing the required patches, we decided to
> post a preliminary patch and get inputs from the community to give it
> a right direction and shape. First, a little description on the feature.

Thanks a lot, this has been long overdue.

Please don't forget to Cc the XFS list to keep developers of the only
Linux filesystem supporting persistant allocations for a long time :)

Various people will beat you up for the above syscall as lots of
architectures really want 64bit arguments aligned in a proper way,
e.g. you at least need a pad after 'int fd'.  Then again I already
have suggestions for filling up that slot with useful information:

 - you really want a whence argument as to lseek, as it makes a lot
   of sense for applications to allocate from the end of the file
   or the current file positions.  The existing XFS ioctl already
   has this, and it's trivial to support this in any preallocation
   implementation I could imagine.
 - we should think about having a flag value for which kind of preallocation
   we want.  XFS currently has two:

	ALLOCSP which updates the inode size and physically zeroes blocks
	RESVSP which does not update inode size but creates and unwritten
	       extent

   the current posix_fallocate semantics are somewhere in the middle, as
   it requires and update to the inode size, but does not specify at
   all what happens if you read from the newly allocated space.
   And yes, as and heads up to developers implementing this feature
   on new filesystems: don't just return new blocks, that's a gapping
   security hole :)

> +asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
> +{
> +	struct file *file;
> +	struct inode *inode;
> +	long ret = -EINVAL;
> +	file = fget(fd);
> +	if (!file)
> +		goto out;
> +	inode = file->f_path.dentry->d_inode;
> +	if (inode->i_op && inode->i_op->fallocate)
> +		ret = inode->i_op->fallocate(inode, offset, len);
> +	else
> +		ret = -ENOTTY;
> +	fput(file);
> +out:
> +        return ret;
> +}

This should use fget_light, and I'm sure the code could be written
in a slightly more readable:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
{
	struct file *file = fget(fd);
	 ret = -EINVAL;

	if (file)
		struct inode *inode = file->f_path.dentry->d_inode;
		if (inode->i_op && inode->i_op->fallocate)
			ret = inode->i_op->fallocate(inode, offset, len);
		else
			ret = -ENOTTY;
		fput(file);
	}

	return ret;
}

p.s. you reference ext4_fallocate in the patch but don't actually
introduce it, it definitively won't compile as-is :)

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:44       ` Dave Kleikamp
  2007-03-01 22:59         ` Andrew Morton
@ 2007-03-01 23:38         ` Christoph Hellwig
  2007-03-03 22:45           ` Arnd Bergmann
  1 sibling, 1 reply; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-01 23:38 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki, Ulrich Drepper

On Thu, Mar 01, 2007 at 10:44:16PM +0000, Dave Kleikamp wrote:
> Would EINVAL (or whatever) make it back to the caller of
> posix_fallocate(), or would glibc fall back to its current
> implementation?
> 
> Forgive me if I haven't put enough thought into it, but would it be
> useful to create a generic_fallocate() that writes zeroed pages for any
> non-existent pages in the range?  I don't know how glibc currently
> implements posix_fallocate(), but maybe the kernel could do it more
> efficiently, even in generic code.  Maybe we don't care, since the major
> file systems can probably do something better in their own code.

I'd be more happy to have the write out zeroes loop in glibc.  And
glibc needs to have it anyway, for older kernels.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 23:29     ` Eric Sandeen
@ 2007-03-01 23:51       ` Christoph Hellwig
  0 siblings, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-01 23:51 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

On Thu, Mar 01, 2007 at 05:29:15PM -0600, Eric Sandeen wrote:
> Amit K. Arora wrote:
> 
> Might want more error checking in there, something like (rough cut)...
> (or is some of this glibc's job?)

Yeah, we need to have this checks.  We can't rely on userspace not
passing arguments that might corrupt your filesystem or let you
escalate privilegues.

> which would keep things in line with posix_fallocate's specified errors, 
> too?

Yes, very good idea.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
                       ` (5 preceding siblings ...)
  2007-03-01 23:36     ` Christoph Hellwig
@ 2007-03-02  6:03     ` Badari Pulavarty
  2007-03-02  6:16       ` Andrew Morton
  2007-03-02 15:16       ` Eric Sandeen
  2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
  7 siblings, 2 replies; 340+ messages in thread
From: Badari Pulavarty @ 2007-03-02  6:03 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, Andrew Morton, suparna,
	cmm, alex, suzuki


Amit K. Arora wrote:

>This is to give a heads up on few patches that we will be soon coming up
>with. These patches implement a new system call sys_fallocate() and a
>new inode operation "fallocate", for persistent preallocation. The new
>system call, as Andrew suggested, will look like:
>
>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
>
I am wondering about return values from this syscall ? Is it supposed to 
return the
number of bytes allocated ? What about partial allocations ? What about 
if the
blocks already exists ? What would be return values in those cases ?

Just curious .. What does posix_fallocate() return ?

Thanks,
Badari

>




^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02  6:03     ` Badari Pulavarty
@ 2007-03-02  6:16       ` Andrew Morton
  2007-03-02 13:23         ` Dave Kleikamp
  2007-03-02 15:16       ` Eric Sandeen
  1 sibling, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-03-02  6:16 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <pbadari@us.ibm.com> wrote:

> Just curious .. What does posix_fallocate() return ?

bookmark this:

http://www.opengroup.org/onlinepubs/009695399/nfindex.html

    Upon successful completion, posix_fallocate() shall return zero;
    otherwise, an error number shall be returned to indicate the error.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:59         ` Andrew Morton
  2007-03-01 23:09           ` Dave Kleikamp
@ 2007-03-02  7:09           ` Ulrich Drepper
  1 sibling, 0 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-02  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Kleikamp, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

Andrew Morton wrote:
> Perhaps Ulrich can comment.

I was out of town, hence the delay.

I think that if there is no support for the syscall the correct answer
is to return ENOSYS.  In this case the current userlevel code would be
used and ENOSYS is also used to trigger the use of the compat code in
glibc in case the syscall does not exist at all.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 19:15     ` Eric Sandeen
@ 2007-03-02 10:45       ` Andreas Dilger
  2007-03-02 13:17         ` Dave Kleikamp
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-03-02 10:45 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
> One thing I'd like to see is a cmd argument as well, to allow for 
> example allocation vs. reservation (i.e. allocating blocks vs. simply 
> reserving a number), as well as the inverse of those functions 
> (un-reservation, de-allocation)?
> 
> If the allocation interface allows allocation/reservation within 
> arbitrary ranges, if the only way to un-allocate is via a truncate, 
> that's pretty asymmetric.

I'd rather we just get the oft-discussed punch() syscall instead.
This is really what "unallocate" would do for persistent allocations
and it would be useful for files that were not preallocated.

For filesystems that don't implement punch glibc() would do zero-filling
of the punched area I guess (to make it equivalent to reading from a
hole in the file).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02 10:45       ` Andreas Dilger
@ 2007-03-02 13:17         ` Dave Kleikamp
  0 siblings, 0 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-03-02 13:17 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Sandeen, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, Andrew Morton, suparna, cmm, alex, suzuki

On Fri, 2007-03-02 at 18:45 +0800, Andreas Dilger wrote:
> On Mar 01, 2007  13:15 -0600, Eric Sandeen wrote:
> > One thing I'd like to see is a cmd argument as well, to allow for 
> > example allocation vs. reservation (i.e. allocating blocks vs. simply 
> > reserving a number), as well as the inverse of those functions 
> > (un-reservation, de-allocation)?
> > 
> > If the allocation interface allows allocation/reservation within 
> > arbitrary ranges, if the only way to un-allocate is via a truncate, 
> > that's pretty asymmetric.
> 
> I'd rather we just get the oft-discussed punch() syscall instead.
> This is really what "unallocate" would do for persistent allocations
> and it would be useful for files that were not preallocated.

I can see a difference though.  punch() would throw away written data as
well as pre-allocated-but-never-written-to data.  I can see where a user
might preallocate a large file and do a lot of random writes.  At some
point, he decides the file isn't going to grow much more, so let's free
up the remaining pre-allocated blocks.  This makes even more sense with
reservation.

The alternative would be to have punch() take a flag to specify if only
preallocated or reserved blocks should be freed.

> 
> For filesystems that don't implement punch glibc() would do zero-filling
> of the punched area I guess (to make it equivalent to reading from a
> hole in the file).

Or it could just fail.  Writing zeroes may be really slow and not give
the caller any benefit.  (The intention was to free blocks back to the
file system.)

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02  6:16       ` Andrew Morton
@ 2007-03-02 13:23         ` Dave Kleikamp
  2007-03-02 15:29           ` Ulrich Drepper
  0 siblings, 1 reply; 340+ messages in thread
From: Dave Kleikamp @ 2007-03-02 13:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Badari Pulavarty, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

Amit wrote:

>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);

On Thu, 2007-03-01 at 22:16 -0800, Andrew Morton wrote:
> On Thu, 01 Mar 2007 22:03:55 -0800 Badari Pulavarty <pbadari@us.ibm.com> wrote:
> 
> > Just curious .. What does posix_fallocate() return ?
> 
> bookmark this:
> 
> http://www.opengroup.org/onlinepubs/009695399/nfindex.html
> 
>     Upon successful completion, posix_fallocate() shall return zero;
>     otherwise, an error number shall be returned to indicate the error.

Then there's no need for sys_allocate to return a long.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 23:09           ` Dave Kleikamp
@ 2007-03-02 13:41             ` Jan Engelhardt
  2007-03-02 18:09             ` Mingming Cao
  1 sibling, 0 replies; 340+ messages in thread
From: Jan Engelhardt @ 2007-03-02 13:41 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki, Ulrich Drepper


On Mar 1 2007 23:09, Dave Kleikamp wrote:
>> 
>> Given that glibc already implements fallocate for all filesystems, it will
>> need to continue to do so for filesystems which don't implement this
>> syscall - otherwise applications would start breaking.
>
>I didn't make it clear, but my point was to call generic_fallocate if
>the file system did not define i_op->allocate().
>
>if (inode->i_op && inode->i_op->fallocate)
>	ret = inode->i_op->fallocate(inode, offset, len);
>else
>	ret = generic_fallocate(inode, offset, len);
>
>I'm not sure it's worth the effort, but I thought I'd throw the idea out
>there.

Writing zeroes using glibc emu most likely means write() --
so generic_fallocate should be preferable (think splice).
Or does glibc use mmap() and it's all different?


Jan
-- 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02  6:03     ` Badari Pulavarty
  2007-03-02  6:16       ` Andrew Morton
@ 2007-03-02 15:16       ` Eric Sandeen
  2007-03-02 16:13         ` Badari Pulavarty
  1 sibling, 1 reply; 340+ messages in thread
From: Eric Sandeen @ 2007-03-02 15:16 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	Andrew Morton, suparna, cmm, alex, suzuki

Badari Pulavarty wrote:
> 
> Amit K. Arora wrote:
> 
>> This is to give a heads up on few patches that we will be soon coming up
>> with. These patches implement a new system call sys_fallocate() and a
>> new inode operation "fallocate", for persistent preallocation. The new
>> system call, as Andrew suggested, will look like:
>>
>>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
>>
> I am wondering about return values from this syscall ? Is it supposed to 
> return the
> number of bytes allocated ? What about partial allocations ? 

If you don't have enough blocks to cover the request, you should 
probably just return -ENOSPC, not a partial allocation.

> What about 
> if the
> blocks already exists ? What would be return values in those cases ?

0 on success, other normal errors oetherwise..

If asked for a range that includes already-allocated blocks, you just 
allocate any non-allocated blocks in the range, I think.

-Eric


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02 13:23         ` Dave Kleikamp
@ 2007-03-02 15:29           ` Ulrich Drepper
  0 siblings, 0 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-02 15:29 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, Badari Pulavarty, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

On 3/2/07, Dave Kleikamp <shaggy@linux.vnet.ibm.com> wrote:
> Then there's no need for sys_allocate to return a long.

Every syscall must return a long.  Otherwise you can have problems on
64-bit archs.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02 15:16       ` Eric Sandeen
@ 2007-03-02 16:13         ` Badari Pulavarty
  2007-03-02 17:01           ` Andrew Morton
  2007-03-02 17:19           ` Eric Sandeen
  0 siblings, 2 replies; 340+ messages in thread
From: Badari Pulavarty @ 2007-03-02 16:13 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Amit K. Arora, linux-fsdevel, lkml, ext4, Andrew Morton, suparna,
	cmm, alex, suzuki

On Fri, 2007-03-02 at 09:16 -0600, Eric Sandeen wrote:
> Badari Pulavarty wrote:
> > 
> > Amit K. Arora wrote:
> > 
> >> This is to give a heads up on few patches that we will be soon coming up
> >> with. These patches implement a new system call sys_fallocate() and a
> >> new inode operation "fallocate", for persistent preallocation. The new
> >> system call, as Andrew suggested, will look like:
> >>
> >>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> >>
> > I am wondering about return values from this syscall ? Is it supposed to 
> > return the
> > number of bytes allocated ? What about partial allocations ? 
> 
> If you don't have enough blocks to cover the request, you should 
> probably just return -ENOSPC, not a partial allocation.

That could be challenging, when multiple writers are working in
parallel. You may not be able to return -ENOSPC, till you fail the
allocation (for filesystems which alllocates a block at a time).

> 
> > What about 
> > if the
> > blocks already exists ? What would be return values in those cases ?
> 
> 0 on success, other normal errors oetherwise..
> 
> If asked for a range that includes already-allocated blocks, you just 
> allocate any non-allocated blocks in the range, I think.

Yes. What I was trying to figure out is, if there is a requirement that
interface need to return exact number of bytes it *really* allocated
(like write() or read()). I can't think of any, but just wanted to
through it out..

BTW, what is the interface for finding out what is the size of the
pre-allocated file ? 

Thanks,
Badari


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02 16:13         ` Badari Pulavarty
@ 2007-03-02 17:01           ` Andrew Morton
  2007-03-02 17:19           ` Eric Sandeen
  1 sibling, 0 replies; 340+ messages in thread
From: Andrew Morton @ 2007-03-02 17:01 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Eric Sandeen, Amit K. Arora, linux-fsdevel, lkml, ext4, suparna,
	cmm, alex, suzuki

On Fri, 02 Mar 2007 08:13:00 -0800 Badari Pulavarty <pbadari@us.ibm.com> wrote:

> > 
> > > What about 
> > > if the
> > > blocks already exists ? What would be return values in those cases ?
> > 
> > 0 on success, other normal errors oetherwise..
> > 
> > If asked for a range that includes already-allocated blocks, you just 
> > allocate any non-allocated blocks in the range, I think.
> 
> Yes. What I was trying to figure out is, if there is a requirement that
> interface need to return exact number of bytes it *really* allocated
> (like write() or read()). I can't think of any, but just wanted to
> through it out..

Hopefully not, because posix didn't anticipate that.

We could of course return a positive number on success, but it'd get
tricky on 32-bit machines.

> BTW, what is the interface for finding out what is the size of the
> pre-allocated file ? 

stat.st_blocks?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02 16:13         ` Badari Pulavarty
  2007-03-02 17:01           ` Andrew Morton
@ 2007-03-02 17:19           ` Eric Sandeen
  1 sibling, 0 replies; 340+ messages in thread
From: Eric Sandeen @ 2007-03-02 17:19 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Amit K. Arora, linux-fsdevel, lkml, ext4, Andrew Morton, suparna,
	cmm, alex, suzuki

Badari Pulavarty wrote:

> BTW, what is the interface for finding out what is the size of the
> pre-allocated file ? 

With XFS at least, "du," "stat," etc tell you a little:

[root@magnesium test]# touch resvsp
[root@magnesium test]# xfs_io resvsp
xfs_io> resvsp 0 10g

The file is 0 length, but is using 10g of blocks:
(with posix_fallocate this would move the size out to 10g as well)

[root@magnesium test]# ls -lh resvsp
-rw-r--r--  1 root root 0 Nov 28 14:11 resvsp
[root@magnesium test]# du -hc resvsp
10G     resvsp
10G     total
[root@magnesium test]# stat resvsp
   File: `resvsp'
   Size: 0               Blocks: 20971520   IO Block: 4096   regular 
empty file
Device: 81eh/2078d      Inode: 186         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)

xfs also has an interface to find out what allocations are where:

if you reserve some ranges not starting at 0...

[root@magnesium test]# xfs_io resvsp
xfs_io> resvsp 1g 1g
xfs_io> resvsp 3g 1g
xfs_io> resvsp 5g 1g
xfs_io> quit

[root@magnesium test]# xfs_bmap -v resvsp
resvsp:
  EXT: FILE-OFFSET           BLOCK-RANGE       AG AG-OFFSET 
TOTAL FLAGS
    0: [0..2097151]:         hole 
2097152
    1: [2097152..4194303]:   42392..2139543     0 (42392..2139543) 
2097152 10000
    2: [4194304..6291455]:   hole 
2097152
    3: [6291456..8388607]:   4236696..6333847   0 (4236696..6333847) 
2097152 10000
    4: [8388608..10485759]:  hole 
2097152
    5: [10485760..12582911]: 8431000..10528151  0 (8431000..10528151) 
2097152 10000

The flags of 10000 mean that these extents is preallocated/unwritten.

I suppose outside of XFS, FIBMAP is your best bet, but that won't tell 
you what is preallocated vs. allocated/written....

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 23:09           ` Dave Kleikamp
  2007-03-02 13:41             ` Jan Engelhardt
@ 2007-03-02 18:09             ` Mingming Cao
  1 sibling, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-03-02 18:09 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, alex, suzuki, Ulrich Drepper

Dave Kleikamp wrote:
> On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:
> 
>>On Thu, 01 Mar 2007 22:44:16 +0000
>>Dave Kleikamp <shaggy@linux.vnet.ibm.com> wrote:
>>
>>
>>>On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
>>>
>>>>On Fri, 2 Mar 2007 00:04:45 +0530
>>>>"Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>>>
>>>>>+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
>>>>>+{
>>>>>+	struct file *file;
>>>>>+	struct inode *inode;
>>>>>+	long ret = -EINVAL;
>>>>>+	file = fget(fd);
>>>>>+	if (!file)
>>>>>+		goto out;
>>>>>+	inode = file->f_path.dentry->d_inode;
>>>>>+	if (inode->i_op && inode->i_op->fallocate)
>>>>>+		ret = inode->i_op->fallocate(inode, offset, len);
>>>>>+	else
>>>>>+		ret = -ENOTTY;
>>>>>+	fput(file);
>>>>>+out:
>>>>>+        return ret;
>>>>>+}
>>>>
>>>>ENOTTY is a bit unconventional - we often use EINVAL for this sort of
>>>>thing.  But EINVAL has other meanings for posix_fallocate() and isn't
>>>>really appropriate here anyway.  So I'm not sure what would be better...
>>>
>>>Would EINVAL (or whatever) make it back to the caller of
>>>posix_fallocate(), or would glibc fall back to its current
>>>implementation?
>>>
>>>Forgive me if I haven't put enough thought into it, but would it be
>>>useful to create a generic_fallocate() that writes zeroed pages for any
>>>non-existent pages in the range?  I don't know how glibc currently
>>>implements posix_fallocate(), but maybe the kernel could do it more
>>>efficiently, even in generic code.  Maybe we don't care, since the major
>>>file systems can probably do something better in their own code.
>>
>>Given that glibc already implements fallocate for all filesystems, it will
>>need to continue to do so for filesystems which don't implement this
>>syscall - otherwise applications would start breaking.
> 
> 
> I didn't make it clear, but my point was to call generic_fallocate if
> the file system did not define i_op->allocate().
> 
> if (inode->i_op && inode->i_op->fallocate)
> 	ret = inode->i_op->fallocate(inode, offset, len);
> else
> 	ret = generic_fallocate(inode, offset, len);
> 
> I'm not sure it's worth the effort, but I thought I'd throw the idea out
> there.
> 
I think this is useful.

Mingming


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 22:52         ` Andrew Morton
@ 2007-03-02 18:28           ` Mingming Cao
  2007-03-05 12:27             ` Jan Kara
  0 siblings, 1 reply; 340+ messages in thread
From: Mingming Cao @ 2007-03-02 18:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nscott, Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	suparna, alex, suzuki, Ulrich Drepper

Andrew Morton wrote:

> On Fri, 02 Mar 2007 09:40:54 +1100
> Nathan Scott <nscott@aconex.com> wrote:
> 
> 
>>On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
>>
>>>On Fri, 2 Mar 2007 00:04:45 +0530
>>>"Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>>>
>>>
>>>>This is to give a heads up on few patches that we will be soon coming up
>>>>with. These patches implement a new system call sys_fallocate() and a
>>>>new inode operation "fallocate", for persistent preallocation. The new
>>>>system call, as Andrew suggested, will look like:
>>>>
>>>>  asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
>>>
>>>...
>>>
>>>I'd agree with Eric on the "command" flag extension.
>>
>>Seems like a separate syscall would be better, "command" sounds
>>a bit ioctl like, especially if that command is passed into the
>>filesystems..
>>
> 
> 
> madvise, fadvise, lseek, etc seem to work OK.
> 
> I get repeatedly traumatised by patch rejects whenever a new syscall gets
> added, so I'm biased.
> 
> The advantage of a command flag is that we can add new modes in the future
> without causing lots of churn, waiting for arch maintainers to catch up,
> potentially adding new compat code, etc.
> 
> Rename it to "mode"? ;)
> 
I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.

Right now neither ext4 preallocation implementation or reservation are 
guranteed to allocate/reserve contigugous extents. If the application 
told it so, it could do more searching to satisfy the requirement.

Or fadvise is the right interface?

Mingming
> I'm inclined to merge this patch nice and early, so the syscall number is
> stabilised.  Otherwise the people who are working on out-of-tree code (ie:
> ext4) will have to keep playing catchup.
> 



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-01 23:38         ` Christoph Hellwig
@ 2007-03-03 22:45           ` Arnd Bergmann
  2007-03-04 20:11             ` Anton Altaparmakov
  2007-03-05 13:18             ` Christoph Hellwig
  0 siblings, 2 replies; 340+ messages in thread
From: Arnd Bergmann @ 2007-03-03 22:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki,
	Ulrich Drepper

On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote:
> > Forgive me if I haven't put enough thought into it, but would it be
> > useful to create a generic_fallocate() that writes zeroed pages for any
> > non-existent pages in the range?  I don't know how glibc currently
> > implements posix_fallocate(), but maybe the kernel could do it more
> > efficiently, even in generic code.  Maybe we don't care, since the major
> > file systems can probably do something better in their own code.
>
> I'd be more happy to have the write out zeroes loop in glibc.  And
> glibc needs to have it anyway, for older kernels.

A generic_fallocate makes sense to me iff we can do it in the kernel
more significantly more efficiently than in glibc, e.g. by using only
a single page in page cache instead of one for each page to be preallocated.

If  glibc is smart enough to do an optimal implementation, I fully agree
with you.

	Arnd <><

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-03 22:45           ` Arnd Bergmann
@ 2007-03-04 20:11             ` Anton Altaparmakov
  2007-03-04 20:53                 ` Arnd Bergmann
                                 ` (2 more replies)
  2007-03-05 13:18             ` Christoph Hellwig
  1 sibling, 3 replies; 340+ messages in thread
From: Anton Altaparmakov @ 2007-03-04 20:11 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Christoph Hellwig, Dave Kleikamp, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki, Ulrich Drepper

On 3 Mar 2007, at 22:45, Arnd Bergmann wrote:
> On Friday 02 March 2007 00:38:19 Christoph Hellwig wrote:
>>> Forgive me if I haven't put enough thought into it, but would it be
>>> useful to create a generic_fallocate() that writes zeroed pages  
>>> for any
>>> non-existent pages in the range?  I don't know how glibc currently
>>> implements posix_fallocate(), but maybe the kernel could do it more
>>> efficiently, even in generic code.  Maybe we don't care, since  
>>> the major
>>> file systems can probably do something better in their own code.
>>
>> I'd be more happy to have the write out zeroes loop in glibc.  And
>> glibc needs to have it anyway, for older kernels.
>
> A generic_fallocate makes sense to me iff we can do it in the kernel
> more significantly more efficiently than in glibc, e.g. by using only
> a single page in page cache instead of one for each page to be  
> preallocated.
>
> If  glibc is smart enough to do an optimal implementation, I fully  
> agree
> with you.

glibc cannot ever be smart enough because a file system driver will  
always know better and be able to do things in a much more optimized  
way.

For example on NTFS fallocate() only needs to involve the setting of  
a few bits in the volume block allocation bitmap (one bit for each  
logical block being allocated) and update the extent map in the on- 
disk inode to reflect that those blocks are now allocated to the  
inode.  Then it just needs to update the allocated size and  
optionally the data size (if fallocate wants to increase the file  
size rather than just the allocated size).  And that is it.  No  
zeroing needs to happen at all because we have not updated the  
initialized size of the inode!

glibc can only dream of an implementation like this.  (-;

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-04 20:11             ` Anton Altaparmakov
@ 2007-03-04 20:53                 ` Arnd Bergmann
  2007-03-04 22:38               ` Ulrich Drepper
  2007-03-05  4:23               ` Christoph Hellwig
  2 siblings, 0 replies; 340+ messages in thread
From: Arnd Bergmann @ 2007-03-04 20:53 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Hellwig, Dave Kleikamp, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki, Ulrich Drepper

On Sunday 04 March 2007, Anton Altaparmakov wrote:
> > A generic_fallocate makes sense to me iff we can do it in the kernel
> > more significantly more efficiently than in glibc, e.g. by using only
> > a single page in page cache instead of one for each page to be  
> > preallocated.
> >
> > If  glibc is smart enough to do an optimal implementation, I fully  
> > agree
> > with you.
> 
> glibc cannot ever be smart enough because a file system driver will  
> always know better and be able to do things in a much more optimized  
> way.

Ok, that's not what I meant. It's obvious that the file system itself
can do better than both VFS and glibc. The question is whether VFS can
be better than glibc on file systems that don't offer their own
implementation of the fallocate operation.

	Arnd <><

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
@ 2007-03-04 20:53                 ` Arnd Bergmann
  0 siblings, 0 replies; 340+ messages in thread
From: Arnd Bergmann @ 2007-03-04 20:53 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Hellwig, Dave Kleikamp, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki, Ulrich Drepper

On Sunday 04 March 2007, Anton Altaparmakov wrote:
> > A generic_fallocate makes sense to me iff we can do it in the kernel
> > more significantly more efficiently than in glibc, e.g. by using only
> > a single page in page cache instead of one for each page to be  
> > preallocated.
> >
> > If  glibc is smart enough to do an optimal implementation, I fully  
> > agree
> > with you.
> 
> glibc cannot ever be smart enough because a file system driver will  
> always know better and be able to do things in a much more optimized  
> way.

Ok, that's not what I meant. It's obvious that the file system itself
can do better than both VFS and glibc. The question is whether VFS can
be better than glibc on file systems that don't offer their own
implementation of the fallocate operation.

	Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-04 20:11             ` Anton Altaparmakov
  2007-03-04 20:53                 ` Arnd Bergmann
@ 2007-03-04 22:38               ` Ulrich Drepper
  2007-03-04 23:22                 ` Anton Altaparmakov
  2007-03-05  0:16                   ` Jörn Engel
  2007-03-05  4:23               ` Christoph Hellwig
  2 siblings, 2 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-04 22:38 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 749 bytes --]

Anton Altaparmakov wrote:
> And that is it.  No zeroing needs to happen at all because we
> have not updated the initialized size of the inode!

When you do it like this, who can the kernel/filesystem *guarantee* that
when the data is written there actually is room on the harddrive?

What you described seems like using truncate/ftruncate to increase the
file's size.  That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the disk are
reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-04 22:38               ` Ulrich Drepper
@ 2007-03-04 23:22                 ` Anton Altaparmakov
  2007-03-05 14:37                   ` Theodore Tso
  2007-03-05  0:16                   ` Jörn Engel
  1 sibling, 1 reply; 340+ messages in thread
From: Anton Altaparmakov @ 2007-03-04 23:22 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

Hi,

On 4 Mar 2007, at 22:38, Ulrich Drepper wrote:
> Anton Altaparmakov wrote:
>> And that is it.  No zeroing needs to happen at all because we
>> have not updated the initialized size of the inode!
>
> When you do it like this, who can the kernel/filesystem *guarantee*  
> that
> when the data is written there actually is room on the harddrive?

The blocks are allocated so of course it is guaranteed.  Subsequent  
writes to this file will not generate any allocations thus  
allocations cannot fail.  (-:

> What you described seems like using truncate/ftruncate to increase the
> file's size.  That is not at all what posix_fallocate is for.
> posix_fallocate must make sure that the requested blocks on the  
> disk are
> reserved (allocated) for the file's use and that at no point in the
> future will, say, a msync() fail because a mmap(MAP_SHARED) page has
> been written to.

No that is different.  I described performing the allocations in the  
volume bitmap, i.e. for each allocated block the corresponding "in  
use" bit is set in the bitmap (NTFS uses a linear bitmap where byte  
0, bit 0 == physical block 0 of volume, byte 0, bit 1 == physical  
block 1 of volume, ... byte 1, bit 0 == block 8 of volume, ...).

Also I described updating the extent map of the inode such that it  
describes the physical blocks as belonging to the file, thus you  
would have "logical file block X corresponds to physical block Y on  
volume" entries entered into the extent map of the inode and they  
would describe the just allocated blocks.

Finally I described updating the allocated size in the inode which  
basically says "there are that many bytes worth of blocks allocated  
to this inode".

And optionally I described updating the data size in the inode which  
basically says "this file has size Z bytes".

And I specifically did NOT update the initialized size in the inode  
thus it will remain at its old value thus all new allocated blocks  
will be considered as present but not initialized thus a read will  
always return zero whilst a write will do the right thing and pad  
with zeroes as necessary (if the write is smaller than the block  
size, etc).

Note that you are right that this is like truncate in NTFS for non- 
sparse enabled inodes/volumes.

But for sparse ones, instead of doing any allocations in the bitmap  
and entering them in the extent map, you would simply add a single  
entry to the extent map that says "X blocks allocated starting at  
logical block Y corresponding to no physical blocks, i.e. they are  
sparse".  You would then also update the allocated size and data size  
as above and now you can even (but do not have to) update the  
initialized size to be equal to the data size as the file can be  
considered fully initialized because it is sparse.  As an  
implementation detail this truncate operation would not modify the  
compressed size of the inode (i.e. the really used on-disk space,  
i.e. what you get from running "du" as that does not change when you  
add sparse blocks) whilst the fallocate described above would update  
the compressed size (if the file is sparse or compressed - there is  
no compressed size in the inode if the inode is not sparse/ 
compressed) because the file now occupies more blocks on disk even if  
they are actually not initialized.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-04 22:38               ` Ulrich Drepper
@ 2007-03-05  0:16                   ` Jörn Engel
  2007-03-05  0:16                   ` Jörn Engel
  1 sibling, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05  0:16 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Anton Altaparmakov, Arnd Bergmann, Christoph Hellwig,
	Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
> 
> When you do it like this, who can the kernel/filesystem *guarantee* that
> when the data is written there actually is room on the harddrive?
> 
> What you described seems like using truncate/ftruncate to increase the
> file's size.  That is not at all what posix_fallocate is for.
> posix_fallocate must make sure that the requested blocks on the disk are
> reserved (allocated) for the file's use and that at no point in the
> future will, say, a msync() fail because a mmap(MAP_SHARED) page has
> been written to.

That actually causes an interesting problem for compressing filesystems.
The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling compression,
then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is not
supported, so I can only return an error.  But which one?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
@ 2007-03-05  0:16                   ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05  0:16 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Anton Altaparmakov, Arnd Bergmann, Christoph Hellwig,
	Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
> 
> When you do it like this, who can the kernel/filesystem *guarantee* that
> when the data is written there actually is room on the harddrive?
> 
> What you described seems like using truncate/ftruncate to increase the
> file's size.  That is not at all what posix_fallocate is for.
> posix_fallocate must make sure that the requested blocks on the disk are
> reserved (allocated) for the file's use and that at no point in the
> future will, say, a msync() fail because a mmap(MAP_SHARED) page has
> been written to.

That actually causes an interesting problem for compressing filesystems.
The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling compression,
then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is not
supported, so I can only return an error.  But which one?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05  0:16                   ` Jörn Engel
  (?)
@ 2007-03-05  0:32                   ` Anton Altaparmakov
  2007-03-05  0:35                     ` Anton Altaparmakov
                                       ` (2 more replies)
  -1 siblings, 3 replies; 340+ messages in thread
From: Anton Altaparmakov @ 2007-03-05  0:32 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ulrich Drepper, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki


On 5 Mar 2007, at 00:16, Jörn Engel wrote:

> On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
>>
>> When you do it like this, who can the kernel/filesystem  
>> *guarantee* that
>> when the data is written there actually is room on the harddrive?
>>
>> What you described seems like using truncate/ftruncate to increase  
>> the
>> file's size.  That is not at all what posix_fallocate is for.
>> posix_fallocate must make sure that the requested blocks on the  
>> disk are
>> reserved (allocated) for the file's use and that at no point in the
>> future will, say, a msync() fail because a mmap(MAP_SHARED) page has
>> been written to.
>
> That actually causes an interesting problem for compressing  
> filesystems.
> The space consumed by blocks depends on their contents and how well it
> compresses.  At the moment, the only option I see to support
> posix_fallocate for LogFS is to set an inode flag disabling  
> compression,
> then allocate the blocks.
>
> But if the file already contains large amounts of compressed data, I
> have a problem.  Disabling compression for a range within a file is  
> not
> supported, so I can only return an error.  But which one?

I don't know how your compression algorithm works but at least on  
NTFS that bit is easy: you allocate the blocks and mark them as  
allocated then the compression engine will write non-compressed data  
to those blocks.  Basically it works like this "does compression  
block X have any sparse blocks?". If the answer is "yes" the block is  
treated as compressed data and if the answer is "no" the block is  
treated as uncompressed data.  This means that if the data cannot be  
compressed (and in some cases if the data compressed is bigger than  
the data uncompressed) the data is stored non-compressed.  That is  
the most space efficient method to do things.

An alternative would be to allocate blocks and then when the data is  
written perform the compression and free any blocks you do not need  
any more because the data has shrunk sufficiently.  Depending on the  
implementation details this could potentially create horrible  
fragmentation as you would allocate a large consecutive region and  
then go and drop random blocks from that region thus making the file  
fragmented.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05  0:32                   ` Anton Altaparmakov
@ 2007-03-05  0:35                     ` Anton Altaparmakov
  2007-03-05  0:44                     ` Arnd Bergmann
  2007-03-05 11:49                       ` Jörn Engel
  2 siblings, 0 replies; 340+ messages in thread
From: Anton Altaparmakov @ 2007-03-05  0:35 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Jörn Engel, Ulrich Drepper, Arnd Bergmann,
	Christoph Hellwig, Dave Kleikamp, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki

On 5 Mar 2007, at 00:32, Anton Altaparmakov wrote:
> On 5 Mar 2007, at 00:16, Jörn Engel wrote:
>> On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
>>>
>>> When you do it like this, who can the kernel/filesystem  
>>> *guarantee* that
>>> when the data is written there actually is room on the harddrive?
>>>
>>> What you described seems like using truncate/ftruncate to  
>>> increase the
>>> file's size.  That is not at all what posix_fallocate is for.
>>> posix_fallocate must make sure that the requested blocks on the  
>>> disk are
>>> reserved (allocated) for the file's use and that at no point in the
>>> future will, say, a msync() fail because a mmap(MAP_SHARED) page has
>>> been written to.
>>
>> That actually causes an interesting problem for compressing  
>> filesystems.
>> The space consumed by blocks depends on their contents and how  
>> well it
>> compresses.  At the moment, the only option I see to support
>> posix_fallocate for LogFS is to set an inode flag disabling  
>> compression,
>> then allocate the blocks.
>>
>> But if the file already contains large amounts of compressed data, I
>> have a problem.  Disabling compression for a range within a file  
>> is not
>> supported, so I can only return an error.  But which one?
>
> I don't know how your compression algorithm works but at least on  
> NTFS that bit is easy: you allocate the blocks and mark them as  
> allocated then the compression engine will write non-compressed  
> data to those blocks.  Basically it works like this "does  
> compression block X have any sparse blocks?". If the answer is  
> "yes" the block is treated as compressed data and if the answer is  
> "no" the block is treated as uncompressed data.  This means that if  
> the data cannot be compressed (and in some cases if the data  
> compressed is bigger than the data uncompressed) the data is stored  
> non-compressed.  That is the most space efficient method to do things.
>
> An alternative would be to allocate blocks and then when the data  
> is written perform the compression and free any blocks you do not  
> need any more because the data has shrunk sufficiently.  Depending  
> on the implementation details this could potentially create  
> horrible fragmentation as you would allocate a large consecutive  
> region and then go and drop random blocks from that region thus  
> making the file fragmented.

And another thing you could do (best if you support journalling)  
would be to do the allocation and hang the details off the inode on a  
"preallocation list" of some kind and then as the data gets written  
use blocks from the preallocation list as you go along.  This would  
avoid the fragmentation issue for example.  You could then free the  
surplus blocks when the whole range of the file being covered by the  
preallocation list has been written to and/or when the file is closed  
for the last time (drop_inode/delete_inode).

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05  0:16                   ` Jörn Engel
  (?)
  (?)
@ 2007-03-05  0:36                   ` Arnd Bergmann
  2007-03-05 11:41                     ` Jörn Engel
  -1 siblings, 1 reply; 340+ messages in thread
From: Arnd Bergmann @ 2007-03-05  0:36 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ulrich Drepper, Anton Altaparmakov, Christoph Hellwig,
	Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

On Monday 05 March 2007, Jörn Engel wrote:
> That actually causes an interesting problem for compressing filesystems.
> The space consumed by blocks depends on their contents and how well it
> compresses.  At the moment, the only option I see to support
> posix_fallocate for LogFS is to set an inode flag disabling compression,
> then allocate the blocks.
> 
> But if the file already contains large amounts of compressed data, I
> have a problem.  Disabling compression for a range within a file is not
> supported, so I can only return an error.  But which one?

Using the current glibc implementation on a compressed file system ideally
should be a very expensive no-op because you won't actually allocate much
space for a file when writing zeroes to it. You also don't benefit of a
contiguous allocation in logfs, since flash has uniform seek times over
all the medium.

I'd suggest you implement posix_fallocate as an real nop and just return
success without doing anything. You could also return ENOSPC in case
the blocks requested by posix_fallocate don't fit on the medium without
compression, but that is more or less just guesswork (like statfs is).

	Arnd <><

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05  0:32                   ` Anton Altaparmakov
  2007-03-05  0:35                     ` Anton Altaparmakov
@ 2007-03-05  0:44                     ` Arnd Bergmann
  2007-03-05 11:49                       ` Jörn Engel
  2 siblings, 0 replies; 340+ messages in thread
From: Arnd Bergmann @ 2007-03-05  0:44 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Jörn Engel, Ulrich Drepper, Christoph Hellwig,
	Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

On Monday 05 March 2007, Anton Altaparmakov wrote:
> An alternative would be to allocate blocks and then when the data is  
> written perform the compression and free any blocks you do not need  
> any more because the data has shrunk sufficiently.  Depending on the  
> implementation details this could potentially create horrible  
> fragmentation as you would allocate a large consecutive region and  
> then go and drop random blocks from that region thus making the file  
> fragmented.

Unfortunately, this is not as easy on logfs, because there is no point
in allocating a block when there is no data to write into it. Fragmentation
on flash media is free, but you can never modify a block in place without
erasing it first. This means it will always be written to a new location
on the next write access.

One option that might work (similar to what you describe in your other mail)
is to have a per-inode count of reserved blocks, without allocating specific
blocks for them. The journal then needs to maintain the number of total
reserved blocks for all files and keep that in sync with blocks that were
reserved for specific inodes.

	Arnd <><

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-04 20:11             ` Anton Altaparmakov
  2007-03-04 20:53                 ` Arnd Bergmann
  2007-03-04 22:38               ` Ulrich Drepper
@ 2007-03-05  4:23               ` Christoph Hellwig
  2 siblings, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-05  4:23 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki, Ulrich Drepper

On Sun, Mar 04, 2007 at 08:11:17PM +0000, Anton Altaparmakov wrote:
> glibc cannot ever be smart enough because a file system driver will  
> always know better and be able to do things in a much more optimized  
> way.

Please read the thread again.  That is not what anyone proposed.
The issues we're discussing is whether fallback for a filesystem that
does not support preallocation natively should be done in kernelspace
or in userspace.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05  0:36                   ` Arnd Bergmann
@ 2007-03-05 11:41                     ` Jörn Engel
  2007-03-05 15:08                       ` Ulrich Drepper
  2007-03-05 22:00                       ` Eric Sandeen
  0 siblings, 2 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05 11:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ulrich Drepper, Anton Altaparmakov, Christoph Hellwig,
	Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote:
> 
> Using the current glibc implementation on a compressed file system ideally
> should be a very expensive no-op because you won't actually allocate much
> space for a file when writing zeroes to it. You also don't benefit of a
> contiguous allocation in logfs, since flash has uniform seek times over
> all the medium.
> 
> I'd suggest you implement posix_fallocate as an real nop and just return
> success without doing anything. You could also return ENOSPC in case
> the blocks requested by posix_fallocate don't fit on the medium without
> compression, but that is more or less just guesswork (like statfs is).

Quoting POSIX_FALLOCATE(3):
       The function posix_fallocate() ensures that disk space is allocated for
       the file referred to by the descriptor fd for the bytes  in  the range
       starting  at  offset  and continuing for len bytes.  After a successful
       call to posix_fallocate(), subsequent writes to bytes in the specified
       range are guaranteed not to fail because of lack of disk space.

       If  the  size  of  the  file  is less than offset+len, then the file is
       increased to this size; otherwise the file size is left unchanged.

Afaics, the (main) purpose of this function is not to decrease
fragmentation but to ensure mmap() won't cause any problems because the
medium fills up.  That problem exists for LogFS as well, once rw mmap()
is supported.

Simply returning success without doing anything would be a bug.  -ENOSPC
is a better choice, but still a lame implementation.  And falling back
on libc to write zeroes in a loop is an exercise in futility.

Does the allocation have to be persistent beyond lifetime of the file
descriptor?  It would be fairly simple to support the write guarantee
while the file is open (or rather the inode remains cached) and drop it
afterwards.

Jörn

-- 
"[One] doesn't need to know [...] how to cause a headache in order
to take an aspirin."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05  0:32                   ` Anton Altaparmakov
@ 2007-03-05 11:49                       ` Jörn Engel
  2007-03-05  0:44                     ` Arnd Bergmann
  2007-03-05 11:49                       ` Jörn Engel
  2 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05 11:49 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Ulrich Drepper, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

On Mon, 5 March 2007 00:32:14 +0000, Anton Altaparmakov wrote:
> 
> I don't know how your compression algorithm works [...]

LogFS is designed for flash media, so it does not have to worry much
about reducing disk seeks.  It is log-structured, which simplifies
compression further.

When writing a block, it basically compresses it and appends it to the
log.  Writes only have to be byte-aligned, so no space is lost for
padding.

The bad news for posix_fallocate() is that even if libc is smart enough
to write random data, mmap() can still cause problems.  If the VM
decides to write a given page twice, the second write compresses better
and the medium has filled up between the two writes, the users will have
fun.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
@ 2007-03-05 11:49                       ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05 11:49 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Ulrich Drepper, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

On Mon, 5 March 2007 00:32:14 +0000, Anton Altaparmakov wrote:
> 
> I don't know how your compression algorithm works [...]

LogFS is designed for flash media, so it does not have to worry much
about reducing disk seeks.  It is log-structured, which simplifies
compression further.

When writing a block, it basically compresses it and appends it to the
log.  Writes only have to be byte-aligned, so no space is lost for
padding.

The bad news for posix_fallocate() is that even if libc is smart enough
to write random data, mmap() can still cause problems.  If the VM
decides to write a given page twice, the second write compresses better
and the medium has filled up between the two writes, the users will have
fun.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-02 18:28           ` Mingming Cao
@ 2007-03-05 12:27             ` Jan Kara
  2007-03-05 20:02               ` Mingming Cao
  2007-03-05 21:41               ` Eric Sandeen
  0 siblings, 2 replies; 340+ messages in thread
From: Jan Kara @ 2007-03-05 12:27 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Andrew Morton, nscott, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, alex, suzuki, Ulrich Drepper

> >On Fri, 02 Mar 2007 09:40:54 +1100
> >Nathan Scott <nscott@aconex.com> wrote:
> >
> >
> >>On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
> >>
> >>>On Fri, 2 Mar 2007 00:04:45 +0530
> >>>"Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> >>>
> >>>
> >>>>This is to give a heads up on few patches that we will be soon coming up
> >>>>with. These patches implement a new system call sys_fallocate() and a
> >>>>new inode operation "fallocate", for persistent preallocation. The new
> >>>>system call, as Andrew suggested, will look like:
> >>>>
> >>>> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
> >>>
> >>>...
> >>>
> >>>I'd agree with Eric on the "command" flag extension.
> >>
> >>Seems like a separate syscall would be better, "command" sounds
> >>a bit ioctl like, especially if that command is passed into the
> >>filesystems..
> >>
> >
> >
> >madvise, fadvise, lseek, etc seem to work OK.
> >
> >I get repeatedly traumatised by patch rejects whenever a new syscall gets
> >added, so I'm biased.
> >
> >The advantage of a command flag is that we can add new modes in the future
> >without causing lots of churn, waiting for arch maintainers to catch up,
> >potentially adding new compat code, etc.
> >
> >Rename it to "mode"? ;)
> >
> I am wondering if it is useful to add another mode to advise block 
> allocation policy? Something like indicating which physical block/block 
> group to allocate from (goal), and whether ask for strict contigous 
> blocks. This will help preallocation or reservation to choose the right 
> blocks for the file.
  Yes, I also think this would be useful so you can "guide"
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-03 22:45           ` Arnd Bergmann
  2007-03-04 20:11             ` Anton Altaparmakov
@ 2007-03-05 13:18             ` Christoph Hellwig
  1 sibling, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-05 13:18 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Christoph Hellwig, Dave Kleikamp, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki, Ulrich Drepper

On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote:
> > I'd be more happy to have the write out zeroes loop in glibc. ?And
> > glibc needs to have it anyway, for older kernels.
> 
> A generic_fallocate makes sense to me iff we can do it in the kernel
> more significantly more efficiently than in glibc, e.g. by using only
> a single page in page cache instead of one for each page to be preallocated.

We can't do that with the current page cache interfaces.  But what
might make sense is to have a block_dump_prealloc that takes a get_block
callback to do what you propose.  It still wouldn't be entirely generic,
but would allow block based filesystems to do a not entirely dumb
implementation.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-04 23:22                 ` Anton Altaparmakov
@ 2007-03-05 14:37                   ` Theodore Tso
  2007-03-05 15:07                     ` Anton Altaparmakov
  2007-03-05 15:15                     ` Ulrich Drepper
  0 siblings, 2 replies; 340+ messages in thread
From: Theodore Tso @ 2007-03-05 14:37 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Ulrich Drepper, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

On Sun, Mar 04, 2007 at 11:22:06PM +0000, Anton Altaparmakov wrote:
> And I specifically did NOT update the initialized size in the inode  
> thus it will remain at its old value thus all new allocated blocks  
> will be considered as present but not initialized thus a read will  
> always return zero whilst a write will do the right thing and pad  
> with zeroes as necessary (if the write is smaller than the block  
> size, etc).

Anton,

	You're describing a method of doing in-advance preallocation
where the filesystem format explicitly has support for this kind of
feature in a way that doesn't require pre-zeroing the data blocks in
question.

	The question which this subthread was concerned about was
whether the kernel should get involved in initializing datablocks in
the case where the filesystem format does not have this support, or
whether this functionality should continue to be done in userspace.
Given that glibc already has to support this for older kernels, I
would argue that there's no point putting in generic support for
filesystem that can't support a more advanced way of doing things.

	Regards,

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 14:37                   ` Theodore Tso
@ 2007-03-05 15:07                     ` Anton Altaparmakov
  2007-03-05 15:15                     ` Ulrich Drepper
  1 sibling, 0 replies; 340+ messages in thread
From: Anton Altaparmakov @ 2007-03-05 15:07 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ulrich Drepper, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

On 5 Mar 2007, at 14:37, Theodore Tso wrote:
> On Sun, Mar 04, 2007 at 11:22:06PM +0000, Anton Altaparmakov wrote:
>> And I specifically did NOT update the initialized size in the inode
>> thus it will remain at its old value thus all new allocated blocks
>> will be considered as present but not initialized thus a read will
>> always return zero whilst a write will do the right thing and pad
>> with zeroes as necessary (if the write is smaller than the block
>> size, etc).
>
> 	You're describing a method of doing in-advance preallocation
> where the filesystem format explicitly has support for this kind of
> feature in a way that doesn't require pre-zeroing the data blocks in
> question.

Indeed.

> 	The question which this subthread was concerned about was
> whether the kernel should get involved in initializing datablocks in
> the case where the filesystem format does not have this support, or
> whether this functionality should continue to be done in userspace.
> Given that glibc already has to support this for older kernels, I
> would argue that there's no point putting in generic support for
> filesystem that can't support a more advanced way of doing things.

Yes, I understood that after I had sent my post...  And yes, I would  
agree.  If glibc already does this there does not appear to be any  
value in just moving existing functionality into the kernel.  Simply  
let "dumb" file systems return ENOSYS and let glibc do it...  And any  
FS which can do it better can implement the function and then glibc  
should not go anywhere near it.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 11:41                     ` Jörn Engel
@ 2007-03-05 15:08                       ` Ulrich Drepper
  2007-03-05 15:33                           ` Jörn Engel
  2007-03-05 22:00                       ` Eric Sandeen
  1 sibling, 1 reply; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-05 15:08 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 431 bytes --]

Jörn Engel wrote:
> Does the allocation have to be persistent beyond lifetime of the file
> descriptor?

Of course.  You call posix_fallocate once for the lifetime of the file
when it is created to ensure that all future uses will work.

It seems your filesystem will not be able to support this unless
compression is turned off.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 11:49                       ` Jörn Engel
  (?)
@ 2007-03-05 15:09                       ` Ulrich Drepper
  -1 siblings, 0 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-05 15:09 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

Jörn Engel wrote:
> The bad news for posix_fallocate() is that even if libc is smart enough
> to write random data, mmap() can still cause problems.

This is not smart, quite to the contrary.  The standard guarantees that
all not-yet-written-to places in the file are zero.  And if a block has
already been written posix_fallocate cannot change it.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 14:37                   ` Theodore Tso
  2007-03-05 15:07                     ` Anton Altaparmakov
@ 2007-03-05 15:15                     ` Ulrich Drepper
  2007-03-05 15:35                       ` Christoph Hellwig
  2007-03-05 16:01                       ` Theodore Tso
  1 sibling, 2 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-05 15:15 UTC (permalink / raw)
  To: Theodore Tso, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

Theodore Tso wrote:
> Given that glibc already has to support this for older kernels, I
> would argue that there's no point putting in generic support for
> filesystem that can't support a more advanced way of doing things.

Well, I'm sure the kernel can do better than the code we have in libc
now.  The kernel has access to the bitmasks which say which blocks have
already been allocated.  The libc code does not and we have to be very
simple-minded and simply touch every block.  And this means reading it
and then writing it back.  The kernel would know when the reading part
is not necessary.  Add to then the block granularity (we use f_bsize as
returned from fstatfs but that's not the best value in some cases) and
you have compelling data to have generic code in the kernel.  Then libc
implementation can then go away completely which is a good thing.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 15:08                       ` Ulrich Drepper
@ 2007-03-05 15:33                           ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05 15:33 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote:
> Jörn Engel wrote:
> > Does the allocation have to be persistent beyond lifetime of the file
> > descriptor?
> 
> Of course.  You call posix_fallocate once for the lifetime of the file
> when it is created to ensure that all future uses will work.

That part is not quite clear from the manpage but I trust most people
would assume the same.

> It seems your filesystem will not be able to support this unless
> compression is turned off.

Correct.  Compression needs to be turned off for a file, if
posix_fallocate(3) is to succeed.  What I could do is disable
compression (meaning that no data written in the future will be
compressed) and rewrite all blocks within the given range.

Still, it is quite obvious that noone designing this interface has lost
much thought to compressing filesystems.  Whatever I can come up with
will either be incompatible or some sort of hack.  :(

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
@ 2007-03-05 15:33                           ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-05 15:33 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote:
> Jörn Engel wrote:
> > Does the allocation have to be persistent beyond lifetime of the file
> > descriptor?
> 
> Of course.  You call posix_fallocate once for the lifetime of the file
> when it is created to ensure that all future uses will work.

That part is not quite clear from the manpage but I trust most people
would assume the same.

> It seems your filesystem will not be able to support this unless
> compression is turned off.

Correct.  Compression needs to be turned off for a file, if
posix_fallocate(3) is to succeed.  What I could do is disable
compression (meaning that no data written in the future will be
compressed) and rewrite all blocks within the given range.

Still, it is quite obvious that noone designing this interface has lost
much thought to compressing filesystems.  Whatever I can come up with
will either be incompatible or some sort of hack.  :(

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 15:15                     ` Ulrich Drepper
@ 2007-03-05 15:35                       ` Christoph Hellwig
  2007-03-05 16:01                       ` Theodore Tso
  1 sibling, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-05 15:35 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Theodore Tso, Arnd Bergmann, Christoph Hellwig, Dave Kleikamp,
	Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, alex, suzuki

On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
> Theodore Tso wrote:
> > Given that glibc already has to support this for older kernels, I
> > would argue that there's no point putting in generic support for
> > filesystem that can't support a more advanced way of doing things.
> 
> Well, I'm sure the kernel can do better than the code we have in libc
> now.  The kernel has access to the bitmasks which say which blocks have
> already been allocated.

The layer of the kernel where a totally generic fallback would be
implemented does not have access to this information.  We could do
a mostly generic helper for block filesystems that allows to implement
fallocate this way without a lot of their own code.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 15:33                           ` Jörn Engel
  (?)
@ 2007-03-05 15:48                           ` Ulrich Drepper
  -1 siblings, 0 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-05 15:48 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]

Jörn Engel wrote:
>> Of course.  You call posix_fallocate once for the lifetime of the file
>> when it is created to ensure that all future uses will work.
> 
> That part is not quite clear from the manpage but I trust most people
> would assume the same.

Not only that, it is what this function is for.  In the POSIX committee
we've looked at the functions in detail before adding them, even if some
information is not in the man page but instead in the Rationale.


> Still, it is quite obvious that noone designing this interface has lost
> much thought to compressing filesystems.

You already have problems with supporting the functionality
posix_fallocate is supporting.  You cannot reliably support MAP_SHARED
files if all of a sudden the compression causes and expansion of a block
and that causes a ENOSPC error.  So, don't expect pity.  This is a
function in support of a real and reliable implementation of memory
mapped files.  You don't use MAP_SHARED on such filesystems, it'll eat
your kittens sooner or later anyway.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 15:15                     ` Ulrich Drepper
  2007-03-05 15:35                       ` Christoph Hellwig
@ 2007-03-05 16:01                       ` Theodore Tso
  2007-03-05 16:07                         ` Ulrich Drepper
  1 sibling, 1 reply; 340+ messages in thread
From: Theodore Tso @ 2007-03-05 16:01 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Arnd Bergmann, Christoph Hellwig, Dave Kleikamp, Andrew Morton,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, alex, suzuki

On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
> Well, I'm sure the kernel can do better than the code we have in libc
> now.  The kernel has access to the bitmasks which say which blocks have
> already been allocated.  The libc code does not and we have to be very
> simple-minded and simply touch every block.  And this means reading it
> and then writing it back.  The kernel would know when the reading part
> is not necessary.  Add to then the block granularity (we use f_bsize as
> returned from fstatfs but that's not the best value in some cases) and
> you have compelling data to have generic code in the kernel.  Then libc
> implementation can then go away completely which is a good thing.

You have a very good point; indeed since we don't export an interface
which allows userspace to determine whether or not a block is in use,
that does mean a huge amount of churn in the page cache.  So maybe it
would be worth doing in the kernel as a result, although the libc
implementation still wouldn't be able to go away for long time due to
the need to be backwards compatible with older kernels that didn't
have this support.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 16:01                       ` Theodore Tso
@ 2007-03-05 16:07                         ` Ulrich Drepper
  0 siblings, 0 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-05 16:07 UTC (permalink / raw)
  To: Theodore Tso, Ulrich Drepper, Arnd Bergmann, Christoph Hellwig,
	Dave Kleikamp, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

Theodore Tso wrote:
> [...] although the libc
> implementation still wouldn't be able to go away for long time due to
> the need to be backwards compatible with older kernels that didn't
> have this support.

It's better than that.  If somebody compiles glibc to not run on older
kernels at all (tested at runtime) then the code is dropped.  E.g., the
current Fedora glibc does not support 2.6.8 or earlier.

So, don't let the compat code be a factor in the decision making.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 12:27             ` Jan Kara
@ 2007-03-05 20:02               ` Mingming Cao
  2007-03-06  7:28                 ` Christoph Hellwig
  2007-03-05 21:41               ` Eric Sandeen
  1 sibling, 1 reply; 340+ messages in thread
From: Mingming Cao @ 2007-03-05 20:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, nscott, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, alex, suzuki, Ulrich Drepper

Jan Kara wrote:
>>>On Fri, 02 Mar 2007 09:40:54 +1100
>>>Nathan Scott <nscott@aconex.com> wrote:
>>>
>>>
>>>
>>>>On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:
>>>>
>>>>
>>>>>On Fri, 2 Mar 2007 00:04:45 +0530
>>>>>"Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>This is to give a heads up on few patches that we will be soon coming up
>>>>>>with. These patches implement a new system call sys_fallocate() and a
>>>>>>new inode operation "fallocate", for persistent preallocation. The new
>>>>>>system call, as Andrew suggested, will look like:
>>>>>>
>>>>>>asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);
>>>>>
>>>>>...
>>>>>
>>>>>I'd agree with Eric on the "command" flag extension.
>>>>
>>>>Seems like a separate syscall would be better, "command" sounds
>>>>a bit ioctl like, especially if that command is passed into the
>>>>filesystems..
>>>>
>>>
>>>
>>>madvise, fadvise, lseek, etc seem to work OK.
>>>
>>>I get repeatedly traumatised by patch rejects whenever a new syscall gets
>>>added, so I'm biased.
>>>
>>>The advantage of a command flag is that we can add new modes in the future
>>>without causing lots of churn, waiting for arch maintainers to catch up,
>>>potentially adding new compat code, etc.
>>>
>>>Rename it to "mode"? ;)
>>>
>>
>>I am wondering if it is useful to add another mode to advise block 
>>allocation policy? Something like indicating which physical block/block 
>>group to allocate from (goal), and whether ask for strict contigous 
>>blocks. This will help preallocation or reservation to choose the right 
>>blocks for the file.
> 
>   Yes, I also think this would be useful so you can "guide"
> preallocation for things like defragmentation (e.g. preallocate space
> for the file being defragmented and move the file to it).
> 
> 									Honza
Yep, I think it makes sense to use preallocation for defragmentation.
After all both preallocation and defragmentation shall call underlying 
filesystem multiple block allocator to try to allocate a chunk of 
contiguous blocks on disk. ext4 online defrag implementation by Takashi 
already support to choose a "goal" allocation block to guide the ext4 
block allocator to place the defraged file is a specific location.

Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
and/or whether the goal block is important over the size of prealloc 
extent), might make it more useful for the orginial goal (get contigous 
and guranteed blocks) and for defragmentation.

Regards,
Mingming

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 12:27             ` Jan Kara
  2007-03-05 20:02               ` Mingming Cao
@ 2007-03-05 21:41               ` Eric Sandeen
  1 sibling, 0 replies; 340+ messages in thread
From: Eric Sandeen @ 2007-03-05 21:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mingming Cao, Andrew Morton, nscott, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, alex, suzuki,
	Ulrich Drepper

Jan Kara wrote:

>> I am wondering if it is useful to add another mode to advise block 
>> allocation policy? Something like indicating which physical block/block 
>> group to allocate from (goal), and whether ask for strict contigous 
>> blocks. This will help preallocation or reservation to choose the right 
>> blocks for the file.
>   Yes, I also think this would be useful so you can "guide"
> preallocation for things like defragmentation (e.g. preallocate space
> for the file being defragmented and move the file to it).

Hints & policies for allocation would certainly be useful, but I think
they belong outside this interface.  i.e. you could flag an inode for
whatever allocation you choose, and -then- call posix_fallocate so that
the allocator will take the hints you've given it.

See also this blurb from the posix_fallocate definition:

"It is implementation-defined whether a previous posix_fadvise() call
influences allocation strategy."

FWIW I don't see a lot of point in asking for "strict contiguous blocks"
- the allocator will presumeably try to do this in any case, and I'm not
sure when you would want to fail if you get more than one extent...?

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 11:41                     ` Jörn Engel
  2007-03-05 15:08                       ` Ulrich Drepper
@ 2007-03-05 22:00                       ` Eric Sandeen
  1 sibling, 0 replies; 340+ messages in thread
From: Eric Sandeen @ 2007-03-05 22:00 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Arnd Bergmann, Ulrich Drepper, Anton Altaparmakov,
	Christoph Hellwig, Dave Kleikamp, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, alex,
	suzuki

Jörn Engel wrote:
> Does the allocation have to be persistent beyond lifetime of the file
> descriptor?  It would be fairly simple to support the write guarantee
> while the file is open (or rather the inode remains cached) and drop it
> afterwards.

"The posix_fallocate() function shall ensure that any required storage
for regular file data starting at offset and continuing for len bytes is
allocated on the file system storage media."

I interpret "on the storage media" to mean that it is persistent.

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-05 20:02               ` Mingming Cao
@ 2007-03-06  7:28                 ` Christoph Hellwig
  2007-03-06 14:36                   ` Ulrich Drepper
  0 siblings, 1 reply; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-06  7:28 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Jan Kara, Andrew Morton, nscott, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, alex, suzuki, Ulrich Drepper

On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote:
> Yep, I think it makes sense to use preallocation for defragmentation.
> After all both preallocation and defragmentation shall call underlying 
> filesystem multiple block allocator to try to allocate a chunk of 
> contiguous blocks on disk. ext4 online defrag implementation by Takashi 
> already support to choose a "goal" allocation block to guide the ext4 
> block allocator to place the defraged file is a specific location.
> 
> Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
> and/or whether the goal block is important over the size of prealloc 
> extent), might make it more useful for the orginial goal (get contigous 
> and guranteed blocks) and for defragmentation.

fallocate with the whence argument and flags is already quite complicated,
I'd rather have another call for placement decisions, that would
be called on an fd to do placement decissions for any further allocations
(prealloc, write, etc)

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06  7:28                 ` Christoph Hellwig
@ 2007-03-06 14:36                   ` Ulrich Drepper
  2007-03-06 14:47                     ` Christoph Hellwig
                                       ` (2 more replies)
  0 siblings, 3 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-03-06 14:36 UTC (permalink / raw)
  To: Christoph Hellwig, Mingming Cao, Jan Kara, Andrew Morton, nscott,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	alex, suzuki, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

Christoph Hellwig wrote:
> fallocate with the whence argument and flags is already quite complicated,
> I'd rather have another call for placement decisions, that would
> be called on an fd to do placement decissions for any further allocations
> (prealloc, write, etc)

Yes, posix_fallocate shouldn't be made more complicated.  But I don't
understand why requesting linear layout of the blocks should be an
option.  It's always an advantage if the blocks requested this way are
linear on disk.  So, the kernel should always do its best to make this
happen, without needing an additional option.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06 14:36                   ` Ulrich Drepper
@ 2007-03-06 14:47                     ` Christoph Hellwig
  2007-03-06 14:50                     ` Jan Kara
  2007-03-06 16:46                     ` Eric Sandeen
  2 siblings, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-03-06 14:47 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Mingming Cao, Jan Kara, Andrew Morton, nscott,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	alex, suzuki

On Tue, Mar 06, 2007 at 06:36:09AM -0800, Ulrich Drepper wrote:
> Christoph Hellwig wrote:
> > fallocate with the whence argument and flags is already quite complicated,
> > I'd rather have another call for placement decisions, that would
> > be called on an fd to do placement decissions for any further allocations
> > (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.

There are HPC workloads where you have multi writers on multiple machines
that write to different parts of a file.  You preferably want each
of those regions in separate allocation groups.  (Or tell the customers
to use separate files for the regions..)

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06 14:36                   ` Ulrich Drepper
  2007-03-06 14:47                     ` Christoph Hellwig
@ 2007-03-06 14:50                     ` Jan Kara
  2007-03-06 18:23                       ` Eric Sandeen
  2007-03-06 16:46                     ` Eric Sandeen
  2 siblings, 1 reply; 340+ messages in thread
From: Jan Kara @ 2007-03-06 14:50 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Mingming Cao, Andrew Morton, nscott,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	alex, suzuki

On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
> Christoph Hellwig wrote:
> > fallocate with the whence argument and flags is already quite complicated,
> > I'd rather have another call for placement decisions, that would
> > be called on an fd to do placement decissions for any further allocations
> > (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.
  Actually, it's not that simple. You want linear layout of blocks you are
going to read. That is not necessary a linear layout of blocks in a single
file - trace sometime a start of some complicated app like KDE. You find
it's seeking like a hell because it needs a few blocks from a ton of
distinct files (shared libs, config files, etc). As these files are mostly
read only, it's advantageous to interleave them on disk or at least keep
them close.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06 14:36                   ` Ulrich Drepper
  2007-03-06 14:47                     ` Christoph Hellwig
  2007-03-06 14:50                     ` Jan Kara
@ 2007-03-06 16:46                     ` Eric Sandeen
  2007-03-13 23:46                       ` David Chinner
  2 siblings, 1 reply; 340+ messages in thread
From: Eric Sandeen @ 2007-03-06 16:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Mingming Cao, Jan Kara, Andrew Morton, nscott,
	Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	alex, suzuki

Ulrich Drepper wrote:
> Christoph Hellwig wrote:
>> fallocate with the whence argument and flags is already quite complicated,
>> I'd rather have another call for placement decisions, that would
>> be called on an fd to do placement decissions for any further allocations
>> (prealloc, write, etc)
> 
> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> understand why requesting linear layout of the blocks should be an
> option.  It's always an advantage if the blocks requested this way are
> linear on disk.  So, the kernel should always do its best to make this
> happen, without needing an additional option.
> 

Agreed on both points.  The hints would be for things like start block,
or speculative EOF preallocation, not contiguity, which I think should
always be the goal.

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06 14:50                     ` Jan Kara
@ 2007-03-06 18:23                       ` Eric Sandeen
  2007-03-07  8:51                         ` Jan Kara
  0 siblings, 1 reply; 340+ messages in thread
From: Eric Sandeen @ 2007-03-06 18:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ulrich Drepper, Christoph Hellwig, Mingming Cao, Andrew Morton,
	nscott, Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	suparna, alex, suzuki

Jan Kara wrote:
> On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
>> Christoph Hellwig wrote:
>>> fallocate with the whence argument and flags is already quite complicated,
>>> I'd rather have another call for placement decisions, that would
>>> be called on an fd to do placement decissions for any further allocations
>>> (prealloc, write, etc)
>> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
>> understand why requesting linear layout of the blocks should be an
>> option.  It's always an advantage if the blocks requested this way are
>> linear on disk.  So, the kernel should always do its best to make this
>> happen, without needing an additional option.
>   Actually, it's not that simple. You want linear layout of blocks you are
> going to read. That is not necessary a linear layout of blocks in a single
> file - trace sometime a start of some complicated app like KDE. You find
> it's seeking like a hell because it needs a few blocks from a ton of
> distinct files (shared libs, config files, etc). As these files are mostly
> read only, it's advantageous to interleave them on disk or at least keep
> them close.

At some point shouldn't the apps be fixed, rather than do crazy things
with the filesystem?  :)

-Eric

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06 18:23                       ` Eric Sandeen
@ 2007-03-07  8:51                         ` Jan Kara
  2007-03-07 11:30                           ` Jörn Engel
  0 siblings, 1 reply; 340+ messages in thread
From: Jan Kara @ 2007-03-07  8:51 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ulrich Drepper, Christoph Hellwig, Mingming Cao, Andrew Morton,
	nscott, Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	suparna, alex, suzuki

On Tue 06-03-07 12:23:22, Eric Sandeen wrote:
> Jan Kara wrote:
> > On Tue 06-03-07 06:36:09, Ulrich Drepper wrote:
> >> Christoph Hellwig wrote:
> >>> fallocate with the whence argument and flags is already quite complicated,
> >>> I'd rather have another call for placement decisions, that would
> >>> be called on an fd to do placement decissions for any further allocations
> >>> (prealloc, write, etc)
> >> Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> >> understand why requesting linear layout of the blocks should be an
> >> option.  It's always an advantage if the blocks requested this way are
> >> linear on disk.  So, the kernel should always do its best to make this
> >> happen, without needing an additional option.
> >   Actually, it's not that simple. You want linear layout of blocks you are
> > going to read. That is not necessary a linear layout of blocks in a single
> > file - trace sometime a start of some complicated app like KDE. You find
> > it's seeking like a hell because it needs a few blocks from a ton of
> > distinct files (shared libs, config files, etc). As these files are mostly
> > read only, it's advantageous to interleave them on disk or at least keep
> > them close.
> 
> At some point shouldn't the apps be fixed, rather than do crazy things
> with the filesystem?  :)
  Yes :) That's basically what we told KDE developpers when they were
complaining ;) But it's hard to fix it for them too (because of some
desktop specs requiring lots of different text config files which can
change anytime - don't ask me who designed it). Moreover for example for
loading shared libraries from which you need just a few blocks scattered
all over the place the problem is in ELF itself.
  I'll probably first write some userspace fs-reorganizer to find out how
much these changes in layout are able to give you in performance (i.e.
whether it's worth the effort of more complicated kernel online
defragmenter).

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-07  8:51                         ` Jan Kara
@ 2007-03-07 11:30                           ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-07 11:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Sandeen, Ulrich Drepper, Christoph Hellwig, Mingming Cao,
	Andrew Morton, nscott, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, alex, suzuki

On Wed, 7 March 2007 09:51:35 +0100, Jan Kara wrote:
>
>   I'll probably first write some userspace fs-reorganizer to find out how
> much these changes in layout are able to give you in performance (i.e.
> whether it's worth the effort of more complicated kernel online
> defragmenter).

Have tried profiling the read accesses and prereading them
asynchronously on startup?  That appears to have improved E17 a lot.
See http://lca2007.linux.org.au/talk/101 (and watch the video).

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC] Heads up on sys_fallocate()
  2007-03-06 16:46                     ` Eric Sandeen
@ 2007-03-13 23:46                       ` David Chinner
  0 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-03-13 23:46 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Ulrich Drepper, Christoph Hellwig, Mingming Cao, Jan Kara,
	Andrew Morton, nscott, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, alex, suzuki

On Tue, Mar 06, 2007 at 10:46:56AM -0600, Eric Sandeen wrote:
> Ulrich Drepper wrote:
> > Christoph Hellwig wrote:
> >> fallocate with the whence argument and flags is already quite complicated,
> >> I'd rather have another call for placement decisions, that would
> >> be called on an fd to do placement decissions for any further allocations
> >> (prealloc, write, etc)
> > 
> > Yes, posix_fallocate shouldn't be made more complicated.  But I don't
> > understand why requesting linear layout of the blocks should be an
> > option.  It's always an advantage if the blocks requested this way are
> > linear on disk.  So, the kernel should always do its best to make this
> > happen, without needing an additional option.
> > 
> 
> Agreed on both points.  The hints would be for things like start block,
> or speculative EOF preallocation, not contiguity, which I think should
> always be the goal.

ISTR having had this discussion before ;)

About guided preallocation for defrag:

http://marc.info/?t=116247859500001&r=1&w=2

e.g.: The sorts of policies we need for effective use of
preallocation:

http://marc.info/?l=linux-fsdevel&m=116184475308164&w=2
http://marc.info/?l=linux-fsdevel&m=116278169519095&w=2

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [RFC][PATCH] sys_fallocate() system call
  2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
                       ` (6 preceding siblings ...)
  2007-03-02  6:03     ` Badari Pulavarty
@ 2007-03-16 14:31     ` Amit K. Arora
  2007-03-16 15:21       ` Heiko Carstens
                         ` (3 more replies)
  7 siblings, 4 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-03-16 14:31 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4, xfs
  Cc: Andrew Morton, suparna, cmm, alex, suzuki

First of all, thanks for the overwhelming response!

Based on the suggestions received, I have added a new parameter to the
sys_fallocate() system call - an interger called "mode", just after the
"fd". Now the system call looks like this:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
preallocation and deallocation of preallocated blocks respectively. More
modes can be added, when required. And these modes can be renamed, since
I am sure these are no way the best ones ! :)

Attached below is the patch which implements this system call. It has
been currently implemented and tested on i386, ppc64 and x86_64
architectures. I am facing some problems while trying to implement this
on s390, and thus the delay. While I try to get it right on s390(x), we
thought of posting this patch, so that we can save some time. Parallely
we will work on getting the patch work on s390, and probably it will
come as a separate patch.

ToDos:
=====
Following is pending:
1>   Implementation on other architectures (other than i386, x86_64 and
ppc64) like s390(x)
2>   A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3>   ext4 patches that support fallocate inode operation are ready. I
plan to submit those separately to just ext4 mailing list.
4>   Changes to glibc, so that posix_fallocate() and posix_fallocate64()
call fallocate() system call
5>   Changes to XFS to implement the fallocate inode operation


Signed-off-by: Amit K Arora <aarora@in.ibm.com>
---
 arch/i386/kernel/syscall_table.S |    1 
 arch/x86_64/kernel/functionlist  |    1 
 fs/open.c                        |   41 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h        |    3 +-
 include/asm-powerpc/systbl.h     |    1 
 include/asm-powerpc/unistd.h     |    3 +-
 include/asm-x86_64/unistd.h      |    4 ++-
 include/linux/fs.h               |    7 ++++++
 include/linux/syscalls.h         |    1 
 9 files changed, 59 insertions(+), 3 deletions(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_fallocate		/* 320 */
Index: linux-2.6.20.1/fs/open.c
===================================================================
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (len == 0 || offset < 0)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	if (!S_ISREG(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	if (offset + len > inode->i_sb->s_maxbytes)
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_fallocate		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -263,6 +263,12 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * fallocate() modes
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1124,6 +1130,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*fallocate)(struct inode *, int, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.20.1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.20.1/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate		280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.20.1/include/asm-powerpc/unistd.h
@@ -324,10 +324,11 @@
 #define __NR_get_robust_list	299
 #define __NR_set_robust_list	300
 #define __NR_move_pages		301
+#define __NR_fallocate		302
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		302
+#define __NR_syscalls		303
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.20.1/arch/x86_64/kernel/functionlist
===================================================================
--- linux-2.6.20.1.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.20.1/arch/x86_64/kernel/functionlist
@@ -932,6 +932,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.20.1/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.20.1/include/asm-powerpc/systbl.h
@@ -305,3 +305,4 @@ SYSCALL_SPU(faccessat)
 COMPAT_SYS_SPU(get_robust_list)
 COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS(move_pages)
+SYSCALL(fallocate)
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
@ 2007-03-16 15:21       ` Heiko Carstens
  2007-03-19  9:24         ` Amit K. Arora
  2007-03-16 16:17       ` Heiko Carstens
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 340+ messages in thread
From: Heiko Carstens @ 2007-03-16 15:21 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki

On Fri, Mar 16, 2007 at 08:01:01PM +0530, Amit K. Arora wrote:
> First of all, thanks for the overwhelming response!
> 
> Based on the suggestions received, I have added a new parameter to the
> sys_fallocate() system call - an interger called "mode", just after the
> "fd". Now the system call looks like this:
> 
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
> preallocation and deallocation of preallocated blocks respectively. More
> modes can be added, when required. And these modes can be renamed, since
> I am sure these are no way the best ones ! :)
> 
> Attached below is the patch which implements this system call. It has
> been currently implemented and tested on i386, ppc64 and x86_64
> architectures. I am facing some problems while trying to implement this
> on s390, and thus the delay. While I try to get it right on s390(x), we
> thought of posting this patch, so that we can save some time. Parallely
> we will work on getting the patch work on s390, and probably it will
> come as a separate patch.

What's the problem you face on s390? If it's just the compat wrapper, you
may look at sys_sync_file_range_wrapper. Or I will send a patch if needed.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
  2007-03-16 15:21       ` Heiko Carstens
@ 2007-03-16 16:17       ` Heiko Carstens
  2007-03-17  9:59         ` Paul Mackerras
  2007-03-17 11:10         ` Matthew Wilcox
  2007-03-17  5:33       ` [RFC][PATCH] sys_fallocate() " Stephen Rothwell
  2007-03-17 14:53       ` Russell King
  3 siblings, 2 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-16 16:17 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki

> on s390, and thus the delay. While I try to get it right on s390(x), we
> thought of posting this patch, so that we can save some time. Parallely
> we will work on getting the patch work on s390, and probably it will
> come as a separate patch.
> 
> +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> +{

There is something here that will not work on s390 (31bit): the arguments
would end up in:
fd -> r2
mode -> r3
offset -> r4 + r5
len -> r6 + second halve on stack

But the s390 ABI says that a long long will be put into two consecutive
registers if the first register is smaller than 6, or it will be put
completely on the stack. So both 32 bit parts of len will end up on the
stack. That would make it a syscall with seven arguments which we currently
don't support on s390. There is no way to access the second half of len
from kernel space and that is why it is not working for you.
So you either rearrange the parameters or convert the loff_t's to pointers.

e.g.

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode)

would work even on s390 ;)

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
  2007-03-16 15:21       ` Heiko Carstens
  2007-03-16 16:17       ` Heiko Carstens
@ 2007-03-17  5:33       ` Stephen Rothwell
  2007-03-19  9:30         ` Amit K. Arora
  2007-03-17 14:53       ` Russell King
  3 siblings, 1 reply; 340+ messages in thread
From: Stephen Rothwell @ 2007-03-17  5:33 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On Fri, 16 Mar 2007 20:01:01 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>

> +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>
> --- linux-2.6.20.1.orig/include/asm-powerpc/systbl.h
> +++ linux-2.6.20.1/include/asm-powerpc/systbl.h
> @@ -305,3 +305,4 @@ SYSCALL_SPU(faccessat)
>  COMPAT_SYS_SPU(get_robust_list)
>  COMPAT_SYS_SPU(set_robust_list)
>  COMPAT_SYS(move_pages)
> +SYSCALL(fallocate)

It is going to need to be a COMPAT_SYS call in powerpc because 32 bit
powerpc will pass the two loff_t's in pairs of registers while
64bit passes them in one register each.

--
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 16:17       ` Heiko Carstens
@ 2007-03-17  9:59         ` Paul Mackerras
  2007-03-17 11:07           ` Matthew Wilcox
  2007-03-17 11:10         ` Matthew Wilcox
  1 sibling, 1 reply; 340+ messages in thread
From: Paul Mackerras @ 2007-03-17  9:59 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	Andrew Morton, suparna, cmm, alex, suzuki

Heiko Carstens writes:

> So you either rearrange the parameters or convert the loff_t's to pointers.
> 
> e.g.
> 
> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode)
> 
> would work even on s390 ;)

... but wouldn't work on 32-bit powerpc. :(  We would end up with a
pad argument between fd and offset, giving 7 arguments in all
(counting the loff_t's as 2), but we only support 6.

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-17  9:59         ` Paul Mackerras
@ 2007-03-17 11:07           ` Matthew Wilcox
  2007-03-17 14:30             ` Heiko Carstens
  0 siblings, 1 reply; 340+ messages in thread
From: Matthew Wilcox @ 2007-03-17 11:07 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Heiko Carstens, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, Andrew Morton, suparna, cmm, alex, suzuki

On Sat, Mar 17, 2007 at 08:59:05PM +1100, Paul Mackerras wrote:
> ... but wouldn't work on 32-bit powerpc. :(  We would end up with a
> pad argument between fd and offset, giving 7 arguments in all
> (counting the loff_t's as 2), but we only support 6.

Ditto mips and parisc.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 16:17       ` Heiko Carstens
  2007-03-17  9:59         ` Paul Mackerras
@ 2007-03-17 11:10         ` Matthew Wilcox
  2007-03-21 12:04           ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: Matthew Wilcox @ 2007-03-17 11:10 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	Andrew Morton, suparna, cmm, alex, suzuki

On Fri, Mar 16, 2007 at 05:17:04PM +0100, Heiko Carstens wrote:
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> e.g.
> 
> asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode)
> 
> would work even on s390 ;)

How about:

asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
				u32 len_low, u32 len_high);

That way we all suffer equally ...

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-17 11:07           ` Matthew Wilcox
@ 2007-03-17 14:30             ` Heiko Carstens
  2007-03-17 14:38               ` Stephen Rothwell
  0 siblings, 1 reply; 340+ messages in thread
From: Heiko Carstens @ 2007-03-17 14:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Paul Mackerras, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, Andrew Morton, suparna, cmm, alex, suzuki

On Sat, Mar 17, 2007 at 05:07:06AM -0600, Matthew Wilcox wrote:
> On Sat, Mar 17, 2007 at 08:59:05PM +1100, Paul Mackerras wrote:
> > ... but wouldn't work on 32-bit powerpc. :(  We would end up with a
> > pad argument between fd and offset, giving 7 arguments in all
> > (counting the loff_t's as 2), but we only support 6.
> 
> Ditto mips and parisc.

Can't be. Or: mips supports 7 arguments and parisc doesn't pad.
Otherwise they couldn't have wired up

sys_sync_file_range(int fd, loff_t offset, loff_t nbytes, unsigned int flags)

But from what I read, it's currently not possible for 32-bit powerpc to
wire up the already present sync_file_range system call.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-17 14:30             ` Heiko Carstens
@ 2007-03-17 14:38               ` Stephen Rothwell
  2007-03-17 14:42                 ` Stephen Rothwell
  0 siblings, 1 reply; 340+ messages in thread
From: Stephen Rothwell @ 2007-03-17 14:38 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Matthew Wilcox, Paul Mackerras, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, Andrew Morton, suparna, cmm, alex,
	suzuki

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

On Sat, 17 Mar 2007 15:30:43 +0100 Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
>
> sys_sync_file_range(int fd, loff_t offset, loff_t nbytes, unsigned int flags)
>
> But from what I read, it's currently not possible for 32-bit powerpc to
> wire up the already present sync_file_range system call.

32bit native is fine (as the ABI in user mode is the same as that in the
kernel).  For 32bit on a 64bit kernel you need the arch specific comapt
routine that I used in the patch I posteda little while ago,

--
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-17 14:38               ` Stephen Rothwell
@ 2007-03-17 14:42                 ` Stephen Rothwell
  0 siblings, 0 replies; 340+ messages in thread
From: Stephen Rothwell @ 2007-03-17 14:42 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Matthew Wilcox, Paul Mackerras, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, Andrew Morton, suparna, cmm, alex,
	suzuki

[-- Attachment #1: Type: text/plain, Size: 607 bytes --]

On Sun, 18 Mar 2007 01:38:38 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> On Sat, 17 Mar 2007 15:30:43 +0100 Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> >
> > sys_sync_file_range(int fd, loff_t offset, loff_t nbytes, unsigned int flags)
> >
> > But from what I read, it's currently not possible for 32-bit powerpc to
> > wire up the already present sync_file_range system call.
>
> 32bit native is fine (as the ABI in user mode is the same as that in the

Sorry, I take that back ...

--
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
                         ` (2 preceding siblings ...)
  2007-03-17  5:33       ` [RFC][PATCH] sys_fallocate() " Stephen Rothwell
@ 2007-03-17 14:53       ` Russell King
  3 siblings, 0 replies; 340+ messages in thread
From: Russell King @ 2007-03-17 14:53 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki

On Fri, Mar 16, 2007 at 08:01:01PM +0530, Amit K. Arora wrote:
> Attached below is the patch which implements this system call. It has
> been currently implemented and tested on i386, ppc64 and x86_64
> architectures. I am facing some problems while trying to implement this
> on s390, and thus the delay. While I try to get it right on s390(x), we
> thought of posting this patch, so that we can save some time. Parallely
> we will work on getting the patch work on s390, and probably it will
> come as a separate patch.

I suggest reading the very end of arch/arm/kernel/sys_arm.c; I'd rather
avoid adding more and more hacks like that to the kernel if at all
possible.

One solution (already mentioned elsewhere) is that we start avoiding
passing 64-bit arguments and instead pass two 32-bit instead.  This
nicely avoids the alignment restrictions for 64-bit args in ABIs.

(The issue for ARM is that with anything other than the "fd, mode,
offset, len" layout we will have to deal with different ABI argument
layouts, or implement our own wrapper function as done for
sys_arm_sync_file_range.)

I think the problem comes down to "what is the argument layout which
causes the least amount of problems for the complete set of architectures."
For ARM, that's the "fd, mode, offset, len" layout.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-16 15:21       ` Heiko Carstens
@ 2007-03-19  9:24         ` Amit K. Arora
  2007-03-19 11:23           ` Heiko Carstens
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-03-19  9:24 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki

On Fri, Mar 16, 2007 at 04:21:03PM +0100, Heiko Carstens wrote:
> On Fri, Mar 16, 2007 at 08:01:01PM +0530, Amit K. Arora wrote:
> > First of all, thanks for the overwhelming response!
> > 
> > Based on the suggestions received, I have added a new parameter to the
> > sys_fallocate() system call - an interger called "mode", just after the
> > "fd". Now the system call looks like this:
> > 
> >  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> > 
> > Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
> > preallocation and deallocation of preallocated blocks respectively. More
> > modes can be added, when required. And these modes can be renamed, since
> > I am sure these are no way the best ones ! :)
> > 
> > Attached below is the patch which implements this system call. It has
> > been currently implemented and tested on i386, ppc64 and x86_64
> > architectures. I am facing some problems while trying to implement this
> > on s390, and thus the delay. While I try to get it right on s390(x), we
> > thought of posting this patch, so that we can save some time. Parallely
> > we will work on getting the patch work on s390, and probably it will
> > come as a separate patch.
> 
> What's the problem you face on s390? If it's just the compat wrapper, you
> may look at sys_sync_file_range_wrapper. Or I will send a patch if needed.

Hi Heiko,

Yes, the problem was adding compat wrapper for this. I will appreciate
your help in writing it. Only thing is that we might have to wait till
the order of the arguments is decided upon. Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-17  5:33       ` [RFC][PATCH] sys_fallocate() " Stephen Rothwell
@ 2007-03-19  9:30         ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-03-19  9:30 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki

On Sat, Mar 17, 2007 at 04:33:50PM +1100, Stephen Rothwell wrote:
> On Fri, 16 Mar 2007 20:01:01 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> >
> 
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> >
> > --- linux-2.6.20.1.orig/include/asm-powerpc/systbl.h
> > +++ linux-2.6.20.1/include/asm-powerpc/systbl.h
> > @@ -305,3 +305,4 @@ SYSCALL_SPU(faccessat)
> >  COMPAT_SYS_SPU(get_robust_list)
> >  COMPAT_SYS_SPU(set_robust_list)
> >  COMPAT_SYS(move_pages)
> > +SYSCALL(fallocate)
> 
> It is going to need to be a COMPAT_SYS call in powerpc because 32 bit
> powerpc will pass the two loff_t's in pairs of registers while
> 64bit passes them in one register each.

Ok. Will make that change, unless it is decided to pass each loff_t
argument as two "u32"s. Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-19  9:24         ` Amit K. Arora
@ 2007-03-19 11:23           ` Heiko Carstens
  0 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-19 11:23 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, Andrew Morton,
	suparna, cmm, alex, suzuki, Matthew Wilcox, Paul Mackerras,
	Stephen Rothwell, Russell King

On Mon, Mar 19, 2007 at 02:54:04PM +0530, Amit K. Arora wrote:
> On Fri, Mar 16, 2007 at 04:21:03PM +0100, Heiko Carstens wrote:
> > On Fri, Mar 16, 2007 at 08:01:01PM +0530, Amit K. Arora wrote:
> > >  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> > > 
> > > Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
> > > preallocation and deallocation of preallocated blocks respectively. More
> > > modes can be added, when required. And these modes can be renamed, since
> > > I am sure these are no way the best ones ! :)
> > > 
> Yes, the problem was adding compat wrapper for this. I will appreciate
> your help in writing it. Only thing is that we might have to wait till
> the order of the arguments is decided upon. Thanks!

There is probably not much choice. If you want to stay with the loff_t
arguments it won't work on 31-bit s390 or 32-bit powerpc dependent on the
order of the arguments.
So you should go for what Matthew Wilcox suggested:

asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
			      u32 len_low, u32 len_high);

That way it will work an all architectures and in addition no architecture
has to do some magic to combine the splitted 64 bit arguments in compat
mode.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-17 11:10         ` Matthew Wilcox
@ 2007-03-21 12:04           ` Amit K. Arora
  2007-03-21 21:35             ` Chris Wedgwood
  2007-03-29 11:51             ` Interface for the new fallocate() " Amit K. Arora
  0 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-03-21 12:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Heiko Carstens, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	Andrew Morton, suparna, cmm, alex, suzuki

On Sat, Mar 17, 2007 at 05:10:37AM -0600, Matthew Wilcox wrote:
> How about:
> 
> asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
> 				u32 len_low, u32 len_high);
> 
> That way we all suffer equally ...

As suggested by you and Russel, I have made this change to the patch.
Here is how it looks like now. Please let me know if anyone has concerns
about passing arguments this way (breaking each "loff_t" into two "u32"s).

Signed-off-by: Amit K Arora <aarora@in.ibm.com>
---
 arch/i386/kernel/syscall_table.S |    1 
 arch/x86_64/kernel/functionlist  |    1 
 fs/open.c                        |   46 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h        |    3 +-
 include/asm-powerpc/systbl.h     |    1 
 include/asm-powerpc/unistd.h     |    3 +-
 include/asm-x86_64/unistd.h      |    4 ++-
 include/linux/fs.h               |    7 +++++
 include/linux/syscalls.h         |    2 +
 9 files changed, 65 insertions(+), 3 deletions(-)

Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_fallocate		/* 320 */
Index: linux-2.6.20.1/fs/open.c
===================================================================
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,52 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
+				u32 len_low, u32 len_high)
+{
+	struct file *file;
+	struct inode *inode;
+	loff_t offset, len;
+	long ret = -EINVAL;
+
+	offset = (off_high << 32) + off_low;
+	len = (len_high << 32) + len_low;
+
+	if (len == 0 || offset < 0)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	if (!S_ISREG(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	if (offset + len > inode->i_sb->s_maxbytes)
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.20.1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-i386/unistd.h
+++ linux-2.6.20.1/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_fallocate		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/linux/fs.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/fs.h
+++ linux-2.6.20.1/include/linux/fs.h
@@ -263,6 +263,12 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * fallocate() modes
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1124,6 +1130,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	int (*fallocate)(struct inode *, int, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.20.1/include/linux/syscalls.h
===================================================================
--- linux-2.6.20.1.orig/include/linux/syscalls.h
+++ linux-2.6.20.1/include/linux/syscalls.h
@@ -602,6 +602,8 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
+				u32 len_low, u32 len_high);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.20.1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.20.1/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate		280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.1/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.20.1/include/asm-powerpc/unistd.h
@@ -324,10 +324,11 @@
 #define __NR_get_robust_list	299
 #define __NR_set_robust_list	300
 #define __NR_move_pages		301
+#define __NR_fallocate		302
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		302
+#define __NR_syscalls		303
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.20.1/arch/x86_64/kernel/functionlist
===================================================================
--- linux-2.6.20.1.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.20.1/arch/x86_64/kernel/functionlist
@@ -932,6 +932,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.20.1/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.20.1.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.20.1/include/asm-powerpc/systbl.h
@@ -305,3 +305,4 @@ SYSCALL_SPU(faccessat)
 COMPAT_SYS_SPU(get_robust_list)
 COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS(move_pages)
+SYSCALL(fallocate)
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [RFC][PATCH] sys_fallocate() system call
  2007-03-21 12:04           ` Amit K. Arora
@ 2007-03-21 21:35             ` Chris Wedgwood
  2007-03-29 11:51             ` Interface for the new fallocate() " Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Chris Wedgwood @ 2007-03-21 21:35 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Matthew Wilcox, Heiko Carstens, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, Andrew Morton, suparna, cmm, alex, suzuki

I hate to comment at this late stage, especially on something that I
think is really a great idea (I did similar more complex, sys_blkalloc
with even more arguments time ago --- I'm glad given how complex this
thread has become I didn't post them now).

In the past there wasn't that much incentive to get this functionality
exposed because of various other issues (mmap + page dirty didn't
flush reliably) which are close to being resolve, so I think the
timing of this is really great....


On Wed, Mar 21, 2007 at 05:34:25PM +0530, Amit K. Arora wrote:

> As suggested by you and Russel, I have made this change to the
> patch.  Here is how it looks like now. Please let me know if anyone
> has concerns about passing arguments this way (breaking each
> "loff_t" into two "u32"s).

I really dislike breaking 64-bit args up unless it's necessary.  I
guess it doesn't really hurt, but it feels needlessly ugly.

> +	.long sys_fallocate		/* 320 */

> +/*
> + * fallocate() modes
> + */
> +#define FA_ALLOCATE	0x1
> +#define FA_DEALLOCATE	0x2
> +

given there are the only TWO modes right now, why not leave the
arguments as 64-bit sane and simply have two syscalls, one for each?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Interface for the new fallocate() system call
  2007-03-21 12:04           ` Amit K. Arora
  2007-03-21 21:35             ` Chris Wedgwood
@ 2007-03-29 11:51             ` Amit K. Arora
  2007-03-29 16:35               ` Chris Wedgwood
                                 ` (2 more replies)
  1 sibling, 3 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-03-29 11:51 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Hello,

We need to come up with the best possible layout of arguments for the
fallocate() system call. Various architectures have different
requirements for how the arguments should look like. Since the mail
chain has become huge, here is the summary of various inputs received
so far.

Platform: s390
--------------
s390 prefers following layout:

   int fallocate(int fd, loff_t offset, loff_t len, int mode)

For details on why and how "int, int, loff_t, loff_t" is a problem on
s390, please see Heiko's mail on 16th March. Here is the link:
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html

Platform: ppc, arm
------------------
ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout,
since this will result in a pad between fd and offset, making seven
arguments total - which is not supported by ppc32. It supports only
6 arguments. Thus the desired layout by ppc32 is:

   int fallocate(int fd, int mode, loff_t offset, loff_t len)

Even ARM prefers above kind of layout. For details please see the
definition of sys_arm_sync_file_range().

Option of loff_t => high u32 + low u32
--------------------------------------
Matthew and Russell have suggested another option of breaking each
"loff_t" into two "u32"s. This will result in 6 arguments in total.

Following think that this is a good alternative:
Matthew Wilcox, Russell King, Heiko Carstens

Following do not like this idea:
Chris Wedgwood


What are your thoughts on this ? What layout should we finalize on ?
Perhaps, since sync_file_range() system call has similar arguments, we
can take hint from the challenges faced on implementing it on various
architectures, and decide.

Please suggest. Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 11:51             ` Interface for the new fallocate() " Amit K. Arora
@ 2007-03-29 16:35               ` Chris Wedgwood
  2007-03-29 17:01               ` Jan Engelhardt
  2007-03-29 17:10               ` Andrew Morton
  2 siblings, 0 replies; 340+ messages in thread
From: Chris Wedgwood @ 2007-03-29 16:35 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Thu, Mar 29, 2007 at 05:21:26PM +0530, Amit K. Arora wrote:

>    int fallocate(int fd, loff_t offset, loff_t len, int mode)

Right now there are only two possible values for mode --- it's not
clear what additional values there will be in the future.

How about two syscalls?  If we decide later on we need something more
complicated we can revisit this and *THEN* add another system call
which may end up being a superset of the other two.

I know that sounds somewhat icky but:

  * it's fairly simple

  * we get nice argument handling on all arches by dropping u32 mode
    (don't we?)

  * syscalls don't really cost a lot to keep about, they do cost in
    terms on maintenance though, but in this case i don't see it being
    all that much of a problem

  * IMO badly/over designed syscalls are going to be a bigger problem
    long term

Given that *NO* single fs in mainline right now can *reliably* use
this functionality for a while maybe whatever solution people come up
with next should sit in -mm for a while?  At least that gives people
exposure to it and a chance to make some changes as once it's merged
to mainline it's pretty hard to change.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 11:51             ` Interface for the new fallocate() " Amit K. Arora
  2007-03-29 16:35               ` Chris Wedgwood
@ 2007-03-29 17:01               ` Jan Engelhardt
  2007-03-29 17:18                   ` linux-os (Dick Johnson)
  2007-03-30  7:00                 ` Heiko Carstens
  2007-03-29 17:10               ` Andrew Morton
  2 siblings, 2 replies; 340+ messages in thread
From: Jan Engelhardt @ 2007-03-29 17:01 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

Hi,

On Mar 29 2007 17:21, Amit K. Arora wrote:
>
>We need to come up with the best possible layout of arguments for the
>fallocate() system call. Various architectures have different
>requirements for how the arguments should look like. Since the mail
>chain has become huge, here is the summary of various inputs received
>so far.

>s390 prefers following layout:
>   int fallocate(int fd, loff_t offset, loff_t len, int mode)
>For details on why and how "int, int, loff_t, loff_t" is a problem on
>s390, please see Heiko's mail on 16th March. Here is the link:
>http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html

Quoting that...
	|len -> r6 + second halve on stack

Then, is not this a gcc glitch? (IMO, it should put all of "len" on the 
stack)

>Platform: ppc, arm
>------------------
>6 arguments. Thus the desired layout by ppc32 is:
>   int fallocate(int fd, int mode, loff_t offset, loff_t len)
>
>Option of loff_t => high u32 + low u32
>--------------------------------------
>Matthew and Russell have suggested another option of breaking each
>"loff_t" into two "u32"s. This will result in 6 arguments in total.
>
>What are your thoughts on this ? What layout should we finalize on ?
>Perhaps, since sync_file_range() system call has similar arguments, we
>can take hint from the challenges faced on implementing it on various
>architectures, and decide.
>
>Please suggest. Thanks!

Does it actually matter? Glibc can have its own argument ordering
different from the syscalls, so at least it would be possible to lay out
the syscall arguments in the most portable way while retaining nice
userspace C code. Hey, glibc might even wrap it up in a struct! (Using a 
pointer, as suggested in one of the proposals.)

int fallocate(int fd, loff_t offset, loff_t len, int mode)
{
	struct fallocate_foobar d = {fd, offset, len, mode};
	return _syscall(..., &d);
}




Jan
-- 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 11:51             ` Interface for the new fallocate() " Amit K. Arora
  2007-03-29 16:35               ` Chris Wedgwood
  2007-03-29 17:01               ` Jan Engelhardt
@ 2007-03-29 17:10               ` Andrew Morton
  2007-03-30  7:14                 ` Jakub Jelinek
  2007-03-30  7:19                 ` Interface for the new fallocate() " Heiko Carstens
  2 siblings, 2 replies; 340+ messages in thread
From: Andrew Morton @ 2007-03-29 17:10 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 29 Mar 2007 17:21:26 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> Hello,
> 
> We need to come up with the best possible layout of arguments for the
> fallocate() system call. Various architectures have different
> requirements for how the arguments should look like. Since the mail
> chain has become huge, here is the summary of various inputs received
> so far.
> 
> Platform: s390
> --------------
> s390 prefers following layout:
> 
>    int fallocate(int fd, loff_t offset, loff_t len, int mode)
> 
> For details on why and how "int, int, loff_t, loff_t" is a problem on
> s390, please see Heiko's mail on 16th March. Here is the link:
> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
> 
> Platform: ppc, arm
> ------------------
> ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout,
> since this will result in a pad between fd and offset, making seven
> arguments total - which is not supported by ppc32. It supports only
> 6 arguments. Thus the desired layout by ppc32 is:
> 
>    int fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Even ARM prefers above kind of layout. For details please see the
> definition of sys_arm_sync_file_range().

This is a clean-looking option.  Can s390 be changed to support seven-arg
syscalls?

> Option of loff_t => high u32 + low u32
> --------------------------------------
> Matthew and Russell have suggested another option of breaking each
> "loff_t" into two "u32"s. This will result in 6 arguments in total.
> 
> Following think that this is a good alternative:
> Matthew Wilcox, Russell King, Heiko Carstens
> 
> Following do not like this idea:
> Chris Wedgwood

It's a bit weird-looking, but the six-32-bit-args approach is simple
enought to understand and implement.  Presumably the glibc wrapper
would hide that detail from everyone.
 
> 
> What are your thoughts on this ? What layout should we finalize on ?
> Perhaps, since sync_file_range() system call has similar arguments, we
> can take hint from the challenges faced on implementing it on various
> architectures, and decide.
> 



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 17:01               ` Jan Engelhardt
@ 2007-03-29 17:18                   ` linux-os (Dick Johnson)
  2007-03-30  7:00                 ` Heiko Carstens
  1 sibling, 0 replies; 340+ messages in thread
From: linux-os (Dick Johnson) @ 2007-03-29 17:18 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm


On Thu, 29 Mar 2007, Jan Engelhardt wrote:

> Hi,
>
> On Mar 29 2007 17:21, Amit K. Arora wrote:
>>
>> We need to come up with the best possible layout of arguments for the
>> fallocate() system call. Various architectures have different
>> requirements for how the arguments should look like. Since the mail
>> chain has become huge, here is the summary of various inputs received
>> so far.
>
>> s390 prefers following layout:
>>   int fallocate(int fd, loff_t offset, loff_t len, int mode)
>> For details on why and how "int, int, loff_t, loff_t" is a problem on
>> s390, please see Heiko's mail on 16th March. Here is the link:
>> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
>
> Quoting that...
> 	|len -> r6 + second halve on stack
>
> Then, is not this a gcc glitch? (IMO, it should put all of "len" on the
> stack)
>
>> Platform: ppc, arm
>> ------------------
>> 6 arguments. Thus the desired layout by ppc32 is:
>>   int fallocate(int fd, int mode, loff_t offset, loff_t len)
>>
>> Option of loff_t => high u32 + low u32
>> --------------------------------------
>> Matthew and Russell have suggested another option of breaking each
>> "loff_t" into two "u32"s. This will result in 6 arguments in total.
>>
>> What are your thoughts on this ? What layout should we finalize on ?
>> Perhaps, since sync_file_range() system call has similar arguments, we
>> can take hint from the challenges faced on implementing it on various
>> architectures, and decide.
>>
>> Please suggest. Thanks!
>
> Does it actually matter? Glibc can have its own argument ordering
> different from the syscalls, so at least it would be possible to lay out
> the syscall arguments in the most portable way while retaining nice
> userspace C code. Hey, glibc might even wrap it up in a struct! (Using a
> pointer, as suggested in one of the proposals.)
>
> int fallocate(int fd, loff_t offset, loff_t len, int mode)
> {
> 	struct fallocate_foobar d = {fd, offset, len, mode};
> 	return _syscall(..., &d);
> }
>
> Jan
> --

I think it's always better to put only a pointer on the stack as
above.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.62 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
@ 2007-03-29 17:18                   ` linux-os (Dick Johnson)
  0 siblings, 0 replies; 340+ messages in thread
From: linux-os (Dick Johnson) @ 2007-03-29 17:18 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm


On Thu, 29 Mar 2007, Jan Engelhardt wrote:

> Hi,
>
> On Mar 29 2007 17:21, Amit K. Arora wrote:
>>
>> We need to come up with the best possible layout of arguments for the
>> fallocate() system call. Various architectures have different
>> requirements for how the arguments should look like. Since the mail
>> chain has become huge, here is the summary of various inputs received
>> so far.
>
>> s390 prefers following layout:
>>   int fallocate(int fd, loff_t offset, loff_t len, int mode)
>> For details on why and how "int, int, loff_t, loff_t" is a problem on
>> s390, please see Heiko's mail on 16th March. Here is the link:
>> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
>
> Quoting that...
> 	|len -> r6 + second halve on stack
>
> Then, is not this a gcc glitch? (IMO, it should put all of "len" on the
> stack)
>
>> Platform: ppc, arm
>> ------------------
>> 6 arguments. Thus the desired layout by ppc32 is:
>>   int fallocate(int fd, int mode, loff_t offset, loff_t len)
>>
>> Option of loff_t => high u32 + low u32
>> --------------------------------------
>> Matthew and Russell have suggested another option of breaking each
>> "loff_t" into two "u32"s. This will result in 6 arguments in total.
>>
>> What are your thoughts on this ? What layout should we finalize on ?
>> Perhaps, since sync_file_range() system call has similar arguments, we
>> can take hint from the challenges faced on implementing it on various
>> architectures, and decide.
>>
>> Please suggest. Thanks!
>
> Does it actually matter? Glibc can have its own argument ordering
> different from the syscalls, so at least it would be possible to lay out
> the syscall arguments in the most portable way while retaining nice
> userspace C code. Hey, glibc might even wrap it up in a struct! (Using a
> pointer, as suggested in one of the proposals.)
>
> int fallocate(int fd, loff_t offset, loff_t len, int mode)
> {
> 	struct fallocate_foobar d = {fd, offset, len, mode};
> 	return _syscall(..., &d);
> }
>
> Jan
> --

I think it's always better to put only a pointer on the stack as
above.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.62 BogoMips).
New book: http://www.AbominableFirebug.com/
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 17:18                   ` linux-os (Dick Johnson)
  (?)
@ 2007-03-29 18:05                   ` Jan Engelhardt
  2007-03-29 18:37                     ` Linus Torvalds
  -1 siblings, 1 reply; 340+ messages in thread
From: Jan Engelhardt @ 2007-03-29 18:05 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

 
On Mar 29 2007 13:18, linux-os (Dick Johnson) wrote:
>
>I think it's always better to put only a pointer on the stack as
>above.

I have to disagree, since wrapping it into a struct and copying the struct
in kernelspace from userspace requires more code. Pointers only become
useful at 3 (rarely) or 4 (yeah, more likely) and 5+ (definitely)
arguments, (3) see above about copying, (4) middle thing and (5) tons of
arguments like mmap() should be wrapped up... for simplicity of dealing
with it later.


Jan
-- 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 18:05                   ` Jan Engelhardt
@ 2007-03-29 18:37                     ` Linus Torvalds
  0 siblings, 0 replies; 340+ messages in thread
From: Linus Torvalds @ 2007-03-29 18:37 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: linux-os (Dick Johnson),
	Amit K. Arora, akpm, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm



On Thu, 29 Mar 2007, Jan Engelhardt wrote:
>
> I have to disagree, since wrapping it into a struct and copying the struct
> in kernelspace from userspace requires more code.

Not just more code, but more security issues too.

Passing system call arguments by value means that there are no subtle 
security issues - the value you use is the value you got. But once you 
pass-by-reference, you have to make damn sure that you do the proper user 
space accesses and verify the pointer correctly.

User-space (aka "user-supplied") pointers are just more dangerous. We 
obviously can't avoid them, but they need much more care than just a 
random value directly passed in a register.

		Linus

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 17:01               ` Jan Engelhardt
  2007-03-29 17:18                   ` linux-os (Dick Johnson)
@ 2007-03-30  7:00                 ` Heiko Carstens
  1 sibling, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-30  7:00 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Thu, Mar 29, 2007 at 07:01:54PM +0200, Jan Engelhardt wrote:
> Hi,
> 
> On Mar 29 2007 17:21, Amit K. Arora wrote:
> >
> >We need to come up with the best possible layout of arguments for the
> >fallocate() system call. Various architectures have different
> >requirements for how the arguments should look like. Since the mail
> >chain has become huge, here is the summary of various inputs received
> >so far.
> 
> >s390 prefers following layout:
> >   int fallocate(int fd, loff_t offset, loff_t len, int mode)
> >For details on why and how "int, int, loff_t, loff_t" is a problem on
> >s390, please see Heiko's mail on 16th March. Here is the link:
> >http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
> 
> Quoting that...
> 	|len -> r6 + second halve on stack
> 
> Then, is not this a gcc glitch? (IMO, it should put all of "len" on the 
> stack)

It _does_ put all of "len" on the stack. That is what I tried to explain
in the section that follows what you quoted.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 17:10               ` Andrew Morton
@ 2007-03-30  7:14                 ` Jakub Jelinek
  2007-03-30  8:39                   ` Heiko Carstens
                                     ` (5 more replies)
  2007-03-30  7:19                 ` Interface for the new fallocate() " Heiko Carstens
  1 sibling, 6 replies; 340+ messages in thread
From: Jakub Jelinek @ 2007-03-30  7:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Thu, Mar 29, 2007 at 10:10:10AM -0700, Andrew Morton wrote:
> > Platform: s390
> > --------------
> > s390 prefers following layout:
> > 
> >    int fallocate(int fd, loff_t offset, loff_t len, int mode)
> > 
> > For details on why and how "int, int, loff_t, loff_t" is a problem on
> > s390, please see Heiko's mail on 16th March. Here is the link:
> > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
> > 
> > Platform: ppc, arm
> > ------------------
> > ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout,
> > since this will result in a pad between fd and offset, making seven
> > arguments total - which is not supported by ppc32. It supports only
> > 6 arguments. Thus the desired layout by ppc32 is:
> > 
> >    int fallocate(int fd, int mode, loff_t offset, loff_t len)
> > 
> > Even ARM prefers above kind of layout. For details please see the
> > definition of sys_arm_sync_file_range().
> 
> This is a clean-looking option.  Can s390 be changed to support seven-arg
> syscalls?

Wouldn't
int fallocate(loff_t offset, loff_t len, int fd, int mode)
work on both s390 and ppc/arm?  glibc will certainly wrap it and
reorder the arguments as needed, so there is no need to keep fd first.

	Jakub

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-29 17:10               ` Andrew Morton
  2007-03-30  7:14                 ` Jakub Jelinek
@ 2007-03-30  7:19                 ` Heiko Carstens
  2007-03-30  9:15                   ` Paul Mackerras
  2007-03-30  9:15                   ` Paul Mackerras
  1 sibling, 2 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-30  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

> > Even ARM prefers above kind of layout. For details please see the
> > definition of sys_arm_sync_file_range().
> 
> This is a clean-looking option.  Can s390 be changed to support seven-arg
> syscalls?
> 
> > Option of loff_t => high u32 + low u32
> > --------------------------------------
> > Matthew and Russell have suggested another option of breaking each
> > "loff_t" into two "u32"s. This will result in 6 arguments in total.
> > 
> > Following think that this is a good alternative:
> > Matthew Wilcox, Russell King, Heiko Carstens
> > 
> > Following do not like this idea:
> > Chris Wedgwood
> 
> It's a bit weird-looking, but the six-32-bit-args approach is simple
> enought to understand and implement.  Presumably the glibc wrapper
> would hide that detail from everyone.

s390 can be changed to support seven-arg syscalls. But that would require
creating an additional stackframe in *libc to save original register
contents and in addition it would make our syscall hotpath slower.
That is because we have to take care of an additional register that might
contain user space passed contents and needs to be put on the kernel stack.
If possible I'd prefer the six-32-bit-args approach.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:14                 ` Jakub Jelinek
  2007-03-30  8:39                   ` Heiko Carstens
@ 2007-03-30  8:39                   ` Heiko Carstens
  2007-03-30  9:15                   ` Paul Mackerras
                                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-30  8:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> On Thu, Mar 29, 2007 at 10:10:10AM -0700, Andrew Morton wrote:
> > > Platform: s390
> > > --------------
> > > s390 prefers following layout:
> > > 
> > >    int fallocate(int fd, loff_t offset, loff_t len, int mode)
> > > 
> > > For details on why and how "int, int, loff_t, loff_t" is a problem on
> > > s390, please see Heiko's mail on 16th March. Here is the link:
> > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
> > > 
> > > Platform: ppc, arm
> > > ------------------
> > > ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout,
> > > since this will result in a pad between fd and offset, making seven
> > > arguments total - which is not supported by ppc32. It supports only
> > > 6 arguments. Thus the desired layout by ppc32 is:
> > > 
> > >    int fallocate(int fd, int mode, loff_t offset, loff_t len)
> > > 
> > > Even ARM prefers above kind of layout. For details please see the
> > > definition of sys_arm_sync_file_range().
> > 
> > This is a clean-looking option.  Can s390 be changed to support seven-arg
> > syscalls?
> 
> Wouldn't
> int fallocate(loff_t offset, loff_t len, int fd, int mode)
> work on both s390 and ppc/arm?  glibc will certainly wrap it and
> reorder the arguments as needed, so there is no need to keep fd first.

That would be fine for s390.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:14                 ` Jakub Jelinek
@ 2007-03-30  8:39                   ` Heiko Carstens
  2007-03-30  8:39                   ` Heiko Carstens
                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-30  8:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> On Thu, Mar 29, 2007 at 10:10:10AM -0700, Andrew Morton wrote:
> > > Platform: s390
> > > --------------
> > > s390 prefers following layout:
> > > 
> > >    int fallocate(int fd, loff_t offset, loff_t len, int mode)
> > > 
> > > For details on why and how "int, int, loff_t, loff_t" is a problem on
> > > s390, please see Heiko's mail on 16th March. Here is the link:
> > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html
> > > 
> > > Platform: ppc, arm
> > > ------------------
> > > ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout,
> > > since this will result in a pad between fd and offset, making seven
> > > arguments total - which is not supported by ppc32. It supports only
> > > 6 arguments. Thus the desired layout by ppc32 is:
> > > 
> > >    int fallocate(int fd, int mode, loff_t offset, loff_t len)
> > > 
> > > Even ARM prefers above kind of layout. For details please see the
> > > definition of sys_arm_sync_file_range().
> > 
> > This is a clean-looking option.  Can s390 be changed to support seven-arg
> > syscalls?
> 
> Wouldn't
> int fallocate(loff_t offset, loff_t len, int fd, int mode)
> work on both s390 and ppc/arm?  glibc will certainly wrap it and
> reorder the arguments as needed, so there is no need to keep fd first.

That would be fine for s390.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:14                 ` Jakub Jelinek
                                     ` (2 preceding siblings ...)
  2007-03-30  9:15                   ` Paul Mackerras
@ 2007-03-30  9:15                   ` Paul Mackerras
  2007-04-05 11:26                   ` Amit K. Arora
  2007-04-17 12:55                   ` Amit K. Arora
  5 siblings, 0 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-03-30  9:15 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

Jakub Jelinek writes:

> Wouldn't
> int fallocate(loff_t offset, loff_t len, int fd, int mode)
> work on both s390 and ppc/arm?  glibc will certainly wrap it and
> reorder the arguments as needed, so there is no need to keep fd first.

That looks fine to me.

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:14                 ` Jakub Jelinek
  2007-03-30  8:39                   ` Heiko Carstens
  2007-03-30  8:39                   ` Heiko Carstens
@ 2007-03-30  9:15                   ` Paul Mackerras
  2007-03-30  9:15                   ` Paul Mackerras
                                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-03-30  9:15 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

Jakub Jelinek writes:

> Wouldn't
> int fallocate(loff_t offset, loff_t len, int fd, int mode)
> work on both s390 and ppc/arm?  glibc will certainly wrap it and
> reorder the arguments as needed, so there is no need to keep fd first.

That looks fine to me.

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:19                 ` Interface for the new fallocate() " Heiko Carstens
@ 2007-03-30  9:15                   ` Paul Mackerras
  2007-03-30 10:44                       ` Jörn Engel
  2007-03-30 10:44                     ` Jörn Engel
  2007-03-30  9:15                   ` Paul Mackerras
  1 sibling, 2 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-03-30  9:15 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

Heiko Carstens writes:

> If possible I'd prefer the six-32-bit-args approach.

It does mean extra unnecessary work for 64-bit platforms, though...

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:19                 ` Interface for the new fallocate() " Heiko Carstens
  2007-03-30  9:15                   ` Paul Mackerras
@ 2007-03-30  9:15                   ` Paul Mackerras
  1 sibling, 0 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-03-30  9:15 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

Heiko Carstens writes:

> If possible I'd prefer the six-32-bit-args approach.

It does mean extra unnecessary work for 64-bit platforms, though...

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  9:15                   ` Paul Mackerras
@ 2007-03-30 10:44                       ` Jörn Engel
  2007-03-30 10:44                     ` Jörn Engel
  1 sibling, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-30 10:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote:
> Heiko Carstens writes:
> 
> > If possible I'd prefer the six-32-bit-args approach.
> 
> It does mean extra unnecessary work for 64-bit platforms, though...

Wouldn't that work be confined to fallocate()?  If I understand Heiko
correctly, the alternative would slow s390 down for every syscall,
including more performance-critical ones.

Jörn

-- 
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
@ 2007-03-30 10:44                       ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-30 10:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote:
> Heiko Carstens writes:
> 
> > If possible I'd prefer the six-32-bit-args approach.
> 
> It does mean extra unnecessary work for 64-bit platforms, though...

Wouldn't that work be confined to fallocate()?  If I understand Heiko
correctly, the alternative would slow s390 down for every syscall,
including more performance-critical ones.

Jörn

-- 
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  9:15                   ` Paul Mackerras
  2007-03-30 10:44                       ` Jörn Engel
@ 2007-03-30 10:44                     ` Jörn Engel
  1 sibling, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-03-30 10:44 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote:
> Heiko Carstens writes:
> 
> > If possible I'd prefer the six-32-bit-args approach.
> 
> It does mean extra unnecessary work for 64-bit platforms, though...

Wouldn't that work be confined to fallocate()?  If I understand Heiko
correctly, the alternative would slow s390 down for every syscall,
including more performance-critical ones.

Jörn

-- 
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30 10:44                       ` Jörn Engel
  (?)
@ 2007-03-30 12:55                       ` Heiko Carstens
  -1 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-30 12:55 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Paul Mackerras, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, Mar 30, 2007 at 12:44:49PM +0200, Jörn Engel wrote:
> On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote:
> > It does mean extra unnecessary work for 64-bit platforms, though...
> 
> Wouldn't that work be confined to fallocate()?  If I understand Heiko
> correctly, the alternative would slow s390 down for every syscall,
> including more performance-critical ones.

That is correct.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30 10:44                       ` Jörn Engel
  (?)
  (?)
@ 2007-03-30 12:55                       ` Heiko Carstens
  -1 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-03-30 12:55 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Paul Mackerras, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, Mar 30, 2007 at 12:44:49PM +0200, Jörn Engel wrote:
> On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote:
> > It does mean extra unnecessary work for 64-bit platforms, though...
> 
> Wouldn't that work be confined to fallocate()?  If I understand Heiko
> correctly, the alternative would slow s390 down for every syscall,
> including more performance-critical ones.

That is correct.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:14                 ` Jakub Jelinek
                                     ` (3 preceding siblings ...)
  2007-03-30  9:15                   ` Paul Mackerras
@ 2007-04-05 11:26                   ` Amit K. Arora
  2007-04-05 11:44                     ` Amit K. Arora
                                       ` (2 more replies)
  2007-04-17 12:55                   ` Amit K. Arora
  5 siblings, 3 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-04-05 11:26 UTC (permalink / raw)
  To: Jakub Jelinek, Andrew Morton, torvalds
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> Wouldn't
> int fallocate(loff_t offset, loff_t len, int fd, int mode)
> work on both s390 and ppc/arm?  glibc will certainly wrap it and
> reorder the arguments as needed, so there is no need to keep fd first.
 
This should work on all the platforms. The only concern I can think of
here is the convention being followed till now, where all the entities on
which the action has to be performed by the kernel (say fd, file/device
name, pid etc.) is the first argument of the system call. If we can live
with the small exception here, fine.

Or else, we may have to implement the 

  int fd, int mode, loff_t offset, loff_t len

as the layout of arguments here. I think only s390 will have a problem
with this, and we can think of a workaround for it (may be similar to
what ARM did to implement sync_file_range() system call)   :

asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
{
        return sys_fallocate(fd, offset, len, mode);
}


To me both the approaches look slightly unconventional. But, we need to
compromise somewhere to make things work on all the platforms.

Any thoughts on which one of the above should we finalize on ?

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-05 11:26                   ` Amit K. Arora
@ 2007-04-05 11:44                     ` Amit K. Arora
  2007-04-05 15:50                     ` Randy Dunlap
  2007-04-06  9:58                     ` Andreas Dilger
  2 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-04-05 11:44 UTC (permalink / raw)
  To: Jakub Jelinek, Andrew Morton, torvalds
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, Apr 05, 2007 at 04:56:19PM +0530, Amit K. Arora wrote:

Correction below:

> asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
> {
>         return sys_fallocate(fd, offset, len, mode);
          return sys_fallocate(fd, mode, offset, len);
> }

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-05 11:26                   ` Amit K. Arora
  2007-04-05 11:44                     ` Amit K. Arora
@ 2007-04-05 15:50                     ` Randy Dunlap
  2007-04-06  9:58                     ` Andreas Dilger
  2 siblings, 0 replies; 340+ messages in thread
From: Randy Dunlap @ 2007-04-05 15:50 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Jakub Jelinek, Andrew Morton, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 5 Apr 2007 16:56:19 +0530 Amit K. Arora wrote:

> On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> > Wouldn't
> > int fallocate(loff_t offset, loff_t len, int fd, int mode)
> > work on both s390 and ppc/arm?  glibc will certainly wrap it and
> > reorder the arguments as needed, so there is no need to keep fd first.
>  
> This should work on all the platforms. The only concern I can think of
> here is the convention being followed till now, where all the entities on
> which the action has to be performed by the kernel (say fd, file/device
> name, pid etc.) is the first argument of the system call. If we can live
> with the small exception here, fine.
> 
> Or else, we may have to implement the 
> 
>   int fd, int mode, loff_t offset, loff_t len
> 
> as the layout of arguments here. I think only s390 will have a problem
> with this, and we can think of a workaround for it (may be similar to
> what ARM did to implement sync_file_range() system call)   :
> 
> asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
> {
>         return sys_fallocate(fd, offset, len, mode);
> }
> 
> 
> To me both the approaches look slightly unconventional. But, we need to
> compromise somewhere to make things work on all the platforms.
> 
> Any thoughts on which one of the above should we finalize on ?
> 
> Thanks!

If s390 can work around the calling order that easily, I certainly
prefer the more conventional ordering of:

>   int fd, int mode, loff_t offset, loff_t len

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-05 11:26                   ` Amit K. Arora
  2007-04-05 11:44                     ` Amit K. Arora
  2007-04-05 15:50                     ` Randy Dunlap
@ 2007-04-06  9:58                     ` Andreas Dilger
  2 siblings, 0 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-04-06  9:58 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Jakub Jelinek, Andrew Morton, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Apr 05, 2007  16:56 +0530, Amit K. Arora wrote:
> This should work on all the platforms. The only concern I can think of
> here is the convention being followed till now, where all the entities on
> which the action has to be performed by the kernel (say fd, file/device
> name, pid etc.) is the first argument of the system call. If we can live
> with the small exception here, fine.

Yes, it is much cleaner to have fd first, like every other such syscall.

> Or else, we may have to implement the 
> 
>   int fd, int mode, loff_t offset, loff_t len
> 
> as the layout of arguments here. I think only s390 will have a problem
> with this, and we can think of a workaround for it (may be similar to
> what ARM did to implement sync_file_range() system call)   :
> 
> asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
> {
>         return sys_fallocate(fd, offset, len, mode);
> }

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30 10:44                       ` Jörn Engel
@ 2007-04-09 13:01                         ` Paul Mackerras
  -1 siblings, 0 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-04-09 13:01 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Jörn Engel writes:

> Wouldn't that work be confined to fallocate()?  If I understand Heiko
> correctly, the alternative would slow s390 down for every syscall,
> including more performance-critical ones.

The alternative that Jakub suggested wouldn't slow s390 down.

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
@ 2007-04-09 13:01                         ` Paul Mackerras
  0 siblings, 0 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-04-09 13:01 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Jörn Engel writes:

> Wouldn't that work be confined to fallocate()?  If I understand Heiko
> correctly, the alternative would slow s390 down for every syscall,
> including more performance-critical ones.

The alternative that Jakub suggested wouldn't slow s390 down.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-09 13:01                         ` Paul Mackerras
@ 2007-04-09 16:34                           ` Jörn Engel
  -1 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-04-09 16:34 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Mon, 9 April 2007 23:01:42 +1000, Paul Mackerras wrote:
> Jörn Engel writes:
> 
> > Wouldn't that work be confined to fallocate()?  If I understand Heiko
> > correctly, the alternative would slow s390 down for every syscall,
> > including more performance-critical ones.
> 
> The alternative that Jakub suggested wouldn't slow s390 down.

True.  And it appears to be one of the least offensive options we have.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
@ 2007-04-09 16:34                           ` Jörn Engel
  0 siblings, 0 replies; 340+ messages in thread
From: Jörn Engel @ 2007-04-09 16:34 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Heiko Carstens, Andrew Morton, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Mon, 9 April 2007 23:01:42 +1000, Paul Mackerras wrote:
> Jörn Engel writes:
> 
> > Wouldn't that work be confined to fallocate()?  If I understand Heiko
> > correctly, the alternative would slow s390 down for every syscall,
> > including more performance-critical ones.
> 
> The alternative that Jakub suggested wouldn't slow s390 down.

True.  And it appears to be one of the least offensive options we have.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-03-30  7:14                 ` Jakub Jelinek
                                     ` (4 preceding siblings ...)
  2007-04-05 11:26                   ` Amit K. Arora
@ 2007-04-17 12:55                   ` Amit K. Arora
  2007-04-18 13:06                     ` Andreas Dilger
  5 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-04-17 12:55 UTC (permalink / raw)
  To: Andrew Morton, Jakub Jelinek, torvalds
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm, suparna

On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> Wouldn't
> int fallocate(loff_t offset, loff_t len, int fd, int mode)
> work on both s390 and ppc/arm?  glibc will certainly wrap it and
> reorder the arguments as needed, so there is no need to keep fd first.
>

I think more people are comfirtable with this approach. Since glibc
will wrap the system call and export the "conventional" interface
(with fd first) to applications, we may not worry about keeping fd first
in kernel code. I am personally fine with this approach.

Still, if people have major concerns, we can think of getting rid of the
"mode" argument itself. Anyhow we may, in future, need to have a policy
based system call (say, for providing the goal block by applications for
performance reasons). "mode" can then be made part of it.

Comments ?
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-17 12:55                   ` Amit K. Arora
@ 2007-04-18 13:06                     ` Andreas Dilger
  2007-04-20 13:51                       ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-04-18 13:06 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Andrew Morton, Jakub Jelinek, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm, suparna

On Apr 17, 2007  18:25 +0530, Amit K. Arora wrote:
> On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> > Wouldn't
> > int fallocate(loff_t offset, loff_t len, int fd, int mode)
> > work on both s390 and ppc/arm?  glibc will certainly wrap it and
> > reorder the arguments as needed, so there is no need to keep fd first.
> 
> I think more people are comfirtable with this approach.

Really?  I thought from the last postings that "fd first, wrap on s390"
was better.

> Since glibc
> will wrap the system call and export the "conventional" interface
> (with fd first) to applications, we may not worry about keeping fd first
> in kernel code. I am personally fine with this approach.

It would seem to make more sense to wrap the syscall on those architectures
that can't handle the "conventional" interface (fd first).

> Still, if people have major concerns, we can think of getting rid of the
> "mode" argument itself. Anyhow we may, in future, need to have a policy
> based system call (say, for providing the goal block by applications for
> performance reasons). "mode" can then be made part of it.

We need at least mode="unallocate" or a separate funallocate() call to
allow allocated-but-unwritten blocks to be unallocated without actually
punching out written data.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-18 13:06                     ` Andreas Dilger
@ 2007-04-20 13:51                       ` Amit K. Arora
  2007-04-20 14:59                         ` Jakub Jelinek
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-04-20 13:51 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm, suparna,
	Andrew Morton, Jakub Jelinek, torvalds

On Wed, Apr 18, 2007 at 07:06:00AM -0600, Andreas Dilger wrote:
> On Apr 17, 2007  18:25 +0530, Amit K. Arora wrote:
> > On Fri, Mar 30, 2007 at 02:14:17AM -0500, Jakub Jelinek wrote:
> > > Wouldn't
> > > int fallocate(loff_t offset, loff_t len, int fd, int mode)
> > > work on both s390 and ppc/arm?  glibc will certainly wrap it and
> > > reorder the arguments as needed, so there is no need to keep fd first.
> > 
> > I think more people are comfirtable with this approach.
> 
> Really?  I thought from the last postings that "fd first, wrap on s390"
> was better.
> 
> > Since glibc
> > will wrap the system call and export the "conventional" interface
> > (with fd first) to applications, we may not worry about keeping fd first
> > in kernel code. I am personally fine with this approach.
> 
> It would seem to make more sense to wrap the syscall on those architectures
> that can't handle the "conventional" interface (fd first).

Ok.
In this case we may have to consider following things:

1) Obviously, for this glibc will have to call fallocate() syscall with
different arguments on s390, than other archs. I think this should be
doable and should not be an issue with glibc folks (right?).

2) we also need to see how strace behaves in this case. With little
knowledge that I have of strace, I don't think it should depend on
argument ordering of a system call on different archs (since it uses
ptrace internally and that should take care of it). But, it will be
nice if someone can confirm this.

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-20 13:51                       ` Amit K. Arora
@ 2007-04-20 14:59                         ` Jakub Jelinek
  2007-04-24 12:16                           ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Jakub Jelinek @ 2007-04-20 14:59 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Andreas Dilger, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	cmm, suparna, Andrew Morton, torvalds

On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote:
> Ok.
> In this case we may have to consider following things:
> 
> 1) Obviously, for this glibc will have to call fallocate() syscall with
> different arguments on s390, than other archs. I think this should be
> doable and should not be an issue with glibc folks (right?).

glibc can cope with this easily, will just add
sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override
the generic Linux implementation.

> 2) we also need to see how strace behaves in this case. With little
> knowledge that I have of strace, I don't think it should depend on
> argument ordering of a system call on different archs (since it uses
> ptrace internally and that should take care of it). But, it will be
> nice if someone can confirm this.

strace would solve this with #ifdef mess, it already does that in many
places so guess another few lines don't make it significantly worse.

	Jakub

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: Interface for the new fallocate() system call
  2007-04-20 14:59                         ` Jakub Jelinek
@ 2007-04-24 12:16                           ` Amit K. Arora
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-04-24 12:16 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andreas Dilger, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	cmm, suparna, Andrew Morton, torvalds

On Fri, Apr 20, 2007 at 10:59:18AM -0400, Jakub Jelinek wrote:
> On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote:
> > Ok.
> > In this case we may have to consider following things:
> > 
> > 1) Obviously, for this glibc will have to call fallocate() syscall with
> > different arguments on s390, than other archs. I think this should be
> > doable and should not be an issue with glibc folks (right?).
> 
> glibc can cope with this easily, will just add
> sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override
> the generic Linux implementation.
> 
> > 2) we also need to see how strace behaves in this case. With little
> > knowledge that I have of strace, I don't think it should depend on
> > argument ordering of a system call on different archs (since it uses
> > ptrace internally and that should take care of it). But, it will be
> > nice if someone can confirm this.
> 
> strace would solve this with #ifdef mess, it already does that in many
> places so guess another few lines don't make it significantly worse.

I will work on the revised fallocate patchset and will post it soon.

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/5] fallocate system call
  2007-04-24 12:16                           ` Amit K. Arora
@ 2007-04-26 17:50                             ` Amit K. Arora
  2007-04-26 18:03                               ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Amit K. Arora
                                                 ` (9 more replies)
  0 siblings, 10 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-04-26 17:50 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Based on the discussion, this new patchset uses following as the
interface for fallocate() system call:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

It seems that only s390 architecture has a problem with such a layout of
arguments in fallocate(). Thus for s390, we plan to have a wrapper
(say, sys_s390_fallocate()) for the sys_fallocate(), which will get
called by glibc when an application issues a fallocate() system call
on s390. The s390 arch specific changes will be part of a separate
patch (PATCH 2/5). It will be great if some s390 expert can verify the
patch, since I have not been able to test it on s390 so far.

It was also noted that minor changes might be required to strace code
to take care of "different arguments on s390" issue.

Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
preallocation and deallocation of preallocated blocks respectively. More
modes can be added, when required.

ToDos:
=====
1>   Implementation on other architectures (other than i386, x86_64, 
ppc64 and s390(x)) 
2>   A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3>   Changes to glibc,
	a) to support fallocate() system call
	b) so that posix_fallocate() and posix_fallocate64() call
	   fallocate() system call
4>   Changes to XFS to implement the fallocate inode operation


Following patches follow:

Patch 1/5 : fallocate() implementation in i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
@ 2007-04-26 18:03                               ` Amit K. Arora
  2007-05-04  4:29                                 ` Andrew Morton
  2007-05-09 16:01                                 ` Amit K. Arora
  2007-04-26 18:07                               ` [PATCH 2/5] fallocate() on s390 Amit K. Arora
                                                 ` (8 subsequent siblings)
  9 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-04-26 18:03 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

NOTE: It is based on 2.6.21 kernel version.

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 arch/i386/kernel/syscall_table.S |    1 
 arch/powerpc/kernel/sys_ppc32.c  |    7 ++++++
 arch/x86_64/kernel/functionlist  |    1 
 fs/open.c                        |   41 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h        |    3 +-
 include/asm-powerpc/systbl.h     |    1 
 include/asm-powerpc/unistd.h     |    3 +-
 include/asm-x86_64/unistd.h      |    4 ++-
 include/linux/fs.h               |    7 ++++++
 include/linux/syscalls.h         |    1 
 10 files changed, 66 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_fallocate		/* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===================================================================
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===================================================================
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (len == 0 || offset < 0)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	if (!S_ISREG(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	if (offset + len > inode->i_sb->s_maxbytes)
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.21/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-i386/unistd.h
+++ linux-2.6.21/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_fallocate		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.21.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.21/include/asm-powerpc/systbl.h
@@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
+COMPAT_SYS(fallocate)
Index: linux-2.6.21/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.21/include/asm-powerpc/unistd.h
@@ -326,10 +326,11 @@
 #define __NR_move_pages		301
 #define __NR_getcpu		302
 #define __NR_epoll_pwait	303
+#define __NR_fallocate		304
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		304
+#define __NR_syscalls		305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.21/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.21/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate		280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/linux/fs.h
===================================================================
--- linux-2.6.21.orig/include/linux/fs.h
+++ linux-2.6.21/include/linux/fs.h
@@ -264,6 +264,12 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * fallocate() modes
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1125,6 +1131,7 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *, int, loff_t, loff_t);
 };
 
 struct seq_file;
Index: linux-2.6.21/include/linux/syscalls.h
===================================================================
--- linux-2.6.21.orig/include/linux/syscalls.h
+++ linux-2.6.21/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.21.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c
@@ -777,6 +777,13 @@ asmlinkage int compat_sys_truncate64(con
 	return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+				     u32 lenhi, u32 lenlo)
+{
+	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+			     ((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
 				 unsigned long low)
 {

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 2/5] fallocate() on s390
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
  2007-04-26 18:03                               ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Amit K. Arora
@ 2007-04-26 18:07                               ` Amit K. Arora
  2007-04-26 18:11                               ` [PATCH 3/5] ext4: Extent overlap bugfix Amit K. Arora
                                                 ` (7 subsequent siblings)
  9 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-04-26 18:07 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with "preferred" ordering of arguments in this system call (i.e. int,
int, loff_t, loff_t).

I will request s390 experts to please review this code and verify if
this patch is correct. Thanks!

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 arch/s390/kernel/compat_wrapper.S |   10 ++++++++++
 arch/s390/kernel/sys_s390.c       |   10 ++++++++++
 arch/s390/kernel/syscalls.S       |    1 +
 include/asm-s390/unistd.h         |    3 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
 	llgtr	%r2,%r2			# char *
 	llgtr	%r3,%r3			# struct compat_timeval *
 	jg	compat_sys_utimes
+
+	.globl	s390_fallocate_wrapper
+s390_fallocate_wrapper:
+	lgfr	%r2,%r2			# int
+	sllg	%r3,%r3,32		# get high word of 64bit loff_t
+	or	%r3,%r4			# get low word of 64bit loff_t
+	sllg	%r4,%r5,32		# get high word of 64bit loff_t
+	or	%r4,%r6			# get low word of 64bit loff_t
+	llgf	%r5,164(%r15)		# unsigned int
+	jg	s390_fallocate
Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.21/arch/s390/kernel/sys_s390.c
@@ -268,6 +268,16 @@ s390_fadvise64_64(struct fadvise64_64_ar
 }
 
 /*
+ * This is a wrapper to call sys_fallocate(). Since s390 ABI has a problem
+ * with the int, int, loff_t, loff_t ordering of arguments, this wrapper
+ * is required.
+ */
+asmlinkage long s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
+{
+	return sys_fallocate(fd, mode, offset, len);
+}
+
+/*
  * Do a system call from kernel instead of calling sys_execve so we
  * end up with proper pt_regs.
  */
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL							/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,s390_fallocate,s390_fallocate_wrapper)
Index: linux-2.6.21/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu		311
 #define __NR_epoll_pwait	312
 #define __NR_utimes		313
+#define __NR_fallocate		314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 3/5] ext4: Extent overlap bugfix
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
  2007-04-26 18:03                               ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Amit K. Arora
  2007-04-26 18:07                               ` [PATCH 2/5] fallocate() on s390 Amit K. Arora
@ 2007-04-26 18:11                               ` Amit K. Arora
  2007-05-04  4:30                                 ` Andrew Morton
  2007-04-26 18:13                               ` [PATCH 4/5] ext4: fallocate support in ext4 Amit K. Arora
                                                 ` (6 subsequent siblings)
  9 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-04-26 18:11 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This is a fix for an extent-overlap bug. The fallocate() implementation
on ext4 depends on this bugfix. Though this fix had been posted earlier,
but because it is still not part of mainline code, I have attached it
here too.

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |   50 ++++++++++++++++++++++++++++++++++++++--
 include/linux/ext4_fs_extents.h |    1 
 2 files changed, 49 insertions(+), 2 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1129,6 +1129,45 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_check_overlap:
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+					struct ext4_extent *newext,
+					struct ext4_ext_path *path)
+{
+	unsigned long b1, b2;
+	unsigned int depth, len1;
+
+	b1 = le32_to_cpu(newext->ee_block);
+	len1 = le16_to_cpu(newext->ee_len);
+	depth = ext_depth(inode);
+	if (!path[depth].p_ext)
+		goto out;
+	b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+	/* get the next allocated block if the extent in the path
+	 * is before the requested block(s) */
+	if (b2 < b1) {
+		b2 = ext4_ext_next_allocated_block(path);
+		if (b2 == EXT_MAX_BLOCK)
+			goto out;
+	}
+
+	if (b1 + len1 > b2) {
+		newext->ee_len = cpu_to_le16(b2 - b1);
+		return 1;
+	}
+out:
+	return 0;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2032,7 +2071,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* allocate new block */
 	goal = ext4_ext_find_goal(inode, path, iblock);
-	allocated = max_blocks;
+
+	/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+	newex.ee_block = cpu_to_le32(iblock);
+	newex.ee_len = cpu_to_le16(max_blocks);
+	err = ext4_ext_check_overlap(inode, &newex, path);
+	if (err)
+		allocated = le16_to_cpu(newex.ee_len);
+	else
+		allocated = max_blocks;
 	newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
 	if (!newblock)
 		goto out2;
@@ -2040,7 +2087,6 @@ int ext4_ext_get_blocks(handle_t *handle
 			goal, newblock, allocated);
 
 	/* try to insert new extent into found leaf and return */
-	newex.ee_block = cpu_to_le32(iblock);
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 4/5] ext4: fallocate support in ext4
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
                                                 ` (2 preceding siblings ...)
  2007-04-26 18:11                               ` [PATCH 3/5] ext4: Extent overlap bugfix Amit K. Arora
@ 2007-04-26 18:13                               ` Amit K. Arora
  2007-05-04  4:31                                 ` Andrew Morton
  2007-04-26 18:16                               ` [PATCH 5/5] ext4: write support for preallocated blocks/extents Amit K. Arora
                                                 ` (5 subsequent siblings)
  9 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-04-26 18:13 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch has the ext4 implemtation of fallocate system call.

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  201 +++++++++++++++++++++++++++++++---------
 fs/ext4/file.c                  |    1 
 include/linux/ext4_fs.h         |    7 +
 include/linux/ext4_fs_extents.h |   13 ++
 4 files changed, 179 insertions(+), 43 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
 		        ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
 			        ext_pblock(path[depth].p_ext),
-			        le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int depth, len1;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *ex, *fex;
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
-	int depth, len, err, next;
+	int depth, len, err, next, uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/* ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1254,7 +1275,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 			        le32_to_cpu(newext->ee_block),
 			        ext_pblock(newext),
-			        le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 		           > le32_to_cpu(nearex->ee_block)) {
@@ -1267,7 +1288,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
 				        ext_pblock(newext),
-				        le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1280,7 +1301,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1299,8 +1320,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1370,8 +1396,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1383,7 +1409,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1399,7 +1426,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 		        cbex.ec_block = le32_to_cpu(ex->ee_block);
-		        cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 		        cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1472,15 +1499,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len));
+			        (unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-		            + le16_to_cpu(ex->ee_len)) {
+		            + ext4_ext_get_actual_len(ex)) {
 	        lblock = le32_to_cpu(ex->ee_block)
-		         + le16_to_cpu(ex->ee_len);
+		         + ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len),
+			        (unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1610,12 +1637,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1629,12 +1656,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1642,12 +1669,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1661,7 +1688,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	struct ext4_extent_header *eh;
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
-	unsigned short ex_ee_len;
+	unsigned short ex_ee_len, uninitialized = 0;
 	struct ext4_extent *ex;
 
 	ext_debug("truncate since %lu in leaf\n", start);
@@ -1676,7 +1703,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1744,6 +1773,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1753,7 +1784,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2029,7 +2060,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2037,8 +2068,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2046,8 +2078,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2089,6 +2124,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err)
 		goto out2;
@@ -2100,8 +2137,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create!=EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
@@ -2205,10 +2244,86 @@ int ext4_ext_writepage_trans_blocks(stru
 	return needed;
 }
 
+/*
+ * ext4_fallocate:
+ * preallocate space for a file
+ * mode is for future use, e.g. for unallocating preallocated blocks etc.
+ */
+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	handle_t *handle;
+	ext4_fsblk_t block, max_blocks;
+	int ret, ret2, nblocks = 0, retries = 0;
+	struct buffer_head map_bh;
+	unsigned int credits, blkbits = inode->i_blkbits;
+
+	/* Currently supporting (pre)allocate mode _only_ */
+	if (mode != FA_ALLOCATE)
+		return -EOPNOTSUPP;
+
+	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+		return -ENOTTY;
+
+	block = offset >> blkbits;
+	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+			 - block;
+	mutex_lock(&EXT4_I(inode)->truncate_mutex);
+	credits = ext4_ext_calc_credits_for_insert(inode, NULL);
+	mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+	handle=ext4_journal_start(inode, credits +
+					EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+retry:
+	ret = 0;
+	while (ret >= 0 && ret < max_blocks) {
+		block = block + ret;
+		max_blocks = max_blocks - ret;
+		ret = ext4_ext_get_blocks(handle, inode, block,
+					  max_blocks, &map_bh,
+					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
+		BUG_ON(!ret);
+		if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
+			&& ((block + ret) > (i_size_read(inode) << blkbits)))
+			nblocks = nblocks + ret;
+	}
+
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+
+	/* Time to update the file size.
+	 * Update only when preallocation was requested beyond the file size.
+	 */
+	if ((offset + len) > i_size_read(inode)) {
+		if (ret > 0) {
+		/* if no error, we assume preallocation succeeded completely */
+			mutex_lock(&inode->i_mutex);
+			i_size_write(inode, offset + len);
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		} else if (ret < 0 && nblocks) {
+		/* Handle partial allocation scenario */
+			loff_t newsize;
+			mutex_lock(&inode->i_mutex);
+			newsize  = (nblocks << blkbits) + i_size_read(inode);
+			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
+	ext4_mark_inode_dirty(handle, inode);
+	ret2 = ext4_journal_stop(handle);
+	if (ret > 0)
+		ret = ret2;
+
+	return ret > 0 ? 0 : ret;
+}
+
 EXPORT_SYMBOL(ext4_mark_inode_dirty);
 EXPORT_SYMBOL(ext4_ext_invalidate_cache);
 EXPORT_SYMBOL(ext4_ext_insert_extent);
 EXPORT_SYMBOL(ext4_ext_walk_space);
 EXPORT_SYMBOL(ext4_ext_find_goal);
 EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert);
+EXPORT_SYMBOL(ext4_fallocate);
 
Index: linux-2.6.21/fs/ext4/file.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/file.c
+++ linux-2.6.21/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.21/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs.h
+++ linux-2.6.21/include/linux/ext4_fs.h
@@ -102,6 +102,8 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits) 	(((size)+(1 << blkbits)-1) & \
+							(~((1 << blkbits)-1)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +227,10 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/* Following is used by preallocation logic to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t 
 extern void ext4_ext_truncate(struct inode *, struct page *);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
+extern int ext4_fallocate(struct inode *, int, loff_t, loff_t);
 static inline int
 ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 			unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -125,6 +125,19 @@ struct ext4_ext_path {
 #define EXT4_EXT_CACHE_EXTENT	2
 
 /*
+ * Macro-instructions to handle (mark/unmark/check/create) unitialized
+ * extents. Applications can issue an IOCTL for preallocation, which results
+ * in assigning unitialized extents to the file.
+ */
+#define ext4_ext_mark_uninitialized(ext)	((ext)->ee_len |= \
+							cpu_to_le16(0x8000))
+#define ext4_ext_is_uninitialized(ext)  	((le16_to_cpu((ext)->ee_len))& \
+									0x8000)
+#define ext4_ext_get_actual_len(ext)		((le16_to_cpu((ext)->ee_len))& \
+									0x7FFF)
+
+
+/*
  * to be called by ext4_ext_walk_space()
  * negative retcode - error
  * positive retcode - signal for ext4_ext_walk_space(), see below

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 5/5] ext4: write support for preallocated blocks/extents
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
                                                 ` (3 preceding siblings ...)
  2007-04-26 18:13                               ` [PATCH 4/5] ext4: fallocate support in ext4 Amit K. Arora
@ 2007-04-26 18:16                               ` Amit K. Arora
  2007-05-04  4:32                                 ` Andrew Morton
  2007-05-07 12:40                                 ` Pekka Enberg
  2007-04-27 12:10                               ` [PATCH 0/5] fallocate system call Heiko Carstens
                                                 ` (4 subsequent siblings)
  9 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-04-26 18:16 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds write support for preallocated (using fallocate system
call) blocks/extents. The preallocated extents in ext4 are marked
"uninitialized", hence they need special handling especially while
writing to them. This patch takes care of that.

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  228 +++++++++++++++++++++++++++++++++++-----
 include/linux/ext4_fs_extents.h |    1 
 2 files changed, 202 insertions(+), 27 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+				struct ext4_ext_path *path,
+				struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done=0, uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh)) {
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+					* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+		merge_done = 1;
+		BUG_ON(eh->eh_entries == 0);
+	}
+
+	return merge_done;
+}
+
+
+/*
  * ext4_ext_check_overlap:
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
@@ -1316,25 +1361,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -1999,15 +2026,149 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * ext4_ext_convert_to_initialized:
+ * this function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three). Atleast one initialized extent
+ * and atmost two uninitialized extents can result.
+ * There are three possibilities:
+ *   a> No split required: Entire extent should be initialized.
+ *   b> Split into two extents: Only one end of the extent is being written to.
+ *   c> Split into three extents: Somone is writing in middle of the extent.
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0, ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/* The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth)
+		{
+			depth=newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/* If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = cpu_to_le32(newblock);
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	if ((err = ext4_ext_get_access(handle, inode, path + depth)))
+		goto out;
+	/* New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/* To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/* Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2==ex and ex3==NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2055,6 +2216,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2063,13 +2225,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2078,12 +2236,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2135,6 +2308,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -203,6 +203,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
                                                 ` (4 preceding siblings ...)
  2007-04-26 18:16                               ` [PATCH 5/5] ext4: write support for preallocated blocks/extents Amit K. Arora
@ 2007-04-27 12:10                               ` Heiko Carstens
  2007-04-27 14:43                                 ` Jörn Engel
  2007-04-30  0:47                               ` David Chinner
                                                 ` (3 subsequent siblings)
  9 siblings, 1 reply; 340+ messages in thread
From: Heiko Carstens @ 2007-04-27 12:10 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote:
> Based on the discussion, this new patchset uses following as the
> interface for fallocate() system call:
> 
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> It seems that only s390 architecture has a problem with such a layout of
> arguments in fallocate(). Thus for s390, we plan to have a wrapper
> (say, sys_s390_fallocate()) for the sys_fallocate(), which will get
> called by glibc when an application issues a fallocate() system call
> on s390. The s390 arch specific changes will be part of a separate
> patch (PATCH 2/5). It will be great if some s390 expert can verify the
> patch, since I have not been able to test it on s390 so far.

After long discussions where at least two possible implementations
were suggested that would work on _all_ architectures you chose one
which doesn't and causes extra effort.

> It was also noted that minor changes might be required to strace code
> to take care of "different arguments on s390" issue.

This is not limited to strace...

Besides that the s390 backend looks ok.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-27 12:10                               ` [PATCH 0/5] fallocate system call Heiko Carstens
@ 2007-04-27 14:43                                 ` Jörn Engel
  2007-04-27 17:46                                     ` Heiko Carstens
  0 siblings, 1 reply; 340+ messages in thread
From: Jörn Engel @ 2007-04-27 14:43 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote:
> 
> After long discussions where at least two possible implementations
> were suggested that would work on _all_ architectures you chose one
> which doesn't and causes extra effort.

I believe the long discussion also showed that every possible
implementation has drawbacks.  To me this one appeared to be the best of
many bad choices.

Is this implementation worse than we thought?

Jörn

-- 
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-27 14:43                                 ` Jörn Engel
@ 2007-04-27 17:46                                     ` Heiko Carstens
  0 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-04-27 17:46 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Fri, Apr 27, 2007 at 04:43:28PM +0200, Jörn Engel wrote:
> On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote:
> > 
> > After long discussions where at least two possible implementations
> > were suggested that would work on _all_ architectures you chose one
> > which doesn't and causes extra effort.
> 
> I believe the long discussion also showed that every possible
> implementation has drawbacks.  To me this one appeared to be the best of
> many bad choices.

If one insists to have fd at first argument, what is wrong with having
u32 arguments only? It's not that this syscall comes even close to
what can be considered performance critical...

> Is this implementation worse than we thought?

It adds userspace overhead for one architecture. Every *trace and
*libc needs special handling on s390 for this syscall. I would
prefer to avoid this.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
@ 2007-04-27 17:46                                     ` Heiko Carstens
  0 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-04-27 17:46 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Fri, Apr 27, 2007 at 04:43:28PM +0200, Jörn Engel wrote:
> On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote:
> > 
> > After long discussions where at least two possible implementations
> > were suggested that would work on _all_ architectures you chose one
> > which doesn't and causes extra effort.
> 
> I believe the long discussion also showed that every possible
> implementation has drawbacks.  To me this one appeared to be the best of
> many bad choices.

If one insists to have fd at first argument, what is wrong with having
u32 arguments only? It's not that this syscall comes even close to
what can be considered performance critical...

> Is this implementation worse than we thought?

It adds userspace overhead for one architecture. Every *trace and
*libc needs special handling on s390 for this syscall. I would
prefer to avoid this.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-27 17:46                                     ` Heiko Carstens
  (?)
@ 2007-04-27 20:42                                     ` Chris Wedgwood
  -1 siblings, 0 replies; 340+ messages in thread
From: Chris Wedgwood @ 2007-04-27 20:42 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Jörn Engel, Amit K. Arora, torvalds, akpm, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Fri, Apr 27, 2007 at 07:46:13PM +0200, Heiko Carstens wrote:

> If one insists to have fd at first argument, what is wrong with
> having u32 arguments only?

Well, I was one of those who objected as it seems *UGLY* to me.

> It's not that this syscall comes even close to what can be
> considered performance critical...

Right.

> It adds userspace overhead for one architecture. Every *trace and
> *libc needs special handling on s390 for this syscall. I would
> prefer to avoid this.

I'm not that bothered about it.  I would prefer it did use clean
64-bit arguments, but given it's a non-critical syscall I'm don't
think the aesthetics are worth impossing crud on s390 for.



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
                                                 ` (5 preceding siblings ...)
  2007-04-27 12:10                               ` [PATCH 0/5] fallocate system call Heiko Carstens
@ 2007-04-30  0:47                               ` David Chinner
  2007-04-30  3:09                                 ` [PATCH] ia64 fallocate syscall David Chinner
                                                   ` (3 more replies)
  2007-05-14 13:29                                 ` Amit K. Arora
                                                 ` (2 subsequent siblings)
  9 siblings, 4 replies; 340+ messages in thread
From: David Chinner @ 2007-04-30  0:47 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote:
> Based on the discussion, this new patchset uses following as the
> interface for fallocate() system call:
> 
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Ok, so now for the hard questions - what are the semantics of
FA_ALLOCATE and FA_DEALLOCATE?

For FA_ALLOCATE, it's supposed to change the file size if we
allocate past EOF, right? What's the return value supposed to
be? Zero for success, error otherwise? Does this update a/m/ctime
at all? How persistent is this preallocation? Should it be
there "forever" or for the lifetime of the currently open fd
that it was preallocated on?

For FA_DEALLOCATE, does it change the filesize at all? Or does
it just punch a hole in the file? If it does change file size,
what happens when you punch out preallocation beyond EOF?
What's the return value supposed to be?

> Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
> preallocation and deallocation of preallocated blocks respectively. More
> modes can be added, when required.

FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does
not change file size) so we can preallocate beyond EOF for apps which
use O_APPEND (i.e. changing file size would cause problems for them).

> ToDos:
> =====
> 1>   Implementation on other architectures (other than i386, x86_64, 
> ppc64 and s390(x)) 

I'll have ia64 soon.

> 2>   A generic file system operation to handle fallocate
> (generic_fallocate), for filesystems that do _not_ have the fallocate
> inode operation implemented.
> 3>   Changes to glibc,
> 	a) to support fallocate() system call
> 	b) so that posix_fallocate() and posix_fallocate64() call
> 	   fallocate() system call
> 4>   Changes to XFS to implement the fallocate inode operation

And that's what I'm doing now, hence all the questions ;)

BTW, do you have a test program for this, or will I need to write
one myself?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH] ia64 fallocate syscall
  2007-04-30  0:47                               ` David Chinner
@ 2007-04-30  3:09                                 ` David Chinner
  2007-04-30  3:11                                 ` [PATCH] XFS ->fallocate() support David Chinner
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-04-30  3:09 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 arch/ia64/kernel/entry.S  |    1 +
 include/asm-ia64/unistd.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S
===================================================================
--- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S	2007-03-29 19:01:41.000000000 +1000
+++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S	2007-04-27 19:12:56.829396661 +1000
@@ -1612,5 +1612,6 @@ sys_call_table:
 	data8 sys_vmsplice
 	data8 sys_ni_syscall			// reserved for move_pages
 	data8 sys_getcpu
+	data8 sys_fallocate
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h
===================================================================
--- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h	2007-03-29 19:03:37.000000000 +1000
+++ 2.6.x-xfs-new/include/asm-ia64/unistd.h	2007-04-27 19:18:18.215568425 +1000
@@ -293,11 +293,12 @@
 #define __NR_vmsplice			1302
 /* 1303 reserved for move_pages */
 #define __NR_getcpu			1304
+#define __NR_fallocate			1305
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls			281 /* length of syscall table */
+#define NR_syscalls			282 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH] XFS ->fallocate() support
  2007-04-30  0:47                               ` David Chinner
  2007-04-30  3:09                                 ` [PATCH] ia64 fallocate syscall David Chinner
@ 2007-04-30  3:11                                 ` David Chinner
  2007-04-30  3:14                                 ` [PATCH] Add preallocation beyond EOF to fallocate David Chinner
  2007-04-30  5:25                                 ` [PATCH 0/5] fallocate system call Chris Wedgwood
  3 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-04-30  3:11 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

Add XFS support for ->fallocate() vector.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 fs/xfs/linux-2.6/xfs_iops.c |   48 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c	2007-02-07 13:24:32.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c	2007-04-30 11:02:16.225095992 +1000
@@ -812,6 +812,53 @@ xfs_vn_removexattr(
 	return namesp->attr_remove(vp, attr, xflags);
 }
 
+STATIC long
+xfs_vn_fallocate(
+	struct inode	*inode,
+	int		mode,
+	loff_t		offset,
+	loff_t		len)
+{
+	long		error = -EOPNOTSUPP;
+	bhv_vnode_t	*vp = vn_from_inode(inode);
+	bhv_desc_t	*bdp;
+	int		do_setattr = 0;
+	xfs_flock64_t	bf;
+
+	bf.l_whence = 0;
+	bf.l_start = offset;
+	bf.l_len = len;
+
+	bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS,
+						VNODE_POSITION_XFS);
+
+	switch (mode) {
+	case FA_ALLOCATE: /* changes file size */
+		error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
+						&bf, 0, NULL, 0);
+		if (offset + len > i_size_read(inode))
+			do_setattr = offset + len;
+		break;
+	case FA_DEALLOCATE:
+		/* XXX: changes file size?  this just punches a hole */
+		error = xfs_change_file_space(bdp, XFS_IOC_UNRESVSP,
+						&bf, 0, NULL, 0);
+		break;
+	default:
+		break;
+	}
+
+	/* Change file size if needed */
+	if (!error && do_setattr) {
+		bhv_vattr_t	va;
+
+		va.va_mask = XFS_AT_SIZE;
+		va.va_size = do_setattr;
+		error = bhv_vop_setattr(vp, &va, 0, NULL);
+	}
+
+	return error;
+}
 
 struct inode_operations xfs_inode_operations = {
 	.permission		= xfs_vn_permission,
@@ -822,6 +869,7 @@ struct inode_operations xfs_inode_operat
 	.getxattr		= xfs_vn_getxattr,
 	.listxattr		= xfs_vn_listxattr,
 	.removexattr		= xfs_vn_removexattr,
+	.fallocate		= xfs_vn_fallocate,
 };
 
 struct inode_operations xfs_dir_inode_operations = {

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH] Add preallocation beyond EOF to fallocate
  2007-04-30  0:47                               ` David Chinner
  2007-04-30  3:09                                 ` [PATCH] ia64 fallocate syscall David Chinner
  2007-04-30  3:11                                 ` [PATCH] XFS ->fallocate() support David Chinner
@ 2007-04-30  3:14                                 ` David Chinner
  2007-04-30  5:25                                 ` [PATCH 0/5] fallocate system call Chris Wedgwood
  3 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-04-30  3:14 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

Add new mode to ->fallocate() to allow allocation to occur
beyond the current EOF without changing the file size. Implement
in XFS ->fallocate() vector.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 fs/xfs/linux-2.6/xfs_iops.c |    8 +++++---
 include/linux/fs.h          |    1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c	2007-04-30 11:02:16.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c	2007-04-30 11:09:48.233375382 +1000
@@ -833,11 +833,13 @@ xfs_vn_fallocate(
 						VNODE_POSITION_XFS);
 
 	switch (mode) {
-	case FA_ALLOCATE: /* changes file size */
-		error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
-						&bf, 0, NULL, 0);
+	case FA_ALLOCATE:	 /* changes file size */
 		if (offset + len > i_size_read(inode))
 			do_setattr = offset + len;
+		/* FALL THROUGH */
+	case FA_PREALLOCATE:	/* no filesize change */
+		error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
+						&bf, 0, NULL, 0);
 		break;
 	case FA_DEALLOCATE:
 		/* XXX: changes file size?  this just punches a hole */
Index: 2.6.x-xfs-new/include/linux/fs.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/fs.h	2007-04-27 18:48:01.000000000 +1000
+++ 2.6.x-xfs-new/include/linux/fs.h	2007-04-30 11:08:05.790903661 +1000
@@ -269,6 +269,7 @@ extern int dir_notify_enable;
  */
 #define FA_ALLOCATE	0x1
 #define FA_DEALLOCATE	0x2
+#define FA_PREALLOCATE	0x3
 
 #ifdef __KERNEL__
 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-30  0:47                               ` David Chinner
                                                   ` (2 preceding siblings ...)
  2007-04-30  3:14                                 ` [PATCH] Add preallocation beyond EOF to fallocate David Chinner
@ 2007-04-30  5:25                                 ` Chris Wedgwood
  2007-04-30  5:56                                   ` David Chinner
  2007-05-02 12:53                                   ` Amit K. Arora
  3 siblings, 2 replies; 340+ messages in thread
From: Chris Wedgwood @ 2007-04-30  5:25 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:

> For FA_ALLOCATE, it's supposed to change the file size if we
> allocate past EOF, right?

I would argue no.  Use truncate for that.

> For FA_DEALLOCATE, does it change the filesize at all?

Same as above.

> Or does
> it just punch a hole in the file?

Yes.

> FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does
> not change file size) so we can preallocate beyond EOF for apps
> which use O_APPEND (i.e. changing file size would cause problems for
> them).

FA_ALLOCATE should be able to allocate past-EOF I would argue.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-30  5:25                                 ` [PATCH 0/5] fallocate system call Chris Wedgwood
@ 2007-04-30  5:56                                   ` David Chinner
  2007-04-30  6:01                                     ` Chris Wedgwood
  2007-05-02 12:53                                   ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-04-30  5:56 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: David Chinner, Amit K. Arora, torvalds, akpm, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
> On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
> 
> > For FA_ALLOCATE, it's supposed to change the file size if we
> > allocate past EOF, right?
> 
> I would argue no.  Use truncate for that.

I'm going from the ext4 implementation because the semantics
have not been documented yet.

IIRC, the argument for FA_ALLOCATE changing file size is that
posix_fallocate() is supposed to change the file size. I think
that having a mode for real preallocation and another for
posix_fallocate is a valid thing to do...

Note that the way XFS implements growing the file size after the
allocation is via a truncate....

> > For FA_DEALLOCATE, does it change the filesize at all?
> 
> Same as above.
> 
> > Or does
> > it just punch a hole in the file?
> 
> Yes.

That's would what I did because otherwise you'd use ftruncate64().
Without documented behaviour or an ext4 implementation, I have to
ask what it's supposed to do, though ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-30  5:56                                   ` David Chinner
@ 2007-04-30  6:01                                     ` Chris Wedgwood
  0 siblings, 0 replies; 340+ messages in thread
From: Chris Wedgwood @ 2007-04-30  6:01 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Mon, Apr 30, 2007 at 03:56:32PM +1000, David Chinner wrote:
> On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:

> IIRC, the argument for FA_ALLOCATE changing file size is that
> posix_fallocate() is supposed to change the file size.

But it's not posix_fallocate; it's something more generic. glibc can
do posix_fallocate using truncate + fallocate.

> Note that the way XFS implements growing the file size after the
> allocation is via a truncate....

What's wrong with that?  That seems very reasonable.

> That's would what I did because otherwise you'd use ftruncate64().
> Without documented behaviour or an ext4 implementation, I have to
> ask what it's supposed to do, though ;)

How many *real* users are there for ext4?  Why does 'what ext4 does'
define 'the semantics'?

Surely semantics should be decided either by precedent (if there is an
existing relevant userbase) or sensible thought and some debate?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-04-30  5:25                                 ` [PATCH 0/5] fallocate system call Chris Wedgwood
  2007-04-30  5:56                                   ` David Chinner
@ 2007-05-02 12:53                                   ` Amit K. Arora
  2007-05-03 10:34                                     ` Andreas Dilger
  1 sibling, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-02 12:53 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: David Chinner, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
> On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
> 
> > For FA_ALLOCATE, it's supposed to change the file size if we
> > allocate past EOF, right?
> 
> I would argue no.  Use truncate for that.

The patch I posted for ext4 *does* change the filesize after
preallocation, if required (i.e. when preallocation is after EOF).
I may have to change that, if we decide on not doing this.

--
Regards,
Amit Arora 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-05-02 12:53                                   ` Amit K. Arora
@ 2007-05-03 10:34                                     ` Andreas Dilger
  2007-05-03 11:22                                       ` Miquel van Smoorenburg
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-05-03 10:34 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Chris Wedgwood, David Chinner, torvalds, akpm, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On May 02, 2007  18:23 +0530, Amit K. Arora wrote:
> On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
> > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
> > 
> > > For FA_ALLOCATE, it's supposed to change the file size if we
> > > allocate past EOF, right?
> > 
> > I would argue no.  Use truncate for that.
> 
> The patch I posted for ext4 *does* change the filesize after
> preallocation, if required (i.e. when preallocation is after EOF).
> I may have to change that, if we decide on not doing this.

I think I'd agree - it may be useful to allow preallocation beyond EOF
for some kinds of applications (e.g. PVR preallocating live TV in 10
minute segments or something, but not knowing in advance how long the
show will actually be recorded or the final encoded size).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-05-03 10:34                                     ` Andreas Dilger
@ 2007-05-03 11:22                                       ` Miquel van Smoorenburg
  2007-05-08  2:26                                         ` David Chinner
  0 siblings, 1 reply; 340+ messages in thread
From: Miquel van Smoorenburg @ 2007-05-03 11:22 UTC (permalink / raw)
  To: adilger; +Cc: linux-kernel

In article <20070503103425.GE6220@schatzie.adilger.int> you write:
>On May 02, 2007  18:23 +0530, Amit K. Arora wrote:
>> On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
>> > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
>> > 
>> > > For FA_ALLOCATE, it's supposed to change the file size if we
>> > > allocate past EOF, right?
>> > 
>> > I would argue no.  Use truncate for that.
>> 
>> The patch I posted for ext4 *does* change the filesize after
>> preallocation, if required (i.e. when preallocation is after EOF).
>> I may have to change that, if we decide on not doing this.
>
>I think I'd agree - it may be useful to allow preallocation beyond EOF
>for some kinds of applications (e.g. PVR preallocating live TV in 10
>minute segments or something, but not knowing in advance how long the
>show will actually be recorded or the final encoded size).

I have an application (diablo dreader) where the header-info database
basically consists of ~40.000 files, one for each group (it's more
complicated that that, but never mind that now).

If you grow those files randomly by a few hundred bytes every update,
the filesystem gets hopelessly fragmented.

I'm using XFS with preallocation turned on, and biosize=18 (which
makes it preallocate in blocks of 256KB), and a homebrew patch that
leaves the preallocated space on disk preallocated even if the
file is closed .. and it helps enormously.

Mike.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-04-26 18:03                               ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Amit K. Arora
@ 2007-05-04  4:29                                 ` Andrew Morton
  2007-05-04  4:41                                   ` Paul Mackerras
                                                     ` (3 more replies)
  2007-05-09 16:01                                 ` Amit K. Arora
  1 sibling, 4 replies; 340+ messages in thread
From: Andrew Morton @ 2007-05-04  4:29 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> This patch implements the fallocate() system call and adds support for
> i386, x86_64 and powerpc.
> 
> ...
>
> +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

Please add a comment over this function which specifies its behaviour. 
Really it should be enough material from which a full manpage can be
written.

If that's all too much, this material should at least be spelled out in the
changelog.  Because there's no way in which this change can be fully
reviewed unless someone (ie: you) tells us what it is setting out to
achieve.

If we 100% implement some standard then a URL for what we claim to
implement would suffice.  Given that we're at least using different types from
posix I doubt if such a thing would be sufficient.

And given the complexity and potential variability within the filesystem
implementations of this, I'd expect that _something_ additional needs to be
said?

> +{
> +	struct file *file;
> +	struct inode *inode;
> +	long ret = -EINVAL;
> +
> +	if (len == 0 || offset < 0)
> +		goto out;

The posix spec implies that negative `len' is permitted - presumably "allocate
ahead of `offset'".  How peculiar.

> +	ret = -EBADF;
> +	file = fget(fd);
> +	if (!file)
> +		goto out;
> +	if (!(file->f_mode & FMODE_WRITE))
> +		goto out_fput;
> +
> +	inode = file->f_path.dentry->d_inode;
> +
> +	ret = -ESPIPE;
> +	if (S_ISFIFO(inode->i_mode))
> +		goto out_fput;
> +
> +	ret = -ENODEV;
> +	if (!S_ISREG(inode->i_mode))
> +		goto out_fput;

So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
seems a bit silly of them.

> +	ret = -EFBIG;
> +	if (offset + len > inode->i_sb->s_maxbytes)
> +		goto out_fput;

This code does handle offset+len going negative, but only by accident, I
suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
here would settle the reader's mind.

> +	if (inode->i_op && inode->i_op->fallocate)
> +		ret = inode->i_op->fallocate(inode, mode, offset, len);
> +	else
> +		ret = -ENOSYS;

If we _are_ going to support negative `len', as posix suggests, I think we
should perform the appropriate sanity conversions to `offset' and `len'
right here, rather than expecting each filesystem to do it.

If we're not going to handle negative `len' then we should check for it.

> +out_fput:
> +	fput(file);
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL(sys_fallocate);

I don't believe this needs to be exported to modules?

> +/*
> + * fallocate() modes
> + */
> +#define FA_ALLOCATE	0x1
> +#define FA_DEALLOCATE	0x2

Now those aren't in posix.  They should be documented, along with their
expected semantics.

>  #ifdef __KERNEL__
>  
>  #include <linux/linkage.h>
> @@ -1125,6 +1131,7 @@ struct inode_operations {
>  	ssize_t (*listxattr) (struct dentry *, char *, size_t);
>  	int (*removexattr) (struct dentry *, const char *);
>  	void (*truncate_range)(struct inode *, loff_t, loff_t);
> +	long (*fallocate)(struct inode *, int, loff_t, loff_t);

I really do think it's better to put the variable names in definitions such
as this.  Especially when we have two identically-typed variables next to
each other like that.  Quick: which one is the offset and which is the
length?



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/5] ext4: Extent overlap bugfix
  2007-04-26 18:11                               ` [PATCH 3/5] ext4: Extent overlap bugfix Amit K. Arora
@ 2007-05-04  4:30                                 ` Andrew Morton
  2007-05-07 11:46                                   ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-05-04  4:30 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> +unsigned int ext4_ext_check_overlap(struct inode *inode,
> +					struct ext4_extent *newext,
> +					struct ext4_ext_path *path)
> +{
> +	unsigned long b1, b2;
> +	unsigned int depth, len1;
> +
> +	b1 = le32_to_cpu(newext->ee_block);
> +	len1 = le16_to_cpu(newext->ee_len);
> +	depth = ext_depth(inode);
> +	if (!path[depth].p_ext)
> +		goto out;
> +	b2 = le32_to_cpu(path[depth].p_ext->ee_block);
> +
> +	/* get the next allocated block if the extent in the path
> +	 * is before the requested block(s) */
> +	if (b2 < b1) {
> +		b2 = ext4_ext_next_allocated_block(path);
> +		if (b2 == EXT_MAX_BLOCK)
> +			goto out;
> +	}
> +
> +	if (b1 + len1 > b2) {

Are we sure that b1+len cannot wrap through zero here?

> +		newext->ee_len = cpu_to_le16(b2 - b1);
> +		return 1;
> +	}

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-04-26 18:13                               ` [PATCH 4/5] ext4: fallocate support in ext4 Amit K. Arora
@ 2007-05-04  4:31                                 ` Andrew Morton
  2007-05-07 11:37                                   ` Andreas Dilger
  2007-05-07 12:07                                   ` Amit K. Arora
  0 siblings, 2 replies; 340+ messages in thread
From: Andrew Morton @ 2007-05-04  4:31 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> This patch has the ext4 implemtation of fallocate system call.
> 
> ...
>
> +		/* ext4_can_extents_be_merged should have checked that either
> +		 * both extents are uninitialized, or both aren't. Thus we
> +		 * need to check only one of them here.
> +		 */

Please always format multiline comments like this:

		/*
		 * ext4_can_extents_be_merged should have checked that either
		 * both extents are uninitialized, or both aren't. Thus we
		 * need to check only one of them here.
		 */

> ...
>
> +/*
> + * ext4_fallocate:
> + * preallocate space for a file
> + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> + */

This description is rather thin.  What is the filesystem's actual behaviour
here?  If the file is using extents then the implementation will do
<something>.  If the file is using bitmaps then we will do <something else>.

But what?   Here is where it should be described.

> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> +{
> +	handle_t *handle;
> +	ext4_fsblk_t block, max_blocks;
> +	int ret, ret2, nblocks = 0, retries = 0;
> +	struct buffer_head map_bh;
> +	unsigned int credits, blkbits = inode->i_blkbits;
> +
> +	/* Currently supporting (pre)allocate mode _only_ */
> +	if (mode != FA_ALLOCATE)
> +		return -EOPNOTSUPP;
> +
> +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> +		return -ENOTTY;

So we don't implement fallocate on bitmap-based files!  Well that's huge
news.  The changelog would be an appropriate place to communicate this,
along with reasons why, or a description of the plan to fix it.

Also, posix says nothing about fallocate() returning ENOTTY.

> +	block = offset >> blkbits;
> +	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> +			 - block;
> +	mutex_lock(&EXT4_I(inode)->truncate_mutex);
> +	credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> +	mutex_unlock(&EXT4_I(inode)->truncate_mutex);

Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
space, and that this disk space will require an arbitrary amount of
metadata, how can we work out how much journal space we'll be needing
without at least looking at `len'?

> +	handle=ext4_journal_start(inode, credits +

Please always put spaces around "="

> +					EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);

And around "+"

> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
> +retry:
> +	ret = 0;
> +	while (ret >= 0 && ret < max_blocks) {
> +		block = block + ret;
> +		max_blocks = max_blocks - ret;
> +		ret = ext4_ext_get_blocks(handle, inode, block,
> +					  max_blocks, &map_bh,
> +					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
> +		BUG_ON(!ret);

BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
ext4_error() would be safer and more useful here.

> +		if (ret > 0 && test_bit(BH_New, &map_bh.b_state)

Use buffer_new() here.   A separate patch which fixes the three existing
instances of open-coded BH_foo usage would be appreciated.

> +			&& ((block + ret) > (i_size_read(inode) << blkbits)))

Check for wrap though the sign bit and through zero please.

> +			nblocks = nblocks + ret;
> +	}
> +
> +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +		goto retry;
> +
> +	/* Time to update the file size.
> +	 * Update only when preallocation was requested beyond the file size.
> +	 */

Fix comment layout.

> +	if ((offset + len) > i_size_read(inode)) {

Both the lhs and the rhs here are signed.  Please review for possible
overflows through the sign bit and through zero.  Perhaps a comment
explaining why it's correct would be appropriate.


> +		if (ret > 0) {
> +		/* if no error, we assume preallocation succeeded completely */
> +			mutex_lock(&inode->i_mutex);
> +			i_size_write(inode, offset + len);
> +			EXT4_I(inode)->i_disksize = i_size_read(inode);
> +			mutex_unlock(&inode->i_mutex);
> +		} else if (ret < 0 && nblocks) {
> +		/* Handle partial allocation scenario */

The above two comments should be indented one additional tabstop.

> +			loff_t newsize;
> +			mutex_lock(&inode->i_mutex);
> +			newsize  = (nblocks << blkbits) + i_size_read(inode);
> +			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
> +			EXT4_I(inode)->i_disksize = i_size_read(inode);
> +			mutex_unlock(&inode->i_mutex);
> +		}
> +	}
> +	ext4_mark_inode_dirty(handle, inode);
> +	ret2 = ext4_journal_stop(handle);
> +	if (ret > 0)
> +		ret = ret2;
> +
> +	return ret > 0 ? 0 : ret;
> +}
> +
>  EXPORT_SYMBOL(ext4_mark_inode_dirty);
>  EXPORT_SYMBOL(ext4_ext_invalidate_cache);
>  EXPORT_SYMBOL(ext4_ext_insert_extent);
>  EXPORT_SYMBOL(ext4_ext_walk_space);
>  EXPORT_SYMBOL(ext4_ext_find_goal);
>  EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert);
> +EXPORT_SYMBOL(ext4_fallocate);
>  
> Index: linux-2.6.21/fs/ext4/file.c
> ===================================================================
> --- linux-2.6.21.orig/fs/ext4/file.c
> +++ linux-2.6.21/fs/ext4/file.c
> @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
>  	.removexattr	= generic_removexattr,
>  #endif
>  	.permission	= ext4_permission,
> +	.fallocate	= ext4_fallocate,
>  };
>  
> Index: linux-2.6.21/include/linux/ext4_fs.h
> ===================================================================
> --- linux-2.6.21.orig/include/linux/ext4_fs.h
> +++ linux-2.6.21/include/linux/ext4_fs.h
> @@ -102,6 +102,8 @@
>  				 EXT4_GOOD_OLD_FIRST_INO : \
>  				 (s)->s_first_ino)
>  #endif
> +#define EXT4_BLOCK_ALIGN(size, blkbits) 	(((size)+(1 << blkbits)-1) & \
> +							(~((1 << blkbits)-1)))

Maybe a comment describing what this does?  Probably it's obvious enough.

I think it could use the standard ALIGN macro.

Is blkbits sufficiently parenthesised here?  Even if it is, adding the
parens would be better practice.

>  /*
>   * Macro-instructions used to manage fragments
> @@ -225,6 +227,10 @@ struct ext4_new_group_data {
>  	__u32 free_blocks_count;
>  };
>  
> +/* Following is used by preallocation logic to tell get_blocks() that we
> + * want uninitialzed extents.
> + */

Please convert all newly-added multiline comments to the preferred layout.

> +#define EXT4_CREATE_UNINITIALIZED_EXT		2
>  
>  /*
>   * ioctl commands
> @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t 
>  extern void ext4_ext_truncate(struct inode *, struct page *);
>  extern void ext4_ext_init(struct super_block *);
>  extern void ext4_ext_release(struct super_block *);
> +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t);

argh.  And feel free to give these args some useful names.

>  static inline int
>  ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
>  			unsigned long max_blocks, struct buffer_head *bh,
> Index: linux-2.6.21/include/linux/ext4_fs_extents.h
> ===================================================================
> --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
> +++ linux-2.6.21/include/linux/ext4_fs_extents.h
> @@ -125,6 +125,19 @@ struct ext4_ext_path {
>  #define EXT4_EXT_CACHE_EXTENT	2
>  
>  /*
> + * Macro-instructions to handle (mark/unmark/check/create) unitialized
> + * extents. Applications can issue an IOCTL for preallocation, which results
> + * in assigning unitialized extents to the file.
> + */
> +#define ext4_ext_mark_uninitialized(ext)	((ext)->ee_len |= \
> +							cpu_to_le16(0x8000))
> +#define ext4_ext_is_uninitialized(ext)  	((le16_to_cpu((ext)->ee_len))& \
> +									0x8000)
> +#define ext4_ext_get_actual_len(ext)		((le16_to_cpu((ext)->ee_len))& \
> +									0x7FFF)

inlined C functions are preferred, and I think these could be implemented
that way.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
  2007-04-26 18:16                               ` [PATCH 5/5] ext4: write support for preallocated blocks/extents Amit K. Arora
@ 2007-05-04  4:32                                 ` Andrew Morton
  2007-05-07 12:11                                   ` Amit K. Arora
  2007-05-07 12:40                                 ` Pekka Enberg
  1 sibling, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-05-04  4:32 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> This patch adds write support for preallocated (using fallocate system
> call) blocks/extents. The preallocated extents in ext4 are marked
> "uninitialized", hence they need special handling especially while
> writing to them. This patch takes care of that.
> 
> ...
>
>  /*
> + * ext4_ext_try_to_merge:
> + * tries to merge the "ex" extent to the next extent in the tree.
> + * It always tries to merge towards right. If you want to merge towards
> + * left, pass "ex - 1" as argument instead of "ex".
> + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
> + * 1 if they got merged.

OK.

> + */
> +int ext4_ext_try_to_merge(struct inode *inode,
> +				struct ext4_ext_path *path,
> +				struct ext4_extent *ex)
> +{
> +	struct ext4_extent_header *eh;
> +	unsigned int depth, len;
> +	int merge_done=0, uninitialized = 0;

space around "=", please.

Many people prefer not to do the multiple-definitions-per-line, btw:

	int merge_done = 0;
	int uninitialized = 0;

reasons:

- If gives you some space for a nice comment

- It makes patches much more readable, and it makes rejects easier to fix

- standardisation.

> +	depth = ext_depth(inode);
> +	BUG_ON(path[depth].p_hdr == NULL);
> +	eh = path[depth].p_hdr;
> +
> +	while (ex < EXT_LAST_EXTENT(eh)) {
> +		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
> +			break;
> +		/* merge with next extent! */
> +		if (ext4_ext_is_uninitialized(ex))
> +			uninitialized = 1;
> +		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
> +					+ ext4_ext_get_actual_len(ex + 1));
> +		if (uninitialized)
> +			ext4_ext_mark_uninitialized(ex);
> +
> +		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
> +			len = (EXT_LAST_EXTENT(eh) - ex - 1)
> +					* sizeof(struct ext4_extent);
> +			memmove(ex + 1, ex + 2, len);
> +		}
> +		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);

Kenrel convention is to put spaces around "-"

> +		merge_done = 1;
> +		BUG_ON(eh->eh_entries == 0);

eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
better to warn and run ext4_error() here?

> +	}
> +
> +	return merge_done;
> +}
> +
> +
>
> ...
>
> +/*
> + * ext4_ext_convert_to_initialized:
> + * this function is called by ext4_ext_get_blocks() if someone tries to write
> + * to an uninitialized extent. It may result in splitting the uninitialized
> + * extent into multiple extents (upto three). Atleast one initialized extent
> + * and atmost two uninitialized extents can result.

There are some typos here

> + * There are three possibilities:
> + *   a> No split required: Entire extent should be initialized.
> + *   b> Split into two extents: Only one end of the extent is being written to.
> + *   c> Split into three extents: Somone is writing in middle of the extent.

and here

> + */
> +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
> +					struct ext4_ext_path *path,
> +					ext4_fsblk_t iblock,
> +					unsigned long max_blocks)
> +{
> +	struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
> +	struct ext4_extent_header *eh;
> +	unsigned int allocated, ee_block, ee_len, depth;
> +	ext4_fsblk_t newblock;
> +	int err = 0, ret = 0;
> +
> +	depth = ext_depth(inode);
> +	eh = path[depth].p_hdr;
> +	ex = path[depth].p_ext;
> +	ee_block = le32_to_cpu(ex->ee_block);
> +	ee_len = ext4_ext_get_actual_len(ex);
> +	allocated = ee_len - (iblock - ee_block);
> +	newblock = iblock - ee_block + ext_pblock(ex);
> +	ex2 = ex;
> +
> +	/* ex1: ee_block to iblock - 1 : uninitialized */
> +	if (iblock > ee_block) {
> +		ex1 = ex;
> +		ex1->ee_len = cpu_to_le16(iblock - ee_block);
> +		ext4_ext_mark_uninitialized(ex1);
> +		ex2 = &newex;
> +	}
> +	/* for sanity, update the length of the ex2 extent before
> +	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
> +	 * overlap of blocks.
> +	 */
> +	if (!ex1 && allocated > max_blocks)
> +		ex2->ee_len = cpu_to_le16(max_blocks);
> +	/* ex3: to ee_block + ee_len : uninitialised */
> +	if (allocated > max_blocks) {
> +		unsigned int newdepth;
> +		ex3 = &newex;
> +		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
> +		ext4_ext_store_pblock(ex3, newblock + max_blocks);
> +		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
> +		ext4_ext_mark_uninitialized(ex3);
> +		err = ext4_ext_insert_extent(handle, inode, path, ex3);
> +		if (err)
> +			goto out;
> +		/* The depth, and hence eh & ex might change
> +		 * as part of the insert above.
> +		 */
> +		newdepth = ext_depth(inode);
> +		if (newdepth != depth)
> +		{

Use

		if (newdepth != depth) {

> +			depth=newdepth;

spaces

> +			path = ext4_ext_find_extent(inode, iblock, NULL);
> +			if (IS_ERR(path)) {
> +				err = PTR_ERR(path);
> +				path = NULL;
> +				goto out;
> +			}
> +			eh = path[depth].p_hdr;
> +			ex = path[depth].p_ext;
> +			if (ex2 != &newex)
> +				ex2 = ex;
> +		}
> +		allocated = max_blocks;
> +	}
> +	/* If there was a change of depth as part of the
> +	 * insertion of ex3 above, we need to update the length
> +	 * of the ex1 extent again here
> +	 */
> +	if (ex1 && ex1 != ex) {
> +		ex1 = ex;
> +		ex1->ee_len = cpu_to_le16(iblock - ee_block);
> +		ext4_ext_mark_uninitialized(ex1);
> +		ex2 = &newex;
> +	}
> +	/* ex2: iblock to iblock + maxblocks-1 : initialised */
> +	ex2->ee_block = cpu_to_le32(iblock);
> +	ex2->ee_start = cpu_to_le32(newblock);
> +	ext4_ext_store_pblock(ex2, newblock);
> +	ex2->ee_len = cpu_to_le16(allocated);
> +	if (ex2 != ex)
> +		goto insert;
> +	if ((err = ext4_ext_get_access(handle, inode, path + depth)))
> +		goto out;

The preferred style is

	err = ext4_ext_get_access(handle, inode, path + depth);
	if (err)
		goto out;

> +	/* New (initialized) extent starts from the first block
> +	 * in the current extent. i.e., ex2 == ex
> +	 * We have to see if it can be merged with the extent
> +	 * on the left.
> +	 */
> +	if (ex2 > EXT_FIRST_EXTENT(eh)) {
> +		/* To merge left, pass "ex2 - 1" to try_to_merge(),
> +		 * since it merges towards right _only_.
> +		 */
> +		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
> +		if (ret) {
> +			err = ext4_ext_correct_indexes(handle, inode, path);
> +			if (err)
> +				goto out;
> +			depth = ext_depth(inode);
> +			ex2--;
> +		}
> +	}
> +	/* Try to Merge towards right. This might be required
> +	 * only when the whole extent is being written to.
> +	 * i.e. ex2==ex and ex3==NULL.
> +	 */
> +	if (!ex3) {
> +		ret = ext4_ext_try_to_merge(inode, path, ex2);
> +		if (ret) {
> +			err = ext4_ext_correct_indexes(handle, inode, path);
> +			if (err)
> +				goto out;
> +		}
> +	}
> +	/* Mark modified extent as dirty */
> +	err = ext4_ext_dirty(handle, inode, path + depth);
> +	goto out;
> +insert:
> +	err = ext4_ext_insert_extent(handle, inode, path, &newex);
> +out:
> +	return err ? err : allocated;
> +}

Sigh.  I hope you guys know how all this works, because the extent code is
a mystery to me.  Is the on-disk layout and the allocation strategy
described anywhere?

> +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *);

Again, I do think that sticking the identifiers in there helps
readability.  Although it is not as important in a boring old declaration
as it is in, say, inode_operations, etc.

Please try to keep the code looking nice in an 80-column display.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  4:29                                 ` Andrew Morton
@ 2007-05-04  4:41                                   ` Paul Mackerras
  2007-05-09 10:15                                     ` Suparna Bhattacharya
  2007-05-04  4:55                                   ` Andrew Morton
                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 340+ messages in thread
From: Paul Mackerras @ 2007-05-04  4:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

Andrew Morton writes:

> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Please add a comment over this function which specifies its behaviour. 
> Really it should be enough material from which a full manpage can be
> written.

This looks like it will have the same problem on s390 as
sys_sync_file_range.  Maybe the prototype should be:

asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  4:29                                 ` Andrew Morton
  2007-05-04  4:41                                   ` Paul Mackerras
@ 2007-05-04  4:55                                   ` Andrew Morton
  2007-05-04  6:07                                   ` David Chinner
  2007-05-07 11:03                                   ` Amit K. Arora
  3 siblings, 0 replies; 340+ messages in thread
From: Andrew Morton @ 2007-05-04  4:55 UTC (permalink / raw)
  To: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Thu, 3 May 2007 21:29:55 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> > +	ret = -EFBIG;
> > +	if (offset + len > inode->i_sb->s_maxbytes)
> > +		goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.

But it doesn't handle offset+len wrapping through zero.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  4:29                                 ` Andrew Morton
  2007-05-04  4:41                                   ` Paul Mackerras
  2007-05-04  4:55                                   ` Andrew Morton
@ 2007-05-04  6:07                                   ` David Chinner
  2007-05-04  6:28                                     ` Andrew Morton
  2007-05-07 11:03                                   ` Amit K. Arora
  3 siblings, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-05-04  6:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> > +{
> > +	struct file *file;
> > +	struct inode *inode;
> > +	long ret = -EINVAL;
> > +
> > +	if (len == 0 || offset < 0)
> > +		goto out;
> 
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'".  How peculiar.

I just checked the man page for posix_fallocate() and it says:

      EINVAL  offset or len was less than zero.

We should probably follow this lead.

> > +
> > +	ret = -ENODEV;
> > +	if (!S_ISREG(inode->i_mode))
> > +		goto out_fput;
> 
> So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> seems a bit silly of them.

Hmmmm - I thought that the intention of sys_fallocate() was to
be generic enough to eventually allow preallocation on directories.
If that is the case, then this check will prevent that....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  6:07                                   ` David Chinner
@ 2007-05-04  6:28                                     ` Andrew Morton
  2007-05-04  6:56                                       ` Jakub Jelinek
                                                         ` (2 more replies)
  0 siblings, 3 replies; 340+ messages in thread
From: Andrew Morton @ 2007-05-04  6:28 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Fri, 4 May 2007 16:07:31 +1000 David Chinner <dgc@sgi.com> wrote:

> On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > > +{
> > > +	struct file *file;
> > > +	struct inode *inode;
> > > +	long ret = -EINVAL;
> > > +
> > > +	if (len == 0 || offset < 0)
> > > +		goto out;
> > 
> > The posix spec implies that negative `len' is permitted - presumably "allocate
> > ahead of `offset'".  How peculiar.
> 
> I just checked the man page for posix_fallocate() and it says:
> 
>       EINVAL  offset or len was less than zero.
> 
> We should probably follow this lead.

Yes, I think so.  I'm suspecting that
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
is just buggy.  Or I can't read.

I mean, if we're going to support negative `len' then is the byte at
`offset' inside or outside the segment?  Head spins.

However it would be neat if someone could test $OTHER_OS and, perhaps more
importantly, the present glibc emulation (which I assume your manpage is
referring to, so this would be a manpage test ;)).

> > > +
> > > +	ret = -ENODEV;
> > > +	if (!S_ISREG(inode->i_mode))
> > > +		goto out_fput;
> > 
> > So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> > seems a bit silly of them.
> 
> Hmmmm - I thought that the intention of sys_fallocate() was to
> be generic enough to eventually allow preallocation on directories.
> If that is the case, then this check will prevent that....

The above opengroup page only permits S_ISREG.  Preallocating directories
sounds quite useful to me, although it's something which would be pretty
hard to emulate if the FS doesn't support it.  And there's a decent case to
be made for emulating it - run-anywhere reasons.  Does glibc emulation support
directories?  Quite unlikely.

But yes, sounds like a desirable thing.  Would XFS support it easily if the above
check was relaxed?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  6:28                                     ` Andrew Morton
@ 2007-05-04  6:56                                       ` Jakub Jelinek
  2007-05-07 13:08                                         ` Ulrich Drepper
  2007-05-04  7:27                                       ` David Chinner
  2007-05-07 11:10                                       ` Amit K. Arora
  2 siblings, 1 reply; 340+ messages in thread
From: Jakub Jelinek @ 2007-05-04  6:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, David Chinner, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> > > The posix spec implies that negative `len' is permitted - presumably "allocate
> > > ahead of `offset'".  How peculiar.
> > 
> > I just checked the man page for posix_fallocate() and it says:
> > 
> >       EINVAL  offset or len was less than zero.

That describes the current glibc implementation.


> > We should probably follow this lead.
> 
> Yes, I think so.  I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy.  Or I can't read.
> 
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment?  Head spins.
> 
> However it would be neat if someone could test $OTHER_OS and, perhaps more
> importantly, the present glibc emulation (which I assume your manpage is
> referring to, so this would be a manpage test ;)).

int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
  struct stat64 st;
  struct statfs f;

  /* `off_t' is a signed type.  Therefore we can determine whether
     OFFSET + LEN is too large if it is a negative value.  */
  if (offset < 0 || len < 0)
    return EINVAL;
  if (offset + len < 0)
    return EFBIG;

  /* First thing we have to make sure is that this is really a regular
     file.  */
  if (__fxstat64 (_STAT_VER, fd, &st) != 0)
    return EBADF;
  if (S_ISFIFO (st.st_mode))
    return ESPIPE;
  if (! S_ISREG (st.st_mode))
    return ENODEV;

  if (len == 0)
    {
      if (st.st_size < offset)
        {
          int ret = __ftruncate (fd, offset);

          if (ret != 0)
            ret = errno;
          return ret;
        }
      return 0;
    }
...

is what glibc does ATM.  Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail".  But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for len bytes"
doesn't make sense for negative size.  
And given the general
"Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other than
those described here, or may contain extensions or limitations that prevent
some errors from occurring."
I believe returning EINVAL for len < 0 is not a POSIX violation.
That doesn't mean the standard shouldn't be clarified, whether by saying
EINVAL must be returned for non-positive len or saying that using negative
len has undefined or implementation defined behavior.

> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.

No, see above.

	Jakub

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  6:28                                     ` Andrew Morton
  2007-05-04  6:56                                       ` Jakub Jelinek
@ 2007-05-04  7:27                                       ` David Chinner
  2007-05-07 11:10                                       ` Amit K. Arora
  2 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-05-04  7:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Chinner, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> On Fri, 4 May 2007 16:07:31 +1000 David Chinner <dgc@sgi.com> wrote:
> > On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > > 
> > > > This patch implements the fallocate() system call and adds support for
> > > > i386, x86_64 and powerpc.
> > > > 
> > > > ...
> > > > +{
> > > > +	struct file *file;
> > > > +	struct inode *inode;
> > > > +	long ret = -EINVAL;
> > > > +
> > > > +	if (len == 0 || offset < 0)
> > > > +		goto out;
> > > 
> > > The posix spec implies that negative `len' is permitted - presumably "allocate
> > > ahead of `offset'".  How peculiar.
> > 
> > I just checked the man page for posix_fallocate() and it says:
> > 
> >       EINVAL  offset or len was less than zero.
> > 
> > We should probably follow this lead.
> 
> Yes, I think so.  I'm suspecting that
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> is just buggy.  Or I can't read.
> 
> I mean, if we're going to support negative `len' then is the byte at
> `offset' inside or outside the segment?  Head spins.

I don't think we should care. If we provide a syscall with the
semantics of "allocate from offset to offset+len" then glibc's
implementation can turn negative length into two separate
fallocate syscalls....

> > > > +	ret = -ENODEV;
> > > > +	if (!S_ISREG(inode->i_mode))
> > > > +		goto out_fput;
> > > 
> > > So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> > > seems a bit silly of them.
> > 
> > Hmmmm - I thought that the intention of sys_fallocate() was to
> > be generic enough to eventually allow preallocation on directories.
> > If that is the case, then this check will prevent that....
> 
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the above
> check was relaxed?

No - right now empty blocks are pruned from the directory immediately so I
don't think we really have a concept of empty blocks in the btree structure.
dir2 is bloody complex, so adding preallocation is probably not going to
be simple to do.

It's not high on my list to add, either, because we can typically avoid the
worst case directory fragmentation by using larger directory block sizes
(e.g. 16k instead of the default 4k on a 4k block size fs).

IIRC directory preallocation has been talked about more for ext3/4....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  4:29                                 ` Andrew Morton
                                                     ` (2 preceding siblings ...)
  2007-05-04  6:07                                   ` David Chinner
@ 2007-05-07 11:03                                   ` Amit K. Arora
  3 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-07 11:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Andrew,

Thanks for the review comments!

On Thu, May 03, 2007 at 09:29:55PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > This patch implements the fallocate() system call and adds support for
> > i386, x86_64 and powerpc.
> > 
> > ...
> >
> > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> Please add a comment over this function which specifies its behaviour. 
> Really it should be enough material from which a full manpage can be
> written.
> 
> If that's all too much, this material should at least be spelled out in the
> changelog.  Because there's no way in which this change can be fully
> reviewed unless someone (ie: you) tells us what it is setting out to
> achieve.
> 
> If we 100% implement some standard then a URL for what we claim to
> implement would suffice.  Given that we're at least using different types from
> posix I doubt if such a thing would be sufficient.
> 
> And given the complexity and potential variability within the filesystem
> implementations of this, I'd expect that _something_ additional needs to be
> said?

Ok. I will add a detailed comment here.

> 
> > +{
> > +	struct file *file;
> > +	struct inode *inode;
> > +	long ret = -EINVAL;
> > +
> > +	if (len == 0 || offset < 0)
> > +		goto out;
> 
> The posix spec implies that negative `len' is permitted - presumably "allocate
> ahead of `offset'".  How peculiar.

I think we should go ahead with current glibc implementation (which
Jakub poited at) of not allowing a negative 'len', since posix also
doesn't explicitly say anything about allowing negative 'len'.

> 
> > +	ret = -EBADF;
> > +	file = fget(fd);
> > +	if (!file)
> > +		goto out;
> > +	if (!(file->f_mode & FMODE_WRITE))
> > +		goto out_fput;
> > +
> > +	inode = file->f_path.dentry->d_inode;
> > +
> > +	ret = -ESPIPE;
> > +	if (S_ISFIFO(inode->i_mode))
> > +		goto out_fput;
> > +
> > +	ret = -ENODEV;
> > +	if (!S_ISREG(inode->i_mode))
> > +		goto out_fput;
> 
> So we return ENODEV against an S_ISBLK fd, as per the posix spec.  That
> seems a bit silly of them.

True. 
 
> > +	ret = -EFBIG;
> > +	if (offset + len > inode->i_sb->s_maxbytes)
> > +		goto out_fput;
> 
> This code does handle offset+len going negative, but only by accident, I
> suspect.  It happens that s_maxbytes has unsigned type.  Perhaps a comment
> here would settle the reader's mind.

Ok. I will add a check here for wrap though zero.
 
> > +	if (inode->i_op && inode->i_op->fallocate)
> > +		ret = inode->i_op->fallocate(inode, mode, offset, len);
> > +	else
> > +		ret = -ENOSYS;
> 
> If we _are_ going to support negative `len', as posix suggests, I think we
> should perform the appropriate sanity conversions to `offset' and `len'
> right here, rather than expecting each filesystem to do it.
> 
> If we're not going to handle negative `len' then we should check for it.

Will add a check for negative 'len' and return -EINVAL. This will be
done where currently we check for negative offset (i.e. at the start of
the function).
 
> > +out_fput:
> > +	fput(file);
> > +out:
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(sys_fallocate);
> 
> I don't believe this needs to be exported to modules?

Ok. Will remove it.
 
> > +/*
> > + * fallocate() modes
> > + */
> > +#define FA_ALLOCATE	0x1
> > +#define FA_DEALLOCATE	0x2
> 
> Now those aren't in posix.  They should be documented, along with their
> expected semantics.

Will add a comment describing the role of these modes.
 
> >  #ifdef __KERNEL__
> >  
> >  #include <linux/linkage.h>
> > @@ -1125,6 +1131,7 @@ struct inode_operations {
> >  	ssize_t (*listxattr) (struct dentry *, char *, size_t);
> >  	int (*removexattr) (struct dentry *, const char *);
> >  	void (*truncate_range)(struct inode *, loff_t, loff_t);
> > +	long (*fallocate)(struct inode *, int, loff_t, loff_t);
> 
> I really do think it's better to put the variable names in definitions such
> as this.  Especially when we have two identically-typed variables next to
> each other like that.  Quick: which one is the offset and which is the
> length?

Ok. Will add the variable names here.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  6:28                                     ` Andrew Morton
  2007-05-04  6:56                                       ` Jakub Jelinek
  2007-05-04  7:27                                       ` David Chinner
@ 2007-05-07 11:10                                       ` Amit K. Arora
  2 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-07 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Chinner, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Thu, May 03, 2007 at 11:28:15PM -0700, Andrew Morton wrote:
> The above opengroup page only permits S_ISREG.  Preallocating directories
> sounds quite useful to me, although it's something which would be pretty
> hard to emulate if the FS doesn't support it.  And there's a decent case to
> be made for emulating it - run-anywhere reasons.  Does glibc emulation support
> directories?  Quite unlikely.
> 
> But yes, sounds like a desirable thing.  Would XFS support it easily if the above
> check was relaxed?

I think we may relax the check here and let the individual file system
decide if they support preallocation for directories or not. What do you
think ?

One thing to be thought in this case is the error code which should be
returned by the file system implementation, incase it doesn't support
preallocation for directories. Should it be -ENODEV (to match with what
posix says) , or something else (which might make more sense in this
case) ?

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-04  4:31                                 ` Andrew Morton
@ 2007-05-07 11:37                                   ` Andreas Dilger
  2007-05-07 20:58                                     ` Andrew Morton
  2007-05-07 12:07                                   ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-05-07 11:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On May 03, 2007  21:31 -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> <something>.  If the file is using bitmaps then we will do <something else>.
> 
> But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> > +{
> > +	handle_t *handle;
> > +	ext4_fsblk_t block, max_blocks;
> > +	int ret, ret2, nblocks = 0, retries = 0;
> > +	struct buffer_head map_bh;
> > +	unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +	/* Currently supporting (pre)allocate mode _only_ */
> > +	if (mode != FA_ALLOCATE)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +		return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.
> 
> Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

> > +	block = offset >> blkbits;
> > +	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +			 - block;
> > +	mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +	credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +	mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

> > +		ret = ext4_ext_get_blocks(handle, inode, block,
> > +					  max_blocks, &map_bh,
> > +					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +		BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 3/5] ext4: Extent overlap bugfix
  2007-05-04  4:30                                 ` Andrew Morton
@ 2007-05-07 11:46                                   ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-07 11:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, May 03, 2007 at 09:30:02PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:41:01 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > +unsigned int ext4_ext_check_overlap(struct inode *inode,
> > +					struct ext4_extent *newext,
> > +					struct ext4_ext_path *path)
> > +{
> > +	unsigned long b1, b2;
> > +	unsigned int depth, len1;
> > +
> > +	b1 = le32_to_cpu(newext->ee_block);
> > +	len1 = le16_to_cpu(newext->ee_len);
> > +	depth = ext_depth(inode);
> > +	if (!path[depth].p_ext)
> > +		goto out;
> > +	b2 = le32_to_cpu(path[depth].p_ext->ee_block);
> > +
> > +	/* get the next allocated block if the extent in the path
> > +	 * is before the requested block(s) */
> > +	if (b2 < b1) {
> > +		b2 = ext4_ext_next_allocated_block(path);
> > +		if (b2 == EXT_MAX_BLOCK)
> > +			goto out;
> > +	}
> > +
> > +	if (b1 + len1 > b2) {
> 
> Are we sure that b1+len cannot wrap through zero here?

No. Will add a check here for this. Thanks!
 
> > +		newext->ee_len = cpu_to_le16(b2 - b1);
> > +		return 1;
> > +	}


--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-04  4:31                                 ` Andrew Morton
  2007-05-07 11:37                                   ` Andreas Dilger
@ 2007-05-07 12:07                                   ` Amit K. Arora
  2007-05-07 15:24                                     ` Dave Kleikamp
  1 sibling, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-07 12:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > This patch has the ext4 implemtation of fallocate system call.
> > 
> > ...
> >
> > +		/* ext4_can_extents_be_merged should have checked that either
> > +		 * both extents are uninitialized, or both aren't. Thus we
> > +		 * need to check only one of them here.
> > +		 */
> 
> Please always format multiline comments like this:
> 
> 		/*
> 		 * ext4_can_extents_be_merged should have checked that either
> 		 * both extents are uninitialized, or both aren't. Thus we
> 		 * need to check only one of them here.
> 		 */

Ok.
 
> > ...
> >
> > +/*
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> <something>.  If the file is using bitmaps then we will do <something else>.
> 
> But what?   Here is where it should be described.

Ok. Will expand the description.
 
> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> > +{
> > +	handle_t *handle;
> > +	ext4_fsblk_t block, max_blocks;
> > +	int ret, ret2, nblocks = 0, retries = 0;
> > +	struct buffer_head map_bh;
> > +	unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +	/* Currently supporting (pre)allocate mode _only_ */
> > +	if (mode != FA_ALLOCATE)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +		return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.

Ok. Will add this in the function description as well.
 
> Also, posix says nothing about fallocate() returning ENOTTY.

Right. I don't seem to find any suitable error from posix description.
Can you please suggest an error code which might make more sense here ?
Will -ENOTSUPP be ok ? Since we want to say here that we don't support
non-extent files.
 
> > +	block = offset >> blkbits;
> > +	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +			 - block;
> > +	mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +	credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +	mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

You are right to say that the credits can not be fixed here. But, 'len'
will not directly tell us how many extents might need to be inserted and
how many block groups (if any - think about the "segment range" already
being allocated case) the allocation request might touch.
One solution I have thought is to check the buffer credits after a call to
ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the
credits are falling short. Incase journal_extend fails, we call
journal_restart. This will automatically take care of how much journal
space we might need for any value of "len".
 
> > +	handle=ext4_journal_start(inode, credits +
> 
> Please always put spaces around "="A
Ok.
> 
> > +					EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> 
> And around "+"
Ok.
> 
> > +	if (IS_ERR(handle))
> > +		return PTR_ERR(handle);
> > +retry:
> > +	ret = 0;
> > +	while (ret >= 0 && ret < max_blocks) {
> > +		block = block + ret;
> > +		max_blocks = max_blocks - ret;
> > +		ret = ext4_ext_get_blocks(handle, inode, block,
> > +					  max_blocks, &map_bh,
> > +					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +		BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ok. Will do that.
> 
> > +		if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> 
> Use buffer_new() here.   A separate patch which fixes the three existing
> instances of open-coded BH_foo usage would be appreciated.

Ok.
> 
> > +			&& ((block + ret) > (i_size_read(inode) << blkbits)))
> 
> Check for wrap though the sign bit and through zero please.
Ok.
> 
> > +			nblocks = nblocks + ret;
> > +	}
> > +
> > +	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > +		goto retry;
> > +
> > +	/* Time to update the file size.
> > +	 * Update only when preallocation was requested beyond the file size.
> > +	 */
> 
> Fix comment layout.
Ok.
> 
> > +	if ((offset + len) > i_size_read(inode)) {
> 
> Both the lhs and the rhs here are signed.  Please review for possible
> overflows through the sign bit and through zero.  Perhaps a comment
> explaining why it's correct would be appropriate.
Ok.
> 
> 
> > +		if (ret > 0) {
> > +		/* if no error, we assume preallocation succeeded completely */
> > +			mutex_lock(&inode->i_mutex);
> > +			i_size_write(inode, offset + len);
> > +			EXT4_I(inode)->i_disksize = i_size_read(inode);
> > +			mutex_unlock(&inode->i_mutex);
> > +		} else if (ret < 0 && nblocks) {
> > +		/* Handle partial allocation scenario */
> 
> The above two comments should be indented one additional tabstop.
Ok.
> 
> > +			loff_t newsize;
> > +			mutex_lock(&inode->i_mutex);
> > +			newsize  = (nblocks << blkbits) + i_size_read(inode);
> > +			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
> > +			EXT4_I(inode)->i_disksize = i_size_read(inode);
> > +			mutex_unlock(&inode->i_mutex);
> > +		}
> > +	}
> > +	ext4_mark_inode_dirty(handle, inode);
> > +	ret2 = ext4_journal_stop(handle);
> > +	if (ret > 0)
> > +		ret = ret2;
> > +
> > +	return ret > 0 ? 0 : ret;
> > +}
> > +
> >  EXPORT_SYMBOL(ext4_mark_inode_dirty);
> >  EXPORT_SYMBOL(ext4_ext_invalidate_cache);
> >  EXPORT_SYMBOL(ext4_ext_insert_extent);
> >  EXPORT_SYMBOL(ext4_ext_walk_space);
> >  EXPORT_SYMBOL(ext4_ext_find_goal);
> >  EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert);
> > +EXPORT_SYMBOL(ext4_fallocate);
> >  
> > Index: linux-2.6.21/fs/ext4/file.c
> > ===================================================================
> > --- linux-2.6.21.orig/fs/ext4/file.c
> > +++ linux-2.6.21/fs/ext4/file.c
> > @@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
> >  	.removexattr	= generic_removexattr,
> >  #endif
> >  	.permission	= ext4_permission,
> > +	.fallocate	= ext4_fallocate,
> >  };
> >  
> > Index: linux-2.6.21/include/linux/ext4_fs.h
> > ===================================================================
> > --- linux-2.6.21.orig/include/linux/ext4_fs.h
> > +++ linux-2.6.21/include/linux/ext4_fs.h
> > @@ -102,6 +102,8 @@
> >  				 EXT4_GOOD_OLD_FIRST_INO : \
> >  				 (s)->s_first_ino)
> >  #endif
> > +#define EXT4_BLOCK_ALIGN(size, blkbits) 	(((size)+(1 << blkbits)-1) & \
> > +							(~((1 << blkbits)-1)))
> 
> Maybe a comment describing what this does?  Probably it's obvious enough.
> 
> I think it could use the standard ALIGN macro.
> 
> Is blkbits sufficiently parenthesised here?  Even if it is, adding the
> parens would be better practice.

I agree. Will change it.
> 
> >  /*
> >   * Macro-instructions used to manage fragments
> > @@ -225,6 +227,10 @@ struct ext4_new_group_data {
> >  	__u32 free_blocks_count;
> >  };
> >  
> > +/* Following is used by preallocation logic to tell get_blocks() that we
> > + * want uninitialzed extents.
> > + */
> 
> Please convert all newly-added multiline comments to the preferred layout.
Ok.
> 
> > +#define EXT4_CREATE_UNINITIALIZED_EXT		2
> >  
> >  /*
> >   * ioctl commands
> > @@ -976,6 +982,7 @@ extern int ext4_ext_get_blocks(handle_t 
> >  extern void ext4_ext_truncate(struct inode *, struct page *);
> >  extern void ext4_ext_init(struct super_block *);
> >  extern void ext4_ext_release(struct super_block *);
> > +extern int ext4_fallocate(struct inode *, int, loff_t, loff_t);
> 
> argh.  And feel free to give these args some useful names.
Ok.
> 
> >  static inline int
> >  ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> >  			unsigned long max_blocks, struct buffer_head *bh,
> > Index: linux-2.6.21/include/linux/ext4_fs_extents.h
> > ===================================================================
> > --- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
> > +++ linux-2.6.21/include/linux/ext4_fs_extents.h
> > @@ -125,6 +125,19 @@ struct ext4_ext_path {
> >  #define EXT4_EXT_CACHE_EXTENT	2
> >  
> >  /*
> > + * Macro-instructions to handle (mark/unmark/check/create) unitialized
> > + * extents. Applications can issue an IOCTL for preallocation, which results
> > + * in assigning unitialized extents to the file.
> > + */
> > +#define ext4_ext_mark_uninitialized(ext)	((ext)->ee_len |= \
> > +							cpu_to_le16(0x8000))
> > +#define ext4_ext_is_uninitialized(ext)  	((le16_to_cpu((ext)->ee_len))& \
> > +									0x8000)
> > +#define ext4_ext_get_actual_len(ext)		((le16_to_cpu((ext)->ee_len))& \
> > +									0x7FFF)
> 
> inlined C functions are preferred, and I think these could be implemented
> that way.

Ok. Will convert them to inline functions.

Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
  2007-05-04  4:32                                 ` Andrew Morton
@ 2007-05-07 12:11                                   ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-07 12:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, May 03, 2007 at 09:32:38PM -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:46:23 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > + */
> > +int ext4_ext_try_to_merge(struct inode *inode,
> > +				struct ext4_ext_path *path,
> > +				struct ext4_extent *ex)
> > +{
> > +	struct ext4_extent_header *eh;
> > +	unsigned int depth, len;
> > +	int merge_done=0, uninitialized = 0;
> 
> space around "=", please.
> 
> Many people prefer not to do the multiple-definitions-per-line, btw:
> 
> 	int merge_done = 0;
> 	int uninitialized = 0;

Ok. Will make the change.

> 
> reasons:
> 
> - If gives you some space for a nice comment
> 
> - It makes patches much more readable, and it makes rejects easier to fix
> 
> - standardisation.
> 
> > +	depth = ext_depth(inode);
> > +	BUG_ON(path[depth].p_hdr == NULL);
> > +	eh = path[depth].p_hdr;
> > +
> > +	while (ex < EXT_LAST_EXTENT(eh)) {
> > +		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
> > +			break;
> > +		/* merge with next extent! */
> > +		if (ext4_ext_is_uninitialized(ex))
> > +			uninitialized = 1;
> > +		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
> > +					+ ext4_ext_get_actual_len(ex + 1));
> > +		if (uninitialized)
> > +			ext4_ext_mark_uninitialized(ex);
> > +
> > +		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
> > +			len = (EXT_LAST_EXTENT(eh) - ex - 1)
> > +					* sizeof(struct ext4_extent);
> > +			memmove(ex + 1, ex + 2, len);
> > +		}
> > +		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
> 
> Kenrel convention is to put spaces around "-"

Will fix this.

> 
> > +		merge_done = 1;
> > +		BUG_ON(eh->eh_entries == 0);
> 
> eek, scary BUG_ON.  Do we really need to be that severe?  Would it be
> better to warn and run ext4_error() here?
Ok.
> 
> > +	}
> > +
> > +	return merge_done;
> > +}
> > +
> > +
> >
> > ...
> >
> > +/*
> > + * ext4_ext_convert_to_initialized:
> > + * this function is called by ext4_ext_get_blocks() if someone tries to write
> > + * to an uninitialized extent. It may result in splitting the uninitialized
> > + * extent into multiple extents (upto three). Atleast one initialized extent
> > + * and atmost two uninitialized extents can result.
> 
> There are some typos here
> 
> > + * There are three possibilities:
> > + *   a> No split required: Entire extent should be initialized.
> > + *   b> Split into two extents: Only one end of the extent is being written to.
> > + *   c> Split into three extents: Somone is writing in middle of the extent.
> 
> and here
> 
Ok. Will fix them.
> > + */
> > +int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
> > +					struct ext4_ext_path *path,
> > +					ext4_fsblk_t iblock,
> > +					unsigned long max_blocks)
> > +{
> > +	struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
> > +	struct ext4_extent_header *eh;
> > +	unsigned int allocated, ee_block, ee_len, depth;
> > +	ext4_fsblk_t newblock;
> > +	int err = 0, ret = 0;
> > +
> > +	depth = ext_depth(inode);
> > +	eh = path[depth].p_hdr;
> > +	ex = path[depth].p_ext;
> > +	ee_block = le32_to_cpu(ex->ee_block);
> > +	ee_len = ext4_ext_get_actual_len(ex);
> > +	allocated = ee_len - (iblock - ee_block);
> > +	newblock = iblock - ee_block + ext_pblock(ex);
> > +	ex2 = ex;
> > +
> > +	/* ex1: ee_block to iblock - 1 : uninitialized */
> > +	if (iblock > ee_block) {
> > +		ex1 = ex;
> > +		ex1->ee_len = cpu_to_le16(iblock - ee_block);
> > +		ext4_ext_mark_uninitialized(ex1);
> > +		ex2 = &newex;
> > +	}
> > +	/* for sanity, update the length of the ex2 extent before
> > +	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
> > +	 * overlap of blocks.
> > +	 */
> > +	if (!ex1 && allocated > max_blocks)
> > +		ex2->ee_len = cpu_to_le16(max_blocks);
> > +	/* ex3: to ee_block + ee_len : uninitialised */
> > +	if (allocated > max_blocks) {
> > +		unsigned int newdepth;
> > +		ex3 = &newex;
> > +		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
> > +		ext4_ext_store_pblock(ex3, newblock + max_blocks);
> > +		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
> > +		ext4_ext_mark_uninitialized(ex3);
> > +		err = ext4_ext_insert_extent(handle, inode, path, ex3);
> > +		if (err)
> > +			goto out;
> > +		/* The depth, and hence eh & ex might change
> > +		 * as part of the insert above.
> > +		 */
> > +		newdepth = ext_depth(inode);
> > +		if (newdepth != depth)
> > +		{
> 
> Use
> 
> 		if (newdepth != depth) {

Ok.
> 
> > +			depth=newdepth;
> 
> spaces
Ok.
> 
> > +			path = ext4_ext_find_extent(inode, iblock, NULL);
> > +			if (IS_ERR(path)) {
> > +				err = PTR_ERR(path);
> > +				path = NULL;
> > +				goto out;
> > +			}
> > +			eh = path[depth].p_hdr;
> > +			ex = path[depth].p_ext;
> > +			if (ex2 != &newex)
> > +				ex2 = ex;
> > +		}
> > +		allocated = max_blocks;
> > +	}
> > +	/* If there was a change of depth as part of the
> > +	 * insertion of ex3 above, we need to update the length
> > +	 * of the ex1 extent again here
> > +	 */
> > +	if (ex1 && ex1 != ex) {
> > +		ex1 = ex;
> > +		ex1->ee_len = cpu_to_le16(iblock - ee_block);
> > +		ext4_ext_mark_uninitialized(ex1);
> > +		ex2 = &newex;
> > +	}
> > +	/* ex2: iblock to iblock + maxblocks-1 : initialised */
> > +	ex2->ee_block = cpu_to_le32(iblock);
> > +	ex2->ee_start = cpu_to_le32(newblock);
> > +	ext4_ext_store_pblock(ex2, newblock);
> > +	ex2->ee_len = cpu_to_le16(allocated);
> > +	if (ex2 != ex)
> > +		goto insert;
> > +	if ((err = ext4_ext_get_access(handle, inode, path + depth)))
> > +		goto out;
> 
> The preferred style is
> 
> 	err = ext4_ext_get_access(handle, inode, path + depth);
> 	if (err)
> 		goto out;

Right. Will change it.
 
> > +	/* New (initialized) extent starts from the first block
> > +	 * in the current extent. i.e., ex2 == ex
> > +	 * We have to see if it can be merged with the extent
> > +	 * on the left.
> > +	 */
> > +	if (ex2 > EXT_FIRST_EXTENT(eh)) {
> > +		/* To merge left, pass "ex2 - 1" to try_to_merge(),
> > +		 * since it merges towards right _only_.
> > +		 */
> > +		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
> > +		if (ret) {
> > +			err = ext4_ext_correct_indexes(handle, inode, path);
> > +			if (err)
> > +				goto out;
> > +			depth = ext_depth(inode);
> > +			ex2--;
> > +		}
> > +	}
> > +	/* Try to Merge towards right. This might be required
> > +	 * only when the whole extent is being written to.
> > +	 * i.e. ex2==ex and ex3==NULL.
> > +	 */
> > +	if (!ex3) {
> > +		ret = ext4_ext_try_to_merge(inode, path, ex2);
> > +		if (ret) {
> > +			err = ext4_ext_correct_indexes(handle, inode, path);
> > +			if (err)
> > +				goto out;
> > +		}
> > +	}
> > +	/* Mark modified extent as dirty */
> > +	err = ext4_ext_dirty(handle, inode, path + depth);
> > +	goto out;
> > +insert:
> > +	err = ext4_ext_insert_extent(handle, inode, path, &newex);
> > +out:
> > +	return err ? err : allocated;
> > +}
> 
> Sigh.  I hope you guys know how all this works, because the extent code is
> a mystery to me.  Is the on-disk layout and the allocation strategy
> described anywhere?
> 
> > +extern int ext4_ext_try_to_merge(struct inode *, struct ext4_ext_path *, struct ext4_extent *);
> 
> Again, I do think that sticking the identifiers in there helps
> readability.  Although it is not as important in a boring old declaration
> as it is in, say, inode_operations, etc.
> 
> Please try to keep the code looking nice in an 80-column display.

Ok. Will make the required changes.

Thanks again for your comments!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
  2007-04-26 18:16                               ` [PATCH 5/5] ext4: write support for preallocated blocks/extents Amit K. Arora
  2007-05-04  4:32                                 ` Andrew Morton
@ 2007-05-07 12:40                                 ` Pekka Enberg
  2007-05-07 13:04                                   ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: Pekka Enberg @ 2007-05-07 12:40 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On 4/26/07, Amit K. Arora <aarora@linux.vnet.ibm.com> wrote:
>  /*
> + * ext4_ext_try_to_merge:
> + * tries to merge the "ex" extent to the next extent in the tree.
> + * It always tries to merge towards right. If you want to merge towards
> + * left, pass "ex - 1" as argument instead of "ex".
> + * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
> + * 1 if they got merged.
> + */
> +int ext4_ext_try_to_merge(struct inode *inode,
> +                               struct ext4_ext_path *path,
> +                               struct ext4_extent *ex)
> +{

Please either use proper kerneldoc format or drop
"ext4_ext_try_to_merge" from the comment.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 5/5] ext4: write support for preallocated blocks/extents
  2007-05-07 12:40                                 ` Pekka Enberg
@ 2007-05-07 13:04                                   ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-07 13:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Mon, May 07, 2007 at 03:40:26PM +0300, Pekka Enberg wrote:
> On 4/26/07, Amit K. Arora <aarora@linux.vnet.ibm.com> wrote:
> > /*
> >+ * ext4_ext_try_to_merge:
> >+ * tries to merge the "ex" extent to the next extent in the tree.
> >+ * It always tries to merge towards right. If you want to merge towards
> >+ * left, pass "ex - 1" as argument instead of "ex".
> >+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
> >+ * 1 if they got merged.
> >+ */
> >+int ext4_ext_try_to_merge(struct inode *inode,
> >+                               struct ext4_ext_path *path,
> >+                               struct ext4_extent *ex)
> >+{
> 
> Please either use proper kerneldoc format or drop
> "ext4_ext_try_to_merge" from the comment.

Ok, Thanks.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  6:56                                       ` Jakub Jelinek
@ 2007-05-07 13:08                                         ` Ulrich Drepper
  0 siblings, 0 replies; 340+ messages in thread
From: Ulrich Drepper @ 2007-05-07 13:08 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew Morton, David Chinner, Amit K. Arora, torvalds,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Jakub Jelinek wrote:
> is what glibc does ATM.  Seems we violate the case where len == 0, as
> EINVAL in that case is "shall fail".  But reading the standard to imply
> negative len is ok is too much guessing, there is no word what it means
> when len is negative and
> "required storage for regular file data starting at offset and continuing for len bytes"
> doesn't make sense for negative size.  

This wording has already been cleaned up.  The current draft for the 
next revision reads:


[EINVAL] The len argument is less than or equal to zero, or the offset
  argument is less than zero, or the underlying file system does not
  support this operation.


I still don't like it since len==0 shouldn't create an error (it's 
inconsistent) but len<0 is already outlawed.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 12:07                                   ` Amit K. Arora
@ 2007-05-07 15:24                                     ` Dave Kleikamp
  2007-05-08 10:52                                       ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Dave Kleikamp @ 2007-05-07 15:24 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Andrew Morton, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> > > +{
> > > +	handle_t *handle;
> > > +	ext4_fsblk_t block, max_blocks;
> > > +	int ret, ret2, nblocks = 0, retries = 0;
> > > +	struct buffer_head map_bh;
> > > +	unsigned int credits, blkbits = inode->i_blkbits;
> > > +
> > > +	/* Currently supporting (pre)allocate mode _only_ */
> > > +	if (mode != FA_ALLOCATE)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > > +		return -ENOTTY;
> > 
> > So we don't implement fallocate on bitmap-based files!  Well that's huge
> > news.  The changelog would be an appropriate place to communicate this,
> > along with reasons why, or a description of the plan to fix it.
> 
> Ok. Will add this in the function description as well.
> 
> > Also, posix says nothing about fallocate() returning ENOTTY.
> 
> Right. I don't seem to find any suitable error from posix description.
> Can you please suggest an error code which might make more sense here ?
> Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> non-extent files.

Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
here, and fall back to the current library code to do preallocation?
This way, the caller of fallocate() will never see this return code, so
it won't violate posix.

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 11:37                                   ` Andreas Dilger
@ 2007-05-07 20:58                                     ` Andrew Morton
  2007-05-07 22:21                                       ` Andreas Dilger
                                                         ` (2 more replies)
  0 siblings, 3 replies; 340+ messages in thread
From: Andrew Morton @ 2007-05-07 20:58 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Mon, 7 May 2007 05:37:54 -0600
Andreas Dilger <adilger@clusterfs.com> wrote:

> > > +	block = offset >> blkbits;
> > > +	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > +			 - block;
> > > +	mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > +	credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > +	mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > 
> > Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> > space, and that this disk space will require an arbitrary amount of
> > metadata, how can we work out how much journal space we'll be needing
> > without at least looking at `len'?
> 
> Good question.
> 
> The uninitialized extent can cover up to 128MB with a single entry.
> If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> function returns the maximum number of extents needed to insert a leaf,
> including splitting all of the index blocks.  That would allow up to 43GB
> (340 extents/block * 128MB) to be preallocated, but it still needs to take
> the size of the preallocation into account (adding 3 blocks per 43GB - a
> leaf block, a bitmap block and a group descriptor).

I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.

Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space.  If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.

But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons.  If you used fallocate to reserve 1GB of disk
and fallocate() "succeeded" and you later get ENOSPC then you'd have a
right to get a bit upset.

So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop: 

	while (len) {
		journal_start();
		len -= do_fallocate(len, ...);
		journal_stop();
	}


Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space?  We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.



Does the proposed implementation handle quotas correctly, btw?  Has that
been tested?


Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only.  I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 20:58                                     ` Andrew Morton
@ 2007-05-07 22:21                                       ` Andreas Dilger
  2007-05-07 22:38                                         ` Andrew Morton
  2007-05-07 23:02                                         ` Jeff Garzik
  2007-05-08  0:00                                       ` Mingming Cao
  2007-05-14 13:34                                       ` Jan Kara
  2 siblings, 2 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-05-07 22:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> Final point: it's fairly disappointing that the present implementation is
> ext4-only, and extent-only.  I do think we should be aiming at an ext4
> bitmap-based implementation and an ext3 implementation.

Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error (e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 22:21                                       ` Andreas Dilger
@ 2007-05-07 22:38                                         ` Andrew Morton
  2007-05-07 23:14                                           ` Theodore Tso
  2007-05-07 23:02                                         ` Jeff Garzik
  1 sibling, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-05-07 22:38 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Mon, 7 May 2007 15:21:04 -0700
Andreas Dilger <adilger@clusterfs.com> wrote:

> On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> > Final point: it's fairly disappointing that the present implementation is
> > ext4-only, and extent-only.  I do think we should be aiming at an ext4
> > bitmap-based implementation and an ext3 implementation.
> 
> Actually, this is a non-issue.  The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing.  For other filesystems (including ext3 and
> ext4 with block-mapped files) the filesystem should return an error (e.g.
> -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

hrm, spose so.

It can be a bit suboptimal from the layout POV.  The reservations code will
largely save us here, but kernel support might make it a bit better.

Totally blowing pagecache could be a problem.  Fixable in userspace by
using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 22:21                                       ` Andreas Dilger
  2007-05-07 22:38                                         ` Andrew Morton
@ 2007-05-07 23:02                                         ` Jeff Garzik
  2007-05-07 23:36                                           ` Theodore Tso
  2007-05-08  1:07                                           ` Andreas Dilger
  1 sibling, 2 replies; 340+ messages in thread
From: Jeff Garzik @ 2007-05-07 23:02 UTC (permalink / raw)
  To: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

Andreas Dilger wrote:
> On May 07, 2007  13:58 -0700, Andrew Morton wrote:
>> Final point: it's fairly disappointing that the present implementation is
>> ext4-only, and extent-only.  I do think we should be aiming at an ext4
>> bitmap-based implementation and an ext3 implementation.
> 
> Actually, this is a non-issue.  The reason that it is handled for extent-only
> is that this is the only way to allocate space in the filesystem without
> doing the explicit zeroing.  For other filesystems (including ext3 and

Precisely /how/ do you avoid the zeroing issue, for extents?

If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
otherwise the implementation is broken.

	Jeff



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 22:38                                         ` Andrew Morton
@ 2007-05-07 23:14                                           ` Theodore Tso
  2007-05-07 23:31                                             ` Andrew Morton
  0 siblings, 1 reply; 340+ messages in thread
From: Theodore Tso @ 2007-05-07 23:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andreas Dilger, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > Actually, this is a non-issue.  The reason that it is handled for extent-only
> > is that this is the only way to allocate space in the filesystem without
> > doing the explicit zeroing.  For other filesystems (including ext3 and
> > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> 
> It can be a bit suboptimal from the layout POV.  The reservations code will
> largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file.  This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality.  That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate().  On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.  

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate().  I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced....  what do folks think?

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 23:14                                           ` Theodore Tso
@ 2007-05-07 23:31                                             ` Andrew Morton
  2007-05-08  0:30                                               ` Mingming Cao
  0 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-05-07 23:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andreas Dilger, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Mon, 7 May 2007 19:14:42 -0400
Theodore Tso <tytso@mit.edu> wrote:

> On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > Actually, this is a non-issue.  The reason that it is handled for extent-only
> > > is that this is the only way to allocate space in the filesystem without
> > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> > 
> > It can be a bit suboptimal from the layout POV.  The reservations code will
> > largely save us here, but kernel support might make it a bit better.
> 
> Actually, the reservations code won't matter, since glibc will fall
> back to its current behavior, which is it will do the preallocation by
> explicitly writing zeros to the file.

No!  Reservations code is *critical* here.  Without reservations, we get
disastrously-bad layout if two processes were running a large fallocate()
at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
lengths save us).

My point is that even though reservations save us, we could do even-better
in-kernel.

But then, a smart application would bypass the glibc() fallocate()
implementation and would tune the reservation window size and would use
direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).

> This wlil result in the same
> layout as if we had done the persistent preallocation, but of course
> it will mean the posix_fallocate() could potentially take a long time
> if you're a PVR and you're reserving a gig or two for a two hour movie
> at high quality.  That seems suboptimal, granted, and ideally the
> application should be warned about this before it calls
> posix_fallocate().  On the other hand, it's what happens today, all
> the time, so applications won't be too badly surprised.

A PVR implementor would take all this over and would do it themselves, for
sure.

> If we think applications programmers badly need to know in advance if
> posix_fallocate() will be fast or slow, probably the right thing is to
> define a new fpathconf() configuration option so they can query to see
> whether a particular file will support a fast posix_fallocate().  I'm
> not 100% convinced such complexity is really needed, but I'm willing
> to be convinced....  what do folks think?
> 

An application could do sys_fallocate(one-byte) to work out whether it's
supported in-kernel, I guess.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 23:02                                         ` Jeff Garzik
@ 2007-05-07 23:36                                           ` Theodore Tso
  2007-05-08  1:07                                           ` Andreas Dilger
  1 sibling, 0 replies; 340+ messages in thread
From: Theodore Tso @ 2007-05-07 23:36 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> >>Final point: it's fairly disappointing that the present implementation is
> >>ext4-only, and extent-only.  I do think we should be aiming at an ext4
> >>bitmap-based implementation and an ext3 implementation.
> >
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only
> >is that this is the only way to allocate space in the filesystem without
> >doing the explicit zeroing.  For other filesystems (including ext3 and
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized.  When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks).  It was decided it wasn't worth the
complexity, given the tradeoffs.

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 20:58                                     ` Andrew Morton
  2007-05-07 22:21                                       ` Andreas Dilger
@ 2007-05-08  0:00                                       ` Mingming Cao
  2007-05-08  0:15                                         ` Andrew Morton
  2007-05-14 13:34                                       ` Jan Kara
  2 siblings, 1 reply; 340+ messages in thread
From: Mingming Cao @ 2007-05-08  0:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andreas Dilger, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna

On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 05:37:54 -0600
> Andreas Dilger <adilger@clusterfs.com> wrote:
> 
> > > > +	block = offset >> blkbits;
> > > > +	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > > > +			 - block;
> > > > +	mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > > > +	credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > > > +	mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> > > 
> > > Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> > > space, and that this disk space will require an arbitrary amount of
> > > metadata, how can we work out how much journal space we'll be needing
> > > without at least looking at `len'?
> > 
> > Good question.
> > 
> > The uninitialized extent can cover up to 128MB with a single entry.
> > If @path isn't specified, then ext4_ext_calc_credits_for_insert()
> > function returns the maximum number of extents needed to insert a leaf,
> > including splitting all of the index blocks.  That would allow up to 43GB
> > (340 extents/block * 128MB) to be preallocated, but it still needs to take
> > the size of the preallocation into account (adding 3 blocks per 43GB - a
> > leaf block, a bitmap block and a group descriptor).
> 
> I think the use of ext4_journal_extend() (as Amit has proposed) will help
> here, but it is not sufficient.
> 
> Because under some circumstances, a journal_extend() failure could mean
> that we fail to allocate all the required disk space.  If it is infrequent
> enough, that is acceptable when the caller is using fallocate() for
> performance reasons.
> 
> But it is very much not acceptable if the caller is using fallocate() for
> space-reservation reasons.  If you used fallocate to reserve 1GB of disk
> and fallocate() "succeeded" and you later get ENOSPC then you'd have a
> right to get a bit upset.
> 
> So I think the ext3/4 fallocate() implementation will need to be
> implemented as a loop: 
> 
> 	while (len) {
> 		journal_start();
> 		len -= do_fallocate(len, ...);
> 		journal_stop();
> 	}
> 
> 

I agree.  There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
> +/*
> + * ext4_fallocate:
> + * preallocate space for a file
> + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> + */
> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> +{
....

> +       mutex_lock(&EXT4_I(inode)->truncate_mutex);
> +       credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> +       mutex_unlock(&EXT4_I(inode)->truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

> +       handle=ext4_journal_start(inode, credits +
> +                                       EXT4_DATA_TRANS_BLOCKS(inode->i_sb)+1);
> +       if (IS_ERR(handle))
> +               return PTR_ERR(handle);
> +retry:
> +       ret = 0;
> +       while (ret >= 0 && ret < max_blocks) {
> +               block = block + ret;
> +               max_blocks = max_blocks - ret;
> +               ret = ext4_ext_get_blocks(handle, inode, block,
> +                                         max_blocks, &map_bh,
> +                                         EXT4_CREATE_UNINITIALIZED_EXT, 0);
> +               BUG_ON(!ret);
> +               if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> +                       && ((block + ret) > (i_size_read(inode) << blkbits)))
> +                       nblocks = nblocks + ret;
> +       }
> +
> +       if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> +               goto retry;
> +
> Now the interesting question is: what do we do if we get halfway through
> this loop and then run out of space?  We could leave the disk all filled up
> and then return failure to the caller, but that's pretty poor behaviour,
> IMO.
> 
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend the block reservation window size before the while loop
so we could get a lower chance to get more fragmented.

> 
> Does the proposed implementation handle quotas correctly, btw?  Has that
> been tested?
> 
I think so. The ext4_ext_get_blocks() will end up calling
ext4_new_blocks() to do the real block allocation, quota is being
handled there, therefor is tested already.


Mingming



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08  0:00                                       ` Mingming Cao
@ 2007-05-08  0:15                                         ` Andrew Morton
  2007-05-08  0:41                                           ` Mingming Cao
  0 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-05-08  0:15 UTC (permalink / raw)
  To: cmm
  Cc: Andreas Dilger, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna

On Mon, 07 May 2007 17:00:24 -0700
Mingming Cao <cmm@us.ibm.com> wrote:

> > +       while (ret >= 0 && ret < max_blocks) {
> > +               block = block + ret;
> > +               max_blocks = max_blocks - ret;
> > +               ret = ext4_ext_get_blocks(handle, inode, block,
> > +                                         max_blocks, &map_bh,
> > +                                         EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +               BUG_ON(!ret);
> > +               if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > +                       && ((block + ret) > (i_size_read(inode) << blkbits)))
> > +                       nblocks = nblocks + ret;
> > +       }
> > +
> > +       if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > +               goto retry;
> > +
> > Now the interesting question is: what do we do if we get halfway through
> > this loop and then run out of space?  We could leave the disk all filled up
> > and then return failure to the caller, but that's pretty poor behaviour,
> > IMO.
> > 
> The current code handles earlier ENOSPC by three times retries. After
> that if we still run out of space, then it's propably right to notify
> the caller there isn't much space left.
> 
> We could extend the block reservation window size before the while loop
> so we could get a lower chance to get more fragmented.

yes, but my point is that the proposed behaviour is really quite bad.

We will attempt to allocate the disk space and then we will return failure,
having consumed all the disk space and having partially and uselessly
populated an unknown amount of the file.

Userspace could presumably repair the mess in most situations by truncating
the file back again.  The kernel cannot do that because there might be live
data in amongst there.

So we'd need to either keep track of which blocks were newly-allocated and
then free them all again on the error path (doesn't work right across
commit+crash+recovery) or we could later use the space-reservation scheme which
delayed allocation will need to introduce.

Or we could decide to live with the above IMO-crappy behaviour.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 23:31                                             ` Andrew Morton
@ 2007-05-08  0:30                                               ` Mingming Cao
  0 siblings, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-05-08  0:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Andreas Dilger, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna

On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
> On Mon, 7 May 2007 19:14:42 -0400
> Theodore Tso <tytso@mit.edu> wrote:
> 
> > On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
> > > > Actually, this is a non-issue.  The reason that it is handled for extent-only
> > > > is that this is the only way to allocate space in the filesystem without
> > > > doing the explicit zeroing.  For other filesystems (including ext3 and
> > > > ext4 with block-mapped files) the filesystem should return an error (e.g.
> > > > -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
> > > 
> > > It can be a bit suboptimal from the layout POV.  The reservations code will
> > > largely save us here, but kernel support might make it a bit better.
> > 
> > Actually, the reservations code won't matter, since glibc will fall
> > back to its current behavior, which is it will do the preallocation by
> > explicitly writing zeros to the file.
> 
> No!  Reservations code is *critical* here.  Without reservations, we get
> disastrously-bad layout if two processes were running a large fallocate()
> at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
> lengths save us).
> 
> My point is that even though reservations save us, we could do even-better
> in-kernel.
> 

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating "please spend time if needed to find
a window covers at least N free blocks".

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

> But then, a smart application would bypass the glibc() fallocate()
> implementation and would tune the reservation window size and would use
> direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
> 
> > This wlil result in the same
> > layout as if we had done the persistent preallocation, but of course
> > it will mean the posix_fallocate() could potentially take a long time
> > if you're a PVR and you're reserving a gig or two for a two hour movie
> > at high quality.  That seems suboptimal, granted, and ideally the
> > application should be warned about this before it calls
> > posix_fallocate().  On the other hand, it's what happens today, all
> > the time, so applications won't be too badly surprised.
> 
> A PVR implementor would take all this over and would do it themselves, for
> sure.
> 
> > If we think applications programmers badly need to know in advance if
> > posix_fallocate() will be fast or slow, probably the right thing is to
> > define a new fpathconf() configuration option so they can query to see
> > whether a particular file will support a fast posix_fallocate().  I'm
> > not 100% convinced such complexity is really needed, but I'm willing
> > to be convinced....  what do folks think?
> > 
> 
> An application could do sys_fallocate(one-byte) to work out whether it's
> supported in-kernel, I guess.
> 


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08  0:15                                         ` Andrew Morton
@ 2007-05-08  0:41                                           ` Mingming Cao
  2007-05-08  1:43                                             ` Theodore Tso
  0 siblings, 1 reply; 340+ messages in thread
From: Mingming Cao @ 2007-05-08  0:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andreas Dilger, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna

On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
> On Mon, 07 May 2007 17:00:24 -0700
> Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > > +       while (ret >= 0 && ret < max_blocks) {
> > > +               block = block + ret;
> > > +               max_blocks = max_blocks - ret;
> > > +               ret = ext4_ext_get_blocks(handle, inode, block,
> > > +                                         max_blocks, &map_bh,
> > > +                                         EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > > +               BUG_ON(!ret);
> > > +               if (ret > 0 && test_bit(BH_New, &map_bh.b_state)
> > > +                       && ((block + ret) > (i_size_read(inode) << blkbits)))
> > > +                       nblocks = nblocks + ret;
> > > +       }
> > > +
> > > +       if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > +               goto retry;
> > > +
> > > Now the interesting question is: what do we do if we get halfway through
> > > this loop and then run out of space?  We could leave the disk all filled up
> > > and then return failure to the caller, but that's pretty poor behaviour,
> > > IMO.
> > > 
> > The current code handles earlier ENOSPC by three times retries. After
> > that if we still run out of space, then it's propably right to notify
> > the caller there isn't much space left.
> > 
> > We could extend the block reservation window size before the while loop
> > so we could get a lower chance to get more fragmented.
> 
> yes, but my point is that the proposed behaviour is really quite bad.
> 
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.


> We will attempt to allocate the disk space and then we will return failure,
> having consumed all the disk space and having partially and uselessly
> populated an unknown amount of the file.
> 

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

> Userspace could presumably repair the mess in most situations by truncating
> the file back again.  The kernel cannot do that because there might be live
> data in amongst there.
> 
> So we'd need to either keep track of which blocks were newly-allocated and
> then free them all again on the error path (doesn't work right across
> commit+crash+recovery) or we could later use the space-reservation scheme which
> delayed allocation will need to introduce.
> 
> Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.


Mingming


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 23:02                                         ` Jeff Garzik
  2007-05-07 23:36                                           ` Theodore Tso
@ 2007-05-08  1:07                                           ` Andreas Dilger
  2007-05-08  1:25                                             ` Jeff Garzik
  1 sibling, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-05-08  1:07 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only is that this is the only way to allocate space in the
> >filesystem without doing the explicit zeroing.
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08  1:07                                           ` Andreas Dilger
@ 2007-05-08  1:25                                             ` Jeff Garzik
  0 siblings, 0 replies; 340+ messages in thread
From: Jeff Garzik @ 2007-05-08  1:25 UTC (permalink / raw)
  To: Jeff Garzik, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

Andreas Dilger wrote:
> My comment was just that the extent doesn't have to be explicitly zero
> filled on the disk, by virtue of the fact that the uninitialized flag
> will cause reads to return zero.


Agreed, thanks for the clarification.

	Jeff



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08  0:41                                           ` Mingming Cao
@ 2007-05-08  1:43                                             ` Theodore Tso
  2007-05-08 16:52                                               ` Andreas Dilger
  2007-05-08 17:46                                               ` Mingming Cao
  0 siblings, 2 replies; 340+ messages in thread
From: Theodore Tso @ 2007-05-08  1:43 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Andrew Morton, Andreas Dilger, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> We could check the total number of fs free blocks account before
> preallocation happens, if there isn't enough space left, there is no
> need to bother preallocating.

Checking against the fs free blocks is a good idea, since it will
prevent the obvious error case where someone tries to preallocate 10GB
when there is only 2GB left.  But it won't help if there are multiple
processes trying to allocate blocks the same time.  On the other hand,
that case is probably relatively rare, and in that case, the
filesystem was probably going to be left completely full in any case.

On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> Userspace could presumably repair the mess in most situations by truncating
> the file back again.  The kernel cannot do that because there might be live
> data in amongst there.

Actually, the kernel could do it, in that could simply release all
unitialized extents back to the system.  The problem is distinguishing
between the unitialized extents that had just been newly added, versus
the ones that had there from before.  (On the other hand, if the
filesystem was completely full, releasing unitialized blocks wouldn't
be the worse thing in the world to do, although releasing previously
fallocated blocks probably does violate the princple of least
surprise, even if it's what the user would have wanted.)

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> If there is enough free space, we could make a reservation window that
> have at least N free blocks and mark it not stealable by other files. So
> later we will not run into the ENOSPC error.

Could you really use a single reservation window?  When the filesystem
is almost full, the free extents are likely going to be scattered all
over the disk.  The general principle of grabbing all of the extents
and keeping them in an in-memory data structure, and only adding them
to the extent tree would work, though; I'm just not sure we could do
it using the existing reservation window code, since it only supports
a single reservation window per file, yes?

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5] fallocate system call
  2007-05-03 11:22                                       ` Miquel van Smoorenburg
@ 2007-05-08  2:26                                         ` David Chinner
  0 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-05-08  2:26 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: adilger, linux-kernel

On Thu, May 03, 2007 at 01:22:48PM +0200, Miquel van Smoorenburg wrote:
> In article <20070503103425.GE6220@schatzie.adilger.int> you write:
> >On May 02, 2007  18:23 +0530, Amit K. Arora wrote:
> >> On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
> >> > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
> >> > 
> >> > > For FA_ALLOCATE, it's supposed to change the file size if we
> >> > > allocate past EOF, right?
> >> > 
> >> > I would argue no.  Use truncate for that.
> >> 
> >> The patch I posted for ext4 *does* change the filesize after
> >> preallocation, if required (i.e. when preallocation is after EOF).
> >> I may have to change that, if we decide on not doing this.
> >
> >I think I'd agree - it may be useful to allow preallocation beyond EOF
> >for some kinds of applications (e.g. PVR preallocating live TV in 10
> >minute segments or something, but not knowing in advance how long the
> >show will actually be recorded or the final encoded size).
> 
> I have an application (diablo dreader) where the header-info database
> basically consists of ~40.000 files, one for each group (it's more
> complicated that that, but never mind that now).
> 
> If you grow those files randomly by a few hundred bytes every update,
> the filesystem gets hopelessly fragmented.
> 
> I'm using XFS with preallocation turned on, and biosize=18 (which
> makes it preallocate in blocks of 256KB), and a homebrew patch that
> leaves the preallocated space on disk preallocated even if the
> file is closed .. and it helps enormously.

XFS always has speculative preallocation turned on - this is
different to explicit preallocation which we are talking about
here ;)

FWIW, the reason you need your homebrew patch is that specualtive
allocation does not set the PREALLOC bit on the inode, and so when
you close the file the speculative prealloc gets truncated away.
If you use a real preallocation (XFS_IOC_RESVSP64) or the upcoming
fallocate() syscall, XFS also sets the PREALLOC bit in the inode so
it doesn't get truncated away on file close.

If you don't want to use XFS_IOC_RESVSP64, you could just use
XFS_IOC_FSSETXATTR to set the prealloc bit on the files you care
about so you don't need a hack in XFS to prevent truncation of
speculative allocation on file close.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 15:24                                     ` Dave Kleikamp
@ 2007-05-08 10:52                                       ` Amit K. Arora
  2007-05-08 14:47                                         ` Dave Kleikamp
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-08 10:52 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
> On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:
> > > On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > > > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> > > > +{
> > > > +	handle_t *handle;
> > > > +	ext4_fsblk_t block, max_blocks;
> > > > +	int ret, ret2, nblocks = 0, retries = 0;
> > > > +	struct buffer_head map_bh;
> > > > +	unsigned int credits, blkbits = inode->i_blkbits;
> > > > +
> > > > +	/* Currently supporting (pre)allocate mode _only_ */
> > > > +	if (mode != FA_ALLOCATE)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > > > +		return -ENOTTY;
> > > 
> > > So we don't implement fallocate on bitmap-based files!  Well that's huge
> > > news.  The changelog would be an appropriate place to communicate this,
> > > along with reasons why, or a description of the plan to fix it.
> > 
> > Ok. Will add this in the function description as well.
> > 
> > > Also, posix says nothing about fallocate() returning ENOTTY.
> > 
> > Right. I don't seem to find any suitable error from posix description.
> > Can you please suggest an error code which might make more sense here ?
> > Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> > non-extent files.
> 
> Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
> here, and fall back to the current library code to do preallocation?
> This way, the caller of fallocate() will never see this return code, so
> it won't violate posix.

You are right.

But, we still need to "standardize" (and limit) the error codes
which we should return from kernel when we want to fall back on the
library implementation. The posix_fallocate() library function will have
to look for a set of errors from fallocate() system call, upon receiving
which it will do preallocation from user level; or else, it will return
success/error-code returned by the system call to the user.

I think we can make it fall back to library implementation of fallocate,
whenever posix_fallocate() receives any of the following errors from
fallocate() system call:

1. ENOSYS
2. EOPNOTSUPP
3. ENOTTY        (?)

Now the question is - should we limit the set of errors for this purpose
to just 1 & 2 above ? In that case I will need to change the error being
returned here to -EOPNOTSUPP (from current -ENOTTY).

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08 10:52                                       ` Amit K. Arora
@ 2007-05-08 14:47                                         ` Dave Kleikamp
  0 siblings, 0 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-05-08 14:47 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Andrew Morton, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna, cmm

On Tue, 2007-05-08 at 16:22 +0530, Amit K. Arora wrote:
> On Mon, May 07, 2007 at 10:24:37AM -0500, Dave Kleikamp wrote:
> > On Mon, 2007-05-07 at 17:37 +0530, Amit K. Arora wrote:
> > > On Thu, May 03, 2007 at 09:31:33PM -0700, Andrew Morton wrote:

> > > > So we don't implement fallocate on bitmap-based files!  Well that's huge
> > > > news.  The changelog would be an appropriate place to communicate this,
> > > > along with reasons why, or a description of the plan to fix it.
> > > 
> > > Ok. Will add this in the function description as well.
> > > 
> > > > Also, posix says nothing about fallocate() returning ENOTTY.
> > > 
> > > Right. I don't seem to find any suitable error from posix description.
> > > Can you please suggest an error code which might make more sense here ?
> > > Will -ENOTSUPP be ok ? Since we want to say here that we don't support
> > > non-extent files.
> > 
> > Isn't the idea that libc will interpret -ENOTTY, or whatever is returned
> > here, and fall back to the current library code to do preallocation?
> > This way, the caller of fallocate() will never see this return code, so
> > it won't violate posix.
> 
> You are right.
> 
> But, we still need to "standardize" (and limit) the error codes
> which we should return from kernel when we want to fall back on the
> library implementation. The posix_fallocate() library function will have
> to look for a set of errors from fallocate() system call, upon receiving
> which it will do preallocation from user level; or else, it will return
> success/error-code returned by the system call to the user.
> 
> I think we can make it fall back to library implementation of fallocate,
> whenever posix_fallocate() receives any of the following errors from
> fallocate() system call:
> 
> 1. ENOSYS
> 2. EOPNOTSUPP
> 3. ENOTTY        (?)
> 
> Now the question is - should we limit the set of errors for this purpose
> to just 1 & 2 above ? In that case I will need to change the error being
> returned here to -EOPNOTSUPP (from current -ENOTTY).

If you want my opinion, -EOPNOTSUPP is better than -ENOTTY.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08  1:43                                             ` Theodore Tso
@ 2007-05-08 16:52                                               ` Andreas Dilger
  2007-05-08 17:46                                               ` Mingming Cao
  1 sibling, 0 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-05-08 16:52 UTC (permalink / raw)
  To: Theodore Tso, Mingming Cao, Andrew Morton, Amit K. Arora,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna

On May 07, 2007  21:43 -0400, Theodore Tso wrote:
> On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> > Userspace could presumably repair the mess in most situations by truncating
> > the file back again.  The kernel cannot do that because there might be live
> > data in amongst there.
> 
> Actually, the kernel could do it, in that could simply release all
> unitialized extents back to the system.  The problem is distinguishing
> between the unitialized extents that had just been newly added, versus
> the ones that had there from before.  (On the other hand, if the
> filesystem was completely full, releasing unitialized blocks wouldn't
> be the worse thing in the world to do, although releasing previously
> fallocated blocks probably does violate the princple of least
> surprise, even if it's what the user would have wanted.)

I tend to agree with this.  Having fallocate() fill up the filesystem
is exactly what the caller asked.  Doing a write() hit ENOSPC doesn't
trucate off the whole write either, nor does "dd" delete the whole file
when the filesystem is full.

Even checking the statfs() space before doing the fallocate() may be
counter intuitive, since it will return ENOSPC but the filesystem will
not actually be full.  Some applications (e.g. database) may WANT to
fill the filesystem and then get the actual file size back to avoid
trusting statfs() because of metadata overhead (e.g. indirect blocks).

One of the design goals for sys_fallocate() was to allow FA_DELALLOC
to deallocate unwritten extents in a safe manner.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-08  1:43                                             ` Theodore Tso
  2007-05-08 16:52                                               ` Andreas Dilger
@ 2007-05-08 17:46                                               ` Mingming Cao
  1 sibling, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-05-08 17:46 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Andreas Dilger, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna

On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote:
> On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> > We could check the total number of fs free blocks account before
> > preallocation happens, if there isn't enough space left, there is no
> > need to bother preallocating.
> 
> Checking against the fs free blocks is a good idea, since it will
> prevent the obvious error case where someone tries to preallocate 10GB
> when there is only 2GB left.
Think it again, this check is useful when preallocate blocks at EOF.
It's not much useful is preallocating a range with holes. In that case
2GB space might be enough if the application tries to preallocate a
10GB.

>   But it won't help if there are multiple
> processes trying to allocate blocks the same time.  On the other hand,
> that case is probably relatively rare, and in that case, the
> filesystem was probably going to be left completely full in any case.

> On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> > Userspace could presumably repair the mess in most situations by truncating
> > the file back again.  The kernel cannot do that because there might be live
> > data in amongst there.
> 
> Actually, the kernel could do it, in that could simply release all
> unitialized extents back to the system.  The problem is distinguishing
> between the unitialized extents that had just been newly added, versus
> the ones that had there from before.

True, the new uninitialized extents can be merged to the near old
uninitialized extents, there is no way to distinguish the just added
unintialized extents from the merged one.

>   (On the other hand, if the
> filesystem was completely full, releasing unitialized blocks wouldn't
> be the worse thing in the world to do, although releasing previously
> fallocated blocks probably does violate the princple of least
> surprise, even if it's what the user would have wanted.)
> 
> On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
> > If there is enough free space, we could make a reservation window that
> > have at least N free blocks and mark it not stealable by other files. So
> > later we will not run into the ENOSPC error.
> 
> Could you really use a single reservation window?  When the filesystem
> is almost full, the free extents are likely going to be scattered all
> over the disk.  The general principle of grabbing all of the extents
> and keeping them in an in-memory data structure, and only adding them
> to the extent tree would work, though; I'm just not sure we could do
> it using the existing reservation window code, since it only supports
> a single reservation window per file, yes?
> 
You are right.  One reservation window per file and there is limit to
the maximum window size). So yeah this way it's not going to prevent
ENOSPC for sure:(

Mingming


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-04  4:41                                   ` Paul Mackerras
@ 2007-05-09 10:15                                     ` Suparna Bhattacharya
  2007-05-09 10:50                                       ` Paul Mackerras
  0 siblings, 1 reply; 340+ messages in thread
From: Suparna Bhattacharya @ 2007-05-09 10:15 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

On Fri, May 04, 2007 at 02:41:50PM +1000, Paul Mackerras wrote:
> Andrew Morton writes:
> 
> > On Thu, 26 Apr 2007 23:33:32 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> > 
> > > This patch implements the fallocate() system call and adds support for
> > > i386, x86_64 and powerpc.
> > > 
> > > ...
> > >
> > > +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> > 
> > Please add a comment over this function which specifies its behaviour. 
> > Really it should be enough material from which a full manpage can be
> > written.
> 
> This looks like it will have the same problem on s390 as
> sys_sync_file_range.  Maybe the prototype should be:
> 
> asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)

Yes, but the trouble is that there was a contrary viewpoint preferring that fd
first be maintained as a convention like other syscalls (see the following
posts)

http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas)
http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2  (Andreas)

http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy)

So we are kind of deadlocked, aren't we ?

The debates on the proposed solution for s390

http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2  
http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 
http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2

Are there any better ideas ?

Regards
Suparna

> 
> Paul.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 10:15                                     ` Suparna Bhattacharya
@ 2007-05-09 10:50                                       ` Paul Mackerras
  2007-05-09 11:10                                         ` Suparna Bhattacharya
  0 siblings, 1 reply; 340+ messages in thread
From: Paul Mackerras @ 2007-05-09 10:50 UTC (permalink / raw)
  To: suparna
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

Suparna Bhattacharya writes:

> > This looks like it will have the same problem on s390 as
> > sys_sync_file_range.  Maybe the prototype should be:
> > 
> > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> 
> Yes, but the trouble is that there was a contrary viewpoint preferring that fd
> first be maintained as a convention like other syscalls (see the following
> posts)

Of course the interface used by an application program would have the
fd first.  Glibc can do the translation.

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 10:50                                       ` Paul Mackerras
@ 2007-05-09 11:10                                         ` Suparna Bhattacharya
  2007-05-09 11:37                                           ` Paul Mackerras
  0 siblings, 1 reply; 340+ messages in thread
From: Suparna Bhattacharya @ 2007-05-09 11:10 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

On Wed, May 09, 2007 at 08:50:44PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
> 
> > > This looks like it will have the same problem on s390 as
> > > sys_sync_file_range.  Maybe the prototype should be:
> > > 
> > > asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode)
> > 
> > Yes, but the trouble is that there was a contrary viewpoint preferring that fd
> > first be maintained as a convention like other syscalls (see the following
> > posts)
> 
> Of course the interface used by an application program would have the
> fd first.  Glibc can do the translation.

I think that was understood.

Regards
Suparna

> 
> Paul.

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 11:10                                         ` Suparna Bhattacharya
@ 2007-05-09 11:37                                           ` Paul Mackerras
  2007-05-09 12:00                                             ` Martin Schwidefsky
  2007-05-09 12:05                                             ` Amit K. Arora
  0 siblings, 2 replies; 340+ messages in thread
From: Paul Mackerras @ 2007-05-09 11:37 UTC (permalink / raw)
  To: suparna
  Cc: Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

Suparna Bhattacharya writes:

> > Of course the interface used by an application program would have the
> > fd first.  Glibc can do the translation.
> 
> I think that was understood.

OK, then what does it matter what the glibc/kernel interface is, as
long as it works?

It's only a minor point; the order of arguments can vary between
architectures if necessary, but it's nicer if they don't have to.
32-bit powerpc will need to have the two int arguments adjacent in
order to avoid using more than 6 argument registers at the user/kernel
boundary, and s390 will need to avoid having a 64-bit argument last
(if I understand it correctly).

Paul.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 11:37                                           ` Paul Mackerras
@ 2007-05-09 12:00                                             ` Martin Schwidefsky
  2007-05-09 12:05                                             ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Martin Schwidefsky @ 2007-05-09 12:00 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: suparna, Andrew Morton, Amit K. Arora, torvalds, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

On 5/9/07, Paul Mackerras <paulus@samba.org> wrote:
> Suparna Bhattacharya writes:
>
> > > Of course the interface used by an application program would have the
> > > fd first.  Glibc can do the translation.
> >
> > I think that was understood.
>
> OK, then what does it matter what the glibc/kernel interface is, as
> long as it works?
>
> It's only a minor point; the order of arguments can vary between
> architectures if necessary, but it's nicer if they don't have to.
> 32-bit powerpc will need to have the two int arguments adjacent in
> order to avoid using more than 6 argument registers at the user/kernel
> boundary, and s390 will need to avoid having a 64-bit argument last
> (if I understand it correctly).

Ah, almost but not quite the point. But I admit it is hard to understand..
The trouble started with the futex call which has been the first
system call with 6 arguments. s390 supported only 5 arguments up to
that point (%r2 - %r6). For futex we added a wrapper to the glibc that
loaded the 6th argument to %r7. In entry.S we set up things so that
%r7 gets stored to the kernel stack where normal C code expects the
first overflow argument. This enabled us to use the standard futex
system call with 6 arguments.
fallocate now has an additional problem: the last argument is a 64 bit
integers AND registers %r2-%r5 are already used. In this case the 64
bit number would have to be split into the high part in %r6 and the
low part on the stack so that the glibc wrapper can load the low part
to %r7. But the C compiler will skip %r6 and store the 64 bit number
on the stack.
If the order of the arguments if modified so that %r6 is assigned to a
32-bit argument, then the entry.S magic with %r7 would work.

-- 
blue skies,
  Martin

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 11:37                                           ` Paul Mackerras
  2007-05-09 12:00                                             ` Martin Schwidefsky
@ 2007-05-09 12:05                                             ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-09 12:05 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: suparna, Andrew Morton, torvalds, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, cmm

On Wed, May 09, 2007 at 09:37:22PM +1000, Paul Mackerras wrote:
> Suparna Bhattacharya writes:
> 
> > > Of course the interface used by an application program would have the
> > > fd first.  Glibc can do the translation.
> > 
> > I think that was understood.
> 
> OK, then what does it matter what the glibc/kernel interface is, as
> long as it works?
> 
> It's only a minor point; the order of arguments can vary between
> architectures if necessary, but it's nicer if they don't have to.
> 32-bit powerpc will need to have the two int arguments adjacent in
> order to avoid using more than 6 argument registers at the user/kernel
> boundary, and s390 will need to avoid having a 64-bit argument last
> (if I understand it correctly).

You are right to say that. But, it may not be _that_ a minor point,
especially for the arch which is getting affected. It has
other implications like what Heiko noticed in his post below:
http://lkml.org/lkml/2007/4/27/377
 - implications like modifying glibc and *trace utilities for a particular
arch.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-04-26 18:03                               ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Amit K. Arora
  2007-05-04  4:29                                 ` Andrew Morton
@ 2007-05-09 16:01                                 ` Amit K. Arora
  2007-05-09 16:54                                   ` Andreas Dilger
                                                     ` (2 more replies)
  1 sibling, 3 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-09 16:01 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.

But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:

1) Should the file size change when preallocation is done beyond EOF ?
   - Andreas and Chris Wedgwood are in favor of not changing the
     file size in this case. I also tend to agree with them. Does anyone
     has an argument in favor of changing the filesize ?
     If not, I will remove the code which changes the filesize, before I
     resubmit the concerned ext4 patch.

2) For FA_UNALLOCATE mode, should the file system allow unallocation
   of normal (non-preallocated) blocks (blocks allocated via
   regular write/truncate operations) also (i.e. work as punch()) ?
   - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
     we need to finalize on the convention here as a general guideline
     to all the filesystems that implement fallocate.

3) If above is true, the file size will need to be changed
   for "unallocation" when block holding the EOF gets unallocated.
   - If we do not "unallocate" normal (non-preallocated) blocks and we
     do not change the file size on preallocation, then this is a
     non-issue.

4) Should we update mtime & ctime on a successfull allocation/
   unallocation ?
   - David Chinner raised this question in following post:
     http://lkml.org/lkml/2007/4/29/407
     I think it makes sense to update the [mc]time for a successfull
     preallocation/unallocation. Does anyone feel otherwise ?
     It will be interesting to know how XFS behaves currently. Does XFS
     update [mc]time for preallocation ?


--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 16:01                                 ` Amit K. Arora
@ 2007-05-09 16:54                                   ` Andreas Dilger
  2007-05-09 17:07                                   ` Mingming Cao
  2007-05-10  0:59                                   ` David Chinner
  2 siblings, 0 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-05-09 16:54 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
>    of normal (non-preallocated) blocks (blocks allocated via
>    regular write/truncate operations) also (i.e. work as punch()) ?
>    - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
>      we need to finalize on the convention here as a general guideline
>      to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

> 3) If above is true, the file size will need to be changed
>    for "unallocation" when block holding the EOF gets unallocated.
>    - If we do not "unallocate" normal (non-preallocated) blocks and we
>      do not change the file size on preallocation, then this is a
>      non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

> 4) Should we update mtime & ctime on a successfull allocation/
>    unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the "this file was modified"
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 16:01                                 ` Amit K. Arora
  2007-05-09 16:54                                   ` Andreas Dilger
@ 2007-05-09 17:07                                   ` Mingming Cao
  2007-05-10  0:59                                   ` David Chinner
  2 siblings, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-05-09 17:07 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna

On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote:
> I have the updated patches ready which take care of Andrew's comments.
> Will run some tests and post them soon.
> 
> But, before submitting these patches, I think it will be better to finalize
> on certain things which might be worth some discussion here:
> 
> 1) Should the file size change when preallocation is done beyond EOF ?
>    - Andreas and Chris Wedgwood are in favor of not changing the
>      file size in this case. I also tend to agree with them. Does anyone
>      has an argument in favor of changing the filesize ?
>      If not, I will remove the code which changes the filesize, before I
>      resubmit the concerned ext4 patch.
> 

If we chose not to update the file size beyong EOF, then for filesystem
without fallocate() support (ext2,3 currently), posix_fallocate() will
follow the hard way(zero-out) to do preallocation. Then we will get
different behavior on filesystems w/o fallocate() support. It make sense
to be consistent, IMO.

My point of view, preallocation is just a efficient way to allocating
blocks for files without zero-out, other than this, the new behavior
should be consistent with the old way: file size update,mtime/ctime,
ENOSPC etc.

Mingming



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-09 16:01                                 ` Amit K. Arora
  2007-05-09 16:54                                   ` Andreas Dilger
  2007-05-09 17:07                                   ` Mingming Cao
@ 2007-05-10  0:59                                   ` David Chinner
  2007-05-10 11:56                                     ` Amit K. Arora
  2 siblings, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-05-10  0:59 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> I have the updated patches ready which take care of Andrew's comments.
> Will run some tests and post them soon.
> 
> But, before submitting these patches, I think it will be better to finalize
> on certain things which might be worth some discussion here:
> 
> 1) Should the file size change when preallocation is done beyond EOF ?
>    - Andreas and Chris Wedgwood are in favor of not changing the
>      file size in this case. I also tend to agree with them. Does anyone
>      has an argument in favor of changing the filesize ?
>      If not, I will remove the code which changes the filesize, before I
>      resubmit the concerned ext4 patch.

I think there needs to be both. If we don't have a mechanism to
atomically change the file size with the preallocation, then
applications that use stat() to work out if they need to preallocate
more space will end up racing.

> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
>    of normal (non-preallocated) blocks (blocks allocated via
>    regular write/truncate operations) also (i.e. work as punch()) ?

Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
what i did for FA_UNALLOCATE as well.

>    - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
>      we need to finalize on the convention here as a general guideline
>      to all the filesystems that implement fallocate.
> 
> 3) If above is true, the file size will need to be changed
>    for "unallocation" when block holding the EOF gets unallocated.

No - we punch a hole. If you want the filesize to change, then
you use ftruncate() to remove the blocks at EOF and change the
file size atomically.

> 4) Should we update mtime & ctime on a successfull allocation/
>    unallocation ?
>    - David Chinner raised this question in following post:
>      http://lkml.org/lkml/2007/4/29/407
>      I think it makes sense to update the [mc]time for a successfull
>      preallocation/unallocation. Does anyone feel otherwise ?
>      It will be interesting to know how XFS behaves currently. Does XFS
>      update [mc]time for preallocation ?

No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
changes. If the filesize changes, it behaves exactly the same way that
ftruncate() behaves.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-10  0:59                                   ` David Chinner
@ 2007-05-10 11:56                                     ` Amit K. Arora
  2007-05-10 22:39                                       ` David Chinner
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-10 11:56 UTC (permalink / raw)
  To: David Chinner
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > I have the updated patches ready which take care of Andrew's comments.
> > Will run some tests and post them soon.
> > 
> > But, before submitting these patches, I think it will be better to finalize
> > on certain things which might be worth some discussion here:
> > 
> > 1) Should the file size change when preallocation is done beyond EOF ?
> >    - Andreas and Chris Wedgwood are in favor of not changing the
> >      file size in this case. I also tend to agree with them. Does anyone
> >      has an argument in favor of changing the filesize ?
> >      If not, I will remove the code which changes the filesize, before I
> >      resubmit the concerned ext4 patch.
> 
> I think there needs to be both. If we don't have a mechanism to
> atomically change the file size with the preallocation, then
> applications that use stat() to work out if they need to preallocate
> more space will end up racing.

By "both" above, do you mean we should give user the flexibility if it
wants the filesize changed or not ? It can be done by having *two* modes
for preallocation in the system call - say FA_PREALLOCATE and
FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate
blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE
mode is used, fallocate() will change the filesize if required (i.e.
when allocation is beyond EOF) and also update [cm]time.
This way, the application can decide what it wants.

This will be helpfull for the partial allocation scenario also. Think of
the case when we do not change the filesize in fallocate() and expect
applications/posix_fallocate() to do ftruncate() after fallocate() for
this. Now if fallocate() results in a partial allocation with -ENOSPC
error returned, applications/posix_fallocate() will not know for what
length ftruncate() has to be called.  :(

Hence it may be a good idea to give user the flexibility if it wants to
atomically change the file size with preallocation or not. But, with
more flexibility there comes inconsistency in behavior, which is worth
considering.

> 
> > 2) For FA_UNALLOCATE mode, should the file system allow unallocation
> >    of normal (non-preallocated) blocks (blocks allocated via
> >    regular write/truncate operations) also (i.e. work as punch()) ?
> 
> Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and
> what i did for FA_UNALLOCATE as well.

Ok. But, some people may not expect/like this. I think, we can keep it
on the backburner for a while, till other issues are sorted out.
 
> >    - Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
> >      we need to finalize on the convention here as a general guideline
> >      to all the filesystems that implement fallocate.
> > 
> > 3) If above is true, the file size will need to be changed
> >    for "unallocation" when block holding the EOF gets unallocated.
> 
> No - we punch a hole. If you want the filesize to change, then
> you use ftruncate() to remove the blocks at EOF and change the
> file size atomically.

Ok.
> 
> > 4) Should we update mtime & ctime on a successfull allocation/
> >    unallocation ?
> >    - David Chinner raised this question in following post:
> >      http://lkml.org/lkml/2007/4/29/407
> >      I think it makes sense to update the [mc]time for a successfull
> >      preallocation/unallocation. Does anyone feel otherwise ?
> >      It will be interesting to know how XFS behaves currently. Does XFS
> >      update [mc]time for preallocation ?
> 
> No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size
> changes. If the filesize changes, it behaves exactly the same way that
> ftruncate() behaves.

Having additional mode (of FA_PREALLOCATE) might help here too. Please
see above.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-10 11:56                                     ` Amit K. Arora
@ 2007-05-10 22:39                                       ` David Chinner
  2007-05-11 11:03                                         ` Suparna Bhattacharya
  0 siblings, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-05-10 22:39 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: David Chinner, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > I have the updated patches ready which take care of Andrew's comments.
> > > Will run some tests and post them soon.
> > > 
> > > But, before submitting these patches, I think it will be better to
> > > finalize on certain things which might be worth some discussion here:
> > > 
> > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > in this case. I also tend to agree with them. Does anyone has an
> > > argument in favor of changing the filesize ?  If not, I will remove the
> > > code which changes the filesize, before I resubmit the concerned ext4
> > > patch.
> > 
> > I think there needs to be both. If we don't have a mechanism to atomically
> > change the file size with the preallocation, then applications that use
> > stat() to work out if they need to preallocate more space will end up
> > racing.
> 
> By "both" above, do you mean we should give user the flexibility if it wants
> the filesize changed or not ? It can be done by having *two* modes for
> preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> will change the filesize if required (i.e.  when allocation is beyond EOF)
> and also update [cm]time.  This way, the application can decide what it
> wants.

Yes, that's right.

> This will be helpfull for the partial allocation scenario also. Think of the
> case when we do not change the filesize in fallocate() and expect
> applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> Now if fallocate() results in a partial allocation with -ENOSPC error
> returned, applications/posix_fallocate() will not know for what length
> ftruncate() has to be called.  :(

Well, posix_fallocate() either gets all the space or it fails. If
you truncate to extend the file size after an ENOSPC, then that is
a buggy implementation.

The same could be said for any application, or even the fallocate()
call itself if it changes the filesize without having completely
preallocated the space asked....

> Hence it may be a good idea to give user the flexibility if it wants to
> atomically change the file size with preallocation or not. But, with more
> flexibility there comes inconsistency in behavior, which is worth
> considering.

We've got different modes to specify different behaviour. That's
what the mode field was put there for in the first place - the
interface is *designed* to support different preallocation
behaviours....

> > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > normal (non-preallocated) blocks (blocks allocated via regular
> > > write/truncate operations) also (i.e. work as punch()) ?
> > 
> > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > i did for FA_UNALLOCATE as well.
> 
> Ok. But, some people may not expect/like this. I think, we can keep it on
> the backburner for a while, till other issues are sorted out.

How can it be a "backburner" issue when it defines the
implementation?  I've already implemented some thing in XFS that
sort of does what I think that the interface is supposed to do, but
I need that interface to be nailed down before proceeding any
further.

All I'm really interested in right now is that the fallocate
_interface_ can be used as a *complete replacement* for the
pre-existing XFS-specific ioctls that are already used by
applications.  What ext4 can or can't do right now is irrelevant to
this discussion - the interface definition needs to take priority
over implementation....

Cheers,

Dave,
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-10 22:39                                       ` David Chinner
@ 2007-05-11 11:03                                         ` Suparna Bhattacharya
  2007-05-12  8:01                                           ` David Chinner
  0 siblings, 1 reply; 340+ messages in thread
From: Suparna Bhattacharya @ 2007-05-11 11:03 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, cmm

On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> On Thu, May 10, 2007 at 05:26:20PM +0530, Amit K. Arora wrote:
> > On Thu, May 10, 2007 at 10:59:26AM +1000, David Chinner wrote:
> > > On Wed, May 09, 2007 at 09:31:02PM +0530, Amit K. Arora wrote:
> > > > I have the updated patches ready which take care of Andrew's comments.
> > > > Will run some tests and post them soon.
> > > > 
> > > > But, before submitting these patches, I think it will be better to
> > > > finalize on certain things which might be worth some discussion here:
> > > > 
> > > > 1) Should the file size change when preallocation is done beyond EOF ?
> > > > - Andreas and Chris Wedgwood are in favor of not changing the file size
> > > > in this case. I also tend to agree with them. Does anyone has an
> > > > argument in favor of changing the filesize ?  If not, I will remove the
> > > > code which changes the filesize, before I resubmit the concerned ext4
> > > > patch.
> > > 
> > > I think there needs to be both. If we don't have a mechanism to atomically
> > > change the file size with the preallocation, then applications that use
> > > stat() to work out if they need to preallocate more space will end up
> > > racing.
> > 
> > By "both" above, do you mean we should give user the flexibility if it wants
> > the filesize changed or not ? It can be done by having *two* modes for
> > preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we
> > use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not
> > change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate()
> > will change the filesize if required (i.e.  when allocation is beyond EOF)
> > and also update [cm]time.  This way, the application can decide what it
> > wants.
> 
> Yes, that's right.
> 
> > This will be helpfull for the partial allocation scenario also. Think of the
> > case when we do not change the filesize in fallocate() and expect
> > applications/posix_fallocate() to do ftruncate() after fallocate() for this.
> > Now if fallocate() results in a partial allocation with -ENOSPC error
> > returned, applications/posix_fallocate() will not know for what length
> > ftruncate() has to be called.  :(
> 
> Well, posix_fallocate() either gets all the space or it fails. If
> you truncate to extend the file size after an ENOSPC, then that is
> a buggy implementation.
> 
> The same could be said for any application, or even the fallocate()
> call itself if it changes the filesize without having completely
> preallocated the space asked....
> 
> > Hence it may be a good idea to give user the flexibility if it wants to
> > atomically change the file size with preallocation or not. But, with more
> > flexibility there comes inconsistency in behavior, which is worth
> > considering.
> 
> We've got different modes to specify different behaviour. That's
> what the mode field was put there for in the first place - the
> interface is *designed* to support different preallocation
> behaviours....
> 
> > > > 2) For FA_UNALLOCATE mode, should the file system allow unallocation of
> > > > normal (non-preallocated) blocks (blocks allocated via regular
> > > > write/truncate operations) also (i.e. work as punch()) ?
> > > 
> > > Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and what
> > > i did for FA_UNALLOCATE as well.
> > 
> > Ok. But, some people may not expect/like this. I think, we can keep it on
> > the backburner for a while, till other issues are sorted out.
> 
> How can it be a "backburner" issue when it defines the
> implementation?  I've already implemented some thing in XFS that
> sort of does what I think that the interface is supposed to do, but
> I need that interface to be nailed down before proceeding any
> further.
> 
> All I'm really interested in right now is that the fallocate
> _interface_ can be used as a *complete replacement* for the
> pre-existing XFS-specific ioctls that are already used by
> applications.  What ext4 can or can't do right now is irrelevant to
> this discussion - the interface definition needs to take priority
> over implementation....

Would you like to write up an interface definition description (likely
man page) and post it for review, possibly with a mention of apps using
it today ?

One reason for introducing the mode parameter was to allow the interface to
evolve incrementally as more options / semantic questions are proposed, so
that we don't have to make all the decisions right now. 
So it would be good to start with a *minimal* definition, even just one mode.
The rest could follow as subsequent patches, each being reviewed and debated
separately. Otherwise this discussion can drag on for a long time.

Regards
Suparna

> 
> Cheers,
> 
> Dave,
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-11 11:03                                         ` Suparna Bhattacharya
@ 2007-05-12  8:01                                           ` David Chinner
  2007-06-12  6:16                                             ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-05-12  8:01 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: David Chinner, Amit K. Arora, torvalds, akpm, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
> On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> > All I'm really interested in right now is that the fallocate
> > _interface_ can be used as a *complete replacement* for the
> > pre-existing XFS-specific ioctls that are already used by
> > applications.  What ext4 can or can't do right now is irrelevant to
> > this discussion - the interface definition needs to take priority
> > over implementation....
> 
> Would you like to write up an interface definition description (likely
> man page) and post it for review, possibly with a mention of apps using
> it today ?

Yeah, I started doing that yesterday as i figured it was the only way
to cut the discussion short....

> One reason for introducing the mode parameter was to allow the interface to
> evolve incrementally as more options / semantic questions are proposed, so
> that we don't have to make all the decisions right now. 
> So it would be good to start with a *minimal* definition, even just one mode.
> The rest could follow as subsequent patches, each being reviewed and debated
> separately. Otherwise this discussion can drag on for a long time.

Minimal definition to replace what applicaitons use on XFS and to
support poasix_fallocate are the thre that have been mentioned so
far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
all in a man page...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/5][TAKE2] fallocate system call
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
@ 2007-05-14 13:29                                 ` Amit K. Arora
  2007-04-26 18:07                               ` [PATCH 2/5] fallocate() on s390 Amit K. Arora
                                                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 13:29 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime/mtime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
  
    
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Each post will have an individual changelog for the particular patch.
Following posts with patches follow:

Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/5][TAKE2] fallocate system call
@ 2007-05-14 13:29                                 ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 13:29 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime/mtime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
  
    
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Each post will have an individual changelog for the particular patch.
Following posts with patches follow:

Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/5] ext4: fallocate support in ext4
  2007-05-07 20:58                                     ` Andrew Morton
  2007-05-07 22:21                                       ` Andreas Dilger
  2007-05-08  0:00                                       ` Mingming Cao
@ 2007-05-14 13:34                                       ` Jan Kara
  2 siblings, 0 replies; 340+ messages in thread
From: Jan Kara @ 2007-05-14 13:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andreas Dilger, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

> On Mon, 7 May 2007 05:37:54 -0600
> 
> Does the proposed implementation handle quotas correctly, btw?  Has that
> been tested?
  It seems to handle quotas fine - the block allocation itself does not
differ from the usual case, just the extents in the tree are marked as
uninitialized...
  The only question is whether DQUOT_PREALLOC_BLOCK() shouldn't be
called instead of DQUOT_ALLOC_BLOCK(). Then fallocate() won't be able to
allocate anything after the softlimit has been reached which makes some
sence but probably current behavior is kind-of less surprising.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
       [not found]                                 ` <20070514142820.GA31468@amitarora.in.ibm.com>
@ 2007-05-14 14:45                                   ` Amit K. Arora
  2007-05-14 23:44                                       ` Stephen Rothwell
  2007-05-14 14:48                                   ` [PATCH 2/5][TAKE2] fallocate() on s390 Amit K. Arora
                                                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 14:45 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
---------
Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
    posix_fallocate should return EINVAL for len <= 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
    they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 arch/i386/kernel/syscall_table.S |    1 
 arch/powerpc/kernel/sys_ppc32.c  |    7 +++
 arch/x86_64/kernel/functionlist  |    1 
 fs/open.c                        |   89 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h        |    3 -
 include/asm-powerpc/systbl.h     |    1 
 include/asm-powerpc/unistd.h     |    3 -
 include/asm-x86_64/unistd.h      |    4 +
 include/linux/fs.h               |   13 +++++
 include/linux/syscalls.h         |    1 
 10 files changed, 120 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_fallocate		/* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===================================================================
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===================================================================
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -351,6 +351,95 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *	  FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ *	    requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ *	0	: On SUCCESS a value of zero is returned.
+ *	error	: On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ *	 support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (offset < 0 || len <= 0)
+		goto out;
+
+	/* Return error if mode is not supported */
+	ret = -EOPNOTSUPP;
+	if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	/*
+	 * Let individual file system decide if it supports preallocation
+	 * for directories or not.
+	 */
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	/* Check for wrap through zero too */
+	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+
+	/*
+	 * Update [cm]time.
+	 * Partial allocation will not result in the time stamp changes,
+	 * since ->fallocate will return error (say, -ENOSPC) in this case.
+	 */
+	if (!ret)
+		file_update_time(file);
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+
+/*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.21/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-i386/unistd.h
+++ linux-2.6.21/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_fallocate		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.21.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.21/include/asm-powerpc/systbl.h
@@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
+COMPAT_SYS(fallocate)
Index: linux-2.6.21/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.21/include/asm-powerpc/unistd.h
@@ -326,10 +326,11 @@
 #define __NR_move_pages		301
 #define __NR_getcpu		302
 #define __NR_epoll_pwait	303
+#define __NR_fallocate		304
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		304
+#define __NR_syscalls		305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.21/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.21/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate		280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/linux/fs.h
===================================================================
--- linux-2.6.21.orig/include/linux/fs.h
+++ linux-2.6.21/include/linux/fs.h
@@ -264,6 +264,17 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
+ *		  may request (pre)allocation of blocks.
+ * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
+ *		  the preallocated blocks.
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1125,6 +1136,8 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 };
 
 struct seq_file;
Index: linux-2.6.21/include/linux/syscalls.h
===================================================================
--- linux-2.6.21.orig/include/linux/syscalls.h
+++ linux-2.6.21/include/linux/syscalls.h
@@ -602,6 +602,7 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.21.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.21/arch/powerpc/kernel/sys_ppc32.c
@@ -777,6 +777,13 @@ asmlinkage int compat_sys_truncate64(con
 	return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+				     u32 lenhi, u32 lenlo)
+{
+	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+			     ((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
 				 unsigned long low)
 {

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 2/5][TAKE2] fallocate() on s390
       [not found]                                 ` <20070514142820.GA31468@amitarora.in.ibm.com>
  2007-05-14 14:45                                   ` [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
@ 2007-05-14 14:48                                   ` Amit K. Arora
  2007-05-14 15:33                                     ` [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper Amit K. Arora
  2007-05-14 14:50                                   ` [PATCH 3/5][TAKE2] ext4: Extent overlap bugfix Amit K. Arora
                                                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 14:48 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This is the patch suggested by Martin Schwidefsky. Here are the comments
and patch from him.

-------------
From: Martin Schwidefsky <schwidefsky@de.ibm.com>

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---

 arch/s390/kernel/compat_wrapper.S |   10 ++++++++++
 arch/s390/kernel/sys_s390.c       |   29 +++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S       |    1 +
 include/asm-s390/unistd.h         |    3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
 	llgtr	%r2,%r2			# char *
 	llgtr	%r3,%r3			# struct compat_timeval *
 	jg	compat_sys_utimes
+
+	.globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	sllg    %r4,%r4,32		# get high word of 64bit loff_t
+	lr      %r4,%r5			# get low word of 64bit loff_t
+	sllg    %r5,%r6,32		# get high word of 64bit loff_t
+	l	%r5,164(%r15)		# get low word of 64bit loff_t
+	jg	sys_fallocate
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL							/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.21/arch/s390/kernel/sys_s390.c
@@ -286,3 +286,32 @@ int kernel_execve(const char *filename, 
 		  "d" (__arg3) : "memory");
 	return __svcres;
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+			       u32 len_high, u32 len_low)
+{
+	union {
+		u64 len;
+		struct {
+			u32 high;
+			u32 low;
+		};
+	} cv;
+	cv.high = len_high;
+	cv.low = len_low;
+	return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.21/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu		311
 #define __NR_epoll_pwait	312
 #define __NR_utimes		313
+#define __NR_fallocate		314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 3/5][TAKE2] ext4: Extent overlap bugfix
       [not found]                                 ` <20070514142820.GA31468@amitarora.in.ibm.com>
  2007-05-14 14:45                                   ` [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
  2007-05-14 14:48                                   ` [PATCH 2/5][TAKE2] fallocate() on s390 Amit K. Arora
@ 2007-05-14 14:50                                   ` Amit K. Arora
  2007-05-14 14:52                                   ` [PATCH 4/5][TAKE2] ext4: fallocate support in ext4 Amit K. Arora
  2007-05-14 14:54                                   ` [PATCH 5/5][TAKE2] ext4: write support for preallocated blocks Amit K. Arora
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 14:50 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
---------
As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |   60 ++++++++++++++++++++++++++++++++++++++--
 include/linux/ext4_fs_extents.h |    1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1129,6 +1129,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+				    struct ext4_extent *newext,
+				    struct ext4_ext_path *path)
+{
+	unsigned long b1, b2;
+	unsigned int depth, len1;
+	unsigned int ret = 0;
+
+	b1 = le32_to_cpu(newext->ee_block);
+	len1 = le16_to_cpu(newext->ee_len);
+	depth = ext_depth(inode);
+	if (!path[depth].p_ext)
+		goto out;
+	b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+	/*
+	 * get the next allocated block if the extent in the path
+	 * is before the requested block(s) 
+	 */
+	if (b2 < b1) {
+		b2 = ext4_ext_next_allocated_block(path);
+		if (b2 == EXT_MAX_BLOCK)
+			goto out;
+	}
+
+	/* check for wrap through zero */
+	if (b1 + len1 < b1) {
+		len1 = EXT_MAX_BLOCK - b1;
+		newext->ee_len = cpu_to_le16(len1);
+		ret = 1;
+	}
+
+	/* check for overlap */
+	if (b1 + len1 > b2) {
+		newext->ee_len = cpu_to_le16(b2 - b1);
+		ret = 1;
+	}
+out:
+	return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2032,7 +2081,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* allocate new block */
 	goal = ext4_ext_find_goal(inode, path, iblock);
-	allocated = max_blocks;
+
+	/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+	newex.ee_block = cpu_to_le32(iblock);
+	newex.ee_len = cpu_to_le16(max_blocks);
+	err = ext4_ext_check_overlap(inode, &newex, path);
+	if (err)
+		allocated = le16_to_cpu(newex.ee_len);
+	else
+		allocated = max_blocks;
 	newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
 	if (!newblock)
 		goto out2;
@@ -2040,7 +2097,6 @@ int ext4_ext_get_blocks(handle_t *handle
 			goal, newblock, allocated);
 
 	/* try to insert new extent into found leaf and return */
-	newex.ee_block = cpu_to_le32(iblock);
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 4/5][TAKE2] ext4: fallocate support in ext4
       [not found]                                 ` <20070514142820.GA31468@amitarora.in.ibm.com>
                                                     ` (2 preceding siblings ...)
  2007-05-14 14:50                                   ` [PATCH 3/5][TAKE2] ext4: Extent overlap bugfix Amit K. Arora
@ 2007-05-14 14:52                                   ` Amit K. Arora
  2007-05-14 14:54                                   ` [PATCH 5/5][TAKE2] ext4: write support for preallocated blocks Amit K. Arora
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 14:52 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a "To Do" item.

Changelog:
---------
Here are the changes from the previous post:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start & journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON & ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
    functions.

Here is the updated patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  241 +++++++++++++++++++++++++++++++++-------
 fs/ext4/file.c                  |    1 
 include/linux/ext4_fs.h         |    8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 221 insertions(+), 41 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
 		        ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
 			        ext_pblock(path[depth].p_ext),
-			        le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int ret = 0;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1192,8 +1204,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
 	int depth, len, err, next;
+	unsigned uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1201,14 +1214,24 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/*
+		 * ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1264,7 +1287,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 			        le32_to_cpu(newext->ee_block),
 			        ext_pblock(newext),
-			        le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 		           > le32_to_cpu(nearex->ee_block)) {
@@ -1277,7 +1300,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
 				        ext_pblock(newext),
-				        le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1290,7 +1313,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1309,8 +1332,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1380,8 +1408,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1393,7 +1421,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1409,7 +1438,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 		        cbex.ec_block = le32_to_cpu(ex->ee_block);
-		        cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 		        cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1482,15 +1511,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len));
+			        (unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-		            + le16_to_cpu(ex->ee_len)) {
+		            + ext4_ext_get_actual_len(ex)) {
 	        lblock = le32_to_cpu(ex->ee_block)
-		         + le16_to_cpu(ex->ee_len);
+		         + ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len),
+			        (unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1620,12 +1649,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1639,12 +1668,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1652,12 +1681,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1672,6 +1701,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
 	unsigned short ex_ee_len;
+	unsigned uninitialized = 0;
 	struct ext4_extent *ex;
 
 	ext_debug("truncate since %lu in leaf\n", start);
@@ -1686,7 +1716,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1754,6 +1786,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1763,7 +1797,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2039,7 +2073,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2047,8 +2081,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2056,8 +2091,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2099,6 +2137,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err)
 		goto out2;
@@ -2110,8 +2150,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create!=EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
@@ -2215,10 +2257,127 @@ int ext4_ext_writepage_trans_blocks(stru
 	return needed;
 }
 
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * tells fallocate to unallocate previously (pre)allocated blocks.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	handle_t *handle;
+	ext4_fsblk_t block, max_blocks;
+	ext4_fsblk_t nblocks = 0;
+	int ret = 0;
+	int ret2 = 0;
+	int retries = 0;
+	struct buffer_head map_bh;
+	unsigned int credits, blkbits = inode->i_blkbits;
+
+	/*
+	 * currently supporting (pre)allocate mode for extent-based
+	 * files _only_
+	 */
+	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+		return -EOPNOTSUPP;
+
+	/* preallocation to directories is currently not supported */
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	block = offset >> blkbits;
+	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+		 	- block;
+
+	/*
+	 * credits to insert 1 extent into extent tree + buffers to be able to
+	 * modify 1 super block, 1 block bitmap and 1 group descriptor.
+	 */
+	credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+	while (ret >= 0 && ret < max_blocks) {
+		block = block + ret;
+		max_blocks = max_blocks - ret;
+		handle = ext4_journal_start(inode, credits);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			break;
+		}
+
+		ret = ext4_ext_get_blocks(handle, inode, block,
+					  max_blocks, &map_bh,
+					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
+		WARN_ON(!ret);
+		if (!ret) {
+			ext4_error(inode->i_sb, "ext4_fallocate",
+				   "ext4_ext_get_blocks returned 0! inode#%lu"
+				   ", block=%llu, max_blocks=%llu",
+				   inode->i_ino, block, max_blocks);
+			ret = -EIO;
+			ext4_mark_inode_dirty(handle, inode);
+			ret2 = ext4_journal_stop(handle);
+			break;
+		}
+		if (ret > 0) {
+			/* check wrap through sign-bit/zero here */
+			if ((block + ret) < 0 || (block + ret) < block) {
+				ret = -EIO;
+				ext4_mark_inode_dirty(handle, inode);
+				ret2 = ext4_journal_stop(handle);
+				break;
+			}
+			if (buffer_new(&map_bh) && ((block + ret) >
+			    (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+			    >> blkbits)))
+					nblocks = nblocks + ret;
+		}
+		ext4_mark_inode_dirty(handle, inode);
+		ret2 = ext4_journal_stop(handle);
+		if (ret2)
+			break;
+	}
+
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+
+	/*
+	 * Time to update the file size.
+	 * Update only when preallocation was requested beyond the file size.
+	 */
+	if ((offset + len) > i_size_read(inode)) {
+		if (ret > 0) {
+			/*
+			 * if no error, we assume preallocation succeeded
+			 * completely
+			 */
+			mutex_lock(&inode->i_mutex);
+			i_size_write(inode, offset + len);
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		} else if (ret < 0 && nblocks) {
+			/* Handle partial allocation scenario */
+			loff_t newsize;
+
+			mutex_lock(&inode->i_mutex);
+			newsize  = (nblocks << blkbits) + i_size_read(inode);
+			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
+
+	return ret > 0 ? ret2 : ret;
+}
+
 EXPORT_SYMBOL(ext4_mark_inode_dirty);
 EXPORT_SYMBOL(ext4_ext_invalidate_cache);
 EXPORT_SYMBOL(ext4_ext_insert_extent);
 EXPORT_SYMBOL(ext4_ext_walk_space);
 EXPORT_SYMBOL(ext4_ext_find_goal);
 EXPORT_SYMBOL(ext4_ext_calc_credits_for_insert);
+EXPORT_SYMBOL(ext4_fallocate);
 
Index: linux-2.6.21/fs/ext4/file.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/file.c
+++ linux-2.6.21/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.21/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs.h
+++ linux-2.6.21/include/linux/ext4_fs.h
@@ -102,6 +102,7 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits)		ALIGN((size),(1 << (blkbits)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t 
 extern void ext4_ext_truncate(struct inode *, struct page *);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
+extern int ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 static inline int
 ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 			unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode *
 	EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
 }
 
+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) {
+	ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) {
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) {
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 5/5][TAKE2] ext4: write support for preallocated blocks
       [not found]                                 ` <20070514142820.GA31468@amitarora.in.ibm.com>
                                                     ` (3 preceding siblings ...)
  2007-05-14 14:52                                   ` [PATCH 4/5][TAKE2] ext4: fallocate support in ext4 Amit K. Arora
@ 2007-05-14 14:54                                   ` Amit K. Arora
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 14:54 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
---------
 1) Replaced BUG_ON with WARN_ON & ext4_error.
 2) Added variable names to the function declaration of
    ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) "if((a=foo())).." was broken into "a=foo(); if(a).."
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  234 +++++++++++++++++++++++++++++++++++-----
 include/linux/ext4_fs_extents.h |    3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+			  struct ext4_ext_path *path,
+			  struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done = 0;
+	int uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh))
+	{
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+				* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+		merge_done = 1;
+		WARN_ON(eh->eh_entries == 0);
+		if (!eh->eh_entries)
+			ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+			   "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+	}
+
+	return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1328,25 +1376,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -2012,15 +2042,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, newex;
+	struct ext4_extent *ex1 = NULL;
+	struct ext4_extent *ex2 = NULL;
+	struct ext4_extent *ex3 = NULL;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0;
+	int ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/* The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth) {
+			depth = newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/* If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = cpu_to_le32(newblock);
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	err = ext4_ext_get_access(handle, inode, path + depth);
+	if (err)
+		goto out;
+	/* New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/* To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/* Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2 == ex and ex3 == NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2068,6 +2235,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2076,13 +2244,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2091,12 +2255,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2148,6 +2327,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+				 struct ext4_ext_path *path,
+				 struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper
  2007-05-14 14:48                                   ` [PATCH 2/5][TAKE2] fallocate() on s390 Amit K. Arora
@ 2007-05-14 15:33                                     ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-14 15:33 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Mon, May 14, 2007 at 08:18:34PM +0530, Amit K. Arora wrote:
> This is the patch suggested by Martin Schwidefsky. Here are the comments
> and patch from him.

Martin also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.
Here it is:

.globl __fallocate
ENTRY(__fallocate)
	stm	%r6,%r7,28(%r15)	/* save %r6/%r7 on stack */
	cfi_offset (%r7, -68)
	cfi_offset (%r6, -72)
	lm	%r6,%r7,96(%r15)	/* load loff_t len from stack */
	svc	SYS_ify(fallocate)
	lm	%r6,%r7,28(%r15)	/* restore %r6/%r7 from stack */
	br	%r14
PSEUDO_END(__fallocate)

--
Regards,
Amit Arora
 
> -------------
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> This patch implements support of fallocate system call on s390(x)
> platform. A wrapper is added to address the issue which s390 ABI has
> with the arguments of this system call.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
> 
>  arch/s390/kernel/compat_wrapper.S |   10 ++++++++++
>  arch/s390/kernel/sys_s390.c       |   29 +++++++++++++++++++++++++++++
>  arch/s390/kernel/syscalls.S       |    1 +
>  include/asm-s390/unistd.h         |    3 ++-
>  4 files changed, 42 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
> ===================================================================
> --- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
> +++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
> @@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
>  	llgtr	%r2,%r2			# char *
>  	llgtr	%r3,%r3			# struct compat_timeval *
>  	jg	compat_sys_utimes
> +
> +	.globl  sys_fallocate_wrapper
> +sys_fallocate_wrapper:
> +	lgfr	%r2,%r2			# int
> +	lgfr	%r3,%r3			# int
> +	sllg    %r4,%r4,32		# get high word of 64bit loff_t
> +	lr      %r4,%r5			# get low word of 64bit loff_t
> +	sllg    %r5,%r6,32		# get high word of 64bit loff_t
> +	l	%r5,164(%r15)		# get low word of 64bit loff_t
> +	jg	sys_fallocate
> Index: linux-2.6.21/arch/s390/kernel/syscalls.S
> ===================================================================
> --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
> +++ linux-2.6.21/arch/s390/kernel/syscalls.S
> @@ -322,3 +322,4 @@ NI_SYSCALL							/* 310 sys_move_pages *
>  SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
>  SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
>  SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
> +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
> Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
> ===================================================================
> --- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
> +++ linux-2.6.21/arch/s390/kernel/sys_s390.c
> @@ -286,3 +286,32 @@ int kernel_execve(const char *filename, 
>  		  "d" (__arg3) : "memory");
>  	return __svcres;
>  }
> +
> +#ifndef CONFIG_64BIT
> +/*
> + * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
> + * 64 bit argument "len" is split into the upper and lower 32 bits. The
> + * system call wrapper in the user space loads the value to %r6/%r7.
> + * The code in entry.S keeps the values in %r2 - %r6 where they are and
> + * stores %r7 to 96(%r15). But the standard C linkage requires that
> + * the whole 64 bit value for len is stored on the stack and doesn't
> + * use %r6 at all. So s390_fallocate has to convert the arguments from
> + *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
> + * to
> + *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
> + */
> +asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
> +			       u32 len_high, u32 len_low)
> +{
> +	union {
> +		u64 len;
> +		struct {
> +			u32 high;
> +			u32 low;
> +		};
> +	} cv;
> +	cv.high = len_high;
> +	cv.low = len_low;
> +	return sys_fallocate(fd, mode, offset, cv.len);
> +}
> +#endif
> Index: linux-2.6.21/include/asm-s390/unistd.h
> ===================================================================
> --- linux-2.6.21.orig/include/asm-s390/unistd.h
> +++ linux-2.6.21/include/asm-s390/unistd.h
> @@ -251,8 +251,9 @@
>  #define __NR_getcpu		311
>  #define __NR_epoll_pwait	312
>  #define __NR_utimes		313
> +#define __NR_fallocate		314
> 
> -#define NR_syscalls 314
> +#define NR_syscalls 315
> 
>  /* 
>   * There are some system calls that are not present on 64 bit, some
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-14 14:45                                   ` [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
@ 2007-05-14 23:44                                       ` Stephen Rothwell
  0 siblings, 0 replies; 340+ messages in thread
From: Stephen Rothwell @ 2007-05-14 23:44 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

[-- Attachment #1: Type: text/plain, Size: 435 bytes --]

On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>
> This patch implements sys_fallocate() and adds support on i386, x86_64
> and powerpc platforms.

This patch no longer applies to Linus' tree - for a start there is no file
arch/x86_64/kernel/functionlist any more.

Can you rebase it, please?

--
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
@ 2007-05-14 23:44                                       ` Stephen Rothwell
  0 siblings, 0 replies; 340+ messages in thread
From: Stephen Rothwell @ 2007-05-14 23:44 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

[-- Attachment #1: Type: text/plain, Size: 401 bytes --]

On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
>
> This patch implements sys_fallocate() and adds support on i386, x86_64
> and powerpc platforms.

This patch no longer applies to Linus' tree - for a start there is no file
arch/x86_64/kernel/functionlist any more.

Can you rebase it, please?

--
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5][TAKE2] fallocate system call
  2007-05-14 13:29                                 ` Amit K. Arora
  (?)
  (?)
@ 2007-05-15  6:31                                 ` Andreas Dilger
  2007-05-15 12:40                                     ` Amit K. Arora
  -1 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-05-15  6:31 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On May 14, 2007  18:59 +0530, Amit K. Arora wrote:
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> fd: The descriptor of the open file.
> 
> mode*: This specifies the behavior of the system call. Currently the
>   system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
>   FA_ALLOCATE: Applications can use this mode to preallocate blocks to
>     a given file (specified by fd). This mode changes the file size if
>     the preallocation is done beyond the EOF. It also updates the
>     ctime/mtime in the inode of the corresponding file, marking a
>     successfull allocation.
>   FA_DEALLOCATE: This mode can be used by applications to deallocate the
>     previously preallocated blocks. This also may change the file size
>     and the ctime/mtime.
> * New modes might get added in future. One such new mode which is
>   already under discussion is FA_PREALLOCATE, which when used will
>   preallocate space but will not change the filesize and [cm]time.
>   Since the semantics of this new mode is not clear and agreed upon yet,
>   this patchset does not implement it currently.
> 
> offset: This is the offset in bytes, from where the preallocation should
>   start.
> 
> len: This is the number of bytes requested for preallocation (from
>   offset).

What is the return value?  I'd hope it is the number of bytes preallocated,
in case of interrupted preallocation for whatever reason (interrupt, out of
space, etc) like a regular write(2) call.  In this case the return type needs
to also be an loff_t to match @len.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5][TAKE2] fallocate system call
  2007-05-15  6:31                                 ` [PATCH 0/5][TAKE2] fallocate system call Andreas Dilger
@ 2007-05-15 12:40                                     ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 12:40 UTC (permalink / raw)
  To: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Tue, May 15, 2007 at 12:31:21AM -0600, Andreas Dilger wrote:
> On May 14, 2007  18:59 +0530, Amit K. Arora wrote:
> >  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> > 
> > fd: The descriptor of the open file.
> > 
> > mode*: This specifies the behavior of the system call. Currently the
> >   system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
> >   FA_ALLOCATE: Applications can use this mode to preallocate blocks to
> >     a given file (specified by fd). This mode changes the file size if
> >     the preallocation is done beyond the EOF. It also updates the
> >     ctime/mtime in the inode of the corresponding file, marking a
> >     successfull allocation.
> >   FA_DEALLOCATE: This mode can be used by applications to deallocate the
> >     previously preallocated blocks. This also may change the file size
> >     and the ctime/mtime.
> > * New modes might get added in future. One such new mode which is
> >   already under discussion is FA_PREALLOCATE, which when used will
> >   preallocate space but will not change the filesize and [cm]time.
> >   Since the semantics of this new mode is not clear and agreed upon yet,
> >   this patchset does not implement it currently.
> > 
> > offset: This is the offset in bytes, from where the preallocation should
> >   start.
> > 
> > len: This is the number of bytes requested for preallocation (from
> >   offset).
> 
> What is the return value?  I'd hope it is the number of bytes preallocated,
> in case of interrupted preallocation for whatever reason (interrupt, out of
> space, etc) like a regular write(2) call.  In this case the return type needs
> to also be an loff_t to match @len.

The return value in current implementation has been kept as "long" where
zero is returned for success and an error on failure. This is done to
keep it inline with posix_fallocate behavior.

This point was brought up sometime back by Badari. At that time it was
decided to keep it the way posix_fallocate is designed. Here are the
posts related to this:
http://lkml.org/lkml/2007/3/2/18
http://lkml.org/lkml/2007/3/2/162
http://lkml.org/lkml/2007/3/2/208

Still if you feel that we should be returning number of bytes
preallocated, we can again ask for opinion here.

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5][TAKE2] fallocate system call
@ 2007-05-15 12:40                                     ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 12:40 UTC (permalink / raw)
  To: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Tue, May 15, 2007 at 12:31:21AM -0600, Andreas Dilger wrote:
> On May 14, 2007  18:59 +0530, Amit K. Arora wrote:
> >  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> > 
> > fd: The descriptor of the open file.
> > 
> > mode*: This specifies the behavior of the system call. Currently the
> >   system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
> >   FA_ALLOCATE: Applications can use this mode to preallocate blocks to
> >     a given file (specified by fd). This mode changes the file size if
> >     the preallocation is done beyond the EOF. It also updates the
> >     ctime/mtime in the inode of the corresponding file, marking a
> >     successfull allocation.
> >   FA_DEALLOCATE: This mode can be used by applications to deallocate the
> >     previously preallocated blocks. This also may change the file size
> >     and the ctime/mtime.
> > * New modes might get added in future. One such new mode which is
> >   already under discussion is FA_PREALLOCATE, which when used will
> >   preallocate space but will not change the filesize and [cm]time.
> >   Since the semantics of this new mode is not clear and agreed upon yet,
> >   this patchset does not implement it currently.
> > 
> > offset: This is the offset in bytes, from where the preallocation should
> >   start.
> > 
> > len: This is the number of bytes requested for preallocation (from
> >   offset).
> 
> What is the return value?  I'd hope it is the number of bytes preallocated,
> in case of interrupted preallocation for whatever reason (interrupt, out of
> space, etc) like a regular write(2) call.  In this case the return type needs
> to also be an loff_t to match @len.

The return value in current implementation has been kept as "long" where
zero is returned for success and an error on failure. This is done to
keep it inline with posix_fallocate behavior.

This point was brought up sometime back by Badari. At that time it was
decided to keep it the way posix_fallocate is designed. Here are the
posts related to this:

Still if you feel that we should be returning number of bytes
preallocated, we can again ask for opinion here.

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-14 23:44                                       ` Stephen Rothwell
  (?)
@ 2007-05-15 13:23                                       ` Amit K. Arora
  2007-05-18 21:36                                         ` Theodore Tso
  -1 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 13:23 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Tue, May 15, 2007 at 09:44:36AM +1000, Stephen Rothwell wrote:
> On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> >
> > This patch implements sys_fallocate() and adds support on i386, x86_64
> > and powerpc platforms.
> 
> This patch no longer applies to Linus' tree - for a start there is no file
> arch/x86_64/kernel/functionlist any more.
> 
> Can you rebase it, please?

I will rebase it to 2.6.22-rc1 and repost the patches soon.
Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/5][TAKE3] fallocate system call
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
@ 2007-05-15 19:37                                 ` Amit K. Arora
  2007-04-26 18:07                               ` [PATCH 2/5] fallocate() on s390 Amit K. Arora
                                                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 19:37 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
P L E A S E    N O T E :
***********************
1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were
based on 2.6.21.
2. An unnecessary export of symbol is removed from the ext4 preallocate
patch. Details in the corresponding post (PATCH 4/5).
3. Return type now described in the interface description below.
4. Besides above points, everything is exactly same as TAKE2.
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime/mtime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 
    
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks


--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/5][TAKE3] fallocate system call
@ 2007-05-15 19:37                                 ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 19:37 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
P L E A S E    N O T E :
***********************
1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were
based on 2.6.21.
2. An unnecessary export of symbol is removed from the ext4 preallocate
patch. Details in the corresponding post (PATCH 4/5).
3. Return type now described in the interface description below.
4. Besides above points, everything is exactly same as TAKE2.
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime/mtime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 
    
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks


--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
       [not found]                                 ` <20070515195421.GA2948@amitarora.in.ibm.com>
@ 2007-05-15 20:03                                   ` Amit K. Arora
  2007-05-16  0:42                                     ` Mingming Cao
  2007-05-16  3:16                                     ` David Chinner
  2007-05-15 20:10                                   ` [PATCH 2/5][TAKE3] fallocate() on s390 Amit K. Arora
                                                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 20:03 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
---------
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.

Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
    posix_fallocate should return EINVAL for len <= 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
    they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 arch/i386/kernel/syscall_table.S |    1 
 arch/powerpc/kernel/sys_ppc32.c  |    7 +++
 arch/x86_64/ia32/ia32entry.S     |    1 
 fs/open.c                        |   89 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h        |    3 -
 include/asm-powerpc/systbl.h     |    1 
 include/asm-powerpc/unistd.h     |    3 -
 include/asm-x86_64/unistd.h      |    2 
 include/linux/fs.h               |   13 +++++
 include/linux/syscalls.h         |    1 
 10 files changed, 119 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
 	.long sys_signalfd
 	.long sys_timerfd
 	.long sys_eventfd
+	.long sys_fallocate
Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
 	return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+				     u32 lenhi, u32 lenlo)
+{
+	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+			     ((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
 				 unsigned long low)
 {
Index: linux-2.6.22-rc1/fs/open.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/open.c
+++ linux-2.6.22-rc1/fs/open.c
@@ -353,6 +353,95 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *	  FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ *	    requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ *	0	: On SUCCESS a value of zero is returned.
+ *	error	: On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ *	 support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (offset < 0 || len <= 0)
+		goto out;
+
+	/* Return error if mode is not supported */
+	ret = -EOPNOTSUPP;
+	if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	/*
+	 * Let individual file system decide if it supports preallocation
+	 * for directories or not.
+	 */
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	/* Check for wrap through zero too */
+	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+
+	/*
+	 * Update [cm]time.
+	 * Partial allocation will not result in the time stamp changes,
+	 * since ->fallocate will return error (say, -ENOSPC) in this case.
+	 */
+	if (!ret)
+		file_update_time(file);
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+
+/*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.22-rc1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-i386/unistd.h
+++ linux-2.6.22-rc1/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
 #define __NR_signalfd		321
 #define __NR_timerfd		322
 #define __NR_eventfd		323
+#define __NR_fallocate		324
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 324
+#define NR_syscalls 325
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc1/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.22-rc1/include/asm-powerpc/systbl.h
@@ -308,3 +308,4 @@ COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
 COMPAT_SYS_SPU(utimensat)
+COMPAT_SYS(fallocate)
Index: linux-2.6.22-rc1/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.22-rc1/include/asm-powerpc/unistd.h
@@ -327,10 +327,11 @@
 #define __NR_getcpu		302
 #define __NR_epoll_pwait	303
 #define __NR_utimensat		304
+#define __NR_fallocate		305
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		305
+#define __NR_syscalls		306
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.22-rc1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.22-rc1/include/asm-x86_64/unistd.h
@@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd)
 __SYSCALL(__NR_timerfd, sys_timerfd)
 #define __NR_eventfd		283
 __SYSCALL(__NR_eventfd, sys_eventfd)
+#define __NR_fallocate		284
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc1/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/fs.h
+++ linux-2.6.22-rc1/include/linux/fs.h
@@ -266,6 +266,17 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
+ *		  may request (pre)allocation of blocks.
+ * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
+ *		  the preallocated blocks.
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1137,6 +1148,8 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 };
 
 struct seq_file;
Index: linux-2.6.22-rc1/include/linux/syscalls.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/syscalls.h
+++ linux-2.6.22-rc1/include/linux/syscalls.h
@@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
 			    const struct itimerspec __user *utmr);
 asmlinkage long sys_eventfd(unsigned int count);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd
 	.quad compat_sys_timerfd
 	.quad sys_eventfd
+	.quad sys_fallocate
 ia32_syscall_end:

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 2/5][TAKE3] fallocate() on s390
       [not found]                                 ` <20070515195421.GA2948@amitarora.in.ibm.com>
  2007-05-15 20:03                                   ` [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
@ 2007-05-15 20:10                                   ` Amit K. Arora
  2007-05-15 20:13                                   ` [PATCH 3/5][TAKE3] ext4: Extent overlap bugfix Amit K. Arora
                                                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 20:10 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
	stm	%r6,%r7,28(%r15)	/* save %r6/%r7 on stack */
	cfi_offset (%r7, -68)
	cfi_offset (%r6, -72)
	lm	%r6,%r7,96(%r15)	/* load loff_t len from stack */
	svc	SYS_ify(fallocate)
	lm	%r6,%r7,28(%r15)	/* restore %r6/%r7 from stack */
	br	%r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-------------
From: Martin Schwidefsky <schwidefsky@de.ibm.com>

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/kernel/compat_wrapper.S |   10 ++++++++++
 arch/s390/kernel/sys_s390.c       |   29 +++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S       |    1 +
 include/asm-s390/unistd.h         |    3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
 	llgtr	%r2,%r2			# char *
 	llgtr	%r3,%r3			# struct compat_timeval *
 	jg	compat_sys_utimes
+
+	.globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	sllg    %r4,%r4,32		# get high word of 64bit loff_t
+	lr      %r4,%r5			# get low word of 64bit loff_t
+	sllg    %r5,%r6,32		# get high word of 64bit loff_t
+	l	%r5,164(%r15)		# get low word of 64bit loff_t
+	jg	sys_fallocate
Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
@@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar
 		return -EFAULT;
 	return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+			       u32 len_high, u32 len_low)
+{
+	union {
+		u64 len;
+		struct {
+			u32 high;
+			u32 low;
+		};
+	} cv;
+	cv.high = len_high;
+	cv.low = len_low;
+	return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL							/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.22-rc1/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h
+++ linux-2.6.22-rc1/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu		311
 #define __NR_epoll_pwait	312
 #define __NR_utimes		313
+#define __NR_fallocate		314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 3/5][TAKE3] ext4: Extent overlap bugfix
       [not found]                                 ` <20070515195421.GA2948@amitarora.in.ibm.com>
  2007-05-15 20:03                                   ` [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
  2007-05-15 20:10                                   ` [PATCH 2/5][TAKE3] fallocate() on s390 Amit K. Arora
@ 2007-05-15 20:13                                   ` Amit K. Arora
  2007-05-15 20:16                                   ` [PATCH 4/5][TAKE3] ext4: fallocate support in ext4 Amit K. Arora
  2007-05-15 20:18                                   ` [PATCH 5/5][TAKE3] ext4: write support for preallocated blocks Amit K. Arora
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 20:13 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
---------
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.
As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |   60 ++++++++++++++++++++++++++++++++++++++--
 include/linux/ext4_fs_extents.h |    1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+				    struct ext4_extent *newext,
+				    struct ext4_ext_path *path)
+{
+	unsigned long b1, b2;
+	unsigned int depth, len1;
+	unsigned int ret = 0;
+
+	b1 = le32_to_cpu(newext->ee_block);
+	len1 = le16_to_cpu(newext->ee_len);
+	depth = ext_depth(inode);
+	if (!path[depth].p_ext)
+		goto out;
+	b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+	/*
+	 * get the next allocated block if the extent in the path
+	 * is before the requested block(s) 
+	 */
+	if (b2 < b1) {
+		b2 = ext4_ext_next_allocated_block(path);
+		if (b2 == EXT_MAX_BLOCK)
+			goto out;
+	}
+
+	/* check for wrap through zero */
+	if (b1 + len1 < b1) {
+		len1 = EXT_MAX_BLOCK - b1;
+		newext->ee_len = cpu_to_le16(len1);
+		ret = 1;
+	}
+
+	/* check for overlap */
+	if (b1 + len1 > b2) {
+		newext->ee_len = cpu_to_le16(b2 - b1);
+		ret = 1;
+	}
+out:
+	return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* allocate new block */
 	goal = ext4_ext_find_goal(inode, path, iblock);
-	allocated = max_blocks;
+
+	/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+	newex.ee_block = cpu_to_le32(iblock);
+	newex.ee_len = cpu_to_le16(max_blocks);
+	err = ext4_ext_check_overlap(inode, &newex, path);
+	if (err)
+		allocated = le16_to_cpu(newex.ee_len);
+	else
+		allocated = max_blocks;
 	newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
 	if (!newblock)
 		goto out2;
@@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle
 			goal, newblock, allocated);
 
 	/* try to insert new extent into found leaf and return */
-	newex.ee_block = cpu_to_le32(iblock);
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 4/5][TAKE3] ext4: fallocate support in ext4
       [not found]                                 ` <20070515195421.GA2948@amitarora.in.ibm.com>
                                                     ` (2 preceding siblings ...)
  2007-05-15 20:13                                   ` [PATCH 3/5][TAKE3] ext4: Extent overlap bugfix Amit K. Arora
@ 2007-05-15 20:16                                   ` Amit K. Arora
  2007-05-15 20:18                                   ` [PATCH 5/5][TAKE3] ext4: write support for preallocated blocks Amit K. Arora
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 20:16 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a <ToDo> item.

Changelog:
---------
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based and point "8)" below.
TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1.

Here are the changes from the previous post:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start & journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON & ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
    functions.
 8) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".

Here is the updated patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  240 +++++++++++++++++++++++++++++++++-------
 fs/ext4/file.c                  |    1 
 include/linux/ext4_fs.h         |    8 +
 include/linux/ext4_fs_extents.h |   12 ++
 4 files changed, 220 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
 		        ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
 			        ext_pblock(path[depth].p_ext),
-			        le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int ret = 0;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
 	int depth, len, err, next;
+	unsigned uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/*
+		 * ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1263,7 +1286,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 			        le32_to_cpu(newext->ee_block),
 			        ext_pblock(newext),
-			        le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 		           > le32_to_cpu(nearex->ee_block)) {
@@ -1276,7 +1299,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
 				        ext_pblock(newext),
-				        le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1289,7 +1312,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1308,8 +1331,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1379,8 +1407,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1392,7 +1420,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1408,7 +1437,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 		        cbex.ec_block = le32_to_cpu(ex->ee_block);
-		        cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 		        cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1481,15 +1510,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len));
+			        (unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-		            + le16_to_cpu(ex->ee_len)) {
+		            + ext4_ext_get_actual_len(ex)) {
 	        lblock = le32_to_cpu(ex->ee_block)
-		         + le16_to_cpu(ex->ee_len);
+		         + ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len),
+			        (unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1619,12 +1648,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1638,12 +1667,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1651,12 +1680,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1671,6 +1700,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
 	unsigned short ex_ee_len;
+	unsigned uninitialized = 0;
 	struct ext4_extent *ex;
 
 	ext_debug("truncate since %lu in leaf\n", start);
@@ -1685,7 +1715,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1753,6 +1785,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1762,7 +1796,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2038,7 +2072,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2046,8 +2080,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2055,8 +2090,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2098,6 +2136,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err)
 		goto out2;
@@ -2109,8 +2149,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create!=EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
@@ -2214,6 +2256,122 @@ int ext4_ext_writepage_trans_blocks(stru
 	return needed;
 }
 
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * tells fallocate to unallocate previously (pre)allocated blocks.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	handle_t *handle;
+	ext4_fsblk_t block, max_blocks;
+	ext4_fsblk_t nblocks = 0;
+	int ret = 0;
+	int ret2 = 0;
+	int retries = 0;
+	struct buffer_head map_bh;
+	unsigned int credits, blkbits = inode->i_blkbits;
+
+	/*
+	 * currently supporting (pre)allocate mode for extent-based
+	 * files _only_
+	 */
+	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+		return -EOPNOTSUPP;
+
+	/* preallocation to directories is currently not supported */
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	block = offset >> blkbits;
+	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+		 	- block;
+
+	/*
+	 * credits to insert 1 extent into extent tree + buffers to be able to
+	 * modify 1 super block, 1 block bitmap and 1 group descriptor.
+	 */
+	credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+	while (ret >= 0 && ret < max_blocks) {
+		block = block + ret;
+		max_blocks = max_blocks - ret;
+		handle = ext4_journal_start(inode, credits);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			break;
+		}
+
+		ret = ext4_ext_get_blocks(handle, inode, block,
+					  max_blocks, &map_bh,
+					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
+		WARN_ON(!ret);
+		if (!ret) {
+			ext4_error(inode->i_sb, "ext4_fallocate",
+				   "ext4_ext_get_blocks returned 0! inode#%lu"
+				   ", block=%llu, max_blocks=%llu",
+				   inode->i_ino, block, max_blocks);
+			ret = -EIO;
+			ext4_mark_inode_dirty(handle, inode);
+			ret2 = ext4_journal_stop(handle);
+			break;
+		}
+		if (ret > 0) {
+			/* check wrap through sign-bit/zero here */
+			if ((block + ret) < 0 || (block + ret) < block) {
+				ret = -EIO;
+				ext4_mark_inode_dirty(handle, inode);
+				ret2 = ext4_journal_stop(handle);
+				break;
+			}
+			if (buffer_new(&map_bh) && ((block + ret) >
+			    (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+			    >> blkbits)))
+					nblocks = nblocks + ret;
+		}
+		ext4_mark_inode_dirty(handle, inode);
+		ret2 = ext4_journal_stop(handle);
+		if (ret2)
+			break;
+	}
+
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+
+	/*
+	 * Time to update the file size.
+	 * Update only when preallocation was requested beyond the file size.
+	 */
+	if ((offset + len) > i_size_read(inode)) {
+		if (ret > 0) {
+			/*
+			 * if no error, we assume preallocation succeeded
+			 * completely
+			 */
+			mutex_lock(&inode->i_mutex);
+			i_size_write(inode, offset + len);
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		} else if (ret < 0 && nblocks) {
+			/* Handle partial allocation scenario */
+			loff_t newsize;
+
+			mutex_lock(&inode->i_mutex);
+			newsize  = (nblocks << blkbits) + i_size_read(inode);
+			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
+
+	return ret > 0 ? ret2 : ret;
+}
+
 EXPORT_SYMBOL(ext4_mark_inode_dirty);
 EXPORT_SYMBOL(ext4_ext_invalidate_cache);
 EXPORT_SYMBOL(ext4_ext_insert_extent);
Index: linux-2.6.22-rc1/fs/ext4/file.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/file.c
+++ linux-2.6.22-rc1/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.22-rc1/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs.h
@@ -102,6 +102,7 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits)		ALIGN((size),(1 << (blkbits)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t 
 extern void ext4_ext_truncate(struct inode *, struct page *);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
+extern int ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 static inline int
 ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 			unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode *
 	EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
 }
 
+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) {
+	ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) {
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) {
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 5/5][TAKE3] ext4: write support for preallocated blocks
       [not found]                                 ` <20070515195421.GA2948@amitarora.in.ibm.com>
                                                     ` (3 preceding siblings ...)
  2007-05-15 20:16                                   ` [PATCH 4/5][TAKE3] ext4: fallocate support in ext4 Amit K. Arora
@ 2007-05-15 20:18                                   ` Amit K. Arora
  4 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-15 20:18 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
---------
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.

 1) Replaced BUG_ON with WARN_ON & ext4_error.
 2) Added variable names to the function declaration of
    ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) "if((a=foo())).." was broken into "a=foo(); if(a).."
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  234 +++++++++++++++++++++++++++++++++++-----
 include/linux/ext4_fs_extents.h |    3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+			  struct ext4_ext_path *path,
+			  struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done = 0;
+	int uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh))
+	{
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+				* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+		merge_done = 1;
+		WARN_ON(eh->eh_entries == 0);
+		if (!eh->eh_entries)
+			ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+			   "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+	}
+
+	return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1375,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, newex;
+	struct ext4_extent *ex1 = NULL;
+	struct ext4_extent *ex2 = NULL;
+	struct ext4_extent *ex3 = NULL;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0;
+	int ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/* The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth) {
+			depth = newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/* If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = cpu_to_le32(newblock);
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	err = ext4_ext_get_access(handle, inode, path + depth);
+	if (err)
+		goto out;
+	/* New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/* To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/* Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2 == ex and ex3 == NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2067,6 +2234,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2075,13 +2243,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2090,12 +2254,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2147,6 +2326,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+				 struct ext4_ext_path *path,
+				 struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/5][TAKE3] fallocate system call
  2007-05-15 19:37                                 ` Amit K. Arora
  (?)
  (?)
@ 2007-05-15 23:52                                 ` Mingming Cao
  -1 siblings, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-05-15 23:52 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna

On Wed, 2007-05-16 at 01:07 +0530, Amit K. Arora wrote:

> ToDos:
> -----
> 1> Implementation on other architectures (other than i386, x86_64,
> ppc64 and s390(x)). David Chinner has already posted a patch for ia64.

Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner <dgc@sgi.com>
Subject: [PATCH] ia64 fallocate syscall
Cc: "Amit K. Arora" <aarora@linux.vnet.ibm.com>, 
        akpm@linux-foundation.org, linux-ext4@vger.kernel.org,
        suparna@in.ibm.com, cmm@us.ibm.com

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 arch/ia64/kernel/entry.S  |    1 +
 include/asm-ia64/unistd.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S	2007-05-12 18:45:56.000000000 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S	2007-05-15 15:36:48.000000000 -0700
@@ -1585,5 +1585,6 @@
 	data8 sys_getcpu
 	data8 sys_epoll_pwait			// 1305
 	data8 sys_utimensat
+	data8 sys_fallocate
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h	2007-05-12 18:45:56.000000000 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h	2007-05-15 15:37:51.000000000 -0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu			1304
 #define __NR_epoll_pwait		1305
 #define __NR_utimensat			1306
+#define __NR_fallocate			1307
 
 #ifdef __KERNEL__
 



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-15 20:03                                   ` [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
@ 2007-05-16  0:42                                     ` Mingming Cao
  2007-05-16 12:31                                       ` Amit K. Arora
  2007-05-16  3:16                                     ` David Chinner
  1 sibling, 1 reply; 340+ messages in thread
From: Mingming Cao @ 2007-05-16  0:42 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna

On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote:
> This patch implements sys_fallocate() and adds support on i386, x86_64
> and powerpc platforms.

> @@ -1137,6 +1148,8 @@ struct inode_operations {
>  	ssize_t (*listxattr) (struct dentry *, char *, size_t);
>  	int (*removexattr) (struct dentry *, const char *);
>  	void (*truncate_range)(struct inode *, loff_t, loff_t);
> +	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
> +			  loff_t len);
>  };

Does the return value from fallocate inode operation has to be *long*?
It's not consistent with the ext4_fallocate() define in patch 4/5, 

+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t
len)

thus cause compile warnings.



Mingming


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-15 20:03                                   ` [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
  2007-05-16  0:42                                     ` Mingming Cao
@ 2007-05-16  3:16                                     ` David Chinner
  2007-05-16 12:21                                       ` Dave Kleikamp
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-05-16  3:16 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	suparna, cmm

On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> This patch implements sys_fallocate() and adds support on i386, x86_64
> and powerpc platforms.

Can you please pick up the ia64 support patch I posted as well?

> Changelog:
> ---------
> Note: The changes below are from the initial post (dated 26th April,
> 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
> version on which this patch is based. TAKE2 was based on 2.6.21 and this
> is based on 2.6.22-rc1.
> 
> Following changes were made to the previous version:
>  1) Added description before sys_fallocate() definition.
>  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
>     posix_fallocate should return EINVAL for len <= 0.
>  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
>  4) Do not return ENODEV for dirs (let individual file systems decide if
>     they want to support preallocation to directories or not.
>  5) Check for wrap through zero.
>  6) Update c/mtime if fallocate() succeeds.

Please don't make this always happen. c/mtime updates should be dependent
on the mode being used and whether there is visible change to the file. If no
userspace visible changes to the file occurred, then timestamps should not
be changed.

e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate()
extending the file, otherwise no change in timestamps should occur.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-16  3:16                                     ` David Chinner
@ 2007-05-16 12:21                                       ` Dave Kleikamp
  2007-05-16 12:37                                         ` Amit K. Arora
  2007-05-16 23:40                                         ` David Chinner
  0 siblings, 2 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-05-16 12:21 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:

> > Following changes were made to the previous version:
> >  1) Added description before sys_fallocate() definition.
> >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> >     posix_fallocate should return EINVAL for len <= 0.
> >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> >  4) Do not return ENODEV for dirs (let individual file systems decide if
> >     they want to support preallocation to directories or not.
> >  5) Check for wrap through zero.
> >  6) Update c/mtime if fallocate() succeeds.
> 
> Please don't make this always happen. c/mtime updates should be dependent
> on the mode being used and whether there is visible change to the file. If no
> userspace visible changes to the file occurred, then timestamps should not
> be changed.

i_blocks will be updated, so it seems reasonable to update ctime.  mtime
shouldn't be changed, though, since the contents of the file will be
unchanged.

> e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate()
> extending the file, otherwise no change in timestamps should occur.
> 
> Cheers,
> 
> Dave.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-16  0:42                                     ` Mingming Cao
@ 2007-05-16 12:31                                       ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-16 12:31 UTC (permalink / raw)
  To: Mingming Cao
  Cc: torvalds, akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna

On Tue, May 15, 2007 at 05:42:46PM -0700, Mingming Cao wrote:
> On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote:
> > This patch implements sys_fallocate() and adds support on i386, x86_64
> > and powerpc platforms.
> 
> > @@ -1137,6 +1148,8 @@ struct inode_operations {
> >  	ssize_t (*listxattr) (struct dentry *, char *, size_t);
> >  	int (*removexattr) (struct dentry *, const char *);
> >  	void (*truncate_range)(struct inode *, loff_t, loff_t);
> > +	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
> > +			  loff_t len);
> >  };
> 
> Does the return value from fallocate inode operation has to be *long*?
> It's not consistent with the ext4_fallocate() define in patch 4/5, 

I think ->fallocate() should return a "long", since sys_fallocate() has
to return what ->fallocate() returns and hence their return type should
ideally match.
 
> +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t
> len)

I will change the ext4_fallocate() to return a "long" (in patch 4/5)
in the next post.

Agree ?

Thanks!
--
Regards,
Amit Arora

> 
> thus cause compile warnings.
> 
> 
> 
> Mingming

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-16 12:21                                       ` Dave Kleikamp
@ 2007-05-16 12:37                                         ` Amit K. Arora
  2007-05-16 23:40                                         ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-16 12:37 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: David Chinner, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> 
> > > Following changes were made to the previous version:
> > >  1) Added description before sys_fallocate() definition.
> > >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > >     posix_fallocate should return EINVAL for len <= 0.
> > >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> > >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > >     they want to support preallocation to directories or not.
> > >  5) Check for wrap through zero.
> > >  6) Update c/mtime if fallocate() succeeds.
> > 
> > Please don't make this always happen. c/mtime updates should be dependent
> > on the mode being used and whether there is visible change to the file. If no
> > userspace visible changes to the file occurred, then timestamps should not
> > be changed.
> 
> i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> shouldn't be changed, though, since the contents of the file will be
> unchanged.

I agree. Thus the ctime should change for FA_PREALLOCATE mode also
(which does not change the file size) - if we end up having this
additional mode in near future.

--
Regards,
Amit Arora
 
> > e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate()
> > extending the file, otherwise no change in timestamps should occur.
> > 
> > Cheers,
> > 
> > Dave.
> -- 
> David Kleikamp
> IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-16 12:21                                       ` Dave Kleikamp
  2007-05-16 12:37                                         ` Amit K. Arora
@ 2007-05-16 23:40                                         ` David Chinner
  2007-05-17 12:10                                           ` Dave Kleikamp
  2007-05-17 12:28                                           ` Amit K. Arora
  1 sibling, 2 replies; 340+ messages in thread
From: David Chinner @ 2007-05-16 23:40 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: David Chinner, Amit K. Arora, torvalds, akpm, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, suparna, cmm

On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> 
> > > Following changes were made to the previous version:
> > >  1) Added description before sys_fallocate() definition.
> > >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > >     posix_fallocate should return EINVAL for len <= 0.
> > >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> > >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > >     they want to support preallocation to directories or not.
> > >  5) Check for wrap through zero.
> > >  6) Update c/mtime if fallocate() succeeds.
> > 
> > Please don't make this always happen. c/mtime updates should be dependent
> > on the mode being used and whether there is visible change to the file. If no
> > userspace visible changes to the file occurred, then timestamps should not
> > be changed.
> 
> i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> shouldn't be changed, though, since the contents of the file will be
> unchanged.

That's assuming blocks were actually allocated - if the prealloc range already
has underlying blocks there is no change and so we should not be changing
mtime either. Only the filesystem will know if it has changed the file, so I
think that timestamp updates need to be driven down to that level, not done
blindy at the highest layer....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-16 23:40                                         ` David Chinner
@ 2007-05-17 12:10                                           ` Dave Kleikamp
  2007-05-17 12:28                                           ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-05-17 12:10 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Thu, 2007-05-17 at 09:40 +1000, David Chinner wrote:
> On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:

> > > Please don't make this always happen. c/mtime updates should be dependent
> > > on the mode being used and whether there is visible change to the file. If no
> > > userspace visible changes to the file occurred, then timestamps should not
> > > be changed.
> > 
> > i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> > shouldn't be changed, though, since the contents of the file will be
> > unchanged.
> 
> That's assuming blocks were actually allocated - if the prealloc range already
> has underlying blocks there is no change and so we should not be changing
> mtime either. Only the filesystem will know if it has changed the file, so I
> think that timestamp updates need to be driven down to that level, not done
> blindy at the highest layer....

Yes, I agree.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-16 23:40                                         ` David Chinner
  2007-05-17 12:10                                           ` Dave Kleikamp
@ 2007-05-17 12:28                                           ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 12:28 UTC (permalink / raw)
  To: David Chinner
  Cc: Dave Kleikamp, torvalds, akpm, linux-fsdevel, linux-kernel,
	linux-ext4, xfs, suparna, cmm

On Thu, May 17, 2007 at 09:40:36AM +1000, David Chinner wrote:
> On Wed, May 16, 2007 at 07:21:16AM -0500, Dave Kleikamp wrote:
> > On Wed, 2007-05-16 at 13:16 +1000, David Chinner wrote:
> > > On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> > 
> > > > Following changes were made to the previous version:
> > > >  1) Added description before sys_fallocate() definition.
> > > >  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> > > >     posix_fallocate should return EINVAL for len <= 0.
> > > >  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
> > > >  4) Do not return ENODEV for dirs (let individual file systems decide if
> > > >     they want to support preallocation to directories or not.
> > > >  5) Check for wrap through zero.
> > > >  6) Update c/mtime if fallocate() succeeds.
> > > 
> > > Please don't make this always happen. c/mtime updates should be dependent
> > > on the mode being used and whether there is visible change to the file. If no
> > > userspace visible changes to the file occurred, then timestamps should not
> > > be changed.
> > 
> > i_blocks will be updated, so it seems reasonable to update ctime.  mtime
> > shouldn't be changed, though, since the contents of the file will be
> > unchanged.
> 
> That's assuming blocks were actually allocated - if the prealloc range already
> has underlying blocks there is no change and so we should not be changing
> mtime either. Only the filesystem will know if it has changed the file, so I
> think that timestamp updates need to be driven down to that level, not done
> blindy at the highest layer....

Ok. Will make this change in the next post.

--
Regards,
Amit Arora
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/6][TAKE4] fallocate system call
  2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
@ 2007-05-17 14:11                                 ` Amit K. Arora
  2007-04-26 18:07                               ` [PATCH 2/5] fallocate() on s390 Amit K. Arora
                                                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:11 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 
    
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Changes from Take2 to Take3:
	1) Return type is now described in the interface description
	   above.
	2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/6 : fallocate() on s390
Patch 3/6 : fallocate() on ia64
Patch 4/6 : ext4: Extent overlap bugfix
Patch 5/6 : ext4: fallocate support in ext4
Patch 6/6 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/6][TAKE4] fallocate system call
@ 2007-05-17 14:11                                 ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:11 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 
    
sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
---------
Changes from Take2 to Take3:
	1) Return type is now described in the interface description
	   above.
	2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/6 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/6 : fallocate() on s390
Patch 3/6 : fallocate() on ia64
Patch 4/6 : ext4: Extent overlap bugfix
Patch 5/6 : ext4: fallocate support in ext4
Patch 6/6 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc
       [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
@ 2007-05-17 14:23                                   ` Amit K. Arora
  2007-05-17 14:25                                   ` [PATCH 2/6][TAKE4] fallocate() on s390 Amit K. Arora
                                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:23 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
---------
Changes from Take3 to Take4:
 1) Do not update c/mtime. Let each filesystem update ctime (update of
    mtime will not be required for allocation since we touch only
    metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
 1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
    posix_fallocate should return EINVAL for len <= 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
    they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 arch/i386/kernel/syscall_table.S |    1 
 arch/powerpc/kernel/sys_ppc32.c  |    7 +++
 arch/x86_64/ia32/ia32entry.S     |    1 
 fs/open.c                        |   86 +++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h        |    3 -
 include/asm-powerpc/systbl.h     |    1 
 include/asm-powerpc/unistd.h     |    3 -
 include/asm-x86_64/unistd.h      |    2 
 include/linux/fs.h               |   13 +++++
 include/linux/syscalls.h         |    1 
 10 files changed, 116 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
 	.long sys_signalfd
 	.long sys_timerfd
 	.long sys_eventfd
+	.long sys_fallocate
Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
 	return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+				     u32 lenhi, u32 lenlo)
+{
+	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+			     ((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
 				 unsigned long low)
 {
Index: linux-2.6.22-rc1/fs/open.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/open.c
+++ linux-2.6.22-rc1/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *	  FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ *	    requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ *	0	: On SUCCESS a value of zero is returned.
+ *	error	: On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ *	 support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (offset < 0 || len <= 0)
+		goto out;
+
+	/* Return error if mode is not supported */
+	ret = -EOPNOTSUPP;
+	if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	/*
+	 * Let individual file system decide if it supports preallocation
+	 * for directories or not.
+	 */
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	/* Check for wrap through zero too */
+	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+
+/*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.22-rc1/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-i386/unistd.h
+++ linux-2.6.22-rc1/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
 #define __NR_signalfd		321
 #define __NR_timerfd		322
 #define __NR_eventfd		323
+#define __NR_fallocate		324
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 324
+#define NR_syscalls 325
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc1/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.22-rc1/include/asm-powerpc/systbl.h
@@ -308,3 +308,4 @@ COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
 COMPAT_SYS_SPU(utimensat)
+COMPAT_SYS(fallocate)
Index: linux-2.6.22-rc1/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.22-rc1/include/asm-powerpc/unistd.h
@@ -327,10 +327,11 @@
 #define __NR_getcpu		302
 #define __NR_epoll_pwait	303
 #define __NR_utimensat		304
+#define __NR_fallocate		305
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		305
+#define __NR_syscalls		306
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.22-rc1/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.22-rc1/include/asm-x86_64/unistd.h
@@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd)
 __SYSCALL(__NR_timerfd, sys_timerfd)
 #define __NR_eventfd		283
 __SYSCALL(__NR_eventfd, sys_eventfd)
+#define __NR_fallocate		284
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc1/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/fs.h
+++ linux-2.6.22-rc1/include/linux/fs.h
@@ -266,6 +266,17 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
+ *		  may request (pre)allocation of blocks.
+ * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
+ *		  the preallocated blocks.
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1137,6 +1148,8 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 };
 
 struct seq_file;
Index: linux-2.6.22-rc1/include/linux/syscalls.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/syscalls.h
+++ linux-2.6.22-rc1/include/linux/syscalls.h
@@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
 			    const struct itimerspec __user *utmr);
 asmlinkage long sys_eventfd(unsigned int count);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22-rc1/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd
 	.quad compat_sys_timerfd
 	.quad sys_eventfd
+	.quad sys_fallocate
 ia32_syscall_end:

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 2/6][TAKE4] fallocate() on s390
       [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
  2007-05-17 14:23                                   ` [PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
@ 2007-05-17 14:25                                   ` Amit K. Arora
  2007-05-17 14:25                                   ` [PATCH 3/6][TAKE4] fallocate() on ia64 Amit K. Arora
                                                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:25 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
	stm	%r6,%r7,28(%r15)	/* save %r6/%r7 on stack */
	cfi_offset (%r7, -68)
	cfi_offset (%r6, -72)
	lm	%r6,%r7,96(%r15)	/* load loff_t len from stack */
	svc	SYS_ify(fallocate)
	lm	%r6,%r7,28(%r15)	/* restore %r6/%r7 from stack */
	br	%r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-------------
From: Martin Schwidefsky <schwidefsky@de.ibm.com>

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/kernel/compat_wrapper.S |   10 ++++++++++
 arch/s390/kernel/sys_s390.c       |   29 +++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S       |    1 +
 include/asm-s390/unistd.h         |    3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
 	llgtr	%r2,%r2			# char *
 	llgtr	%r3,%r3			# struct compat_timeval *
 	jg	compat_sys_utimes
+
+	.globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	sllg    %r4,%r4,32		# get high word of 64bit loff_t
+	lr      %r4,%r5			# get low word of 64bit loff_t
+	sllg    %r5,%r6,32		# get high word of 64bit loff_t
+	l	%r5,164(%r15)		# get low word of 64bit loff_t
+	jg	sys_fallocate
Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
@@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar
 		return -EFAULT;
 	return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+			       u32 len_high, u32 len_low)
+{
+	union {
+		u64 len;
+		struct {
+			u32 high;
+			u32 low;
+		};
+	} cv;
+	cv.high = len_high;
+	cv.low = len_low;
+	return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL							/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.22-rc1/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h
+++ linux-2.6.22-rc1/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu		311
 #define __NR_epoll_pwait	312
 #define __NR_utimes		313
+#define __NR_fallocate		314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 3/6][TAKE4] fallocate() on ia64
       [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
  2007-05-17 14:23                                   ` [PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
  2007-05-17 14:25                                   ` [PATCH 2/6][TAKE4] fallocate() on s390 Amit K. Arora
@ 2007-05-17 14:25                                   ` Amit K. Arora
  2007-05-17 14:26                                   ` [PATCH 4/6][TAKE4] ext4: Extent overlap bugfix Amit K. Arora
                                                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:25 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner <dgc@sgi.com>
Subject: [PATCH] ia64 fallocate syscall
Cc: "Amit K. Arora" <aarora@linux.vnet.ibm.com>, 
        akpm@linux-foundation.org, linux-ext4@vger.kernel.org,
        suparna@in.ibm.com, cmm@us.ibm.com

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 arch/ia64/kernel/entry.S  |    1 +
 include/asm-ia64/unistd.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S	2007-05-12 18:45:56.000000000 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S	2007-05-15 15:36:48.000000000 -0700
@@ -1585,5 +1585,6 @@
 	data8 sys_getcpu
 	data8 sys_epoll_pwait			// 1305
 	data8 sys_utimensat
+	data8 sys_fallocate

 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h	2007-05-12 18:45:56.000000000 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h	2007-05-15 15:37:51.000000000 -0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu			1304
 #define __NR_epoll_pwait		1305
 #define __NR_utimensat			1306
+#define __NR_fallocate			1307

 #ifdef __KERNEL__



^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 4/6][TAKE4] ext4: Extent overlap bugfix
       [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
                                                     ` (2 preceding siblings ...)
  2007-05-17 14:25                                   ` [PATCH 3/6][TAKE4] fallocate() on ia64 Amit K. Arora
@ 2007-05-17 14:26                                   ` Amit K. Arora
  2007-05-17 14:29                                   ` [PATCH 5/6][TAKE4] ext4: fallocate support in ext4 Amit K. Arora
  2007-05-17 14:30                                   ` [PATCH 6/6][TAKE4] ext4: write support for preallocated blocks Amit K. Arora
  5 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:26 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
---------
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |   60 ++++++++++++++++++++++++++++++++++++++--
 include/linux/ext4_fs_extents.h |    1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+				    struct ext4_extent *newext,
+				    struct ext4_ext_path *path)
+{
+	unsigned long b1, b2;
+	unsigned int depth, len1;
+	unsigned int ret = 0;
+
+	b1 = le32_to_cpu(newext->ee_block);
+	len1 = le16_to_cpu(newext->ee_len);
+	depth = ext_depth(inode);
+	if (!path[depth].p_ext)
+		goto out;
+	b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+	/*
+	 * get the next allocated block if the extent in the path
+	 * is before the requested block(s) 
+	 */
+	if (b2 < b1) {
+		b2 = ext4_ext_next_allocated_block(path);
+		if (b2 == EXT_MAX_BLOCK)
+			goto out;
+	}
+
+	/* check for wrap through zero */
+	if (b1 + len1 < b1) {
+		len1 = EXT_MAX_BLOCK - b1;
+		newext->ee_len = cpu_to_le16(len1);
+		ret = 1;
+	}
+
+	/* check for overlap */
+	if (b1 + len1 > b2) {
+		newext->ee_len = cpu_to_le16(b2 - b1);
+		ret = 1;
+	}
+out:
+	return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* allocate new block */
 	goal = ext4_ext_find_goal(inode, path, iblock);
-	allocated = max_blocks;
+
+	/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+	newex.ee_block = cpu_to_le32(iblock);
+	newex.ee_len = cpu_to_le16(max_blocks);
+	err = ext4_ext_check_overlap(inode, &newex, path);
+	if (err)
+		allocated = le16_to_cpu(newex.ee_len);
+	else
+		allocated = max_blocks;
 	newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
 	if (!newblock)
 		goto out2;
@@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle
 			goal, newblock, allocated);
 
 	/* try to insert new extent into found leaf and return */
-	newex.ee_block = cpu_to_le32(iblock);
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 5/6][TAKE4] ext4: fallocate support in ext4
       [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
                                                     ` (3 preceding siblings ...)
  2007-05-17 14:26                                   ` [PATCH 4/6][TAKE4] ext4: Extent overlap bugfix Amit K. Arora
@ 2007-05-17 14:29                                   ` Amit K. Arora
  2007-05-17 14:30                                   ` [PATCH 6/6][TAKE4] ext4: write support for preallocated blocks Amit K. Arora
  5 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:29 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a <ToDo> item.

Changelog:
---------
Changes from Take3 to Take4:
 1) Changed ext4_fllocate() declaration and definition to return a "long"
    and not an "int", to match with ->fallocate() inode op.
 2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel version.
 2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".
Changes from Take1 to Take2:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start & journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON & ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
    functions.

Here is the updated patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  249 +++++++++++++++++++++++++++++++++-------
 fs/ext4/file.c                  |    1 
 include/linux/ext4_fs.h         |    8 +
 include/linux/ext4_fs_extents.h |   12 +
 4 files changed, 229 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 		        le32_to_cpu(path->p_ext->ee_block),
 		        ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 			        le32_to_cpu(path[depth].p_ext->ee_block),
 			        ext_pblock(path[depth].p_ext),
-			        le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int ret = 0;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
 	int depth, len, err, next;
+	unsigned uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/*
+		 * ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1263,7 +1286,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 			        le32_to_cpu(newext->ee_block),
 			        ext_pblock(newext),
-			        le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 		           > le32_to_cpu(nearex->ee_block)) {
@@ -1276,7 +1299,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 				        le32_to_cpu(newext->ee_block),
 				        ext_pblock(newext),
-				        le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1289,7 +1312,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1308,8 +1331,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1379,8 +1407,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1392,7 +1420,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1408,7 +1437,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 		        cbex.ec_block = le32_to_cpu(ex->ee_block);
-		        cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 		        cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1481,15 +1510,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len));
+			        (unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-		            + le16_to_cpu(ex->ee_len)) {
+		            + ext4_ext_get_actual_len(ex)) {
 	        lblock = le32_to_cpu(ex->ee_block)
-		         + le16_to_cpu(ex->ee_len);
+		         + ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 			        (unsigned long) le32_to_cpu(ex->ee_block),
-			        (unsigned long) le16_to_cpu(ex->ee_len),
+			        (unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1619,12 +1648,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1638,12 +1667,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1651,12 +1680,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1671,6 +1700,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
 	unsigned short ex_ee_len;
+	unsigned uninitialized = 0;
 	struct ext4_extent *ex;
 
 	ext_debug("truncate since %lu in leaf\n", start);
@@ -1685,7 +1715,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1753,6 +1785,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1762,7 +1796,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2038,7 +2072,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 	        unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2046,8 +2080,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2055,8 +2090,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2098,6 +2136,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err)
 		goto out2;
@@ -2109,8 +2149,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create!=EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
@@ -2214,6 +2256,131 @@ int ext4_ext_writepage_trans_blocks(stru
 	return needed;
 }
 
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * tells fallocate to unallocate previously (pre)allocated blocks.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	handle_t *handle;
+	ext4_fsblk_t block, max_blocks;
+	ext4_fsblk_t nblocks = 0;
+	int ret = 0;
+	int ret2 = 0;
+	int retries = 0;
+	struct buffer_head map_bh;
+	unsigned int credits, blkbits = inode->i_blkbits;
+
+	/*
+	 * currently supporting (pre)allocate mode for extent-based
+	 * files _only_
+	 */
+	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+		return -EOPNOTSUPP;
+
+	/* preallocation to directories is currently not supported */
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	block = offset >> blkbits;
+	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+		 	- block;
+
+	/*
+	 * credits to insert 1 extent into extent tree + buffers to be able to
+	 * modify 1 super block, 1 block bitmap and 1 group descriptor.
+	 */
+	credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+	while (ret >= 0 && ret < max_blocks) {
+		block = block + ret;
+		max_blocks = max_blocks - ret;
+		handle = ext4_journal_start(inode, credits);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			break;
+		}
+
+		ret = ext4_ext_get_blocks(handle, inode, block,
+					  max_blocks, &map_bh,
+					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
+		WARN_ON(!ret);
+		if (!ret) {
+			ext4_error(inode->i_sb, "ext4_fallocate",
+				   "ext4_ext_get_blocks returned 0! inode#%lu"
+				   ", block=%llu, max_blocks=%llu",
+				   inode->i_ino, block, max_blocks);
+			ret = -EIO;
+			ext4_mark_inode_dirty(handle, inode);
+			ret2 = ext4_journal_stop(handle);
+			break;
+		}
+		if (ret > 0) {
+			/* check wrap through sign-bit/zero here */
+			if ((block + ret) < 0 || (block + ret) < block) {
+				ret = -EIO;
+				ext4_mark_inode_dirty(handle, inode);
+				ret2 = ext4_journal_stop(handle);
+				break;
+			}
+			if (buffer_new(&map_bh) && ((block + ret) >
+			    (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+			    >> blkbits)))
+					nblocks = nblocks + ret;
+		}
+
+		/* Update ctime if new blocks get allocated */
+		if (nblocks) {
+			struct timespec now;
+			now = current_fs_time(inode->i_sb);
+			if (!timespec_equal(&inode->i_ctime, &now))
+				inode->i_ctime = now;
+		}
+
+		ext4_mark_inode_dirty(handle, inode);
+		ret2 = ext4_journal_stop(handle);
+		if (ret2)
+			break;
+	}
+
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+
+	/*
+	 * Time to update the file size.
+	 * Update only when preallocation was requested beyond the file size.
+	 */
+	if ((offset + len) > i_size_read(inode)) {
+		if (ret > 0) {
+			/*
+			 * if no error, we assume preallocation succeeded
+			 * completely
+			 */
+			mutex_lock(&inode->i_mutex);
+			i_size_write(inode, offset + len);
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		} else if (ret < 0 && nblocks) {
+			/* Handle partial allocation scenario */
+			loff_t newsize;
+
+			mutex_lock(&inode->i_mutex);
+			newsize  = (nblocks << blkbits) + i_size_read(inode);
+			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
+
+	return ret > 0 ? ret2 : ret;
+}
+
 EXPORT_SYMBOL(ext4_mark_inode_dirty);
 EXPORT_SYMBOL(ext4_ext_invalidate_cache);
 EXPORT_SYMBOL(ext4_ext_insert_extent);
Index: linux-2.6.22-rc1/fs/ext4/file.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/file.c
+++ linux-2.6.22-rc1/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.22-rc1/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs.h
@@ -102,6 +102,7 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits)		ALIGN((size),(1 << (blkbits)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -976,6 +982,8 @@ extern int ext4_ext_get_blocks(handle_t 
 extern void ext4_ext_truncate(struct inode *, struct page *);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
+extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 static inline int
 ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 			unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -188,6 +188,18 @@ ext4_ext_invalidate_cache(struct inode *
 	EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
 }
 
+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext) {
+	ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext) {
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext) {
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 6/6][TAKE4] ext4: write support for preallocated blocks
       [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
                                                     ` (4 preceding siblings ...)
  2007-05-17 14:29                                   ` [PATCH 5/6][TAKE4] ext4: fallocate support in ext4 Amit K. Arora
@ 2007-05-17 14:30                                   ` Amit K. Arora
  5 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-05-17 14:30 UTC (permalink / raw)
  To: torvalds, akpm; +Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
---------
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) Replaced BUG_ON with WARN_ON & ext4_error.
 2) Added variable names to the function declaration of
    ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) "if((a=foo())).." was broken into "a=foo(); if(a).."
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
 fs/ext4/extents.c               |  234 +++++++++++++++++++++++++++++++++++-----
 include/linux/ext4_fs_extents.h |    3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+			  struct ext4_ext_path *path,
+			  struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done = 0;
+	int uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh))
+	{
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+				* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+		merge_done = 1;
+		WARN_ON(eh->eh_entries == 0);
+		if (!eh->eh_entries)
+			ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+			   "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+	}
+
+	return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1375,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, newex;
+	struct ext4_extent *ex1 = NULL;
+	struct ext4_extent *ex2 = NULL;
+	struct ext4_extent *ex3 = NULL;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0;
+	int ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/* The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth) {
+			depth = newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/* If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = cpu_to_le32(newblock);
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	err = ext4_ext_get_access(handle, inode, path + depth);
+	if (err)
+		goto out;
+	/* New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/* To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/* Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2 == ex and ex3 == NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2067,6 +2234,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2075,13 +2243,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 	        if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2090,12 +2254,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2147,6 +2326,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -202,6 +202,9 @@ static inline int ext4_ext_get_actual_le
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+				 struct ext4_ext_path *path,
+				 struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-15 13:23                                       ` Amit K. Arora
@ 2007-05-18 21:36                                         ` Theodore Tso
  2007-05-18 23:10                                           ` Mingming Cao
  0 siblings, 1 reply; 340+ messages in thread
From: Theodore Tso @ 2007-05-18 21:36 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: linux-ext4

On Tue, May 15, 2007 at 06:53:53PM +0530, Amit K. Arora wrote:
> I will rebase it to 2.6.22-rc1 and repost the patches soon.
> Thanks!

I've rebased to 2.6.22-rc1 and put it in the ext4-patch-queue.

Mingming had rebased your previous (take3) set to 2.6.22-rc1, but
apparently the series file was corrupted, so it referenced an
incorrect patch filename, and the patch series didn't apply cleanly.
I've fixed it and confirmed that it builds and boots under UML.  Will
do more testing, but please take a look and confirm that it looks good.

Amit, we should probably get you access to repo.or.cz so you can
update the patches yourself.  My normal process is to transfer the
patches into git using the 'guilt' tool, and then start doing test
builds from there.  After I fix up the patches and do whatever is
necessary so they build, I copy them back into the ext4-patch-queue
directly, and then do a git-diff to see what has changed, and make the
changes to the patches look sane.  Can you send me and/or mingming
your ssh key, and we can give you push access to repo.or.cz?

We've missed the -rc1 merge window, so the goal should be to make sure
that everything in the series file before the "unstable patches" is
ready for merging.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-18 21:36                                         ` Theodore Tso
@ 2007-05-18 23:10                                           ` Mingming Cao
  2007-05-20 12:39                                             ` Dave Kleikamp
  0 siblings, 1 reply; 340+ messages in thread
From: Mingming Cao @ 2007-05-18 23:10 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Amit K. Arora, linux-ext4

On Fri, 2007-05-18 at 17:36 -0400, Theodore Tso wrote:
> On Tue, May 15, 2007 at 06:53:53PM +0530, Amit K. Arora wrote:
> > I will rebase it to 2.6.22-rc1 and repost the patches soon.
> > Thanks!
> 
> I've rebased to 2.6.22-rc1 and put it in the ext4-patch-queue.
>
> Mingming had rebased your previous (take3) set to 2.6.22-rc1, but
> apparently the series file was corrupted, so it referenced an
> incorrect patch filename, and the patch series didn't apply cleanly.
> I've fixed it and confirmed that it builds and boots under UML.  Will
> do more testing, but please take a look and confirm that it looks good.
> 

Thanks Ted. I am not sure how the series corrupted but I am glad that
you catch that and updated with fallocate patches.:-)

We don't need the ext4-fallocate-1b-fallocate_inode_op_fix.patch as Amit
fixed the ext4_fallocate() return value type to match VFS fallocate() in
takes 4, patch 5/6. I will update the series to reflect this and run
test.

I think Kalpak's patch to remove 32000 subdirs patch can be add to the
ext4 patch queue as well. Agreed?

> Amit, we should probably get you access to repo.or.cz so you can
> update the patches yourself.  My normal process is to transfer the
> patches into git using the 'guilt' tool, and then start doing test
> builds from there.  After I fix up the patches and do whatever is
> necessary so they build, I copy them back into the ext4-patch-queue
> directly, and then do a git-diff to see what has changed, and make the
> changes to the patches look sane.  Can you send me and/or mingming
> your ssh key, and we can give you push access to repo.or.cz?
> 

I am not sure Amit can response this before he leave for vacation.(from
May 19 for 10 days). 

I will checked the fallocate patches and run auto tests.

> We've missed the -rc1 merge window, so the goal should be to make sure
> that everything in the series file before the "unstable patches" is
> ready for merging.
> 
I tend to agree.  But there are some bug-fix type or mount option
patches that can try to target for rc2, what do you think?


# New patch to fix whitespace before applying new patches
whitespace.patch
 
#New patch to remove unnecessary exported symbols
ext4_remove_exported_symbles.patch
 
# New patch to add mount option to turn off extents
ext4_noextent_mount_opt.patch
# Now Turn on extents feature by default
ext4_extents_on_by_default.patch
 
#New patch to propagate inode flags
ext4-propagate_flags.patch
 
#New patch to add extent sanity checks
ext4-extent-sanity-checks.patch
 
#New patch to free blocks when failed to insert an extent
ext4-free-blocks-on-insert-extent-failure.patch

Cheers,
Mingming

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE4] fallocate system call
  2007-05-17 14:11                                 ` Amit K. Arora
  (?)
  (?)
@ 2007-05-19  6:44                                 ` Andrew Morton
  2007-05-21  5:24                                     ` Mingming Cao
  -1 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-05-19  6:44 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: torvalds, linux-fsdevel, linux-kernel, linux-ext4, xfs, suparna, cmm

On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> fallocate() is a new system call being proposed here which will allow
> applications to preallocate space to any file(s) in a file system.

I merged the first three patches into -mm, thanks.

All the system call numbers got changed due to recent additions.  They
may change in the future, too - nothing is stable until the code lands
in mainline.

I didn't merge any of the ext4 changes as they appear to be in Ted's
devel tree.  Although I didn't check that they are 100% the same in 
that tree.

What's the plan to get some ext4 updates into mainline, btw?  Things
seem to be rather gradual.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-18 23:10                                           ` Mingming Cao
@ 2007-05-20 12:39                                             ` Dave Kleikamp
  2007-05-21  5:38                                               ` Theodore Tso
  0 siblings, 1 reply; 340+ messages in thread
From: Dave Kleikamp @ 2007-05-20 12:39 UTC (permalink / raw)
  To: cmm; +Cc: Theodore Tso, Amit K. Arora, linux-ext4

On Fri, 2007-05-18 at 16:10 -0700, Mingming Cao wrote:
> On Fri, 2007-05-18 at 17:36 -0400, Theodore Tso wrote:

> > We've missed the -rc1 merge window, so the goal should be to make sure
> > that everything in the series file before the "unstable patches" is
> > ready for merging.
> > 
> I tend to agree.  But there are some bug-fix type or mount option
> patches that can try to target for rc2, what do you think?

I agree with Mingming.  There's no reason for these patches not to be in
mainline.  I am curious why the fallocate patches were put at the top of
the series file in the first place.  The older patches shouldn't be held
up by fallocate (which should wait until the next merge window).

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE4] fallocate system call
  2007-05-19  6:44                                 ` [PATCH 0/6][TAKE4] fallocate system call Andrew Morton
@ 2007-05-21  5:24                                     ` Mingming Cao
  0 siblings, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-05-21  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna

On Fri, 2007-05-18 at 23:44 -0700, Andrew Morton wrote:
> On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > fallocate() is a new system call being proposed here which will allow
> > applications to preallocate space to any file(s) in a file system.
> 
> I merged the first three patches into -mm, thanks.
> 
> All the system call numbers got changed due to recent additions.  They
> may change in the future, too - nothing is stable until the code lands
> in mainline.
> 
In case you haven't realize it, the ia64 fallocate() patch comes with
Amit's takes 4 fallocate patch series (3/6) missing one line change,
thus fail to compile on ia64.

Here is the updated one. Patch tested on ia64. (compile and fsx)

fallocate() on ia64

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 arch/ia64/kernel/entry.S  |    1 +
 include/asm-ia64/unistd.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S	2007-05-18 16:30:16.000000000 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S	2007-05-18 16:32:45.000000000 -0700
@@ -1585,5 +1585,6 @@
 	data8 sys_getcpu
 	data8 sys_epoll_pwait			// 1305
 	data8 sys_utimensat
+	data8 sys_fallocate
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h	2007-05-18 16:30:16.000000000 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h	2007-05-18 17:34:58.000000000 -0700
@@ -296,11 +296,12 @@
 #define __NR_getcpu			1304
 #define __NR_epoll_pwait		1305
 #define __NR_utimensat			1306
+#define __NR_fallocate			1307
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls			283 /* length of syscall table */
+#define NR_syscalls			285 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 #define __ARCH_WANT_SYS_RT_SIGSUSPEND


> I didn't merge any of the ext4 changes as they appear to be in Ted's
> devel tree.  Although I didn't check that they are 100% the same in 
> that tree.
> 
Since both Amit and Ted are traveling, I will jump in...

Most likely it's not the same one. What in Ted's devel tree is "takes 2"
patches.

I have incorporated takes 4 patches in the backing ext4 patch git tree
here:
http://repo.or.cz/w/ext4-patch-queue.git

I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not
sure if Ted got a chance to update his ext4 git tree from this patch
queue git tree yet. 

> What's the plan to get some ext4 updates into mainline, btw?  Things
> seem to be rather gradual.


Last time Ted and I discussed we all agree fallocate patches should go
into mainline. Actually most patches marked before the "unstable
patches" can get into mainline, especially the following patches
(contains a few bug fixes patches)

# New patch to fix whitespace before applying new patches
whitespace.patch
 
#New patch to remove unnecessary exported symbols
ext4_remove_exported_symbles.patch
 
# New patch to add mount option to turn off extents
ext4_noextent_mount_opt.patch
# Now Turn on extents feature by default
ext4_extents_on_by_default.patch
 
#New patch to propagate inode flags
ext4-propagate_flags.patch
 
#New patch to add extent sanity checks
ext4-extent-sanity-checks.patch
 
#New patch to free blocks when failed to insert an extent
ext4-free-blocks-on-insert-extent-failure.patch

We already missed rc-1 window, but if possible, I would like to see ext4
fallocate patches and above patches in mainline 2.6.22. The nanosecond
timestamp patch is probably good to go also.

Regards,
Mingming
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE4] fallocate system call
@ 2007-05-21  5:24                                     ` Mingming Cao
  0 siblings, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-05-21  5:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, torvalds, linux-fsdevel, linux-kernel, linux-ext4,
	xfs, suparna

On Fri, 2007-05-18 at 23:44 -0700, Andrew Morton wrote:
> On Thu, 17 May 2007 19:41:15 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > fallocate() is a new system call being proposed here which will allow
> > applications to preallocate space to any file(s) in a file system.
> 
> I merged the first three patches into -mm, thanks.
> 
> All the system call numbers got changed due to recent additions.  They
> may change in the future, too - nothing is stable until the code lands
> in mainline.
> 
In case you haven't realize it, the ia64 fallocate() patch comes with
Amit's takes 4 fallocate patch series (3/6) missing one line change,
thus fail to compile on ia64.

Here is the updated one. Patch tested on ia64. (compile and fsx)

fallocate() on ia64

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 arch/ia64/kernel/entry.S  |    1 +
 include/asm-ia64/unistd.h |    3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S	2007-05-18 16:30:16.000000000 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S	2007-05-18 16:32:45.000000000 -0700
@@ -1585,5 +1585,6 @@
 	data8 sys_getcpu
 	data8 sys_epoll_pwait			// 1305
 	data8 sys_utimensat
+	data8 sys_fallocate
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h	2007-05-18 16:30:16.000000000 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h	2007-05-18 17:34:58.000000000 -0700
@@ -296,11 +296,12 @@
 #define __NR_getcpu			1304
 #define __NR_epoll_pwait		1305
 #define __NR_utimensat			1306
+#define __NR_fallocate			1307
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls			283 /* length of syscall table */
+#define NR_syscalls			285 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 #define __ARCH_WANT_SYS_RT_SIGSUSPEND


> I didn't merge any of the ext4 changes as they appear to be in Ted's
> devel tree.  Although I didn't check that they are 100% the same in 
> that tree.
> 
Since both Amit and Ted are traveling, I will jump in...

Most likely it's not the same one. What in Ted's devel tree is "takes 2"
patches.

I have incorporated takes 4 patches in the backing ext4 patch git tree
here:

I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not
sure if Ted got a chance to update his ext4 git tree from this patch
queue git tree yet. 

> What's the plan to get some ext4 updates into mainline, btw?  Things
> seem to be rather gradual.


Last time Ted and I discussed we all agree fallocate patches should go
into mainline. Actually most patches marked before the "unstable
patches" can get into mainline, especially the following patches
(contains a few bug fixes patches)

# New patch to fix whitespace before applying new patches
whitespace.patch
 
#New patch to remove unnecessary exported symbols
ext4_remove_exported_symbles.patch
 
# New patch to add mount option to turn off extents
ext4_noextent_mount_opt.patch
# Now Turn on extents feature by default
ext4_extents_on_by_default.patch
 
#New patch to propagate inode flags
ext4-propagate_flags.patch
 
#New patch to add extent sanity checks
ext4-extent-sanity-checks.patch
 
#New patch to free blocks when failed to insert an extent
ext4-free-blocks-on-insert-extent-failure.patch

We already missed rc-1 window, but if possible, I would like to see ext4
fallocate patches and above patches in mainline 2.6.22. The nanosecond
timestamp patch is probably good to go also.

Regards,
Mingming
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc
  2007-05-20 12:39                                             ` Dave Kleikamp
@ 2007-05-21  5:38                                               ` Theodore Tso
  0 siblings, 0 replies; 340+ messages in thread
From: Theodore Tso @ 2007-05-21  5:38 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: cmm, Amit K. Arora, linux-ext4

On Sun, May 20, 2007 at 07:39:32AM -0500, Dave Kleikamp wrote:
> On Fri, 2007-05-18 at 16:10 -0700, Mingming Cao wrote:
> > On Fri, 2007-05-18 at 17:36 -0400, Theodore Tso wrote:
> 
> > > We've missed the -rc1 merge window, so the goal should be to make sure
> > > that everything in the series file before the "unstable patches" is
> > > ready for merging.
> > > 
> > I tend to agree.  But there are some bug-fix type or mount option
> > patches that can try to target for rc2, what do you think?
> 
> I agree with Mingming.  There's no reason for these patches not to be in
> mainline.  I am curious why the fallocate patches were put at the top of
> the series file in the first place.  The older patches shouldn't be held
> up by fallocate (which should wait until the next merge window).

I've rebased the ext4 patch queue for 2.6.22-rc2, and moved the
obvious bug fixes to the top of the queue.  There's one patch which I
missed (ext4-free-blocks-on-insert-extent-failure.patch) which is also
a bug fixed, that should be moved up.

It's true that some of the older patches are below fallocate in the
queue, but they are still new features that probably shouldn't be
pushed at this point.  But yes, I agree that the bug fixes should be
pushed to Linus before 2.6.22 ships.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-05-12  8:01                                           ` David Chinner
@ 2007-06-12  6:16                                             ` Amit K. Arora
  2007-06-12  8:11                                               ` David Chinner
  2007-06-13 23:52                                               ` David Chinner
  0 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-12  6:16 UTC (permalink / raw)
  To: David Chinner
  Cc: Suparna Bhattacharya, torvalds, akpm, linux-fsdevel,
	linux-kernel, linux-ext4, xfs, cmm

On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
> On Fri, May 11, 2007 at 04:33:01PM +0530, Suparna Bhattacharya wrote:
> > On Fri, May 11, 2007 at 08:39:50AM +1000, David Chinner wrote:
> > > All I'm really interested in right now is that the fallocate
> > > _interface_ can be used as a *complete replacement* for the
> > > pre-existing XFS-specific ioctls that are already used by
> > > applications.  What ext4 can or can't do right now is irrelevant to
> > > this discussion - the interface definition needs to take priority
> > > over implementation....
> > 
> > Would you like to write up an interface definition description (likely
> > man page) and post it for review, possibly with a mention of apps using
> > it today ?
> 
> Yeah, I started doing that yesterday as i figured it was the only way
> to cut the discussion short....
> 
> > One reason for introducing the mode parameter was to allow the interface to
> > evolve incrementally as more options / semantic questions are proposed, so
> > that we don't have to make all the decisions right now. 
> > So it would be good to start with a *minimal* definition, even just one mode.
> > The rest could follow as subsequent patches, each being reviewed and debated
> > separately. Otherwise this discussion can drag on for a long time.
> 
> Minimal definition to replace what applicaitons use on XFS and to
> support poasix_fallocate are the thre that have been mentioned so
> far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
> all in a man page...

Hi Dave,

Did you get time to write the above man page ? It will help to push
further patches in time (eg. for FA_PREALLOCATE mode).

The idea I had was to push the patch with bare minimum functionality
(FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
new mode(s) based on the man page you planned to provide.

Thanks!
--
Regards,
Amit Arora

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-06-12  6:16                                             ` Amit K. Arora
@ 2007-06-12  8:11                                               ` David Chinner
  2007-06-13 23:52                                               ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-06-12  8:11 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: David Chinner, Suparna Bhattacharya, torvalds, akpm,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm

On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
> On Sat, May 12, 2007 at 06:01:57PM +1000, David Chinner wrote:
> > Minimal definition to replace what applicaitons use on XFS and to
> > support poasix_fallocate are the thre that have been mentioned so
> > far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them
> > all in a man page...
> 
> Hi Dave,
> 
> Did you get time to write the above man page ? It will help to push
> further patches in time (eg. for FA_PREALLOCATE mode).

No, I didn't. Instead of working on new preallocation stuff, I've
been spending all my time fixing bugs found by new and interesting
(ab)uses of preallocation and hole punching.

> The idea I had was to push the patch with bare minimum functionality
> (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other
> new mode(s) based on the man page you planned to provide.

Push them. I'll just make XFS work with whatever is provided.
Is there a test harness for the syscall yet?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-06-12  6:16                                             ` Amit K. Arora
  2007-06-12  8:11                                               ` David Chinner
@ 2007-06-13 23:52                                               ` David Chinner
  2007-06-14  9:14                                                 ` Andreas Dilger
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-06-13 23:52 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: David Chinner, Suparna Bhattacharya, torvalds, akpm,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm

[-- Attachment #1: Type: text/plain, Size: 339 bytes --]

On Tue, Jun 12, 2007 at 11:46:52AM +0530, Amit K. Arora wrote:
> Did you get time to write the above man page ? It will help to push
> further patches in time (eg. for FA_PREALLOCATE mode).

First pass is attached.

`nroff -man fallocate.2 | less` to view.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

[-- Attachment #2: fallocate.2 --]
[-- Type: text/plain, Size: 2552 bytes --]

.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include <sys/syscall.h>
.PP
.BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are three modes:
.TP
.B FA_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
B FA_ALLOCATE
closely resembles
B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
B FA_ALLOCATE
may allocate a larger range that was specified.
TP
B FA_PREALLOCATE
provides the same functionality as
B FA_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
TP
B FA_DEALLOCATE
removes the underlying disk space with the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
B FA_ALLOCATE
and
B FA_PREALLOCATE
as well as from
B write(3)
will be removed.
B FA_DEALLOCATE
shall never remove disk blocks outside the range specified.
B FA_DEALLOCATE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
B ftuncate64(3)
should be used.

SH "RETURN VALUE"
BR fallocate()
returns zero on success, or an error number on failure.
Note that
IR errno
is not set.
SH "ERRORS"
TP
B EBADF
I fd
is not a valid file descriptor, or is not opened for writing.
TP
B EFBIG
I offset+len
exceeds the maximum file size.
TP
B EINVAL
I offset
or
I len
was less than 0.
TP
B ENODEV
I fd
does not refer to a regular file or a directory.
TP
B ENOSPC
There is not enough space left on the device containing the file
referred to by
IR fd.
TP
B ESPIPE
I fd
refers to a pipe of file descriptor.
B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
SH AVAILABILITY
The
BR fallocate ()
system call is available since 2.6.XX
SH "SEE ALSO"
BR syscall (2),
BR posix_fadvise (3)
BR ftruncate (3)

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-06-13 23:52                                               ` David Chinner
@ 2007-06-14  9:14                                                 ` Andreas Dilger
  2007-06-14 12:04                                                   ` David Chinner
  2007-06-30 10:14                                                   ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Christoph Hellwig
  0 siblings, 2 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-06-14  9:14 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, Suparna Bhattacharya, torvalds, akpm,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm

On Jun 14, 2007  09:52 +1000, David Chinner wrote:
> B FA_PREALLOCATE
> provides the same functionality as
> B FA_ALLOCATE
> except it does not ever change the file size. This allows allocation
> of zero blocks beyond the end of file and is useful for optimising
> append workloads.
> TP
> B FA_DEALLOCATE
> removes the underlying disk space with the given range. The disk space
> shall be removed regardless of it's contents so both allocated space
> from
> B FA_ALLOCATE
> and
> B FA_PREALLOCATE
> as well as from
> B write(3)
> will be removed.
> B FA_DEALLOCATE
> shall never remove disk blocks outside the range specified.

So this is essentially the same as "punch".  There doesn't seem to be
a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
end.

> B FA_DEALLOCATE
> shall never change the file size. If changing the file size
> is required when deallocating blocks from an offset to end
> of file (or beyond end of file) is required,
> B ftuncate64(3)
> should be used.

This also seems to be a bit of a wart, since it isn't a natural converse
of either of the above functions.  How about having two modes,
similar to FA_ALLOCATE and FA_PREALLOCATE?  Say, FA_PUNCH (which
would be as you describe here - deletes all data in the specified
range changing the file size if it overlaps EOF, and FA_DEALLOCATE,
which only deallocates unused FA_{PRE,}ALLOCATE space?

We might also consider making @mode be a mask instead of an enumeration:

FA_FL_DEALLOC	0x01 (default allocate)
FA_FL_KEEP_SIZE	0x02 (default extend/shrink size)
FA_FL_DEL_DATA	0x04 (default keep written data on DEALLOC)

We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags
without making the interface sub-optimal.

I suppose it might be a bit late in the game to add a "goal"
parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
the API more suitable for XFS?  The goal could be a single __u64, or
a struct with e.g. __u64 byte offset (possibly also __u32 lun like
in FIEMAP).  I guess the one potential limitation here is the
number of function parameters on some architectures.

> B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> IR fd.

Should probably say whether space is removed on failure or not.  In
some (primitive) implementations it might no longer be possible to
distinguish between unwritten extents and zero-filled blocks, though
at this point DEALLOC of zero-filled blocks might not be harmful either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-06-14  9:14                                                 ` Andreas Dilger
@ 2007-06-14 12:04                                                   ` David Chinner
  2007-06-14 19:33                                                     ` Andreas Dilger
  2007-06-30 10:14                                                   ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Christoph Hellwig
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-06-14 12:04 UTC (permalink / raw)
  To: David Chinner, Amit K. Arora, Suparna Bhattacharya, torvalds,
	akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm

On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> On Jun 14, 2007  09:52 +1000, David Chinner wrote:
> > B FA_PREALLOCATE
> > provides the same functionality as
> > B FA_ALLOCATE
> > except it does not ever change the file size. This allows allocation
> > of zero blocks beyond the end of file and is useful for optimising
> > append workloads.
> > TP
> > B FA_DEALLOCATE
> > removes the underlying disk space with the given range. The disk space
> > shall be removed regardless of it's contents so both allocated space
> > from
> > B FA_ALLOCATE
> > and
> > B FA_PREALLOCATE
> > as well as from
> > B write(3)
> > will be removed.
> > B FA_DEALLOCATE
> > shall never remove disk blocks outside the range specified.
> 
> So this is essentially the same as "punch".

Depends on your definition of "punch".

> There doesn't seem to be
> a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> end.

ftruncate()

> > B FA_DEALLOCATE
> > shall never change the file size. If changing the file size
> > is required when deallocating blocks from an offset to end
> > of file (or beyond end of file) is required,
> > B ftuncate64(3)
> > should be used.
> 
> This also seems to be a bit of a wart, since it isn't a natural converse
> of either of the above functions.  How about having two modes,
> similar to FA_ALLOCATE and FA_PREALLOCATE?

<shrug>

whatever.

> Say, FA_PUNCH (which
> would be as you describe here - deletes all data in the specified
> range changing the file size if it overlaps EOF,

Punch means different things to different people. To me (and probably
most XFS aware ppl) punch implies no change to the file size.

i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
holes to leave the file size unchanged. This is the behaviour I
described for FA_DEALLOCATE.

> and FA_DEALLOCATE,
> which only deallocates unused FA_{PRE,}ALLOCATE space?

That's an "unwritten-to-hole" extent conversion. Is that really
useful for anything? That's easily implemented with FIEMAP
and FA_DEALLOCATE.

Anyway, because we can't agree on a single pair of flags:

	FA_ALLOCATE        == posix_fallocate()
	FA_DEALLOCATE      == unwritten-to-hole ???
	FA_RESV_SPACE      == XFS_IOC_RESVSP64
	FA_UNRESV_SPACE    == XFS_IOC_UNRESVSP64

> We might also consider making @mode be a mask instead of an enumeration:
> 
> FA_FL_DEALLOC	0x01 (default allocate)
> FA_FL_KEEP_SIZE	0x02 (default extend/shrink size)
> FA_FL_DEL_DATA	0x04 (default keep written data on DEALLOC)

i.e:

#define FA_ALLOCATE	0
#define FA_DEALLOCATE	FA_FL_DEALLOC
#define FA_RESV_SPACE	FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE	FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS?

It would suffice for the simpler operations, I think, but we'll
rapidly run out of flags and we'll still need another interface
for doing complex stuff.....

> The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP).  I guess the one potential limitation here is the
> number of function parameters on some architectures.

To be useful it needs to __u64.

> > B ENOSPC
> > There is not enough space left on the device containing the file
> > referred to by
> > IR fd.
> 
> Should probably say whether space is removed on failure or not.  In

Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
allocated was freed back up. That way the error handling in the allocate
functions is much simpler (i.e. no need to undo there).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-06-14 12:04                                                   ` David Chinner
@ 2007-06-14 19:33                                                     ` Andreas Dilger
  2007-06-25 13:28                                                         ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-06-14 19:33 UTC (permalink / raw)
  To: David Chinner
  Cc: Amit K. Arora, Suparna Bhattacharya, torvalds, akpm,
	linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm

On Jun 14, 2007  22:04 +1000, David Chinner wrote:
> On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> > > B FA_DEALLOCATE
> > > removes the underlying disk space with the given range. The disk space
> > > shall be removed regardless of it's contents so both allocated space
> > > from
> > > B FA_ALLOCATE
> > > and
> > > B FA_PREALLOCATE
> > > as well as from
> > > B write(3)
> > > will be removed.
> > > B FA_DEALLOCATE
> > > shall never remove disk blocks outside the range specified.
> > 
> > So this is essentially the same as "punch".
> 
> Depends on your definition of "punch".
> 
> > There doesn't seem to be
> > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> > end.
> 
> ftruncate()

No, that will delete written data also.  What I'm thinking is in cases
where an application does fallocate() to reserve a lot of space, and
when the application is finished it wants to unreserve any unused space.

> > > B FA_DEALLOCATE
> > > shall never change the file size. If changing the file size
> > > is required when deallocating blocks from an offset to end
> > > of file (or beyond end of file) is required,
> > > B ftuncate64(3)
> > > should be used.
> > 
> > This also seems to be a bit of a wart, since it isn't a natural converse
> > of either of the above functions.  How about having two modes,
> > similar to FA_ALLOCATE and FA_PREALLOCATE?
> 
> <shrug>
> 
> whatever.
> 
> > Say, FA_PUNCH (which
> > would be as you describe here - deletes all data in the specified
> > range changing the file size if it overlaps EOF,
> 
> Punch means different things to different people. To me (and probably
> most XFS aware ppl) punch implies no change to the file size.

If "punch" does not change the file size, how is it possible to determine
the end of the actual written data?  Say you have a file with records
in it, and these records are cancelled as they are processed (e.g. a
journal of sorts).  One usage model for punch() that we had in the past
is to punch out each record after it finishes processing, so that it will
not be re-processed after a crash.  If the file size doesn't change with
punch then there is no way to know when the last record is hit and the
rest of the file needs to be scanned.

> i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
> holes to leave the file size unchanged. This is the behaviour I
> described for FA_DEALLOCATE.
> 
> > and FA_DEALLOCATE,
> > which only deallocates unused FA_{PRE,}ALLOCATE space?
> 
> That's an "unwritten-to-hole" extent conversion. Is that really
> useful for anything? That's easily implemented with FIEMAP
> and FA_DEALLOCATE.

But why force the application to do this instead of making the
fallocate API sensible and allowing it to be done directly?

> Anyway, because we can't agree on a single pair of flags:
> 
> 	FA_ALLOCATE        == posix_fallocate()
> 	FA_DEALLOCATE      == unwritten-to-hole ???

I'd think this makes sense, being natural opposites of each other.
FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE
shouldn't erase existing data.  If FA_ALLOCATE extends the file size,
then FA_DEALLOCATE should shrink it if there is no data at the end.

> 	FA_RESV_SPACE      == XFS_IOC_RESVSP64
> 	FA_UNRESV_SPACE    == XFS_IOC_UNRESVSP64

> > We might also consider making @mode be a mask instead of an enumeration:
> > 
> > FA_FL_DEALLOC	0x01 (default allocate)
> > FA_FL_KEEP_SIZE	0x02 (default extend/shrink size)
> > FA_FL_DEL_DATA	0x04 (default keep written data on DEALLOC)
> 
> #define FA_ALLOCATE	0
> #define FA_DEALLOCATE	FA_FL_DEALLOC
> #define FA_RESV_SPACE	FA_FL_KEEP_SIZE
> #define FA_UNRESV_SPACE	FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64
clear at least.  The benefit is that it would also be possible (I'm
not necessarily advocating this as a flag, just an example) to have
semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while
preallocating) with:

#define FA_ZERO_SPACE    FA_DEL_DATA

or whatever semantics the caller actually wants, instead of restricting
them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE
(whatever it is we decide on in the end).

> > > B ENOSPC
> > > There is not enough space left on the device containing the file
> > > referred to by
> > > IR fd.
> > 
> > Should probably say whether space is removed on failure or not.  In
> 
> Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
> allocated was freed back up. That way the error handling in the allocate
> functions is much simpler (i.e. no need to undo there).

Hmm, another flag?  FA_FL_FREE_ENOSPC?  I can imagine applications like
PVRs to want to preallocate, say, an estimated 30 min of space for a show
but if they only get 25 min of space returned they know some cleanup is
in order (which can be done asynchronously while the show is filling the
first 25 min of preallocated space).  Otherwise, they have to loop in
userspace trying decreasing preallocations until they fit, or starting
small and incrementally preallocating space until they get an error.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/6][TAKE5] fallocate system call
  2007-06-14 19:33                                                     ` Andreas Dilger
@ 2007-06-25 13:28                                                         ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:28 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

N O T E: 
-------
1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
   of ext4 patch queue git tree hosted by Ted.
2) The above new patches (4/7 and 7/7) are based on the dicussion
   between Andreas Dilger and David Chinner on the mode argument,
   when later posted a man page on fallocate.
3) All of these patches are based on 2.6.22-rc4 kernel and apply to
   2.6.22-rc5 too (with some successfull hunks, though  - since the
   ext4 patch queue git tree has some other patches as well before
   fallocate patches in the patch series).

Changelog:
---------
Changes from Take4 to Take5:
	1) New Patch 4/7 implements new flags and values for mode
	   argument of fallocate system call.
	2) New Patch 7/7 implements 2 (out of 4) modes in ext4.
	   Implementation of rest of the (two) modes is yet to be done.
	3) Updated the interface description below to mention new modes
	   being supported.
	4) Removed "extent overlap check" bugfix (patch 4/6 in TAKE4,
	   since it is now part of mainline.
	5) Corrected format of couple of multi-line comments, which got
	   missed in earlier take.

Changes from Take2 to Take3:
        1) Return type is now described in the interface description
           above.
        2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports four modes - FA_ALLOCATE, FA_DEALLOCATE, 
  FA_RESV_SPACE and FA_UNRESV_SPACE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_FA_RESV_SPACE: This mode is quite same as FA_ALLOCATE. The only
    difference being that the file size will not be changed.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime. This is reverse of FA_ALLOCATE mode.
  FA_UNRESV_SPACE: This mode is quite same as FA_DEALLOCATE. The
    difference being that the file size is not changed and the data is
    also deleted.
* New modes might get added in future.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).

RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate().

sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ia64, ppc64 and s390(x)).
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Following patches follow:
Patch 1/6 : fallocate() implementation on i386, x86_64 and powerpc
Patch 2/7 : fallocate() on s390(x)
Patch 3/7 : fallocate() on ia64
Patch 4/7 : support new modes in fallocate
Patch 5/7 : ext4: fallocate support in ext4
Patch 6/7 : ext4: write support for preallocated blocks
Patch 7/7 : ext4: support new modes

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 0/6][TAKE5] fallocate system call
@ 2007-06-25 13:28                                                         ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:28 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

N O T E: 
-------
1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
   of ext4 patch queue git tree hosted by Ted.
2) The above new patches (4/7 and 7/7) are based on the dicussion
   between Andreas Dilger and David Chinner on the mode argument,
   when later posted a man page on fallocate.
3) All of these patches are based on 2.6.22-rc4 kernel and apply to
   2.6.22-rc5 too (with some successfull hunks, though  - since the
   ext4 patch queue git tree has some other patches as well before
   fallocate patches in the patch series).

Changelog:
---------
Changes from Take4 to Take5:
	1) New Patch 4/7 implements new flags and values for mode
	   argument of fallocate system call.
	2) New Patch 7/7 implements 2 (out of 4) modes in ext4.
	   Implementation of rest of the (two) modes is yet to be done.
	3) Updated the interface description below to mention new modes
	   being supported.
	4) Removed "extent overlap check" bugfix (patch 4/6 in TAKE4,
	   since it is now part of mainline.
	5) Corrected format of couple of multi-line comments, which got
	   missed in earlier take.

Changes from Take2 to Take3:
        1) Return type is now described in the interface description
           above.
        2) Patches rebased to 2.6.22-rc1 kernel.

** Each post will have an individual changelog for a particular patch.


Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
---------
The system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports four modes - FA_ALLOCATE, FA_DEALLOCATE, 
  FA_RESV_SPACE and FA_UNRESV_SPACE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
    a given file (specified by fd). This mode changes the file size if
    the preallocation is done beyond the EOF. It also updates the
    ctime in the inode of the corresponding file, marking a
    successfull allocation.
  FA_FA_RESV_SPACE: This mode is quite same as FA_ALLOCATE. The only
    difference being that the file size will not be changed.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
    previously preallocated blocks. This also may change the file size
    and the ctime/mtime. This is reverse of FA_ALLOCATE mode.
  FA_UNRESV_SPACE: This mode is quite same as FA_DEALLOCATE. The
    difference being that the file size is not changed and the data is
    also deleted.
* New modes might get added in future.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).

RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate().

sys_fallocate() on s390:
-----------------------
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-------------
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:

ToDos:
-----
1> Implementation on other architectures (other than i386, x86_64,
ia64, ppc64 and s390(x)).
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Following patches follow:
Patch 1/6 : fallocate() implementation on i386, x86_64 and powerpc
Patch 2/7 : fallocate() on s390(x)
Patch 3/7 : fallocate() on ia64
Patch 4/7 : support new modes in fallocate
Patch 5/7 : ext4: fallocate support in ext4
Patch 6/7 : ext4: write support for preallocated blocks
Patch 7/7 : ext4: support new modes

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 1/7][TAKE5] fallocate() implementation on i386, x86_64 and powerpc
  2007-06-25 13:28                                                         ` Amit K. Arora
  (?)
@ 2007-06-25 13:40                                                         ` Amit K. Arora
  2007-06-26 19:38                                                           ` Heiko Carstens
  -1 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
---------
Changes from Take3 to Take4:
 1) Do not update c/mtime. Let each filesystem update ctime (update of
    mtime will not be required for allocation since we touch only
    metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
 1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
    posix_fallocate should return EINVAL for len <= 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
    they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
 	.long sys_signalfd
 	.long sys_timerfd
 	.long sys_eventfd
+	.long sys_fallocate
Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
 	return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+				     u32 lenhi, u32 lenlo)
+{
+	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+			     ((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
 				 unsigned long low)
 {
Index: linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd
 	.quad compat_sys_timerfd
 	.quad sys_eventfd
+	.quad sys_fallocate
 ia32_syscall_end:
Index: linux-2.6.22-rc4/fs/open.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/open.c
+++ linux-2.6.22-rc4/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *	  FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ *	    requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ *	0	: On SUCCESS a value of zero is returned.
+ *	error	: On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ *	 support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (offset < 0 || len <= 0)
+		goto out;
+
+	/* Return error if mode is not supported */
+	ret = -EOPNOTSUPP;
+	if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	/*
+	 * Let individual file system decide if it supports preallocation
+	 * for directories or not.
+	 */
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	/* Check for wrap through zero too */
+	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+
+/*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.22-rc4/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-i386/unistd.h
+++ linux-2.6.22-rc4/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
 #define __NR_signalfd		321
 #define __NR_timerfd		322
 #define __NR_eventfd		323
+#define __NR_fallocate		324
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 324
+#define NR_syscalls 325
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc4/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.22-rc4/include/asm-powerpc/systbl.h
@@ -308,6 +308,7 @@ COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
 COMPAT_SYS_SPU(utimensat)
+COMPAT_SYS(fallocate)
 COMPAT_SYS_SPU(signalfd)
 COMPAT_SYS_SPU(timerfd)
 SYSCALL_SPU(eventfd)
Index: linux-2.6.22-rc4/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.22-rc4/include/asm-powerpc/unistd.h
@@ -330,10 +330,11 @@
 #define __NR_signalfd		305
 #define __NR_timerfd		306
 #define __NR_eventfd		307
+#define __NR_fallocate		308
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		308
+#define __NR_syscalls		309
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.22-rc4/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.22-rc4/include/asm-x86_64/unistd.h
@@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd)
 __SYSCALL(__NR_timerfd, sys_timerfd)
 #define __NR_eventfd		283
 __SYSCALL(__NR_eventfd, sys_eventfd)
+#define __NR_fallocate		284
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/fs.h
+++ linux-2.6.22-rc4/include/linux/fs.h
@@ -266,6 +266,17 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
+ *		  may request (pre)allocation of blocks.
+ * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
+ *		  the preallocated blocks.
+ */
+#define FA_ALLOCATE	0x1
+#define FA_DEALLOCATE	0x2
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1138,6 +1149,8 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 };
 
 struct seq_file;
Index: linux-2.6.22-rc4/include/linux/syscalls.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/syscalls.h
+++ linux-2.6.22-rc4/include/linux/syscalls.h
@@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, si
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
 			    const struct itimerspec __user *utmr);
 asmlinkage long sys_eventfd(unsigned int count);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 2/7][TAKE5] fallocate() on s390(x)
  2007-06-25 13:28                                                         ` Amit K. Arora
  (?)
  (?)
@ 2007-06-25 13:42                                                         ` Amit K. Arora
  2007-06-26 15:15                                                           ` Heiko Carstens
  -1 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:42 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
        stm     %r6,%r7,28(%r15)        /* save %r6/%r7 on stack */
        cfi_offset (%r7, -68)
        cfi_offset (%r6, -72)
        lm      %r6,%r7,96(%r15)        /* load loff_t len from stack */
        svc     SYS_ify(fallocate)
        lm      %r6,%r7,28(%r15)        /* restore %r6/%r7 from stack */
        br      %r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-------------
From: Martin Schwidefsky <schwidefsky@de.ibm.com>

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.


Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

Index: linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/compat_wrapper.S	2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S	2007-06-11 16:27:29.000000000 -0700
@@ -1683,6 +1683,16 @@
 	llgtr	%r3,%r3			# struct compat_timeval *
 	jg	compat_sys_utimes
 
+	.globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+	lgfr	%r2,%r2			# int
+	lgfr	%r3,%r3			# int
+	sllg    %r4,%r4,32		# get high word of 64bit loff_t
+	lr      %r4,%r5			# get low word of 64bit loff_t
+	sllg    %r5,%r6,32		# get high word of 64bit loff_t
+	l	%r5,164(%r15)		# get low word of 64bit loff_t
+	jg	sys_fallocate
+
 	.globl	compat_sys_utimensat_wrapper
 compat_sys_utimensat_wrapper:
 	llgfr	%r2,%r2			# unsigned int
Index: linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/sys_s390.c	2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c	2007-06-11 16:27:29.000000000 -0700
@@ -265,3 +265,32 @@
 		return -EFAULT;
 	return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+			       u32 len_high, u32 len_low)
+{
+	union {
+		u64 len;
+		struct {
+			u32 high;
+			u32 low;
+		};
+	} cv;
+	cv.high = len_high;
+	cv.low = len_low;
+	return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc4/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/syscalls.S	2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/syscalls.S	2007-06-11 16:27:29.000000000 -0700
@@ -322,6 +322,7 @@
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
 NI_SYSCALL							/* 314 sys_fallocate */
 SYSCALL(sys_utimensat,sys_utimensat,compat_sys_utimensat_wrapper)	/* 315 */
 SYSCALL(sys_signalfd,sys_signalfd,compat_sys_signalfd_wrapper)
Index: linux-2.6.22-rc4/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-s390/unistd.h	2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/include/asm-s390/unistd.h	2007-06-11 16:27:29.000000000 -0700
@@ -256,7 +256,8 @@
 #define __NR_signalfd		316
 #define __NR_timerfd		317
 #define __NR_eventfd		318
-#define NR_syscalls 319
+#define __NR_fallocate		319
+#define NR_syscalls 320
 
 /* 
  * There are some system calls that are not present on 64 bit, some

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 3/7][TAKE5] fallocate() on ia64
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (2 preceding siblings ...)
  (?)
@ 2007-06-25 13:43                                                         ` Amit K. Arora
  -1 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:43 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

fallocate() on ia64

ia64 fallocate syscall support.

Signed-off-by: Dave Chinner <dgc@sgi.com>

Index: linux-2.6.22-rc4/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/ia64/kernel/entry.S	2007-06-11 17:22:15.000000000 -0700
+++ linux-2.6.22-rc4/arch/ia64/kernel/entry.S	2007-06-11 17:30:37.000000000 -0700
@@ -1588,5 +1588,6 @@
 	data8 sys_signalfd
 	data8 sys_timerfd
 	data8 sys_eventfd
+	data8 sys_fallocate			// 1310
 
 	.org sys_call_table + 8*NR_syscalls	// guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc4/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc4.orig/include/asm-ia64/unistd.h	2007-06-11 17:22:15.000000000 -0700
+++ linux-2.6.22-rc4/include/asm-ia64/unistd.h	2007-06-11 17:30:37.000000000 -0700
@@ -299,11 +299,12 @@
 #define __NR_signalfd			1307
 #define __NR_timerfd			1308
 #define __NR_eventfd			1309
+#define __NR_fallocate			1310
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls			286 /* length of syscall table */
+#define NR_syscalls			287 /* length of syscall table */
 
 /*
  * The following defines stop scripts/checksyscalls.sh from complaining about

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (3 preceding siblings ...)
  (?)
@ 2007-06-25 13:45                                                         ` Amit K. Arora
  2007-06-25 15:03                                                           ` Amit K. Arora
  2007-06-25 21:52                                                           ` Andreas Dilger
  -1 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

Implement new flags and values for mode argument.

This patch implements the new flags and values for the "mode" argument
of the fallocate system call. It is based on the discussion between
Andreas Dilger and David Chinner on the man page proposed (by the later)
on fallocate.

Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22-rc4/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/fs.h
+++ linux-2.6.22-rc4/include/linux/fs.h
@@ -267,15 +267,16 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
 /*
- * sys_fallocate modes
- * Currently sys_fallocate supports two modes:
- * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
- *		  may request (pre)allocation of blocks.
- * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
- *		  the preallocated blocks.
+ * sys_fallocate mode flags and values
  */
-#define FA_ALLOCATE	0x1
-#define FA_DEALLOCATE	0x2
+#define FA_FL_DEALLOC	0x01 /* default is allocate */
+#define FA_FL_KEEP_SIZE	0x02 /* default is extend/shrink size */
+#define FA_FL_DEL_DATA	0x04 /* default is keep written data on DEALLOC */
+
+#define FA_ALLOCATE	0
+#define FA_DEALLOCATE	FA_FL_DEALLOC
+#define FA_RESV_SPACE	FA_FL_KEEP_SIZE
+#define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
 
 #ifdef __KERNEL__
 
Index: linux-2.6.22-rc4/fs/open.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/open.c
+++ linux-2.6.22-rc4/fs/open.c
@@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
  * sys_fallocate - preallocate blocks or free preallocated blocks
  * @fd: the file descriptor
  * @mode: mode specifies if fallocate should preallocate blocks OR free
- *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
- *	  FA_DEALLOCATE modes are supported.
+ *	  (unallocate) preallocated blocks.
  * @offset: The offset within file, from where (un)allocation is being
  *	    requested. It should not have a negative value.
  * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
  *
  * This system call, depending on the mode, preallocates or unallocates blocks
  * for a file. The range of blocks depends on the value of offset and len
- * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * arguments provided by the user/application. For FA_ALLOCATE and
+ * FA_RESV_SPACE modes, if the sys_fallocate()
  * system call succeeds, subsequent writes to the file in the given range
  * (specified by offset & len) should not fail - even if the file system
  * later becomes full. Hence the preallocation done is persistent (valid
- * even after reopen of the file and remount/reboot).
+ * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
+ * is passed, the file size will not be changed even if the preallocation
+ * is beyond EOF.
  *
  * It is expected that the ->fallocate() inode operation implemented by the
  * individual file systems will update the file size and/or ctime/mtime
- * depending on the mode and also on the success of the operation.
+ * depending on the mode (change is visible to user or not - say file size)
+ * and obviously, on the success of the operation.
  *
  * Note: Incase the file system does not support preallocation,
  * posix_fallocate() should fall back to the library implementation (i.e.
@@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
 
 	/* Return error if mode is not supported */
 	ret = -EOPNOTSUPP;
-	if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+	if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
+		mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
 		goto out;
 
 	ret = -EBADF;

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 5/7][TAKE5] ext4: fallocate support in ext4
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (4 preceding siblings ...)
  (?)
@ 2007-06-25 13:48                                                         ` Amit K. Arora
  -1 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a <ToDo> item.

Changelog:
---------
Changes from Take3 to Take4:
 1) Changed ext4_fllocate() declaration and definition to return a
"long"
    and not an "int", to match with ->fallocate() inode op.
 2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
 1) Patch rebased to 2.6.22-rc1 kernel version.
 2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".
Changes from Take1 to Take2:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start & journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON & ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
    functions.


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -316,7 +316,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -339,7 +339,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -455,7 +455,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 			le32_to_cpu(path->p_ext->ee_block),
 			ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -713,7 +713,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 				le32_to_cpu(path[depth].p_ext->ee_block),
 				ext_pblock(path[depth].p_ext),
-				le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1133,7 +1133,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1142,14 +1154,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1171,7 +1183,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int ret = 0;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1218,8 +1230,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
 	int depth, len, err, next;
+	unsigned uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1227,14 +1240,24 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/*
+		 * ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1290,7 +1313,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 			   > le32_to_cpu(nearex->ee_block)) {
@@ -1303,7 +1326,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 					le32_to_cpu(newext->ee_block),
 					ext_pblock(newext),
-					le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1316,7 +1339,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1335,8 +1358,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1406,8 +1434,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1419,7 +1447,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1435,7 +1464,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 			cbex.ec_block = le32_to_cpu(ex->ee_block);
-			cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 			cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1508,15 +1537,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 				(unsigned long) le32_to_cpu(ex->ee_block),
-				(unsigned long) le16_to_cpu(ex->ee_len));
+				(unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-			    + le16_to_cpu(ex->ee_len)) {
+			+ ext4_ext_get_actual_len(ex)) {
 		lblock = le32_to_cpu(ex->ee_block)
-			 + le16_to_cpu(ex->ee_len);
+			+ ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 				(unsigned long) le32_to_cpu(ex->ee_block),
-				(unsigned long) le16_to_cpu(ex->ee_len),
+				(unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1646,12 +1675,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1665,12 +1694,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1678,12 +1707,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1698,6 +1727,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
 	unsigned short ex_ee_len;
+	unsigned uninitialized = 0;
 	struct ext4_extent *ex;
 
 	/* the header must be checked already in ext4_ext_remove_space() */
@@ -1711,7 +1741,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1779,6 +1811,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1788,7 +1822,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2062,7 +2096,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 		unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2070,8 +2104,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 		if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2079,8 +2114,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err) {
 		/* free data blocks we just allocated */
@@ -2137,8 +2177,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create != EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
@@ -2241,3 +2283,129 @@ int ext4_ext_writepage_trans_blocks(stru
 
 	return needed;
 }
+
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * tells fallocate to unallocate previously (pre)allocated blocks.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	handle_t *handle;
+	ext4_fsblk_t block, max_blocks;
+	ext4_fsblk_t nblocks = 0;
+	int ret = 0;
+	int ret2 = 0;
+	int retries = 0;
+	struct buffer_head map_bh;
+	unsigned int credits, blkbits = inode->i_blkbits;
+
+	/*
+	 * currently supporting (pre)allocate mode for extent-based
+	 * files _only_
+	 */
+	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+		return -EOPNOTSUPP;
+
+	/* preallocation to directories is currently not supported */
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	block = offset >> blkbits;
+	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+			- block;
+
+	/*
+	 * credits to insert 1 extent into extent tree + buffers to be able to
+	 * modify 1 super block, 1 block bitmap and 1 group descriptor.
+	 */
+	credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+	while (ret >= 0 && ret < max_blocks) {
+		block = block + ret;
+		max_blocks = max_blocks - ret;
+		handle = ext4_journal_start(inode, credits);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			break;
+		}
+
+		ret = ext4_ext_get_blocks(handle, inode, block,
+					  max_blocks, &map_bh,
+					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
+		WARN_ON(!ret);
+		if (!ret) {
+			ext4_error(inode->i_sb, "ext4_fallocate",
+				   "ext4_ext_get_blocks returned 0! inode#%lu"
+				   ", block=%llu, max_blocks=%llu",
+				   inode->i_ino, block, max_blocks);
+			ret = -EIO;
+			ext4_mark_inode_dirty(handle, inode);
+			ret2 = ext4_journal_stop(handle);
+			break;
+		}
+		if (ret > 0) {
+			/* check wrap through sign-bit/zero here */
+			if ((block + ret) < 0 || (block + ret) < block) {
+				ret = -EIO;
+				ext4_mark_inode_dirty(handle, inode);
+				ret2 = ext4_journal_stop(handle);
+				break;
+			}
+			if (buffer_new(&map_bh) && ((block + ret) >
+			    (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+			    >> blkbits)))
+					nblocks = nblocks + ret;
+		}
+
+		/* Update ctime if new blocks get allocated */
+		if (nblocks) {
+			struct timespec now;
+			now = current_fs_time(inode->i_sb);
+			if (!timespec_equal(&inode->i_ctime, &now))
+				inode->i_ctime = now;
+		}
+
+		ext4_mark_inode_dirty(handle, inode);
+		ret2 = ext4_journal_stop(handle);
+		if (ret2)
+			break;
+	}
+
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+
+	/*
+	 * Time to update the file size.
+	 * Update only when preallocation was requested beyond the file size.
+	 */
+	if ((offset + len) > i_size_read(inode)) {
+		if (ret > 0) {
+			/*
+			 * if no error, we assume preallocation succeeded
+			 * completely
+			 */
+			mutex_lock(&inode->i_mutex);
+			i_size_write(inode, offset + len);
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		} else if (ret < 0 && nblocks) {
+			/* Handle partial allocation scenario */
+			loff_t newsize;
+
+			mutex_lock(&inode->i_mutex);
+			newsize  = (nblocks << blkbits) + i_size_read(inode);
+			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
+
+	return ret > 0 ? ret2 : ret;
+}
+
Index: linux-2.6.22-rc4/fs/ext4/file.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/file.c
+++ linux-2.6.22-rc4/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.22-rc4/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs.h
+++ linux-2.6.22-rc4/include/linux/ext4_fs.h
@@ -102,6 +102,7 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits)		ALIGN((size), (1 << (blkbits)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -984,6 +990,8 @@ extern int ext4_ext_get_blocks(handle_t 
 extern void ext4_ext_truncate(struct inode *, struct page *);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
+extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 static inline int
 ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 			unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
@@ -188,6 +188,21 @@ ext4_ext_invalidate_cache(struct inode *
 	EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
 }
 
+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
+{
+	ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
+{
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
+{
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 6/7][TAKE5] ext4: write support for preallocated blocks
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (5 preceding siblings ...)
  (?)
@ 2007-06-25 13:49                                                         ` Amit K. Arora
  -1 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:49 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
---------
Changes from Take3 to Take4:
 - no change -
Changes from Take2 to Take3:
 1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
 1) Replaced BUG_ON with WARN_ON & ext4_error.
 2) Added variable names to the function declaration of
    ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) "if((a=foo())).." was broken into "a=foo(); if(a).."
 5) Removed extra spaces.


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -1167,6 +1167,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+			  struct ext4_ext_path *path,
+			  struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done = 0;
+	int uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh)) {
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+				* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+		merge_done = 1;
+		WARN_ON(eh->eh_entries == 0);
+		if (!eh->eh_entries)
+			ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+			   "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+	}
+
+	return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1354,25 +1401,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -2035,15 +2064,158 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, newex;
+	struct ext4_extent *ex1 = NULL;
+	struct ext4_extent *ex2 = NULL;
+	struct ext4_extent *ex3 = NULL;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0;
+	int ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/*
+	 * for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/*
+		 * The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth) {
+			depth = newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/*
+	 * If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = cpu_to_le32(newblock);
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	err = ext4_ext_get_access(handle, inode, path + depth);
+	if (err)
+		goto out;
+	/*
+	 * New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/*
+		 * To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/*
+	 * Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2 == ex and ex3 == NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2056,8 +2228,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (goal) {
 		if (goal == EXT4_EXT_CACHE_GAP) {
 			if (!create) {
-				/* block isn't allocated yet and
-				 * user doesn't want to allocate it */
+				/*
+				 * block isn't allocated yet and
+				 * user doesn't want to allocate it
+				 */
 				goto out2;
 			}
 			/* we should allocate requested block */
@@ -2091,6 +2265,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2099,13 +2274,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 		if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2114,12 +2285,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2128,8 +2314,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * we couldn't try to create block if create flag is zero
 	 */
 	if (!create) {
-		/* put just found gap into cache to speed up
-		 * subsequent requests */
+		/*
+		 * put just found gap into cache to speed up
+		 * subsequent requests
+		 */
 		ext4_ext_put_gap_in_cache(inode, path, iblock);
 		goto out2;
 	}
@@ -2175,6 +2363,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
@@ -2244,7 +2433,8 @@ void ext4_ext_truncate(struct inode * in
 	err = ext4_ext_remove_space(inode, last_block);
 
 	/* In a multi-transaction truncate, we only make the final
-	 * transaction synchronous. */
+	 * transaction synchronous.
+	 */
 	if (IS_SYNC(inode))
 		handle->h_sync = 1;
 
Index: linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc4/include/linux/ext4_fs_extents.h
@@ -205,6 +205,9 @@ static inline int ext4_ext_get_actual_le
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+				 struct ext4_ext_path *path,
+				 struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 340+ messages in thread

* [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (6 preceding siblings ...)
  (?)
@ 2007-06-25 13:50                                                         ` Amit K. Arora
  2007-06-25 21:56                                                           ` Andreas Dilger
  -1 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 13:50 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

Support new values of mode in ext4.

This patch supports new mode values/flags in ext4. With this patch ext4
will be able to support FA_ALLOCATE and FA_RESV_SPACE modes. Supporting
FA_DEALLOCATE and FA_UNRESV_SPACE fallocate modes in ext4 is a work for
future.

Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -2477,7 +2477,8 @@ int ext4_ext_writepage_trans_blocks(stru
 /*
  * preallocate space for a file. This implements ext4's fallocate inode
  * operation, which gets called from sys_fallocate system call.
- * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * Currently only FA_ALLOCATE  and FA_RESV_SPACE modes are supported on
+ * extent based files.
  * We may have more modes supported in future - like FA_DEALLOCATE, which
  * tells fallocate to unallocate previously (pre)allocated blocks.
  * For block-mapped files, posix_fallocate should fall back to the method
@@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
 	 * currently supporting (pre)allocate mode for extent-based
 	 * files _only_
 	 */
-	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
+		!(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
 		return -EOPNOTSUPP;
 
 	/* preallocation to directories is currently not supported */
@@ -2572,9 +2574,10 @@ retry:
 
 	/*
 	 * Time to update the file size.
-	 * Update only when preallocation was requested beyond the file size.
+	 * Update only when preallocation was requested beyond the file size
+	 * and when FA_FL_KEEP_SIZE mode is not specified!
 	 */
-	if ((offset + len) > i_size_read(inode)) {
+	if (!(mode & FA_FL_KEEP_SIZE) && (offset + len) > i_size_read(inode)) {
 		if (ret > 0) {
 			/*
 			 * if no error, we assume preallocation succeeded

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 13:45                                                         ` [PATCH 4/7][TAKE5] support new modes in fallocate Amit K. Arora
@ 2007-06-25 15:03                                                           ` Amit K. Arora
  2007-06-25 21:46                                                             ` Andreas Dilger
  2007-06-25 21:52                                                           ` Andreas Dilger
  1 sibling, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-25 15:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: David Chinner, Andreas Dilger, suparna, cmm, xfs

I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
*suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
If it is decided that these flags are also needed, I will update this
patch. Thanks!

On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
> Implement new flags and values for mode argument.
> 
> This patch implements the new flags and values for the "mode" argument
> of the fallocate system call. It is based on the discussion between
> Andreas Dilger and David Chinner on the man page proposed (by the later)
> on fallocate.
> 
> Signed-off-by: Amit Arora <aarora@in.ibm.com>
> 
> Index: linux-2.6.22-rc4/include/linux/fs.h
> ===================================================================
> --- linux-2.6.22-rc4.orig/include/linux/fs.h
> +++ linux-2.6.22-rc4/include/linux/fs.h
> @@ -267,15 +267,16 @@ extern int dir_notify_enable;
>  #define SYNC_FILE_RANGE_WAIT_AFTER	4
> 
>  /*
> - * sys_fallocate modes
> - * Currently sys_fallocate supports two modes:
> - * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
> - *		  may request (pre)allocation of blocks.
> - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
> - *		  the preallocated blocks.
> + * sys_fallocate mode flags and values
>   */
> -#define FA_ALLOCATE	0x1
> -#define FA_DEALLOCATE	0x2
> +#define FA_FL_DEALLOC	0x01 /* default is allocate */
> +#define FA_FL_KEEP_SIZE	0x02 /* default is extend/shrink size */
> +#define FA_FL_DEL_DATA	0x04 /* default is keep written data on DEALLOC */
> +
> +#define FA_ALLOCATE	0
> +#define FA_DEALLOCATE	FA_FL_DEALLOC
> +#define FA_RESV_SPACE	FA_FL_KEEP_SIZE
> +#define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
> 
>  #ifdef __KERNEL__
> 
> Index: linux-2.6.22-rc4/fs/open.c
> ===================================================================
> --- linux-2.6.22-rc4.orig/fs/open.c
> +++ linux-2.6.22-rc4/fs/open.c
> @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
>   * sys_fallocate - preallocate blocks or free preallocated blocks
>   * @fd: the file descriptor
>   * @mode: mode specifies if fallocate should preallocate blocks OR free
> - *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
> - *	  FA_DEALLOCATE modes are supported.
> + *	  (unallocate) preallocated blocks.
>   * @offset: The offset within file, from where (un)allocation is being
>   *	    requested. It should not have a negative value.
>   * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
>   *
>   * This system call, depending on the mode, preallocates or unallocates blocks
>   * for a file. The range of blocks depends on the value of offset and len
> - * arguments provided by the user/application. For FA_ALLOCATE mode, if this
> + * arguments provided by the user/application. For FA_ALLOCATE and
> + * FA_RESV_SPACE modes, if the sys_fallocate()
>   * system call succeeds, subsequent writes to the file in the given range
>   * (specified by offset & len) should not fail - even if the file system
>   * later becomes full. Hence the preallocation done is persistent (valid
> - * even after reopen of the file and remount/reboot).
> + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
> + * is passed, the file size will not be changed even if the preallocation
> + * is beyond EOF.
>   *
>   * It is expected that the ->fallocate() inode operation implemented by the
>   * individual file systems will update the file size and/or ctime/mtime
> - * depending on the mode and also on the success of the operation.
> + * depending on the mode (change is visible to user or not - say file size)
> + * and obviously, on the success of the operation.
>   *
>   * Note: Incase the file system does not support preallocation,
>   * posix_fallocate() should fall back to the library implementation (i.e.
> @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
> 
>  	/* Return error if mode is not supported */
>  	ret = -EOPNOTSUPP;
> -	if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
> +	if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
> +		mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
>  		goto out;
> 
>  	ret = -EBADF;
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 15:03                                                           ` Amit K. Arora
@ 2007-06-25 21:46                                                             ` Andreas Dilger
  2007-06-26 10:32                                                               ` Amit K. Arora
  2007-06-26 23:14                                                               ` David Chinner
  0 siblings, 2 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-06-25 21:46 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
> I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
> If it is decided that these flags are also needed, I will update this
> patch. Thanks!

Can you clarify - what is the current behaviour when ENOSPC (or some other
error) is hit?  Does it keep the current fallocate() or does it free it?

For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
don't want to expose uninitialized disk blocks to userspace.  I'm not
sure if this makes sense at all.

> On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
> > Implement new flags and values for mode argument.
> > 
> > This patch implements the new flags and values for the "mode" argument
> > of the fallocate system call. It is based on the discussion between
> > Andreas Dilger and David Chinner on the man page proposed (by the later)
> > on fallocate.
> > 
> > Signed-off-by: Amit Arora <aarora@in.ibm.com>
> > 
> > Index: linux-2.6.22-rc4/include/linux/fs.h
> > ===================================================================
> > --- linux-2.6.22-rc4.orig/include/linux/fs.h
> > +++ linux-2.6.22-rc4/include/linux/fs.h
> > @@ -267,15 +267,16 @@ extern int dir_notify_enable;
> >  #define SYNC_FILE_RANGE_WAIT_AFTER	4
> > 
> >  /*
> > - * sys_fallocate modes
> > - * Currently sys_fallocate supports two modes:
> > - * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
> > - *		  may request (pre)allocation of blocks.
> > - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
> > - *		  the preallocated blocks.
> > + * sys_fallocate mode flags and values
> >   */
> > -#define FA_ALLOCATE	0x1
> > -#define FA_DEALLOCATE	0x2
> > +#define FA_FL_DEALLOC	0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE	0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA	0x04 /* default is keep written data on DEALLOC */
> > +
> > +#define FA_ALLOCATE	0
> > +#define FA_DEALLOCATE	FA_FL_DEALLOC
> > +#define FA_RESV_SPACE	FA_FL_KEEP_SIZE
> > +#define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
> > 
> >  #ifdef __KERNEL__
> > 
> > Index: linux-2.6.22-rc4/fs/open.c
> > ===================================================================
> > --- linux-2.6.22-rc4.orig/fs/open.c
> > +++ linux-2.6.22-rc4/fs/open.c
> > @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
> >   * sys_fallocate - preallocate blocks or free preallocated blocks
> >   * @fd: the file descriptor
> >   * @mode: mode specifies if fallocate should preallocate blocks OR free
> > - *	  (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
> > - *	  FA_DEALLOCATE modes are supported.
> > + *	  (unallocate) preallocated blocks.
> >   * @offset: The offset within file, from where (un)allocation is being
> >   *	    requested. It should not have a negative value.
> >   * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
> >   *
> >   * This system call, depending on the mode, preallocates or unallocates blocks
> >   * for a file. The range of blocks depends on the value of offset and len
> > - * arguments provided by the user/application. For FA_ALLOCATE mode, if this
> > + * arguments provided by the user/application. For FA_ALLOCATE and
> > + * FA_RESV_SPACE modes, if the sys_fallocate()
> >   * system call succeeds, subsequent writes to the file in the given range
> >   * (specified by offset & len) should not fail - even if the file system
> >   * later becomes full. Hence the preallocation done is persistent (valid
> > - * even after reopen of the file and remount/reboot).
> > + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
> > + * is passed, the file size will not be changed even if the preallocation
> > + * is beyond EOF.
> >   *
> >   * It is expected that the ->fallocate() inode operation implemented by the
> >   * individual file systems will update the file size and/or ctime/mtime
> > - * depending on the mode and also on the success of the operation.
> > + * depending on the mode (change is visible to user or not - say file size)
> > + * and obviously, on the success of the operation.
> >   *
> >   * Note: Incase the file system does not support preallocation,
> >   * posix_fallocate() should fall back to the library implementation (i.e.
> > @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
> > 
> >  	/* Return error if mode is not supported */
> >  	ret = -EOPNOTSUPP;
> > -	if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
> > +	if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
> > +		mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
> >  		goto out;
> > 
> >  	ret = -EBADF;
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 13:45                                                         ` [PATCH 4/7][TAKE5] support new modes in fallocate Amit K. Arora
  2007-06-25 15:03                                                           ` Amit K. Arora
@ 2007-06-25 21:52                                                           ` Andreas Dilger
  2007-06-26 10:45                                                             ` Amit K. Arora
  2007-06-26 23:26                                                             ` David Chinner
  1 sibling, 2 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-06-25 21:52 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
> +#define FA_FL_DEALLOC	0x01 /* default is allocate */
> +#define FA_FL_KEEP_SIZE	0x02 /* default is extend/shrink size */
> +#define FA_FL_DEL_DATA	0x04 /* default is keep written data on DEALLOC */

In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
each extent.  For some workloads this would be much faster than truncate
and reallocate of all the blocks in a file.

In that light, please change the comment to /* default is keep existing data */
so that it doesn't imply this is only for DEALLOC.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-25 13:50                                                         ` [PATCH 7/7][TAKE5] ext4: support new modes Amit K. Arora
@ 2007-06-25 21:56                                                           ` Andreas Dilger
  2007-06-26 12:07                                                             ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-06-25 21:56 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Jun 25, 2007  19:20 +0530, Amit K. Arora wrote:
> @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
>  	 * currently supporting (pre)allocate mode for extent-based
>  	 * files _only_
>  	 */
> -	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
> +		!(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
>  		return -EOPNOTSUPP;

This should probably just check for the individual flags it can support
(e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).

I also thought another proposed flag was to determine whether mtime (and
maybe ctime) is changed when doing prealloc/dealloc space?  Default should
probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
should decide if we want to allow changing the file w/o changing ctime, if
that is required even though the file is not visibly changing.  Maybe the
ctime update should be implicit if the size or mtime are changing?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 21:46                                                             ` Andreas Dilger
@ 2007-06-26 10:32                                                               ` Amit K. Arora
  2007-06-26 15:34                                                                 ` Andreas Dilger
  2007-06-30 10:21                                                                 ` Christoph Hellwig
  2007-06-26 23:14                                                               ` David Chinner
  1 sibling, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-26 10:32 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
> > I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> > *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
> > If it is decided that these flags are also needed, I will update this
> > patch. Thanks!
> 
> Can you clarify - what is the current behaviour when ENOSPC (or some other
> error) is hit?  Does it keep the current fallocate() or does it free it?

Currently it is left on the file system implementation. In ext4, we do
not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
end up with partial (pre)allocation. This is inline with dd and
posix_fallocate, which also do not free the partially allocated space.
 
> For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> don't want to expose uninitialized disk blocks to userspace.  I'm not
> sure if this makes sense at all.

I don't think we need to make it default - atleast for filesystems which
have a mechanism to distinguish preallocated blocks from "regular" ones.
In ext4, for example, we will have a way to mark uninitialized extents.
All the preallocated blocks will be part of these uninitialized extents.
And any read on these extents will treat them as a hole, returning
zeroes to user land. Thus any existing data on uninitialized blocks will
not be exposed to the userspace.

--
Regards,
Amit Arora 

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 21:52                                                           ` Andreas Dilger
@ 2007-06-26 10:45                                                             ` Amit K. Arora
  2007-06-26 15:42                                                               ` Andreas Dilger
  2007-06-26 23:26                                                             ` David Chinner
  1 sibling, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-26 10:45 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
> > +#define FA_FL_DEALLOC	0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE	0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA	0x04 /* default is keep written data on DEALLOC */
> 
> In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> each extent.  For some workloads this would be much faster than truncate
> and reallocate of all the blocks in a file.

In ext4, we already mark each extent having preallocated blocks as
uninitialized. This is done as part of following code (which is part of
patch 5/7) in ext4_ext_get_blocks() :  

@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
        /* try to insert new extent into found leaf and return */
        ext4_ext_store_pblock(&newex, newblock);
        newex.ee_len = cpu_to_le16(allocated);
+       if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+               ext4_ext_mark_uninitialized(&newex);
        err = ext4_ext_insert_extent(handle, inode, path, &newex);
        if (err) {
                /* free data blocks we just allocated */


> In that light, please change the comment to /* default is keep existing data */
> so that it doesn't imply this is only for DEALLOC.

Ok. Will update the comment.

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-25 21:56                                                           ` Andreas Dilger
@ 2007-06-26 12:07                                                             ` Amit K. Arora
  2007-06-26 16:14                                                               ` Andreas Dilger
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-26 12:07 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Mon, Jun 25, 2007 at 03:56:25PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007  19:20 +0530, Amit K. Arora wrote:
> > @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
> >  	 * currently supporting (pre)allocate mode for extent-based
> >  	 * files _only_
> >  	 */
> > -	if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
> > +		!(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
> >  		return -EOPNOTSUPP;
> 
> This should probably just check for the individual flags it can support
> (e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).

Hmm.. I am thinking of a scenario when the file system supports some
individual flags, but does not support a particular combination of them.
Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
for some reason). This means that although we support FA_FL_DEALLOC,
FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
combination of all these flags (which is nothing but FA_UNRESV_SPACE).
 
> I also thought another proposed flag was to determine whether mtime (and
> maybe ctime) is changed when doing prealloc/dealloc space?  Default should
> probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
> should decide if we want to allow changing the file w/o changing ctime, if
> that is required even though the file is not visibly changing.  Maybe the
> ctime update should be implicit if the size or mtime are changing?

Is it really required ? I mean, why should we allow users not to update
ctime/mtime even if the file metadata/data gets updated ? It sounds
a bit "unnatural" to me.
Is there any application scenario in your mind, when you suggest of
giving this flexibility to userspace ?

I think, modifying ctime/mtime should be dependent on the other flags.
E.g., if we do not zero out data blocks on allocation/deallocation,
update only ctime. Otherwise, update ctime and mtime both.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 2/7][TAKE5] fallocate() on s390(x)
  2007-06-25 13:42                                                         ` [PATCH 2/7][TAKE5] fallocate() on s390(x) Amit K. Arora
@ 2007-06-26 15:15                                                           ` Heiko Carstens
  0 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-06-26 15:15 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner,
	Andreas Dilger, suparna, cmm, xfs

> Index: linux-2.6.22-rc4/arch/s390/kernel/syscalls.S
> ===================================================================
> --- linux-2.6.22-rc4.orig/arch/s390/kernel/syscalls.S	2007-06-11 16:16:01.000000000 -0700
> +++ linux-2.6.22-rc4/arch/s390/kernel/syscalls.S	2007-06-11 16:27:29.000000000 -0700
> @@ -322,6 +322,7 @@
>  SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
>  SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
>  SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
> +SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
>  NI_SYSCALL							/* 314 sys_fallocate */

You need to remove the NI_SYSCALL line. Otherwise all following entries
will be wrong.

>  SYSCALL(sys_utimensat,sys_utimensat,compat_sys_utimensat_wrapper)	/* 315 */
>  SYSCALL(sys_signalfd,sys_signalfd,compat_sys_signalfd_wrapper)
> Index: linux-2.6.22-rc4/include/asm-s390/unistd.h
> ===================================================================
> --- linux-2.6.22-rc4.orig/include/asm-s390/unistd.h	2007-06-11 16:16:01.000000000 -0700
> +++ linux-2.6.22-rc4/include/asm-s390/unistd.h	2007-06-11 16:27:29.000000000 -0700
> @@ -256,7 +256,8 @@
>  #define __NR_signalfd		316
>  #define __NR_timerfd		317
>  #define __NR_eventfd		318
> -#define NR_syscalls 319
> +#define __NR_fallocate		319
> +#define NR_syscalls 320

Erm... no. You use slot 314 in the syscall table but assign number 319.
That won't work. Please use 314 for both.
I assume this got broken when updating to newer kernel versions.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 10:32                                                               ` Amit K. Arora
@ 2007-06-26 15:34                                                                 ` Andreas Dilger
  2007-06-26 19:09                                                                   ` Amit K. Arora
  2007-06-26 23:18                                                                   ` David Chinner
  2007-06-30 10:21                                                                 ` Christoph Hellwig
  1 sibling, 2 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-06-26 15:34 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
> On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > error) is hit?  Does it keep the current fallocate() or does it free it?
> 
> Currently it is left on the file system implementation. In ext4, we do
> not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> end up with partial (pre)allocation. This is inline with dd and
> posix_fallocate, which also do not free the partially allocated space.

Since I believe the XFS allocation ioctls do it the opposite way (free
preallocated space on error) this should be encoded into the flags.
Having it "filesystem dependent" just means that nobody will be happy.

> > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > sure if this makes sense at all.
> 
> I don't think we need to make it default - atleast for filesystems which
> have a mechanism to distinguish preallocated blocks from "regular" ones.

What I mean is that any data read from the file should have the "appearance"
of being zeroed (whether zeroes are actually written to disk or not).  What
I _think_ David is proposing is to allow fallocate() to return without
marking the blocks even "uninitialized" and subsequent reads would return
the old data from the disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 10:45                                                             ` Amit K. Arora
@ 2007-06-26 15:42                                                               ` Andreas Dilger
  2007-06-26 19:12                                                                 ` Amit K. Arora
  2007-06-26 23:32                                                                 ` David Chinner
  0 siblings, 2 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-06-26 15:42 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
> On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > each extent.  For some workloads this would be much faster than truncate
> > and reallocate of all the blocks in a file.
> 
> In ext4, we already mark each extent having preallocated blocks as
> uninitialized. This is done as part of following code (which is part of
> patch 5/7) in ext4_ext_get_blocks() :  

What I meant is that with XFS_IOC_ALLOCSP the previously-written data
is ZEROED OUT, unlike with fallocate() which leaves previously-written
data alone and only allocates in holes.

So, if you had a sparse file with some data in it:

     AAAAA         BBBBBB

fallocate() would allocate the holes:

00000AAAAA000000000BBBBBB00000000

XFS_IOC_ALLOCSP would overwrite everything:

000000000000000000000000000000000

In order to specify this for allocation, FA_FL_DEL_DATA would need to make
sense for allocations (as well as the deallocation).  This is farily easy
to do - just mark all of the existing extents as unallocated, and their
data disappears.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-26 12:07                                                             ` Amit K. Arora
@ 2007-06-26 16:14                                                               ` Andreas Dilger
  2007-06-26 19:29                                                                 ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-06-26 16:14 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
> Hmm.. I am thinking of a scenario when the file system supports some
> individual flags, but does not support a particular combination of them.
> Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
> file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
> FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
> for some reason). This means that although we support FA_FL_DEALLOC,
> FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
> combination of all these flags (which is nothing but FA_UNRESV_SPACE).

That is up to the filesystem to determine then.  I just thought it should
be clear to return an error for flags (or as you say combinations thereof)
that the filesystem doesn't understand.

That said, I'd think in most cases the flags are orthogonal, so if you
support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
then you will also support other combinations of those flags just from
the way it is coded.

> > I also thought another proposed flag was to determine whether mtime (and
> > maybe ctime) is changed when doing prealloc/dealloc space?  Default should
> > probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
> > should decide if we want to allow changing the file w/o changing ctime, if
> > that is required even though the file is not visibly changing.  Maybe the
> > ctime update should be implicit if the size or mtime are changing?
> 
> Is it really required ? I mean, why should we allow users not to update
> ctime/mtime even if the file metadata/data gets updated ? It sounds
> a bit "unnatural" to me.
> Is there any application scenario in your mind, when you suggest of
> giving this flexibility to userspace ?

One reason is that XFS does NOT update the mtime/ctime when doing the
XFS_IOC_* allocation ioctls.

> I think, modifying ctime/mtime should be dependent on the other flags.
> E.g., if we do not zero out data blocks on allocation/deallocation,
> update only ctime. Otherwise, update ctime and mtime both.

I'm only being the advocate for requirements David Chinner has put
forward due to existing behaviour in XFS.  This is one of the reasons
why I think the "flags" mechanism we now have - we can encode the
various different behaviours in any way we want and leave it to the
caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 15:34                                                                 ` Andreas Dilger
@ 2007-06-26 19:09                                                                   ` Amit K. Arora
  2007-06-26 23:18                                                                   ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-26 19:09 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > 
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
> 
> Since I believe the XFS allocation ioctls do it the opposite way (free
> preallocated space on error) this should be encoded into the flags.
> Having it "filesystem dependent" just means that nobody will be happy.

Ok, got your point. Maybe we can have a flag for this, as you suggested.
But, default behavior IMHO should be _not_ to undo partial allocation
(thus the file system will have the option of supporting this flag or
not and it will be inline with posix_fallocate; XFS will obviously
like to support this flag, inline with its existing behavior).

> > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > > sure if this makes sense at all.
> > 
> > I don't think we need to make it default - atleast for filesystems which
> > have a mechanism to distinguish preallocated blocks from "regular" ones.
> 
> What I mean is that any data read from the file should have the "appearance"
> of being zeroed (whether zeroes are actually written to disk or not).  What
> I _think_ David is proposing is to allow fallocate() to return without
> marking the blocks even "uninitialized" and subsequent reads would return
> the old data from the disk.

I can't think of a good reason for this (i.e. returning stale data from
preallocated blocks). It is infact a security issue to me.
Anyhow, this may though be beneficial for file systems which have
noticable overhead in marking the blocks "uninitialized/preallocated".
Can you or David please throw some light on how this option might really
be helpful ? Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 15:42                                                               ` Andreas Dilger
@ 2007-06-26 19:12                                                                 ` Amit K. Arora
  2007-06-26 23:32                                                                 ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-26 19:12 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > > each extent.  For some workloads this would be much faster than truncate
> > > and reallocate of all the blocks in a file.
> > 
> > In ext4, we already mark each extent having preallocated blocks as
> > uninitialized. This is done as part of following code (which is part of
> > patch 5/7) in ext4_ext_get_blocks() :  
> 
> What I meant is that with XFS_IOC_ALLOCSP the previously-written data
> is ZEROED OUT, unlike with fallocate() which leaves previously-written
> data alone and only allocates in holes.
> 
> In order to specify this for allocation, FA_FL_DEL_DATA would need to make
> sense for allocations (as well as the deallocation).  This is farily easy
> to do - just mark all of the existing extents as unallocated, and their
> data disappears.

Ok, agreed. Will add the FA_ZERO_SPACE mode too.
Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-26 16:14                                                               ` Andreas Dilger
@ 2007-06-26 19:29                                                                 ` Amit K. Arora
  2007-06-27  0:04                                                                   ` David Chinner
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-06-26 19:29 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
> On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
> > Hmm.. I am thinking of a scenario when the file system supports some
> > individual flags, but does not support a particular combination of them.
> > Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
> > file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
> > FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
> > for some reason). This means that although we support FA_FL_DEALLOC,
> > FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
> > combination of all these flags (which is nothing but FA_UNRESV_SPACE).
> 
> That is up to the filesystem to determine then.  I just thought it should
> be clear to return an error for flags (or as you say combinations thereof)
> that the filesystem doesn't understand.
> 
> That said, I'd think in most cases the flags are orthogonal, so if you
> support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
> then you will also support other combinations of those flags just from
> the way it is coded.

Ok. 
 
> > > I also thought another proposed flag was to determine whether mtime (and
> > > maybe ctime) is changed when doing prealloc/dealloc space?  Default should
> > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
> > > should decide if we want to allow changing the file w/o changing ctime, if
> > > that is required even though the file is not visibly changing.  Maybe the
> > > ctime update should be implicit if the size or mtime are changing?
> > 
> > Is it really required ? I mean, why should we allow users not to update
> > ctime/mtime even if the file metadata/data gets updated ? It sounds
> > a bit "unnatural" to me.
> > Is there any application scenario in your mind, when you suggest of
> > giving this flexibility to userspace ?
> 
> One reason is that XFS does NOT update the mtime/ctime when doing the
> XFS_IOC_* allocation ioctls.

Hmm.. I personally will call it a bug in XFS code then. :)

> > I think, modifying ctime/mtime should be dependent on the other flags.
> > E.g., if we do not zero out data blocks on allocation/deallocation,
> > update only ctime. Otherwise, update ctime and mtime both.
> 
> I'm only being the advocate for requirements David Chinner has put
> forward due to existing behaviour in XFS.  This is one of the reasons
> why I think the "flags" mechanism we now have - we can encode the
> various different behaviours in any way we want and leave it to the
> caller.

I understand. May be we can confirm once more with David Chinner if this
is really required. Will it really be a compatibility issue if new XFS
preallocations (ie. via fallocate) update mtime/ctime ? Will old
applications really get affected ? If yes, then it might be worth
implementing - even though I personally don't like it.

David, can you please confirm ? Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/7][TAKE5] fallocate() implementation on i386, x86_64 and powerpc
  2007-06-25 13:40                                                         ` [PATCH 1/7][TAKE5] fallocate() implementation on i386, x86_64 and powerpc Amit K. Arora
@ 2007-06-26 19:38                                                           ` Heiko Carstens
  0 siblings, 0 replies; 340+ messages in thread
From: Heiko Carstens @ 2007-06-26 19:38 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner,
	Andreas Dilger, suparna, cmm, xfs

> Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
> ===================================================================
> --- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c
> +++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
> @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
>  	return sys_truncate(path, (high << 32) | low);
>  }
> 
> +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
> +				     u32 lenhi, u32 lenlo)
> +{
> +	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
> +			     ((loff_t)lenhi << 32) | lenlo);
> +}
> +
>  asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
>  				 unsigned long low)
>  {
> Index: linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
> ===================================================================
> --- linux-2.6.22-rc4.orig/arch/x86_64/ia32/ia32entry.S
> +++ linux-2.6.22-rc4/arch/x86_64/ia32/ia32entry.S
> @@ -719,4 +719,5 @@ ia32_sys_call_table:
>  	.quad compat_sys_signalfd
>  	.quad compat_sys_timerfd
>  	.quad sys_eventfd
> +	.quad sys_fallocate
>  ia32_syscall_end:

Btw. this is also (still?) broken. x86_64 needs a compat syscall here.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 21:46                                                             ` Andreas Dilger
  2007-06-26 10:32                                                               ` Amit K. Arora
@ 2007-06-26 23:14                                                               ` David Chinner
  2007-06-27  3:49                                                                 ` Andreas Dilger
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-06-26 23:14 UTC (permalink / raw)
  To: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, suparna, cmm, xfs

On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
> > I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> > *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
> > If it is decided that these flags are also needed, I will update this
> > patch. Thanks!
> 
> Can you clarify - what is the current behaviour when ENOSPC (or some other
> error) is hit?  Does it keep the current fallocate() or does it free it?
> 
> For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> don't want to expose uninitialized disk blocks to userspace.  I'm not
> sure if this makes sense at all.

Someone on the XFs list had an interesting request - preallocated
swap files. You can't use unwritten extents for this because
of sys_swapon()s use of bmap() (XFS returns holes for reading
unwritten extents), so we need a method of preallocating that does
not zero or mark the extent unread. i.e. FA_MKSWAP.

I thinkthis would be:

#define FA_FL_NO_ZERO_SPACE	0x08	/* default is to zero space */

#define FA_MKSWAP 	(FA_ALLOCATE | FA_FL_NO_ZERO_SPACE)

That way we can allocate large swap files that don't need zeroing
in a single, fast operation, and hence potentially bring new
swap space online without needed very much memory at all (i.e.
should succeed in most near-OOM conditions).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (7 preceding siblings ...)
  (?)
@ 2007-06-26 23:15                                                         ` David Chinner
  -1 siblings, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-06-26 23:15 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner,
	Andreas Dilger, suparna, cmm, xfs

On Mon, Jun 25, 2007 at 06:58:10PM +0530, Amit K. Arora wrote:
> 2) The above new patches (4/7 and 7/7) are based on the dicussion
>    between Andreas Dilger and David Chinner on the mode argument,
>    when later posted a man page on fallocate.

Can you include the man page in this patch set, please? That
way it can be kept up to date with the rest of the patch set.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 15:34                                                                 ` Andreas Dilger
  2007-06-26 19:09                                                                   ` Amit K. Arora
@ 2007-06-26 23:18                                                                   ` David Chinner
  2007-06-28 18:19                                                                     ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-06-26 23:18 UTC (permalink / raw)
  To: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, suparna, cmm, xfs

On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > 
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
> 
> Since I believe the XFS allocation ioctls do it the opposite way (free
> preallocated space on error) this should be encoded into the flags.
> Having it "filesystem dependent" just means that nobody will be happy.

No, XFs does not free preallocated space on error. it is up to the
application to clean up.

> What I mean is that any data read from the file should have the "appearance"
> of being zeroed (whether zeroes are actually written to disk or not).  What
> I _think_ David is proposing is to allow fallocate() to return without
> marking the blocks even "uninitialized" and subsequent reads would return
> the old data from the disk.

Correct, but for swap files that's not an issue - no user should be able
too read them, and FA_MKSWAP would really need root privileges to execute.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-25 21:52                                                           ` Andreas Dilger
  2007-06-26 10:45                                                             ` Amit K. Arora
@ 2007-06-26 23:26                                                             ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-06-26 23:26 UTC (permalink / raw)
  To: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, suparna, cmm, xfs

On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
> > +#define FA_FL_DEALLOC	0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE	0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA	0x04 /* default is keep written data on DEALLOC */
> 
> In XFS one of the (many) ALLOC modes is to zero existing data on allocate.

No, none of the XFS allocation modes do that.

XFS_IOC_ALLOCSP, which does write zeros to disk, only allocates and
writes zeros in the range between the old file size and the new file size.
XFS_IOC_RESVSP, which alocates unwritten extents, only allocates
where extents do not currently exist. It does not zero existing
extents.

IOWs, you can't overwrite existing data with XFS preallocation.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 15:42                                                               ` Andreas Dilger
  2007-06-26 19:12                                                                 ` Amit K. Arora
@ 2007-06-26 23:32                                                                 ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-06-26 23:32 UTC (permalink / raw)
  To: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, suparna, cmm, xfs

On Tue, Jun 26, 2007 at 11:42:50AM -0400, Andreas Dilger wrote:
> On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
> > On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > > each extent.  For some workloads this would be much faster than truncate
> > > and reallocate of all the blocks in a file.
> > 
> > In ext4, we already mark each extent having preallocated blocks as
> > uninitialized. This is done as part of following code (which is part of
> > patch 5/7) in ext4_ext_get_blocks() :  
> 
> What I meant is that with XFS_IOC_ALLOCSP the previously-written data
> is ZEROED OUT, unlike with fallocate() which leaves previously-written
> data alone and only allocates in holes.
> 
> So, if you had a sparse file with some data in it:
> 
>      AAAAA         BBBBBB
> 
> fallocate() would allocate the holes:
> 
> 00000AAAAA000000000BBBBBB00000000
> 
> XFS_IOC_ALLOCSP would overwrite everything:
> 
> 000000000000000000000000000000000

No, it wouldn't. XFS_IOC_ALLOCSP would give you:


      AAAAA         BBBBBB00000000

because it only allocates the space between the old EOF and the new
EOF. Graphic demonstration - write 4k @ 4k, 4k @ 16k, allocsp out to 32k:

budgie:~ # xfs_io -f \
> -c "pwrite 4096 4096" \
> -c "pwrite 16384 4096" \
> -c "bmap -vvp" \
> -c "allocsp 32768 0" \
> -c "bmap -vvp" \
> /mnt/test/alfred
wrote 4096/4096 bytes at offset 4096
4 KiB, 1 ops; 0.0000 sec (108.507 MiB/sec and 27777.7778 ops/sec)
wrote 4096/4096 bytes at offset 16384
4 KiB, 1 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
/mnt/test/alfred:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL
   0: [0..7]:          hole                                       8
   1: [8..15]:         5226864..5226871  4 (1022160..1022167)     8
   2: [16..31]:        hole                                      16
   3: [32..39]:        5226888..5226895  4 (1022184..1022191)     8
/mnt/test/alfred:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL
   0: [0..7]:          hole                                       8
   1: [8..15]:         5226864..5226871  4 (1022160..1022167)     8
   2: [16..31]:        hole                                      16
   3: [32..63]:        5226888..5226919  4 (1022184..1022215)    32
budgie:~ #

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-26 19:29                                                                 ` Amit K. Arora
@ 2007-06-27  0:04                                                                   ` David Chinner
  2007-06-28 18:07                                                                     ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-06-27  0:04 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner, suparna,
	cmm, xfs

On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote:
> On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
> > On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
> > > > I also thought another proposed flag was to determine whether mtime (and
> > > > maybe ctime) is changed when doing prealloc/dealloc space?  Default should
> > > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
> > > > should decide if we want to allow changing the file w/o changing ctime, if
> > > > that is required even though the file is not visibly changing.  Maybe the
> > > > ctime update should be implicit if the size or mtime are changing?
> > > 
> > > Is it really required ? I mean, why should we allow users not to update
> > > ctime/mtime even if the file metadata/data gets updated ? It sounds
> > > a bit "unnatural" to me.
> > > Is there any application scenario in your mind, when you suggest of
> > > giving this flexibility to userspace ?
> > 
> > One reason is that XFS does NOT update the mtime/ctime when doing the
> > XFS_IOC_* allocation ioctls.

Not totally correct.

XFS_IOC_ALLOCSP/FREESP change timestamps if they change
the file size (via the truncate call made to change the file size).
If they don't change the file size, then they are a no-op and should
not change the file size.

XFS_IOC_RESVSP/UNRESVSP don't change timestamps just like they don't
change file size. That is by design AFAICT so these calls can be
used by HSM-type applications that don't want to change timestamps
when punching out data blocks or preallocating new ones.

> Hmm.. I personally will call it a bug in XFS code then. :)

No, I'd call it useful. :)

> > > I think, modifying ctime/mtime should be dependent on the other flags.
> > > E.g., if we do not zero out data blocks on allocation/deallocation,
> > > update only ctime. Otherwise, update ctime and mtime both.
> > 
> > I'm only being the advocate for requirements David Chinner has put
> > forward due to existing behaviour in XFS.  This is one of the reasons
> > why I think the "flags" mechanism we now have - we can encode the
> > various different behaviours in any way we want and leave it to the
> > caller.
> 
> I understand. May be we can confirm once more with David Chinner if this
> is really required. Will it really be a compatibility issue if new XFS
> preallocations (ie. via fallocate) update mtime/ctime?

It should be left up to the filesystem to decide. Only the
filesystem knows whether something changed and the timestamp should
or should not be updated.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 23:14                                                               ` David Chinner
@ 2007-06-27  3:49                                                                 ` Andreas Dilger
  2007-06-27 13:36                                                                   ` David Chinner
  0 siblings, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-06-27  3:49 UTC (permalink / raw)
  To: David Chinner; +Cc: Amit K. Arora, linux-fsdevel, linux-ext4, suparna, cmm, xfs

On Jun 27, 2007  09:14 +1000, David Chinner wrote:
> Someone on the XFs list had an interesting request - preallocated
> swap files. You can't use unwritten extents for this because
> of sys_swapon()s use of bmap() (XFS returns holes for reading
> unwritten extents), so we need a method of preallocating that does
> not zero or mark the extent unread. i.e. FA_MKSWAP.

Is there a reason why unwritten extents return 0 to bmap()?  This
would seem to be the only impediment from using fallocated files
for swap files.  Maybe if FIEMAP was used by mkswap to get an
"UNWRITTEN" flag back instead of "HOLE" it wouldn't be a problem.

> That way we can allocate large swap files that don't need zeroing
> in a single, fast operation, and hence potentially bring new
> swap space online without needed very much memory at all (i.e.
> should succeed in most near-OOM conditions).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-27  3:49                                                                 ` Andreas Dilger
@ 2007-06-27 13:36                                                                   ` David Chinner
  2007-06-27 23:28                                                                     ` Nathan Scott
  2007-06-30 10:26                                                                     ` Christoph Hellwig
  0 siblings, 2 replies; 340+ messages in thread
From: David Chinner @ 2007-06-27 13:36 UTC (permalink / raw)
  To: xfs-oss; +Cc: Amit K. Arora, linux-fsdevel, linux-ext4, suparna, cmm

On Tue, Jun 26, 2007 at 11:49:15PM -0400, Andreas Dilger wrote:
> On Jun 27, 2007  09:14 +1000, David Chinner wrote:
> > Someone on the XFs list had an interesting request - preallocated
> > swap files. You can't use unwritten extents for this because
> > of sys_swapon()s use of bmap() (XFS returns holes for reading
> > unwritten extents), so we need a method of preallocating that does
> > not zero or mark the extent unread. i.e. FA_MKSWAP.
> 
> Is there a reason why unwritten extents return 0 to bmap()?

It's a fallout of xfs_get_blocks not mapping unwritten extents
on read because we want do_mpage_readpage() to treat them
as a hole. i.e. zero fill them instead of doing I/O. This is
the way XFS was shoehorned into the generic read path :/

> This
> would seem to be the only impediment from using fallocated files
> for swap files.  Maybe if FIEMAP was used by mkswap to get an
> "UNWRITTEN" flag back instead of "HOLE" it wouldn't be a problem.

Probably. If we taught do_mpage_readpage() about unwritten mappings,
then would could map them on read if and then sys_swapon can remain
blissfully unaware of unwritten extents.

I think this is pretty much all I need to do to acheive that is
(untested):

---

Teach do_mpage_readpage() about unwritten extents so we can
always map them in get_blocks rather than they are are holes on
read. Allows setup_swap_extents() to use preallocated files on XFS
filesystems for swap files without ever needing to convert them.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 fs/mpage.c                  |    5 +++--
 fs/xfs/linux-2.6/xfs_aops.c |   13 +++----------
 2 files changed, 6 insertions(+), 12 deletions(-)

Index: 2.6.x-xfs-new/fs/mpage.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/mpage.c	2007-05-29 16:17:59.000000000 +1000
+++ 2.6.x-xfs-new/fs/mpage.c	2007-06-27 22:39:35.568852270 +1000
@@ -207,7 +207,8 @@ do_mpage_readpage(struct bio *bio, struc
 	 * Map blocks using the result from the previous get_blocks call first.
 	 */
 	nblocks = map_bh->b_size >> blkbits;
-	if (buffer_mapped(map_bh) && block_in_file > *first_logical_block &&
+	if (buffer_mapped(map_bh) && !buffer_unwritten(map_bh) &&
+			block_in_file > *first_logical_block &&
 			block_in_file < (*first_logical_block + nblocks)) {
 		unsigned map_offset = block_in_file - *first_logical_block;
 		unsigned last = nblocks - map_offset;
@@ -242,7 +243,7 @@ do_mpage_readpage(struct bio *bio, struc
 			*first_logical_block = block_in_file;
 		}
 
-		if (!buffer_mapped(map_bh)) {
+		if (!buffer_mapped(map_bh) || buffer_unwritten(map_bh)) {
 			fully_mapped = 0;
 			if (first_hole == blocks_per_page)
 				first_hole = page_block;
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c	2007-06-05 22:14:39.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c	2007-06-27 22:39:29.545636749 +1000
@@ -1340,16 +1340,9 @@ __xfs_get_blocks(
 		return 0;
 
 	if (iomap.iomap_bn != IOMAP_DADDR_NULL) {
-		/*
-		 * For unwritten extents do not report a disk address on
-		 * the read case (treat as if we're reading into a hole).
-		 */
-		if (create || !(iomap.iomap_flags & IOMAP_UNWRITTEN)) {
-			xfs_map_buffer(bh_result, &iomap, offset,
-				       inode->i_blkbits);
-		}
-		if (create && (iomap.iomap_flags & IOMAP_UNWRITTEN)) {
-			if (direct)
+		xfs_map_buffer(bh_result, &iomap, offset, inode->i_blkbits);
+		if (iomap.iomap_flags & IOMAP_UNWRITTEN) {
+			if (create && direct)
 				bh_result->b_private = inode;
 			set_buffer_unwritten(bh_result);
 		}


Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-27 13:36                                                                   ` David Chinner
@ 2007-06-27 23:28                                                                     ` Nathan Scott
  2007-06-28  0:39                                                                       ` David Chinner
  2007-06-30 10:26                                                                     ` Christoph Hellwig
  1 sibling, 1 reply; 340+ messages in thread
From: Nathan Scott @ 2007-06-27 23:28 UTC (permalink / raw)
  To: David Chinner, Andreas Dilger
  Cc: xfs-oss, Amit K. Arora, linux-fsdevel, linux-ext4, suparna, cmm

On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote:
> .... Allows setup_swap_extents() to use preallocated files on XFS
> filesystems for swap files without ever needing to convert them.

Using unwritten extents (as opposed to the MKSWAP flag mentioned
earlier) has the unfortunate down side of requiring transactions,
possibly additional IO, and memory allocation during swap.  (but,
this patch should probably go in regardless, as teaching generic
code about unwritten extents is not a bad idea).

cheers.

--
Nathan

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-27 23:28                                                                     ` Nathan Scott
@ 2007-06-28  0:39                                                                       ` David Chinner
  2007-06-28  0:53                                                                         ` Nathan Scott
  0 siblings, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-06-28  0:39 UTC (permalink / raw)
  To: Nathan Scott
  Cc: David Chinner, Andreas Dilger, xfs-oss, Amit K. Arora,
	linux-fsdevel, linux-ext4, suparna, cmm

On Thu, Jun 28, 2007 at 09:28:36AM +1000, Nathan Scott wrote:
> On Wed, 2007-06-27 at 23:36 +1000, David Chinner wrote:
> > .... Allows setup_swap_extents() to use preallocated files on XFS
> > filesystems for swap files without ever needing to convert them.
> 
> Using unwritten extents (as opposed to the MKSWAP flag mentioned
> earlier) has the unfortunate down side of requiring transactions,
> possibly additional IO, and memory allocation during swap.  (but,
> this patch should probably go in regardless, as teaching generic
> code about unwritten extents is not a bad idea).

I don't think it does - swapfile I/O looks like it goes direct to
bio without passing through the filesystem.  When the swapfile is
mapped, it scans and records the extent map of the entire swapfile
in a separate structure and AFAICT the swap code uses that built map
without touching the filesystem at all.

If that is true then the written/unwritten state of the extents is
irrelevant; all we need is allocated disk space for the file and
swapping should work. And it's not like anyone should be reading
the contents of that swapfile through the filesystem, either. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-28  0:39                                                                       ` David Chinner
@ 2007-06-28  0:53                                                                         ` Nathan Scott
  0 siblings, 0 replies; 340+ messages in thread
From: Nathan Scott @ 2007-06-28  0:53 UTC (permalink / raw)
  To: David Chinner
  Cc: Andreas Dilger, xfs-oss, Amit K. Arora, linux-fsdevel,
	linux-ext4, suparna, cmm

On Thu, 2007-06-28 at 10:39 +1000, David Chinner wrote:
> 
> 
> I don't think it does - swapfile I/O looks like it goes direct to
> bio without passing through the filesystem.  When the swapfile is
> mapped, it scans and records the extent map of the entire swapfile
> in a separate structure and AFAICT the swap code uses that built map
> without touching the filesystem at all.
> 
> If that is true then the written/unwritten state of the extents is
> irrelevant; all we need is allocated disk space for the file and
> swapping should work. And it's not like anyone should be reading
> the contents of that swapfile through the filesystem, either. ;)

Ah, yes, good point - thats true.  Unwritten extents are ideal for
this then, as attempts to read swap via the regular interfaces will
return zeros instead of random swapped out memory contents.

cheers.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-25 13:28                                                         ` Amit K. Arora
                                                                           ` (8 preceding siblings ...)
  (?)
@ 2007-06-28  9:55                                                         ` Andrew Morton
  2007-06-28 17:36                                                           ` Mingming Cao
  2007-06-28 17:57                                                           ` Amit K. Arora
  -1 siblings, 2 replies; 340+ messages in thread
From: Andrew Morton @ 2007-06-28  9:55 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner,
	Andreas Dilger, suparna, cmm, xfs

On Mon, 25 Jun 2007 18:58:10 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> N O T E: 
> -------
> 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
>    of ext4 patch queue git tree hosted by Ted.

Why the heck are replacements for these things being sent out again when
they're already in -mm and they're already in Ted's queue (from which I
need to diligently drop them each time I remerge)?

Are we all supposed to re-review the entire patchset (or at least #4 and
#7) again?

The core kernel changes are not appropriate to the ext4 tree.

For a start, the syscall numbers in Ted's queue are wrong (other new
syscalls are pending).

Patches which add syscalls are an utter PITA to carry due to all the patch
conflicts and to the relatively frequent syscall renumbering (they don't
get numbered in time-of-arrival order due to differing rates at which patches
mature).

Please drop the non-ext4 patches from the ext4 tree and send incremental
patches against the (non-ext4) fallocate patches in -mm.

And try to get the code finished?  Time is pressing.

Thanks.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28  9:55                                                         ` Andrew Morton
@ 2007-06-28 17:36                                                           ` Mingming Cao
  2007-06-28 17:57                                                           ` Amit K. Arora
  1 sibling, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-06-28 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, Andreas Dilger, suparna, xfs

On Thu, 2007-06-28 at 02:55 -0700, Andrew Morton wrote:

> Please drop the non-ext4 patches from the ext4 tree and send incremental
> patches against the (non-ext4) fallocate patches in -mm.
> 
The ext4 fallocate() patches are dependent on the core fallocate()
patches, so ext4 patch-queue and git tree won't compile (it's not based
on mm tree) without the core changes.

We can send ext4 fallocate patches (incremental patches against mm tree)
and drop the full fallocate patches(ext4 and non ext4 part) from ext4
patch queue if you prefer this way.

> And try to get the code finished?  Time is pressing.
> 
I looked at the mm tree, there are other ext4 features/changes that are
currently in ext4-patch-queue(not ext4 git tree) that not in part of
ext4 series yet. Ted, can you merge those patches to your git tree?
Thanks!


Thanks for your patience.

Mingming.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28  9:55                                                         ` Andrew Morton
  2007-06-28 17:36                                                           ` Mingming Cao
@ 2007-06-28 17:57                                                           ` Amit K. Arora
  2007-06-28 18:33                                                             ` Andrew Morton
  2007-06-28 20:34                                                             ` [PATCH 0/6][TAKE5] fallocate system call Andreas Dilger
  1 sibling, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-28 17:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner,
	Andreas Dilger, suparna, cmm, xfs

On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
> On Mon, 25 Jun 2007 18:58:10 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > N O T E: 
> > -------
> > 1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
> >    of ext4 patch queue git tree hosted by Ted.
> 
> Why the heck are replacements for these things being sent out again when
> they're already in -mm and they're already in Ted's queue (from which I
> need to diligently drop them each time I remerge)?
> 
> Are we all supposed to re-review the entire patchset (or at least #4 and
> #7) again?

As I mentioned in the note above, only patches #4 and #7 were new and
thus these needed to be reviewed. Other patches are _not_ replacements
of any of the patches which are already part of -mm and/or in Ted's
patch queue. They were posted again as just "placeholders" so that the
two new patches (#4 & #7) could be reviewed. Sorry for any confusion.
 
> Please drop the non-ext4 patches from the ext4 tree and send incremental
> patches against the (non-ext4) fallocate patches in -mm.

Please let us know what you think of Mingming's suggestion of posting
all the fallocate patches including the ext4 ones as incremental ones
against the -mm.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 7/7][TAKE5] ext4: support new modes
  2007-06-27  0:04                                                                   ` David Chinner
@ 2007-06-28 18:07                                                                     ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-28 18:07 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, xfs

On Wed, Jun 27, 2007 at 10:04:56AM +1000, David Chinner wrote:
> On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote:
> > On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote:
> > > On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
> > > > I think, modifying ctime/mtime should be dependent on the other flags.
> > > > E.g., if we do not zero out data blocks on allocation/deallocation,
> > > > update only ctime. Otherwise, update ctime and mtime both.
> > > 
> > > I'm only being the advocate for requirements David Chinner has put
> > > forward due to existing behaviour in XFS.  This is one of the reasons
> > > why I think the "flags" mechanism we now have - we can encode the
> > > various different behaviours in any way we want and leave it to the
> > > caller.
> > 
> > I understand. May be we can confirm once more with David Chinner if this
> > is really required. Will it really be a compatibility issue if new XFS
> > preallocations (ie. via fallocate) update mtime/ctime?
> 
> It should be left up to the filesystem to decide. Only the
> filesystem knows whether something changed and the timestamp should
> or should not be updated.

Since Andreas had suggested FA_FL_NO_MTIME flag thinking it as a
requirement from XFS (whereas XFS does not need this flag), I don't think
we need to add this new flag.

Please let know if someone still feels FA_FL_NO_MTIME flag can be
useful.

--
Regards,
Amit Arora


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 23:18                                                                   ` David Chinner
@ 2007-06-28 18:19                                                                     ` Amit K. Arora
  2007-06-28 23:39                                                                       ` Nathan Scott
  2007-06-29  1:03                                                                       ` David Chinner
  0 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-06-28 18:19 UTC (permalink / raw)
  To: David Chinner; +Cc: linux-fsdevel, linux-kernel, linux-ext4, suparna, cmm, xfs

On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
> On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> > On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
> > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > > 
> > > Currently it is left on the file system implementation. In ext4, we do
> > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > > end up with partial (pre)allocation. This is inline with dd and
> > > posix_fallocate, which also do not free the partially allocated space.
> > 
> > Since I believe the XFS allocation ioctls do it the opposite way (free
> > preallocated space on error) this should be encoded into the flags.
> > Having it "filesystem dependent" just means that nobody will be happy.
> 
> No, XFs does not free preallocated space on error. it is up to the
> application to clean up.

Since XFS also does not free preallocated space on error and this
behavior is inline with dd, posix_fallocate() and the current ext4
implementation, do we still need FA_FL_FREE_ENOSPC flag ?
 
> > What I mean is that any data read from the file should have the "appearance"
> > of being zeroed (whether zeroes are actually written to disk or not).  What
> > I _think_ David is proposing is to allow fallocate() to return without
> > marking the blocks even "uninitialized" and subsequent reads would return
> > the old data from the disk.
> 
> Correct, but for swap files that's not an issue - no user should be able
> too read them, and FA_MKSWAP would really need root privileges to execute.

Will the FA_MKSWAP mode still be required with your suggested change of
teaching do_mpage_readpage() about unwritten extents being in place ?
Or, will you still like to have FA_MKSWAP mode ?

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28 17:57                                                           ` Amit K. Arora
@ 2007-06-28 18:33                                                             ` Andrew Morton
  2007-06-28 18:45                                                               ` Dave Kleikamp
                                                                                 ` (3 more replies)
  2007-06-28 20:34                                                             ` [PATCH 0/6][TAKE5] fallocate system call Andreas Dilger
  1 sibling, 4 replies; 340+ messages in thread
From: Andrew Morton @ 2007-06-28 18:33 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, David Chinner,
	Andreas Dilger, suparna, cmm, xfs

On Thu, 28 Jun 2007 23:27:57 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:

> > Please drop the non-ext4 patches from the ext4 tree and send incremental
> > patches against the (non-ext4) fallocate patches in -mm.
> 
> Please let us know what you think of Mingming's suggestion of posting
> all the fallocate patches including the ext4 ones as incremental ones
> against the -mm.

I think Mingming was asking that Ted move the current quilt tree into git,
presumably because she's working off git.

I'm not sure what to do, really.  The core kernel patches need to be in
Ted's tree for testing but that'll create a mess for me.

ug.

Options might be

a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
   any needed build fixes, then just wait for it all to mature and then
   think about it again

b) We do what we normally don't do and reserve the syscall slots in mainline.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28 18:33                                                             ` Andrew Morton
@ 2007-06-28 18:45                                                               ` Dave Kleikamp
  2007-06-28 18:57                                                               ` Jeff Garzik
                                                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 340+ messages in thread
From: Dave Kleikamp @ 2007-06-28 18:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, Andreas Dilger, suparna, cmm, xfs

On Thu, 2007-06-28 at 11:33 -0700, Andrew Morton wrote:
> On Thu, 28 Jun 2007 23:27:57 +0530 "Amit K. Arora" <aarora@linux.vnet.ibm.com> wrote:
> 
> > > Please drop the non-ext4 patches from the ext4 tree and send incremental
> > > patches against the (non-ext4) fallocate patches in -mm.
> > 
> > Please let us know what you think of Mingming's suggestion of posting
> > all the fallocate patches including the ext4 ones as incremental ones
> > against the -mm.
> 
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.

I moved the fallocate patches to the very end of the series in the quilt
tree.  This way the patches will be in the quilt tree for testing, but
Ted can easily leave them out of the git tree so you and Linus won't
pull them with the ext4 patches.

Fortunately, the ext4-specific fallocate patches don't conflict with the
other patches in the queue, so they can (at least for now) be handled
independently in the -mm tree.

> I'm not sure what to do, really.  The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.
> 
> ug.
> 
> Options might be
> 
> a) I drop the fallocate patches from -mm and from the ext4 tree, hack up
>    any needed build fixes, then just wait for it all to mature and then
>    think about it again
> 
> b) We do what we normally don't do and reserve the syscall slots in mainline.

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28 18:33                                                             ` Andrew Morton
  2007-06-28 18:45                                                               ` Dave Kleikamp
@ 2007-06-28 18:57                                                               ` Jeff Garzik
  2007-06-29  7:20                                                               ` Christoph Hellwig
  2007-06-29 13:56                                                               ` Theodore Tso
  3 siblings, 0 replies; 340+ messages in thread
From: Jeff Garzik @ 2007-06-28 18:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, Andreas Dilger, suparna, cmm, xfs

Andrew Morton wrote:
> b) We do what we normally don't do and reserve the syscall slots in mainline.

If everyone agrees it's going to happen... why not?

	Jeff



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28 17:57                                                           ` Amit K. Arora
  2007-06-28 18:33                                                             ` Andrew Morton
@ 2007-06-28 20:34                                                             ` Andreas Dilger
  1 sibling, 0 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-06-28 20:34 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, suparna, cmm, xfs

On Jun 28, 2007  23:27 +0530, Amit K. Arora wrote:
> On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
> > Are we all supposed to re-review the entire patchset (or at least #4 and
> > #7) again?
> 
> As I mentioned in the note above, only patches #4 and #7 were new and
> thus these needed to be reviewed. Other patches are _not_ replacements
> of any of the patches which are already part of -mm and/or in Ted's
> patch queue. They were posted again as just "placeholders" so that the
> two new patches (#4 & #7) could be reviewed. Sorry for any confusion.

The new patches are definitely a big improvement over the previous API,
and need to go in before fallocate() goes into mainline.  This last set
of changes allows the behaviour of these syscalls to accomodate the various
different semantics desired by XFS in a sensible manner instead of tying
all of the individual behaviours (time update, size update, alloc/free, etc)
into monolithic modes that will never make everyone happy.

My understanding is that you only need to grab #4 and #7 to get your tree
into get fallocate in sync with the ext4 patch queue (i.e. they are
incremental over the previous set).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-28 18:19                                                                     ` Amit K. Arora
@ 2007-06-28 23:39                                                                       ` Nathan Scott
  2007-06-29  1:03                                                                       ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: Nathan Scott @ 2007-06-28 23:39 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: David Chinner, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, xfs

On Thu, 2007-06-28 at 23:49 +0530, Amit K. Arora wrote:
> 
> > Correct, but for swap files that's not an issue - no user should be
> able
> > too read them, and FA_MKSWAP would really need root privileges to
> execute.
> 
> Will the FA_MKSWAP mode still be required with your suggested change
> of
> teaching do_mpage_readpage() about unwritten extents being in place ?
> Or, will you still like to have FA_MKSWAP mode ? 

There's no need for a MKSWAP flag.

cheers.

--
Nathan


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-28 18:19                                                                     ` Amit K. Arora
  2007-06-28 23:39                                                                       ` Nathan Scott
@ 2007-06-29  1:03                                                                       ` David Chinner
  1 sibling, 0 replies; 340+ messages in thread
From: David Chinner @ 2007-06-29  1:03 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: David Chinner, linux-fsdevel, linux-kernel, linux-ext4, suparna,
	cmm, xfs

On Thu, Jun 28, 2007 at 11:49:13PM +0530, Amit K. Arora wrote:
> On Wed, Jun 27, 2007 at 09:18:04AM +1000, David Chinner wrote:
> > On Tue, Jun 26, 2007 at 11:34:13AM -0400, Andreas Dilger wrote:
> > > On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
> > > > On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > > > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > > > 
> > > > Currently it is left on the file system implementation. In ext4, we do
> > > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > > > end up with partial (pre)allocation. This is inline with dd and
> > > > posix_fallocate, which also do not free the partially allocated space.
> > > 
> > > Since I believe the XFS allocation ioctls do it the opposite way (free
> > > preallocated space on error) this should be encoded into the flags.
> > > Having it "filesystem dependent" just means that nobody will be happy.
> > 
> > No, XFs does not free preallocated space on error. it is up to the
> > application to clean up.
> 
> Since XFS also does not free preallocated space on error and this
> behavior is inline with dd, posix_fallocate() and the current ext4
> implementation, do we still need FA_FL_FREE_ENOSPC flag ?

Not at the moment.

> > > What I mean is that any data read from the file should have the "appearance"
> > > of being zeroed (whether zeroes are actually written to disk or not).  What
> > > I _think_ David is proposing is to allow fallocate() to return without
> > > marking the blocks even "uninitialized" and subsequent reads would return
> > > the old data from the disk.
> > 
> > Correct, but for swap files that's not an issue - no user should be able
> > too read them, and FA_MKSWAP would really need root privileges to execute.
> 
> Will the FA_MKSWAP mode still be required with your suggested change of
> teaching do_mpage_readpage() about unwritten extents being in place ?
> Or, will you still like to have FA_MKSWAP mode ?

budgie:/mnt/test # xfs_io -f -c "resvsp 0 1048576" -c "truncate 1048576" swap_file
budgie:/mnt/test # mkswap swap_file
Setting up swapspace version 1, size = 1032 kB
budgie:/mnt/test # swapon -v swap_file
swapon on swap_file
budgie:/mnt/test # swapon -s
Filename                                Type            Size    Used    Priority
/dev/sda2                               partition       9437152 0       -1
/mnt/test/swap_file                     file            992     0       -2
budgie:/mnt/test # xfs_bmap -vvp swap_file
swap_file:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..31]:         96..127           0 (96..127)           32
   1: [32..2047]:      128..2143         0 (128..2143)       2016 10000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

Looks like the changes work, so FA_MKSWAP is not necessary for XFS.
We can drop that for the moment unless anyone else sees a need for it.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28 18:33                                                             ` Andrew Morton
  2007-06-28 18:45                                                               ` Dave Kleikamp
  2007-06-28 18:57                                                               ` Jeff Garzik
@ 2007-06-29  7:20                                                               ` Christoph Hellwig
  2007-06-29 13:56                                                               ` Theodore Tso
  3 siblings, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-06-29  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, Andreas Dilger, suparna, cmm, xfs

On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.
> 
> I'm not sure what to do, really.  The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.

Could we please stop this stupid ext4-centrism?  XFS is ready so we can
put in the syscalls backed by XFS.  We have already done this with the xattr
syscalls in 2.4, btw.

Then again I don't think we should put it in quite yet, because this thread
has degraded into creeping featurism, please give me some more time to
preparate a semi-coheret rants about this..


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-28 18:33                                                             ` Andrew Morton
                                                                                 ` (2 preceding siblings ...)
  2007-06-29  7:20                                                               ` Christoph Hellwig
@ 2007-06-29 13:56                                                               ` Theodore Tso
  2007-06-29 14:29                                                                 ` Jeff Garzik
  2007-06-29 15:50                                                                 ` Mingming Caoc
  3 siblings, 2 replies; 340+ messages in thread
From: Theodore Tso @ 2007-06-29 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amit K. Arora, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, Andreas Dilger, suparna, cmm, xfs

On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
> > Please let us know what you think of Mingming's suggestion of posting
> > all the fallocate patches including the ext4 ones as incremental ones
> > against the -mm.
> 
> I think Mingming was asking that Ted move the current quilt tree into git,
> presumably because she's working off git.

No, mingming and I both work off of the patch queue (which is also
stored in git).  So what mingming was asking for exactly was just
posting the incremental patches and tagging them appropriately to
avoid confusion.

I tried building the patch queue earlier in the week and it there were
multiple oops/panics as I ran things through various regression tests,
but that may have been fixed since (the tree was broken over the
weekend and I may have grabbed a broken patch series) or it may have
been a screw up on my part feeding them into our testing grid.  I
haven't had time to try again this week, but I'll try to put together
a new tested ext4 patchset over the weekend.

> I'm not sure what to do, really.  The core kernel patches need to be in
> Ted's tree for testing but that'll create a mess for me.

I don't think we have a problem here.  What we have now is fine, and
it was just people kvetching that Amit reposted patches that were
already in -mm and ext4.

In any case, the plan is to push all of the core bits into Linus tree
for 2.6.22 once it opens up, which should be Real Soon Now, it looks
like.

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-29 13:56                                                               ` Theodore Tso
@ 2007-06-29 14:29                                                                 ` Jeff Garzik
  2007-06-29 17:42                                                                   ` Theodore Tso
  2007-06-29 15:50                                                                 ` Mingming Caoc
  1 sibling, 1 reply; 340+ messages in thread
From: Jeff Garzik @ 2007-06-29 14:29 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, David Chinner, Andreas Dilger, suparna,
	cmm, xfs

Theodore Tso wrote:
> I don't think we have a problem here.  What we have now is fine, and

It's fine for ext4, but not the wider world.  This is a common problem 
created by parallel development when code dependencies exist.


> In any case, the plan is to push all of the core bits into Linus tree
> for 2.6.22 once it opens up, which should be Real Soon Now, it looks
> like.

Presumably you mean 2.6.23.

	Jeff




^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-29 13:56                                                               ` Theodore Tso
  2007-06-29 14:29                                                                 ` Jeff Garzik
@ 2007-06-29 15:50                                                                 ` Mingming Caoc
  2007-06-29 20:57                                                                   ` Andrew Morton
  1 sibling, 1 reply; 340+ messages in thread
From: Mingming Caoc @ 2007-06-29 15:50 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Amit K. Arora, linux-fsdevel,
	linux-kernel, linux-ext4, David Chinner, Andreas Dilger, suparna,
	cmm, xfs

Theodore Tso wrote:
> On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
>   
>>> Please let us know what you think of Mingming's suggestion of posting
>>> all the fallocate patches including the ext4 ones as incremental ones
>>> against the -mm.
>>>       
>> I think Mingming was asking that Ted move the current quilt tree into git,
>> presumably because she's working off git.
>>     
>
> No, mingming and I both work off of the patch queue (which is also
> stored in git).  So what mingming was asking for exactly was just
> posting the incremental patches and tagging them appropriately to
> avoid confusion.
>
> I tried building the patch queue earlier in the week and it there were
> multiple oops/panics as I ran things through various regression tests,but that may have been fixed since (the tree was broken over the
> weekend and I may have grabbed a broken patch series) or it may have
> been a screw up on my part feeding them into our testing grid.  I
> haven't had time to try again this week, but I'll try to put together
> a new tested ext4 patchset over the weekend.
>
>   
I think the ext4 patch queue is in good shape now.  Shaggy have tested 
in on dbench, fsx, and tiobench, tests runs fine. and BULL team has 
benchmarked  the latest ext4 patch queue with iozone and FFSB.

Regards,
Mingming
>> I'm not sure what to do, really.  The core kernel patches need to be in
>> Ted's tree for testing but that'll create a mess for me.
>>     
>
> I don't think we have a problem here.  What we have now is fine, and
> it was just people kvetching that Amit reposted patches that were
> already in -mm and ext4.
>
> In any case, the plan is to push all of the core bits into Linus tree
> for 2.6.22 once it opens up, which should be Real Soon Now, it looks
> like.
>
> 						- Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-29 14:29                                                                 ` Jeff Garzik
@ 2007-06-29 17:42                                                                   ` Theodore Tso
  0 siblings, 0 replies; 340+ messages in thread
From: Theodore Tso @ 2007-06-29 17:42 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Andrew Morton, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, David Chinner, Andreas Dilger, suparna, cmm, xfs

On Fri, Jun 29, 2007 at 10:29:21AM -0400, Jeff Garzik wrote:
> >In any case, the plan is to push all of the core bits into Linus tree
> >for 2.6.22 once it opens up, which should be Real Soon Now, it looks
> >like.
> 
> Presumably you mean 2.6.23.

Yes, sorry.  I meant once Linus releases 2.6.22, and we would be
aiming to merge before the 2.6.23-rc1 window.

						- Ted

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 0/6][TAKE5] fallocate system call
  2007-06-29 15:50                                                                 ` Mingming Caoc
@ 2007-06-29 20:57                                                                   ` Andrew Morton
  2007-07-01  7:35                                                                     ` Ext4 patches for 2.6.22-rc6 Mingming Cao
  0 siblings, 1 reply; 340+ messages in thread
From: Andrew Morton @ 2007-06-29 20:57 UTC (permalink / raw)
  To: Mingming Caoc
  Cc: Theodore Tso, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, David Chinner, Andreas Dilger, suparna, xfs

On Fri, 29 Jun 2007 11:50:04 -0400
Mingming Caoc <cmm@us.ibm.com> wrote:

> I think the ext4 patch queue is in good shape now.

Which ext4 patches are you intending to merge into 2.6.23?

Please send all those out to lkml for review?

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc
  2007-06-14  9:14                                                 ` Andreas Dilger
  2007-06-14 12:04                                                   ` David Chinner
@ 2007-06-30 10:14                                                   ` Christoph Hellwig
  1 sibling, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-06-30 10:14 UTC (permalink / raw)
  To: David Chinner, Amit K. Arora, Suparna Bhattacharya, torvalds,
	akpm, linux-fsdevel, linux-kernel, linux-ext4, xfs, cmm

On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> I suppose it might be a bit late in the game to add a "goal"
> parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
> the API more suitable for XFS?  The goal could be a single __u64, or
> a struct with e.g. __u64 byte offset (possibly also __u32 lun like
> in FIEMAP).  I guess the one potential limitation here is the
> number of function parameters on some architectures.

This isn't really about "more suitable for XFS" but more about more
suitable for sophisticated layout decisions.

But I'm still not confident this should be shohorned into this
syscall.  In fact I'm already rather unhappy about the feature churn in
the current patch series.

The more I think about it the more I'd prefer we would just put a simple
syscall in that implements nothing but the posix_fallocate(3) semantics
as defined in SuS, and then go on to brainstorm about advanced
preallocation / layout hint semantics.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-26 10:32                                                               ` Amit K. Arora
  2007-06-26 15:34                                                                 ` Andreas Dilger
@ 2007-06-30 10:21                                                                 ` Christoph Hellwig
  2007-06-30 16:52                                                                   ` Andreas Dilger
  2007-07-01 22:55                                                                   ` David Chinner
  1 sibling, 2 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-06-30 10:21 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Andreas Dilger, linux-fsdevel, linux-kernel, linux-ext4,
	David Chinner, suparna, cmm, xfs

On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > error) is hit?  Does it keep the current fallocate() or does it free it?
> 
> Currently it is left on the file system implementation. In ext4, we do
> not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> end up with partial (pre)allocation. This is inline with dd and
> posix_fallocate, which also do not free the partially allocated space.

I can't find anything in the specification of posix_fallocate
(http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
that tells what should happen to allocate blocks on error.

But common sense would be to not leak disk space on failure of this
syscall, and this definitively should not be left up to the filesystem,
either we always leak it or always free it, and I'd strongly favour
the latter variant.

> > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > sure if this makes sense at all.
> 
> I don't think we need to make it default - atleast for filesystems which
> have a mechanism to distinguish preallocated blocks from "regular" ones.
> In ext4, for example, we will have a way to mark uninitialized extents.
> All the preallocated blocks will be part of these uninitialized extents.
> And any read on these extents will treat them as a hole, returning
> zeroes to user land. Thus any existing data on uninitialized blocks will
> not be exposed to the userspace.

This is the xfs unwritten extent behaviour.  But anyway, the important bit
is uninitialized blocks should never ever leak to userspace, so there is
not need for the flag.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-27 13:36                                                                   ` David Chinner
  2007-06-27 23:28                                                                     ` Nathan Scott
@ 2007-06-30 10:26                                                                     ` Christoph Hellwig
  1 sibling, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-06-30 10:26 UTC (permalink / raw)
  To: David Chinner
  Cc: xfs-oss, Amit K. Arora, linux-fsdevel, linux-ext4, suparna, cmm

On Wed, Jun 27, 2007 at 11:36:57PM +1000, David Chinner wrote:
> > This
> > would seem to be the only impediment from using fallocated files
> > for swap files.  Maybe if FIEMAP was used by mkswap to get an
> > "UNWRITTEN" flag back instead of "HOLE" it wouldn't be a problem.
> 
> Probably. If we taught do_mpage_readpage() about unwritten mappings,
> then would could map them on read if and then sys_swapon can remain
> blissfully unaware of unwritten extents.

Except for reading the swap header in the first page sys_swapon will
never end up in  do_mpage_readpage.  It rather uses ->bmap to build
it's own extent list and issues bios directly.

Now this is everything but nice and we should rather refactor the direct
I/O code to work on kernel pages without looking at their fields so this
can be done properly.  Alternatively ->bmap would grow a BMAP_SWAP flag
so the filesystem could do the right thing.

But despite not beeing useful for swap the patch below looks very nice
to me.  doing things correctly in core code is always better than hacking
around it in the filesystem, especially as XFS won't stay the only filesystem
using unwritten extents.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-30 10:21                                                                 ` Christoph Hellwig
@ 2007-06-30 16:52                                                                   ` Andreas Dilger
  2007-07-03 10:08                                                                     ` Amit K. Arora
  2007-07-01 22:55                                                                   ` David Chinner
  1 sibling, 1 reply; 340+ messages in thread
From: Andreas Dilger @ 2007-06-30 16:52 UTC (permalink / raw)
  To: Christoph Hellwig, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, David Chinner, suparna, cmm, xfs

On Jun 30, 2007  11:21 +0100, Christoph Hellwig wrote:
> On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
> 
> I can't find anything in the specification of posix_fallocate
> (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
> that tells what should happen to allocate blocks on error.
> 
> But common sense would be to not leak disk space on failure of this
> syscall, and this definitively should not be left up to the filesystem,
> either we always leak it or always free it, and I'd strongly favour
> the latter variant.

I definitely agree that the behaviour should be specified part of
the interface.  The current behaviour of both ext4 and XFS is that the
successful part of the unallocated extent is left in place when returning
ENOSPC so we considered this the "consistent" behaviour.  This is the same
as e.g. sys_write() which does not remove the part of the write that was
successful if ENOSPC is hit.  I think this also makes sense for some usa
cases, because application like PVR may want to preallocate approximately
30min of space, but if it gets only 25min worth then it can at least start
using this while it also begins looking for and/or freeing old files.

If the space is always freed on ENOSPC, then there may be a significant
amount of work done and undone while the application is iterating over
possible sizes until one works.   It is easy for the application to
use fstat() to see the blocks/size actually preallocated on failure, and
explicitly request unallocation of this space if the outcome is undesirable.

If you think that applications have a strong preference for both kinds
of behaviour (e.g. database which requires the full allocation to succeed,
unlike PVR application above) then this could be encoded into a @mode flag.

> > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > > sure if this makes sense at all.
> 
> This is the xfs unwritten extent behaviour.  But anyway, the important bit
> is uninitialized blocks should never ever leak to userspace, so there is
> not need for the flag.

I agree that we shouldn't need FA_ZERO_SPACE.  If an application wants
explicit zeros written to disk it can just do this with O_DIRECT writes
or similar.

> The more I think about it the more I'd prefer we would just put a simple
> syscall in that implements nothing but the posix_fallocate(3) semantics
> as defined in SuS, and then go on to brainstorm about advanced
> preallocation / layout hint semantics.

I don't think the current @mode flags introduce any significant complexity
in the implementation, and in fact one of the reasons these came up in the
first place was because David pointed out the XFS behaviour did NOT match
with posix_fallocate() and we started getting strange semantics enforced
by monolithic modes.  IMHO, coding for and understanding the semantics of
the monolithic modes is much more complex and less useful than the explicit
flags.

The @mode flags that are currently under consideration are (AFAIK):

FA_FL_DEALLOC	0x01 /* deallocate unwritten extent (default allocate) */
FA_FL_KEEP_SIZE	0x02 /* keep size for EOF {pre,de}alloc (default change size) */
FA_FL_DEL_DATA	0x04 /* delete existing data in alloc range (default keep) */

Your concern about leaking space would imply:

FA_FL_ERR_FREE	0x08 /* free preallocation on error (default keep prealloc) */

The other possible flags that were proposed, to avoid confusing backup and
HSM applications when preallocated space is added or removed from a file
(you don't want a backup app to re-backup a file that was migrated via HSM):

FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Ext4 patches for 2.6.22-rc6
  2007-06-29 20:57                                                                   ` Andrew Morton
@ 2007-07-01  7:35                                                                     ` Mingming Cao
  0 siblings, 0 replies; 340+ messages in thread
From: Mingming Cao @ 2007-07-01  7:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Theodore Tso, linux-fsdevel, linux-kernel, linux-ext4

On Fri, 2007-06-29 at 13:57 -0700, Andrew Morton wrote:
> On Fri, 29 Jun 2007 11:50:04 -0400
> Mingming Caoc <cmm@us.ibm.com> wrote:
> 
> > I think the ext4 patch queue is in good shape now.
> 
> Which ext4 patches are you intending to merge into 2.6.23?
> 
> Please send all those out to lkml for review?

Hi Andrew, 

Here are the patches in ext4-patch-queue that I think can be considered
to be merged to upstream. Please review.

All of the patches have been posted on ext4 mailinglist before. Some are
bug fixes, some are features, to summaries:
- make extents on by default in ext4dev
- nanosecond timestamp
- 64 bit inode versioning support
- remove 32k subdir limits
- journal  checksumming
- journal stats via procfs
- delayed allocation for ext4 writeback mode
- fallocate()

All the patches can be found at http://repo.or.cz/w/ext4-patch-queue.git
and have been tested(with fsx ,dbench, FFSB, iozone) on
x86,x86_64,ppc64, with extents and delayed allocation enabled

And the full series can be found at
http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=series;h=2f43431db28778ce8d2149bce7a51566a2d2517c;hb=56e27e20cf228b32f5162a76b3bad154d1d3b730

I will post the patches-in-good-shape (in 9 set of patches) to lkml in
the following emails, except for the bottom two feature:

*the fallocate() patches, which Amit just posted a few days ago and are
under review (hopefully we can reach a agreement on the interface and
the "modes" before 2.6.23-rc1 window closed).

*Another one is the delayed allocation patches in ext4 patch queue. Alex
mentioned in another email that he is working on another version of
delalloc that can handle block size < page size, and move some work to
vfs. So it's probably not very useful to post this version for people to
review.


So, here is the series file.

# Rebased the patches to 2.6.22-rc6

# Add mount option to turn off extents
ext4_noextent_mount_opt.patch

# Mounted ext4dev fs with extents by default for testing purpose,
# for Ext4 product release, extents mount option
# will be turn on only if the fs has EXTENTS feature on
ext4_extents_on_by_default.patch

# Propagate inode flags
ext4-propagate_flags.patch

# Add extent sanity checks
ext4-extent-sanity-checks.patch

# Bug fix:set 64bit JBD2 feature on >32bit ext4 fs
ext4_set_jbd2_64bit_feature.patch

# Fix: Rename CONFIG_JBD_DEBUG to CONFIG_JBD2_DEBUG
jbd2_config_jbd2_debug_fix.patch

# Export jbd2-debug via debugfs
ext4_CONFIG_JBD2_DEBUG.patch
jbd2_move_jbd2_debug_to_debugfs.patch

# Nanosecond timestamp support
ext4-nanosecond-patch

# inode verion patch series
# inode versioning is needed for NFSv4

# vfs changes, 64 bit inode->i_version
64-bit-i_version.patch
# reserve hi 32 bit inode version on ext4 on-disk inode
i_version_hi.patch
# ext4 inode version read/store
ext4_i_version_hi_2.patch
# ext4 inode version update
i_version_update_ext4.patch
# add a noversion mount option to disable inode version updates
ext4_no_version.patch

# New patch to expand inode i_extra_isize to support features
# in high part of inode (>128 bytes)
ext4_expand_inode_extra_isize.patch

# Export jbd stats through procfs
# Shall this move to debugfs?
jbd-stats-through-procfs

# Remove 32000 subdirs limit. 
ext4_remove_subdirs_limit.patch

# Add journal checksums
ext4-journal_chksum-2.6.20.patch

# Various Cleanups
ext4-zero_user_page.patch
is_power_of_2-ext4-superc.patch
ext4-remove-extra-is_rdonly-check.patch
ext4_extent_compilation_fixes.patch
ext4_extent_macros_cleanup.patch



^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-30 10:21                                                                 ` Christoph Hellwig
  2007-06-30 16:52                                                                   ` Andreas Dilger
@ 2007-07-01 22:55                                                                   ` David Chinner
  2007-07-02 11:47                                                                     ` Amit K. Arora
  2007-07-11  9:05                                                                     ` Christoph Hellwig
  1 sibling, 2 replies; 340+ messages in thread
From: David Chinner @ 2007-07-01 22:55 UTC (permalink / raw)
  To: Christoph Hellwig, Amit K. Arora, Andreas Dilger, linux-fsdevel,
	linux-kernel, linux-ext4, David Chinner, suparna, cmm, xfs

On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote:
> On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > 
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
> 
> I can't find anything in the specification of posix_fallocate
> (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
> that tells what should happen to allocate blocks on error.

Yeah, and AFAICT glibc leaves them behind ATM.

> But common sense would be to not leak disk space on failure of this
> syscall, and this definitively should not be left up to the filesystem,
> either we always leak it or always free it, and I'd strongly favour
> the latter variant.

We can't simply walk the range an remove unwritten extents, as some
of them may have been present before the fallocate() call. That
makes it extremely difficult to undo a failed call and not remove
more pre-existing pre-allocations.

Given the current behaviour for posix_fallocate() in glibc, I think
that retaining the same error semantic and punting the cleanup to
userspace (where the app will fail with ENOSPC anyway) is the only
sane thing we can do here. Trying to undo this in the kernel leads
to lots of extra rarely used code in error handling paths...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-01 22:55                                                                   ` David Chinner
@ 2007-07-02 11:47                                                                     ` Amit K. Arora
  2007-07-11  9:05                                                                     ` Christoph Hellwig
  1 sibling, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-07-02 11:47 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Hellwig, Andreas Dilger, linux-fsdevel, linux-kernel,
	linux-ext4, suparna, cmm, xfs

On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
> On Sat, Jun 30, 2007 at 11:21:11AM +0100, Christoph Hellwig wrote:
> > On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > > > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > > > error) is hit?  Does it keep the current fallocate() or does it free it?
> > > 
> > > Currently it is left on the file system implementation. In ext4, we do
> > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > > end up with partial (pre)allocation. This is inline with dd and
> > > posix_fallocate, which also do not free the partially allocated space.
> > 
> > I can't find anything in the specification of posix_fallocate
> > (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
> > that tells what should happen to allocate blocks on error.
> 
> Yeah, and AFAICT glibc leaves them behind ATM.

Yes, it does.
 
> > But common sense would be to not leak disk space on failure of this
> > syscall, and this definitively should not be left up to the filesystem,
> > either we always leak it or always free it, and I'd strongly favour
> > the latter variant.

I would not call it a "leak", since the blocks which got allocated as
part of the partial success of the fallocate syscall can be strictly
accounted for (i.e. they are assigned to a particular inode). And these
can be freed by the application, using a suitable @mode of fallocate.
 
> We can't simply walk the range an remove unwritten extents, as some
> of them may have been present before the fallocate() call. That
> makes it extremely difficult to undo a failed call and not remove
> more pre-existing pre-allocations.

Same is true for ext4 too. It is very difficult to keep track of which
uninitialized (unwritten) extents got allocated as part of the current
syscall. This is because, as David mentions, some of them might be
already present; and also because some of the older ones may have got
merged with the *new* uninitialized/unwritten extents as part of the
current syscall. 
 
> Given the current behaviour for posix_fallocate() in glibc, I think
> that retaining the same error semantic and punting the cleanup to
> userspace (where the app will fail with ENOSPC anyway) is the only
> sane thing we can do here. Trying to undo this in the kernel leads
> to lots of extra rarely used code in error handling paths...

Right. This gives applications the free hand if they really want to use
the partially preallocated space, OR they want to free it; without
introducing additional complexity in the kernel.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-06-30 16:52                                                                   ` Andreas Dilger
@ 2007-07-03 10:08                                                                     ` Amit K. Arora
  2007-07-03 10:31                                                                       ` Christoph Hellwig
  0 siblings, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-07-03 10:08 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: adilger, David Chinner, suparna, cmm, xfs, Christoph Hellwig

On Sat, Jun 30, 2007 at 12:52:46PM -0400, Andreas Dilger wrote:
> The @mode flags that are currently under consideration are (AFAIK):
> 
> FA_FL_DEALLOC		0x01 /* deallocate unwritten extent (default allocate) */
> FA_FL_KEEP_SIZE	0x02 /* keep size for EOF {pre,de}alloc (default change size) */
> FA_FL_DEL_DATA	0x04 /* delete existing data in alloc range (default keep) */

We now have two sets of flags - 
1) the above three with which I think no one has any issues with, and
2) the ones below, for which we need some discussions before finalizing
on them.

I will prefer fallocate going in mainline with the above three modes, and
rest of the modes can be debated upon and discussed parallely. And, each
new mode/flag can be pushed as a separate patch. This will not hold
fallocate feature indefinitely...

Please confirm if you find this approach ok. Otherwise, please object.
Thanks!

> FA_FL_ERR_FREE	0x08 /* free preallocation on error (default keep prealloc) */
> FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
> FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-03 10:08                                                                     ` Amit K. Arora
@ 2007-07-03 10:31                                                                       ` Christoph Hellwig
  2007-07-03 11:46                                                                         ` Amit K. Arora
  0 siblings, 1 reply; 340+ messages in thread
From: Christoph Hellwig @ 2007-07-03 10:31 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, adilger, David Chinner,
	suparna, cmm, xfs, Christoph Hellwig

On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote:
> > FA_FL_DEALLOC		0x01 /* deallocate unwritten extent (default allocate) */
> > FA_FL_KEEP_SIZE	0x02 /* keep size for EOF {pre,de}alloc (default change size) */
> > FA_FL_DEL_DATA	0x04 /* delete existing data in alloc range (default keep) */
> 
> We now have two sets of flags - 
> 1) the above three with which I think no one has any issues with, and

Yes, I do.  FA_FL_DEL_DATA is plain stupid, a preallocation call should
never delete data.  FA_FL_DEALLOC should probably be a separate syscall
because it's very different functionality.

While we're at it I also dislike the FA_ prefix becuase it doesn't say
anything and is far too generic.  FALLOC_ is much better.

> > FA_FL_ERR_FREE	0x08 /* free preallocation on error (default keep prealloc) */

NACK on this one.  We should have just one behaviour, and from the thread
that not freeing the allocation on error.

> > FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
> > FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */

NACK to these aswell.  If i_size changes c/mtime need updates, if the size
doesn't chamge they don't.  No need to add more flags for this.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-03 10:31                                                                       ` Christoph Hellwig
@ 2007-07-03 11:46                                                                         ` Amit K. Arora
  2007-07-04  5:37                                                                           ` Timothy Shimmin
  2007-07-11  9:03                                                                           ` Christoph Hellwig
  0 siblings, 2 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-07-03 11:46 UTC (permalink / raw)
  To: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4,
	adilger, David Chinner, suparna, cmm, xfs

On Tue, Jul 03, 2007 at 11:31:07AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 03, 2007 at 03:38:48PM +0530, Amit K. Arora wrote:
> > > FA_FL_DEALLOC		0x01 /* deallocate unwritten extent (default allocate) */
> > > FA_FL_KEEP_SIZE	0x02 /* keep size for EOF {pre,de}alloc (default change size) */
> > > FA_FL_DEL_DATA	0x04 /* delete existing data in alloc range (default keep) */
> > 
> > We now have two sets of flags - 
> > 1) the above three with which I think no one has any issues with, and
> 
> Yes, I do.  FA_FL_DEL_DATA is plain stupid, a preallocation call should
> never delete data.  FA_FL_DEALLOC should probably be a separate syscall
> because it's very different functionality.

Well, if you see the modes proposed using above flags :

#define FA_ALLOCATE	0
#define FA_DEALLOCATE	FA_FL_DEALLOC
#define FA_RESV_SPACE	FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)

FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
flag. Hence prealloction will never delete data.
This mode is required only for FA_UNRESV_SPACE, which is a deallocation
mode, to support any existing XFS aware applications/usage-scenarios.

And, regarding FA_FL_DEALLOC being a separate syscall - I think then the
very purpose of @mode argument is not justified. We have this mode so
that we can provide more features like this. That said, I don't say that
we should make things very complicated; but, atleast we should provide
some basic features which we expect most of the applications wanting
preallocation to use. To start with, we need to cater to already
existing applications/user base who use XFS preallocation feature.

And further advanced features, like goal based preallocation, can be
implemented as a separate syscall.

> While we're at it I also dislike the FA_ prefix becuase it doesn't say
> anything and is far too generic.  FALLOC_ is much better.

Ok. This can be changed in the next take.
 
> > > FA_FL_ERR_FREE	0x08 /* free preallocation on error (default keep prealloc) */
> 
> NACK on this one.  We should have just one behaviour, and from the thread
> that not freeing the allocation on error.

I agree on this one. 
 
> > > FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
> > > FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */
> 
> NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> doesn't chamge they don't.  No need to add more flags for this.

This requirement was from the point of view of HSM applications. Hope
you saw Andreas previous post and are keeping that in mind.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-03 11:46                                                                         ` Amit K. Arora
@ 2007-07-04  5:37                                                                           ` Timothy Shimmin
  2007-07-11  9:04                                                                             ` Christoph Hellwig
  2007-07-11  9:03                                                                           ` Christoph Hellwig
  1 sibling, 1 reply; 340+ messages in thread
From: Timothy Shimmin @ 2007-07-04  5:37 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4,
	adilger, David Chinner, suparna, cmm, xfs

Amit K. Arora wrote:
>>>> FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
>>>> FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */
>> NACK to these aswell.  If i_size changes c/mtime need updates, if the size
>> doesn't chamge they don't.  No need to add more flags for this.
> 
> This requirement was from the point of view of HSM applications. Hope
> you saw Andreas previous post and are keeping that in mind.
> 
We use this capability in XFS at the moment.
I think this is mainly for DMF (HSM) but is done via the xfs handle interface
(xfs_open_by_handle) AFAICT.

This sets up a set of invisible operations (xfs_invis_file_operations).
xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI
which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP & friends)
for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG)
is called or not.

--Tim

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-03 11:46                                                                         ` Amit K. Arora
  2007-07-04  5:37                                                                           ` Timothy Shimmin
@ 2007-07-11  9:03                                                                           ` Christoph Hellwig
  2007-07-12  7:28                                                                             ` Suparna Bhattacharya
  1 sibling, 1 reply; 340+ messages in thread
From: Christoph Hellwig @ 2007-07-11  9:03 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4,
	adilger, David Chinner, suparna, cmm, xfs

On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
> Well, if you see the modes proposed using above flags :
> 
> #define FA_ALLOCATE	0
> #define FA_DEALLOCATE	FA_FL_DEALLOC
> #define FA_RESV_SPACE	FA_FL_KEEP_SIZE
> #define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
> 
> FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
> for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
> flag. Hence prealloction will never delete data.
> This mode is required only for FA_UNRESV_SPACE, which is a deallocation
> mode, to support any existing XFS aware applications/usage-scenarios.

Sorry, but this doesn't make any sense.  There is no need to put every
feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
be supported forever anyway - as I suggested before they really should
be moved to generic code.

What needs to be supported is what makes sense as an interface.
A punch a hole interface does make sense, but trying to hack this into
a preallocation system call is just madness.  We're not IRIX or windows
that fit things into random subcall just because there was some space
left to squeeze them in.

> > > > FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
> > > > FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */
> > 
> > NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> > doesn't chamge they don't.  No need to add more flags for this.
> 
> This requirement was from the point of view of HSM applications. Hope
> you saw Andreas previous post and are keeping that in mind.

HSMs needs this basically for every system call, which screams for an
open flag like O_INVISIBLE anyway.  Adding this in a generic way is
a good idea, but hacking bits and pieces that won't fit into the global
design is completely wrong.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-04  5:37                                                                           ` Timothy Shimmin
@ 2007-07-11  9:04                                                                             ` Christoph Hellwig
  0 siblings, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-07-11  9:04 UTC (permalink / raw)
  To: Timothy Shimmin
  Cc: Amit K. Arora, Christoph Hellwig, linux-fsdevel, linux-kernel,
	linux-ext4, adilger, David Chinner, suparna, cmm, xfs

On Wed, Jul 04, 2007 at 03:37:01PM +1000, Timothy Shimmin wrote:
> We use this capability in XFS at the moment.
> I think this is mainly for DMF (HSM) but is done via the xfs handle 
> interface
> (xfs_open_by_handle) AFAICT.
> 

You're not :)  You're using an O_INVIBLE equivalent (as described below),
which would be a useful thing to have at the VFS level, but adding hacks
to some system calls only wouldn't help any HSM system.  It's just useless
API clutter.


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-01 22:55                                                                   ` David Chinner
  2007-07-02 11:47                                                                     ` Amit K. Arora
@ 2007-07-11  9:05                                                                     ` Christoph Hellwig
  1 sibling, 0 replies; 340+ messages in thread
From: Christoph Hellwig @ 2007-07-11  9:05 UTC (permalink / raw)
  To: David Chinner
  Cc: Christoph Hellwig, Amit K. Arora, Andreas Dilger, linux-fsdevel,
	linux-kernel, linux-ext4, suparna, cmm, xfs

On Mon, Jul 02, 2007 at 08:55:43AM +1000, David Chinner wrote:
> Given the current behaviour for posix_fallocate() in glibc, I think
> that retaining the same error semantic and punting the cleanup to
> userspace (where the app will fail with ENOSPC anyway) is the only
> sane thing we can do here. Trying to undo this in the kernel leads
> to lots of extra rarely used code in error handling paths...

Agreed, looks like we should stay with the user has to clean up behaviour.

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-11  9:03                                                                           ` Christoph Hellwig
@ 2007-07-12  7:28                                                                             ` Suparna Bhattacharya
  2007-07-12  8:26                                                                               ` Amit K. Arora
  2007-07-12 13:13                                                                               ` David Chinner
  0 siblings, 2 replies; 340+ messages in thread
From: Suparna Bhattacharya @ 2007-07-12  7:28 UTC (permalink / raw)
  To: Christoph Hellwig, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, adilger, David Chinner, cmm, xfs

On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
> > Well, if you see the modes proposed using above flags :
> > 
> > #define FA_ALLOCATE	0
> > #define FA_DEALLOCATE	FA_FL_DEALLOC
> > #define FA_RESV_SPACE	FA_FL_KEEP_SIZE
> > #define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
> > 
> > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
> > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
> > flag. Hence prealloction will never delete data.
> > This mode is required only for FA_UNRESV_SPACE, which is a deallocation
> > mode, to support any existing XFS aware applications/usage-scenarios.
> 
> Sorry, but this doesn't make any sense.  There is no need to put every
> feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
> be supported forever anyway - as I suggested before they really should
> be moved to generic code.
> 
> What needs to be supported is what makes sense as an interface.
> A punch a hole interface does make sense, but trying to hack this into
> a preallocation system call is just madness.  We're not IRIX or windows
> that fit things into random subcall just because there was some space
> left to squeeze them in.
> 
> > > > > FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
> > > > > FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */
> > > 
> > > NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> > > doesn't chamge they don't.  No need to add more flags for this.
> > 
> > This requirement was from the point of view of HSM applications. Hope
> > you saw Andreas previous post and are keeping that in mind.
> 
> HSMs needs this basically for every system call, which screams for an
> open flag like O_INVISIBLE anyway.  Adding this in a generic way is
> a good idea, but hacking bits and pieces that won't fit into the global
> design is completely wrong.


Why don't we just merge the interface for preallocation (essentially
enough to satisfy posix_fallocate() and the simple XFS requirement for 
space reservation without changing file size), which there is clear agreement
on (I hope :)).  After all, this was all that we set out to do when we
started.

And leave all the dealloc/punch/hsm type features for separate future patches/
debates, those really shouldn't hold up the basic fallocate interface.
I agree with Christoph that we are just diverging too much in trying to
club those decisions here.

Dave, Andreas, Ted ?

Regards
Suparna

> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-12  7:28                                                                             ` Suparna Bhattacharya
@ 2007-07-12  8:26                                                                               ` Amit K. Arora
  2007-07-12 14:40                                                                                 ` Andreas Dilger
  2007-07-12 13:13                                                                               ` David Chinner
  1 sibling, 1 reply; 340+ messages in thread
From: Amit K. Arora @ 2007-07-12  8:26 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4,
	adilger, David Chinner, cmm, xfs

On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
> On Wed, Jul 11, 2007 at 10:03:12AM +0100, Christoph Hellwig wrote:
> > On Tue, Jul 03, 2007 at 05:16:50PM +0530, Amit K. Arora wrote:
> > > Well, if you see the modes proposed using above flags :
> > > 
> > > #define FA_ALLOCATE	0
> > > #define FA_DEALLOCATE	FA_FL_DEALLOC
> > > #define FA_RESV_SPACE	FA_FL_KEEP_SIZE
> > > #define FA_UNRESV_SPACE	(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA)
> > > 
> > > FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes
> > > for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this
> > > flag. Hence prealloction will never delete data.
> > > This mode is required only for FA_UNRESV_SPACE, which is a deallocation
> > > mode, to support any existing XFS aware applications/usage-scenarios.
> > 
> > Sorry, but this doesn't make any sense.  There is no need to put every
> > feature in the XFS ioctls in the syscalls.  The XFS ioctls will need to
> > be supported forever anyway - as I suggested before they really should
> > be moved to generic code.
> > 
> > What needs to be supported is what makes sense as an interface.
> > A punch a hole interface does make sense, but trying to hack this into
> > a preallocation system call is just madness.  We're not IRIX or windows
> > that fit things into random subcall just because there was some space
> > left to squeeze them in.
> > 
> > > > > > FA_FL_NO_MTIME	0x10 /* keep same mtime (default change on size, data change) */
> > > > > > FA_FL_NO_CTIME	0x20 /* keep same ctime (default change on size, data change) */
> > > > 
> > > > NACK to these aswell.  If i_size changes c/mtime need updates, if the size
> > > > doesn't chamge they don't.  No need to add more flags for this.
> > > 
> > > This requirement was from the point of view of HSM applications. Hope
> > > you saw Andreas previous post and are keeping that in mind.
> > 
> > HSMs needs this basically for every system call, which screams for an
> > open flag like O_INVISIBLE anyway.  Adding this in a generic way is
> > a good idea, but hacking bits and pieces that won't fit into the global
> > design is completely wrong.
> 
> Why don't we just merge the interface for preallocation (essentially
> enough to satisfy posix_fallocate() and the simple XFS requirement for 
> space reservation without changing file size), which there is clear agreement
> on (I hope :)).  After all, this was all that we set out to do when we
> started.

As you suggest, let us just have two modes for the time being:

#define FALLOC_ALLOCATE			0x1
#define FALLOC_ALLOCATE_KEEP_SIZE	0x2

As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
will result in file size not being changed even if the preallocation is
beyond EOF.

> And leave all the dealloc/punch/hsm type features for separate future patches/
> debates, those really shouldn't hold up the basic fallocate interface.

I agree.

> I agree with Christoph that we are just diverging too much in trying to
> club those decisions here.
> 
> Dave, Andreas, Ted ?
> 
> Regards
> Suparna

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-12  7:28                                                                             ` Suparna Bhattacharya
  2007-07-12  8:26                                                                               ` Amit K. Arora
@ 2007-07-12 13:13                                                                               ` David Chinner
  2007-07-12 14:15                                                                                 ` Amit K. Arora
  1 sibling, 1 reply; 340+ messages in thread
From: David Chinner @ 2007-07-12 13:13 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Christoph Hellwig, Amit K. Arora, linux-fsdevel, linux-kernel,
	linux-ext4, adilger, David Chinner, cmm, xfs

On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
> 
> Why don't we just merge the interface for preallocation (essentially
> enough to satisfy posix_fallocate() and the simple XFS requirement for 
> space reservation without changing file size), which there is clear agreement
> on (I hope :)).  After all, this was all that we set out to do when we
> started.
> 
> And leave all the dealloc/punch/hsm type features for separate future patches/
> debates, those really shouldn't hold up the basic fallocate interface.
> I agree with Christoph that we are just diverging too much in trying to
> club those decisions here.
> 
> Dave, Andreas, Ted ?

Sure. I'll just make XFS work with whatever it is that gets merged.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-12 13:13                                                                               ` David Chinner
@ 2007-07-12 14:15                                                                                 ` Amit K. Arora
  0 siblings, 0 replies; 340+ messages in thread
From: Amit K. Arora @ 2007-07-12 14:15 UTC (permalink / raw)
  To: David Chinner
  Cc: Suparna Bhattacharya, Christoph Hellwig, linux-fsdevel,
	linux-kernel, linux-ext4, adilger, cmm, xfs

On Thu, Jul 12, 2007 at 11:13:34PM +1000, David Chinner wrote:
> On Thu, Jul 12, 2007 at 12:58:13PM +0530, Suparna Bhattacharya wrote:
> > 
> > Why don't we just merge the interface for preallocation (essentially
> > enough to satisfy posix_fallocate() and the simple XFS requirement for 
> > space reservation without changing file size), which there is clear agreement
> > on (I hope :)).  After all, this was all that we set out to do when we
> > started.
> > 
> > And leave all the dealloc/punch/hsm type features for separate future patches/
> > debates, those really shouldn't hold up the basic fallocate interface.
> > I agree with Christoph that we are just diverging too much in trying to
> > club those decisions here.
> > 
> > Dave, Andreas, Ted ?
> 
> Sure. I'll just make XFS work with whatever it is that gets merged.

Great. I will post the new patches soon.

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 340+ messages in thread

* Re: [PATCH 4/7][TAKE5] support new modes in fallocate
  2007-07-12  8:26                                                                               ` Amit K. Arora
@ 2007-07-12 14:40                                                                                 ` Andreas Dilger
  0 siblings, 0 replies; 340+ messages in thread
From: Andreas Dilger @ 2007-07-12 14:40 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Suparna Bhattacharya, Christoph Hellwig, linux-fsdevel,
	linux-kernel, linux-ext4, David Chinner, cmm, xfs

On Jul 12, 2007  13:56 +0530, Amit K. Arora wrote:
> As you suggest, let us just have two modes for the time being:
> 
> #define FALLOC_ALLOCATE			0x1
> #define FALLOC_ALLOCATE_KEEP_SIZE	0x2
> 
> As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
> will result in file size not being changed even if the preallocation is
> beyond EOF.

What does FALLOC_ALLOCATE mean vs. not passing this flag?  I have no
objection to this as long as the code remains with these as "flags"
instead of "modes"...  Essentially just dropping the FALLOC_FL_DEALLOCATE
and FALLOC_FL_DEL_DATA from the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 340+ messages in thread

end of thread, other threads:[~2007-07-12 14:41 UTC | newest]

Thread overview: 340+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-17  9:46 [Resubmit][Patch 0/2] Persistent preallocation in ext4 Amit K. Arora
2007-01-17 10:13 ` [Patch 1/2] ioctl and uninitialized extents Amit K. Arora
2007-01-17 10:18 ` [Patch 2/2] support for writing to uninitialized extent Amit K. Arora
2007-01-17 22:20 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Mingming Cao
2007-01-17 22:33   ` Eric Sandeen
2007-01-18  6:08   ` Amit K. Arora
2007-01-19  9:11 ` Amit K. Arora
2007-01-19  9:17 ` patch for fsx-linux Amit K. Arora
2007-01-19  9:22 ` small tool for unit testing Amit K. Arora
2007-02-07  7:48 ` Testing ext4 persistent preallocation patches for 64 bit features Amit K. Arora
2007-02-07  8:25   ` Mingming Cao
2007-02-07 10:36     ` Suparna Bhattacharya
2007-02-07 21:11       ` Andreas Dilger
2007-02-08  8:52         ` Amit K. Arora
2007-02-08 10:51     ` Amit K. Arora
2007-02-25 10:23 ` [Resubmit][Patch 0/2] Persistent preallocation in ext4 Andrew Morton
2007-03-01 18:34   ` [RFC] Heads up on sys_fallocate() Amit K. Arora
2007-03-01 19:15     ` Eric Sandeen
2007-03-02 10:45       ` Andreas Dilger
2007-03-02 13:17         ` Dave Kleikamp
2007-03-01 20:23     ` Jeff Garzik
2007-03-01 20:31       ` Jeremy Allison
2007-03-01 21:14     ` Jeremy Fitzhardinge
2007-03-01 22:58       ` Alan
2007-03-01 22:05         ` Jeremy Fitzhardinge
2007-03-01 23:11           ` Alan
2007-03-01 22:15             ` Jeremy Fitzhardinge
2007-03-01 22:25     ` Andrew Morton
2007-03-01 22:40       ` Nathan Scott
2007-03-01 22:39         ` Eric Sandeen
2007-03-01 22:52         ` Andrew Morton
2007-03-02 18:28           ` Mingming Cao
2007-03-05 12:27             ` Jan Kara
2007-03-05 20:02               ` Mingming Cao
2007-03-06  7:28                 ` Christoph Hellwig
2007-03-06 14:36                   ` Ulrich Drepper
2007-03-06 14:47                     ` Christoph Hellwig
2007-03-06 14:50                     ` Jan Kara
2007-03-06 18:23                       ` Eric Sandeen
2007-03-07  8:51                         ` Jan Kara
2007-03-07 11:30                           ` Jörn Engel
2007-03-06 16:46                     ` Eric Sandeen
2007-03-13 23:46                       ` David Chinner
2007-03-05 21:41               ` Eric Sandeen
2007-03-01 22:41       ` Anton Blanchard
2007-03-01 22:44       ` Dave Kleikamp
2007-03-01 22:59         ` Andrew Morton
2007-03-01 23:09           ` Dave Kleikamp
2007-03-02 13:41             ` Jan Engelhardt
2007-03-02 18:09             ` Mingming Cao
2007-03-02  7:09           ` Ulrich Drepper
2007-03-01 23:38         ` Christoph Hellwig
2007-03-03 22:45           ` Arnd Bergmann
2007-03-04 20:11             ` Anton Altaparmakov
2007-03-04 20:53               ` Arnd Bergmann
2007-03-04 20:53                 ` Arnd Bergmann
2007-03-04 22:38               ` Ulrich Drepper
2007-03-04 23:22                 ` Anton Altaparmakov
2007-03-05 14:37                   ` Theodore Tso
2007-03-05 15:07                     ` Anton Altaparmakov
2007-03-05 15:15                     ` Ulrich Drepper
2007-03-05 15:35                       ` Christoph Hellwig
2007-03-05 16:01                       ` Theodore Tso
2007-03-05 16:07                         ` Ulrich Drepper
2007-03-05  0:16                 ` Jörn Engel
2007-03-05  0:16                   ` Jörn Engel
2007-03-05  0:32                   ` Anton Altaparmakov
2007-03-05  0:35                     ` Anton Altaparmakov
2007-03-05  0:44                     ` Arnd Bergmann
2007-03-05 11:49                     ` Jörn Engel
2007-03-05 11:49                       ` Jörn Engel
2007-03-05 15:09                       ` Ulrich Drepper
2007-03-05  0:36                   ` Arnd Bergmann
2007-03-05 11:41                     ` Jörn Engel
2007-03-05 15:08                       ` Ulrich Drepper
2007-03-05 15:33                         ` Jörn Engel
2007-03-05 15:33                           ` Jörn Engel
2007-03-05 15:48                           ` Ulrich Drepper
2007-03-05 22:00                       ` Eric Sandeen
2007-03-05  4:23               ` Christoph Hellwig
2007-03-05 13:18             ` Christoph Hellwig
2007-03-01 23:29     ` Eric Sandeen
2007-03-01 23:51       ` Christoph Hellwig
2007-03-01 23:36     ` Christoph Hellwig
2007-03-02  6:03     ` Badari Pulavarty
2007-03-02  6:16       ` Andrew Morton
2007-03-02 13:23         ` Dave Kleikamp
2007-03-02 15:29           ` Ulrich Drepper
2007-03-02 15:16       ` Eric Sandeen
2007-03-02 16:13         ` Badari Pulavarty
2007-03-02 17:01           ` Andrew Morton
2007-03-02 17:19           ` Eric Sandeen
2007-03-16 14:31     ` [RFC][PATCH] sys_fallocate() system call Amit K. Arora
2007-03-16 15:21       ` Heiko Carstens
2007-03-19  9:24         ` Amit K. Arora
2007-03-19 11:23           ` Heiko Carstens
2007-03-16 16:17       ` Heiko Carstens
2007-03-17  9:59         ` Paul Mackerras
2007-03-17 11:07           ` Matthew Wilcox
2007-03-17 14:30             ` Heiko Carstens
2007-03-17 14:38               ` Stephen Rothwell
2007-03-17 14:42                 ` Stephen Rothwell
2007-03-17 11:10         ` Matthew Wilcox
2007-03-21 12:04           ` Amit K. Arora
2007-03-21 21:35             ` Chris Wedgwood
2007-03-29 11:51             ` Interface for the new fallocate() " Amit K. Arora
2007-03-29 16:35               ` Chris Wedgwood
2007-03-29 17:01               ` Jan Engelhardt
2007-03-29 17:18                 ` linux-os (Dick Johnson)
2007-03-29 17:18                   ` linux-os (Dick Johnson)
2007-03-29 18:05                   ` Jan Engelhardt
2007-03-29 18:37                     ` Linus Torvalds
2007-03-30  7:00                 ` Heiko Carstens
2007-03-29 17:10               ` Andrew Morton
2007-03-30  7:14                 ` Jakub Jelinek
2007-03-30  8:39                   ` Heiko Carstens
2007-03-30  8:39                   ` Heiko Carstens
2007-03-30  9:15                   ` Paul Mackerras
2007-03-30  9:15                   ` Paul Mackerras
2007-04-05 11:26                   ` Amit K. Arora
2007-04-05 11:44                     ` Amit K. Arora
2007-04-05 15:50                     ` Randy Dunlap
2007-04-06  9:58                     ` Andreas Dilger
2007-04-17 12:55                   ` Amit K. Arora
2007-04-18 13:06                     ` Andreas Dilger
2007-04-20 13:51                       ` Amit K. Arora
2007-04-20 14:59                         ` Jakub Jelinek
2007-04-24 12:16                           ` Amit K. Arora
2007-04-26 17:50                             ` [PATCH 0/5] fallocate " Amit K. Arora
2007-04-26 18:03                               ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Amit K. Arora
2007-05-04  4:29                                 ` Andrew Morton
2007-05-04  4:41                                   ` Paul Mackerras
2007-05-09 10:15                                     ` Suparna Bhattacharya
2007-05-09 10:50                                       ` Paul Mackerras
2007-05-09 11:10                                         ` Suparna Bhattacharya
2007-05-09 11:37                                           ` Paul Mackerras
2007-05-09 12:00                                             ` Martin Schwidefsky
2007-05-09 12:05                                             ` Amit K. Arora
2007-05-04  4:55                                   ` Andrew Morton
2007-05-04  6:07                                   ` David Chinner
2007-05-04  6:28                                     ` Andrew Morton
2007-05-04  6:56                                       ` Jakub Jelinek
2007-05-07 13:08                                         ` Ulrich Drepper
2007-05-04  7:27                                       ` David Chinner
2007-05-07 11:10                                       ` Amit K. Arora
2007-05-07 11:03                                   ` Amit K. Arora
2007-05-09 16:01                                 ` Amit K. Arora
2007-05-09 16:54                                   ` Andreas Dilger
2007-05-09 17:07                                   ` Mingming Cao
2007-05-10  0:59                                   ` David Chinner
2007-05-10 11:56                                     ` Amit K. Arora
2007-05-10 22:39                                       ` David Chinner
2007-05-11 11:03                                         ` Suparna Bhattacharya
2007-05-12  8:01                                           ` David Chinner
2007-06-12  6:16                                             ` Amit K. Arora
2007-06-12  8:11                                               ` David Chinner
2007-06-13 23:52                                               ` David Chinner
2007-06-14  9:14                                                 ` Andreas Dilger
2007-06-14 12:04                                                   ` David Chinner
2007-06-14 19:33                                                     ` Andreas Dilger
2007-06-25 13:28                                                       ` [PATCH 0/6][TAKE5] fallocate system call Amit K. Arora
2007-06-25 13:28                                                         ` Amit K. Arora
2007-06-25 13:40                                                         ` [PATCH 1/7][TAKE5] fallocate() implementation on i386, x86_64 and powerpc Amit K. Arora
2007-06-26 19:38                                                           ` Heiko Carstens
2007-06-25 13:42                                                         ` [PATCH 2/7][TAKE5] fallocate() on s390(x) Amit K. Arora
2007-06-26 15:15                                                           ` Heiko Carstens
2007-06-25 13:43                                                         ` [PATCH 3/7][TAKE5] fallocate() on ia64 Amit K. Arora
2007-06-25 13:45                                                         ` [PATCH 4/7][TAKE5] support new modes in fallocate Amit K. Arora
2007-06-25 15:03                                                           ` Amit K. Arora
2007-06-25 21:46                                                             ` Andreas Dilger
2007-06-26 10:32                                                               ` Amit K. Arora
2007-06-26 15:34                                                                 ` Andreas Dilger
2007-06-26 19:09                                                                   ` Amit K. Arora
2007-06-26 23:18                                                                   ` David Chinner
2007-06-28 18:19                                                                     ` Amit K. Arora
2007-06-28 23:39                                                                       ` Nathan Scott
2007-06-29  1:03                                                                       ` David Chinner
2007-06-30 10:21                                                                 ` Christoph Hellwig
2007-06-30 16:52                                                                   ` Andreas Dilger
2007-07-03 10:08                                                                     ` Amit K. Arora
2007-07-03 10:31                                                                       ` Christoph Hellwig
2007-07-03 11:46                                                                         ` Amit K. Arora
2007-07-04  5:37                                                                           ` Timothy Shimmin
2007-07-11  9:04                                                                             ` Christoph Hellwig
2007-07-11  9:03                                                                           ` Christoph Hellwig
2007-07-12  7:28                                                                             ` Suparna Bhattacharya
2007-07-12  8:26                                                                               ` Amit K. Arora
2007-07-12 14:40                                                                                 ` Andreas Dilger
2007-07-12 13:13                                                                               ` David Chinner
2007-07-12 14:15                                                                                 ` Amit K. Arora
2007-07-01 22:55                                                                   ` David Chinner
2007-07-02 11:47                                                                     ` Amit K. Arora
2007-07-11  9:05                                                                     ` Christoph Hellwig
2007-06-26 23:14                                                               ` David Chinner
2007-06-27  3:49                                                                 ` Andreas Dilger
2007-06-27 13:36                                                                   ` David Chinner
2007-06-27 23:28                                                                     ` Nathan Scott
2007-06-28  0:39                                                                       ` David Chinner
2007-06-28  0:53                                                                         ` Nathan Scott
2007-06-30 10:26                                                                     ` Christoph Hellwig
2007-06-25 21:52                                                           ` Andreas Dilger
2007-06-26 10:45                                                             ` Amit K. Arora
2007-06-26 15:42                                                               ` Andreas Dilger
2007-06-26 19:12                                                                 ` Amit K. Arora
2007-06-26 23:32                                                                 ` David Chinner
2007-06-26 23:26                                                             ` David Chinner
2007-06-25 13:48                                                         ` [PATCH 5/7][TAKE5] ext4: fallocate support in ext4 Amit K. Arora
2007-06-25 13:49                                                         ` [PATCH 6/7][TAKE5] ext4: write support for preallocated blocks Amit K. Arora
2007-06-25 13:50                                                         ` [PATCH 7/7][TAKE5] ext4: support new modes Amit K. Arora
2007-06-25 21:56                                                           ` Andreas Dilger
2007-06-26 12:07                                                             ` Amit K. Arora
2007-06-26 16:14                                                               ` Andreas Dilger
2007-06-26 19:29                                                                 ` Amit K. Arora
2007-06-27  0:04                                                                   ` David Chinner
2007-06-28 18:07                                                                     ` Amit K. Arora
2007-06-26 23:15                                                         ` [PATCH 0/6][TAKE5] fallocate system call David Chinner
2007-06-28  9:55                                                         ` Andrew Morton
2007-06-28 17:36                                                           ` Mingming Cao
2007-06-28 17:57                                                           ` Amit K. Arora
2007-06-28 18:33                                                             ` Andrew Morton
2007-06-28 18:45                                                               ` Dave Kleikamp
2007-06-28 18:57                                                               ` Jeff Garzik
2007-06-29  7:20                                                               ` Christoph Hellwig
2007-06-29 13:56                                                               ` Theodore Tso
2007-06-29 14:29                                                                 ` Jeff Garzik
2007-06-29 17:42                                                                   ` Theodore Tso
2007-06-29 15:50                                                                 ` Mingming Caoc
2007-06-29 20:57                                                                   ` Andrew Morton
2007-07-01  7:35                                                                     ` Ext4 patches for 2.6.22-rc6 Mingming Cao
2007-06-28 20:34                                                             ` [PATCH 0/6][TAKE5] fallocate system call Andreas Dilger
2007-06-30 10:14                                                   ` [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc Christoph Hellwig
2007-04-26 18:07                               ` [PATCH 2/5] fallocate() on s390 Amit K. Arora
2007-04-26 18:11                               ` [PATCH 3/5] ext4: Extent overlap bugfix Amit K. Arora
2007-05-04  4:30                                 ` Andrew Morton
2007-05-07 11:46                                   ` Amit K. Arora
2007-04-26 18:13                               ` [PATCH 4/5] ext4: fallocate support in ext4 Amit K. Arora
2007-05-04  4:31                                 ` Andrew Morton
2007-05-07 11:37                                   ` Andreas Dilger
2007-05-07 20:58                                     ` Andrew Morton
2007-05-07 22:21                                       ` Andreas Dilger
2007-05-07 22:38                                         ` Andrew Morton
2007-05-07 23:14                                           ` Theodore Tso
2007-05-07 23:31                                             ` Andrew Morton
2007-05-08  0:30                                               ` Mingming Cao
2007-05-07 23:02                                         ` Jeff Garzik
2007-05-07 23:36                                           ` Theodore Tso
2007-05-08  1:07                                           ` Andreas Dilger
2007-05-08  1:25                                             ` Jeff Garzik
2007-05-08  0:00                                       ` Mingming Cao
2007-05-08  0:15                                         ` Andrew Morton
2007-05-08  0:41                                           ` Mingming Cao
2007-05-08  1:43                                             ` Theodore Tso
2007-05-08 16:52                                               ` Andreas Dilger
2007-05-08 17:46                                               ` Mingming Cao
2007-05-14 13:34                                       ` Jan Kara
2007-05-07 12:07                                   ` Amit K. Arora
2007-05-07 15:24                                     ` Dave Kleikamp
2007-05-08 10:52                                       ` Amit K. Arora
2007-05-08 14:47                                         ` Dave Kleikamp
2007-04-26 18:16                               ` [PATCH 5/5] ext4: write support for preallocated blocks/extents Amit K. Arora
2007-05-04  4:32                                 ` Andrew Morton
2007-05-07 12:11                                   ` Amit K. Arora
2007-05-07 12:40                                 ` Pekka Enberg
2007-05-07 13:04                                   ` Amit K. Arora
2007-04-27 12:10                               ` [PATCH 0/5] fallocate system call Heiko Carstens
2007-04-27 14:43                                 ` Jörn Engel
2007-04-27 17:46                                   ` Heiko Carstens
2007-04-27 17:46                                     ` Heiko Carstens
2007-04-27 20:42                                     ` Chris Wedgwood
2007-04-30  0:47                               ` David Chinner
2007-04-30  3:09                                 ` [PATCH] ia64 fallocate syscall David Chinner
2007-04-30  3:11                                 ` [PATCH] XFS ->fallocate() support David Chinner
2007-04-30  3:14                                 ` [PATCH] Add preallocation beyond EOF to fallocate David Chinner
2007-04-30  5:25                                 ` [PATCH 0/5] fallocate system call Chris Wedgwood
2007-04-30  5:56                                   ` David Chinner
2007-04-30  6:01                                     ` Chris Wedgwood
2007-05-02 12:53                                   ` Amit K. Arora
2007-05-03 10:34                                     ` Andreas Dilger
2007-05-03 11:22                                       ` Miquel van Smoorenburg
2007-05-08  2:26                                         ` David Chinner
2007-05-14 13:29                               ` [PATCH 0/5][TAKE2] " Amit K. Arora
2007-05-14 13:29                                 ` Amit K. Arora
     [not found]                                 ` <20070514142820.GA31468@amitarora.in.ibm.com>
2007-05-14 14:45                                   ` [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
2007-05-14 23:44                                     ` Stephen Rothwell
2007-05-14 23:44                                       ` Stephen Rothwell
2007-05-15 13:23                                       ` Amit K. Arora
2007-05-18 21:36                                         ` Theodore Tso
2007-05-18 23:10                                           ` Mingming Cao
2007-05-20 12:39                                             ` Dave Kleikamp
2007-05-21  5:38                                               ` Theodore Tso
2007-05-14 14:48                                   ` [PATCH 2/5][TAKE2] fallocate() on s390 Amit K. Arora
2007-05-14 15:33                                     ` [PATCH 2/5][TAKE2] fallocate() on s390 - glibc wrapper Amit K. Arora
2007-05-14 14:50                                   ` [PATCH 3/5][TAKE2] ext4: Extent overlap bugfix Amit K. Arora
2007-05-14 14:52                                   ` [PATCH 4/5][TAKE2] ext4: fallocate support in ext4 Amit K. Arora
2007-05-14 14:54                                   ` [PATCH 5/5][TAKE2] ext4: write support for preallocated blocks Amit K. Arora
2007-05-15  6:31                                 ` [PATCH 0/5][TAKE2] fallocate system call Andreas Dilger
2007-05-15 12:40                                   ` Amit K. Arora
2007-05-15 12:40                                     ` Amit K. Arora
2007-05-15 19:37                               ` [PATCH 0/5][TAKE3] " Amit K. Arora
2007-05-15 19:37                                 ` Amit K. Arora
     [not found]                                 ` <20070515195421.GA2948@amitarora.in.ibm.com>
2007-05-15 20:03                                   ` [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
2007-05-16  0:42                                     ` Mingming Cao
2007-05-16 12:31                                       ` Amit K. Arora
2007-05-16  3:16                                     ` David Chinner
2007-05-16 12:21                                       ` Dave Kleikamp
2007-05-16 12:37                                         ` Amit K. Arora
2007-05-16 23:40                                         ` David Chinner
2007-05-17 12:10                                           ` Dave Kleikamp
2007-05-17 12:28                                           ` Amit K. Arora
2007-05-15 20:10                                   ` [PATCH 2/5][TAKE3] fallocate() on s390 Amit K. Arora
2007-05-15 20:13                                   ` [PATCH 3/5][TAKE3] ext4: Extent overlap bugfix Amit K. Arora
2007-05-15 20:16                                   ` [PATCH 4/5][TAKE3] ext4: fallocate support in ext4 Amit K. Arora
2007-05-15 20:18                                   ` [PATCH 5/5][TAKE3] ext4: write support for preallocated blocks Amit K. Arora
2007-05-15 23:52                                 ` [PATCH 0/5][TAKE3] fallocate system call Mingming Cao
2007-05-17 14:11                               ` [PATCH 0/6][TAKE4] " Amit K. Arora
2007-05-17 14:11                                 ` Amit K. Arora
     [not found]                                 ` <20070517141458.GA26641@amitarora.in.ibm.com>
2007-05-17 14:23                                   ` [PATCH 1/6][TAKE4] fallocate() implementation on i86, x86_64 and powerpc Amit K. Arora
2007-05-17 14:25                                   ` [PATCH 2/6][TAKE4] fallocate() on s390 Amit K. Arora
2007-05-17 14:25                                   ` [PATCH 3/6][TAKE4] fallocate() on ia64 Amit K. Arora
2007-05-17 14:26                                   ` [PATCH 4/6][TAKE4] ext4: Extent overlap bugfix Amit K. Arora
2007-05-17 14:29                                   ` [PATCH 5/6][TAKE4] ext4: fallocate support in ext4 Amit K. Arora
2007-05-17 14:30                                   ` [PATCH 6/6][TAKE4] ext4: write support for preallocated blocks Amit K. Arora
2007-05-19  6:44                                 ` [PATCH 0/6][TAKE4] fallocate system call Andrew Morton
2007-05-21  5:24                                   ` Mingming Cao
2007-05-21  5:24                                     ` Mingming Cao
2007-03-30  7:19                 ` Interface for the new fallocate() " Heiko Carstens
2007-03-30  9:15                   ` Paul Mackerras
2007-03-30 10:44                     ` Jörn Engel
2007-03-30 10:44                       ` Jörn Engel
2007-03-30 12:55                       ` Heiko Carstens
2007-03-30 12:55                       ` Heiko Carstens
2007-04-09 13:01                       ` Paul Mackerras
2007-04-09 13:01                         ` Paul Mackerras
2007-04-09 16:34                         ` Jörn Engel
2007-04-09 16:34                           ` Jörn Engel
2007-03-30 10:44                     ` Jörn Engel
2007-03-30  9:15                   ` Paul Mackerras
2007-03-17  5:33       ` [RFC][PATCH] sys_fallocate() " Stephen Rothwell
2007-03-19  9:30         ` Amit K. Arora
2007-03-17 14:53       ` Russell King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.