linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6][TAKE7] fallocate system call
@ 2007-07-13 12:38 Amit K. Arora
  2007-07-13 12:46 ` [PATCH 1/6][TAKE7] manpage for fallocate Amit K. Arora
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:38 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, michael.kerrisk, tytso, cmm, suparna, adilger, dgc

This is the latest fallocate patchset and is based on 2.6.22.

* Following are the changes from TAKE6:
1) We now just have two modes (and no deallocation modes).
2) Updated the man page
3) Added a new patch submitted by David P. Quigley  (Patch 3/6).
4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6.
5) Included below in the end is a small testcase to test fallocate.

* Following are the changes from TAKE5 to TAKE6:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from "FA_" to "FALLOC_"
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-----
1> Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2> A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4> Patch to e2fsprogs to recognize and display uninitialized extents.


Following patches follow:
Patch 1/6 : manpage for fallocate
Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/6 : revalidate write permissions for fallocate
Patch 4/6 : ext4: fallocate support in ext4
Patch 5/6 : ext4: write support for preallocated blocks
Patch 6/6 : ext4: change for better extent-to-group alignment

Note: Attached below is a small testcase to test fallocate. The __NR_fallocate
will need to be changed depending on the system call number in the kernel (it
may get changed due to merge) and also depending on the architecture.

--
Regards,
Amit Arora



#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>

#include <linux/unistd.h>
#include <sys/vfs.h>
#include <sys/stat.h>

#define VERBOSE 0

#define __NR_fallocate                324

#define FALLOC_FL_KEEP_SIZE	0x01
#define FALLOC_ALLOCATE		0x0
#define FALLOC_RESV_SPACE	FALLOC_FL_KEEP_SIZE


int do_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret;

  if (VERBOSE)
  	printf("Trying to preallocate blocks (offset=%llu, len=%llu)\n",
		offset, len);
  ret = syscall(__NR_fallocate, fd, mode, offset, len);

  if (ret <0) {
        printf("SYSCALL: received error %d, ret=%d\n", errno, ret);
        close(fd);
        return(1);
  }

  if (VERBOSE)
  	printf("fallocate system call succedded !  ret=%d\n", ret);

  return ret;
}

int test_fallocate(int fd, int mode, loff_t offset, loff_t len)
{
  int ret, blocks;
  struct stat statbuf1, statbuf2;

  fstat(fd, &statbuf1);

  ret = do_fallocate(fd, mode, offset, len);

  fstat(fd, &statbuf2);

  /* check file size after preallocation */
  if (mode == FALLOC_ALLOCATE) {
  	if (!ret && statbuf1.st_size < (offset + len) &&
	    statbuf2.st_size != (offset + len)) {
		printf("Error: fallocate succeeded, but the file size did not "
			"change, where it should have!\n");
		ret = 1;
	}
  } else if (statbuf1.st_size != statbuf2.st_size) {
	printf("Error : File size changed, when it should not have!\n");
	ret = 1;
  }

  blocks = ((statbuf2.st_blocks - statbuf1.st_blocks) * 512)/ statbuf2.st_blksize;

  /* Print report */
  printf("# FALLOCATE TEST REPORT #\n");
  printf("\tNew blocks preallocated = %d.\n", blocks);
  printf("\tNumber of bytes preallocated = %d\n", blocks * statbuf2.st_blksize);
  printf("\tOld file size = %d, New file size %d.\n",
	  statbuf1.st_size, statbuf2.st_size);
  printf("\tOld num blocks = %d, New num blocks %d.\n",
	  (statbuf1.st_blocks * 512)/1024, (statbuf2.st_blocks * 512)/1024);

  return ret;
}


int do_write(int fd, loff_t offset, loff_t len)
{
  int ret;
  char *buf;

  buf = (char *)malloc(len);
  if (!buf) {
	printf("error: malloc failed.\n");
	return(-1);
  }

  if (VERBOSE)
  	printf("Trying to write to file (offset=%llu, len=%llu)\n", 
		offset, len);

  ret = lseek(fd, offset, SEEK_SET);
  if (ret != offset) {
     	printf("lseek() failed error=%d, ret=%d\n", errno, ret);
  	close(fd); 
       	return(-1);
  }

  ret = write(fd, buf, len);
  if (ret != len) {
       	 printf("write() failed error=%d, ret=%d\n", errno, ret);
  		close(fd); 
       		return(-1);
  }

  if (VERBOSE)
  	printf("Write succedded ! Written %llu bytes ret=%d\n", len, ret);

  return ret;
}


int test_write(int fd, loff_t offset, loff_t len)
{
  int ret;

  ret = do_write(fd, offset, len);
  printf("# WRITE TEST REPORT #\n");
  if (ret > 0) printf("\t written %d bytes.\n", ret);
  else printf("\t write operation failed!\n");

  if (ret > 0) return 0;
  else return 1;
}

void usage(char **argv)
{
  printf("\n%s <option> <filename-with-path> <offset> <length>\n", argv[0]);
  printf("option can be one of the following :\n");
  printf("\t-f\t: preallocate. This maps to FALLOC_ALLOCATE mode.\n");
  printf("\t-F\t: preallocate, but do not change the file size.\n");
  printf("\t\t    This maps to FALLOC_RESV_SPACE mode.\n");
  printf("\t-w\t: write some data to the range.\n");
  printf("\t-W\t: preallocate and write some data to the range.\n");
}

/*
 * Arguments:
 * argv[1] = option (-f/-F/-w/-W/-m)
 * argv[2] = fname	: the file name with path
 * argv[3] = offset	: in bytes
 * argv[4] = len	: in bytes
 */
int main(int argc, char **argv)
{
  int ret = 1, fd, mode;
  char *fname; 
  loff_t offset, len;

  if (argc!=5 || argv[1][0] != '-') {
	usage(argv);
	exit(1);
  }

  fname = argv[2];
  offset = (unsigned long long)atol(argv[3]);;
  len = (unsigned long long)atol(argv[4]);

  if (offset < 0 || len <= 0) {
        printf("%s: Invalid arguments.\n", argv[0]);
        exit(1);
  }

  fd = open(fname, O_CREAT|O_RDWR, 0666);
  if (fd < 0) {
        printf("Error opening file %s, error = %d.\n", fname, errno);
        exit(1);
  }

  /* -f */
  if (!strcmp(argv[1], "-f")) {
	mode = FALLOC_ALLOCATE;
	ret = test_fallocate(fd, mode, offset, len);
  	if (ret) printf("test_fallocate: ERROR ! ret=%d\n", ret);
  /* -F */
  } else if (!strcmp(argv[1], "-F")) {
	mode = FALLOC_RESV_SPACE;
	ret = test_fallocate(fd, mode, offset, len);
  	if (ret) printf("test_fallocate: ERROR ! ret=%d\n", ret);
  /* -w */
  } else if (!strcmp(argv[1], "-w")) {
	ret = test_write(fd, offset, len);
  /* -W */
  } else if (!strcmp(argv[1], "-W")) {
	mode = FALLOC_ALLOCATE;
	ret = test_fallocate(fd, mode, offset, len);
  	if (ret) {
		printf("test_fallocate: ERROR ! ret=%d\n", ret);
		goto out;
	}
	ret = test_write(fd, offset, len);
  	if (ret) printf("test_write: ERROR ! ret=%d\n", ret);
  } else {
        printf("%s: Invalid arguments.\n", argv[0]);
	usage(argv);
  }

out:

  if (!ret) printf("\n\n### TESTS PASSED ###\n");
  else printf("\n\n#!# TESTS FAILED #!#\n");

  close(fd);
  return ret;
}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6][TAKE7] manpage for fallocate
  2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
@ 2007-07-13 12:46 ` Amit K. Arora
  2007-07-13 14:06   ` David Chinner
  2007-07-14  8:23   ` Michael Kerrisk
  2007-07-13 12:47 ` [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc Amit K. Arora
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, michael.kerrisk, tytso, cmm, suparna, adilger, dgc

Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.

This includes changes suggested by Heikki Orsila and Barry Naujok.


.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include <fcntl.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t " len);
.SH DESCRIPTION
The
.B fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are two modes:
.TP
.B FALLOC_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
.IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
.B FALLOC_ALLOCATE
closely resembles
.BR posix_fallocate (3)
and is intended as a method of optimally implementing this function.
.B FALLOC_ALLOCATE
may allocate a larger range than that was specified.
.TP
.B FALLOC_RESV_SPACE
provides the same functionality as
.B FALLOC_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
.SH RETURN VALUE
.B fallocate
returns zero on success, or an error number on failure.
Note that
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd .
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.TP
.B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
.TP
.B EINTR
A signal was caught during execution
.TP
.B EIO
An I/O error occurred while reading from or writing to a file system.
.TP
.B EOPNOTSUPP
The mode is not supported on the file descriptor.
.SH AVAILABILITY
The
.B fallocate
system call is available since 2.6.XX
.SH SEE ALSO
.BR syscall (2),
.BR posix_fadvise (3),
.BR ftruncate (3).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
  2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
  2007-07-13 12:46 ` [PATCH 1/6][TAKE7] manpage for fallocate Amit K. Arora
@ 2007-07-13 12:47 ` Amit K. Arora
  2007-07-13 13:21   ` Christoph Hellwig
  2007-07-13 12:48 ` [PATCH 3/6][TAKE7] revalidate write permissions for fallocate Amit K. Arora
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:47 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, tytso, cmm, suparna, adilger, dgc

From: Amit Arora <aarora@in.ibm.com>

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called ->fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
ToDos:
1. Implementation on other architectures (other than i386, x86_64,
   and ppc). Patches for s390(x) and ia64 are already available from
   previous posts, but it was decided that they should be added later
   once fallocate is in the mainline. Hence not including those patches
   in this take.
2. A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3. Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
 	.long sys_signalfd
 	.long sys_timerfd
 	.long sys_eventfd
+	.long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
 	return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+				     u32 lenhi, u32 lenlo)
+{
+	return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+			     ((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high,
 				 unsigned long low)
 {
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
 	.quad compat_sys_signalfd
 	.quad compat_sys_timerfd
 	.quad sys_eventfd
+	.quad sys32_fallocate
 ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===================================================================
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies the behavior of allocation.
+ * @offset: The offset within file, from where allocation is being
+ *	    requested. It should not have a negative value.
+ * @len: The amount of space in bytes to be allocated, from the offset.
+ *	 This can not be zero or a negative value.
+ *
+ * This system call preallocates space for a file. The range of blocks
+ * allocated depends on the value of offset and len arguments provided
+ * by the user/application. With FALLOC_ALLOCATE or FALLOC_RESV_SPACE
+ * modes, if the system call succeeds, subsequent writes to the file in
+ * the given range (specified by offset & len) should not fail - even if
+ * the file system later becomes full. Hence the preallocation done is
+ * persistent (valid even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by
+ * the individual file systems will update the file size and/or
+ * ctime/mtime depending on the mode and also on the success of the
+ * operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ *	0	: On SUCCESS a value of zero is returned.
+ *	error	: On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ * <TBD> Generic fallocate to be added for file systems that do not
+ *	 support fallocate.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	struct file *file;
+	struct inode *inode;
+	long ret = -EINVAL;
+
+	if (offset < 0 || len <= 0)
+		goto out;
+
+	/* Return error if mode is not supported */
+	ret = -EOPNOTSUPP;
+	if (mode != FALLOC_ALLOCATE && mode != FALLOC_RESV_SPACE)
+		goto out;
+
+	ret = -EBADF;
+	file = fget(fd);
+	if (!file)
+		goto out;
+	if (!(file->f_mode & FMODE_WRITE))
+		goto out_fput;
+
+	inode = file->f_path.dentry->d_inode;
+
+	ret = -ESPIPE;
+	if (S_ISFIFO(inode->i_mode))
+		goto out_fput;
+
+	ret = -ENODEV;
+	/*
+	 * Let individual file system decide if it supports preallocation
+	 * for directories or not.
+	 */
+	if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EFBIG;
+	/* Check for wrap through zero too */
+	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
+		goto out_fput;
+
+	if (inode->i_op && inode->i_op->fallocate)
+		ret = inode->i_op->fallocate(inode, mode, offset, len);
+	else
+		ret = -ENOSYS;
+
+out_fput:
+	fput(file);
+out:
+	return ret;
+}
+
+/*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
Index: linux-2.6.22/include/asm-i386/unistd.h
===================================================================
--- linux-2.6.22.orig/include/asm-i386/unistd.h
+++ linux-2.6.22/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
 #define __NR_signalfd		321
 #define __NR_timerfd		322
 #define __NR_eventfd		323
+#define __NR_fallocate		324
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 324
+#define NR_syscalls 325
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22/include/asm-powerpc/systbl.h
===================================================================
--- linux-2.6.22.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.22/include/asm-powerpc/systbl.h
@@ -308,6 +308,7 @@ COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
 COMPAT_SYS_SPU(utimensat)
+COMPAT_SYS(fallocate)
 COMPAT_SYS_SPU(signalfd)
 COMPAT_SYS_SPU(timerfd)
 SYSCALL_SPU(eventfd)
Index: linux-2.6.22/include/asm-powerpc/unistd.h
===================================================================
--- linux-2.6.22.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.22/include/asm-powerpc/unistd.h
@@ -331,10 +331,11 @@
 #define __NR_timerfd		306
 #define __NR_eventfd		307
 #define __NR_sync_file_range2	308
+#define __NR_fallocate		309
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		309
+#define __NR_syscalls		310
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
Index: linux-2.6.22/include/asm-x86_64/unistd.h
===================================================================
--- linux-2.6.22.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.22/include/asm-x86_64/unistd.h
@@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd)
 __SYSCALL(__NR_timerfd, sys_timerfd)
 #define __NR_eventfd		284
 __SYSCALL(__NR_eventfd, sys_eventfd)
+#define __NR_fallocate		285
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.22/include/linux/fs.h
===================================================================
--- linux-2.6.22.orig/include/linux/fs.h
+++ linux-2.6.22/include/linux/fs.h
@@ -266,6 +266,21 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
 
+/*
+ * sys_fallocate modes
+ * Currently sys_fallocate supports two modes:
+ * FALLOC_ALLOCATE :	This is the preallocate mode, using which an application
+ *			may request reservation of space for a particular file.
+ *			The file size will be changed if the allocation is
+ *			beyond EOF.
+ * FALLOC_RESV_SPACE :	This is same as the above mode, with only one difference
+ *			that the file size will not be modified.
+ */
+#define FALLOC_FL_KEEP_SIZE    0x01 /* default is extend/shrink size */
+
+#define FALLOC_ALLOCATE        0
+#define FALLOC_RESV_SPACE      FALLOC_FL_KEEP_SIZE
+
 #ifdef __KERNEL__
 
 #include <linux/linkage.h>
@@ -1138,6 +1153,8 @@ struct inode_operations {
 	ssize_t (*listxattr) (struct dentry *, char *, size_t);
 	int (*removexattr) (struct dentry *, const char *);
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
+	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 };
 
 struct seq_file;
Index: linux-2.6.22/include/linux/syscalls.h
===================================================================
--- linux-2.6.22.orig/include/linux/syscalls.h
+++ linux-2.6.22/include/linux/syscalls.h
@@ -610,6 +610,7 @@ asmlinkage long sys_signalfd(int ufd, si
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
 			    const struct itimerspec __user *utmr);
 asmlinkage long sys_eventfd(unsigned int count);
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
Index: linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
===================================================================
--- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
+++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
@@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, 
 	return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
 				len, advice);
 }
+
+asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
+				unsigned offset_hi, unsigned len_lo,
+				unsigned len_hi)
+{
+	return sys_fallocate(fd, mode, ((u64)offset_hi << 32) | offset_lo,
+			     ((u64)len_hi << 32) | len_lo);
+}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/6][TAKE7] revalidate write permissions for fallocate
  2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
  2007-07-13 12:46 ` [PATCH 1/6][TAKE7] manpage for fallocate Amit K. Arora
  2007-07-13 12:47 ` [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc Amit K. Arora
@ 2007-07-13 12:48 ` Amit K. Arora
  2007-07-13 13:21   ` Christoph Hellwig
  2007-07-13 12:50 ` [PATCH 4/6][TAKE7] ext4: fallocate support in ext4 Amit K. Arora
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:48 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, tytso, cmm, suparna, adilger, dgc

From: David P. Quigley <dpquigl@tycho.nsa.gov>

Revalidate the write permissions for fallocate(2), in case security policy has
changed since the files were opened.

Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>

---
 fs/open.c |    3 +++
 1 files changed, 3 insertions(+)

Index: linux-2.6.22/fs/open.c
===================================================================
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in
 		goto out;
 	if (!(file->f_mode & FMODE_WRITE))
 		goto out_fput;
+	ret = security_file_permission(file, MAY_WRITE);
+	if (ret)
+		goto out_fput;
 
 	inode = file->f_path.dentry->d_inode;
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/6][TAKE7] ext4: fallocate support in ext4
  2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
                   ` (2 preceding siblings ...)
  2007-07-13 12:48 ` [PATCH 3/6][TAKE7] revalidate write permissions for fallocate Amit K. Arora
@ 2007-07-13 12:50 ` Amit K. Arora
  2007-07-13 12:52 ` [PATCH 5/6][TAKE7] ext4: write support for preallocated blocks Amit K. Arora
  2007-07-13 12:52 ` [PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment Amit K. Arora
  5 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:50 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, tytso, cmm, suparna, adilger, dgc

From: Amit Arora <aarora@in.ibm.com>

fallocate support in ext4

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 			le32_to_cpu(path->p_ext->ee_block),
 			ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 				le32_to_cpu(path[depth].p_ext->ee_block),
 				ext_pblock(path[depth].p_ext),
-				le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 				sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+	unsigned short ext1_ee_len, ext2_ee_len;
+
+	/*
+	 * Make sure that either both extents are uninitialized, or
+	 * both are _not_.
+	 */
+	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+		return 0;
+
+	ext1_ee_len = ext4_ext_get_actual_len(ex1);
+	ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+	if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
 			le32_to_cpu(ex2->ee_block))
 		return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
 		return 0;
 #endif
 
-	if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+	if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
 		return 1;
 	return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
 	unsigned int ret = 0;
 
 	b1 = le32_to_cpu(newext->ee_block);
-	len1 = le16_to_cpu(newext->ee_len);
+	len1 = ext4_ext_get_actual_len(newext);
 	depth = ext_depth(inode);
 	if (!path[depth].p_ext)
 		goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
 	struct ext4_extent *nearex; /* nearest extent */
 	struct ext4_ext_path *npath = NULL;
 	int depth, len, err, next;
+	unsigned uninitialized = 0;
 
-	BUG_ON(newext->ee_len == 0);
+	BUG_ON(ext4_ext_get_actual_len(newext) == 0);
 	depth = ext_depth(inode);
 	ex = path[depth].p_ext;
 	BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
 	/* try to insert block into found extent and return */
 	if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
 		ext_debug("append %d block to %d:%d (from %llu)\n",
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				le32_to_cpu(ex->ee_block),
-				le16_to_cpu(ex->ee_len), ext_pblock(ex));
+				ext4_ext_get_actual_len(ex), ext_pblock(ex));
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
 			return err;
-		ex->ee_len = cpu_to_le16(le16_to_cpu(ex->ee_len)
-					 + le16_to_cpu(newext->ee_len));
+
+		/*
+		 * ext4_can_extents_be_merged should have checked that either
+		 * both extents are uninitialized, or both aren't. Thus we
+		 * need to check only one of them here.
+		 */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+					+ ext4_ext_get_actual_len(newext));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 		eh = path[depth].p_hdr;
 		nearex = ex;
 		goto merge;
@@ -1263,7 +1286,7 @@ has_space:
 		ext_debug("first extent in the leaf: %d:%llu:%d\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len));
+				ext4_ext_get_actual_len(newext));
 		path[depth].p_ext = EXT_FIRST_EXTENT(eh);
 	} else if (le32_to_cpu(newext->ee_block)
 			   > le32_to_cpu(nearex->ee_block)) {
@@ -1276,7 +1299,7 @@ has_space:
 					"move %d from 0x%p to 0x%p\n",
 					le32_to_cpu(newext->ee_block),
 					ext_pblock(newext),
-					le16_to_cpu(newext->ee_len),
+					ext4_ext_get_actual_len(newext),
 					nearex, len, nearex + 1, nearex + 2);
 			memmove(nearex + 2, nearex + 1, len);
 		}
@@ -1289,7 +1312,7 @@ has_space:
 				"move %d from 0x%p to 0x%p\n",
 				le32_to_cpu(newext->ee_block),
 				ext_pblock(newext),
-				le16_to_cpu(newext->ee_len),
+				ext4_ext_get_actual_len(newext),
 				nearex, len, nearex + 1, nearex + 2);
 		memmove(nearex + 1, nearex, len);
 		path[depth].p_ext = nearex;
@@ -1308,8 +1331,13 @@ merge:
 		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
 			break;
 		/* merge with next extent! */
-		nearex->ee_len = cpu_to_le16(le16_to_cpu(nearex->ee_len)
-					     + le16_to_cpu(nearex[1].ee_len));
+		if (ext4_ext_is_uninitialized(nearex))
+			uninitialized = 1;
+		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
+					+ ext4_ext_get_actual_len(nearex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(nearex);
+
 		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
 			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
 					* sizeof(struct ext4_extent);
@@ -1379,8 +1407,8 @@ int ext4_ext_walk_space(struct inode *in
 			end = le32_to_cpu(ex->ee_block);
 			if (block + num < end)
 				end = block + num;
-		} else if (block >=
-			     le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len)) {
+		} else if (block >= le32_to_cpu(ex->ee_block)
+					+ ext4_ext_get_actual_len(ex)) {
 			/* need to allocate space after found extent */
 			start = block;
 			end = block + num;
@@ -1392,7 +1420,8 @@ int ext4_ext_walk_space(struct inode *in
 			 * by found extent
 			 */
 			start = block;
-			end = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len);
+			end = le32_to_cpu(ex->ee_block)
+				+ ext4_ext_get_actual_len(ex);
 			if (block + num < end)
 				end = block + num;
 			exists = 1;
@@ -1408,7 +1437,7 @@ int ext4_ext_walk_space(struct inode *in
 			cbex.ec_type = EXT4_EXT_CACHE_GAP;
 		} else {
 			cbex.ec_block = le32_to_cpu(ex->ee_block);
-			cbex.ec_len = le16_to_cpu(ex->ee_len);
+			cbex.ec_len = ext4_ext_get_actual_len(ex);
 			cbex.ec_start = ext_pblock(ex);
 			cbex.ec_type = EXT4_EXT_CACHE_EXTENT;
 		}
@@ -1481,15 +1510,15 @@ ext4_ext_put_gap_in_cache(struct inode *
 		ext_debug("cache gap(before): %lu [%lu:%lu]",
 				(unsigned long) block,
 				(unsigned long) le32_to_cpu(ex->ee_block),
-				(unsigned long) le16_to_cpu(ex->ee_len));
+				(unsigned long) ext4_ext_get_actual_len(ex));
 	} else if (block >= le32_to_cpu(ex->ee_block)
-			    + le16_to_cpu(ex->ee_len)) {
+			+ ext4_ext_get_actual_len(ex)) {
 		lblock = le32_to_cpu(ex->ee_block)
-			 + le16_to_cpu(ex->ee_len);
+			+ ext4_ext_get_actual_len(ex);
 		len = ext4_ext_next_allocated_block(path);
 		ext_debug("cache gap(after): [%lu:%lu] %lu",
 				(unsigned long) le32_to_cpu(ex->ee_block),
-				(unsigned long) le16_to_cpu(ex->ee_len),
+				(unsigned long) ext4_ext_get_actual_len(ex),
 				(unsigned long) block);
 		BUG_ON(len == lblock);
 		len = len - lblock;
@@ -1619,12 +1648,12 @@ static int ext4_remove_blocks(handle_t *
 				unsigned long from, unsigned long to)
 {
 	struct buffer_head *bh;
+	unsigned short ee_len =  ext4_ext_get_actual_len(ex);
 	int i;
 
 #ifdef EXTENTS_STATS
 	{
 		struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
-		unsigned short ee_len =  le16_to_cpu(ex->ee_len);
 		spin_lock(&sbi->s_ext_stats_lock);
 		sbi->s_ext_blocks += ee_len;
 		sbi->s_ext_extents++;
@@ -1638,12 +1667,12 @@ static int ext4_remove_blocks(handle_t *
 	}
 #endif
 	if (from >= le32_to_cpu(ex->ee_block)
-	    && to == le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+	    && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		/* tail removal */
 		unsigned long num;
 		ext4_fsblk_t start;
-		num = le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - from;
-		start = ext_pblock(ex) + le16_to_cpu(ex->ee_len) - num;
+		num = le32_to_cpu(ex->ee_block) + ee_len - from;
+		start = ext_pblock(ex) + ee_len - num;
 		ext_debug("free last %lu blocks starting %llu\n", num, start);
 		for (i = 0; i < num; i++) {
 			bh = sb_find_get_block(inode->i_sb, start + i);
@@ -1651,12 +1680,12 @@ static int ext4_remove_blocks(handle_t *
 		}
 		ext4_free_blocks(handle, inode, start, num);
 	} else if (from == le32_to_cpu(ex->ee_block)
-		   && to <= le32_to_cpu(ex->ee_block) + le16_to_cpu(ex->ee_len) - 1) {
+		   && to <= le32_to_cpu(ex->ee_block) + ee_len - 1) {
 		printk("strange request: removal %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	} else {
 		printk("strange request: removal(2) %lu-%lu from %u:%u\n",
-		       from, to, le32_to_cpu(ex->ee_block), le16_to_cpu(ex->ee_len));
+			from, to, le32_to_cpu(ex->ee_block), ee_len);
 	}
 	return 0;
 }
@@ -1671,6 +1700,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	unsigned a, b, block, num;
 	unsigned long ex_ee_block;
 	unsigned short ex_ee_len;
+	unsigned uninitialized = 0;
 	struct ext4_extent *ex;
 
 	ext_debug("truncate since %lu in leaf\n", start);
@@ -1685,7 +1715,9 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 	ex = EXT_LAST_EXTENT(eh);
 
 	ex_ee_block = le32_to_cpu(ex->ee_block);
-	ex_ee_len = le16_to_cpu(ex->ee_len);
+	if (ext4_ext_is_uninitialized(ex))
+		uninitialized = 1;
+	ex_ee_len = ext4_ext_get_actual_len(ex);
 
 	while (ex >= EXT_FIRST_EXTENT(eh) &&
 			ex_ee_block + ex_ee_len > start) {
@@ -1753,6 +1785,8 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
 		if (err)
@@ -1762,7 +1796,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 				ext_pblock(ex));
 		ex--;
 		ex_ee_block = le32_to_cpu(ex->ee_block);
-		ex_ee_len = le16_to_cpu(ex->ee_len);
+		ex_ee_len = ext4_ext_get_actual_len(ex);
 	}
 
 	if (correct_index && eh->eh_entries)
@@ -2038,7 +2072,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (ex) {
 		unsigned long ee_block = le32_to_cpu(ex->ee_block);
 		ext4_fsblk_t ee_start = ext_pblock(ex);
-		unsigned short ee_len  = le16_to_cpu(ex->ee_len);
+		unsigned short ee_len;
 
 		/*
 		 * Allow future support for preallocated extents to be added
@@ -2046,8 +2080,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		 * Uninitialized extents are treated as holes, except that
 		 * we avoid (fail) allocating new blocks during a write.
 		 */
-		if (ee_len > EXT_MAX_LEN)
+		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
 			goto out2;
+		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 		if (iblock >= ee_block && iblock < ee_block + ee_len) {
 			newblock = iblock - ee_block + ee_start;
@@ -2055,8 +2090,11 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
-			ext4_ext_put_in_cache(inode, ee_block, ee_len,
-						ee_start, EXT4_EXT_CACHE_EXTENT);
+			/* Do not put uninitialized extent in the cache */
+			if (!ext4_ext_is_uninitialized(ex))
+				ext4_ext_put_in_cache(inode, ee_block,
+							ee_len, ee_start,
+							EXT4_EXT_CACHE_EXTENT);
 			goto out;
 		}
 	}
@@ -2098,6 +2136,8 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* try to insert new extent into found leaf and return */
 	ext4_ext_store_pblock(&newex, newblock);
 	newex.ee_len = cpu_to_le16(allocated);
+	if (create == EXT4_CREATE_UNINITIALIZED_EXT)  /* Mark uninitialized */
+		ext4_ext_mark_uninitialized(&newex);
 	err = ext4_ext_insert_extent(handle, inode, path, &newex);
 	if (err) {
 		/* free data blocks we just allocated */
@@ -2113,8 +2153,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	newblock = ext_pblock(&newex);
 	__set_bit(BH_New, &bh_result->b_state);
 
-	ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
-				EXT4_EXT_CACHE_EXTENT);
+	/* Cache only when it is _not_ an uninitialized extent */
+	if (create != EXT4_CREATE_UNINITIALIZED_EXT)
+		ext4_ext_put_in_cache(inode, iblock, allocated, newblock,
+						EXT4_EXT_CACHE_EXTENT);
 out:
 	if (allocated > max_blocks)
 		allocated = max_blocks;
@@ -2217,3 +2259,130 @@ int ext4_ext_writepage_trans_blocks(stru
 
 	return needed;
 }
+
+/*
+ * preallocate space for a file. This implements ext4's fallocate inode
+ * operation, which gets called from sys_fallocate system call.
+ * Currently only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are
+ * supported on extent based files.
+ * For block-mapped files, posix_fallocate should fall back to the method
+ * of writing zeroes to the required new blocks (the same behavior which is
+ * expected for file systems which do not support fallocate() system call).
+ */
+long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	handle_t *handle;
+	ext4_fsblk_t block, max_blocks;
+	ext4_fsblk_t nblocks = 0;
+	int ret = 0;
+	int ret2 = 0;
+	int retries = 0;
+	struct buffer_head map_bh;
+	unsigned int credits, blkbits = inode->i_blkbits;
+
+	/*
+	 * currently supporting (pre)allocate mode for extent-based
+	 * files _only_
+	 */
+	if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
+	    (mode != FALLOC_ALLOCATE && mode != FALLOC_RESV_SPACE))
+		return -EOPNOTSUPP;
+
+	/* preallocation to directories is currently not supported */
+	if (S_ISDIR(inode->i_mode))
+		return -ENODEV;
+
+	block = offset >> blkbits;
+	max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
+			- block;
+
+	/*
+	 * credits to insert 1 extent into extent tree + buffers to be able to
+	 * modify 1 super block, 1 block bitmap and 1 group descriptor.
+	 */
+	credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3;
+retry:
+	while (ret >= 0 && ret < max_blocks) {
+		block = block + ret;
+		max_blocks = max_blocks - ret;
+		handle = ext4_journal_start(inode, credits);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			break;
+		}
+
+		ret = ext4_ext_get_blocks(handle, inode, block,
+					  max_blocks, &map_bh,
+					  EXT4_CREATE_UNINITIALIZED_EXT, 0);
+		WARN_ON(!ret);
+		if (!ret) {
+			ext4_error(inode->i_sb, "ext4_fallocate",
+				   "ext4_ext_get_blocks returned 0! inode#%lu"
+				   ", block=%llu, max_blocks=%llu",
+				   inode->i_ino, block, max_blocks);
+			ret = -EIO;
+			ext4_mark_inode_dirty(handle, inode);
+			ret2 = ext4_journal_stop(handle);
+			break;
+		}
+		if (ret > 0) {
+			/* check wrap through sign-bit/zero here */
+			if ((block + ret) < 0 || (block + ret) < block) {
+				ret = -EIO;
+				ext4_mark_inode_dirty(handle, inode);
+				ret2 = ext4_journal_stop(handle);
+				break;
+			}
+			if (buffer_new(&map_bh) && ((block + ret) >
+			    (EXT4_BLOCK_ALIGN(i_size_read(inode), blkbits)
+			    >> blkbits)))
+					nblocks = nblocks + ret;
+		}
+
+		/* Update ctime if new blocks get allocated */
+		if (nblocks) {
+			struct timespec now;
+
+			now = current_fs_time(inode->i_sb);
+			if (!timespec_equal(&inode->i_ctime, &now))
+				inode->i_ctime = now;
+		}
+
+		ext4_mark_inode_dirty(handle, inode);
+		ret2 = ext4_journal_stop(handle);
+		if (ret2)
+			break;
+	}
+
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry;
+
+	/*
+	 * Time to update the file size.
+	 * Update only when preallocation was requested beyond the file size.
+	 */
+	if (mode != FALLOC_RESV_SPACE &&
+	    (offset + len) > i_size_read(inode)) {
+		if (ret > 0) {
+			/*
+			 * if no error, we assume preallocation succeeded
+			 * completely
+			 */
+			mutex_lock(&inode->i_mutex);
+			i_size_write(inode, offset + len);
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		} else if (ret < 0 && nblocks) {
+			/* Handle partial allocation scenario */
+			loff_t newsize;
+
+			mutex_lock(&inode->i_mutex);
+			newsize  = (nblocks << blkbits) + i_size_read(inode);
+			i_size_write(inode, EXT4_BLOCK_ALIGN(newsize, blkbits));
+			EXT4_I(inode)->i_disksize = i_size_read(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
+	}
+
+	return ret > 0 ? ret2 : ret;
+}
Index: linux-2.6.22/fs/ext4/file.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/file.c
+++ linux-2.6.22/fs/ext4/file.c
@@ -135,5 +135,6 @@ const struct inode_operations ext4_file_
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext4_permission,
+	.fallocate	= ext4_fallocate,
 };
 
Index: linux-2.6.22/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs.h
+++ linux-2.6.22/include/linux/ext4_fs.h
@@ -102,6 +102,7 @@
 				 EXT4_GOOD_OLD_FIRST_INO : \
 				 (s)->s_first_ino)
 #endif
+#define EXT4_BLOCK_ALIGN(size, blkbits)		ALIGN((size), (1 << (blkbits)))
 
 /*
  * Macro-instructions used to manage fragments
@@ -225,6 +226,11 @@ struct ext4_new_group_data {
 	__u32 free_blocks_count;
 };
 
+/*
+ * Following is used by preallocation code to tell get_blocks() that we
+ * want uninitialzed extents.
+ */
+#define EXT4_CREATE_UNINITIALIZED_EXT		2
 
 /*
  * ioctl commands
@@ -983,6 +989,8 @@ extern int ext4_ext_get_blocks(handle_t 
 extern void ext4_ext_truncate(struct inode *, struct page *);
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
+extern long ext4_fallocate(struct inode *inode, int mode, loff_t offset,
+			  loff_t len);
 static inline int
 ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
 			unsigned long max_blocks, struct buffer_head *bh,
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -188,6 +188,21 @@ ext4_ext_invalidate_cache(struct inode *
 	EXT4_I(inode)->i_cached_extent.ec_type = EXT4_EXT_CACHE_NO;
 }
 
+static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
+{
+	ext->ee_len |= cpu_to_le16(0x8000);
+}
+
+static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
+{
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+}
+
+static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
+{
+	return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+}
+
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 5/6][TAKE7] ext4: write support for preallocated blocks
  2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
                   ` (3 preceding siblings ...)
  2007-07-13 12:50 ` [PATCH 4/6][TAKE7] ext4: fallocate support in ext4 Amit K. Arora
@ 2007-07-13 12:52 ` Amit K. Arora
  2007-07-13 12:52 ` [PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment Amit K. Arora
  5 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, tytso, cmm, suparna, adilger, dgc

From:  Amit Arora <aarora@in.ibm.com>

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+			  struct ext4_ext_path *path,
+			  struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done = 0;
+	int uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh)) {
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+				* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+		merge_done = 1;
+		WARN_ON(eh->eh_entries == 0);
+		if (!eh->eh_entries)
+			ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+			   "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+	}
+
+	return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1374,7 @@ has_space:
 
 merge:
 	/* try to merge extents to the right */
-	while (nearex < EXT_LAST_EXTENT(eh)) {
-		if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-			break;
-		/* merge with next extent! */
-		if (ext4_ext_is_uninitialized(nearex))
-			uninitialized = 1;
-		nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-					+ ext4_ext_get_actual_len(nearex + 1));
-		if (uninitialized)
-			ext4_ext_mark_uninitialized(nearex);
-
-		if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-			len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-					* sizeof(struct ext4_extent);
-			memmove(nearex + 1, nearex + 2, len);
-		}
-		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-		BUG_ON(eh->eh_entries == 0);
-	}
+	ext4_ext_try_to_merge(inode, path, nearex);
 
 	/* try to merge extents to the left */
 
@@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+					struct ext4_ext_path *path,
+					ext4_fsblk_t iblock,
+					unsigned long max_blocks)
+{
+	struct ext4_extent *ex, newex;
+	struct ext4_extent *ex1 = NULL;
+	struct ext4_extent *ex2 = NULL;
+	struct ext4_extent *ex3 = NULL;
+	struct ext4_extent_header *eh;
+	unsigned int allocated, ee_block, ee_len, depth;
+	ext4_fsblk_t newblock;
+	int err = 0;
+	int ret = 0;
+
+	depth = ext_depth(inode);
+	eh = path[depth].p_hdr;
+	ex = path[depth].p_ext;
+	ee_block = le32_to_cpu(ex->ee_block);
+	ee_len = ext4_ext_get_actual_len(ex);
+	allocated = ee_len - (iblock - ee_block);
+	newblock = iblock - ee_block + ext_pblock(ex);
+	ex2 = ex;
+
+	/* ex1: ee_block to iblock - 1 : uninitialized */
+	if (iblock > ee_block) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/*
+	 * for sanity, update the length of the ex2 extent before
+	 * we insert ex3, if ex1 is NULL. This is to avoid temporary
+	 * overlap of blocks.
+	 */
+	if (!ex1 && allocated > max_blocks)
+		ex2->ee_len = cpu_to_le16(max_blocks);
+	/* ex3: to ee_block + ee_len : uninitialised */
+	if (allocated > max_blocks) {
+		unsigned int newdepth;
+		ex3 = &newex;
+		ex3->ee_block = cpu_to_le32(iblock + max_blocks);
+		ext4_ext_store_pblock(ex3, newblock + max_blocks);
+		ex3->ee_len = cpu_to_le16(allocated - max_blocks);
+		ext4_ext_mark_uninitialized(ex3);
+		err = ext4_ext_insert_extent(handle, inode, path, ex3);
+		if (err)
+			goto out;
+		/*
+		 * The depth, and hence eh & ex might change
+		 * as part of the insert above.
+		 */
+		newdepth = ext_depth(inode);
+		if (newdepth != depth) {
+			depth = newdepth;
+			path = ext4_ext_find_extent(inode, iblock, NULL);
+			if (IS_ERR(path)) {
+				err = PTR_ERR(path);
+				path = NULL;
+				goto out;
+			}
+			eh = path[depth].p_hdr;
+			ex = path[depth].p_ext;
+			if (ex2 != &newex)
+				ex2 = ex;
+		}
+		allocated = max_blocks;
+	}
+	/*
+	 * If there was a change of depth as part of the
+	 * insertion of ex3 above, we need to update the length
+	 * of the ex1 extent again here
+	 */
+	if (ex1 && ex1 != ex) {
+		ex1 = ex;
+		ex1->ee_len = cpu_to_le16(iblock - ee_block);
+		ext4_ext_mark_uninitialized(ex1);
+		ex2 = &newex;
+	}
+	/* ex2: iblock to iblock + maxblocks-1 : initialised */
+	ex2->ee_block = cpu_to_le32(iblock);
+	ex2->ee_start = cpu_to_le32(newblock);
+	ext4_ext_store_pblock(ex2, newblock);
+	ex2->ee_len = cpu_to_le16(allocated);
+	if (ex2 != ex)
+		goto insert;
+	err = ext4_ext_get_access(handle, inode, path + depth);
+	if (err)
+		goto out;
+	/*
+	 * New (initialized) extent starts from the first block
+	 * in the current extent. i.e., ex2 == ex
+	 * We have to see if it can be merged with the extent
+	 * on the left.
+	 */
+	if (ex2 > EXT_FIRST_EXTENT(eh)) {
+		/*
+		 * To merge left, pass "ex2 - 1" to try_to_merge(),
+		 * since it merges towards right _only_.
+		 */
+		ret = ext4_ext_try_to_merge(inode, path, ex2 - 1);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+			depth = ext_depth(inode);
+			ex2--;
+		}
+	}
+	/*
+	 * Try to Merge towards right. This might be required
+	 * only when the whole extent is being written to.
+	 * i.e. ex2 == ex and ex3 == NULL.
+	 */
+	if (!ex3) {
+		ret = ext4_ext_try_to_merge(inode, path, ex2);
+		if (ret) {
+			err = ext4_ext_correct_indexes(handle, inode, path);
+			if (err)
+				goto out;
+		}
+	}
+	/* Mark modified extent as dirty */
+	err = ext4_ext_dirty(handle, inode, path + depth);
+	goto out;
+insert:
+	err = ext4_ext_insert_extent(handle, inode, path, &newex);
+out:
+	return err ? err : allocated;
+}
+
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
 			ext4_fsblk_t iblock,
 			unsigned long max_blocks, struct buffer_head *bh_result,
 			int create, int extend_disksize)
 {
 	struct ext4_ext_path *path = NULL;
+	struct ext4_extent_header *eh;
 	struct ext4_extent newex, *ex;
 	ext4_fsblk_t goal, newblock;
-	int err = 0, depth;
+	int err = 0, depth, ret;
 	unsigned long allocated = 0;
 
 	__clear_bit(BH_New, &bh_result->b_state);
@@ -2032,8 +2204,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	if (goal) {
 		if (goal == EXT4_EXT_CACHE_GAP) {
 			if (!create) {
-				/* block isn't allocated yet and
-				 * user doesn't want to allocate it */
+				/*
+				 * block isn't allocated yet and
+				 * user doesn't want to allocate it
+				 */
 				goto out2;
 			}
 			/* we should allocate requested block */
@@ -2067,6 +2241,7 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * this is why assert can't be put in ext4_ext_find_extent()
 	 */
 	BUG_ON(path[depth].p_ext == NULL && depth != 0);
+	eh = path[depth].p_hdr;
 
 	ex = path[depth].p_ext;
 	if (ex) {
@@ -2075,13 +2250,9 @@ int ext4_ext_get_blocks(handle_t *handle
 		unsigned short ee_len;
 
 		/*
-		 * Allow future support for preallocated extents to be added
-		 * as an RO_COMPAT feature:
 		 * Uninitialized extents are treated as holes, except that
-		 * we avoid (fail) allocating new blocks during a write.
+		 * we split out initialized portions during a write.
 		 */
-		if (le16_to_cpu(ex->ee_len) > EXT_MAX_LEN)
-			goto out2;
 		ee_len = ext4_ext_get_actual_len(ex);
 		/* if found extent covers block, simply return it */
 		if (iblock >= ee_block && iblock < ee_block + ee_len) {
@@ -2090,12 +2261,27 @@ int ext4_ext_get_blocks(handle_t *handle
 			allocated = ee_len - (iblock - ee_block);
 			ext_debug("%d fit into %lu:%d -> %llu\n", (int) iblock,
 					ee_block, ee_len, newblock);
+
 			/* Do not put uninitialized extent in the cache */
-			if (!ext4_ext_is_uninitialized(ex))
+			if (!ext4_ext_is_uninitialized(ex)) {
 				ext4_ext_put_in_cache(inode, ee_block,
 							ee_len, ee_start,
 							EXT4_EXT_CACHE_EXTENT);
-			goto out;
+				goto out;
+			}
+			if (create == EXT4_CREATE_UNINITIALIZED_EXT)
+				goto out;
+			if (!create)
+				goto out2;
+
+			ret = ext4_ext_convert_to_initialized(handle, inode,
+								path, iblock,
+								max_blocks);
+			if (ret <= 0)
+				goto out2;
+			else
+				allocated = ret;
+			goto outnew;
 		}
 	}
 
@@ -2104,8 +2290,10 @@ int ext4_ext_get_blocks(handle_t *handle
 	 * we couldn't try to create block if create flag is zero
 	 */
 	if (!create) {
-		/* put just found gap into cache to speed up
-		 * subsequent requests */
+		/*
+		 * put just found gap into cache to speed up
+		 * subsequent requests
+		 */
 		ext4_ext_put_gap_in_cache(inode, path, iblock);
 		goto out2;
 	}
@@ -2151,6 +2339,7 @@ int ext4_ext_get_blocks(handle_t *handle
 
 	/* previous routine could use block we allocated */
 	newblock = ext_pblock(&newex);
+outnew:
 	__set_bit(BH_New, &bh_result->b_state);
 
 	/* Cache only when it is _not_ an uninitialized extent */
@@ -2220,7 +2409,8 @@ void ext4_ext_truncate(struct inode * in
 	err = ext4_ext_remove_space(inode, last_block);
 
 	/* In a multi-transaction truncate, we only make the final
-	 * transaction synchronous. */
+	 * transaction synchronous.
+	 */
 	if (IS_SYNC(inode))
 		handle->h_sync = 1;
 
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -205,6 +205,9 @@ static inline int ext4_ext_get_actual_le
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *);
+extern int ext4_ext_try_to_merge(struct inode *inode,
+				 struct ext4_ext_path *path,
+				 struct ext4_extent *);
 extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, ext_prepare_callback, void *);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment
  2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
                   ` (4 preceding siblings ...)
  2007-07-13 12:52 ` [PATCH 5/6][TAKE7] ext4: write support for preallocated blocks Amit K. Arora
@ 2007-07-13 12:52 ` Amit K. Arora
  5 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 12:52 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-ext4
  Cc: xfs, tytso, cmm, suparna, adilger, dgc

From: Amit Arora <aarora@in.ibm.com>

Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger. 
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN
instead of 0x8000 at some places.

Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	unsigned short ext1_ee_len, ext2_ee_len;
+	unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
 	/*
 	 * Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode 
 	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
 		return 0;
 
+	if (ext4_ext_is_uninitialized(ex1))
+		max_len = EXT_UNINIT_MAX_LEN;
+	else
+		max_len = EXT_INIT_MAX_LEN;
+
 	ext1_ee_len = ext4_ext_get_actual_len(ex1);
 	ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > max_len)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 		ex->ee_block = cpu_to_le32(block);
 		ex->ee_len = cpu_to_le16(num);
-		if (uninitialized)
+		/*
+		 * Do not mark uninitialized if all the blocks in the
+		 * extent have been removed.
+		 */
+		if (uninitialized && num)
 			ext4_ext_mark_uninitialized(ex);
 
 		err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2307,6 +2316,19 @@ int ext4_ext_get_blocks(handle_t *handle
 	/* allocate new block */
 	goal = ext4_ext_find_goal(inode, path, iblock);
 
+	/*
+	 * See if request is beyond maximum number of blocks we can have in
+	 * a single extent. For an initialized extent this limit is
+	 * EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+	 * EXT_UNINIT_MAX_LEN.
+	 */
+	if (max_blocks > EXT_INIT_MAX_LEN &&
+	    create != EXT4_CREATE_UNINITIALIZED_EXT)
+		max_blocks = EXT_INIT_MAX_LEN;
+	else if (max_blocks > EXT_UNINIT_MAX_LEN &&
+		 create == EXT4_CREATE_UNINITIALIZED_EXT)
+		max_blocks = EXT_UNINIT_MAX_LEN;
+
 	/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
 	newex.ee_block = cpu_to_le32(iblock);
 	newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===================================================================
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK	0xffffffff
 
-#define EXT_MAX_LEN	((1UL << 15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is <= 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767).
+ */
+#define EXT_INIT_MAX_LEN	(1UL << 15)
+#define EXT_UNINIT_MAX_LEN	(EXT_INIT_MAX_LEN - 1)
 
 
 #define EXT_FIRST_EXTENT(__hdr__) \
@@ -190,17 +208,22 @@ ext4_ext_invalidate_cache(struct inode *
 
 static inline void ext4_ext_mark_uninitialized(struct ext4_extent *ext)
 {
-	ext->ee_len |= cpu_to_le16(0x8000);
+	/* We can not have an uninitialized extent of zero length! */
+	BUG_ON((le16_to_cpu(ext->ee_len) & ~EXT_INIT_MAX_LEN) == 0);
+	ext->ee_len |= cpu_to_le16(EXT_INIT_MAX_LEN);
 }
 
 static inline int ext4_ext_is_uninitialized(struct ext4_extent *ext)
 {
-	return (int)(le16_to_cpu((ext)->ee_len) & 0x8000);
+	/* Extent with ee_len of 0x8000 is treated as an initialized extent */
+	return (le16_to_cpu(ext->ee_len) > EXT_INIT_MAX_LEN);
 }
 
 static inline int ext4_ext_get_actual_len(struct ext4_extent *ext)
 {
-	return (int)(le16_to_cpu((ext)->ee_len) & 0x7FFF);
+	return (le16_to_cpu(ext->ee_len) <= EXT_INIT_MAX_LEN ?
+		le16_to_cpu(ext->ee_len) :
+		(le16_to_cpu(ext->ee_len) - EXT_INIT_MAX_LEN));
 }
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
  2007-07-13 12:47 ` [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc Amit K. Arora
@ 2007-07-13 13:21   ` Christoph Hellwig
  2007-07-13 14:18     ` Amit K. Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2007-07-13 13:21 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, tytso, cmm,
	suparna, adilger, dgc

On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote:
>  /*
> + * sys_fallocate - preallocate blocks or free preallocated blocks
> + * @fd: the file descriptor
> + * @mode: mode specifies the behavior of allocation.
> + * @offset: The offset within file, from where allocation is being
> + *	    requested. It should not have a negative value.
> + * @len: The amount of space in bytes to be allocated, from the offset.
> + *	 This can not be zero or a negative value.

kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say
just temove this comment, the manpage is a much better documentation anyway.

> + * <TBD> Generic fallocate to be added for file systems that do not
> + *	 support fallocate.

Please remove the comment, adding a generic fallback in kernelspace is a
very dumb idea as we already discussed long time ago.

> --- linux-2.6.22.orig/include/linux/fs.h
> +++ linux-2.6.22/include/linux/fs.h
> @@ -266,6 +266,21 @@ extern int dir_notify_enable;
>  #define SYNC_FILE_RANGE_WRITE		2
>  #define SYNC_FILE_RANGE_WAIT_AFTER	4
>  
> +/*
> + * sys_fallocate modes
> + * Currently sys_fallocate supports two modes:
> + * FALLOC_ALLOCATE :	This is the preallocate mode, using which an application
> + *			may request reservation of space for a particular file.
> + *			The file size will be changed if the allocation is
> + *			beyond EOF.
> + * FALLOC_RESV_SPACE :	This is same as the above mode, with only one difference
> + *			that the file size will not be modified.
> + */
> +#define FALLOC_FL_KEEP_SIZE    0x01 /* default is extend/shrink size */
> +
> +#define FALLOC_ALLOCATE        0
> +#define FALLOC_RESV_SPACE      FALLOC_FL_KEEP_SIZE

Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
not sure there is any point in having two namespace now that we have a flags-
based ABI.

Also please don't add this to fs.h.  fs.h is a complete mess and the
falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
be added to headers-y so the ABI constant can be exported to userspace.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate
  2007-07-13 12:48 ` [PATCH 3/6][TAKE7] revalidate write permissions for fallocate Amit K. Arora
@ 2007-07-13 13:21   ` Christoph Hellwig
  2007-07-13 14:28     ` Amit K. Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2007-07-13 13:21 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, tytso, cmm,
	suparna, adilger, dgc

On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote:
> From: David P. Quigley <dpquigl@tycho.nsa.gov>
> 
> Revalidate the write permissions for fallocate(2), in case security policy has
> changed since the files were opened.
> 
> Acked-by: James Morris <jmorris@namei.org>
> Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>

This should be merged into the main falloc patch.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6][TAKE7] manpage for fallocate
  2007-07-13 12:46 ` [PATCH 1/6][TAKE7] manpage for fallocate Amit K. Arora
@ 2007-07-13 14:06   ` David Chinner
  2007-07-13 14:27     ` Amit K. Arora
  2007-07-14  8:23   ` Michael Kerrisk
  1 sibling, 1 reply; 26+ messages in thread
From: David Chinner @ 2007-07-13 14:06 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, michael.kerrisk,
	tytso, cmm, suparna, adilger, dgc

On Fri, Jul 13, 2007 at 06:16:01PM +0530, Amit K. Arora wrote:
> Following is the modified version of the manpage originally submitted by
> David Chinner. Please use `nroff -man fallocate.2 | less` to view.
> 
> This includes changes suggested by Heikki Orsila and Barry Naujok.

Can we get itemised change logs for all these patches from now on?

> .TH fallocate 2
> .SH NAME
> fallocate \- allocate or remove file space

If fallocate is just being used for allocating space this is wrong.
maybe - "manipulate file space" instead?

dd> .TP
> .B FALLOC_RESV_SPACE
> provides the same functionality as
> .B FALLOC_ALLOCATE
> except it does not ever change the file size. This allows allocation
> of zero blocks beyond the end of file and is useful for optimising

"of zeroed blocks"

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
  2007-07-13 13:21   ` Christoph Hellwig
@ 2007-07-13 14:18     ` Amit K. Arora
  2007-07-13 14:46       ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 14:18 UTC (permalink / raw)
  To: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	tytso, cmm, suparna, adilger, dgc

On Fri, Jul 13, 2007 at 02:21:19PM +0100, Christoph Hellwig wrote:
> On Fri, Jul 13, 2007 at 06:17:55PM +0530, Amit K. Arora wrote:
> >  /*
> > + * sys_fallocate - preallocate blocks or free preallocated blocks
> > + * @fd: the file descriptor
> > + * @mode: mode specifies the behavior of allocation.
> > + * @offset: The offset within file, from where allocation is being
> > + *	    requested. It should not have a negative value.
> > + * @len: The amount of space in bytes to be allocated, from the offset.
> > + *	 This can not be zero or a negative value.
> 
> kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say
> just temove this comment, the manpage is a much better documentation anyway.

Ok. I will remove this entire comment.
 
> > + * <TBD> Generic fallocate to be added for file systems that do not
> > + *	 support fallocate.
> 
> Please remove the comment, adding a generic fallback in kernelspace is a
> very dumb idea as we already discussed long time ago.
>
> > --- linux-2.6.22.orig/include/linux/fs.h
> > +++ linux-2.6.22/include/linux/fs.h
> > @@ -266,6 +266,21 @@ extern int dir_notify_enable;
> >  #define SYNC_FILE_RANGE_WRITE		2
> >  #define SYNC_FILE_RANGE_WAIT_AFTER	4
> >  
> > +/*
> > + * sys_fallocate modes
> > + * Currently sys_fallocate supports two modes:
> > + * FALLOC_ALLOCATE :	This is the preallocate mode, using which an application
> > + *			may request reservation of space for a particular file.
> > + *			The file size will be changed if the allocation is
> > + *			beyond EOF.
> > + * FALLOC_RESV_SPACE :	This is same as the above mode, with only one difference
> > + *			that the file size will not be modified.
> > + */
> > +#define FALLOC_FL_KEEP_SIZE    0x01 /* default is extend/shrink size */
> > +
> > +#define FALLOC_ALLOCATE        0
> > +#define FALLOC_RESV_SPACE      FALLOC_FL_KEEP_SIZE
> 
> Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
> not sure there is any point in having two namespace now that we have a flags-
> based ABI.

Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want
to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this
flag and remove the other mode too (FALLOC_RESV_SPACE).
Is this what you are suggesting ?

> Also please don't add this to fs.h.  fs.h is a complete mess and the
> falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
> be added to headers-y so the ABI constant can be exported to userspace.

Should we need a header file just to declare one flag - i.e.
FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two
modes) ? If "linux/fs.h" is not a good place, will "asm-generic/fcntl.h"
be a sane place for this flag ?

Thanks!
--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6][TAKE7] manpage for fallocate
  2007-07-13 14:06   ` David Chinner
@ 2007-07-13 14:27     ` Amit K. Arora
  0 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 14:27 UTC (permalink / raw)
  To: David Chinner
  Cc: linux-fsdevel, linux-kernel, linux-ext4, xfs, michael.kerrisk,
	tytso, cmm, suparna, adilger

On Sat, Jul 14, 2007 at 12:06:51AM +1000, David Chinner wrote:
> On Fri, Jul 13, 2007 at 06:16:01PM +0530, Amit K. Arora wrote:
> > Following is the modified version of the manpage originally submitted by
> > David Chinner. Please use `nroff -man fallocate.2 | less` to view.
> > 
> > This includes changes suggested by Heikki Orsila and Barry Naujok.
> 
> Can we get itemised change logs for all these patches from now on?

Sure.
 
> > .TH fallocate 2
> > .SH NAME
> > fallocate \- allocate or remove file space
> 
> If fallocate is just being used for allocating space this is wrong.
> maybe - "manipulate file space" instead?

Yes, it needs to be changed.
 
> dd> .TP
> > .B FALLOC_RESV_SPACE
> > provides the same functionality as
> > .B FALLOC_ALLOCATE
> > except it does not ever change the file size. This allows allocation
> > of zero blocks beyond the end of file and is useful for optimising
> 
> "of zeroed blocks"

Ok.

--
Regards,
Amit Arora

> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6][TAKE7] revalidate write permissions for fallocate
  2007-07-13 13:21   ` Christoph Hellwig
@ 2007-07-13 14:28     ` Amit K. Arora
  0 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-13 14:28 UTC (permalink / raw)
  To: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	tytso, cmm, suparna, adilger, dgc

On Fri, Jul 13, 2007 at 02:21:37PM +0100, Christoph Hellwig wrote:
> On Fri, Jul 13, 2007 at 06:18:47PM +0530, Amit K. Arora wrote:
> > From: David P. Quigley <dpquigl@tycho.nsa.gov>
> > 
> > Revalidate the write permissions for fallocate(2), in case security policy has
> > changed since the files were opened.
> > 
> > Acked-by: James Morris <jmorris@namei.org>
> > Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
> 
> This should be merged into the main falloc patch.

Ok. Will merge it...

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc
  2007-07-13 14:18     ` Amit K. Arora
@ 2007-07-13 14:46       ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2007-07-13 14:46 UTC (permalink / raw)
  To: Amit K. Arora
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, linux-ext4, xfs,
	tytso, cmm, suparna, adilger, dgc

On Fri, Jul 13, 2007 at 07:48:58PM +0530, Amit K. Arora wrote:
> Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want
> to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this
> flag and remove the other mode too (FALLOC_RESV_SPACE).
> Is this what you are suggesting ?

Yes.

> Should we need a header file just to declare one flag - i.e.
> FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two
> modes) ? If "linux/fs.h" is not a good place, will "asm-generic/fcntl.h"
> be a sane place for this flag ?

It might sound a litte silly but is the cleanest thing we could do by
far.  And I suspect there will be more more flags soon..


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6][TAKE7] manpage for fallocate
  2007-07-13 12:46 ` [PATCH 1/6][TAKE7] manpage for fallocate Amit K. Arora
  2007-07-13 14:06   ` David Chinner
@ 2007-07-14  8:23   ` Michael Kerrisk
  2007-07-16  5:32     ` Amit K. Arora
  1 sibling, 1 reply; 26+ messages in thread
From: Michael Kerrisk @ 2007-07-14  8:23 UTC (permalink / raw)
  To: Amit K. Arora, linux-ext4, linux-kernel, linux-fsdevel
  Cc: mtk-manpages, dgc, adilger, suparna, cmm, tytso, xfs

[CC += mtk-manpages@gmx.net]

Amit,

Thanks for this page.  I will endeavour to review it in 
the coming days.  In the meantime, the better address to CC
me on fot man pages stuff is mtk-manpages@gmx.net.

Cheers,

Michael

> Following is the modified version of the manpage originally submitted by
> David Chinner. Please use `nroff -man fallocate.2 | less` to view.
> 
> This includes changes suggested by Heikki Orsila and Barry Naujok.
> 
> 
> .TH fallocate 2
> .SH NAME
> fallocate \- allocate or remove file space
> .SH SYNOPSIS
> .nf
> .B #include <fcntl.h>
> .PP
> .BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t "
> len);
> .SH DESCRIPTION
> The
> .B fallocate
> syscall allows a user to directly manipulate the allocated disk space
> for the file referred to by
> .I fd
> for the byte range starting at
> .I offset
> and continuing for
> .I len
> bytes.
> The
> .I mode
> parameter determines the operation to be performed on the given range.
> Currently there are two modes:
> .TP
> .B FALLOC_ALLOCATE
> allocates and initialises to zero the disk space within the given range.
> After a successful call, subsequent writes are guaranteed not to fail
> because
> of lack of disk space.  If the size of the file is less than
> .IR offset + len ,
> then the file is increased to this size; otherwise the file size is left
> unchanged.
> .B FALLOC_ALLOCATE
> closely resembles
> .BR posix_fallocate (3)
> and is intended as a method of optimally implementing this function.
> .B FALLOC_ALLOCATE
> may allocate a larger range than that was specified.
> .TP
> .B FALLOC_RESV_SPACE
> provides the same functionality as
> .B FALLOC_ALLOCATE
> except it does not ever change the file size. This allows allocation
> of zero blocks beyond the end of file and is useful for optimising
> append workloads.
> .SH RETURN VALUE
> .B fallocate
> returns zero on success, or an error number on failure.
> Note that
> .I errno
> is not set.
> .SH ERRORS
> .TP
> .B EBADF
> .I fd
> is not a valid file descriptor, or is not opened for writing.
> .TP
> .B EFBIG
> .IR offset + len
> exceeds the maximum file size.
> .TP
> .B EINVAL
> .I offset
> was less than 0, or
> .I len
> was less than or equal to 0.
> .TP
> .B ENODEV
> .I fd
> does not refer to a regular file or a directory.
> .TP
> .B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> .IR fd .
> .TP
> .B ESPIPE
> .I fd
> refers to a pipe of file descriptor.
> .TP
> .B ENOSYS
> The filesystem underlying the file descriptor does not support this
> operation.
> .TP
> .B EINTR
> A signal was caught during execution
> .TP
> .B EIO
> An I/O error occurred while reading from or writing to a file system.
> .TP
> .B EOPNOTSUPP
> The mode is not supported on the file descriptor.
> .SH AVAILABILITY
> The
> .B fallocate
> system call is available since 2.6.XX
> .SH SEE ALSO
> .BR syscall (2),
> .BR posix_fadvise (3),
> .BR ftruncate (3).

-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6][TAKE7] manpage for fallocate
  2007-07-14  8:23   ` Michael Kerrisk
@ 2007-07-16  5:32     ` Amit K. Arora
  2007-07-23  6:09       ` fallocate() man page Michael Kerrisk
  0 siblings, 1 reply; 26+ messages in thread
From: Amit K. Arora @ 2007-07-16  5:32 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-ext4, linux-kernel, linux-fsdevel, mtk-manpages, dgc,
	adilger, suparna, cmm, tytso, xfs

On Sat, Jul 14, 2007 at 10:23:42AM +0200, Michael Kerrisk wrote:
> [CC += mtk-manpages@gmx.net]
> 
> Amit,
 
Hi Michael,

> Thanks for this page.  I will endeavour to review it in 
> the coming days.  In the meantime, the better address to CC
> me on fot man pages stuff is mtk-manpages@gmx.net.

Sure.

BTW, this man page has changed a bit and the one in TAKE8 of fallocate
patches is the latest one. You are copied on that too.
I will forward that mail to "mtk-manpages@gmx.net" id also, so that you
do not miss it. Thanks!

--
Regards,
Amit Arora

> 
> Cheers,
> 
> Michael
> 
> > Following is the modified version of the manpage originally submitted by
> > David Chinner. Please use `nroff -man fallocate.2 | less` to view.
> > 
> > This includes changes suggested by Heikki Orsila and Barry Naujok.
> > 
> > 
> > .TH fallocate 2
> > .SH NAME
> > fallocate \- allocate or remove file space
> > .SH SYNOPSIS
> > .nf
> > .B #include <fcntl.h>
> > .PP
> > .BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t "
> > len);
> > .SH DESCRIPTION
> > The
> > .B fallocate
> > syscall allows a user to directly manipulate the allocated disk space
> > for the file referred to by
> > .I fd
> > for the byte range starting at
> > .I offset
> > and continuing for
> > .I len
> > bytes.
> > The
> > .I mode
> > parameter determines the operation to be performed on the given range.
> > Currently there are two modes:
> > .TP
> > .B FALLOC_ALLOCATE
> > allocates and initialises to zero the disk space within the given range.
> > After a successful call, subsequent writes are guaranteed not to fail
> > because
> > of lack of disk space.  If the size of the file is less than
> > .IR offset + len ,
> > then the file is increased to this size; otherwise the file size is left
> > unchanged.
> > .B FALLOC_ALLOCATE
> > closely resembles
> > .BR posix_fallocate (3)
> > and is intended as a method of optimally implementing this function.
> > .B FALLOC_ALLOCATE
> > may allocate a larger range than that was specified.
> > .TP
> > .B FALLOC_RESV_SPACE
> > provides the same functionality as
> > .B FALLOC_ALLOCATE
> > except it does not ever change the file size. This allows allocation
> > of zero blocks beyond the end of file and is useful for optimising
> > append workloads.
> > .SH RETURN VALUE
> > .B fallocate
> > returns zero on success, or an error number on failure.
> > Note that
> > .I errno
> > is not set.
> > .SH ERRORS
> > .TP
> > .B EBADF
> > .I fd
> > is not a valid file descriptor, or is not opened for writing.
> > .TP
> > .B EFBIG
> > .IR offset + len
> > exceeds the maximum file size.
> > .TP
> > .B EINVAL
> > .I offset
> > was less than 0, or
> > .I len
> > was less than or equal to 0.
> > .TP
> > .B ENODEV
> > .I fd
> > does not refer to a regular file or a directory.
> > .TP
> > .B ENOSPC
> > There is not enough space left on the device containing the file
> > referred to by
> > .IR fd .
> > .TP
> > .B ESPIPE
> > .I fd
> > refers to a pipe of file descriptor.
> > .TP
> > .B ENOSYS
> > The filesystem underlying the file descriptor does not support this
> > operation.
> > .TP
> > .B EINTR
> > A signal was caught during execution
> > .TP
> > .B EIO
> > An I/O error occurred while reading from or writing to a file system.
> > .TP
> > .B EOPNOTSUPP
> > The mode is not supported on the file descriptor.
> > .SH AVAILABILITY
> > The
> > .B fallocate
> > system call is available since 2.6.XX
> > .SH SEE ALSO
> > .BR syscall (2),
> > .BR posix_fadvise (3),
> > .BR ftruncate (3).
> 
> -- 
> Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
> Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

^ permalink raw reply	[flat|nested] 26+ messages in thread

* fallocate() man page
  2007-07-16  5:32     ` Amit K. Arora
@ 2007-07-23  6:09       ` Michael Kerrisk
  2007-07-23 13:10         ` Amit K. Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Kerrisk @ 2007-07-23  6:09 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: linux-kernel, linux-fsdevel, dgc

Amit,

I've taken the page that you sent and made various minor formatting and
wording fixes.  I've also added various FIXMEs to the page.  Some of these
("FIXME .") are things that I need to check up later.  Some others are
questions for which I need input from you, David, or someone else with the
relevant info (I've marked these "FIXME Amit:").  Could you please review,
and send a new draft of the page back to me.

Cheers,

Michael


.\" FIXME Amit: I need author and license information for this page.
.TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.\" FIXME . eventually this #include will probably be something
.\" different when support is added in glibc.
.B #include <linux/falloc.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
", loff_t " len ");
.\" FIXME . check later what feature text macros are  required in
.\" glibc
.SH DESCRIPTION
.BR fallocate ()
allows the caller to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.

The
.I mode
argument determines the operation to be performed on the given range.
Currently only one flag is supported for
.IR mode :
.TP
.B FALLOC_FL_KEEP_SIZE
allocates and initializes to zero the disk space within the given range.
.\" FIXME Amit: The next two sentences seem to contradict
.\" each other somewhat.  On the one hand, later writes
.\" are guaranteed not to fail for lack of space; on the other
.\" hand, the file size id not changed even if it is currently
.\" smaller than offset+len bytes.
.\" Could you explain this a little further.  (E.g., how does
.\" the kernel guarantee space without changing the size
.\" of the file?)
After a successful call,
subsequent writes are guaranteed not to fail because
of lack of disk space.
Even if the size of the file is less than
.IR offset + len ,
the file size is not changed.
This allows allocation of zeroed blocks beyond
the end of file and is useful for optimizing append workloads.
.\" FIXME Amit: Which other flags are likely to appear
.\" for mode, and in which kernel version are they likely?
.PP
If
.B FALLOC_FL_KEEP_SIZE
flag is not specified in
.IR mode ,
the default behavior is almost same as when this flag is specified.
The only difference is that on success,
the file size will be changed if the
.IR offset + len
is greater than the file size.
This default behavior closely resembles the behavior of the
.BR posix_fallocate (3)
library function,
and is intended as a method of optimally implementing that function.
.\" FIXME Amit: is it worth adding a few words to the following
.\" sentence to say why fallocate() may allocate a larger range
.\" than specified?
.PP
.BR fallocate ()
may allocate a larger range than that was specified.
.SH RETURN VALUE
.BR fallocate ()
returns zero on success, or an error number on failure.
Note that
.\" FIXME . the library wrapper function will do the right
.\" thing, returning -1 on error and setting errno.
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd .
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.\" FIXME Amit: ENODEV says "fd is not a file or a directory";
.\" ESPIPE says (I had to fix the text a little) "refers to a pipe".
.\" This doesn't make sense: if fd is a pipe, then either one
.\" of these errors could occur.  Which is it supposed to be?
.TP
.B ENOSYS
The filesystem containing the file system referred to by
.I fd
does not support this operation.
.TP
.B EINTR
A signal was caught during execution.
.TP
.B EIO
An I/O error occurred while reading from or writing to a file system.
.TP
.B EOPNOTSUPP
.\" FIXME Amit: can you say a little more about the following error
The
.I mode
is not supported on the file descriptor.
.SH VERSIONS
.BR fallocate ()
.\" FIXME . To confirm that this syscall does actually get released
.\" with 2.6.23.
is available since on Linux since kernel 2.6.23.
.SH CONFORMING
.BR fallocate ()
is Linux specific.
.SH SEE ALSO
.BR ftruncate (2),
.BR posix_fallocate (3),
.BR posix_fadvise (3)



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page
  2007-07-23  6:09       ` fallocate() man page Michael Kerrisk
@ 2007-07-23 13:10         ` Amit K. Arora
  2007-07-24  7:06           ` David Chinner
                             ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-23 13:10 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: linux-kernel, linux-fsdevel, dgc

Hi Michael,

On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote:
> Amit,
> 
> I've taken the page that you sent and made various minor formatting and
> wording fixes.  I've also added various FIXMEs to the page.  Some of these
> ("FIXME .") are things that I need to check up later.  Some others are
> questions for which I need input from you, David, or someone else with the
> relevant info (I've marked these "FIXME Amit:").  Could you please review,
> and send a new draft of the page back to me.

Thanks for going through the manpage and improving it!

My comments are below in between <Amit> ... </Amit> tags.

Thanks!
--
Regards,
Amit Arora



.\" FIXME Amit: I need author and license information for this page.
.\" <Amit>
.\"    David Chinner is the original author, hence he can help with this.
.\" </Amit>
.TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.\" FIXME . eventually this #include will probably be something
.\" different when support is added in glibc.
.B #include <linux/falloc.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
", loff_t " len ");
.\" FIXME . check later what feature text macros are  required in
.\" glibc
.SH DESCRIPTION
.BR fallocate ()
allows the caller to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.

The
.I mode
argument determines the operation to be performed on the given range.
Currently only one flag is supported for
.IR mode :
.TP
.B FALLOC_FL_KEEP_SIZE
allocates and initializes to zero the disk space within the given range.
.\" FIXME Amit: The next two sentences seem to contradict
.\" each other somewhat.  On the one hand, later writes
.\" are guaranteed not to fail for lack of space; on the other
.\" hand, the file size id not changed even if it is currently
.\" smaller than offset+len bytes.
.\" Could you explain this a little further.  (E.g., how does
.\" the kernel guarantee space without changing the size
.\" of the file?)
.\" <Amit>
.\"     Well, this is a feature where you can allocate/reserve space for
.\" a file without changing the file size. This is done by allocating blocks
.\" to the file, but still not changing the size. As mentioned below, this
.\" helps applications that use append mode a lot. These can open
.\" a file in append mode and start writing to "preallocated" space.
.\" So, if someone does a stat on a file after fallocate() with this mode (where
.\" file size is not changed), he/she will see that the st_blocks
.\" increased, but st_size did not change.
.\" </Amit>
After a successful call,
subsequent writes are guaranteed not to fail because
of lack of disk space.
Even if the size of the file is less than
.IR offset + len ,
the file size is not changed.
This allows allocation of zeroed blocks beyond
the end of file and is useful for optimizing append workloads.
.\" FIXME Amit: Which other flags are likely to appear
.\" for mode, and in which kernel version are they likely?
.\" <Amit>
.\"    There were few more flags which were discussed, but none of
.\" them have been finalized upon. Here are these flags:
.\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME, FA_FL_NO_CTIME
.\" All of the above flags were debated upon and we can not say if any/which one
.\" of these flags will make it to the later kernels.
.\" </Amit>
.PP
If
.B FALLOC_FL_KEEP_SIZE
flag is not specified in
.IR mode ,
the default behavior is almost same as when this flag is specified.
The only difference is that on success,
the file size will be changed if the
.IR offset + len
is greater than the file size.
This default behavior closely resembles the behavior of the
.BR posix_fallocate (3)
library function,
and is intended as a method of optimally implementing that function.
.\" FIXME Amit: is it worth adding a few words to the following
.\" sentence to say why fallocate() may allocate a larger range
.\" than specified?
.\" <Amit>
.\"     The preallocation is done in block size chunks. Thus, if the last
.\" few bytes in the range falls in a new block, this entire block gets
.\" allocated to the file. Hence we may have slightly larger range allocated.
.\" I have tried to add one line to explain this below. Please see if it
.\" makes sense and is understandable. Thanks!
.\" </Amit>
.PP
.BR fallocate ()
may allocate a larger range than that was specified.
.\" <Amit>
.\" This is because allocation is done in block size chunks and hence
.\" the allocation will automatically get block aligned.
.\" </Amit>
.SH RETURN VALUE
.BR fallocate ()
returns zero on success, or an error number on failure.
Note that
.\" FIXME . the library wrapper function will do the right
.\" thing, returning -1 on error and setting errno.
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd .
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.\" FIXME Amit: ENODEV says "fd is not a file or a directory";
.\" ESPIPE says (I had to fix the text a little) "refers to a pipe".
.\" This doesn't make sense: if fd is a pipe, then either one
.\" of these errors could occur.  Which is it supposed to be?
.\" <Amit>
.\"    This is inline with posix_fallocate manpage. If it is a pipe,
.\" user will get ESPIPE.
.\" </Amit>
.TP
.B ENOSYS
The filesystem containing the file system referred to by
.I fd
does not support this operation.
.TP
.B EINTR
A signal was caught during execution.
.TP
.B EIO
An I/O error occurred while reading from or writing to a file system.
.TP
.B EOPNOTSUPP
.\" FIXME Amit: can you say a little more about the following error
.\" <Amit>
.\" How does following sound ?
.\" 'The specified mode is not supported on the object by the file system.'
.\" </Amit>
The
.I mode
is not supported on the file descriptor.
.SH VERSIONS
.BR fallocate ()
.\" FIXME . To confirm that this syscall does actually get released
.\" with 2.6.23.
is available since on Linux since kernel 2.6.23.
.SH CONFORMING
.BR fallocate ()
is Linux specific.
.SH SEE ALSO
.BR ftruncate (2),
.BR posix_fallocate (3),
.BR posix_fadvise (3)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page
  2007-07-23 13:10         ` Amit K. Arora
@ 2007-07-24  7:06           ` David Chinner
  2007-07-30  6:21             ` Michael Kerrisk
  2007-07-30 19:43           ` Michael Kerrisk
  2007-07-30 19:44           ` fallocate() man page - darft 2 Michael Kerrisk
  2 siblings, 1 reply; 26+ messages in thread
From: David Chinner @ 2007-07-24  7:06 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: Michael Kerrisk, linux-kernel, linux-fsdevel, dgc

On Mon, Jul 23, 2007 at 06:40:39PM +0530, Amit K. Arora wrote:
> On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote:
> > I've taken the page that you sent and made various minor formatting and
> > wording fixes.  I've also added various FIXMEs to the page.  Some of these
> > ("FIXME .") are things that I need to check up later.  Some others are
> > questions for which I need input from you, David, or someone else with the
> > relevant info (I've marked these "FIXME Amit:").  Could you please review,
> > and send a new draft of the page back to me.
> 
> Thanks for going through the manpage and improving it!
> 
> My comments are below in between <Amit> ... </Amit> tags.

Does this Q&A really need to be encoded in nroff comments? ;)

> .\" FIXME Amit: I need author and license information for this page.
> .\" <Amit>
> .\"    David Chinner is the original author, hence he can help with this.
> .\" </Amit>

Patch below.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

diff -u orig/fallocate.3 new/fallocate.3
--- orig/fallocate.3	Tue Jul 24 17:00:42 2007
+++ new/fallocate.3	Tue Jul 24 17:02:44 2007
@@ -1,7 +1,6 @@
-.\" FIXME Amit: I need author and license information for this page.
-.\" <Amit>
-.\"    David Chinner is the original author, hence he can help with this.
-.\" </Amit>
+.\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved
+.\" Written by Dave Chinner <dgc@sgi.com>
+.\" May be distributed as per GPLv2
 .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
 .SH NAME
 fallocate \- manipulate file space

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page
  2007-07-24  7:06           ` David Chinner
@ 2007-07-30  6:21             ` Michael Kerrisk
  0 siblings, 0 replies; 26+ messages in thread
From: Michael Kerrisk @ 2007-07-30  6:21 UTC (permalink / raw)
  To: David Chinner; +Cc: Amit K. Arora, linux-kernel, linux-fsdevel



David Chinner wrote:
> On Mon, Jul 23, 2007 at 06:40:39PM +0530, Amit K. Arora wrote:
>> On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote:
>>> I've taken the page that you sent and made various minor formatting and
>>> wording fixes.  I've also added various FIXMEs to the page.  Some of these
>>> ("FIXME .") are things that I need to check up later.  Some others are
>>> questions for which I need input from you, David, or someone else with the
>>> relevant info (I've marked these "FIXME Amit:").  Could you please review,
>>> and send a new draft of the page back to me.
>> Thanks for going through the manpage and improving it!
>>
>> My comments are below in between <Amit> ... </Amit> tags.
> 
> Does this Q&A really need to be encoded in nroff comments? ;)
> 
>> .\" FIXME Amit: I need author and license information for this page.
>> .\" <Amit>
>> .\"    David Chinner is the original author, hence he can help with this.
>> .\" </Amit>
> 
> Patch below.

Thanks David.  Applied, but I wrote "GNU General Public License vesion 2".

Cheers,

Michael


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page
  2007-07-23 13:10         ` Amit K. Arora
  2007-07-24  7:06           ` David Chinner
@ 2007-07-30 19:43           ` Michael Kerrisk
  2007-07-31 13:56             ` Amit K. Arora
  2007-07-30 19:44           ` fallocate() man page - darft 2 Michael Kerrisk
  2 siblings, 1 reply; 26+ messages in thread
From: Michael Kerrisk @ 2007-07-30 19:43 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: linux-kernel, linux-fsdevel, dgc

Hello Amit.

> On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote:
>> Amit,
>>
>> I've taken the page that you sent and made various minor formatting and
>> wording fixes.  I've also added various FIXMEs to the page.  Some of these
>> ("FIXME .") are things that I need to check up later.  Some others are
>> questions for which I need input from you, David, or someone else with the
>> relevant info (I've marked these "FIXME Amit:").  Could you please review,
>> and send a new draft of the page back to me.
> 
> Thanks for going through the manpage and improving it!
> 
> My comments are below in between <Amit> ... </Amit> tags.
> 
> Thanks!
[...]

> The
> .I mode
> argument determines the operation to be performed on the given range.
> Currently only one flag is supported for
> .IR mode :
> .TP
> .B FALLOC_FL_KEEP_SIZE
> allocates and initializes to zero the disk space within the given range.
> .\" FIXME Amit: The next two sentences seem to contradict
> .\" each other somewhat.  On the one hand, later writes
> .\" are guaranteed not to fail for lack of space; on the other
> .\" hand, the file size id not changed even if it is currently
> .\" smaller than offset+len bytes.
> .\" Could you explain this a little further.  (E.g., how does
> .\" the kernel guarantee space without changing the size
> .\" of the file?)
> .\" <Amit>
> .\"     Well, this is a feature where you can allocate/reserve space for
> .\" a file without changing the file size. This is done by allocating blocks
> .\" to the file, but still not changing the size. As mentioned below, this
> .\" helps applications that use append mode a lot. These can open
> .\" a file in append mode and start writing to "preallocated" space.
> .\" So, if someone does a stat on a file after fallocate() with this mode (where
> .\" file size is not changed), he/she will see that the st_blocks
> .\" increased, but st_size did not change.
> .\" </Amit>

Okay -- I tried rewording the text here a little to make this clearer.  Can
you review the new version to see that it's okay.

[...]

> .\" FIXME Amit: Which other flags are likely to appear
> .\" for mode, and in which kernel version are they likely?
> .\" <Amit>
> .\"    There were few more flags which were discussed, but none of
> .\" them have been finalized upon. Here are these flags:
> .\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME, FA_FL_NO_CTIME
> .\" All of the above flags were debated upon and we can not say if any/which one
> .\" of these flags will make it to the later kernels.
> .\" </Amit>

Thanks for the info.

[...]

> .\" FIXME Amit: is it worth adding a few words to the following
> .\" sentence to say why fallocate() may allocate a larger range
> .\" than specified?
> .\" <Amit>
> .\"     The preallocation is done in block size chunks. Thus, if the last
> .\" few bytes in the range falls in a new block, this entire block gets
> .\" allocated to the file. Hence we may have slightly larger range allocated.
> .\" I have tried to add one line to explain this below. Please see if it
> .\" makes sense and is understandable. Thanks!
> .\" </Amit>

Thanks.

> .PP
> .BR fallocate ()
> may allocate a larger range than that was specified.
> .\" <Amit>
> .\" This is because allocation is done in block size chunks and hence
> .\" the allocation will automatically get block aligned.
> .\" </Amit>

I made the sentence:

    Because allocation is done in block size chunks, fallocate()
    may allocate a larger range than that which was specified.

okay?

[...]

> .TP
> .B ENODEV
> .I fd
> does not refer to a regular file or a directory.
> .TP
> .B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> .IR fd .
> .TP
> .B ESPIPE
> .I fd
> refers to a pipe of file descriptor.
> .\" FIXME Amit: ENODEV says "fd is not a file or a directory";
> .\" ESPIPE says (I had to fix the text a little) "refers to a pipe".
> .\" This doesn't make sense: if fd is a pipe, then either one
> .\" of these errors could occur.  Which is it supposed to be?
> .\" <Amit>
> .\"    This is inline with posix_fallocate manpage. If it is a pipe,
> .\" user will get ESPIPE.
> .\" </Amit>

Okay -- thanks.  I reworded the text for the ESNODEV error to make this
clearer.  (Please check the wording in the next draft.)

By the way in fs/open.c I see the comment:

        /*
         * Let individual file system decide if it supports preallocation
         * for directories or not.
         */
        if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
                goto out_fput;

But that comment doesn't seem to accord with the line of code immediately
below it (S_ISDIR() check is doene regardles of file system type).  Do I
misunderstand something -- or is the comment wrong?

[...]

> .TP
> .B EOPNOTSUPP
> .\" FIXME Amit: can you say a little more about the following error
> .\" <Amit>
> .\" How does following sound ?
> .\" 'The specified mode is not supported on the object by the file system.'
> .\" </Amit>

I made it:

    The mode is not supported by the file system containing the file
    referred to by fd.

Okay?

[...]

New version of the page on its way soon.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* fallocate() man page - darft 2
  2007-07-23 13:10         ` Amit K. Arora
  2007-07-24  7:06           ` David Chinner
  2007-07-30 19:43           ` Michael Kerrisk
@ 2007-07-30 19:44           ` Michael Kerrisk
  2007-08-02 17:36             ` Amit K. Arora
  2 siblings, 1 reply; 26+ messages in thread
From: Michael Kerrisk @ 2007-07-30 19:44 UTC (permalink / raw)
  To: Amit K. Arora, dgc; +Cc: linux-kernel, linux-fsdevel

Amit, David,

I've edited the previous version of the page, adding David's license, and
integrating Amit's comments.  I've also added a few new FIXMES.  ("FIXME
Amit" again.)

Could you please review the changes, and the FIXMEs.

Cheers,

Michael



.\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved
.\" Written by Dave Chinner <dgc@sgi.com>
.\" May be distributed as per GNU General Public License version 2.
.\"
.TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.\" FIXME . eventually this #include will probably be something
.\" different when support is added in glibc.
.B #include <linux/falloc.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
", loff_t " len ");
.\" FIXME . check later what feature text macros are  required in
.\" glibc
.SH DESCRIPTION
.BR fallocate ()
allows the caller to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.
.\" FIXME Amit: in other words the affected byte range
.\" is the bytes from (offset) to (offset + len - 1), right?

The
.I mode
argument determines the operation to be performed on the given range.
Currently only one flag is supported for
.IR mode :
.TP
.B FALLOC_FL_KEEP_SIZE
This flag allocates and initializes to zero the disk space
within the range specified by
.I offset
and
.IR len .
After a successful call, subsequent writes into this range
are guaranteed not to fail because of lack of disk space.
Preallocating zeroed blocks beyond the end of the file
is useful for optimizing append workloads.
Preallocating blocks does not change
the file size (as reported by
.BR stat (2))
even if it is less than
.\" FIXME Amit: "offset + len" is written here.  But should it be
.\" "offset + len - 1" ?
.IR offset + len .
.\"
.\" Note from Amit Arora:
.\" There were few more flags which were discussed, but none of
.\" them have been finalized upon. Here are these flags:
.\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME,
.\" FA_FL_NO_CTIME
.\" All of the above flags were debated upon and we can not say
.\" if any/which one of these flags will make it to the later kernels.
.PP
If
.B FALLOC_FL_KEEP_SIZE
flag is not specified in
.IR mode ,
the default behavior is almost same as when this flag is specified.
The only difference is that on success,
the file size will be changed if
.\" FIXME Amit: "offset + len" is written here.  But should it be
.\" "offset + len - 1" ?
.IR offset + len
is greater than the file size.
This default behavior closely resembles the behavior of the
.BR posix_fallocate (3)
library function,
and is intended as a method of optimally implementing that function.
.PP
Because allocation is done in block size chunks,
.BR fallocate ()
may allocate a larger range than that which was specified.
.SH RETURN VALUE
.BR fallocate ()
returns zero on success, or an error number on failure.
Note that
.\" FIXME . the library wrapper function will do the right
.\" thing, returning -1 on error and setting errno.
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
(If
.I fd
is a pipe or FIFO, a different error results.)
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd .
.TP
.B ESPIPE
.I fd
refers to a pipe or FIFO.
.TP
.B ENOSYS
The file system containing the file system referred to by
.I fd
does not support this operation.
.TP
.B EINTR
A signal was caught during execution.
.TP
.B EIO
An I/O error occurred while reading from or writing to a file system.
.TP
.B EOPNOTSUPP
The
.I mode
is not supported by the file system containing the file referred to by
.IR fd .
.SH VERSIONS
.BR fallocate ()
.\" FIXME . To confirm that this syscall does actually get released
.\" with 2.6.23.
is available on Linux since kernel 2.6.23.
.SH CONFORMING
.BR fallocate ()
is Linux specific.
.SH SEE ALSO
.BR ftruncate (2),
.BR posix_fallocate (3),
.BR posix_fadvise (3)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page
  2007-07-30 19:43           ` Michael Kerrisk
@ 2007-07-31 13:56             ` Amit K. Arora
  0 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-07-31 13:56 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: linux-kernel, linux-fsdevel, dgc

Hi Michael,

On Mon, Jul 30, 2007 at 09:43:08PM +0200, Michael Kerrisk wrote:
> Hello Amit.
> 
> > On Mon, Jul 23, 2007 at 08:09:45AM +0200, Michael Kerrisk wrote:
> >> Amit,
> >>
> >> I've taken the page that you sent and made various minor formatting and
> >> wording fixes.  I've also added various FIXMEs to the page.  Some of these
> >> ("FIXME .") are things that I need to check up later.  Some others are
> >> questions for which I need input from you, David, or someone else with the
> >> relevant info (I've marked these "FIXME Amit:").  Could you please review,
> >> and send a new draft of the page back to me.
> > 
> > Thanks for going through the manpage and improving it!
> > 
> > My comments are below in between <Amit> ... </Amit> tags.
> > 
> > Thanks!
> [...]
> 
> > The
> > .I mode
> > argument determines the operation to be performed on the given range.
> > Currently only one flag is supported for
> > .IR mode :
> > .TP
> > .B FALLOC_FL_KEEP_SIZE
> > allocates and initializes to zero the disk space within the given range.
> > .\" FIXME Amit: The next two sentences seem to contradict
> > .\" each other somewhat.  On the one hand, later writes
> > .\" are guaranteed not to fail for lack of space; on the other
> > .\" hand, the file size id not changed even if it is currently
> > .\" smaller than offset+len bytes.
> > .\" Could you explain this a little further.  (E.g., how does
> > .\" the kernel guarantee space without changing the size
> > .\" of the file?)
> > .\" <Amit>
> > .\"     Well, this is a feature where you can allocate/reserve space for
> > .\" a file without changing the file size. This is done by allocating blocks
> > .\" to the file, but still not changing the size. As mentioned below, this
> > .\" helps applications that use append mode a lot. These can open
> > .\" a file in append mode and start writing to "preallocated" space.
> > .\" So, if someone does a stat on a file after fallocate() with this mode (where
> > .\" file size is not changed), he/she will see that the st_blocks
> > .\" increased, but st_size did not change.
> > .\" </Amit>
> 
> Okay -- I tried rewording the text here a little to make this clearer.  Can
> you review the new version to see that it's okay.
> 
> [...]
<Amit>
Ok. Will review the draft version soon and will get back to you.
</Amit>
 
> > .\" FIXME Amit: Which other flags are likely to appear
> > .\" for mode, and in which kernel version are they likely?
> > .\" <Amit>
> > .\"    There were few more flags which were discussed, but none of
> > .\" them have been finalized upon. Here are these flags:
> > .\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME, FA_FL_NO_CTIME
> > .\" All of the above flags were debated upon and we can not say if any/which one
> > .\" of these flags will make it to the later kernels.
> > .\" </Amit>
> 
> Thanks for the info.
> 
> [...]
> 
> > .\" FIXME Amit: is it worth adding a few words to the following
> > .\" sentence to say why fallocate() may allocate a larger range
> > .\" than specified?
> > .\" <Amit>
> > .\"     The preallocation is done in block size chunks. Thus, if the last
> > .\" few bytes in the range falls in a new block, this entire block gets
> > .\" allocated to the file. Hence we may have slightly larger range allocated.
> > .\" I have tried to add one line to explain this below. Please see if it
> > .\" makes sense and is understandable. Thanks!
> > .\" </Amit>
> 
> Thanks.
> 
> > .PP
> > .BR fallocate ()
> > may allocate a larger range than that was specified.
> > .\" <Amit>
> > .\" This is because allocation is done in block size chunks and hence
> > .\" the allocation will automatically get block aligned.
> > .\" </Amit>
> 
> I made the sentence:
> 
>     Because allocation is done in block size chunks, fallocate()
>     may allocate a larger range than that which was specified.
> 
> okay?
> 
> [...]

<Amit>
Ok.
</Amit>
 
> > .TP
> > .B ENODEV
> > .I fd
> > does not refer to a regular file or a directory.
> > .TP
> > .B ENOSPC
> > There is not enough space left on the device containing the file
> > referred to by
> > .IR fd .
> > .TP
> > .B ESPIPE
> > .I fd
> > refers to a pipe of file descriptor.
> > .\" FIXME Amit: ENODEV says "fd is not a file or a directory";
> > .\" ESPIPE says (I had to fix the text a little) "refers to a pipe".
> > .\" This doesn't make sense: if fd is a pipe, then either one
> > .\" of these errors could occur.  Which is it supposed to be?
> > .\" <Amit>
> > .\"    This is inline with posix_fallocate manpage. If it is a pipe,
> > .\" user will get ESPIPE.
> > .\" </Amit>
> 
> Okay -- thanks.  I reworded the text for the ESNODEV error to make this
> clearer.  (Please check the wording in the next draft.)

<Amit>
Sure.
</Amit>
 
> By the way in fs/open.c I see the comment:
> 
>         /*
>          * Let individual file system decide if it supports preallocation
>          * for directories or not.
>          */
>         if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
>                 goto out_fput;
> 
> But that comment doesn't seem to accord with the line of code immediately
> below it (S_ISDIR() check is doene regardles of file system type).  Do I
> misunderstand something -- or is the comment wrong?
> 
> [...]

<Amit>
I think it is correct. We are failing ("goto out_fput;") _only_ if it is
not a regular file AND also not a directory. In the case when the
concerned object is a directory, the above "if" condition won't be true
and thus the "goto" won't get called. Hence, the individual file
system's ->fallocate() inode op will be called, which will decide if it
wants to support directories or not.
</Amit>
 
> > .TP
> > .B EOPNOTSUPP
> > .\" FIXME Amit: can you say a little more about the following error
> > .\" <Amit>
> > .\" How does following sound ?
> > .\" 'The specified mode is not supported on the object by the file system.'
> > .\" </Amit>
> 
> I made it:
> 
>     The mode is not supported by the file system containing the file
>     referred to by fd.
> 
> Okay?
> 
> [...]

<Amit>
Ok.
</Amit>
 
> New version of the page on its way soon.

<Amit>
I have received it. Will review it soon (maybe by tomorrow) and get
back. Thanks!
</Amit>

--
Regards,
Amit Arora
 
> Cheers,
> 
> Michael

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page - darft 2
  2007-07-30 19:44           ` fallocate() man page - darft 2 Michael Kerrisk
@ 2007-08-02 17:36             ` Amit K. Arora
  2007-08-03 11:59               ` Michael Kerrisk
  0 siblings, 1 reply; 26+ messages in thread
From: Amit K. Arora @ 2007-08-02 17:36 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: dgc, linux-kernel, linux-fsdevel

Hi Michael,

On Mon, Jul 30, 2007 at 09:44:10PM +0200, Michael Kerrisk wrote:
> Amit, David,
> 
> I've edited the previous version of the page, adding David's license, and
> integrating Amit's comments.  I've also added a few new FIXMES.  ("FIXME
> Amit" again.)

Ok, Thanks!
 
> Could you please review the changes, and the FIXMEs.

Please find my comments below..
 
> Cheers,
> 
> Michael

--
Regards,
Amit Arora
 
> .\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved
> .\" Written by Dave Chinner <dgc@sgi.com>
> .\" May be distributed as per GNU General Public License version 2.
> .\"
> .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
> .SH NAME
> fallocate \- manipulate file space
> .SH SYNOPSIS
> .nf
> .\" FIXME . eventually this #include will probably be something
> .\" different when support is added in glibc.
> .B #include <linux/falloc.h>
> .PP
> .BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
> ", loff_t " len ");
> .\" FIXME . check later what feature text macros are  required in
> .\" glibc
> .SH DESCRIPTION
> .BR fallocate ()
> allows the caller to directly manipulate the allocated disk space
> for the file referred to by
> .I fd
> for the byte range starting at
> .I offset
> and continuing for
> .I len
> bytes.
> .\" FIXME Amit: in other words the affected byte range
> .\" is the bytes from (offset) to (offset + len - 1), right?

<Amit>
Yes, you are right.
</Amit> 

> The
> .I mode
> argument determines the operation to be performed on the given range.
> Currently only one flag is supported for
> .IR mode :
> .TP
> .B FALLOC_FL_KEEP_SIZE
> This flag allocates and initializes to zero the disk space
> within the range specified by
> .I offset
> and
> .IR len .
> After a successful call, subsequent writes into this range
> are guaranteed not to fail because of lack of disk space.
> Preallocating zeroed blocks beyond the end of the file
> is useful for optimizing append workloads.
> Preallocating blocks does not change
> the file size (as reported by
> .BR stat (2))
> even if it is less than
> .\" FIXME Amit: "offset + len" is written here.  But should it be
> .\" "offset + len - 1" ?

<Amit>
Good point. This text was directly taken from the man page of
posix_fallocate and is also there on the posix specifications at:
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html

The current posix_fallocate() implementation and also the fallocate()
implementation in ext4 are based on above documentation, wherein EOF is
compared with "offset + len" and not with "offset + len - 1".

I am not sure if this is right or wrong. But, this is as per posix
specifications. ;)
</Amit>

> .IR offset + len .
> .\"
> .\" Note from Amit Arora:
> .\" There were few more flags which were discussed, but none of
> .\" them have been finalized upon. Here are these flags:
> .\" FA_FL_DEALLOC, FA_FL_DEL_DATA, FA_FL_ERR_FREE, FA_FL_NO_MTIME,
> .\" FA_FL_NO_CTIME
> .\" All of the above flags were debated upon and we can not say
> .\" if any/which one of these flags will make it to the later kernels.
> .PP
> If
> .B FALLOC_FL_KEEP_SIZE
> flag is not specified in
> .IR mode ,
> the default behavior is almost same as when this flag is specified.
> The only difference is that on success,
> the file size will be changed if
> .\" FIXME Amit: "offset + len" is written here.  But should it be
> .\" "offset + len - 1" ?

<Amit>
Please see my previous comment.
</Amit>

> .IR offset + len
> is greater than the file size.
> This default behavior closely resembles the behavior of the
> .BR posix_fallocate (3)
> library function,
> and is intended as a method of optimally implementing that function.
> .PP
> Because allocation is done in block size chunks,
> .BR fallocate ()
> may allocate a larger range than that which was specified.
> .SH RETURN VALUE
> .BR fallocate ()
> returns zero on success, or an error number on failure.
> Note that
> .\" FIXME . the library wrapper function will do the right
> .\" thing, returning -1 on error and setting errno.
> .I errno
> is not set.
> .SH ERRORS
> .TP
> .B EBADF
> .I fd
> is not a valid file descriptor, or is not opened for writing.
> .TP
> .B EFBIG
> .IR offset + len
> exceeds the maximum file size.
> .TP
> .B EINVAL
> .I offset
> was less than 0, or
> .I len
> was less than or equal to 0.
> .TP
> .B ENODEV
> .I fd
> does not refer to a regular file or a directory.
> (If
> .I fd
> is a pipe or FIFO, a different error results.)
> .TP
> .B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> .IR fd .
> .TP
> .B ESPIPE
> .I fd
> refers to a pipe or FIFO.
> .TP
> .B ENOSYS
> The file system containing the file system referred to by

<Amit>
There is a typo above. We have "file system" repeated twice in above
sentence. Second one should be "file".
</Amit>

> .I fd
> does not support this operation.
> .TP
> .B EINTR
> A signal was caught during execution.
> .TP
> .B EIO
> An I/O error occurred while reading from or writing to a file system.
> .TP
> .B EOPNOTSUPP
> The
> .I mode
> is not supported by the file system containing the file referred to by
> .IR fd .
> .SH VERSIONS
> .BR fallocate ()
> .\" FIXME . To confirm that this syscall does actually get released
> .\" with 2.6.23.
> is available on Linux since kernel 2.6.23.
> .SH CONFORMING
> .BR fallocate ()
> is Linux specific.
> .SH SEE ALSO
> .BR ftruncate (2),
> .BR posix_fallocate (3),
> .BR posix_fadvise (3)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page - darft 2
  2007-08-02 17:36             ` Amit K. Arora
@ 2007-08-03 11:59               ` Michael Kerrisk
  2007-08-06  6:10                 ` Amit K. Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Michael Kerrisk @ 2007-08-03 11:59 UTC (permalink / raw)
  To: Amit K. Arora; +Cc: dgc, linux-kernel, linux-fsdevel

Hi Amit,

>> Could you please review the changes, and the FIXMEs.
> 
> Please find my comments below..

Thanks.

[...]

>> .SH DESCRIPTION
>> .BR fallocate ()
>> allows the caller to directly manipulate the allocated disk space
>> for the file referred to by
>> .I fd
>> for the byte range starting at
>> .I offset
>> and continuing for
>> .I len
>> bytes.
>> .\" FIXME Amit: in other words the affected byte range
>> .\" is the bytes from (offset) to (offset + len - 1), right?
> 
> <Amit>
> Yes, you are right.
> </Amit> 

[...]

>> Preallocating blocks does not change
>> the file size (as reported by
>> .BR stat (2))
>> even if it is less than
>> .\" FIXME Amit: "offset + len" is written here.  But should it be
>> .\" "offset + len - 1" ?
> 
> <Amit>
> Good point. This text was directly taken from the man page of
> posix_fallocate and is also there on the posix specifications at:
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html
> 
> The current posix_fallocate() implementation and also the fallocate()
> implementation in ext4 are based on above documentation, wherein EOF is
> compared with "offset + len" and not with "offset + len - 1".
> 
> I am not sure if this is right or wrong. But, this is as per posix
> specifications. ;)
> </Amit>

Ahhh -- the off by one error was inside my head!  Obviously if we allocate
bytes for offset 1000, len 100, then the affected byte range would run to
offset 1099, giving a file size of 1100 bytes -- that is (offset + len) --
not (offset + len - 1), which is of course the offset of the last byte.
Sorry for the confusion.

[...]

>> .B ENOSYS
>> The file system containing the file system referred to by
> 
> <Amit>
> There is a typo above. We have "file system" repeated twice in above
> sentence. Second one should be "file".
> </Amit>

Thanks for catching that.

Okay -- it seems that this page is pretty much ready for publication,
right?  I'll hold off for a bit, until nearer the end of the 2.6.23 cycle.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fallocate() man page - darft 2
  2007-08-03 11:59               ` Michael Kerrisk
@ 2007-08-06  6:10                 ` Amit K. Arora
  0 siblings, 0 replies; 26+ messages in thread
From: Amit K. Arora @ 2007-08-06  6:10 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: dgc, linux-kernel, linux-fsdevel

On Fri, Aug 03, 2007 at 01:59:53PM +0200, Michael Kerrisk wrote:
> > <Amit>
> > There is a typo above. We have "file system" repeated twice in above
> > sentence. Second one should be "file".
> > </Amit>
> 
> Thanks for catching that.
> 
> Okay -- it seems that this page is pretty much ready for publication,
> right?  I'll hold off for a bit, until nearer the end of the 2.6.23 cycle.

I agree. Thanks!

--
Regards,
Amit Arora

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-08-06  6:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-13 12:38 [PATCH 0/6][TAKE7] fallocate system call Amit K. Arora
2007-07-13 12:46 ` [PATCH 1/6][TAKE7] manpage for fallocate Amit K. Arora
2007-07-13 14:06   ` David Chinner
2007-07-13 14:27     ` Amit K. Arora
2007-07-14  8:23   ` Michael Kerrisk
2007-07-16  5:32     ` Amit K. Arora
2007-07-23  6:09       ` fallocate() man page Michael Kerrisk
2007-07-23 13:10         ` Amit K. Arora
2007-07-24  7:06           ` David Chinner
2007-07-30  6:21             ` Michael Kerrisk
2007-07-30 19:43           ` Michael Kerrisk
2007-07-31 13:56             ` Amit K. Arora
2007-07-30 19:44           ` fallocate() man page - darft 2 Michael Kerrisk
2007-08-02 17:36             ` Amit K. Arora
2007-08-03 11:59               ` Michael Kerrisk
2007-08-06  6:10                 ` Amit K. Arora
2007-07-13 12:47 ` [PATCH 2/6][TAKE7] fallocate() implementation in i386, x86_64 and powerpc Amit K. Arora
2007-07-13 13:21   ` Christoph Hellwig
2007-07-13 14:18     ` Amit K. Arora
2007-07-13 14:46       ` Christoph Hellwig
2007-07-13 12:48 ` [PATCH 3/6][TAKE7] revalidate write permissions for fallocate Amit K. Arora
2007-07-13 13:21   ` Christoph Hellwig
2007-07-13 14:28     ` Amit K. Arora
2007-07-13 12:50 ` [PATCH 4/6][TAKE7] ext4: fallocate support in ext4 Amit K. Arora
2007-07-13 12:52 ` [PATCH 5/6][TAKE7] ext4: write support for preallocated blocks Amit K. Arora
2007-07-13 12:52 ` [PATCH 6/6][TAKE7] ext4: change for better extent-to-group alignment Amit K. Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).