Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
@ 2020-02-26 13:40 Kirill Tkhai
  2020-02-26 13:40 ` [PATCH RFC 1/5] fs: Add new argument to file_operations::fallocate() Kirill Tkhai
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 13:40 UTC (permalink / raw)
  To: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, ktkhai, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel

When discard granuality of a block device is bigger than filesystem block size,
fstrim does not effectively release device blocks. During the filesystem life,
some files become deleted, some remain alive, and this results in that many
device blocks are used incomletely (of course, the reason is not only in this,
but since this is not a problem of a filesystem, this is not a subject
of the patchset). This results in space lose for thin provisioning devices.

Say, a filesystem on a block device, which is provided by another filesystem
(say, distributed network filesystem). Semi-used blocks of the block device
result in bad performance and worse space usage of underlining filesystem.
Another example is ext4 with 4k block on loop on ext4 with 1m block. This
case also results in bad disk space usage.

Choosing a bigger block size is not a solution here, since small files become
taking much more disk space, than they used before, and the result excess
disk usage is the same.

The proposed solution is defragmentation of files based on block device
discard granuality knowledge. Files, which were not modified for a long time,
and read-only files, small files, etc, may be placed in the same block device
block together. I.e., compaction of some device blocks, which results
in releasing another device blocks.

The problem is current fallocate() does not allow to implement effective
way for such the defragmentation. The below describes the situation for ext4,
but this should touch all filesystems.

fallocate() goes thru standard blocks allocator, which try to behave very
good for life allocation cases: block placement and future file size
prediction, delayed blocks allocation, etc. But it almost impossible
to allocate blocks from specified place for our specific case. The only
ext4 block allocator option possible to use is that the allocator firstly
tries to allocate blocks from the same block group, that inode is related to.
But this is not enough for effective files compaction.

This patchset implements an extension of fallocate():

	fallocate2(int fd, int mode, loff_t offset, loff_t len,
		   unsigned long long physical)

The new argument is @physical offset from start of device, which is must
for block allocation. In case of [@physical, @physical + len] block range
is available for allocation, the syscall assigns the corresponding extent/
extents to inode. In case of the range or its part is occupied, the syscall
returns with error (maybe, smaller range will be allocated. The behavior
is the same as when fallocate() meets no space in the middle).

This interface allows to solve the formulated problem. Also, note, this
interface may allow to improve existing e4defrag algorithm: decrease
number of file extents more effective.

[1-2/5] are refactoring.
[3/5] adds fallocate2() body.
[4/5] prepares ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
[5/5] adds fallocate2() support for ext4

Any comments are welcomed!

---

Kirill Tkhai (5):
      fs: Add new argument to file_operations::fallocate()
      fs: Add new argument to vfs_fallocate()
      fs: Add fallocate2() syscall
      ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
      ext4: Add fallocate2() support


 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/ia32/sys_ia32.c               |   10 +++++++
 drivers/block/loop.c                   |    2 +
 drivers/nvme/target/io-cmd-file.c      |    4 +--
 drivers/staging/android/ashmem.c       |    2 +
 drivers/target/target_core_file.c      |    2 +
 fs/block_dev.c                         |    4 +--
 fs/btrfs/file.c                        |    4 ++-
 fs/ceph/file.c                         |    5 +++-
 fs/cifs/cifsfs.c                       |    7 +++--
 fs/cifs/smb2ops.c                      |    5 +++-
 fs/ext4/ext4.h                         |    5 +++-
 fs/ext4/extents.c                      |   35 ++++++++++++++++++++-----
 fs/ext4/inode.c                        |   14 ++++++++++
 fs/ext4/mballoc.c                      |   45 +++++++++++++++++++++++++-------
 fs/f2fs/file.c                         |    4 ++-
 fs/fat/file.c                          |    7 ++++-
 fs/fuse/file.c                         |    5 +++-
 fs/gfs2/file.c                         |    5 +++-
 fs/hugetlbfs/inode.c                   |    5 +++-
 fs/io_uring.c                          |    2 +
 fs/ioctl.c                             |    5 ++--
 fs/nfs/nfs4file.c                      |    6 ++++
 fs/nfsd/vfs.c                          |    2 +
 fs/ocfs2/file.c                        |    4 ++-
 fs/open.c                              |   21 +++++++++++----
 fs/overlayfs/file.c                    |    8 ++++--
 fs/xfs/xfs_file.c                      |    5 +++-
 include/linux/fs.h                     |    4 +--
 include/linux/syscalls.h               |    8 +++++-
 ipc/shm.c                              |    6 ++--
 mm/madvise.c                           |    2 +
 mm/shmem.c                             |    4 ++-
 34 files changed, 190 insertions(+), 59 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 1/5] fs: Add new argument to file_operations::fallocate()
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
@ 2020-02-26 13:40 ` Kirill Tkhai
  2020-02-26 13:41 ` [PATCH RFC 2/5] fs: Add new argument to vfs_fallocate() Kirill Tkhai
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 13:40 UTC (permalink / raw)
  To: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, ktkhai, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel

After the patch the prototype will look in the following way:

long (*fallocate)(struct file *file, int mode, loff_t offset,
		  loff_t len, u64 physical);

@physical is the new argument. This patch does not contain
functional changes, and it will be used in further patches.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 drivers/block/loop.c              |    2 +-
 drivers/staging/android/ashmem.c  |    2 +-
 drivers/target/target_core_file.c |    2 +-
 fs/block_dev.c                    |    4 ++--
 fs/btrfs/file.c                   |    4 +++-
 fs/ceph/file.c                    |    5 ++++-
 fs/cifs/cifsfs.c                  |    7 ++++---
 fs/cifs/smb2ops.c                 |    5 ++++-
 fs/ext4/ext4.h                    |    2 +-
 fs/ext4/extents.c                 |    6 +++++-
 fs/f2fs/file.c                    |    4 +++-
 fs/fat/file.c                     |    7 +++++--
 fs/fuse/file.c                    |    5 ++++-
 fs/gfs2/file.c                    |    5 ++++-
 fs/hugetlbfs/inode.c              |    5 ++++-
 fs/nfs/nfs4file.c                 |    6 +++++-
 fs/ocfs2/file.c                   |    4 +++-
 fs/open.c                         |    2 +-
 fs/overlayfs/file.c               |    6 +++++-
 fs/xfs/xfs_file.c                 |    5 ++++-
 include/linux/fs.h                |    2 +-
 ipc/shm.c                         |    6 +++---
 mm/shmem.c                        |    4 +++-
 23 files changed, 71 insertions(+), 29 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da8ec0b9d909..6416111a2ae1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -438,7 +438,7 @@ static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos,
 		goto out;
 	}
 
-	ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq));
+	ret = file->f_op->fallocate(file, mode, pos, blk_rq_bytes(rq), (u64)-1);
 	if (unlikely(ret && ret != -EINVAL && ret != -EOPNOTSUPP))
 		ret = -EIO;
  out:
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 8044510d8ec6..ea05ff484ebe 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -489,7 +489,7 @@ ashmem_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		mutex_unlock(&ashmem_mutex);
 		f->f_op->fallocate(f,
 				   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				   start, end - start);
+				   start, end - start, (u64)-1);
 		fput(f);
 		if (atomic_dec_and_test(&ashmem_shrink_inflight))
 			wake_up_all(&ashmem_shrink_wait);
diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7143d03f0e02..feafb731bbd9 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -581,7 +581,7 @@ fd_execute_unmap(struct se_cmd *cmd, sector_t lba, sector_t nolb)
 		if (!file->f_op->fallocate)
 			return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
 
-		ret = file->f_op->fallocate(file, mode, pos, len);
+		ret = file->f_op->fallocate(file, mode, pos, len, (u64)-1);
 		if (ret < 0) {
 			pr_warn("FILEIO: fallocate() failed: %d\n", ret);
 			return TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..d356f7d7f666 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2078,7 +2078,7 @@ static const struct address_space_operations def_blk_aops = {
 		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
 
 static long blkdev_fallocate(struct file *file, int mode, loff_t start,
-			     loff_t len)
+			     loff_t len, u64 physical)
 {
 	struct block_device *bdev = I_BDEV(bdev_file_inode(file));
 	struct address_space *mapping;
@@ -2087,7 +2087,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	int error;
 
 	/* Fail if we don't recognize the flags. */
-	if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+	if ((mode & ~BLKDEV_FALLOC_FL_SUPPORTED) || physical != (u64)-1)
 		return -EOPNOTSUPP;
 
 	/* Don't go off the end of the device. */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6f6f1805e6fd..5d80da6d14eb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3174,7 +3174,7 @@ static int btrfs_zero_range(struct inode *inode,
 }
 
 static long btrfs_fallocate(struct file *file, int mode,
-			    loff_t offset, loff_t len)
+			    loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	struct extent_state *cached_state = NULL;
@@ -3201,6 +3201,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_ZERO_RANGE))
 		return -EOPNOTSUPP;
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		return btrfs_punch_hole(inode, offset, len);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 7e0190b1f821..948694b478a4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1775,7 +1775,7 @@ static int ceph_zero_objects(struct inode *inode, loff_t offset, loff_t length)
 }
 
 static long ceph_fallocate(struct file *file, int mode,
-				loff_t offset, loff_t length)
+				loff_t offset, loff_t length, u64 physical)
 {
 	struct ceph_file_info *fi = file->private_data;
 	struct inode *inode = file_inode(file);
@@ -1790,6 +1790,9 @@ static long ceph_fallocate(struct file *file, int mode,
 	if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	if (!S_ISREG(inode->i_mode))
 		return -EOPNOTSUPP;
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index fa77fe5258b0..ddf7888798af 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -281,14 +281,15 @@ cifs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
-static long cifs_fallocate(struct file *file, int mode, loff_t off, loff_t len)
+static long cifs_fallocate(struct file *file, int mode,
+			   loff_t off, loff_t len, u64 physical)
 {
 	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(file);
 	struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);
 	struct TCP_Server_Info *server = tcon->ses->server;
 
-	if (server->ops->fallocate)
-		return server->ops->fallocate(file, tcon, mode, off, len);
+	if (server->ops->fallocate && physical != (u64)-1)
+		return server->ops->fallocate(file, tcon, mode, off, len, (u64)-1);
 
 	return -EOPNOTSUPP;
 }
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 5fa34225a99b..30cb1b911ebf 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -3460,8 +3460,11 @@ static int smb3_fiemap(struct cifs_tcon *tcon,
 }
 
 static long smb3_fallocate(struct file *file, struct cifs_tcon *tcon, int mode,
-			   loff_t off, loff_t len)
+			   loff_t off, loff_t len, u64 physical)
 {
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	/* KEEP_SIZE already checked for by do_fallocate */
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		return smb3_punch_hole(file, tcon, off, len);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 61b37a052052..5a98081c5369 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3347,7 +3347,7 @@ extern int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 extern void ext4_ext_init(struct super_block *);
 extern void ext4_ext_release(struct super_block *);
 extern long ext4_fallocate(struct file *file, int mode, loff_t offset,
-			  loff_t len);
+			  loff_t len, u64 physical);
 extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
 					  loff_t offset, ssize_t len);
 extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 954013d6076b..10d0188a712d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4835,7 +4835,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
  * of writing zeroes to the required new blocks (the same behavior which is
  * expected for file systems which do not support fallocate() system call).
  */
-long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+long ext4_fallocate(struct file *file, int mode,
+		    loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	loff_t new_size = 0;
@@ -4861,6 +4862,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		     FALLOC_FL_INSERT_RANGE))
 		return -EOPNOTSUPP;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		return ext4_punch_hole(inode, offset, len);
 
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0d4da644df3b..2dfd886a2e75 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1685,7 +1685,7 @@ static int expand_inode_data(struct inode *inode, loff_t offset,
 }
 
 static long f2fs_fallocate(struct file *file, int mode,
-				loff_t offset, loff_t len)
+			   loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	long ret = 0;
@@ -1696,6 +1696,8 @@ static long f2fs_fallocate(struct file *file, int mode,
 		return -ENOSPC;
 	if (!f2fs_is_compress_backend_ready(inode))
 		return -EOPNOTSUPP;
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
 
 	/* f2fs only support ->fallocate for regular file */
 	if (!S_ISREG(inode->i_mode))
diff --git a/fs/fat/file.c b/fs/fat/file.c
index bdc4503c00a3..4febd1e4f5af 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -19,7 +19,7 @@
 #include "fat.h"
 
 static long fat_fallocate(struct file *file, int mode,
-			  loff_t offset, loff_t len);
+			  loff_t offset, loff_t len, u64 physical);
 
 static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr)
 {
@@ -257,7 +257,7 @@ static int fat_cont_expand(struct inode *inode, loff_t size)
  * allocate and zero out clusters via an expanding truncate.
  */
 static long fat_fallocate(struct file *file, int mode,
-			  loff_t offset, loff_t len)
+			  loff_t offset, loff_t len, u64 physical)
 {
 	int nr_cluster; /* Number of clusters to be allocated */
 	loff_t mm_bytes; /* Number of bytes to be allocated for file */
@@ -271,6 +271,9 @@ static long fat_fallocate(struct file *file, int mode,
 	if (mode & ~FALLOC_FL_KEEP_SIZE)
 		return -EOPNOTSUPP;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	/* No support for dir */
 	if (!S_ISREG(inode->i_mode))
 		return -EOPNOTSUPP;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d67b830fb7a..5981ad057b7c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3166,7 +3166,7 @@ static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end)
 }
 
 static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
-				loff_t length)
+				loff_t length, u64 physical)
 {
 	struct fuse_file *ff = file->private_data;
 	struct inode *inode = file_inode(file);
@@ -3186,6 +3186,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	if (fc->no_fallocate)
 		return -EOPNOTSUPP;
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index cb26be6f4351..40f958ea0fde 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1114,7 +1114,8 @@ static long __gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t
 	return error;
 }
 
-static long gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+static long gfs2_fallocate(struct file *file, int mode,
+			   loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -1127,6 +1128,8 @@ static long gfs2_fallocate(struct file *file, int mode, loff_t offset, loff_t le
 	/* fallocate is needed by gfs2_grow to reserve space in the rindex */
 	if (gfs2_is_jdata(ip) && inode != sdp->sd_rindex)
 		return -EOPNOTSUPP;
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
 
 	inode_lock(inode);
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index aff8642f0c2e..98d9af6529fa 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -563,7 +563,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 }
 
 static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
-				loff_t len)
+				loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
@@ -580,6 +580,9 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		return hugetlbfs_punch_hole(inode, offset, len);
 
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 1297919e0fce..51061872e9fc 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -214,7 +214,8 @@ static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
 	}
 }
 
-static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t len)
+static long nfs42_fallocate(struct file *filep, int mode,
+			    loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(filep);
 	long ret;
@@ -225,6 +226,9 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
 	if ((mode != 0) && (mode != (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)))
 		return -EOPNOTSUPP;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	ret = inode_newsize_ok(inode, offset + len);
 	if (ret < 0)
 		return ret;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6cd5e4924e4d..a749ff71b8e4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2022,7 +2022,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 }
 
 static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
-			    loff_t len)
+			    loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
@@ -2032,6 +2032,8 @@ static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
 	if (!ocfs2_writes_unwritten_extents(osb))
 		return -EOPNOTSUPP;
 
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..73f27c9b518c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -306,7 +306,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
-	ret = file->f_op->fallocate(file, mode, offset, len);
+	ret = file->f_op->fallocate(file, mode, offset, len, (u64)-1);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index a5317216de73..abe34162d9d4 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -460,13 +460,17 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 	return ret;
 }
 
-static long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+static long ovl_fallocate(struct file *file, int mode,
+			  loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	struct fd real;
 	const struct cred *old_cred;
 	int ret;
 
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
+
 	ret = ovl_real_fdget(file, &real);
 	if (ret)
 		return ret;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8a4a3f29b36..61ca96469fa0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -802,7 +802,8 @@ xfs_file_fallocate(
 	struct file		*file,
 	int			mode,
 	loff_t			offset,
-	loff_t			len)
+	loff_t			len,
+	u64			physical)
 {
 	struct inode		*inode = file_inode(file);
 	struct xfs_inode	*ip = XFS_I(inode);
@@ -816,6 +817,8 @@ xfs_file_fallocate(
 		return -EINVAL;
 	if (mode & ~XFS_FALLOC_FL_SUPPORTED)
 		return -EOPNOTSUPP;
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
 
 	xfs_ilock(ip, iolock);
 	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f814ccd8d929..17c111e164d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1846,7 +1846,7 @@ struct file_operations {
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*setlease)(struct file *, long, struct file_lock **, void **);
 	long (*fallocate)(struct file *file, int mode, loff_t offset,
-			  loff_t len);
+			  loff_t len, u64 physical);
 	void (*show_fdinfo)(struct seq_file *m, struct file *f);
 #ifndef CONFIG_MMU
 	unsigned (*mmap_capabilities)(struct file *);
diff --git a/ipc/shm.c b/ipc/shm.c
index ce1ca9f7c6e9..3ab15a1c2d91 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -532,13 +532,13 @@ static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 }
 
 static long shm_fallocate(struct file *file, int mode, loff_t offset,
-			  loff_t len)
+			  loff_t len, u64 physical)
 {
 	struct shm_file_data *sfd = shm_file_data(file);
 
-	if (!sfd->file->f_op->fallocate)
+	if (!sfd->file->f_op->fallocate || physical != (u64)-1)
 		return -EOPNOTSUPP;
-	return sfd->file->f_op->fallocate(file, mode, offset, len);
+	return sfd->file->f_op->fallocate(file, mode, offset, len, (u64)-1);
 }
 
 static unsigned long shm_get_unmapped_area(struct file *file,
diff --git a/mm/shmem.c b/mm/shmem.c
index 31b4bcc95f17..a07afc5b06d0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2724,7 +2724,7 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 }
 
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
-							 loff_t len)
+			    loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
@@ -2735,6 +2735,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
 		return -EOPNOTSUPP;
+	if (physical != (u64)-1)
+		return -EOPNOTSUPP;
 
 	inode_lock(inode);
 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 2/5] fs: Add new argument to vfs_fallocate()
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
  2020-02-26 13:40 ` [PATCH RFC 1/5] fs: Add new argument to file_operations::fallocate() Kirill Tkhai
@ 2020-02-26 13:41 ` Kirill Tkhai
  2020-02-26 13:41 ` [PATCH RFC 3/5] fs: Add fallocate2() syscall Kirill Tkhai
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 13:41 UTC (permalink / raw)
  To: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, ktkhai, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel

This patch propagates @physical into vfs_fallocate().
No functional changes.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 drivers/nvme/target/io-cmd-file.c |    4 ++--
 fs/io_uring.c                     |    2 +-
 fs/ioctl.c                        |    5 +++--
 fs/nfsd/vfs.c                     |    2 +-
 fs/open.c                         |    7 ++++---
 fs/overlayfs/file.c               |    2 +-
 include/linux/fs.h                |    2 +-
 mm/madvise.c                      |    2 +-
 8 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index cd5670b83118..f86ea0dc4638 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -306,7 +306,7 @@ static void nvmet_file_execute_discard(struct nvmet_req *req)
 			break;
 		}
 
-		ret = vfs_fallocate(req->ns->file, mode, offset, len);
+		ret = vfs_fallocate(req->ns->file, mode, offset, len, (u64)-1);
 		if (ret && ret != -EOPNOTSUPP) {
 			req->error_slba = le64_to_cpu(range.slba);
 			status = errno_to_nvme_status(req, ret);
@@ -360,7 +360,7 @@ static void nvmet_file_write_zeroes_work(struct work_struct *w)
 		return;
 	}
 
-	ret = vfs_fallocate(req->ns->file, mode, offset, len);
+	ret = vfs_fallocate(req->ns->file, mode, offset, len, (u64)-1);
 	nvmet_req_complete(req, ret < 0 ? errno_to_nvme_status(req, ret) : 0);
 }
 
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8866bd60783f..03be497747da 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2512,7 +2512,7 @@ static void io_fallocate_finish(struct io_wq_work **workptr)
 		return;
 
 	ret = vfs_fallocate(req->file, req->sync.mode, req->sync.off,
-				req->sync.len);
+				req->sync.len, (u64)-1);
 	if (ret < 0)
 		req_set_fail_links(req);
 	io_cqring_add_event(req, ret);
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 282d45be6f45..5f3222434e05 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -502,7 +502,7 @@ static int ioctl_preallocate(struct file *filp, int mode, void __user *argp)
 	}
 
 	return vfs_fallocate(filp, mode | FALLOC_FL_KEEP_SIZE, sr.l_start,
-			sr.l_len);
+			sr.l_len, (u64)-1);
 }
 
 /* on ia32 l_start is on a 32-bit boundary */
@@ -530,7 +530,8 @@ static int compat_ioctl_preallocate(struct file *file, int mode,
 		return -EINVAL;
 	}
 
-	return vfs_fallocate(file, mode | FALLOC_FL_KEEP_SIZE, sr.l_start, sr.l_len);
+	return vfs_fallocate(file, mode | FALLOC_FL_KEEP_SIZE,
+			     sr.l_start, sr.l_len, (u64)-1);
 }
 #endif
 
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..a6b0acb795f5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -590,7 +590,7 @@ __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (!S_ISREG(file_inode(file)->i_mode))
 		return nfserr_inval;
 
-	error = vfs_fallocate(file, flags, offset, len);
+	error = vfs_fallocate(file, flags, offset, len, (u64)-1);
 	if (!error)
 		error = commit_metadata(fhp);
 
diff --git a/fs/open.c b/fs/open.c
index 73f27c9b518c..596fd3dc3988 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -226,7 +226,8 @@ SYSCALL_DEFINE2(ftruncate64, unsigned int, fd, loff_t, length)
 #endif /* BITS_PER_LONG == 32 */
 
 
-int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+int vfs_fallocate(struct file *file, int mode,
+		  loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
 	long ret;
@@ -306,7 +307,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
-	ret = file->f_op->fallocate(file, mode, offset, len, (u64)-1);
+	ret = file->f_op->fallocate(file, mode, offset, len, physical);
 
 	/*
 	 * Create inotify and fanotify events.
@@ -329,7 +330,7 @@ int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 	int error = -EBADF;
 
 	if (f.file) {
-		error = vfs_fallocate(f.file, mode, offset, len);
+		error = vfs_fallocate(f.file, mode, offset, len, (u64)-1);
 		fdput(f);
 	}
 	return error;
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index abe34162d9d4..e1857861c7ba 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -476,7 +476,7 @@ static long ovl_fallocate(struct file *file, int mode,
 		return ret;
 
 	old_cred = ovl_override_creds(file_inode(file)->i_sb);
-	ret = vfs_fallocate(real.file, mode, offset, len);
+	ret = vfs_fallocate(real.file, mode, offset, len, (u64)-1);
 	revert_creds(old_cred);
 
 	/* Update size */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17c111e164d2..0222599a4b9b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2538,7 +2538,7 @@ extern long vfs_truncate(const struct path *, loff_t);
 extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
 		       struct file *filp);
 extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
-			loff_t len);
+			loff_t len, u64 physical);
 extern long do_sys_open(int dfd, const char __user *filename, int flags,
 			umode_t mode);
 extern struct file *file_open_name(struct filename *, int, umode_t);
diff --git a/mm/madvise.c b/mm/madvise.c
index 43b47d3fae02..89c2e8bab44a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -849,7 +849,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 	}
 	error = vfs_fallocate(f,
 				FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				offset, end - start);
+				offset, end - start, (u64)-1);
 	fput(f);
 	down_read(&current->mm->mmap_sem);
 	return error;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 3/5] fs: Add fallocate2() syscall
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
  2020-02-26 13:40 ` [PATCH RFC 1/5] fs: Add new argument to file_operations::fallocate() Kirill Tkhai
  2020-02-26 13:41 ` [PATCH RFC 2/5] fs: Add new argument to vfs_fallocate() Kirill Tkhai
@ 2020-02-26 13:41 ` Kirill Tkhai
  2020-02-26 13:41 ` [PATCH RFC 4/5] ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY Kirill Tkhai
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 13:41 UTC (permalink / raw)
  To: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, ktkhai, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel

This introduces a new syscall and propagates @physical there.
Also, architecture-dependent definitions for x86 are added.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/ia32/sys_ia32.c               |   10 ++++++++++
 fs/open.c                              |   16 +++++++++++++---
 include/linux/syscalls.h               |    8 +++++++-
 5 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c17cb77eb150..62b3692df584 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -442,3 +442,4 @@
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
 437	i386	openat2			sys_openat2			__ia32_sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
+486	i386	fallocate2		sys_fallocate2			__ia32_compat_sys_x86_fallocate2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 44d510bc9b78..b106a39509ee 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -359,6 +359,7 @@
 435	common	clone3			__x64_sys_clone3/ptregs
 437	common	openat2			__x64_sys_openat2
 438	common	pidfd_getfd		__x64_sys_pidfd_getfd
+486	common	fallocate2		__x64_sys_fallocate2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index 21790307121e..1757bfe1a19c 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -230,6 +230,16 @@ COMPAT_SYSCALL_DEFINE6(x86_fallocate, int, fd, int, mode,
 			      ((u64)len_hi << 32) | len_lo);
 }
 
+COMPAT_SYSCALL_DEFINE6(x86_fallocate2, int, fd, int, mode,
+		       unsigned int, offset_lo, unsigned int, offset_hi,
+		       unsigned int, len_lo, unsigned int, len_hi,
+		       unsigned int physical_lo, unsigned int physical_hi)
+{
+	return ksys_fallocate2(fd, mode, ((u64)offset_hi << 32) | offset_lo,
+			      ((u64)len_hi << 32) | len_lo,
+			      ((u64)physical_hi << 32) | physical_lo);
+}
+
 /*
  * The 32-bit clone ABI is CONFIG_CLONE_BACKWARDS
  */
diff --git a/fs/open.c b/fs/open.c
index 596fd3dc3988..1b964a37ecc2 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -290,6 +290,10 @@ int vfs_fallocate(struct file *file, int mode,
 	if (ret)
 		return ret;
 
+	if (physical != (u64)-1 &&
+	    !ns_capable(inode->i_sb->s_user_ns, CAP_FOWNER))
+		return -EPERM;
+
 	if (S_ISFIFO(inode->i_mode))
 		return -ESPIPE;
 
@@ -324,13 +328,13 @@ int vfs_fallocate(struct file *file, int mode,
 }
 EXPORT_SYMBOL_GPL(vfs_fallocate);
 
-int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+int ksys_fallocate2(int fd, int mode, loff_t offset, loff_t len, u64 physical)
 {
 	struct fd f = fdget(fd);
 	int error = -EBADF;
 
 	if (f.file) {
-		error = vfs_fallocate(f.file, mode, offset, len, (u64)-1);
+		error = vfs_fallocate(f.file, mode, offset, len, physical);
 		fdput(f);
 	}
 	return error;
@@ -338,7 +342,13 @@ int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 SYSCALL_DEFINE4(fallocate, int, fd, int, mode, loff_t, offset, loff_t, len)
 {
-	return ksys_fallocate(fd, mode, offset, len);
+	return ksys_fallocate2(fd, mode, offset, len, (u64)-1);
+}
+
+SYSCALL_DEFINE5(fallocate2, int, fd, int, mode, loff_t, offset, loff_t, len,
+		unsigned long long, physical)
+{
+	return ksys_fallocate2(fd, mode, offset, len, physical);
 }
 
 /*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..1999493b03e9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -427,6 +427,8 @@ asmlinkage long sys_truncate64(const char __user *path, loff_t length);
 asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length);
 #endif
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_fallocate2(int fd, int mode, loff_t offset, loff_t len,
+			       unsigned long long physical);
 asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode);
 asmlinkage long sys_chdir(const char __user *filename);
 asmlinkage long sys_fchdir(unsigned int fd);
@@ -1255,7 +1257,11 @@ ssize_t ksys_pread64(unsigned int fd, char __user *buf, size_t count,
 		     loff_t pos);
 ssize_t ksys_pwrite64(unsigned int fd, const char __user *buf,
 		      size_t count, loff_t pos);
-int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+int ksys_fallocate2(int fd, int mode, loff_t offset, loff_t len, u64 physical);
+static inline int ksys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+	return ksys_fallocate2(fd, mode, offset, len, (u64)-1);
+}
 #ifdef CONFIG_ADVISE_SYSCALLS
 int ksys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
 #else



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 4/5] ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
                   ` (2 preceding siblings ...)
  2020-02-26 13:41 ` [PATCH RFC 3/5] fs: Add fallocate2() syscall Kirill Tkhai
@ 2020-02-26 13:41 ` Kirill Tkhai
  2020-02-26 13:41 ` [PATCH RFC 5/5] ext4: Add fallocate2() support Kirill Tkhai
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 13:41 UTC (permalink / raw)
  To: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, ktkhai, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel

EXT4_MB_HINT_GOAL_ONLY is currently unused. This patch teaches
ext4_mb_discard_preallocations() to discard only that preallocated
range, which contains a specified block, in case of the flag is set.
Otherwise, a preallocated range is not discarded.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/ext4/mballoc.c |   28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 51a78eb65f3c..b1b3c5526d1a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3894,8 +3894,8 @@ ext4_mb_release_group_pa(struct ext4_buddy *e4b,
  *   1) how many requested
  */
 static noinline_for_stack int
-ext4_mb_discard_group_preallocations(struct super_block *sb,
-					ext4_group_t group, int needed)
+ext4_mb_discard_group_preallocations(struct super_block *sb, ext4_group_t group,
+				     int needed, ext4_fsblk_t goal)
 {
 	struct ext4_group_info *grp = ext4_get_group_info(sb, group);
 	struct buffer_head *bitmap_bh = NULL;
@@ -3947,6 +3947,12 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
 			continue;
 		}
 
+		if (goal != (ext4_fsblk_t)-1 &&
+		    (goal < pa->pa_pstart || goal >= pa->pa_pstart + pa->pa_len)) {
+			spin_unlock(&pa->pa_lock);
+			continue;
+		}
+
 		/* seems this one can be freed ... */
 		pa->pa_deleted = 1;
 
@@ -4462,15 +4468,23 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
 	return 0;
 }
 
-static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
+static int ext4_mb_discard_preallocations(struct super_block *sb,
+					  struct ext4_allocation_context *ac)
 {
-	ext4_group_t i, ngroups = ext4_get_groups_count(sb);
+	ext4_group_t i = 0, ngroups = ext4_get_groups_count(sb);
+	int needed = ac->ac_o_ex.fe_len;
+	ext4_fsblk_t goal = (ext4_fsblk_t)-1;
 	int ret;
 	int freed = 0;
 
+	if (ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) {
+		i = ac->ac_o_ex.fe_group;
+		ngroups = i + 1;
+		goal = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex);
+	}
 	trace_ext4_mb_discard_preallocations(sb, needed);
-	for (i = 0; i < ngroups && needed > 0; i++) {
-		ret = ext4_mb_discard_group_preallocations(sb, i, needed);
+	for (; i < ngroups && needed > 0; i++) {
+		ret = ext4_mb_discard_group_preallocations(sb, i, needed, goal);
 		freed += ret;
 		needed -= ret;
 	}
@@ -4585,7 +4599,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 			ar->len = ac->ac_b_ex.fe_len;
 		}
 	} else {
-		freed  = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+		freed  = ext4_mb_discard_preallocations(sb, ac);
 		if (freed)
 			goto repeat;
 		*errp = -ENOSPC;



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
                   ` (3 preceding siblings ...)
  2020-02-26 13:41 ` [PATCH RFC 4/5] ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY Kirill Tkhai
@ 2020-02-26 13:41 ` Kirill Tkhai
  2020-02-26 15:55   ` Christoph Hellwig
  2020-02-27 10:39 ` [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Ritesh Harjani
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 13:41 UTC (permalink / raw)
  To: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, ktkhai, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel

This adds a support of physical hint for fallocate2() syscall.
In case of @physical argument is set for ext4_fallocate(),
we try to allocate blocks only from [@phisical, @physical + len]
range, while other blocks are not used.

ext4_fallocate(struct file *file, int mode,
                    loff_t offset, loff_t len, u64 physical)

In case of some of blocks from the range are occupied, the syscall
returns with error. This is the only difference from fallocate().
The same as fallocate(), less then @len blocks may be allocated
with error as a return value.

We try to find hint blocks both in preallocated and ordinary blocks.
Note, that ext4_mb_use_preallocated() looks for the hint only in
inode's preallocations. In case of there are no desired block,
further ext4_mb_discard_preallocations() tries to release group
preallocations.

Note, that this patch makes EXT4_MB_HINT_GOAL_ONLY flag be used,
it used to be unused before for years.
New EXT4_GET_BLOCKS_FROM_GOAL flag of ext4_map_blocks() is added.
It indicates, that struct ext4_map_blocks::m_goal_pblk is valid.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/ext4/ext4.h    |    3 +++
 fs/ext4/extents.c |   31 ++++++++++++++++++++++++-------
 fs/ext4/inode.c   |   14 ++++++++++++++
 fs/ext4/mballoc.c |   17 ++++++++++++++---
 4 files changed, 55 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5a98081c5369..299fbb8350ac 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -181,6 +181,7 @@ struct ext4_allocation_request {
 struct ext4_map_blocks {
 	ext4_fsblk_t m_pblk;
 	ext4_lblk_t m_lblk;
+	ext4_fsblk_t m_goal_pblk;
 	unsigned int m_len;
 	unsigned int m_flags;
 };
@@ -621,6 +622,8 @@ enum {
 	/* Caller will submit data before dropping transaction handle. This
 	 * allows jbd2 to avoid submitting data before commit. */
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
+	/* Caller wants blocks from provided physical offset */
+#define EXT4_GET_BLOCKS_FROM_GOAL		0x0800
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10d0188a712d..5f2790c1c4fb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4412,7 +4412,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 
 	/* allocate new block */
 	ar.inode = inode;
-	ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
 	ar.logical = map->m_lblk;
 	/*
 	 * We calculate the offset from the beginning of the cluster
@@ -4437,6 +4436,13 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.flags |= EXT4_MB_DELALLOC_RESERVED;
 	if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
 		ar.flags |= EXT4_MB_USE_RESERVED;
+	if (flags & EXT4_GET_BLOCKS_FROM_GOAL) {
+		ar.flags |= EXT4_MB_HINT_TRY_GOAL|EXT4_MB_HINT_GOAL_ONLY;
+		ar.goal = map->m_goal_pblk;
+	} else {
+		ar.goal = ext4_ext_find_goal(inode, path, map->m_lblk);
+	}
+
 	newblock = ext4_mb_new_blocks(handle, &ar, &err);
 	if (!newblock)
 		goto out2;
@@ -4580,8 +4586,8 @@ int ext4_ext_truncate(handle_t *handle, struct inode *inode)
 }
 
 static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
-				  ext4_lblk_t len, loff_t new_size,
-				  int flags)
+				  ext4_lblk_t len, ext4_fsblk_t goal_pblk,
+				  loff_t new_size, int flags)
 {
 	struct inode *inode = file_inode(file);
 	handle_t *handle;
@@ -4603,6 +4609,10 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	 */
 	if (len <= EXT_UNWRITTEN_MAX_LEN)
 		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
+	if (goal_pblk != (ext4_fsblk_t)-1) {
+		map.m_goal_pblk = goal_pblk;
+		flags |= EXT4_GET_BLOCKS_FROM_GOAL;
+	}
 
 	/*
 	 * credits to insert 1 extent into extent tree
@@ -4637,6 +4647,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 			break;
 		}
 		map.m_lblk += ret;
+		map.m_goal_pblk += ret;
 		map.m_len = len = len - ret;
 		epos = (loff_t)map.m_lblk << inode->i_blkbits;
 		inode->i_ctime = current_time(inode);
@@ -4746,6 +4757,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 				round_down(offset, 1 << blkbits) >> blkbits,
 				(round_up((offset + len), 1 << blkbits) -
 				 round_down(offset, 1 << blkbits)) >> blkbits,
+				(ext4_fsblk_t)-1,
 				new_size, flags);
 		if (ret)
 			goto out_mutex;
@@ -4778,8 +4790,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		truncate_pagecache_range(inode, start, end - 1);
 		inode->i_mtime = inode->i_ctime = current_time(inode);
 
-		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
-					     flags);
+		ret = ext4_alloc_file_blocks(file, lblk, max_blocks,
+					     (ext4_fsblk_t)-1, new_size, flags);
 		up_write(&EXT4_I(inode)->i_mmap_sem);
 		if (ret)
 			goto out_mutex;
@@ -4839,10 +4851,12 @@ long ext4_fallocate(struct file *file, int mode,
 		    loff_t offset, loff_t len, u64 physical)
 {
 	struct inode *inode = file_inode(file);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	loff_t new_size = 0;
 	unsigned int max_blocks;
 	int ret = 0;
 	int flags;
+	ext4_fsblk_t pblk;
 	ext4_lblk_t lblk;
 	unsigned int blkbits = inode->i_blkbits;
 
@@ -4862,7 +4876,8 @@ long ext4_fallocate(struct file *file, int mode,
 		     FALLOC_FL_INSERT_RANGE))
 		return -EOPNOTSUPP;
 
-	if (physical != (u64)-1)
+	if (((mode & ~FALLOC_FL_KEEP_SIZE) || sbi->s_cluster_ratio > 1) &&
+	    physical != (u64)-1)
 		return -EOPNOTSUPP;
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
@@ -4883,6 +4898,7 @@ long ext4_fallocate(struct file *file, int mode,
 
 	trace_ext4_fallocate_enter(inode, offset, len, mode);
 	lblk = offset >> blkbits;
+	pblk = physical == (u64)-1 ? (ext4_fsblk_t)-1 : physical >> blkbits;
 
 	max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
 	flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT;
@@ -4911,7 +4927,8 @@ long ext4_fallocate(struct file *file, int mode,
 	/* Wait all existing dio workers, newcomers will block on i_mutex */
 	inode_dio_wait(inode);
 
-	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags);
+	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, pblk,
+				     new_size, flags);
 	if (ret)
 		goto out;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fa0ff78dc033..1054ba65cc1b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -580,6 +580,10 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			return ret;
 	}
 
+	if (retval > 0 && flags & EXT4_GET_BLOCKS_FROM_GOAL &&
+	    map->m_pblk != map->m_goal_pblk)
+		return -EEXIST;
+
 	/* If it is only a block(s) look up */
 	if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
 		return retval;
@@ -672,6 +676,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			}
 		}
 
+		/*
+		 * Concurrent thread could allocate extent with other m_pblk,
+		 * and we got it during second call of ext4_ext_map_blocks().
+		 */
+		if (retval > 0 && flags & EXT4_GET_BLOCKS_FROM_GOAL &&
+		    map->m_pblk != map->m_goal_pblk) {
+			retval = -EEXIST;
+			goto out_sem;
+		}
+
 		/*
 		 * If the extent has been zeroed out, we don't need to update
 		 * extent status tree.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b1b3c5526d1a..ed25f47748a0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3426,6 +3426,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	struct ext4_prealloc_space *pa, *cpa = NULL;
 	ext4_fsblk_t goal_block;
 
+	goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex);
+
 	/* only data can be preallocated */
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return 0;
@@ -3436,7 +3438,11 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 
 		/* all fields in this condition don't change,
 		 * so we can skip locking for them */
-		if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
+		if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) &&
+		    (goal_block < pa->pa_pstart ||
+		     goal_block >= pa->pa_pstart + pa->pa_len))
+			continue;
+		else if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
 		    ac->ac_o_ex.fe_logical >= (pa->pa_lstart +
 					       EXT4_C2B(sbi, pa->pa_len)))
 			continue;
@@ -3465,6 +3471,9 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
 		return 0;
 
+	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
+		return 0;
+
 	/* inode may have no locality group for some reason */
 	lg = ac->ac_lg;
 	if (lg == NULL)
@@ -3474,7 +3483,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 		/* The max size of hash table is PREALLOC_TB_SIZE */
 		order = PREALLOC_TB_SIZE - 1;
 
-	goal_block = ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex);
 	/*
 	 * search for the prealloc space that is having
 	 * minimal distance from the goal block.
@@ -4261,8 +4269,11 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac,
 	/* start searching from the goal */
 	goal = ar->goal;
 	if (goal < le32_to_cpu(es->s_first_data_block) ||
-			goal >= ext4_blocks_count(es))
+			goal >= ext4_blocks_count(es)) {
+		if (ar->flags & EXT4_MB_HINT_GOAL_ONLY)
+			return -EINVAL;
 		goal = le32_to_cpu(es->s_first_data_block);
+	}
 	ext4_get_group_no_and_offset(sb, goal, &group, &block);
 
 	/* set up allocation goals */



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 13:41 ` [PATCH RFC 5/5] ext4: Add fallocate2() support Kirill Tkhai
@ 2020-02-26 15:55   ` Christoph Hellwig
  2020-02-26 20:05     ` Kirill Tkhai
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2020-02-26 15:55 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, enwlinux, sblbir,
	khazhy, linux-ext4, linux-kernel, linux-fsdevel

On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
> This adds a support of physical hint for fallocate2() syscall.
> In case of @physical argument is set for ext4_fallocate(),
> we try to allocate blocks only from [@phisical, @physical + len]
> range, while other blocks are not used.

Sorry, but this is a complete bullshit interface.  Userspace has
absolutely no business even thinking of physical placement.  If you
want to align allocations to physical block granularity boundaries
that is the file systems job, not the applications job.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 15:55   ` Christoph Hellwig
@ 2020-02-26 20:05     ` Kirill Tkhai
  2020-02-26 21:51       ` Andreas Dilger
                         ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-26 20:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, enwlinux, sblbir,
	khazhy, linux-ext4, linux-kernel, linux-fsdevel

On 26.02.2020 18:55, Christoph Hellwig wrote:
> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>> This adds a support of physical hint for fallocate2() syscall.
>> In case of @physical argument is set for ext4_fallocate(),
>> we try to allocate blocks only from [@phisical, @physical + len]
>> range, while other blocks are not used.
> 
> Sorry, but this is a complete bullshit interface.  Userspace has
> absolutely no business even thinking of physical placement.  If you
> want to align allocations to physical block granularity boundaries
> that is the file systems job, not the applications job.

Why? There are two contradictory actions that filesystem can't do at the same time:

1)place files on a distance from each other to minimize number of extents
  on possible future growth;
2)place small files in the same big block of block device.

At initial allocation time you never know, which file will stop grow in some future,
i.e. which file is suitable for compaction. This knowledge becomes available some time later.
Say, if a file has not been changed for a month, it is suitable for compaction with
another files like it.

If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
and just share your algorithm here.

In Virtuozzo we tried to compact ext4 with existing kernel interface:

https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c

But it does not work well in many situations, and the main problem is blocks allocation
in desired place is not possible. Block allocator can't behave excellent for everything.

If this interface bad, can you suggest another interface to make block allocator to know
the behavior expected from him in this specific case?

Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 20:05     ` Kirill Tkhai
@ 2020-02-26 21:51       ` Andreas Dilger
  2020-02-27 12:24         ` Kirill Tkhai
  2020-02-27  6:59       ` Konstantin Khlebnikov
  2020-02-27  7:33       ` Dave Chinner
  2 siblings, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2020-02-26 21:51 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christoph Hellwig, Theodore Ts'o, Alexander Viro,
	Mike Snitzer, Jan Kara, Eric Biggers, riteshh, krisman, surajjs,
	dmonakhov, mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov,
	linux-ext4, Linux Kernel Mailing List, Linux FS Devel

[-- Attachment #1: Type: text/plain, Size: 4071 bytes --]

On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> On 26.02.2020 18:55, Christoph Hellwig wrote:
>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>> This adds a support of physical hint for fallocate2() syscall.
>>> In case of @physical argument is set for ext4_fallocate(),
>>> we try to allocate blocks only from [@phisical, @physical + len]
>>> range, while other blocks are not used.
>> 
>> Sorry, but this is a complete bullshit interface.  Userspace has
>> absolutely no business even thinking of physical placement.  If you
>> want to align allocations to physical block granularity boundaries
>> that is the file systems job, not the applications job.
> 
> Why? There are two contradictory actions that filesystem can't do at the same time:
> 
> 1)place files on a distance from each other to minimize number of extents
>  on possible future growth;
> 2)place small files in the same big block of block device.
> 
> At initial allocation time you never know, which file will stop grow in some
> future, i.e. which file is suitable for compaction. This knowledge becomes
> available some time later.  Say, if a file has not been changed for a month,
> it is suitable for compaction with another files like it.
> 
> If at allocation time you can determine a file, which won't grow in the future,
> don't be afraid, and just share your algorithm here.

Very few files grow after they are initially written/closed.  Those that
do are almost always opened with O_APPEND (e.g. log files).  It would be
reasonable to have O_APPEND cause the filesystem to reserve blocks (in
memory at least, maybe some small amount on disk like 1/4 of the current
file size) for the file to grow after it is closed.  We might use the
same heuristic for directories that grow long after initial creation.

The main exception there is VM images, because they are not really "files"
in the normal sense, but containers aggregating a lot of different files,
each created with patterns that are not visible to the VM host.  In that
case, it would be better to have the VM host tell the filesystem that the
IO pattern is "random" and not try to optimize until the VM is cold.

> In Virtuozzo we tried to compact ext4 with existing kernel interface:
> 
> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
> 
> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave
> excellent for everything.
> 
> If this interface bad, can you suggest another interface to make block
> allocator to know the behavior expected from him in this specific case?

In ext4 there is already the "group" allocator, which combines multiple
small files together into a single preallocation group, so that the IO
to disk is large/contiguous.  The theory is that files written at the
same time will have similar lifespans, but that isn't always true.

If the files are large and still being written, the allocator will reserve
additional blocks (default 8MB I think) on the expectation that it will
continue to write until it is closed.

I think (correct me if I'm wrong) that your issue is with defragmenting
small files to free up contiguous space in the filesystem?  I think once
the free space is freed of small files that defragmenting large files is
easily done.  Anything with more than 8-16MB extents will max out most
storage anyway (seek rate * IO size).

In that case, an interesting userspace interface would be an array of
inode numbers (64-bit please) that should be packed together densely in
the order they are provided (maybe a flag for that).  That allows the
filesystem the freedom to find the physical blocks for the allocation,
while userspace can tell which files are related to each other.

Tools like "readahead" could also leverage this to "perfectly" allocate
the files used during boot into a single stream of reads from the disk.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 20:05     ` Kirill Tkhai
  2020-02-26 21:51       ` Andreas Dilger
@ 2020-02-27  6:59       ` Konstantin Khlebnikov
  2020-02-27 10:42         ` Kirill Tkhai
  2020-02-27  7:33       ` Dave Chinner
  2 siblings, 1 reply; 35+ messages in thread
From: Konstantin Khlebnikov @ 2020-02-27  6:59 UTC (permalink / raw)
  To: Kirill Tkhai, Christoph Hellwig
  Cc: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, enwlinux, sblbir,
	khazhy, linux-ext4, linux-kernel, linux-fsdevel

On 26/02/2020 23.05, Kirill Tkhai wrote:
> On 26.02.2020 18:55, Christoph Hellwig wrote:
>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>> This adds a support of physical hint for fallocate2() syscall.
>>> In case of @physical argument is set for ext4_fallocate(),
>>> we try to allocate blocks only from [@phisical, @physical + len]
>>> range, while other blocks are not used.
>>
>> Sorry, but this is a complete bullshit interface.  Userspace has
>> absolutely no business even thinking of physical placement.  If you
>> want to align allocations to physical block granularity boundaries
>> that is the file systems job, not the applications job.
> 
> Why? There are two contradictory actions that filesystem can't do at the same time:
> 
> 1)place files on a distance from each other to minimize number of extents
>    on possible future growth;
> 2)place small files in the same big block of block device.
> 
> At initial allocation time you never know, which file will stop grow in some future,
> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
> Say, if a file has not been changed for a month, it is suitable for compaction with
> another files like it.
> 
> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
> and just share your algorithm here.
> 
> In Virtuozzo we tried to compact ext4 with existing kernel interface:
> 
> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
> 
> But it does not work well in many situations, and the main problem is blocks allocation
> in desired place is not possible. Block allocator can't behave excellent for everything.
> 
> If this interface bad, can you suggest another interface to make block allocator to know
> the behavior expected from him in this specific case?

Controlling exact place is odd. I suppose main reason for this that defragmentation
process wants to control fragmentation during allocating new space.

Maybe flag FALLOC_FL_DONT_FRAGMENT (allocate exactly one extent or fail) could solve that problem?

Defragmentator could try allocate different sizes and automatically balance fragmentation factor
without controlling exact disk offsets. Also it could reserve space for expected file growth.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 20:05     ` Kirill Tkhai
  2020-02-26 21:51       ` Andreas Dilger
  2020-02-27  6:59       ` Konstantin Khlebnikov
@ 2020-02-27  7:33       ` Dave Chinner
  2020-02-27 11:12         ` Kirill Tkhai
  2 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2020-02-27  7:33 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christoph Hellwig, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, linux-ext4, linux-kernel,
	linux-fsdevel

On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote:
> On 26.02.2020 18:55, Christoph Hellwig wrote:
> > On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
> >> This adds a support of physical hint for fallocate2() syscall.
> >> In case of @physical argument is set for ext4_fallocate(),
> >> we try to allocate blocks only from [@phisical, @physical + len]
> >> range, while other blocks are not used.
> > 
> > Sorry, but this is a complete bullshit interface.  Userspace has
> > absolutely no business even thinking of physical placement.  If you
> > want to align allocations to physical block granularity boundaries
> > that is the file systems job, not the applications job.
> 
> Why? There are two contradictory actions that filesystem can't do at the same time:
> 
> 1)place files on a distance from each other to minimize number of extents
>   on possible future growth;

Speculative EOF preallocation at delayed allocation reservation time
provides this.

> 2)place small files in the same big block of block device.

Delayed allocation during writeback packs files smaller than the
stripe unit of the filesystem tightly.

So, yes, we do both of these things at the same time in XFS, and
have for the past 10 years.

> At initial allocation time you never know, which file will stop grow in some future,
> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
> Say, if a file has not been changed for a month, it is suitable for compaction with
> another files like it.
> 
> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
> and just share your algorithm here.
> 
> In Virtuozzo we tried to compact ext4 with existing kernel interface:
> 
> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
> 
> But it does not work well in many situations, and the main problem is blocks allocation
> in desired place is not possible. Block allocator can't behave excellent for everything.
> 
> If this interface bad, can you suggest another interface to make block allocator to know
> the behavior expected from him in this specific case?

Write once, long term data:

	fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);

That will allow the the storage stack to group all data with the
same hint together, both in software and in hardware.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
                   ` (4 preceding siblings ...)
  2020-02-26 13:41 ` [PATCH RFC 5/5] ext4: Add fallocate2() support Kirill Tkhai
@ 2020-02-27 10:39 ` Ritesh Harjani
  2020-02-28  7:07 ` xiaohui li
  2020-03-02 16:56 ` Theodore Y. Ts'o
  7 siblings, 0 replies; 35+ messages in thread
From: Ritesh Harjani @ 2020-02-27 10:39 UTC (permalink / raw)
  To: Kirill Tkhai, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, krisman, surajjs, dmonakhov, mbobrowski, enwlinux,
	sblbir, khazhy, linux-ext4, linux-kernel, linux-fsdevel


> fallocate() goes thru standard blocks allocator, which try to behave very
> good for life allocation cases: block placement and future file size
> prediction, delayed blocks allocation, etc. But it almost impossible
> to allocate blocks from specified place for our specific case. The only
> ext4 block allocator option possible to use is that the allocator firstly
> tries to allocate blocks from the same block group, that inode is related to.
> But this is not enough for effective files compaction.
> 
> This patchset implements an extension of fallocate():
> 
> 	fallocate2(int fd, int mode, loff_t offset, loff_t len,
> 		   unsigned long long physical)
> 
> The new argument is @physical offset from start of device, which is must
> for block allocation. In case of [@physical, @physical + len] block range
> is available for allocation, the syscall assigns the corresponding extent/
> extents to inode. In case of the range or its part is occupied, the syscall
> returns with error (maybe, smaller range will be allocated. The behavior
> is the same as when fallocate() meets no space in the middle).

Doesn't this interface kills the whole philosophy of letting filesystems
to decide which block it sees as most fit for allocation. IMHO user
passing over actual physical location from where the FS should allocate,
does not sound like a good interface.


-ritesh


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-27  6:59       ` Konstantin Khlebnikov
@ 2020-02-27 10:42         ` Kirill Tkhai
  0 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-27 10:42 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Christoph Hellwig
  Cc: tytso, viro, adilger.kernel, snitzer, jack, ebiggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, enwlinux, sblbir,
	khazhy, linux-ext4, linux-kernel, linux-fsdevel

On 27.02.2020 09:59, Konstantin Khlebnikov wrote:
> On 26/02/2020 23.05, Kirill Tkhai wrote:
>> On 26.02.2020 18:55, Christoph Hellwig wrote:
>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>>> This adds a support of physical hint for fallocate2() syscall.
>>>> In case of @physical argument is set for ext4_fallocate(),
>>>> we try to allocate blocks only from [@phisical, @physical + len]
>>>> range, while other blocks are not used.
>>>
>>> Sorry, but this is a complete bullshit interface.  Userspace has
>>> absolutely no business even thinking of physical placement.  If you
>>> want to align allocations to physical block granularity boundaries
>>> that is the file systems job, not the applications job.
>>
>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>
>> 1)place files on a distance from each other to minimize number of extents
>>    on possible future growth;
>> 2)place small files in the same big block of block device.
>>
>> At initial allocation time you never know, which file will stop grow in some future,
>> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
>> Say, if a file has not been changed for a month, it is suitable for compaction with
>> another files like it.
>>
>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
>> and just share your algorithm here.
>>
>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>
>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>
>> But it does not work well in many situations, and the main problem is blocks allocation
>> in desired place is not possible. Block allocator can't behave excellent for everything.
>>
>> If this interface bad, can you suggest another interface to make block allocator to know
>> the behavior expected from him in this specific case?
> 
> Controlling exact place is odd. I suppose main reason for this that defragmentation
> process wants to control fragmentation during allocating new space.
> 
> Maybe flag FALLOC_FL_DONT_FRAGMENT (allocate exactly one extent or fail) could solve that problem?
>
> Defragmentator could try allocate different sizes and automatically balance fragmentation factor
> without controlling exact disk offsets. Also it could reserve space for expected file growth.

I don't think this will helps. The problem is not in allocation a single extent (fallocate() already
tries to allocate as small number of extents as possible), but in that it's impossible to allocate it
in desired bounds. Say, you have 1Mb discard granuality on block device and two files in different
block device clusters: one is 4Kb of length, another's size is 1Mb-4Kb. The biggest file is situated
on the start of block device cluster:

[      1Mb cluster0       ][      1Mb cluster1     ]
[****BIG_FILE****|free 4Kb][small_file|free 1Mb-4Kb]

The best defragmentation will be to move small_file into free 4Kb of cluster0. Allocation of single
extent does not help here: you have to allocate very big bunch of such extents in cycle before
allocator returns you desired block, and then it's need to return the rest of extents back. This
has very bad performance.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-27  7:33       ` Dave Chinner
@ 2020-02-27 11:12         ` Kirill Tkhai
  2020-02-27 21:56           ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-27 11:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, linux-ext4, linux-kernel,
	linux-fsdevel

On 27.02.2020 10:33, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote:
>> On 26.02.2020 18:55, Christoph Hellwig wrote:
>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>>> This adds a support of physical hint for fallocate2() syscall.
>>>> In case of @physical argument is set for ext4_fallocate(),
>>>> we try to allocate blocks only from [@phisical, @physical + len]
>>>> range, while other blocks are not used.
>>>
>>> Sorry, but this is a complete bullshit interface.  Userspace has
>>> absolutely no business even thinking of physical placement.  If you
>>> want to align allocations to physical block granularity boundaries
>>> that is the file systems job, not the applications job.
>>
>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>
>> 1)place files on a distance from each other to minimize number of extents
>>   on possible future growth;
> 
> Speculative EOF preallocation at delayed allocation reservation time
> provides this.
> 
>> 2)place small files in the same big block of block device.
> 
> Delayed allocation during writeback packs files smaller than the
> stripe unit of the filesystem tightly.
> 
> So, yes, we do both of these things at the same time in XFS, and
> have for the past 10 years.
> 
>> At initial allocation time you never know, which file will stop grow in some future,
>> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
>> Say, if a file has not been changed for a month, it is suitable for compaction with
>> another files like it.
>>
>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
>> and just share your algorithm here.
>>
>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>
>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>
>> But it does not work well in many situations, and the main problem is blocks allocation
>> in desired place is not possible. Block allocator can't behave excellent for everything.
>>
>> If this interface bad, can you suggest another interface to make block allocator to know
>> the behavior expected from him in this specific case?
> 
> Write once, long term data:
> 
> 	fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);
> 
> That will allow the the storage stack to group all data with the
> same hint together, both in software and in hardware.

This is interesting option, but it only applicable before write is made. And it's only
applicable on your own applications. My usecase is defragmentation of containers, where
any applications may run. Most of applications never care whether long or short-term
data they write.

Maybe, we can make fallocate() care about F_SET_RW_HINT? Say, if RWH_WRITE_LIFE_EXTREME
is set, fallocate() tries to allocate space around another inodes with the same hint.

E.g., we have 1Mb discard granuality on block device and two files in different block
device clusters: one is 4Kb of length, another's size is 1Mb-4Kb. The biggest file is
situated on the start of block device cluster:

[      1Mb cluster0       ][      1Mb cluster1     ]
[****BIG_FILE****|free 4Kb][small_file|free 1Mb-4Kb]

defrag util wants to move small file into free space in cluster0. To do that it opens
BIG_FILE and sets F_SET_RW_HINT for its inode. Then it opens tmp file, sets the hint
and calls fallocate():

fd1 = open("BIG_FILE", O_RDONLY);
ioctl(fd1, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);

fd2 = open("donor", O_WRONLY|O_TMPFILE|O_CREAT);
ioctl(fd2, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);

fallocate(fd2, 0, 0, 4Kb); // firstly seeks a space around files with RWH_WRITE_LIFE_EXTREME hint

How about this?

Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-26 21:51       ` Andreas Dilger
@ 2020-02-27 12:24         ` Kirill Tkhai
  2020-02-28 15:35           ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-27 12:24 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Theodore Ts'o, Alexander Viro,
	Mike Snitzer, Jan Kara, Eric Biggers, riteshh, krisman, surajjs,
	dmonakhov, mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov,
	linux-ext4, Linux Kernel Mailing List, Linux FS Devel

On 27.02.2020 00:51, Andreas Dilger wrote:
> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 26.02.2020 18:55, Christoph Hellwig wrote:
>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>>> This adds a support of physical hint for fallocate2() syscall.
>>>> In case of @physical argument is set for ext4_fallocate(),
>>>> we try to allocate blocks only from [@phisical, @physical + len]
>>>> range, while other blocks are not used.
>>>
>>> Sorry, but this is a complete bullshit interface.  Userspace has
>>> absolutely no business even thinking of physical placement.  If you
>>> want to align allocations to physical block granularity boundaries
>>> that is the file systems job, not the applications job.
>>
>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>
>> 1)place files on a distance from each other to minimize number of extents
>>  on possible future growth;
>> 2)place small files in the same big block of block device.
>>
>> At initial allocation time you never know, which file will stop grow in some
>> future, i.e. which file is suitable for compaction. This knowledge becomes
>> available some time later.  Say, if a file has not been changed for a month,
>> it is suitable for compaction with another files like it.
>>
>> If at allocation time you can determine a file, which won't grow in the future,
>> don't be afraid, and just share your algorithm here.
> 
> Very few files grow after they are initially written/closed.  Those that
> do are almost always opened with O_APPEND (e.g. log files).  It would be
> reasonable to have O_APPEND cause the filesystem to reserve blocks (in
> memory at least, maybe some small amount on disk like 1/4 of the current
> file size) for the file to grow after it is closed.  We might use the
> same heuristic for directories that grow long after initial creation.

1)Lets see on a real example. I created a new ext4 and started the test below:
https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99

Here are two files written. One file is 4Kb. One file is 1Mb-4Kb.

$filefrag -e test1.tmp test2.tmp 
Filesystem type is: ef53
File size of test1.tmp is 4096 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:      33793..     33793:      1:             last,eof
test1.tmp: 1 extent found
File size of test2.tmp is 1044480 (255 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     254:      33536..     33790:    255:             last,eof
test2.tmp: 1 extent found

$debugfs:  testb 33791
Block 33791 not in use

test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp
placement prevents us from discarding next 1Mb block.

2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb:

# filefrag -e test3.tmp test4.tmp 
Filesystem type is: ef53
File size of test3.tmp is 1052672 (257 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     256:      35840..     36096:    257:             last,eof
test3.tmp: 1 extent found
File size of test4.tmp is 1044480 (255 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     254:      35072..     35326:    255:             last,eof
test4.tmp: 1 extent found

They don't go sequentially, and here is fragmentation starts.

After both the tests:
$df -h
/dev/loop0      2.0G   11M  1.8G   1% /root/mnt

Filesystem is free, all last block groups are free. E.g.,

Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED]
  Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66
  Inode bitmap at 288 (bg #0 + 288), csum 0x00000000
  Inode table at 7969-8480 (bg #0 + 7969)
  32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
  Free blocks: 491520-524287
  Free inodes: 122881-131072

but two files are not packed together.

So, ext4 block allocator does not work good for my workload. It even does not
know anything about discard granuality of underlining block device. Does it?
I assume no fs knows. Should I tell it?

> The main exception there is VM images, because they are not really "files"
> in the normal sense, but containers aggregating a lot of different files,
> each created with patterns that are not visible to the VM host.  In that
> case, it would be better to have the VM host tell the filesystem that the
> IO pattern is "random" and not try to optimize until the VM is cold.
> 
>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>
>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>
>> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave
>> excellent for everything.
>>
>> If this interface bad, can you suggest another interface to make block
>> allocator to know the behavior expected from him in this specific case?
> 
> In ext4 there is already the "group" allocator, which combines multiple
> small files together into a single preallocation group, so that the IO
> to disk is large/contiguous.  The theory is that files written at the
> same time will have similar lifespans, but that isn't always true.
> 
> If the files are large and still being written, the allocator will reserve
> additional blocks (default 8MB I think) on the expectation that it will
> continue to write until it is closed.
> 
> I think (correct me if I'm wrong) that your issue is with defragmenting
> small files to free up contiguous space in the filesystem?  I think once
> the free space is freed of small files that defragmenting large files is
> easily done.  Anything with more than 8-16MB extents will max out most
> storage anyway (seek rate * IO size).

My issue is mostly with files < 1Mb, because underlining device discard
granuality is 1Mb. The result of fragmentation is that size of occupied
1Mb blocks of device is 1.5 times bigger, than size of really written
data (say, df -h). And this is the problem.

> In that case, an interesting userspace interface would be an array of
> inode numbers (64-bit please) that should be packed together densely in
> the order they are provided (maybe a flag for that).  That allows the
> filesystem the freedom to find the physical blocks for the allocation,
> while userspace can tell which files are related to each other.

So, this interface is 3-in-1:

1)finds a placement for inodes extents;
2)assigns this space to some temporary donor inode;
3)calls ext4_move_extents() for each of them.

Do I understand you right?

If so, then IMO it's good to start from two inodes, because here may code
a very difficult algorithm of placement of many inodes, which may require
much memory. Is this OK?

Can we introduce a flag, that some of inode is unmovable?

Can this interface use a knowledge about underlining device discard granuality?

In the answer to Dave, I wrote a proposition to make fallocate() care about
i_write_hint. Could you please comment what you think about that too?

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-27 11:12         ` Kirill Tkhai
@ 2020-02-27 21:56           ` Dave Chinner
  2020-02-28 12:41             ` Kirill Tkhai
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2020-02-27 21:56 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christoph Hellwig, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, linux-ext4, linux-kernel,
	linux-fsdevel

On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote:
> On 27.02.2020 10:33, Dave Chinner wrote:
> > On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote:
> >> On 26.02.2020 18:55, Christoph Hellwig wrote:
> >>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
> >>>> This adds a support of physical hint for fallocate2() syscall.
> >>>> In case of @physical argument is set for ext4_fallocate(),
> >>>> we try to allocate blocks only from [@phisical, @physical + len]
> >>>> range, while other blocks are not used.
> >>>
> >>> Sorry, but this is a complete bullshit interface.  Userspace has
> >>> absolutely no business even thinking of physical placement.  If you
> >>> want to align allocations to physical block granularity boundaries
> >>> that is the file systems job, not the applications job.
> >>
> >> Why? There are two contradictory actions that filesystem can't do at the same time:
> >>
> >> 1)place files on a distance from each other to minimize number of extents
> >>   on possible future growth;
> > 
> > Speculative EOF preallocation at delayed allocation reservation time
> > provides this.
> > 
> >> 2)place small files in the same big block of block device.
> > 
> > Delayed allocation during writeback packs files smaller than the
> > stripe unit of the filesystem tightly.
> > 
> > So, yes, we do both of these things at the same time in XFS, and
> > have for the past 10 years.
> > 
> >> At initial allocation time you never know, which file will stop grow in some future,
> >> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
> >> Say, if a file has not been changed for a month, it is suitable for compaction with
> >> another files like it.
> >>
> >> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
> >> and just share your algorithm here.
> >>
> >> In Virtuozzo we tried to compact ext4 with existing kernel interface:
> >>
> >> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
> >>
> >> But it does not work well in many situations, and the main problem is blocks allocation
> >> in desired place is not possible. Block allocator can't behave excellent for everything.
> >>
> >> If this interface bad, can you suggest another interface to make block allocator to know
> >> the behavior expected from him in this specific case?
> > 
> > Write once, long term data:
> > 
> > 	fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);
> > 
> > That will allow the the storage stack to group all data with the
> > same hint together, both in software and in hardware.
> 
> This is interesting option, but it only applicable before write is made. And it's only
> applicable on your own applications. My usecase is defragmentation of containers, where
> any applications may run. Most of applications never care whether long or short-term
> data they write.

Why is that a problem? They'll be using the default write hint (i.e.
NONE) and so a hint aware allocation policy will be separating that
data from all the other data written with specific hints...

And you've mentioned that your application has specific *never write
again* selection criteria for data it is repacking. And that
involves rewriting that data.  IOWs, you know exactly what policy
you want to apply before you rewrite the data, and so what other
applications do is completely irrelevant for your repacker...

> Maybe, we can make fallocate() care about F_SET_RW_HINT? Say, if RWH_WRITE_LIFE_EXTREME
> is set, fallocate() tries to allocate space around another inodes with the same hint.

That's exactly what I said:

> > That will allow the the storage stack to group all data with the
> > same hint together, both in software and in hardware.

What the filesystem does with the hint is up to the filesystem
and the policies that it's developers decide are appropriate. If
your filesystem doesn't do what you need, talk to the filesystem
developers about implementing the policy you require.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
                   ` (5 preceding siblings ...)
  2020-02-27 10:39 ` [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Ritesh Harjani
@ 2020-02-28  7:07 ` xiaohui li
  2020-02-28 12:46   ` Kirill Tkhai
  2020-03-02 16:56 ` Theodore Y. Ts'o
  7 siblings, 1 reply; 35+ messages in thread
From: xiaohui li @ 2020-02-28  7:07 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Theodore Y. Ts'o, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, Ext4 Developers List, linux-kernel,
	linux-fsdevel

hi Kirill Tkhai:

I agree with your idea very much.
I had also implemented a similar fallocate interface with the unique
flag which call tell ext4 filesystems to allocate the special free
physical extents.
I had done the same work as your fallocate2 work last year.

but i think it has modified the core ext4 physical blocks allocator a
lot, so the ext4 community may not accept it.
so I didn't share it openly.

but i think this fallocate2 interface just same as my work is also
very useful in android mobile phone world.

today android phone has large capacity memory and storage, just the
same as a computer.
and current days, customer treat phone and make full use of it by just
the same way as they treat computer in past days.
so after install many software and unstall them for many times in
android phone,  the ext4 physical layout on disk has become more
fragmented
(the number of extents per group is large).
at this moment, when run a sequential write workload, you will find
the sequential write performance is very bad.

but after do defragment work on some fragmented ext4 groups, and then
run the same workload again, the sequential write performance has
improved greatly.
and the fallocate2 interface is a necessary component of this
defragment work shown above.

so i also think this fallocate2 interface is an important and useful
tools for ext4 filesystems.









On Wed, Feb 26, 2020 at 9:41 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> When discard granuality of a block device is bigger than filesystem block size,
> fstrim does not effectively release device blocks. During the filesystem life,
> some files become deleted, some remain alive, and this results in that many
> device blocks are used incomletely (of course, the reason is not only in this,
> but since this is not a problem of a filesystem, this is not a subject
> of the patchset). This results in space lose for thin provisioning devices.
>
> Say, a filesystem on a block device, which is provided by another filesystem
> (say, distributed network filesystem). Semi-used blocks of the block device
> result in bad performance and worse space usage of underlining filesystem.
> Another example is ext4 with 4k block on loop on ext4 with 1m block. This
> case also results in bad disk space usage.
>
> Choosing a bigger block size is not a solution here, since small files become
> taking much more disk space, than they used before, and the result excess
> disk usage is the same.
>
> The proposed solution is defragmentation of files based on block device
> discard granuality knowledge. Files, which were not modified for a long time,
> and read-only files, small files, etc, may be placed in the same block device
> block together. I.e., compaction of some device blocks, which results
> in releasing another device blocks.
>
> The problem is current fallocate() does not allow to implement effective
> way for such the defragmentation. The below describes the situation for ext4,
> but this should touch all filesystems.
>
> fallocate() goes thru standard blocks allocator, which try to behave very
> good for life allocation cases: block placement and future file size
> prediction, delayed blocks allocation, etc. But it almost impossible
> to allocate blocks from specified place for our specific case. The only
> ext4 block allocator option possible to use is that the allocator firstly
> tries to allocate blocks from the same block group, that inode is related to.
> But this is not enough for effective files compaction.
>
> This patchset implements an extension of fallocate():
>
>         fallocate2(int fd, int mode, loff_t offset, loff_t len,
>                    unsigned long long physical)
>
> The new argument is @physical offset from start of device, which is must
> for block allocation. In case of [@physical, @physical + len] block range
> is available for allocation, the syscall assigns the corresponding extent/
> extents to inode. In case of the range or its part is occupied, the syscall
> returns with error (maybe, smaller range will be allocated. The behavior
> is the same as when fallocate() meets no space in the middle).
>
> This interface allows to solve the formulated problem. Also, note, this
> interface may allow to improve existing e4defrag algorithm: decrease
> number of file extents more effective.
>
> [1-2/5] are refactoring.
> [3/5] adds fallocate2() body.
> [4/5] prepares ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
> [5/5] adds fallocate2() support for ext4
>
> Any comments are welcomed!
>
> ---
>
> Kirill Tkhai (5):
>       fs: Add new argument to file_operations::fallocate()
>       fs: Add new argument to vfs_fallocate()
>       fs: Add fallocate2() syscall
>       ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY
>       ext4: Add fallocate2() support
>
>
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 +
>  arch/x86/ia32/sys_ia32.c               |   10 +++++++
>  drivers/block/loop.c                   |    2 +
>  drivers/nvme/target/io-cmd-file.c      |    4 +--
>  drivers/staging/android/ashmem.c       |    2 +
>  drivers/target/target_core_file.c      |    2 +
>  fs/block_dev.c                         |    4 +--
>  fs/btrfs/file.c                        |    4 ++-
>  fs/ceph/file.c                         |    5 +++-
>  fs/cifs/cifsfs.c                       |    7 +++--
>  fs/cifs/smb2ops.c                      |    5 +++-
>  fs/ext4/ext4.h                         |    5 +++-
>  fs/ext4/extents.c                      |   35 ++++++++++++++++++++-----
>  fs/ext4/inode.c                        |   14 ++++++++++
>  fs/ext4/mballoc.c                      |   45 +++++++++++++++++++++++++-------
>  fs/f2fs/file.c                         |    4 ++-
>  fs/fat/file.c                          |    7 ++++-
>  fs/fuse/file.c                         |    5 +++-
>  fs/gfs2/file.c                         |    5 +++-
>  fs/hugetlbfs/inode.c                   |    5 +++-
>  fs/io_uring.c                          |    2 +
>  fs/ioctl.c                             |    5 ++--
>  fs/nfs/nfs4file.c                      |    6 ++++
>  fs/nfsd/vfs.c                          |    2 +
>  fs/ocfs2/file.c                        |    4 ++-
>  fs/open.c                              |   21 +++++++++++----
>  fs/overlayfs/file.c                    |    8 ++++--
>  fs/xfs/xfs_file.c                      |    5 +++-
>  include/linux/fs.h                     |    4 +--
>  include/linux/syscalls.h               |    8 +++++-
>  ipc/shm.c                              |    6 ++--
>  mm/madvise.c                           |    2 +
>  mm/shmem.c                             |    4 ++-
>  34 files changed, 190 insertions(+), 59 deletions(-)
>
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-27 21:56           ` Dave Chinner
@ 2020-02-28 12:41             ` Kirill Tkhai
  2020-02-29 22:41               ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-28 12:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, linux-ext4, linux-kernel,
	linux-fsdevel

On 28.02.2020 00:56, Dave Chinner wrote:
> On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote:
>> On 27.02.2020 10:33, Dave Chinner wrote:
>>> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote:
>>>> On 26.02.2020 18:55, Christoph Hellwig wrote:
>>>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>>>>> This adds a support of physical hint for fallocate2() syscall.
>>>>>> In case of @physical argument is set for ext4_fallocate(),
>>>>>> we try to allocate blocks only from [@phisical, @physical + len]
>>>>>> range, while other blocks are not used.
>>>>>
>>>>> Sorry, but this is a complete bullshit interface.  Userspace has
>>>>> absolutely no business even thinking of physical placement.  If you
>>>>> want to align allocations to physical block granularity boundaries
>>>>> that is the file systems job, not the applications job.
>>>>
>>>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>>>
>>>> 1)place files on a distance from each other to minimize number of extents
>>>>   on possible future growth;
>>>
>>> Speculative EOF preallocation at delayed allocation reservation time
>>> provides this.
>>>
>>>> 2)place small files in the same big block of block device.
>>>
>>> Delayed allocation during writeback packs files smaller than the
>>> stripe unit of the filesystem tightly.
>>>
>>> So, yes, we do both of these things at the same time in XFS, and
>>> have for the past 10 years.
>>>
>>>> At initial allocation time you never know, which file will stop grow in some future,
>>>> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
>>>> Say, if a file has not been changed for a month, it is suitable for compaction with
>>>> another files like it.
>>>>
>>>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
>>>> and just share your algorithm here.
>>>>
>>>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>>>
>>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>>>
>>>> But it does not work well in many situations, and the main problem is blocks allocation
>>>> in desired place is not possible. Block allocator can't behave excellent for everything.
>>>>
>>>> If this interface bad, can you suggest another interface to make block allocator to know
>>>> the behavior expected from him in this specific case?
>>>
>>> Write once, long term data:
>>>
>>> 	fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);
>>>
>>> That will allow the the storage stack to group all data with the
>>> same hint together, both in software and in hardware.
>>
>> This is interesting option, but it only applicable before write is made. And it's only
>> applicable on your own applications. My usecase is defragmentation of containers, where
>> any applications may run. Most of applications never care whether long or short-term
>> data they write.
> 
> Why is that a problem? They'll be using the default write hint (i.e.
> NONE) and so a hint aware allocation policy will be separating that
> data from all the other data written with specific hints...
> 
> And you've mentioned that your application has specific *never write
> again* selection criteria for data it is repacking. And that
> involves rewriting that data.  IOWs, you know exactly what policy
> you want to apply before you rewrite the data, and so what other
> applications do is completely irrelevant for your repacker...

It is not a rewriting data, there is moving data to new place with EXT4_IOC_MOVE_EXT.
This time extent is already allocated and its place is known. But if

>> Maybe, we can make fallocate() care about F_SET_RW_HINT? Say, if RWH_WRITE_LIFE_EXTREME
>> is set, fallocate() tries to allocate space around another inodes with the same hint.
> 
> That's exactly what I said:
>
>>> That will allow the the storage stack to group all data with the
>>> same hint together, both in software and in hardware.

... and fallocate() cares about the hint, this should work.
 
> What the filesystem does with the hint is up to the filesystem
> and the policies that it's developers decide are appropriate. If
> your filesystem doesn't do what you need, talk to the filesystem
> developers about implementing the policy you require.

Do XFS kernel defrag interfaces allow to pack some randomly chosen
small files in 1Mb blocks? Do they allow to pack small 4Kb file into
free space after a big file like in example:

before:
       BIG file                         Small file
[single 16 Mb extent - 4Kb][unused 4Kb][4Kb extent]

after:
       BIG file             Small file     
[single 16 Mb extent - 4Kb][4Kb extent][unused 4Kb]

?

Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-02-28  7:07 ` xiaohui li
@ 2020-02-28 12:46   ` Kirill Tkhai
  0 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-02-28 12:46 UTC (permalink / raw)
  To: xiaohui li
  Cc: Theodore Y. Ts'o, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, Ext4 Developers List, linux-kernel,
	linux-fsdevel

Hi, Xiaohui,

On 28.02.2020 10:07, xiaohui li wrote:
> hi Kirill Tkhai:
> 
> I agree with your idea very much.
> I had also implemented a similar fallocate interface with the unique
> flag which call tell ext4 filesystems to allocate the special free
> physical extents.
> I had done the same work as your fallocate2 work last year.
> 
> but i think it has modified the core ext4 physical blocks allocator a
> lot, so the ext4 community may not accept it.
> so I didn't share it openly.
> 
> but i think this fallocate2 interface just same as my work is also
> very useful in android mobile phone world.
> 
> today android phone has large capacity memory and storage, just the
> same as a computer.
> and current days, customer treat phone and make full use of it by just
> the same way as they treat computer in past days.
> so after install many software and unstall them for many times in
> android phone,  the ext4 physical layout on disk has become more
> fragmented
> (the number of extents per group is large).
> at this moment, when run a sequential write workload, you will find
> the sequential write performance is very bad.
> 
> but after do defragment work on some fragmented ext4 groups, and then
> run the same workload again, the sequential write performance has
> improved greatly.
> and the fallocate2 interface is a necessary component of this
> defragment work shown above.
> 
> so i also think this fallocate2 interface is an important and useful
> tools for ext4 filesystems.

I hope fs people in their comments will help to formulate an interface,
which is suitable fs philosophy and for our necessities.

Do you have your defrag util hosted on some of github repositories?

Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-27 12:24         ` Kirill Tkhai
@ 2020-02-28 15:35           ` Andreas Dilger
  2020-02-28 21:16             ` Dave Chinner
  2020-03-02 11:07             ` Kirill Tkhai
  0 siblings, 2 replies; 35+ messages in thread
From: Andreas Dilger @ 2020-02-28 15:35 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christoph Hellwig, Theodore Ts'o, Alexander Viro,
	Mike Snitzer, Jan Kara, Eric Biggers, riteshh, krisman, surajjs,
	dmonakhov, mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov,
	linux-ext4, Linux Kernel Mailing List, Linux FS Devel

[-- Attachment #1: Type: text/plain, Size: 9713 bytes --]

On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> On 27.02.2020 00:51, Andreas Dilger wrote:
>> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>> 
>>> 1)place files on a distance from each other to minimize number of extents
>>> on possible future growth;
>>> 2)place small files in the same big block of block device.
>>> 
>>> At initial allocation time you never know, which file will stop grow in some
>>> future, i.e. which file is suitable for compaction. This knowledge becomes
>>> available some time later.  Say, if a file has not been changed for a month,
>>> it is suitable for compaction with another files like it.
>>> 
>>> If at allocation time you can determine a file, which won't grow in the future,
>>> don't be afraid, and just share your algorithm here.
>> 
>> Very few files grow after they are initially written/closed.  Those that
>> do are almost always opened with O_APPEND (e.g. log files).  It would be
>> reasonable to have O_APPEND cause the filesystem to reserve blocks (in
>> memory at least, maybe some small amount on disk like 1/4 of the current
>> file size) for the file to grow after it is closed.  We might use the
>> same heuristic for directories that grow long after initial creation.
> 
> 1)Lets see on a real example. I created a new ext4 and started the test below:
> https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99
> 
> Here are two files written. One file is 4Kb. One file is 1Mb-4Kb.
> 
> $filefrag -e test1.tmp test2.tmp
> Filesystem type is: ef53
> File size of test1.tmp is 4096 (1 block of 4096 bytes)
> ext:     logical_offset:        physical_offset: length:   expected: flags:
>   0:        0..       0:      33793..     33793:      1:             last,eof
> test1.tmp: 1 extent found
> File size of test2.tmp is 1044480 (255 blocks of 4096 bytes)
> ext:     logical_offset:        physical_offset: length:   expected: flags:
>   0:        0..     254:      33536..     33790:    255:             last,eof
> test2.tmp: 1 extent found

The alignment of blocks in the filesystem is much easier to see if you use
"filefrag -e -x ..." to print the values in hex.  In this case, 33536 = 0x8300
so it is properly aligned on disk IMHO.

> $debugfs:  testb 33791
> Block 33791 not in use
> 
> test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp
> placement prevents us from discarding next 1Mb block.

For most filesystem uses, aligning the almost 1MB file on a 1MB boundary
is good.  That allows a full-stripe read/write for RAID, and is more
likely to align with the erase block for flash.  If it were to be allocated
after the 4KB block, then it may be that each 1MB-aligned read/write of a
large file would need to read/write two unaligned chunks per syscall.

> 2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb:
> 
> # filefrag -e test3.tmp test4.tmp
> Filesystem type is: ef53
> File size of test3.tmp is 1052672 (257 blocks of 4096 bytes)
> ext:     logical_offset:        physical_offset: length:   expected: flags:
>   0:        0..     256:      35840..     36096:    257:             last,eof
> test3.tmp: 1 extent found
> File size of test4.tmp is 1044480 (255 blocks of 4096 bytes)
> ext:     logical_offset:        physical_offset: length:   expected: flags:
>   0:        0..     254:      35072..     35326:    255:             last,eof
> test4.tmp: 1 extent found

Here again, "filefrag -e -x" would be helpful.  35840 = 0x8c00, and
35072 = 0x8900, so IMHO they are allocated properly for most uses.
Packing all files together sequentially on disk is what FAT did and
it always got very fragmented in the end.

> They don't go sequentially, and here is fragmentation starts.
> 
> After both the tests:
> $df -h
> /dev/loop0      2.0G   11M  1.8G   1% /root/mnt
> 
> Filesystem is free, all last block groups are free. E.g.,
> 
> Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED]
>  Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66
>  Inode bitmap at 288 (bg #0 + 288), csum 0x00000000
>  Inode table at 7969-8480 (bg #0 + 7969)
>  32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
>  Free blocks: 491520-524287
>  Free inodes: 122881-131072
> 
> but two files are not packed together.
> 
> So, ext4 block allocator does not work good for my workload. It even does not
> know anything about discard granuality of underlining block device. Does it?
> I assume no fs knows. Should I tell it?

You can tune the alignment of allocations via s_raid_stripe and s_raid_stride
in the ext4 superblock.  I believe these are also set by mke2fs by libdisk,
but I don't know if it takes flash erase block geometry into account.

>> The main exception there is VM images, because they are not really "files"
>> in the normal sense, but containers aggregating a lot of different files,
>> each created with patterns that are not visible to the VM host.  In that
>> case, it would be better to have the VM host tell the filesystem that the
>> IO pattern is "random" and not try to optimize until the VM is cold.
>> 
>>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>> 
>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>> 
>>> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave
>>> excellent for everything.
>>> 
>>> If this interface bad, can you suggest another interface to make block
>>> allocator to know the behavior expected from him in this specific case?
>> 
>> In ext4 there is already the "group" allocator, which combines multiple
>> small files together into a single preallocation group, so that the IO
>> to disk is large/contiguous.  The theory is that files written at the
>> same time will have similar lifespans, but that isn't always true.
>> 
>> If the files are large and still being written, the allocator will reserve
>> additional blocks (default 8MB I think) on the expectation that it will
>> continue to write until it is closed.
>> 
>> I think (correct me if I'm wrong) that your issue is with defragmenting
>> small files to free up contiguous space in the filesystem?  I think once
>> the free space is freed of small files that defragmenting large files is
>> easily done.  Anything with more than 8-16MB extents will max out most
>> storage anyway (seek rate * IO size).
> 
> My issue is mostly with files < 1Mb, because underlining device discard
> granuality is 1Mb. The result of fragmentation is that size of occupied
> 1Mb blocks of device is 1.5 times bigger, than size of really written
> data (say, df -h). And this is the problem.


Sure, and the group allocator will aggregate writes << prealloc size of
8MB by default.  If it is 1MB that doesn't qualify for group prealloc.
I think under 64KB does qualify for aggregation and unaligned writes.

>> In that case, an interesting userspace interface would be an array of
>> inode numbers (64-bit please) that should be packed together densely in
>> the order they are provided (maybe a flag for that).  That allows the
>> filesystem the freedom to find the physical blocks for the allocation,
>> while userspace can tell which files are related to each other.
> 
> So, this interface is 3-in-1:
> 
> 1)finds a placement for inodes extents;

The target allocation size would be sum(size of inodes), which should
be relatively small in your case).

> 2)assigns this space to some temporary donor inode;

Maybe yes, or just reserves that space from being allocated by anyone.

> 3)calls ext4_move_extents() for each of them.

... using the target space that was reserved earlier

> Do I understand you right?

Correct.  That is my "5 minutes thinking about an interface for grouping
small files together without exposing kernel internals" proposal for this.

> If so, then IMO it's good to start from two inodes, because here may code
> a very difficult algorithm of placement of many inodes, which may require
> much memory. Is this OK?

Well, if the files are small then it won't be a lot of memory.  Even so,
the kernel would only need to copy a few MB at a time in order to get
any decent performance, so I don't think that is a huge problem to have
several MB of dirty data in flight.

> Can we introduce a flag, that some of inode is unmovable?

There are very few flags left in the ext4_inode->i_flags for use.
You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
so we might consider it a generic "do not merge" flag if set?

> Can this interface use a knowledge about underlining device discard granuality?

As I wrote above, ext4+mballoc has a very good appreciation for alignment.
That was written for RAID storage devices, but it doesn't matter what
the reason is.  It isn't clear if flash discard alignment is easily
used (it may not be a power-of-two value or similar), but wouldn't be
harmful to try.

> In the answer to Dave, I wrote a proposition to make fallocate() care about
> i_write_hint. Could you please comment what you think about that too?

I'm not against that.  How the two interact would need to be documented
first and discussed to see if that makes sene, and then implemented.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-28 15:35           ` Andreas Dilger
@ 2020-02-28 21:16             ` Dave Chinner
  2020-02-29 20:12               ` Andreas Dilger
  2020-03-02 10:33               ` Kirill Tkhai
  2020-03-02 11:07             ` Kirill Tkhai
  1 sibling, 2 replies; 35+ messages in thread
From: Dave Chinner @ 2020-02-28 21:16 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Kirill Tkhai, Christoph Hellwig, Theodore Ts'o,
	Alexander Viro, Mike Snitzer, Jan Kara, Eric Biggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, Eric Whitney, sblbir,
	Khazhismel Kumykov, linux-ext4, Linux Kernel Mailing List,
	Linux FS Devel

On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> > On 27.02.2020 00:51, Andreas Dilger wrote:
> >> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> In that case, an interesting userspace interface would be an array of
> >> inode numbers (64-bit please) that should be packed together densely in
> >> the order they are provided (maybe a flag for that).  That allows the
> >> filesystem the freedom to find the physical blocks for the allocation,
> >> while userspace can tell which files are related to each other.
> > 
> > So, this interface is 3-in-1:
> > 
> > 1)finds a placement for inodes extents;
> 
> The target allocation size would be sum(size of inodes), which should
> be relatively small in your case).
> 
> > 2)assigns this space to some temporary donor inode;
> 
> Maybe yes, or just reserves that space from being allocated by anyone.
> 
> > 3)calls ext4_move_extents() for each of them.
> 
> ... using the target space that was reserved earlier
> 
> > Do I understand you right?
> 
> Correct.  That is my "5 minutes thinking about an interface for grouping
> small files together without exposing kernel internals" proposal for this.

You don't need any special kernel interface with XFS for this. It is
simply:

	mkdir tmpdir
	create O_TMPFILEs in tmpdir

Now all the tmpfiles you create and their data will be co-located
around the location of the tmpdir inode. This is the natural
placement policy of the filesystem. i..e the filesystem assumes that
files in the same directory are all related, so will be accessed
together and so should be located in relatively close proximity to
each other.

This is a locality optimisation technique that is older than XFS. It
works remarkably well when the filesystem can spread directories
effectively across it's address space.  It also allows userspace to
use simple techniques to group (or separate) data files as desired.
Indeed, this is how xfs_fsr directs locality for it's tmpfiles when
relocating/defragmenting data....

> > If so, then IMO it's good to start from two inodes, because here may code
> > a very difficult algorithm of placement of many inodes, which may require
> > much memory. Is this OK?
> 
> Well, if the files are small then it won't be a lot of memory.  Even so,
> the kernel would only need to copy a few MB at a time in order to get
> any decent performance, so I don't think that is a huge problem to have
> several MB of dirty data in flight.
> 
> > Can we introduce a flag, that some of inode is unmovable?
> 
> There are very few flags left in the ext4_inode->i_flags for use.
> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
> also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
> so we might consider it a generic "do not merge" flag if set?

We've had that in XFS for as long as I can remember. Many
applications were sensitive to the exact layout of the files they
created themselves, so having xfs_fsr defrag/move them about would
cause performance SLAs to be broken.

Indeed, thanks to XFS, ext4 already has an interface that can be
used to set/clear a "no defrag" flag such as you are asking for.
It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
and it has special meaning for directories. From the 'man 3 xfsctl'
man page where this interface came from:

      Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
	No defragment file bit - the file should be skipped during a
	defragmentation operation. When applied to  a directory,
	new files and directories created will inherit the no-defrag
	bit.

> > Can this interface use a knowledge about underlining device discard granuality?
> 
> As I wrote above, ext4+mballoc has a very good appreciation for alignment.
> That was written for RAID storage devices, but it doesn't matter what
> the reason is.  It isn't clear if flash discard alignment is easily
> used (it may not be a power-of-two value or similar), but wouldn't be
> harmful to try.

Yup, XFS has the similar (but more complex) alignment controls for
directing allocation to match the underlying storage
characteristics. e.g. stripe unit is also the "small file size
threshold" where the allocation policy changes from packing to
aligning and separating.

> > In the answer to Dave, I wrote a proposition to make fallocate() care about
> > i_write_hint. Could you please comment what you think about that too?
> 
> I'm not against that.  How the two interact would need to be documented
> first and discussed to see if that makes sene, and then implemented.

Individual filesystems can make their own choices as to what they do
with write hints, including ignoring them and leaving it for the
storage device to decide where to physically place the data. Which,
in many cases, ignoring the hint is the right thing for the
filesystem to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-28 21:16             ` Dave Chinner
@ 2020-02-29 20:12               ` Andreas Dilger
  2020-03-01  0:06                 ` Dave Chinner
  2020-03-02 10:33               ` Kirill Tkhai
  1 sibling, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2020-02-29 20:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kirill Tkhai, Christoph Hellwig, Theodore Ts'o,
	Alexander Viro, Mike Snitzer, Jan Kara, Eric Biggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, Eric Whitney, sblbir,
	Khazhismel Kumykov, linux-ext4, Linux Kernel Mailing List,
	Linux FS Devel

[-- Attachment #1: Type: text/plain, Size: 3719 bytes --]

On Feb 28, 2020, at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
>> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> 
>>> So, this interface is 3-in-1:
>>> 
>>> 1)finds a placement for inodes extents;
>> 
>> The target allocation size would be sum(size of inodes), which should
>> be relatively small in your case).
>> 
>>> 2)assigns this space to some temporary donor inode;
>> 
>> Maybe yes, or just reserves that space from being allocated by anyone.
>> 
>>> 3)calls ext4_move_extents() for each of them.
>> 
>> ... using the target space that was reserved earlier
>> 
>>> Do I understand you right?
>> 
>> Correct.  That is my "5 minutes thinking about an interface for grouping
>> small files together without exposing kernel internals" proposal for this.
> 
> You don't need any special kernel interface with XFS for this. It is
> simply:
> 
> 	mkdir tmpdir
> 	create O_TMPFILEs in tmpdir
> 
> Now all the tmpfiles you create and their data will be co-located
> around the location of the tmpdir inode. This is the natural
> placement policy of the filesystem. i..e the filesystem assumes that
> files in the same directory are all related, so will be accessed
> together and so should be located in relatively close proximity to
> each other.

Sure, this will likely get inodes allocate _close_ to each other on
ext4 as well (the new directory will preferentially be located in a
group that has free space), but it doesn't necessarily result in
all of the files being packed densely.  For 1MB+4KB and 1MB-4KB files
they will still prefer to be aligned on 1MB boundaries rather than
packed together.

>>> Can we introduce a flag, that some of inode is unmovable?
>> 
>> There are very few flags left in the ext4_inode->i_flags for use.
>> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
>> also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
>> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
>> so we might consider it a generic "do not merge" flag if set?
> 
> Indeed, thanks to XFS, ext4 already has an interface that can be
> used to set/clear a "no defrag" flag such as you are asking for.
> It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
> In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
> and it has special meaning for directories. From the 'man 3 xfsctl'
> man page where this interface came from:
> 
>      Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
> 	No defragment file bit - the file should be skipped during a
> 	defragmentation operation. When applied to  a directory,
> 	new files and directories created will inherit the no-defrag
> 	bit.

The interface is not the limiting factor here, but rather the number
of flags available in the inode.  Since chattr/lsattr from e2fsprogs
was used as "common ground" for a few years, there are a number of
flags in the namespace that don't actually have any meaning for ext4.

One of those flags is:

#define EXT4_NOTAIL_FL    0x00008000 /* file tail should not be merged */

This was added for Reiserfs, but it is not used by any other filesystem,
so generalizing it slightly to mean "no migrate" is reasonable.  That
doesn't affect Reiserfs in any way, and it would still be possible to
also wire up the XFS_XFLAG_NODEFRAG bit to be stored as that flag.

It wouldn't be any issue at all to chose an arbitrary unused flag to
store this in ext4 inode internally, except that chattr/lsattr are used
by a variety of different filesystems, so whatever flag is chosen will
immediately also apply to any other filesystem that users use those
tools on.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-28 12:41             ` Kirill Tkhai
@ 2020-02-29 22:41               ` Dave Chinner
  2020-03-02 10:17                 ` Kirill Tkhai
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2020-02-29 22:41 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Christoph Hellwig, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, linux-ext4, linux-kernel,
	linux-fsdevel

On Fri, Feb 28, 2020 at 03:41:51PM +0300, Kirill Tkhai wrote:
> On 28.02.2020 00:56, Dave Chinner wrote:
> > On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote:
> >> On 27.02.2020 10:33, Dave Chinner wrote:
> >>> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote:
> >>>> On 26.02.2020 18:55, Christoph Hellwig wrote:
> >>>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
> >>>>>> This adds a support of physical hint for fallocate2() syscall.
> >>>>>> In case of @physical argument is set for ext4_fallocate(),
> >>>>>> we try to allocate blocks only from [@phisical, @physical + len]
> >>>>>> range, while other blocks are not used.
> >>>>>
> >>>>> Sorry, but this is a complete bullshit interface.  Userspace has
> >>>>> absolutely no business even thinking of physical placement.  If you
> >>>>> want to align allocations to physical block granularity boundaries
> >>>>> that is the file systems job, not the applications job.
> >>>>
> >>>> Why? There are two contradictory actions that filesystem can't do at the same time:
> >>>>
> >>>> 1)place files on a distance from each other to minimize number of extents
> >>>>   on possible future growth;
> >>>
> >>> Speculative EOF preallocation at delayed allocation reservation time
> >>> provides this.
> >>>
> >>>> 2)place small files in the same big block of block device.
> >>>
> >>> Delayed allocation during writeback packs files smaller than the
> >>> stripe unit of the filesystem tightly.
> >>>
> >>> So, yes, we do both of these things at the same time in XFS, and
> >>> have for the past 10 years.
> >>>
> >>>> At initial allocation time you never know, which file will stop grow in some future,
> >>>> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
> >>>> Say, if a file has not been changed for a month, it is suitable for compaction with
> >>>> another files like it.
> >>>>
> >>>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
> >>>> and just share your algorithm here.
> >>>>
> >>>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
> >>>>
> >>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
> >>>>
> >>>> But it does not work well in many situations, and the main problem is blocks allocation
> >>>> in desired place is not possible. Block allocator can't behave excellent for everything.
> >>>>
> >>>> If this interface bad, can you suggest another interface to make block allocator to know
> >>>> the behavior expected from him in this specific case?
> >>>
> >>> Write once, long term data:
> >>>
> >>> 	fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);
> >>>
> >>> That will allow the the storage stack to group all data with the
> >>> same hint together, both in software and in hardware.
> >>
> >> This is interesting option, but it only applicable before write is made. And it's only
> >> applicable on your own applications. My usecase is defragmentation of containers, where
> >> any applications may run. Most of applications never care whether long or short-term
> >> data they write.
> > 
> > Why is that a problem? They'll be using the default write hint (i.e.
> > NONE) and so a hint aware allocation policy will be separating that
> > data from all the other data written with specific hints...
> > 
> > And you've mentioned that your application has specific *never write
> > again* selection criteria for data it is repacking. And that
> > involves rewriting that data.  IOWs, you know exactly what policy
> > you want to apply before you rewrite the data, and so what other
> > applications do is completely irrelevant for your repacker...
> 
> It is not a rewriting data, there is moving data to new place with EXT4_IOC_MOVE_EXT.

"rewriting" is a technical term for reading data at rest and writing
it again, whether it be to the same location or to some other
location. Changing physical location of data, by definition,
requires rewriting data.

EXT4_IOC_MOVE_EXT = data rewrite + extent swap to update the
metadata in the original file to point at the new data. Hence it
appears to "move" from userspace perspective (hence the name) but
under the covers it is rewriting data and fiddling pointers...

> > What the filesystem does with the hint is up to the filesystem
> > and the policies that it's developers decide are appropriate. If
> > your filesystem doesn't do what you need, talk to the filesystem
> > developers about implementing the policy you require.
> 
> Do XFS kernel defrag interfaces allow to pack some randomly chosen
> small files in 1Mb blocks? Do they allow to pack small 4Kb file into
> free space after a big file like in example:

No. Randomly selecting small holes for small file writes is a
terrible idea from a performance perspective. Hence filling tiny
holes (not randomly!) is often only done for metadata allocation
(e.g. extent map blocks, which are largely random access anyway) or
if there is no other choice for data (e.g. at ENOSPC).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-29 20:12               ` Andreas Dilger
@ 2020-03-01  0:06                 ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2020-03-01  0:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Kirill Tkhai, Christoph Hellwig, Theodore Ts'o,
	Alexander Viro, Mike Snitzer, Jan Kara, Eric Biggers, riteshh,
	krisman, surajjs, dmonakhov, mbobrowski, Eric Whitney, sblbir,
	Khazhismel Kumykov, linux-ext4, Linux Kernel Mailing List,
	Linux FS Devel

On Sat, Feb 29, 2020 at 01:12:52PM -0700, Andreas Dilger wrote:
> On Feb 28, 2020, at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
> >> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>> 
> >>> So, this interface is 3-in-1:
> >>> 
> >>> 1)finds a placement for inodes extents;
> >> 
> >> The target allocation size would be sum(size of inodes), which should
> >> be relatively small in your case).
> >> 
> >>> 2)assigns this space to some temporary donor inode;
> >> 
> >> Maybe yes, or just reserves that space from being allocated by anyone.
> >> 
> >>> 3)calls ext4_move_extents() for each of them.
> >> 
> >> ... using the target space that was reserved earlier
> >> 
> >>> Do I understand you right?
> >> 
> >> Correct.  That is my "5 minutes thinking about an interface for grouping
> >> small files together without exposing kernel internals" proposal for this.
> > 
> > You don't need any special kernel interface with XFS for this. It is
> > simply:
> > 
> > 	mkdir tmpdir
> > 	create O_TMPFILEs in tmpdir
> > 
> > Now all the tmpfiles you create and their data will be co-located
> > around the location of the tmpdir inode. This is the natural
> > placement policy of the filesystem. i..e the filesystem assumes that
> > files in the same directory are all related, so will be accessed
> > together and so should be located in relatively close proximity to
> > each other.
> 
> Sure, this will likely get inodes allocate _close_ to each other on
> ext4 as well (the new directory will preferentially be located in a
> group that has free space), but it doesn't necessarily result in
> all of the files being packed densely.  For 1MB+4KB and 1MB-4KB files
> they will still prefer to be aligned on 1MB boundaries rather than
> packed together.

Userspace can control that, too, simply by choosing to relocate only
small files into a single directory, then relocating the large files
in a separate set of operations after flushing the small files and
having the packed tightly.

Seriously, userspace has a *lot* of control of how data is located
and packed simply by grouping the IO it wants to be written together
into the same logical groups (i.e. directories) in the same temporal
IO domain...

> >>> Can we introduce a flag, that some of inode is unmovable?
> >> 
> >> There are very few flags left in the ext4_inode->i_flags for use.
> >> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
> >> also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
> >> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
> >> so we might consider it a generic "do not merge" flag if set?
> > 
> > Indeed, thanks to XFS, ext4 already has an interface that can be
> > used to set/clear a "no defrag" flag such as you are asking for.
> > It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
> > In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
> > and it has special meaning for directories. From the 'man 3 xfsctl'
> > man page where this interface came from:
> > 
> >      Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
> > 	No defragment file bit - the file should be skipped during a
> > 	defragmentation operation. When applied to  a directory,
> > 	new files and directories created will inherit the no-defrag
> > 	bit.
> 
> The interface is not the limiting factor here, but rather the number
> of flags available in the inode.

Yes, that's an internal ext4 issue, not a userspace API problem.

> Since chattr/lsattr from e2fsprogs
> was used as "common ground" for a few years, there are a number of
> flags in the namespace that don't actually have any meaning for ext4.

Yes, that's a shitty API bed that extN made for itself, isn't it?
We've sucked at API design for a long, long time. :/

But the chattr userspace application is also irrelevant to the
problem at hand: it already uses the FS_IOC_FS[GS]ETXATTR ioctl
interface for changing project quota IDs and the per-inode
inheritance flag. Hence how it manages the new flag is irrelevant,
but it also means we can't change the definition or behaviour of
existing flags it controls regardless of what filesystem those flags
act on.

> One of those flags is:
> 
> #define EXT4_NOTAIL_FL    0x00008000 /* file tail should not be merged */
> 
> This was added for Reiserfs, but it is not used by any other filesystem,
> so generalizing it slightly to mean "no migrate" is reasonable.  That
> doesn't affect Reiserfs in any way, and it would still be possible to
> also wire up the XFS_XFLAG_NODEFRAG bit to be stored as that flag.

Yes, ext4 could do that, but we are not allowed to redefine long
standing userspace API flags to mean something completely different.
That's effectively what you are proposing here if you allow ext4 to
manipulate the same on-disk flag via both FS_NOTAIL_FL and
FS_XFLAG_NODEFRAG. ie. the  FS_NOTAIL_FL flag is manipulated by
FS_IOC_[GS]ETFLAGS and is marked both as user visible and modifiable
by ext4 even though ti does nothing.

IOWs, to redefine this on-disk flag we would also need to have
EXT4_IOC_GETFLAGS / EXT4_IOC_SETFLAGS reject attempts to set/clear
FS_NOTAIL_FL with EOPNOTSUPP or EINVAL. Which we then risk breaking
applications that use this flag even though ext4 does not implement
anything other than setting/clearing the flag on demand.

IOWs, we cannot change the meaning of the EXT4_NOTAIL_FL on disk
flag, because that either changes the user visible behaviour of the
on-disk flag or it changes the definition of a userspace API flag to
mean something it was never meant to mean. Neither of those things
are acceptible changes to make to a generic userspace API.

> It wouldn't be any issue at all to chose an arbitrary unused flag to
> store this in ext4 inode internally, except that chattr/lsattr are used
> by a variety of different filesystems, so whatever flag is chosen will
> immediately also apply to any other filesystem that users use those
> tools on.

The impact on userspace is only a problem if you re-use a flag ext4
already exposes to userspace. And that is not allowed if it causes
the userspace API to be globally redefined for everyone. Which,
clearly, it would.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-29 22:41               ` Dave Chinner
@ 2020-03-02 10:17                 ` Kirill Tkhai
  0 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-02 10:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, tytso, viro, adilger.kernel, snitzer, jack,
	ebiggers, riteshh, krisman, surajjs, dmonakhov, mbobrowski,
	enwlinux, sblbir, khazhy, linux-ext4, linux-kernel,
	linux-fsdevel

On 01.03.2020 01:41, Dave Chinner wrote:
> On Fri, Feb 28, 2020 at 03:41:51PM +0300, Kirill Tkhai wrote:
>> On 28.02.2020 00:56, Dave Chinner wrote:
>>> On Thu, Feb 27, 2020 at 02:12:53PM +0300, Kirill Tkhai wrote:
>>>> On 27.02.2020 10:33, Dave Chinner wrote:
>>>>> On Wed, Feb 26, 2020 at 11:05:23PM +0300, Kirill Tkhai wrote:
>>>>>> On 26.02.2020 18:55, Christoph Hellwig wrote:
>>>>>>> On Wed, Feb 26, 2020 at 04:41:16PM +0300, Kirill Tkhai wrote:
>>>>>>>> This adds a support of physical hint for fallocate2() syscall.
>>>>>>>> In case of @physical argument is set for ext4_fallocate(),
>>>>>>>> we try to allocate blocks only from [@phisical, @physical + len]
>>>>>>>> range, while other blocks are not used.
>>>>>>>
>>>>>>> Sorry, but this is a complete bullshit interface.  Userspace has
>>>>>>> absolutely no business even thinking of physical placement.  If you
>>>>>>> want to align allocations to physical block granularity boundaries
>>>>>>> that is the file systems job, not the applications job.
>>>>>>
>>>>>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>>>>>
>>>>>> 1)place files on a distance from each other to minimize number of extents
>>>>>>   on possible future growth;
>>>>>
>>>>> Speculative EOF preallocation at delayed allocation reservation time
>>>>> provides this.
>>>>>
>>>>>> 2)place small files in the same big block of block device.
>>>>>
>>>>> Delayed allocation during writeback packs files smaller than the
>>>>> stripe unit of the filesystem tightly.
>>>>>
>>>>> So, yes, we do both of these things at the same time in XFS, and
>>>>> have for the past 10 years.
>>>>>
>>>>>> At initial allocation time you never know, which file will stop grow in some future,
>>>>>> i.e. which file is suitable for compaction. This knowledge becomes available some time later.
>>>>>> Say, if a file has not been changed for a month, it is suitable for compaction with
>>>>>> another files like it.
>>>>>>
>>>>>> If at allocation time you can determine a file, which won't grow in the future, don't be afraid,
>>>>>> and just share your algorithm here.
>>>>>>
>>>>>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>>>>>
>>>>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>>>>>
>>>>>> But it does not work well in many situations, and the main problem is blocks allocation
>>>>>> in desired place is not possible. Block allocator can't behave excellent for everything.
>>>>>>
>>>>>> If this interface bad, can you suggest another interface to make block allocator to know
>>>>>> the behavior expected from him in this specific case?
>>>>>
>>>>> Write once, long term data:
>>>>>
>>>>> 	fcntl(fd, F_SET_RW_HINT, RWH_WRITE_LIFE_EXTREME);
>>>>>
>>>>> That will allow the the storage stack to group all data with the
>>>>> same hint together, both in software and in hardware.
>>>>
>>>> This is interesting option, but it only applicable before write is made. And it's only
>>>> applicable on your own applications. My usecase is defragmentation of containers, where
>>>> any applications may run. Most of applications never care whether long or short-term
>>>> data they write.
>>>
>>> Why is that a problem? They'll be using the default write hint (i.e.
>>> NONE) and so a hint aware allocation policy will be separating that
>>> data from all the other data written with specific hints...
>>>
>>> And you've mentioned that your application has specific *never write
>>> again* selection criteria for data it is repacking. And that
>>> involves rewriting that data.  IOWs, you know exactly what policy
>>> you want to apply before you rewrite the data, and so what other
>>> applications do is completely irrelevant for your repacker...
>>
>> It is not a rewriting data, there is moving data to new place with EXT4_IOC_MOVE_EXT.
> 
> "rewriting" is a technical term for reading data at rest and writing
> it again, whether it be to the same location or to some other
> location. Changing physical location of data, by definition,
> requires rewriting data.
> 
> EXT4_IOC_MOVE_EXT = data rewrite + extent swap to update the
> metadata in the original file to point at the new data. Hence it
> appears to "move" from userspace perspective (hence the name) but
> under the covers it is rewriting data and fiddling pointers...

Yeah, I understand this. I mean that file remains accessible for external
users, and external reads/writes are handled properly, and state of file
remains consistent.

>>> What the filesystem does with the hint is up to the filesystem
>>> and the policies that it's developers decide are appropriate. If
>>> your filesystem doesn't do what you need, talk to the filesystem
>>> developers about implementing the policy you require.
>>
>> Do XFS kernel defrag interfaces allow to pack some randomly chosen
>> small files in 1Mb blocks? Do they allow to pack small 4Kb file into
>> free space after a big file like in example:
> 
> No. Randomly selecting small holes for small file writes is a
> terrible idea from a performance perspective. Hence filling tiny
> holes (not randomly!) is often only done for metadata allocation
> (e.g. extent map blocks, which are largely random access anyway) or
> if there is no other choice for data (e.g. at ENOSPC).

I'm speaking more about the possibility. "Random" is from block allocator
view. But from user view they are not random, these are unmodifiable files.
Say, static content of website never changes, and these files may be packed
together to decrease number of occupied 1Mb disc blocks.

To pack all files on a disc together is terrible idea, I'm 100% agree with you.

Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-28 21:16             ` Dave Chinner
  2020-02-29 20:12               ` Andreas Dilger
@ 2020-03-02 10:33               ` Kirill Tkhai
  1 sibling, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-02 10:33 UTC (permalink / raw)
  To: Dave Chinner, Andreas Dilger
  Cc: Christoph Hellwig, Theodore Ts'o, Alexander Viro,
	Mike Snitzer, Jan Kara, Eric Biggers, riteshh, krisman, surajjs,
	dmonakhov, mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov,
	linux-ext4, Linux Kernel Mailing List, Linux FS Devel

On 29.02.2020 00:16, Dave Chinner wrote:
> On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
>> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> On 27.02.2020 00:51, Andreas Dilger wrote:
>>>> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> In that case, an interesting userspace interface would be an array of
>>>> inode numbers (64-bit please) that should be packed together densely in
>>>> the order they are provided (maybe a flag for that).  That allows the
>>>> filesystem the freedom to find the physical blocks for the allocation,
>>>> while userspace can tell which files are related to each other.
>>>
>>> So, this interface is 3-in-1:
>>>
>>> 1)finds a placement for inodes extents;
>>
>> The target allocation size would be sum(size of inodes), which should
>> be relatively small in your case).
>>
>>> 2)assigns this space to some temporary donor inode;
>>
>> Maybe yes, or just reserves that space from being allocated by anyone.
>>
>>> 3)calls ext4_move_extents() for each of them.
>>
>> ... using the target space that was reserved earlier
>>
>>> Do I understand you right?
>>
>> Correct.  That is my "5 minutes thinking about an interface for grouping
>> small files together without exposing kernel internals" proposal for this.
> 
> You don't need any special kernel interface with XFS for this. It is
> simply:
> 
> 	mkdir tmpdir
> 	create O_TMPFILEs in tmpdir
> 
> Now all the tmpfiles you create and their data will be co-located
> around the location of the tmpdir inode. This is the natural
> placement policy of the filesystem. i..e the filesystem assumes that
> files in the same directory are all related, so will be accessed
> together and so should be located in relatively close proximity to
> each other.

Hm, but does this help for my problem? 1)allocate two files in the same directory
and then 2)move source files there?

In case of I have two 512K files ext4 allows the same:

1)fallocate() 1M continuous space (this should ends with success in case of disc
is not almost full);
2)move both files into newly allocated space.

But this doubles IO, since both of files have to be moved. The ideal solution
would be to allocate space around one of them and to move the second file there.

> This is a locality optimisation technique that is older than XFS. It
> works remarkably well when the filesystem can spread directories
> effectively across it's address space.  It also allows userspace to
> use simple techniques to group (or separate) data files as desired.
> Indeed, this is how xfs_fsr directs locality for it's tmpfiles when
> relocating/defragmenting data....
> 
>>> If so, then IMO it's good to start from two inodes, because here may code
>>> a very difficult algorithm of placement of many inodes, which may require
>>> much memory. Is this OK?
>>
>> Well, if the files are small then it won't be a lot of memory.  Even so,
>> the kernel would only need to copy a few MB at a time in order to get
>> any decent performance, so I don't think that is a huge problem to have
>> several MB of dirty data in flight.
>>
>>> Can we introduce a flag, that some of inode is unmovable?
>>
>> There are very few flags left in the ext4_inode->i_flags for use.
>> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
>> also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
>> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
>> so we might consider it a generic "do not merge" flag if set?
> 
> We've had that in XFS for as long as I can remember. Many
> applications were sensitive to the exact layout of the files they
> created themselves, so having xfs_fsr defrag/move them about would
> cause performance SLAs to be broken.
> 
> Indeed, thanks to XFS, ext4 already has an interface that can be
> used to set/clear a "no defrag" flag such as you are asking for.
> It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
> In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
> and it has special meaning for directories. From the 'man 3 xfsctl'
> man page where this interface came from:
> 
>       Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
> 	No defragment file bit - the file should be skipped during a
> 	defragmentation operation. When applied to  a directory,
> 	new files and directories created will inherit the no-defrag
> 	bit.
> 
>>> Can this interface use a knowledge about underlining device discard granuality?
>>
>> As I wrote above, ext4+mballoc has a very good appreciation for alignment.
>> That was written for RAID storage devices, but it doesn't matter what
>> the reason is.  It isn't clear if flash discard alignment is easily
>> used (it may not be a power-of-two value or similar), but wouldn't be
>> harmful to try.
> 
> Yup, XFS has the similar (but more complex) alignment controls for
> directing allocation to match the underlying storage
> characteristics. e.g. stripe unit is also the "small file size
> threshold" where the allocation policy changes from packing to
> aligning and separating.
> 
>>> In the answer to Dave, I wrote a proposition to make fallocate() care about
>>> i_write_hint. Could you please comment what you think about that too?
>>
>> I'm not against that.  How the two interact would need to be documented
>> first and discussed to see if that makes sene, and then implemented.
> 
> Individual filesystems can make their own choices as to what they do
> with write hints, including ignoring them and leaving it for the
> storage device to decide where to physically place the data. Which,
> in many cases, ignoring the hint is the right thing for the
> filesystem to do...
> 
> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 5/5] ext4: Add fallocate2() support
  2020-02-28 15:35           ` Andreas Dilger
  2020-02-28 21:16             ` Dave Chinner
@ 2020-03-02 11:07             ` Kirill Tkhai
  1 sibling, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-02 11:07 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, Theodore Ts'o, Alexander Viro,
	Mike Snitzer, Jan Kara, Eric Biggers, riteshh, krisman, surajjs,
	dmonakhov, mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov,
	linux-ext4, Linux Kernel Mailing List, Linux FS Devel

On 28.02.2020 18:35, Andreas Dilger wrote:
> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 27.02.2020 00:51, Andreas Dilger wrote:
>>> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> Why? There are two contradictory actions that filesystem can't do at the same time:
>>>>
>>>> 1)place files on a distance from each other to minimize number of extents
>>>> on possible future growth;
>>>> 2)place small files in the same big block of block device.
>>>>
>>>> At initial allocation time you never know, which file will stop grow in some
>>>> future, i.e. which file is suitable for compaction. This knowledge becomes
>>>> available some time later.  Say, if a file has not been changed for a month,
>>>> it is suitable for compaction with another files like it.
>>>>
>>>> If at allocation time you can determine a file, which won't grow in the future,
>>>> don't be afraid, and just share your algorithm here.
>>>
>>> Very few files grow after they are initially written/closed.  Those that
>>> do are almost always opened with O_APPEND (e.g. log files).  It would be
>>> reasonable to have O_APPEND cause the filesystem to reserve blocks (in
>>> memory at least, maybe some small amount on disk like 1/4 of the current
>>> file size) for the file to grow after it is closed.  We might use the
>>> same heuristic for directories that grow long after initial creation.
>>
>> 1)Lets see on a real example. I created a new ext4 and started the test below:
>> https://gist.github.com/tkhai/afd8458c0a3cc082a1230370c7d89c99
>>
>> Here are two files written. One file is 4Kb. One file is 1Mb-4Kb.
>>
>> $filefrag -e test1.tmp test2.tmp
>> Filesystem type is: ef53
>> File size of test1.tmp is 4096 (1 block of 4096 bytes)
>> ext:     logical_offset:        physical_offset: length:   expected: flags:
>>   0:        0..       0:      33793..     33793:      1:             last,eof
>> test1.tmp: 1 extent found
>> File size of test2.tmp is 1044480 (255 blocks of 4096 bytes)
>> ext:     logical_offset:        physical_offset: length:   expected: flags:
>>   0:        0..     254:      33536..     33790:    255:             last,eof
>> test2.tmp: 1 extent found
> 
> The alignment of blocks in the filesystem is much easier to see if you use
> "filefrag -e -x ..." to print the values in hex.  In this case, 33536 = 0x8300
> so it is properly aligned on disk IMHO.
> 
>> $debugfs:  testb 33791
>> Block 33791 not in use
>>
>> test2.tmp started from 131Mb. In case of discard granuality is 1Mb, test1.tmp
>> placement prevents us from discarding next 1Mb block.
> 
> For most filesystem uses, aligning the almost 1MB file on a 1MB boundary
> is good.  That allows a full-stripe read/write for RAID, and is more
> likely to align with the erase block for flash.  If it were to be allocated
> after the 4KB block, then it may be that each 1MB-aligned read/write of a
> large file would need to read/write two unaligned chunks per syscall.
> 
>> 2)Another example. Let write two files: 1Mb-4Kb and 1Mb+4Kb:
>>
>> # filefrag -e test3.tmp test4.tmp
>> Filesystem type is: ef53
>> File size of test3.tmp is 1052672 (257 blocks of 4096 bytes)
>> ext:     logical_offset:        physical_offset: length:   expected: flags:
>>   0:        0..     256:      35840..     36096:    257:             last,eof
>> test3.tmp: 1 extent found
>> File size of test4.tmp is 1044480 (255 blocks of 4096 bytes)
>> ext:     logical_offset:        physical_offset: length:   expected: flags:
>>   0:        0..     254:      35072..     35326:    255:             last,eof
>> test4.tmp: 1 extent found
> 
> Here again, "filefrag -e -x" would be helpful.  35840 = 0x8c00, and
> 35072 = 0x8900, so IMHO they are allocated properly for most uses.
> Packing all files together sequentially on disk is what FAT did and
> it always got very fragmented in the end.
> 
>> They don't go sequentially, and here is fragmentation starts.
>>
>> After both the tests:
>> $df -h
>> /dev/loop0      2.0G   11M  1.8G   1% /root/mnt
>>
>> Filesystem is free, all last block groups are free. E.g.,
>>
>> Group 15: (Blocks 491520-524287) csum 0x3ef5 [INODE_UNINIT, ITABLE_ZEROED]
>>  Block bitmap at 272 (bg #0 + 272), csum 0xd52c1f66
>>  Inode bitmap at 288 (bg #0 + 288), csum 0x00000000
>>  Inode table at 7969-8480 (bg #0 + 7969)
>>  32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
>>  Free blocks: 491520-524287
>>  Free inodes: 122881-131072
>>
>> but two files are not packed together.
>>
>> So, ext4 block allocator does not work good for my workload. It even does not
>> know anything about discard granuality of underlining block device. Does it?
>> I assume no fs knows. Should I tell it?
> 
> You can tune the alignment of allocations via s_raid_stripe and s_raid_stride
> in the ext4 superblock.  I believe these are also set by mke2fs by libdisk,
> but I don't know if it takes flash erase block geometry into account.
> 
>>> The main exception there is VM images, because they are not really "files"
>>> in the normal sense, but containers aggregating a lot of different files,
>>> each created with patterns that are not visible to the VM host.  In that
>>> case, it would be better to have the VM host tell the filesystem that the
>>> IO pattern is "random" and not try to optimize until the VM is cold.
>>>
>>>> In Virtuozzo we tried to compact ext4 with existing kernel interface:
>>>>
>>>> https://github.com/dmonakhov/e2fsprogs/blob/e4defrag2/misc/e4defrag2.c
>>>>
>>>> But it does not work well in many situations, and the main problem is blocks allocation in desired place is not possible. Block allocator can't behave
>>>> excellent for everything.
>>>>
>>>> If this interface bad, can you suggest another interface to make block
>>>> allocator to know the behavior expected from him in this specific case?
>>>
>>> In ext4 there is already the "group" allocator, which combines multiple
>>> small files together into a single preallocation group, so that the IO
>>> to disk is large/contiguous.  The theory is that files written at the
>>> same time will have similar lifespans, but that isn't always true.
>>>
>>> If the files are large and still being written, the allocator will reserve
>>> additional blocks (default 8MB I think) on the expectation that it will
>>> continue to write until it is closed.
>>>
>>> I think (correct me if I'm wrong) that your issue is with defragmenting
>>> small files to free up contiguous space in the filesystem?  I think once
>>> the free space is freed of small files that defragmenting large files is
>>> easily done.  Anything with more than 8-16MB extents will max out most
>>> storage anyway (seek rate * IO size).
>>
>> My issue is mostly with files < 1Mb, because underlining device discard
>> granuality is 1Mb. The result of fragmentation is that size of occupied
>> 1Mb blocks of device is 1.5 times bigger, than size of really written
>> data (say, df -h). And this is the problem.
> 
> 
> Sure, and the group allocator will aggregate writes << prealloc size of
> 8MB by default.  If it is 1MB that doesn't qualify for group prealloc.
> I think under 64KB does qualify for aggregation and unaligned writes.
> 
>>> In that case, an interesting userspace interface would be an array of
>>> inode numbers (64-bit please) that should be packed together densely in
>>> the order they are provided (maybe a flag for that).  That allows the
>>> filesystem the freedom to find the physical blocks for the allocation,
>>> while userspace can tell which files are related to each other.
>>
>> So, this interface is 3-in-1:
>>
>> 1)finds a placement for inodes extents;
> 
> The target allocation size would be sum(size of inodes), which should
> be relatively small in your case).
> 
>> 2)assigns this space to some temporary donor inode;
> 
> Maybe yes, or just reserves that space from being allocated by anyone.
> 
>> 3)calls ext4_move_extents() for each of them.
> 
> ... using the target space that was reserved earlier
> 
>> Do I understand you right?
> 
> Correct.  That is my "5 minutes thinking about an interface for grouping
> small files together without exposing kernel internals" proposal for this.

Ok. I'll think about the prototype and then public to the mailing list.
 
>> If so, then IMO it's good to start from two inodes, because here may code
>> a very difficult algorithm of placement of many inodes, which may require
>> much memory. Is this OK?
> 
> Well, if the files are small then it won't be a lot of memory.  Even so,
> the kernel would only need to copy a few MB at a time in order to get
> any decent performance, so I don't think that is a huge problem to have
> several MB of dirty data in flight.

I mean not in-flight IO, but memory for all logic of files placement.
Userspace may build multi-step algoritm, which is hidden for kernel:
pack two files together, then decrease number of extents of some third
file, then pack something else.

Also, files related to different directories should be packed together,
but it does not look good for kernel to look for files directories
by inodes (our interface is about 64-bit inodes numbers, sure?).

For me it does not look good, kernel iterates over all files and looks for
a placement for a specific file, since this is just excess work for kernel.
Usually, both the files are chosen by userspace, and the userspace does not
want to move more then one of them at time.

>> Can we introduce a flag, that some of inode is unmovable?
> 
> There are very few flags left in the ext4_inode->i_flags for use.
> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
> also have other semantics.  The EXT4_NOTAIL_FL is for not merging the
> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
> so we might consider it a generic "do not merge" flag if set?
> 
>> Can this interface use a knowledge about underlining device discard granuality?
> 
> As I wrote above, ext4+mballoc has a very good appreciation for alignment.
> That was written for RAID storage devices, but it doesn't matter what
> the reason is.  It isn't clear if flash discard alignment is easily
> used (it may not be a power-of-two value or similar), but wouldn't be
> harmful to try.
> 
>> In the answer to Dave, I wrote a proposition to make fallocate() care about
>> i_write_hint. Could you please comment what you think about that too?
> 
> I'm not against that.  How the two interact would need to be documented
> first and discussed to see if that makes sene, and then implemented.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
                   ` (6 preceding siblings ...)
  2020-02-28  7:07 ` xiaohui li
@ 2020-03-02 16:56 ` Theodore Y. Ts'o
  2020-03-03  9:57   ` Kirill Tkhai
  7 siblings, 1 reply; 35+ messages in thread
From: Theodore Y. Ts'o @ 2020-03-02 16:56 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: viro, adilger.kernel, snitzer, jack, ebiggers, riteshh, krisman,
	surajjs, dmonakhov, mbobrowski, enwlinux, sblbir, khazhy,
	linux-ext4, linux-kernel, linux-fsdevel

Kirill,

In a couple of your comments on this patch series, you mentioned
"defragmentation".  Is that because you're trying to use this as part
of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?

If that's the case, you should note that input parameter for that
ioctl is:

struct move_extent {
	__u32 reserved;		/* should be zero */
	__u32 donor_fd;		/* donor file descriptor */
	__u64 orig_start;	/* logical start offset in block for orig */
	__u64 donor_start;	/* logical start offset in block for donor */
	__u64 len;		/* block length to be moved */
	__u64 moved_len;	/* moved block length */
};

Note that the donor_start is separate from the start of the file that
is being defragged.  So you could have the userspace application
fallocate a large chunk of space for that donor file, and then use
that donor file to defrag multiple files if you want to close pack
them.

Many years ago, back when LSF/MM colocated with a larger
storage-focused conference so we could manage to origanize an ext4
developer's workshop, we had talked about ways we create kernel
support for a more powerful userspace defragger, which could also
defragment the free space, so that future block allocations were more
likely to be successful.

The discussions surrounded interfaces where userspace could block (or
at least strongly dissuade unless the only other alternative was
returning ENOSPC) the kernel from allocating out of a certain number
of block groups.  And then also to have an interface where for a
particular process (namely, the defragger), to make the kernel
strongly prefer that allocations come out of an ordered list of block
groups.

(Of course these days, now that the cool kids are all embracing eBPF,
one could imagine a privileged interface where the defragger could
install some kind of eBPF program which provided enhanced policy to
ext4's block allocator.)

No one ever really followed through with this, in part because the
details of allowing userspace (and it would have to be privileged
userspace) to dictate policy to the block allocator has all sorts of
potential pitfalls, and in part because no company was really
interested in funding the engineering work.  In addition, I'll note
that the windows world, the need and interest for defragging has gone
done significantly with the advent more sophisticated file systems
like NTFSv5, which doesn't need defragging nearly as often as say, the
FAT file system.  And I think if anything, the interst in doing work
with e4defrag has decreased even more over the years.

That being said, there has been some interest in making changes to
both the block allocator and some kind of on-line defrag which is
optimized for low-end flash (such as the kind found in android
handsets).  There, the need to be careful that we don't end up
increasing the write wearout becomes even more critical, although the
GC work which f2fs does involve extra moving around of data blocks,
and phones have seemed to do fine.  Of course, the typical phone only
has to last 2-3 years before the battery dies, the screen gets
cracked, and/or the owner decides they want the latest cool toy from
the phone manufacturers.  :-)

In any case, if your goal is really some interface to support on-line
defragmentation for ext4, you want to consider whether the
EXT4_IOC_MOVE_EXTENT interface is sufficiently powerful such that you
don't really need to mess around with new block allocation hints.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-02 16:56 ` Theodore Y. Ts'o
@ 2020-03-03  9:57   ` Kirill Tkhai
  2020-03-03 16:55     ` Theodore Y. Ts'o
  2020-03-11 19:26     ` Andreas Dilger
  0 siblings, 2 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-03  9:57 UTC (permalink / raw)
  To: Theodore Y. Ts'o, adilger.kernel
  Cc: viro, snitzer, jack, ebiggers, riteshh, krisman, surajjs,
	dmonakhov, mbobrowski, enwlinux, sblbir, khazhy, linux-ext4,
	linux-kernel, linux-fsdevel

Hi, Ted,

On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
> Kirill,
> 
> In a couple of your comments on this patch series, you mentioned
> "defragmentation".  Is that because you're trying to use this as part
> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
> 
> If that's the case, you should note that input parameter for that
> ioctl is:
> 
> struct move_extent {
> 	__u32 reserved;		/* should be zero */
> 	__u32 donor_fd;		/* donor file descriptor */
> 	__u64 orig_start;	/* logical start offset in block for orig */
> 	__u64 donor_start;	/* logical start offset in block for donor */
> 	__u64 len;		/* block length to be moved */
> 	__u64 moved_len;	/* moved block length */
> };
> 
> Note that the donor_start is separate from the start of the file that
> is being defragged.  So you could have the userspace application
> fallocate a large chunk of space for that donor file, and then use
> that donor file to defrag multiple files if you want to close pack
> them.

The practice shows it's not so. Your suggestion was the first thing we tried,
but it works bad and just doubles/triples IO.

Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:

[[512Kb file][512Kb free]][[512Kb file][512Kb free]]

We want to pack both of files in the same 1Mb cluster. Packed together on block device,
they will be in the same server of underlining distributed storage file system.
This gives a big performance improvement, and this is the price I aimed.

In case of I fallocate a large hunk for both of them, I have to move them
both to this new hunk. So, instead of moving 512Kb of data, we will have to move
1Mb of data, i.e. double size, which is counterproductive.

Imaging another situation, when we have 
[[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]

Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
And this is the thing I try prevent with finding a suitable new interface.

> Many years ago, back when LSF/MM colocated with a larger
> storage-focused conference so we could manage to origanize an ext4
> developer's workshop, we had talked about ways we create kernel
> support for a more powerful userspace defragger, which could also
> defragment the free space, so that future block allocations were more
> likely to be successful.
> 
> The discussions surrounded interfaces where userspace could block (or
> at least strongly dissuade unless the only other alternative was
> returning ENOSPC) the kernel from allocating out of a certain number
> of block groups.  And then also to have an interface where for a
> particular process (namely, the defragger), to make the kernel
> strongly prefer that allocations come out of an ordered list of block
> groups.
> 
> (Of course these days, now that the cool kids are all embracing eBPF,
> one could imagine a privileged interface where the defragger could
> install some kind of eBPF program which provided enhanced policy to
> ext4's block allocator.)
> 
> No one ever really followed through with this, in part because the
> details of allowing userspace (and it would have to be privileged
> userspace) to dictate policy to the block allocator has all sorts of
> potential pitfalls, and in part because no company was really
> interested in funding the engineering work.  In addition, I'll note
> that the windows world, the need and interest for defragging has gone
> done significantly with the advent more sophisticated file systems
> like NTFSv5, which doesn't need defragging nearly as often as say, the
> FAT file system.  And I think if anything, the interst in doing work
> with e4defrag has decreased even more over the years.
> 
> That being said, there has been some interest in making changes to
> both the block allocator and some kind of on-line defrag which is
> optimized for low-end flash (such as the kind found in android
> handsets).  There, the need to be careful that we don't end up
> increasing the write wearout becomes even more critical, although the
> GC work which f2fs does involve extra moving around of data blocks,
> and phones have seemed to do fine.  Of course, the typical phone only
> has to last 2-3 years before the battery dies, the screen gets
> cracked, and/or the owner decides they want the latest cool toy from
> the phone manufacturers.  :-)
> 
> In any case, if your goal is really some interface to support on-line
> defragmentation for ext4, you want to consider whether the
> EXT4_IOC_MOVE_EXTENT interface is sufficiently powerful such that you
> don't really need to mess around with new block allocation hints.

It's powerful, but it does not allow to create an effective defragmentation
tool for my usecase. See the examples above. I do not want to replace
EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
a space close to some existing file and reduce IO at defragmentation time.
This is just only thing I need in this patchset.

I can't climb into maintainers heads and find a thing, which will be suitable
for you. I did my try and suggested the interface. In case of it's not OK
for you, could you, please, suggest another one, which will work for my usecase?
The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
Are you OK with interface suggested by Andreas?

Thanks,
Kirill







^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-03  9:57   ` Kirill Tkhai
@ 2020-03-03 16:55     ` Theodore Y. Ts'o
  2020-03-03 17:36       ` Kirill Tkhai
  2020-03-11 19:26     ` Andreas Dilger
  1 sibling, 1 reply; 35+ messages in thread
From: Theodore Y. Ts'o @ 2020-03-03 16:55 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: adilger.kernel, viro, snitzer, jack, ebiggers, riteshh, krisman,
	surajjs, dmonakhov, mbobrowski, enwlinux, sblbir, khazhy,
	linux-ext4, linux-kernel, linux-fsdevel

On Tue, Mar 03, 2020 at 12:57:15PM +0300, Kirill Tkhai wrote:
> The practice shows it's not so. Your suggestion was the first thing we tried,
> but it works bad and just doubles/triples IO.
> 
> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
> 
> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
> 
> We want to pack both of files in the same 1Mb cluster. Packed together on block device,
> they will be in the same server of underlining distributed storage file system.
> This gives a big performance improvement, and this is the price I aimed.
> 
> In case of I fallocate a large hunk for both of them, I have to move them
> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
> 1Mb of data, i.e. double size, which is counterproductive.
> 
> Imaging another situation, when we have 
> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
> 
> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
> us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
> And this is the thing I try prevent with finding a suitable new interface.

OK, so you aren't trying to *defragment*.  You want to have files
placed "properly" ab initio.

It sounds like what you *think* is the best way to go is to simply
have files backed tightly together.  So effectively what you want as a
block allocation strategy is something which just finds the next free
space big enough for the requested fallocate space, and just plop it
down right there.

OK, so what happens once you've allocated all of the free space, and
the pattern of deletes leaves the file system with a lot of holes?

I could imagine trying to implement this as a mount option which uses
an alternate block allocation strategy, but it's not clear what your
end game is after all of the "easy" spaces have been taken.  It's much
like proposals I've seen for a log-structured file system, where the
garbage collector is left as a "we'll get to it later" TODO item.  (If
I had a dollar each time I've read a paper proposing a log structured
file system which leaves out the garbage collector as an
implementation detail....)

> It's powerful, but it does not allow to create an effective defragmentation
> tool for my usecase. See the examples above. I do not want to replace
> EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
> a space close to some existing file and reduce IO at defragmentation time.
> This is just only thing I need in this patchset.

"At defragmentation time"?   So you do want to run a defragger?

It might be helpful to see the full design of what you have in mind,
and not just a request for interfaces....

> I can't climb into maintainers heads and find a thing, which will be suitable
> for you. I did my try and suggested the interface. In case of it's not OK
> for you, could you, please, suggest another one, which will work for my usecase?
> The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
> Are you OK with interface suggested by Andreas?

Like you, I can't climb into your head and figure out exactly how your
entire system design is going to work.  And I'd really rather not
proposal or bless an interface until I do, since it may be that it's
better to make some minor changes to your system design, instead of
trying to twist ext4 for your particular use case....

       	  	     	      		     - Ted

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-03 16:55     ` Theodore Y. Ts'o
@ 2020-03-03 17:36       ` Kirill Tkhai
  0 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-03 17:36 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: adilger.kernel, viro, snitzer, jack, ebiggers, riteshh, krisman,
	surajjs, dmonakhov, mbobrowski, enwlinux, sblbir, khazhy,
	linux-ext4, linux-kernel, linux-fsdevel

On 03.03.2020 19:55, Theodore Y. Ts'o wrote:
> On Tue, Mar 03, 2020 at 12:57:15PM +0300, Kirill Tkhai wrote:
>> The practice shows it's not so. Your suggestion was the first thing we tried,
>> but it works bad and just doubles/triples IO.
>>
>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>
>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>
>> We want to pack both of files in the same 1Mb cluster. Packed together on block device,
>> they will be in the same server of underlining distributed storage file system.
>> This gives a big performance improvement, and this is the price I aimed.
>>
>> In case of I fallocate a large hunk for both of them, I have to move them
>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>> 1Mb of data, i.e. double size, which is counterproductive.
>>
>> Imaging another situation, when we have 
>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>
>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again forces
>> us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is terrible!
>> And this is the thing I try prevent with finding a suitable new interface.
> 
> OK, so you aren't trying to *defragment*.  You want to have files
> placed "properly" ab initio.
> 
> It sounds like what you *think* is the best way to go is to simply
> have files backed tightly together.  So effectively what you want as a
> block allocation strategy is something which just finds the next free
> space big enough for the requested fallocate space, and just plop it
> down right there.
> 
> OK, so what happens once you've allocated all of the free space, and
> the pattern of deletes leaves the file system with a lot of holes?

We defrag not all files, but a specific subset of files. Say, webserver
may have a lot of static content, a lot of small files, which are never
changed. So, we found files, which were not modified for months or years,
and pack them together. Also we pack RO files, and the most cases they
never changed in the future.

So, it's not for all files, it's for specific files, which are chosen
by defragger algorithm.

> I could imagine trying to implement this as a mount option which uses
> an alternate block allocation strategy, but it's not clear what your
> end game is after all of the "easy" spaces have been taken.  It's much
> like proposals I've seen for a log-structured file system, where the
> garbage collector is left as a "we'll get to it later" TODO item.  (If
> I had a dollar each time I've read a paper proposing a log structured
> file system which leaves out the garbage collector as an
> implementation detail....)

Mount option acts at runtime. I'm OK with block placement at runtime. We
defrag files with old modification time, when we know they are unmodifiable,
some time later. So, I'm not sure this will help.
If I understood you wrong, please, explain, whether you mean something else.

>> It's powerful, but it does not allow to create an effective defragmentation
>> tool for my usecase. See the examples above. I do not want to replace
>> EXT4_IOC_MOVE_EXTENT I just want an interface to be able to allocate
>> a space close to some existing file and reduce IO at defragmentation time.
>> This is just only thing I need in this patchset.
> 
> "At defragmentation time"?   So you do want to run a defragger?
> 
> It might be helpful to see the full design of what you have in mind,
> and not just a request for interfaces....

Yes, I run defragger. And it detects, which files may be packed together,
and then it tries to pack them.
>> I can't climb into maintainers heads and find a thing, which will be suitable
>> for you. I did my try and suggested the interface. In case of it's not OK
>> for you, could you, please, suggest another one, which will work for my usecase?
>> The thesis "EXT4_IOC_MOVE_EXTENT is enough for everything" does not work for me :(
>> Are you OK with interface suggested by Andreas?
> 
> Like you, I can't climb into your head and figure out exactly how your
> entire system design is going to work.  And I'd really rather not
> proposal or bless an interface until I do, since it may be that it's
> better to make some minor changes to your system design, instead of
> trying to twist ext4 for your particular use case....

Let I try to give a description:

1)defragger scans whole filesystem and it divides fs in 1Mb cluster.
There is populated a statistics of every scanned cluster: filling percent,
percent of long-time-unmodifiable blocks, percent of RO blocks etc.

Also relation of some directories to block groups are cached.

2)then it marks clusters suitable for compaction and for relocation.

3)then some data becomes compacted or relocated. For compaction we try
to allocate blocks nearly existing RO data to decrease IO (as I wrote
in previous email). The only way we can do it is to use ext4 block
allocation option: usually it tries to allocate blocks for a new inode
in the same block group, where directory is placed. Here we use information
about relationship between directories and block groups. We use specific
directory to create donor file. In case of success, fallocate(donor)
returns blocks from correct 1Mb cluster. Otherwise, cluster is fully
relocated (1Mb block is allocated and we write everything there).

3.1)We have:

[     Cluster 1      ][      Cluster 2    ][    Cluster 3       ]
[1020Kb file][4K hole][4Kfile][1020Kb hole][1020Kb file][4K hole]

3.2)Create a donor file in the directory, which is related to the same
block group, where cluster 1 is placed.

3.3)Several calls of fallocate(donor, 4K)-> are we in Cluster 1 (or 3, or similar)?
                                                /              \
                                             (OK)move 4K file   (Fail)fallocate(1M) and move both 1020Kb file and 4K file.


The practice shows that probability of successful fallocate() results in the specific cluster
is small. So, it the most case we have to move both 1020Kb and 4K files.

My experience is "less full filesystem is, less probability to receive suitable extent from fallocate()".
For me it looks like ext4 tries to spread all files over the disk, so this helps to build files
with less number of extents. This is very nice at allocation time, and it really works good.
But I need some opposite at defragmentation time...

What do you think about all of this?

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-03  9:57   ` Kirill Tkhai
  2020-03-03 16:55     ` Theodore Y. Ts'o
@ 2020-03-11 19:26     ` Andreas Dilger
  2020-03-11 20:29       ` Kirill Tkhai
  1 sibling, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2020-03-11 19:26 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Theodore Y. Ts'o, Alexander Viro, Mike Snitzer, Jan Kara,
	Eric Biggers, riteshh, krisman, surajjs, Dmitry Monakhov,
	mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov, linux-ext4,
	Linux Kernel Mailing List, Linux FS Devel

[-- Attachment #1: Type: text/plain, Size: 2614 bytes --]

On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>> Kirill,
>> 
>> In a couple of your comments on this patch series, you mentioned
>> "defragmentation".  Is that because you're trying to use this as part
>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>> 
>> If that's the case, you should note that input parameter for that
>> ioctl is:
>> 
>> struct move_extent {
>> 	__u32 reserved;		/* should be zero */
>> 	__u32 donor_fd;		/* donor file descriptor */
>> 	__u64 orig_start;	/* logical start offset in block for orig */
>> 	__u64 donor_start;	/* logical start offset in block for donor */
>> 	__u64 len;		/* block length to be moved */
>> 	__u64 moved_len;	/* moved block length */
>> };
>> 
>> Note that the donor_start is separate from the start of the file that
>> is being defragged.  So you could have the userspace application
>> fallocate a large chunk of space for that donor file, and then use
>> that donor file to defrag multiple files if you want to close pack
>> them.
> 
> The practice shows it's not so. Your suggestion was the first thing we tried,
> but it works bad and just doubles/triples IO.
> 
> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
> 
> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
> 
> We want to pack both of files in the same 1Mb cluster. Packed together on block
> device, they will be in the same server of underlining distributed storage file
> system. This gives a big performance improvement, and this is the price I aimed.
> 
> In case of I fallocate a large hunk for both of them, I have to move them
> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
> 1Mb of data, i.e. double size, which is counterproductive.
> 
> Imaging another situation, when we have
> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
> 
> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
> terrible! And this is the thing I try prevent with finding a new interface.

One idea I had, which may work for your use case, is to run fallocate() on
the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
always give you that 4KB chunk if it is free, and that avoids the need to try
and force the allocator to select that block through some other method.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-11 19:26     ` Andreas Dilger
@ 2020-03-11 20:29       ` Kirill Tkhai
  2020-03-12  0:31         ` Andreas Dilger
  0 siblings, 1 reply; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-11 20:29 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, Alexander Viro, Mike Snitzer, Jan Kara,
	Eric Biggers, riteshh, krisman, surajjs, Dmitry Monakhov,
	mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov, linux-ext4,
	Linux Kernel Mailing List, Linux FS Devel

On 11.03.2020 22:26, Andreas Dilger wrote:
> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>> Kirill,
>>>
>>> In a couple of your comments on this patch series, you mentioned
>>> "defragmentation".  Is that because you're trying to use this as part
>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>
>>> If that's the case, you should note that input parameter for that
>>> ioctl is:
>>>
>>> struct move_extent {
>>> 	__u32 reserved;		/* should be zero */
>>> 	__u32 donor_fd;		/* donor file descriptor */
>>> 	__u64 orig_start;	/* logical start offset in block for orig */
>>> 	__u64 donor_start;	/* logical start offset in block for donor */
>>> 	__u64 len;		/* block length to be moved */
>>> 	__u64 moved_len;	/* moved block length */
>>> };
>>>
>>> Note that the donor_start is separate from the start of the file that
>>> is being defragged.  So you could have the userspace application
>>> fallocate a large chunk of space for that donor file, and then use
>>> that donor file to defrag multiple files if you want to close pack
>>> them.
>>
>> The practice shows it's not so. Your suggestion was the first thing we tried,
>> but it works bad and just doubles/triples IO.
>>
>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>
>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>
>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>> device, they will be in the same server of underlining distributed storage file
>> system. This gives a big performance improvement, and this is the price I aimed.
>>
>> In case of I fallocate a large hunk for both of them, I have to move them
>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>> 1Mb of data, i.e. double size, which is counterproductive.
>>
>> Imaging another situation, when we have
>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>
>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>> terrible! And this is the thing I try prevent with finding a new interface.
> 
> One idea I had, which may work for your use case, is to run fallocate() on
> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
> as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
> always give you that 4KB chunk if it is free, and that avoids the need to try
> and force the allocator to select that block through some other method.

Do you mean the following:

1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).
2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
3)truncate 4KB at the end of the first file.

?

If so, this can't be an online defrag, since some process may want to increase *1MB-4KB*
file in between. This will just bring to data corruption.
Another problem is that power lose between 1 and 3 will result in that file length remain
*1MB* instead of *1MB-4KB*.

So, we still need some kernel support to implement this.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-11 20:29       ` Kirill Tkhai
@ 2020-03-12  0:31         ` Andreas Dilger
  2020-03-12  9:23           ` Kirill Tkhai
  0 siblings, 1 reply; 35+ messages in thread
From: Andreas Dilger @ 2020-03-12  0:31 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Theodore Y. Ts'o, Alexander Viro, Mike Snitzer, Jan Kara,
	Eric Biggers, riteshh, krisman, surajjs, Dmitry Monakhov,
	mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov, linux-ext4,
	Linux Kernel Mailing List, Linux FS Devel

[-- Attachment #1: Type: text/plain, Size: 3902 bytes --]

On Mar 11, 2020, at 2:29 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 11.03.2020 22:26, Andreas Dilger wrote:
>> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>> 
>>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>>> Kirill,
>>>> 
>>>> In a couple of your comments on this patch series, you mentioned
>>>> "defragmentation".  Is that because you're trying to use this as part
>>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>> 
>>>> If that's the case, you should note that input parameter for that
>>>> ioctl is:
>>>> 
>>>> struct move_extent {
>>>> 	__u32 reserved;		/* should be zero */
>>>> 	__u32 donor_fd;		/* donor file descriptor */
>>>> 	__u64 orig_start;	/* logical start offset in block for orig */
>>>> 	__u64 donor_start;	/* logical start offset in block for donor */
>>>> 	__u64 len;		/* block length to be moved */
>>>> 	__u64 moved_len;	/* moved block length */
>>>> };
>>>> 
>>>> Note that the donor_start is separate from the start of the file that
>>>> is being defragged.  So you could have the userspace application
>>>> fallocate a large chunk of space for that donor file, and then use
>>>> that donor file to defrag multiple files if you want to close pack
>>>> them.
>>> 
>>> The practice shows it's not so. Your suggestion was the first thing we tried,
>>> but it works bad and just doubles/triples IO.
>>> 
>>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>> 
>>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>> 
>>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>>> device, they will be in the same server of underlining distributed storage file
>>> system. This gives a big performance improvement, and this is the price I aimed.
>>> 
>>> In case of I fallocate a large hunk for both of them, I have to move them
>>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>>> 1Mb of data, i.e. double size, which is counterproductive.
>>> 
>>> Imaging another situation, when we have
>>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>> 
>>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>>> terrible! And this is the thing I try prevent with finding a new interface.
>> 
>> One idea I had, which may work for your use case, is to run fallocate() on
>> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
>> as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
>> always give you that 4KB chunk if it is free, and that avoids the need to try
>> and force the allocator to select that block through some other method.
> 
> Do you mean the following:
> 
> 1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).

You can use FALLOCATE_KEEP_SIZE to avoid changing the size of the file.

> 2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
> 3)truncate 4KB at the end of the first file.
> 
> If so, this can't be an online defrag, since some process may want to increase
> *1MB-4KB* file in between. This will just bring to data corruption.

You previously stated that one of the main reasons to do the defrag is because
the files are not being modified.  It would be possible to detect the case of
the file being modified by the file version and/or size and/or time change
before removing the fallocated block.

> Another problem is that power lose between 1 and 3 will result in that file
> length remain *1MB* instead of *1MB-4KB*.

With FALLOCATE_KEEP_SIZE you can just use this file as the donor file to
allocate the blocks, then migrate it to another file without having written
anything into it.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag.
  2020-03-12  0:31         ` Andreas Dilger
@ 2020-03-12  9:23           ` Kirill Tkhai
  0 siblings, 0 replies; 35+ messages in thread
From: Kirill Tkhai @ 2020-03-12  9:23 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Y. Ts'o, Alexander Viro, Mike Snitzer, Jan Kara,
	Eric Biggers, riteshh, krisman, surajjs, Dmitry Monakhov,
	mbobrowski, Eric Whitney, sblbir, Khazhismel Kumykov, linux-ext4,
	Linux Kernel Mailing List, Linux FS Devel

On 12.03.2020 03:31, Andreas Dilger wrote:
> On Mar 11, 2020, at 2:29 PM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 11.03.2020 22:26, Andreas Dilger wrote:
>>> On Mar 3, 2020, at 2:57 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> On 02.03.2020 19:56, Theodore Y. Ts'o wrote:
>>>>> Kirill,
>>>>>
>>>>> In a couple of your comments on this patch series, you mentioned
>>>>> "defragmentation".  Is that because you're trying to use this as part
>>>>> of e4defrag, or at least, using EXT4_IOC_MOVE_EXT?
>>>>>
>>>>> If that's the case, you should note that input parameter for that
>>>>> ioctl is:
>>>>>
>>>>> struct move_extent {
>>>>> 	__u32 reserved;		/* should be zero */
>>>>> 	__u32 donor_fd;		/* donor file descriptor */
>>>>> 	__u64 orig_start;	/* logical start offset in block for orig */
>>>>> 	__u64 donor_start;	/* logical start offset in block for donor */
>>>>> 	__u64 len;		/* block length to be moved */
>>>>> 	__u64 moved_len;	/* moved block length */
>>>>> };
>>>>>
>>>>> Note that the donor_start is separate from the start of the file that
>>>>> is being defragged.  So you could have the userspace application
>>>>> fallocate a large chunk of space for that donor file, and then use
>>>>> that donor file to defrag multiple files if you want to close pack
>>>>> them.
>>>>
>>>> The practice shows it's not so. Your suggestion was the first thing we tried,
>>>> but it works bad and just doubles/triples IO.
>>>>
>>>> Let we have two files of 512Kb, and they are placed in separate 1Mb clusters:
>>>>
>>>> [[512Kb file][512Kb free]][[512Kb file][512Kb free]]
>>>>
>>>> We want to pack both of files in the same 1Mb cluster. Packed together on block
>>>> device, they will be in the same server of underlining distributed storage file
>>>> system. This gives a big performance improvement, and this is the price I aimed.
>>>>
>>>> In case of I fallocate a large hunk for both of them, I have to move them
>>>> both to this new hunk. So, instead of moving 512Kb of data, we will have to move
>>>> 1Mb of data, i.e. double size, which is counterproductive.
>>>>
>>>> Imaging another situation, when we have
>>>> [[1020Kb file]][4Kb free]][[4Kb file][1020Kb free]]
>>>>
>>>> Here we may just move [4Kb file] into [4Kb free]. But your suggestion again
>>>> forces us to move 1Mb instead of 4Kb, which makes IO 256 times worse! This is
>>>> terrible! And this is the thing I try prevent with finding a new interface.
>>>
>>> One idea I had, which may work for your use case, is to run fallocate() on
>>> the *1MB-4KB file* to allocate the last 4KB in that hunk, then use that block
>>> as the donor file for the 1MB+4KB file.  The ext4 allocation algorithms should
>>> always give you that 4KB chunk if it is free, and that avoids the need to try
>>> and force the allocator to select that block through some other method.
>>
>> Do you mean the following:
>>
>> 1)fallocate() 4K at the end of *1MB-4KB* the first file (==> this increases the file length).
> 
> You can use FALLOCATE_KEEP_SIZE to avoid changing the size of the file.

Ok, but there still remains the problem with fallocation of a block in front of target file:

[4KB hole][1MB-4KB file][hole][4KB file]

Is there a high-probably way that ext4 allocator returns a block before target file?

>> 2)EXT4_IOC_MOVE_EXT *4KB* the second file in that new hunk.
>> 3)truncate 4KB at the end of the first file.
>>
>> If so, this can't be an online defrag, since some process may want to increase
>> *1MB-4KB* file in between. This will just bring to data corruption.
> 
> You previously stated that one of the main reasons to do the defrag is because
> the files are not being modified.  It would be possible to detect the case of
> the file being modified by the file version and/or size and/or time change
> before removing the fallocated block.

Yes, files should not be modified in parallel, but there is no 100% guarantee.
It's almost 100% by statistics, but since there is multi-user system (I defrag
fs on VPS), there are possible exceptions. And the whole architecture must be
safe in any cases.

File version does not look acceptable for online defrag, since there is no a way,
which allows to check the version and to remove fallocated block *atomically*.

>> Another problem is that power lose between 1 and 3 will result in that file
>> length remain *1MB* instead of *1MB-4KB*.
> 
> With FALLOCATE_KEEP_SIZE you can just use this file as the donor file to
> allocate the blocks, then migrate it to another file without having written
> anything into it.

In case of some task decides to write into fallocated block, there will be data
corruption.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, back to index

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-26 13:40 [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Kirill Tkhai
2020-02-26 13:40 ` [PATCH RFC 1/5] fs: Add new argument to file_operations::fallocate() Kirill Tkhai
2020-02-26 13:41 ` [PATCH RFC 2/5] fs: Add new argument to vfs_fallocate() Kirill Tkhai
2020-02-26 13:41 ` [PATCH RFC 3/5] fs: Add fallocate2() syscall Kirill Tkhai
2020-02-26 13:41 ` [PATCH RFC 4/5] ext4: Prepare ext4_mb_discard_preallocations() for handling EXT4_MB_HINT_GOAL_ONLY Kirill Tkhai
2020-02-26 13:41 ` [PATCH RFC 5/5] ext4: Add fallocate2() support Kirill Tkhai
2020-02-26 15:55   ` Christoph Hellwig
2020-02-26 20:05     ` Kirill Tkhai
2020-02-26 21:51       ` Andreas Dilger
2020-02-27 12:24         ` Kirill Tkhai
2020-02-28 15:35           ` Andreas Dilger
2020-02-28 21:16             ` Dave Chinner
2020-02-29 20:12               ` Andreas Dilger
2020-03-01  0:06                 ` Dave Chinner
2020-03-02 10:33               ` Kirill Tkhai
2020-03-02 11:07             ` Kirill Tkhai
2020-02-27  6:59       ` Konstantin Khlebnikov
2020-02-27 10:42         ` Kirill Tkhai
2020-02-27  7:33       ` Dave Chinner
2020-02-27 11:12         ` Kirill Tkhai
2020-02-27 21:56           ` Dave Chinner
2020-02-28 12:41             ` Kirill Tkhai
2020-02-29 22:41               ` Dave Chinner
2020-03-02 10:17                 ` Kirill Tkhai
2020-02-27 10:39 ` [PATCH RFC 0/5] fs, ext4: Physical blocks placement hint for fallocate(0): fallocate2(). TP defrag Ritesh Harjani
2020-02-28  7:07 ` xiaohui li
2020-02-28 12:46   ` Kirill Tkhai
2020-03-02 16:56 ` Theodore Y. Ts'o
2020-03-03  9:57   ` Kirill Tkhai
2020-03-03 16:55     ` Theodore Y. Ts'o
2020-03-03 17:36       ` Kirill Tkhai
2020-03-11 19:26     ` Andreas Dilger
2020-03-11 20:29       ` Kirill Tkhai
2020-03-12  0:31         ` Andreas Dilger
2020-03-12  9:23           ` Kirill Tkhai

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git