Hi all, Jane Chu has taken an interest in trying to fix the pmem poison recovery story on Linux. Since I sort of had a half-baked patchset that seems to contain some elements of what the reviewers of her patchset wanted, I'm releasing this reworked version to see if it has any traction. Our current "advice" to people using persistent memory and FSDAX who wish to recover upon receipt of a media error (aka 'hwpoison') event from ACPI is to punch-hole that part of the file and then pwrite it, which will magically cause the pmem to be reinitialized and the poison to be cleared. Punching doesn't make any sense at all -- the (re)allocation on pwrite does not permit the caller to specify where to find blocks, which means that we might not get the same pmem back. This pushes the user farther away from the goal of reinitializing poisoned memory and leads to complaints about unnecessary file fragmentation. AFAICT, the only reason why the "punch and write" dance works at all is that the XFS and ext4 currently call blkdev_issue_zeroout when allocating pmem ahead of a write call. Even a regular overwrite won't clear the poison, because dax_direct_access is smart enough to bail out on poisoned pmem, but not smart enough to clear it. To be fair, that function maps pages and has no idea what kinds of reads and writes the caller might want to perform. Therefore, clean up this whole mess by creating a dax_zeroinit_range function that callers can use on poisoned persistent memory to reset the contents of the persistent memory to a known state (all zeroes) and clear any lingering poison state that might be lingering in the memory controllers. Create a new fallocate mode to trigger this functionality, then wire up XFS and ext4 to use it. For good measure, wire it up to traditional storage if the storage has a fast way to zero LBA contents, since we assume that those LBAs won't hit old media errors. v2: change the name to zeroinit, add an explicit fallocate mode, and support regular block devices for non-dax files If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=zero-initialize-pmem-5.16 --- fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++ fs/ext4/extents.c | 93 +++++++++++++++++++++++++++++++++++++++++++ fs/iomap/direct-io.c | 75 +++++++++++++++++++++++++++++++++++ fs/open.c | 5 ++ fs/xfs/xfs_bmap_util.c | 22 ++++++++++ fs/xfs/xfs_bmap_util.h | 2 + fs/xfs/xfs_file.c | 11 ++++- fs/xfs/xfs_trace.h | 1 include/linux/dax.h | 7 +++ include/linux/falloc.h | 1 include/linux/iomap.h | 3 + include/trace/events/ext4.h | 7 +++ include/uapi/linux/falloc.h | 9 ++++ 13 files changed, 325 insertions(+), 4 deletions(-)
From: Darrick J. Wong <djwong@kernel.org> Our current "advice" to people using persistent memory and FSDAX who wish to recover upon receipt of a media error (aka 'hwpoison') event from ACPI is to punch-hole that part of the file and then pwrite it, which will magically cause the pmem to be reinitialized and the poison to be cleared. Punching doesn't make any sense at all -- the (re)allocation on pwrite does not permit the caller to specify where to find blocks, which means that we might not get the same pmem back. This pushes the user farther away from the goal of reinitializing poisoned memory and leads to complaints about unnecessary file fragmentation. AFAICT, the only reason why the "punch and write" dance works at all is that the XFS and ext4 currently call blkdev_issue_zeroout when allocating pmem ahead of a write call. Even a regular overwrite won't clear the poison, because dax_direct_access is smart enough to bail out on poisoned pmem, but not smart enough to clear it. To be fair, that function maps pages and has no idea what kinds of reads and writes the caller might want to perform. Therefore, create a dax_zeroinit_range function that filesystems can to reset the pmem contents to zero and clear hardware media error flags. This uses the dax page zeroing helper function, which should ensure that subsequent accesses will not trip over any pre-existing media errors. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/dax.h | 7 ++++ 2 files changed, 100 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index 4e3e5a283a91..765b80d08605 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, return dax_insert_pfn_mkwrite(vmf, pfn, order); } EXPORT_SYMBOL_GPL(dax_finish_sync_fault); + +static loff_t +dax_zeroinit_iter(struct iomap_iter *iter) +{ + struct iomap *iomap = &iter->iomap; + const struct iomap *srcmap = iomap_iter_srcmap(iter); + const u64 start = iomap->addr + iter->pos - iomap->offset; + const u64 nr_bytes = iomap_length(iter); + u64 start_page = start >> PAGE_SHIFT; + u64 nr_pages = nr_bytes >> PAGE_SHIFT; + int ret; + + if (!iomap->dax_dev) + return -ECANCELED; + + /* + * The physical extent must be page aligned because that's what the dax + * function requires. + */ + if (!PAGE_ALIGNED(start | nr_bytes)) + return -ECANCELED; + + /* + * The dax function, by using pgoff_t, is stuck with unsigned long, so + * we must check for overflows. + */ + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX) + return -ECANCELED; + + /* Must be able to zero storage directly without fs intervention. */ + if (iomap->flags & IOMAP_F_SHARED) + return -ECANCELED; + if (srcmap != iomap) + return -ECANCELED; + + switch (iomap->type) { + case IOMAP_MAPPED: + while (nr_pages > 0) { + /* XXX function only supports one page at a time?! */ + ret = dax_zero_page_range(iomap->dax_dev, start_page, + 1); + if (ret) + return ret; + start_page++; + nr_pages--; + } + + fallthrough; + case IOMAP_UNWRITTEN: + return nr_bytes; + } + + /* Reject holes, inline data, or delalloc extents. */ + return -ECANCELED; +} + +/* + * Initialize storage mapped to a DAX-mode file to a known value and ensure the + * media are ready to accept read and write commands. This requires the use of + * the dax layer's zero page range function to write zeroes to a pmem region + * and to reset any hardware media error state. + * + * The physical extents must be aligned to page size. The file must be backed + * by a pmem device. The extents returned must not require copy on write (or + * any other mapping interventions from the filesystem) and must be contiguous. + * @done will be set to true if the reset succeeded. + * + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage + * mappings do not support zero initialization, -EOPNOTSUPP if the device does + * not support it, or the usual negative errno. + */ +int +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, + const struct iomap_ops *ops) +{ + struct iomap_iter iter = { + .inode = inode, + .pos = pos, + .len = len, + .flags = IOMAP_REPORT, + }; + int ret; + + if (!IS_DAX(inode)) + return -EINVAL; + if (pos + len > i_size_read(inode)) + return -EINVAL; + + while ((ret = iomap_iter(&iter, ops)) > 0) + iter.processed = dax_zeroinit_iter(&iter); + return ret; +} +EXPORT_SYMBOL_GPL(dax_zeroinit_range); diff --git a/include/linux/dax.h b/include/linux/dax.h index 2619d94c308d..3c873f7c35ba 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); dax_entry_t dax_lock_page(struct page *page); void dax_unlock_page(struct page *page, dax_entry_t cookie); +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, + const struct iomap_ops *ops); #else #define generic_fsdax_supported NULL @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) { } +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, + const struct iomap_ops *ops) +{ + return -EOPNOTSUPP; +} #endif #if IS_ENABLED(CONFIG_DAX)
From: Darrick J. Wong <djwong@kernel.org> Create a function that ensures that the storage backing part of a file contains zeroes and will not trip over old media errors if the contents are re-read. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/iomap/direct-io.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/iomap.h | 3 ++ 2 files changed, 78 insertions(+) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 4ecd255e0511..48826a49f976 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -652,3 +652,78 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, return iomap_dio_complete(dio); } EXPORT_SYMBOL_GPL(iomap_dio_rw); + +static loff_t +iomap_zeroinit_iter(struct iomap_iter *iter) +{ + struct iomap *iomap = &iter->iomap; + const struct iomap *srcmap = iomap_iter_srcmap(iter); + const u64 start = iomap->addr + iter->pos - iomap->offset; + const u64 nr_bytes = iomap_length(iter); + sector_t sector = start >> SECTOR_SHIFT; + sector_t nr_sectors = nr_bytes >> SECTOR_SHIFT; + int ret; + + if (!iomap->bdev) + return -ECANCELED; + + /* The physical extent must be sector-aligned for block layer APIs. */ + if ((start | nr_bytes) & (SECTOR_SIZE - 1)) + return -EINVAL; + + /* Must be able to zero storage directly without fs intervention. */ + if (iomap->flags & IOMAP_F_SHARED) + return -ECANCELED; + if (srcmap != iomap) + return -ECANCELED; + + switch (iomap->type) { + case IOMAP_MAPPED: + ret = blkdev_issue_zeroout(iomap->bdev, sector, nr_sectors, + GFP_KERNEL, 0); + if (ret) + return ret; + fallthrough; + case IOMAP_UNWRITTEN: + return nr_bytes; + } + + /* Reject holes, inline data, or delalloc extents. */ + return -ECANCELED; +} + +/* + * Use a storage device's accelerated zero-writing command to ensure the media + * are ready to accept read and write commands. FSDAX is not supported. + * + * The range arguments must be aligned to sector size. The file must be backed + * by a block device. The extents returned must not require copy on write (or + * any other mapping interventions from the filesystem) and must be contiguous. + * @done will be set to true if the reset succeeded. + * + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage + * mappings do not support zero initialization, -EOPNOTSUPP if the device does + * not support it, or the usual negative errno. + */ +int +iomap_zeroout_range(struct inode *inode, loff_t pos, u64 len, + const struct iomap_ops *ops) +{ + struct iomap_iter iter = { + .inode = inode, + .pos = pos, + .len = len, + .flags = IOMAP_REPORT, + }; + int ret; + + if (IS_DAX(inode)) + return -EINVAL; + if (pos + len > i_size_read(inode)) + return -EINVAL; + + while ((ret = iomap_iter(&iter, ops)) > 0) + iter.processed = iomap_zeroinit_iter(&iter); + return ret; +} +EXPORT_SYMBOL_GPL(iomap_zeroout_range); diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 24f8489583ca..f4b9c6698388 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -339,6 +339,9 @@ struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, ssize_t iomap_dio_complete(struct iomap_dio *dio); int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); +int iomap_zeroout_range(struct inode *inode, loff_t pos, u64 len, + const struct iomap_ops *ops); + #ifdef CONFIG_SWAP struct file; struct swap_info_struct;
From: Darrick J. Wong <djwong@kernel.org> Add a new mode to fallocate to zero-initialize all the storage backing a file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/open.c | 5 +++++ include/linux/falloc.h | 1 + include/uapi/linux/falloc.h | 9 +++++++++ 3 files changed, 15 insertions(+) diff --git a/fs/open.c b/fs/open.c index daa324606a41..230220b8f67a 100644 --- a/fs/open.c +++ b/fs/open.c @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) (mode & ~FALLOC_FL_INSERT_RANGE)) return -EINVAL; + /* Zeroinit should only be used by itself and keep size must be set. */ + if ((mode & FALLOC_FL_ZEROINIT_RANGE) && + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE))) + return -EINVAL; + /* Unshare range should only be used with allocate mode. */ if ((mode & FALLOC_FL_UNSHARE_RANGE) && (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) diff --git a/include/linux/falloc.h b/include/linux/falloc.h index f3f0b97b1675..4597b416667b 100644 --- a/include/linux/falloc.h +++ b/include/linux/falloc.h @@ -29,6 +29,7 @@ struct space_resv { FALLOC_FL_PUNCH_HOLE | \ FALLOC_FL_COLLAPSE_RANGE | \ FALLOC_FL_ZERO_RANGE | \ + FALLOC_FL_ZEROINIT_RANGE | \ FALLOC_FL_INSERT_RANGE | \ FALLOC_FL_UNSHARE_RANGE) diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h index 51398fa57f6c..8144403b6102 100644 --- a/include/uapi/linux/falloc.h +++ b/include/uapi/linux/falloc.h @@ -77,4 +77,13 @@ */ #define FALLOC_FL_UNSHARE_RANGE 0x40 +/* + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by + * writing zeros to it. Subsequent read and writes should not fail due to any + * previous media errors. Blocks must be not be shared or require copy on + * write. Holes and unwritten extents are left untouched. This mode must be + * used with FALLOC_FL_KEEP_SIZE. + */ +#define FALLOC_FL_ZEROINIT_RANGE 0x80 + #endif /* _UAPI_FALLOC_H_ */
From: Darrick J. Wong <djwong@kernel.org> Implement this new fallocate mode so that persistent memory users can, upon receipt of a pmem poison notification, cause the pmem to be reinitialized to a known value (zero) and clear any hardware poison state that might be lurking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_bmap_util.c | 22 ++++++++++++++++++++++ fs/xfs/xfs_bmap_util.h | 2 ++ fs/xfs/xfs_file.c | 11 ++++++++--- fs/xfs/xfs_trace.h | 1 + 4 files changed, 33 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 73a36b7be3bd..319e79bb7fd8 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -956,6 +956,28 @@ xfs_flush_unmap_range( return 0; } +int +xfs_file_zeroinit_space( + struct xfs_inode *ip, + xfs_off_t offset, + xfs_off_t len) +{ + struct inode *inode = VFS_I(ip); + int error; + + trace_xfs_zeroinit_file_space(ip, offset, len); + + if (IS_DAX(inode)) + error = dax_zeroinit_range(inode, offset, len, + &xfs_read_iomap_ops); + else + error = iomap_zeroout_range(inode, offset, len, + &xfs_read_iomap_ops); + if (error == -ECANCELED) + return -EOPNOTSUPP; + return error; +} + int xfs_free_file_space( struct xfs_inode *ip, diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h index 9f993168b55b..7bb425d0876c 100644 --- a/fs/xfs/xfs_bmap_util.h +++ b/fs/xfs/xfs_bmap_util.h @@ -61,6 +61,8 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset, xfs_off_t len); int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset, xfs_off_t len); +int xfs_file_zeroinit_space(struct xfs_inode *ip, xfs_off_t offset, + xfs_off_t length); /* EOF block manipulation functions */ bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7aa943edfc02..886e819efa3b 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -899,7 +899,8 @@ xfs_break_layouts( #define XFS_FALLOC_FL_SUPPORTED \ (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | \ FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | \ - FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE) + FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE | \ + FALLOC_FL_ZEROINIT_RANGE) STATIC long xfs_file_fallocate( @@ -950,13 +951,17 @@ xfs_file_fallocate( * handled at the right time by xfs_prepare_shift(). */ if (mode & (FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE | - FALLOC_FL_COLLAPSE_RANGE)) { + FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZEROINIT_RANGE)) { error = xfs_flush_unmap_range(ip, offset, len); if (error) goto out_unlock; } - if (mode & FALLOC_FL_PUNCH_HOLE) { + if (mode & FALLOC_FL_ZEROINIT_RANGE) { + error = xfs_file_zeroinit_space(ip, offset, len); + if (error) + goto out_unlock; + } else if (mode & FALLOC_FL_PUNCH_HOLE) { error = xfs_free_file_space(ip, offset, len); if (error) goto out_unlock; diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 4a8076ef8cb4..eccad0a5c40f 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1534,6 +1534,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof); DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write); DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_unwritten); DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_append); +DEFINE_SIMPLE_IO_EVENT(xfs_zeroinit_file_space); DECLARE_EVENT_CLASS(xfs_itrunc_class, TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),
From: Darrick J. Wong <djwong@kernel.org> Implement this new fallocate mode so that persistent memory users can, upon receipt of a pmem poison notification, cause the pmem to be reinitialized to a known value (zero) and clear any hardware poison state that might be lurking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- fs/ext4/extents.c | 93 +++++++++++++++++++++++++++++++++++++++++++ include/trace/events/ext4.h | 7 +++ 2 files changed, 99 insertions(+), 1 deletion(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c0de30f25185..c345002e2da6 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -29,6 +29,7 @@ #include <linux/fiemap.h> #include <linux/backing-dev.h> #include <linux/iomap.h> +#include <linux/dax.h> #include "ext4_jbd2.h" #include "ext4_extents.h" #include "xattr.h" @@ -4475,6 +4476,90 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len); static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len); +static long ext4_zeroinit_range(struct file *file, loff_t offset, loff_t len) +{ + struct inode *inode = file_inode(file); + struct address_space *mapping = inode->i_mapping; + handle_t *handle = NULL; + loff_t end = offset + len; + long ret; + + trace_ext4_zeroinit_range(inode, offset, len, + FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE); + + /* We don't support data=journal mode */ + if (ext4_should_journal_data(inode)) + return -EOPNOTSUPP; + + inode_lock(inode); + + /* + * Indirect files do not support unwritten extents + */ + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { + ret = -EOPNOTSUPP; + goto out_mutex; + } + + /* Wait all existing dio workers, newcomers will block on i_mutex */ + inode_dio_wait(inode); + + /* + * Prevent page faults from reinstantiating pages we have released from + * page cache. + */ + filemap_invalidate_lock(mapping); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_mmap; + + /* Now release the pages and zero block aligned part of pages */ + truncate_pagecache_range(inode, offset, end - 1); + inode->i_mtime = inode->i_ctime = current_time(inode); + + if (IS_DAX(inode)) + ret = dax_zeroinit_range(inode, offset, len, + &ext4_iomap_report_ops); + else + ret = iomap_zeroout_range(inode, offset, len, + &ext4_iomap_report_ops); + if (ret == -ECANCELED) + ret = -EOPNOTSUPP; + if (ret) + goto out_mmap; + + /* + * In worst case we have to writeout two nonadjacent unwritten + * blocks and update the inode + */ + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + ext4_std_error(inode->i_sb, ret); + goto out_mmap; + } + + inode->i_mtime = inode->i_ctime = current_time(inode); + ret = ext4_mark_inode_dirty(handle, inode); + if (unlikely(ret)) + goto out_handle; + ext4_fc_track_range(handle, inode, offset >> inode->i_sb->s_blocksize_bits, + (offset + len - 1) >> inode->i_sb->s_blocksize_bits); + ext4_update_inode_fsync_trans(handle, inode, 1); + + if (file->f_flags & O_SYNC) + ext4_handle_sync(handle); + +out_handle: + ext4_journal_stop(handle); +out_mmap: + filemap_invalidate_unlock(mapping); +out_mutex: + inode_unlock(inode); + return ret; +} + static long ext4_zero_range(struct file *file, loff_t offset, loff_t len, int mode) { @@ -4659,7 +4744,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) /* Return error if mode is not supported */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | - FALLOC_FL_INSERT_RANGE)) + FALLOC_FL_INSERT_RANGE | FALLOC_FL_ZEROINIT_RANGE)) return -EOPNOTSUPP; ext4_fc_start_update(inode); @@ -4687,6 +4772,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) ret = ext4_zero_range(file, offset, len, mode); goto exit; } + + if (mode & FALLOC_FL_ZEROINIT_RANGE) { + ret = ext4_zeroinit_range(file, offset, len); + goto exit; + } + trace_ext4_fallocate_enter(inode, offset, len, mode); lblk = offset >> blkbits; diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 0ea36b2b0662..282f1208067f 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -1407,6 +1407,13 @@ DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range, TP_ARGS(inode, offset, len, mode) ); +DEFINE_EVENT(ext4__fallocate_mode, ext4_zeroinit_range, + + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), + + TP_ARGS(inode, offset, len, mode) +); + TRACE_EVENT(ext4_fallocate_exit, TP_PROTO(struct inode *inode, loff_t offset, unsigned int max_blocks, int ret),
On 21/09/17 06:30PM, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Our current "advice" to people using persistent memory and FSDAX who > wish to recover upon receipt of a media error (aka 'hwpoison') event > from ACPI is to punch-hole that part of the file and then pwrite it, > which will magically cause the pmem to be reinitialized and the poison > to be cleared. > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > does not permit the caller to specify where to find blocks, which means > that we might not get the same pmem back. This pushes the user farther > away from the goal of reinitializing poisoned memory and leads to > complaints about unnecessary file fragmentation. > > AFAICT, the only reason why the "punch and write" dance works at all is > that the XFS and ext4 currently call blkdev_issue_zeroout when > allocating pmem ahead of a write call. Even a regular overwrite won't > clear the poison, because dax_direct_access is smart enough to bail out > on poisoned pmem, but not smart enough to clear it. To be fair, that > function maps pages and has no idea what kinds of reads and writes the > caller might want to perform. > > Therefore, create a dax_zeroinit_range function that filesystems can to > reset the pmem contents to zero and clear hardware media error flags. > This uses the dax page zeroing helper function, which should ensure that > subsequent accesses will not trip over any pre-existing media errors. Thanks Darrick for such clear explaination of the problem and your solution. As I see from this thread [1], it looks like we are heading in this direction, so I thought of why not review this RFC patch series :) [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/ > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/dax.h | 7 ++++ > 2 files changed, 100 insertions(+) > > > diff --git a/fs/dax.c b/fs/dax.c > index 4e3e5a283a91..765b80d08605 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > return dax_insert_pfn_mkwrite(vmf, pfn, order); > } > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > + > +static loff_t > +dax_zeroinit_iter(struct iomap_iter *iter) > +{ > + struct iomap *iomap = &iter->iomap; > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > + const u64 start = iomap->addr + iter->pos - iomap->offset; > + const u64 nr_bytes = iomap_length(iter); > + u64 start_page = start >> PAGE_SHIFT; > + u64 nr_pages = nr_bytes >> PAGE_SHIFT; > + int ret; > + > + if (!iomap->dax_dev) > + return -ECANCELED; > + > + /* > + * The physical extent must be page aligned because that's what the dax > + * function requires. > + */ > + if (!PAGE_ALIGNED(start | nr_bytes)) > + return -ECANCELED; > + > + /* > + * The dax function, by using pgoff_t, is stuck with unsigned long, so > + * we must check for overflows. > + */ > + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX) > + return -ECANCELED; > + > + /* Must be able to zero storage directly without fs intervention. */ > + if (iomap->flags & IOMAP_F_SHARED) > + return -ECANCELED; > + if (srcmap != iomap) > + return -ECANCELED; > + > + switch (iomap->type) { > + case IOMAP_MAPPED: > + while (nr_pages > 0) { > + /* XXX function only supports one page at a time?! */ > + ret = dax_zero_page_range(iomap->dax_dev, start_page, > + 1); > + if (ret) > + return ret; > + start_page++; > + nr_pages--; > + } > + > + fallthrough; > + case IOMAP_UNWRITTEN: > + return nr_bytes; > + } > + > + /* Reject holes, inline data, or delalloc extents. */ > + return -ECANCELED; We reject holes here, but the other vfs plumbing patch [2] mentions "Holes and unwritten extents are left untouched.". Shouldn't we just return nr_bytes for IOMAP_HOLE case as well? [2]: "vfs: add a zero-initialization mode to fallocate" Although I am not an expert in this area, but the rest of the patch looks very well crafted to me. Thanks again for such details :) -ritesh > > +} > + > +/* > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the > + * media are ready to accept read and write commands. This requires the use of > + * the dax layer's zero page range function to write zeroes to a pmem region > + * and to reset any hardware media error state. > + * > + * The physical extents must be aligned to page size. The file must be backed > + * by a pmem device. The extents returned must not require copy on write (or > + * any other mapping interventions from the filesystem) and must be contiguous. > + * @done will be set to true if the reset succeeded. > + * > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does > + * not support it, or the usual negative errno. > + */ > +int > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > + const struct iomap_ops *ops) > +{ > + struct iomap_iter iter = { > + .inode = inode, > + .pos = pos, > + .len = len, > + .flags = IOMAP_REPORT, > + }; > + int ret; > + > + if (!IS_DAX(inode)) > + return -EINVAL; > + if (pos + len > i_size_read(inode)) > + return -EINVAL; > + > + while ((ret = iomap_iter(&iter, ops)) > 0) > + iter.processed = dax_zeroinit_iter(&iter); > + return ret; > +} > +EXPORT_SYMBOL_GPL(dax_zeroinit_range); > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 2619d94c308d..3c873f7c35ba 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); > struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); > dax_entry_t dax_lock_page(struct page *page); > void dax_unlock_page(struct page *page, dax_entry_t cookie); > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > + const struct iomap_ops *ops); > #else > #define generic_fsdax_supported NULL > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) > static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) > { > } > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > + const struct iomap_ops *ops) > +{ > + return -EOPNOTSUPP; > +} > #endif > > #if IS_ENABLED(CONFIG_DAX) >
On 21/09/17 06:30PM, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Create a function that ensures that the storage backing part of a file > contains zeroes and will not trip over old media errors if the contents > are re-read. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/iomap/direct-io.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/iomap.h | 3 ++ > 2 files changed, 78 insertions(+) > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 4ecd255e0511..48826a49f976 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -652,3 +652,78 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > return iomap_dio_complete(dio); > } > EXPORT_SYMBOL_GPL(iomap_dio_rw); > + > +static loff_t > +iomap_zeroinit_iter(struct iomap_iter *iter) > +{ > + struct iomap *iomap = &iter->iomap; > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > + const u64 start = iomap->addr + iter->pos - iomap->offset; > + const u64 nr_bytes = iomap_length(iter); > + sector_t sector = start >> SECTOR_SHIFT; > + sector_t nr_sectors = nr_bytes >> SECTOR_SHIFT; > + int ret; > + > + if (!iomap->bdev) > + return -ECANCELED; > + > + /* The physical extent must be sector-aligned for block layer APIs. */ > + if ((start | nr_bytes) & (SECTOR_SIZE - 1)) > + return -EINVAL; > + > + /* Must be able to zero storage directly without fs intervention. */ > + if (iomap->flags & IOMAP_F_SHARED) > + return -ECANCELED; > + if (srcmap != iomap) > + return -ECANCELED; > + > + switch (iomap->type) { > + case IOMAP_MAPPED: > + ret = blkdev_issue_zeroout(iomap->bdev, sector, nr_sectors, > + GFP_KERNEL, 0); > + if (ret) > + return ret; > + fallthrough; > + case IOMAP_UNWRITTEN: > + return nr_bytes; > + } > + > + /* Reject holes, inline data, or delalloc extents. */ > + return -ECANCELED; Same comment here as in patch-1 which implements dax_zeroinit_iter(). -ritesh > +} > + > +/* > + * Use a storage device's accelerated zero-writing command to ensure the media > + * are ready to accept read and write commands. FSDAX is not supported. > + * > + * The range arguments must be aligned to sector size. The file must be backed > + * by a block device. The extents returned must not require copy on write (or > + * any other mapping interventions from the filesystem) and must be contiguous. > + * @done will be set to true if the reset succeeded. > + * > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does > + * not support it, or the usual negative errno. > + */ > +int > +iomap_zeroout_range(struct inode *inode, loff_t pos, u64 len, > + const struct iomap_ops *ops) > +{ > + struct iomap_iter iter = { > + .inode = inode, > + .pos = pos, > + .len = len, > + .flags = IOMAP_REPORT, > + }; > + int ret; > + > + if (IS_DAX(inode)) > + return -EINVAL; > + if (pos + len > i_size_read(inode)) > + return -EINVAL; > + > + while ((ret = iomap_iter(&iter, ops)) > 0) > + iter.processed = iomap_zeroinit_iter(&iter); > + return ret; > +} > +EXPORT_SYMBOL_GPL(iomap_zeroout_range); > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 24f8489583ca..f4b9c6698388 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -339,6 +339,9 @@ struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > ssize_t iomap_dio_complete(struct iomap_dio *dio); > int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); > > +int iomap_zeroout_range(struct inode *inode, loff_t pos, u64 len, > + const struct iomap_ops *ops); > + > #ifdef CONFIG_SWAP > struct file; > struct swap_info_struct; >
On 21/09/17 06:31PM, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Add a new mode to fallocate to zero-initialize all the storage backing a > file. This patch looks pretty straight forward to me. Thanks -ritesh > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/open.c | 5 +++++ > include/linux/falloc.h | 1 + > include/uapi/linux/falloc.h | 9 +++++++++ > 3 files changed, 15 insertions(+) > > > diff --git a/fs/open.c b/fs/open.c > index daa324606a41..230220b8f67a 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > (mode & ~FALLOC_FL_INSERT_RANGE)) > return -EINVAL; > > + /* Zeroinit should only be used by itself and keep size must be set. */ > + if ((mode & FALLOC_FL_ZEROINIT_RANGE) && > + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE))) > + return -EINVAL; > + > /* Unshare range should only be used with allocate mode. */ > if ((mode & FALLOC_FL_UNSHARE_RANGE) && > (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) > diff --git a/include/linux/falloc.h b/include/linux/falloc.h > index f3f0b97b1675..4597b416667b 100644 > --- a/include/linux/falloc.h > +++ b/include/linux/falloc.h > @@ -29,6 +29,7 @@ struct space_resv { > FALLOC_FL_PUNCH_HOLE | \ > FALLOC_FL_COLLAPSE_RANGE | \ > FALLOC_FL_ZERO_RANGE | \ > + FALLOC_FL_ZEROINIT_RANGE | \ > FALLOC_FL_INSERT_RANGE | \ > FALLOC_FL_UNSHARE_RANGE) > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h > index 51398fa57f6c..8144403b6102 100644 > --- a/include/uapi/linux/falloc.h > +++ b/include/uapi/linux/falloc.h > @@ -77,4 +77,13 @@ > */ > #define FALLOC_FL_UNSHARE_RANGE 0x40 > > +/* > + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by > + * writing zeros to it. Subsequent read and writes should not fail due to any > + * previous media errors. Blocks must be not be shared or require copy on > + * write. Holes and unwritten extents are left untouched. This mode must be > + * used with FALLOC_FL_KEEP_SIZE. > + */ > +#define FALLOC_FL_ZEROINIT_RANGE 0x80 > + > #endif /* _UAPI_FALLOC_H_ */ >
+cc linux-ext4 [Thread]: https://lore.kernel.org/linux-xfs/163192864476.417973.143014658064006895.stgit@magnolia/T/#t On 21/09/17 06:31PM, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Implement this new fallocate mode so that persistent memory users can, > upon receipt of a pmem poison notification, cause the pmem to be > reinitialized to a known value (zero) and clear any hardware poison > state that might be lurking. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > --- > fs/ext4/extents.c | 93 +++++++++++++++++++++++++++++++++++++++++++ > include/trace/events/ext4.h | 7 +++ > 2 files changed, 99 insertions(+), 1 deletion(-) > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index c0de30f25185..c345002e2da6 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -29,6 +29,7 @@ > #include <linux/fiemap.h> > #include <linux/backing-dev.h> > #include <linux/iomap.h> > +#include <linux/dax.h> > #include "ext4_jbd2.h" > #include "ext4_extents.h" > #include "xattr.h" > @@ -4475,6 +4476,90 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len); > > static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len); > > +static long ext4_zeroinit_range(struct file *file, loff_t offset, loff_t len) > +{ > + struct inode *inode = file_inode(file); > + struct address_space *mapping = inode->i_mapping; > + handle_t *handle = NULL; > + loff_t end = offset + len; > + long ret; > + > + trace_ext4_zeroinit_range(inode, offset, len, > + FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE); > + > + /* We don't support data=journal mode */ > + if (ext4_should_journal_data(inode)) > + return -EOPNOTSUPP; > + > + inode_lock(inode); > + > + /* > + * Indirect files do not support unwritten extents > + */ > + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { > + ret = -EOPNOTSUPP; > + goto out_mutex; > + } > + > + /* Wait all existing dio workers, newcomers will block on i_mutex */ > + inode_dio_wait(inode); > + > + /* > + * Prevent page faults from reinstantiating pages we have released from > + * page cache. > + */ > + filemap_invalidate_lock(mapping); > + > + ret = ext4_break_layouts(inode); > + if (ret) > + goto out_mmap; > + > + /* Now release the pages and zero block aligned part of pages */ > + truncate_pagecache_range(inode, offset, end - 1); > + inode->i_mtime = inode->i_ctime = current_time(inode); > + > + if (IS_DAX(inode)) > + ret = dax_zeroinit_range(inode, offset, len, > + &ext4_iomap_report_ops); > + else > + ret = iomap_zeroout_range(inode, offset, len, > + &ext4_iomap_report_ops); > + if (ret == -ECANCELED) > + ret = -EOPNOTSUPP; > + if (ret) > + goto out_mmap; > + > + /* > + * In worst case we have to writeout two nonadjacent unwritten > + * blocks and update the inode > + */ Is this comment true? We are actually not touching IOMAP_UNWRITTEN blocks no? So is there any need for journal transaction for this? We are essentially only writing to blocks which are already allocated on disk and zeroing it out in both dax_zeroinit_range() and iomap_zeroinit_range(). > + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); I guess credits is 1 here since only inode is getting modified. > + if (IS_ERR(handle)) { > + ret = PTR_ERR(handle); > + ext4_std_error(inode->i_sb, ret); > + goto out_mmap; > + } > + > + inode->i_mtime = inode->i_ctime = current_time(inode); > + ret = ext4_mark_inode_dirty(handle, inode); > + if (unlikely(ret)) > + goto out_handle; > + ext4_fc_track_range(handle, inode, offset >> inode->i_sb->s_blocksize_bits, > + (offset + len - 1) >> inode->i_sb->s_blocksize_bits); I am not sure whether we need ext4_fc_track_range() here? We are not doing any metadata operation except maybe updating inode timestamp right? -ritesh > + ext4_update_inode_fsync_trans(handle, inode, 1); > + > + if (file->f_flags & O_SYNC) > + ext4_handle_sync(handle); > + > +out_handle: > + ext4_journal_stop(handle); > +out_mmap: > + filemap_invalidate_unlock(mapping); > +out_mutex: > + inode_unlock(inode); > + return ret; > +} > + > static long ext4_zero_range(struct file *file, loff_t offset, > loff_t len, int mode) > { > @@ -4659,7 +4744,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > /* Return error if mode is not supported */ > if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | > FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | > - FALLOC_FL_INSERT_RANGE)) > + FALLOC_FL_INSERT_RANGE | FALLOC_FL_ZEROINIT_RANGE)) > return -EOPNOTSUPP; > > ext4_fc_start_update(inode); > @@ -4687,6 +4772,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > ret = ext4_zero_range(file, offset, len, mode); > goto exit; > } > + > + if (mode & FALLOC_FL_ZEROINIT_RANGE) { > + ret = ext4_zeroinit_range(file, offset, len); > + goto exit; > + } > + > trace_ext4_fallocate_enter(inode, offset, len, mode); > lblk = offset >> blkbits; > > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h > index 0ea36b2b0662..282f1208067f 100644 > --- a/include/trace/events/ext4.h > +++ b/include/trace/events/ext4.h > @@ -1407,6 +1407,13 @@ DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range, > TP_ARGS(inode, offset, len, mode) > ); > > +DEFINE_EVENT(ext4__fallocate_mode, ext4_zeroinit_range, > + > + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), > + > + TP_ARGS(inode, offset, len, mode) > +); > + > TRACE_EVENT(ext4_fallocate_exit, > TP_PROTO(struct inode *inode, loff_t offset, > unsigned int max_blocks, int ret), >
On Fri, Sep 17, 2021 at 6:30 PM Darrick J. Wong <djwong@kernel.org> wrote: > > Hi all, > > Jane Chu has taken an interest in trying to fix the pmem poison recovery > story on Linux. Since I sort of had a half-baked patchset that seems to > contain some elements of what the reviewers of her patchset wanted, I'm > releasing this reworked version to see if it has any traction. > > Our current "advice" to people using persistent memory and FSDAX who > wish to recover upon receipt of a media error (aka 'hwpoison') event > from ACPI is to punch-hole that part of the file and then pwrite it, > which will magically cause the pmem to be reinitialized and the poison > to be cleared. > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > does not permit the caller to specify where to find blocks, which means > that we might not get the same pmem back. Not sure this is a driving concern. If you get the same pmem back it will have gone through a poison clearing cycle when it gets reallocated. Also, once the filesystem gets notified of error locations through Ruan's series the FS can avoid allocating blocks where poison clearing failed. > This pushes the user farther > away from the goal of reinitializing poisoned memory and leads to > complaints about unnecessary file fragmentation. Fragmentation though is a valid concern. > > AFAICT, the only reason why the "punch and write" dance works at all is > that the XFS and ext4 currently call blkdev_issue_zeroout when > allocating pmem ahead of a write call. Even a regular overwrite won't > clear the poison, because dax_direct_access is smart enough to bail out > on poisoned pmem, but not smart enough to clear it. Alignment constraints were the entanglement that kept DAX from poison clearing. This is similar to the dance you need to do to get a disk to remap a bad block, which needs an O_DIRECT write. It was also deemed messy to keep overloading writes this way. > To be fair, that > function maps pages and has no idea what kinds of reads and writes the > caller might want to perform. > > Therefore, clean up this whole mess by creating a dax_zeroinit_range > function that callers can use on poisoned persistent memory to reset the > contents of the persistent memory to a known state (all zeroes) and > clear any lingering poison state that might be lingering in the memory > controllers. Create a new fallocate mode to trigger this functionality, > then wire up XFS and ext4 to use it. For good measure, wire it up to > traditional storage if the storage has a fast way to zero LBA contents, > since we assume that those LBAs won't hit old media errors. Sounds good, I'll take a look at the rest.
On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote: > On 21/09/17 06:30PM, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Our current "advice" to people using persistent memory and FSDAX who > > wish to recover upon receipt of a media error (aka 'hwpoison') event > > from ACPI is to punch-hole that part of the file and then pwrite it, > > which will magically cause the pmem to be reinitialized and the poison > > to be cleared. > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > > does not permit the caller to specify where to find blocks, which means > > that we might not get the same pmem back. This pushes the user farther > > away from the goal of reinitializing poisoned memory and leads to > > complaints about unnecessary file fragmentation. > > > > AFAICT, the only reason why the "punch and write" dance works at all is > > that the XFS and ext4 currently call blkdev_issue_zeroout when > > allocating pmem ahead of a write call. Even a regular overwrite won't > > clear the poison, because dax_direct_access is smart enough to bail out > > on poisoned pmem, but not smart enough to clear it. To be fair, that > > function maps pages and has no idea what kinds of reads and writes the > > caller might want to perform. > > > > Therefore, create a dax_zeroinit_range function that filesystems can to > > reset the pmem contents to zero and clear hardware media error flags. > > This uses the dax page zeroing helper function, which should ensure that > > subsequent accesses will not trip over any pre-existing media errors. > > Thanks Darrick for such clear explaination of the problem and your solution. > As I see from this thread [1], it looks like we are heading in this direction, > so I thought of why not review this RFC patch series :) > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/ > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > include/linux/dax.h | 7 ++++ > > 2 files changed, 100 insertions(+) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 4e3e5a283a91..765b80d08605 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > > return dax_insert_pfn_mkwrite(vmf, pfn, order); > > } > > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > > + > > +static loff_t > > +dax_zeroinit_iter(struct iomap_iter *iter) > > +{ > > + struct iomap *iomap = &iter->iomap; > > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > > + const u64 start = iomap->addr + iter->pos - iomap->offset; > > + const u64 nr_bytes = iomap_length(iter); > > + u64 start_page = start >> PAGE_SHIFT; > > + u64 nr_pages = nr_bytes >> PAGE_SHIFT; > > + int ret; > > + > > + if (!iomap->dax_dev) > > + return -ECANCELED; > > + > > + /* > > + * The physical extent must be page aligned because that's what the dax > > + * function requires. > > + */ > > + if (!PAGE_ALIGNED(start | nr_bytes)) > > + return -ECANCELED; > > + > > + /* > > + * The dax function, by using pgoff_t, is stuck with unsigned long, so > > + * we must check for overflows. > > + */ > > + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX) > > + return -ECANCELED; > > + > > + /* Must be able to zero storage directly without fs intervention. */ > > + if (iomap->flags & IOMAP_F_SHARED) > > + return -ECANCELED; > > + if (srcmap != iomap) > > + return -ECANCELED; > > + > > + switch (iomap->type) { > > + case IOMAP_MAPPED: > > + while (nr_pages > 0) { > > + /* XXX function only supports one page at a time?! */ > > + ret = dax_zero_page_range(iomap->dax_dev, start_page, > > + 1); > > + if (ret) > > + return ret; > > + start_page++; > > + nr_pages--; > > + } > > + > > + fallthrough; > > + case IOMAP_UNWRITTEN: > > + return nr_bytes; > > + } > > + > > + /* Reject holes, inline data, or delalloc extents. */ > > + return -ECANCELED; > > We reject holes here, but the other vfs plumbing patch [2] mentions > "Holes and unwritten extents are left untouched.". > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well? I'm not entirely sure what we should do for holes and unwritten extents, as you can tell from the gross inconsistency between the comment and the code. :/ On block devices, I think we rely on the behavior that writing to disk will clear the device's error state (via LBA remapping or some other strategy). I think this means iomap_zeroinit can skip unwritten extents because reads and read faults will be satisfied from the zero page and writeback (or direct writes) will trigger the drive firmware. On FSDAX devices, reads are fulfilled by zeroing the user buffer, and read faults with the (dax) zero page. Writes and write faults won't clear the poison (unlike disks). So I guess this means that dax_zeroinit *does* have to act on unwritten areas. Ok. I'll make those changes. As for holes -- on the one hand, one could argue that zero-initializing a hole makes no sense and should be an error. OTOH one could make an equally compelling argument that it's merely a nop. Thoughts? --D > [2]: "vfs: add a zero-initialization mode to fallocate" > > Although I am not an expert in this area, but the rest of the patch looks > very well crafted to me. Thanks again for such details :) > > -ritesh > > > > > +} > > + > > +/* > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the > > + * media are ready to accept read and write commands. This requires the use of > > + * the dax layer's zero page range function to write zeroes to a pmem region > > + * and to reset any hardware media error state. > > + * > > + * The physical extents must be aligned to page size. The file must be backed > > + * by a pmem device. The extents returned must not require copy on write (or > > + * any other mapping interventions from the filesystem) and must be contiguous. > > + * @done will be set to true if the reset succeeded. > > + * > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does > > + * not support it, or the usual negative errno. > > + */ > > +int > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > + const struct iomap_ops *ops) > > +{ > > + struct iomap_iter iter = { > > + .inode = inode, > > + .pos = pos, > > + .len = len, > > + .flags = IOMAP_REPORT, > > + }; > > + int ret; > > + > > + if (!IS_DAX(inode)) > > + return -EINVAL; > > + if (pos + len > i_size_read(inode)) > > + return -EINVAL; > > + > > + while ((ret = iomap_iter(&iter, ops)) > 0) > > + iter.processed = dax_zeroinit_iter(&iter); > > + return ret; > > +} > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range); > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index 2619d94c308d..3c873f7c35ba 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); > > struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); > > dax_entry_t dax_lock_page(struct page *page); > > void dax_unlock_page(struct page *page, dax_entry_t cookie); > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > + const struct iomap_ops *ops); > > #else > > #define generic_fsdax_supported NULL > > > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) > > static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) > > { > > } > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > + const struct iomap_ops *ops) > > +{ > > + return -EOPNOTSUPP; > > +} > > #endif > > > > #if IS_ENABLED(CONFIG_DAX) > >
On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add a new mode to fallocate to zero-initialize all the storage backing a
> file.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/open.c | 5 +++++
> include/linux/falloc.h | 1 +
> include/uapi/linux/falloc.h | 9 +++++++++
> 3 files changed, 15 insertions(+)
>
>
> diff --git a/fs/open.c b/fs/open.c
> index daa324606a41..230220b8f67a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> (mode & ~FALLOC_FL_INSERT_RANGE))
> return -EINVAL;
>
> + /* Zeroinit should only be used by itself and keep size must be set. */
> + if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> + return -EINVAL;
> +
> /* Unshare range should only be used with allocate mode. */
> if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
> (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..4597b416667b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -29,6 +29,7 @@ struct space_resv {
> FALLOC_FL_PUNCH_HOLE | \
> FALLOC_FL_COLLAPSE_RANGE | \
> FALLOC_FL_ZERO_RANGE | \
> + FALLOC_FL_ZEROINIT_RANGE | \
> FALLOC_FL_INSERT_RANGE | \
> FALLOC_FL_UNSHARE_RANGE)
>
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..8144403b6102 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,13 @@
> */
> #define FALLOC_FL_UNSHARE_RANGE 0x40
>
> +/*
> + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> + * writing zeros to it. Subsequent read and writes should not fail due to any
> + * previous media errors. Blocks must be not be shared or require copy on
> + * write. Holes and unwritten extents are left untouched. This mode must be
> + * used with FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_ZEROINIT_RANGE 0x80
> +
How does this differ from ZERO_RANGE? Especially when the above says that
ZEROINIT_RANGE leaves unwritten extents untouched. That implies that unwritten
extents are still "allowed". If that's the case, why not just use ZERO_RANGE?
- Eric
On Mon, Sep 20, 2021 at 10:52:09AM -0700, Eric Biggers wrote: > On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Add a new mode to fallocate to zero-initialize all the storage backing a > > file. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/open.c | 5 +++++ > > include/linux/falloc.h | 1 + > > include/uapi/linux/falloc.h | 9 +++++++++ > > 3 files changed, 15 insertions(+) > > > > > > diff --git a/fs/open.c b/fs/open.c > > index daa324606a41..230220b8f67a 100644 > > --- a/fs/open.c > > +++ b/fs/open.c > > @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > (mode & ~FALLOC_FL_INSERT_RANGE)) > > return -EINVAL; > > > > + /* Zeroinit should only be used by itself and keep size must be set. */ > > + if ((mode & FALLOC_FL_ZEROINIT_RANGE) && > > + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE))) > > + return -EINVAL; > > + > > /* Unshare range should only be used with allocate mode. */ > > if ((mode & FALLOC_FL_UNSHARE_RANGE) && > > (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h > > index f3f0b97b1675..4597b416667b 100644 > > --- a/include/linux/falloc.h > > +++ b/include/linux/falloc.h > > @@ -29,6 +29,7 @@ struct space_resv { > > FALLOC_FL_PUNCH_HOLE | \ > > FALLOC_FL_COLLAPSE_RANGE | \ > > FALLOC_FL_ZERO_RANGE | \ > > + FALLOC_FL_ZEROINIT_RANGE | \ > > FALLOC_FL_INSERT_RANGE | \ > > FALLOC_FL_UNSHARE_RANGE) > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h > > index 51398fa57f6c..8144403b6102 100644 > > --- a/include/uapi/linux/falloc.h > > +++ b/include/uapi/linux/falloc.h > > @@ -77,4 +77,13 @@ > > */ > > #define FALLOC_FL_UNSHARE_RANGE 0x40 > > > > +/* > > + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by > > + * writing zeros to it. Subsequent read and writes should not fail due to any > > + * previous media errors. Blocks must be not be shared or require copy on > > + * write. Holes and unwritten extents are left untouched. This mode must be > > + * used with FALLOC_FL_KEEP_SIZE. > > + */ > > +#define FALLOC_FL_ZEROINIT_RANGE 0x80 > > + > > How does this differ from ZERO_RANGE? Especially when the above says that > ZEROINIT_RANGE leaves unwritten extents untouched. That implies that unwritten > extents are still "allowed". If that's the case, why not just use ZERO_RANGE? (Note: I changed the second to last sentence to read "Holes are ignored, and inline data regions are not supported.") ZERO_RANGE only guarantees that a subsequent read returns all zeroes, which means that implementations are allowed to play games with the file mappings to make that happen with as little work as possible. ZEROINIT_RANGE implies that the filesystem doesn't change the mappings at all and actually writes/resets the mapped storage, though I didn't want to exclude the possibility that the filesystem could change the mapping as a last resort to guarantee that "subsequent read and writes should not fail" if the media write fails. I /could/ encode all that in the definition, but that feels like overspecifying the implementation. --D > - Eric
On Sat, Sep 18, 2021 at 10:37:57PM +0530, riteshh wrote: > +cc linux-ext4 > > [Thread]: https://lore.kernel.org/linux-xfs/163192864476.417973.143014658064006895.stgit@magnolia/T/#t > > On 21/09/17 06:31PM, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Implement this new fallocate mode so that persistent memory users can, > > upon receipt of a pmem poison notification, cause the pmem to be > > reinitialized to a known value (zero) and clear any hardware poison > > state that might be lurking. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/ext4/extents.c | 93 +++++++++++++++++++++++++++++++++++++++++++ > > include/trace/events/ext4.h | 7 +++ > > 2 files changed, 99 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > index c0de30f25185..c345002e2da6 100644 > > --- a/fs/ext4/extents.c > > +++ b/fs/ext4/extents.c > > @@ -29,6 +29,7 @@ > > #include <linux/fiemap.h> > > #include <linux/backing-dev.h> > > #include <linux/iomap.h> > > +#include <linux/dax.h> > > #include "ext4_jbd2.h" > > #include "ext4_extents.h" > > #include "xattr.h" > > @@ -4475,6 +4476,90 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len); > > > > static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len); > > > > +static long ext4_zeroinit_range(struct file *file, loff_t offset, loff_t len) > > +{ > > + struct inode *inode = file_inode(file); > > + struct address_space *mapping = inode->i_mapping; > > + handle_t *handle = NULL; > > + loff_t end = offset + len; > > + long ret; > > + > > + trace_ext4_zeroinit_range(inode, offset, len, > > + FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE); > > + > > + /* We don't support data=journal mode */ > > + if (ext4_should_journal_data(inode)) > > + return -EOPNOTSUPP; > > + > > + inode_lock(inode); > > + > > + /* > > + * Indirect files do not support unwritten extents > > + */ > > + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { > > + ret = -EOPNOTSUPP; > > + goto out_mutex; > > + } > > + > > + /* Wait all existing dio workers, newcomers will block on i_mutex */ > > + inode_dio_wait(inode); > > + > > + /* > > + * Prevent page faults from reinstantiating pages we have released from > > + * page cache. > > + */ > > + filemap_invalidate_lock(mapping); > > + > > + ret = ext4_break_layouts(inode); > > + if (ret) > > + goto out_mmap; > > + > > + /* Now release the pages and zero block aligned part of pages */ > > + truncate_pagecache_range(inode, offset, end - 1); > > + inode->i_mtime = inode->i_ctime = current_time(inode); > > + > > + if (IS_DAX(inode)) > > + ret = dax_zeroinit_range(inode, offset, len, > > + &ext4_iomap_report_ops); > > + else > > + ret = iomap_zeroout_range(inode, offset, len, > > + &ext4_iomap_report_ops); > > + if (ret == -ECANCELED) > > + ret = -EOPNOTSUPP; > > + if (ret) > > + goto out_mmap; > > + > > + /* > > + * In worst case we have to writeout two nonadjacent unwritten > > + * blocks and update the inode > > + */ > > Is this comment true? We are actually not touching IOMAP_UNWRITTEN blocks no? > So is there any need for journal transaction for this? > We are essentially only writing to blocks which are already allocated on disk > and zeroing it out in both dax_zeroinit_range() and iomap_zeroinit_range(). Oops. Yeah, the comment is wrong. Deleted. > > + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); > > I guess credits is 1 here since only inode is getting modified. Yep. > > > + if (IS_ERR(handle)) { > > + ret = PTR_ERR(handle); > > + ext4_std_error(inode->i_sb, ret); > > + goto out_mmap; > > + } > > + > > + inode->i_mtime = inode->i_ctime = current_time(inode); > > + ret = ext4_mark_inode_dirty(handle, inode); > > + if (unlikely(ret)) > > + goto out_handle; > > + ext4_fc_track_range(handle, inode, offset >> inode->i_sb->s_blocksize_bits, > > + (offset + len - 1) >> inode->i_sb->s_blocksize_bits); > > I am not sure whether we need ext4_fc_track_range() here? > We are not doing any metadata operation except maybe updating inode timestamp > right? I wasn't sure what fastcommit needs to track about the range. Is it /only/ tracking changes to the file mapping? /me is sadly falling further and further behind on where ext4 is these days... :/ --D > > -ritesh > > > + ext4_update_inode_fsync_trans(handle, inode, 1); > > + > > + if (file->f_flags & O_SYNC) > > + ext4_handle_sync(handle); > > + > > +out_handle: > > + ext4_journal_stop(handle); > > +out_mmap: > > + filemap_invalidate_unlock(mapping); > > +out_mutex: > > + inode_unlock(inode); > > + return ret; > > +} > > + > > static long ext4_zero_range(struct file *file, loff_t offset, > > loff_t len, int mode) > > { > > @@ -4659,7 +4744,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > /* Return error if mode is not supported */ > > if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | > > FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | > > - FALLOC_FL_INSERT_RANGE)) > > + FALLOC_FL_INSERT_RANGE | FALLOC_FL_ZEROINIT_RANGE)) > > return -EOPNOTSUPP; > > > > ext4_fc_start_update(inode); > > @@ -4687,6 +4772,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > ret = ext4_zero_range(file, offset, len, mode); > > goto exit; > > } > > + > > + if (mode & FALLOC_FL_ZEROINIT_RANGE) { > > + ret = ext4_zeroinit_range(file, offset, len); > > + goto exit; > > + } > > + > > trace_ext4_fallocate_enter(inode, offset, len, mode); > > lblk = offset >> blkbits; > > > > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h > > index 0ea36b2b0662..282f1208067f 100644 > > --- a/include/trace/events/ext4.h > > +++ b/include/trace/events/ext4.h > > @@ -1407,6 +1407,13 @@ DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range, > > TP_ARGS(inode, offset, len, mode) > > ); > > > > +DEFINE_EVENT(ext4__fallocate_mode, ext4_zeroinit_range, > > + > > + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), > > + > > + TP_ARGS(inode, offset, len, mode) > > +); > > + > > TRACE_EVENT(ext4_fallocate_exit, > > TP_PROTO(struct inode *inode, loff_t offset, > > unsigned int max_blocks, int ret), > >
On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Add a new mode to fallocate to zero-initialize all the storage backing a
> file.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/open.c | 5 +++++
> include/linux/falloc.h | 1 +
> include/uapi/linux/falloc.h | 9 +++++++++
> 3 files changed, 15 insertions(+)
>
>
> diff --git a/fs/open.c b/fs/open.c
> index daa324606a41..230220b8f67a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> (mode & ~FALLOC_FL_INSERT_RANGE))
> return -EINVAL;
>
> + /* Zeroinit should only be used by itself and keep size must be set. */
> + if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> + return -EINVAL;
> +
> /* Unshare range should only be used with allocate mode. */
> if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
> (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index f3f0b97b1675..4597b416667b 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -29,6 +29,7 @@ struct space_resv {
> FALLOC_FL_PUNCH_HOLE | \
> FALLOC_FL_COLLAPSE_RANGE | \
> FALLOC_FL_ZERO_RANGE | \
> + FALLOC_FL_ZEROINIT_RANGE | \
> FALLOC_FL_INSERT_RANGE | \
> FALLOC_FL_UNSHARE_RANGE)
>
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 51398fa57f6c..8144403b6102 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -77,4 +77,13 @@
> */
> #define FALLOC_FL_UNSHARE_RANGE 0x40
>
> +/*
> + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> + * writing zeros to it. Subsequent read and writes should not fail due to any
> + * previous media errors. Blocks must be not be shared or require copy on
> + * write. Holes and unwritten extents are left untouched. This mode must be
> + * used with FALLOC_FL_KEEP_SIZE.
> + */
> +#define FALLOC_FL_ZEROINIT_RANGE 0x80
Hmmmm.
I think this wants to be a behavioural modifier for existing
operations rather than an operation unto itself. i.e. similar to how
KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
the guarantees ALLOC provides userspace.
In this case, the change of behaviour over ZERO_RANGE is that we
want physical zeros to be written instead of the filesystem
optimising away the physical zeros by manipulating the layout
of the file.
There's been requests in the past for a way to make ALLOC also
behave like this - in the case that users want fully allocated space
to be preallocated so their applications don't take unwritten extent
conversion penalties on first writes. Databases are an example here,
where setup of a new WAL file isn't performance critical, but writes
to the WAL are and the WAL files are write-once. Hence they always
take unwritten conversion penalties and the only way around that is
to physically zero the files before use...
So it seems to me what we actually need here is a "write zeroes"
modifier to fallocate() operations to tell the filesystem that the
application really wants it to write zeroes over that range, not
just guarantee space has been physically allocated....
Then we have and API that looks like:
ALLOC - allocate space efficiently
ALLOC | INIT - allocate space by writing zeros to it
ZERO - zero data and preallocate space efficiently
ZERO | INIT - zero range by writing zeros to it
Which seems to cater for all the cases I know of where physically
writing zeros instead of allocating unwritten extents is the
preferred behaviour of fallocate()....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
On 21/09/20 10:22AM, Darrick J. Wong wrote: > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote: > > On 21/09/17 06:30PM, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > Our current "advice" to people using persistent memory and FSDAX who > > > wish to recover upon receipt of a media error (aka 'hwpoison') event > > > from ACPI is to punch-hole that part of the file and then pwrite it, > > > which will magically cause the pmem to be reinitialized and the poison > > > to be cleared. > > > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > > > does not permit the caller to specify where to find blocks, which means > > > that we might not get the same pmem back. This pushes the user farther > > > away from the goal of reinitializing poisoned memory and leads to > > > complaints about unnecessary file fragmentation. > > > > > > AFAICT, the only reason why the "punch and write" dance works at all is > > > that the XFS and ext4 currently call blkdev_issue_zeroout when > > > allocating pmem ahead of a write call. Even a regular overwrite won't > > > clear the poison, because dax_direct_access is smart enough to bail out > > > on poisoned pmem, but not smart enough to clear it. To be fair, that > > > function maps pages and has no idea what kinds of reads and writes the > > > caller might want to perform. > > > > > > Therefore, create a dax_zeroinit_range function that filesystems can to > > > reset the pmem contents to zero and clear hardware media error flags. > > > This uses the dax page zeroing helper function, which should ensure that > > > subsequent accesses will not trip over any pre-existing media errors. > > > > Thanks Darrick for such clear explaination of the problem and your solution. > > As I see from this thread [1], it looks like we are heading in this direction, > > so I thought of why not review this RFC patch series :) > > > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/ > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > > --- > > > fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > include/linux/dax.h | 7 ++++ > > > 2 files changed, 100 insertions(+) > > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index 4e3e5a283a91..765b80d08605 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > > > return dax_insert_pfn_mkwrite(vmf, pfn, order); > > > } > > > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > > > + > > > +static loff_t > > > +dax_zeroinit_iter(struct iomap_iter *iter) > > > +{ > > > + struct iomap *iomap = &iter->iomap; > > > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > > > + const u64 start = iomap->addr + iter->pos - iomap->offset; > > > + const u64 nr_bytes = iomap_length(iter); > > > + u64 start_page = start >> PAGE_SHIFT; > > > + u64 nr_pages = nr_bytes >> PAGE_SHIFT; > > > + int ret; > > > + > > > + if (!iomap->dax_dev) > > > + return -ECANCELED; > > > + > > > + /* > > > + * The physical extent must be page aligned because that's what the dax > > > + * function requires. > > > + */ > > > + if (!PAGE_ALIGNED(start | nr_bytes)) > > > + return -ECANCELED; > > > + > > > + /* > > > + * The dax function, by using pgoff_t, is stuck with unsigned long, so > > > + * we must check for overflows. > > > + */ > > > + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX) > > > + return -ECANCELED; > > > + > > > + /* Must be able to zero storage directly without fs intervention. */ > > > + if (iomap->flags & IOMAP_F_SHARED) > > > + return -ECANCELED; > > > + if (srcmap != iomap) > > > + return -ECANCELED; > > > + > > > + switch (iomap->type) { > > > + case IOMAP_MAPPED: > > > + while (nr_pages > 0) { > > > + /* XXX function only supports one page at a time?! */ > > > + ret = dax_zero_page_range(iomap->dax_dev, start_page, > > > + 1); > > > + if (ret) > > > + return ret; > > > + start_page++; > > > + nr_pages--; > > > + } > > > + > > > + fallthrough; > > > + case IOMAP_UNWRITTEN: > > > + return nr_bytes; > > > + } > > > + > > > + /* Reject holes, inline data, or delalloc extents. */ > > > + return -ECANCELED; > > > > We reject holes here, but the other vfs plumbing patch [2] mentions > > "Holes and unwritten extents are left untouched.". > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well? > > I'm not entirely sure what we should do for holes and unwritten extents, > as you can tell from the gross inconsistency between the comment and the > code. :/ > > On block devices, I think we rely on the behavior that writing to disk > will clear the device's error state (via LBA remapping or some other > strategy). I think this means iomap_zeroinit can skip unwritten extents > because reads and read faults will be satisfied from the zero page and > writeback (or direct writes) will trigger the drive firmware. > > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and > read faults with the (dax) zero page. Writes and write faults won't > clear the poison (unlike disks). So I guess this means that > dax_zeroinit *does* have to act on unwritten areas. > > Ok. I'll make those changes. Yes, I guess unwritten extents still have extents blocks allocated with generally a bit marked (to mark it as unwritten). So there could still be a need to clear the poison for this in case of DAX. > > As for holes -- on the one hand, one could argue that zero-initializing > a hole makes no sense and should be an error. OTOH one could make an > equally compelling argument that it's merely a nop. Thoughts? So in case of holes consider this case (please correct if any of my understanding below is wrong/incomplete). If we have a large hole and if someone tries to do write to that area. 1. Then from what I understood from the code FS will try and allocate some disk blocks (could these blocks be marked with HWpoison as FS has no way of knowing it?). 2. If yes, then after allocating those blocks dax_direct_access will fail (as you had mentioned above). But it won't clear the HWposion. 3. Then the user again will have to clear using this API. But that is only for a given extent which is some part of the hole which FS allocated. Now above could be repeated until the entire hole range is covered. Is that above understanding correct? If yes, then maybe it all depends on what sort of gurantee the API is providing. If using the API on the given range guarantees that the entire file range will not have any blocks with HWpoison then I guess we may have to cover the IOMAP_HOLE case as well? If not, then maybe we could explicitly mentioned this in the API documentation. Please help correct if any of above does not make any sense. It will help me understand this use case better. -ritesh > > --D > > > [2]: "vfs: add a zero-initialization mode to fallocate" > > > > Although I am not an expert in this area, but the rest of the patch looks > > very well crafted to me. Thanks again for such details :) > > > > -ritesh > > > > > > > > +} > > > + > > > +/* > > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the > > > + * media are ready to accept read and write commands. This requires the use of > > > + * the dax layer's zero page range function to write zeroes to a pmem region > > > + * and to reset any hardware media error state. > > > + * > > > + * The physical extents must be aligned to page size. The file must be backed > > > + * by a pmem device. The extents returned must not require copy on write (or > > > + * any other mapping interventions from the filesystem) and must be contiguous. > > > + * @done will be set to true if the reset succeeded. > > > + * > > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage > > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does > > > + * not support it, or the usual negative errno. > > > + */ > > > +int > > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > + const struct iomap_ops *ops) > > > +{ > > > + struct iomap_iter iter = { > > > + .inode = inode, > > > + .pos = pos, > > > + .len = len, > > > + .flags = IOMAP_REPORT, > > > + }; > > > + int ret; > > > + > > > + if (!IS_DAX(inode)) > > > + return -EINVAL; > > > + if (pos + len > i_size_read(inode)) > > > + return -EINVAL; > > > + > > > + while ((ret = iomap_iter(&iter, ops)) > 0) > > > + iter.processed = dax_zeroinit_iter(&iter); > > > + return ret; > > > +} > > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range); > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > > index 2619d94c308d..3c873f7c35ba 100644 > > > --- a/include/linux/dax.h > > > +++ b/include/linux/dax.h > > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); > > > struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); > > > dax_entry_t dax_lock_page(struct page *page); > > > void dax_unlock_page(struct page *page, dax_entry_t cookie); > > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > + const struct iomap_ops *ops); > > > #else > > > #define generic_fsdax_supported NULL > > > > > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) > > > static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) > > > { > > > } > > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > + const struct iomap_ops *ops) > > > +{ > > > + return -EOPNOTSUPP; > > > +} > > > #endif > > > > > > #if IS_ENABLED(CONFIG_DAX) > > >
On 21/09/20 11:11AM, Darrick J. Wong wrote: > On Sat, Sep 18, 2021 at 10:37:57PM +0530, riteshh wrote: > > +cc linux-ext4 > > > > [Thread]: https://lore.kernel.org/linux-xfs/163192864476.417973.143014658064006895.stgit@magnolia/T/#t > > > > On 21/09/17 06:31PM, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > Implement this new fallocate mode so that persistent memory users can, > > > upon receipt of a pmem poison notification, cause the pmem to be > > > reinitialized to a known value (zero) and clear any hardware poison > > > state that might be lurking. > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > > --- > > > fs/ext4/extents.c | 93 +++++++++++++++++++++++++++++++++++++++++++ > > > include/trace/events/ext4.h | 7 +++ > > > 2 files changed, 99 insertions(+), 1 deletion(-) > > > > > > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > > index c0de30f25185..c345002e2da6 100644 > > > --- a/fs/ext4/extents.c > > > +++ b/fs/ext4/extents.c > > > @@ -29,6 +29,7 @@ > > > #include <linux/fiemap.h> > > > #include <linux/backing-dev.h> > > > #include <linux/iomap.h> > > > +#include <linux/dax.h> > > > #include "ext4_jbd2.h" > > > #include "ext4_extents.h" > > > #include "xattr.h" > > > @@ -4475,6 +4476,90 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len); > > > > > > static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len); > > > > > > +static long ext4_zeroinit_range(struct file *file, loff_t offset, loff_t len) > > > +{ > > > + struct inode *inode = file_inode(file); > > > + struct address_space *mapping = inode->i_mapping; > > > + handle_t *handle = NULL; > > > + loff_t end = offset + len; > > > + long ret; > > > + > > > + trace_ext4_zeroinit_range(inode, offset, len, > > > + FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE); > > > + > > > + /* We don't support data=journal mode */ > > > + if (ext4_should_journal_data(inode)) > > > + return -EOPNOTSUPP; > > > + > > > + inode_lock(inode); > > > + > > > + /* > > > + * Indirect files do not support unwritten extents > > > + */ > > > + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { > > > + ret = -EOPNOTSUPP; > > > + goto out_mutex; > > > + } > > > + > > > + /* Wait all existing dio workers, newcomers will block on i_mutex */ > > > + inode_dio_wait(inode); > > > + > > > + /* > > > + * Prevent page faults from reinstantiating pages we have released from > > > + * page cache. > > > + */ > > > + filemap_invalidate_lock(mapping); > > > + > > > + ret = ext4_break_layouts(inode); > > > + if (ret) > > > + goto out_mmap; > > > + > > > + /* Now release the pages and zero block aligned part of pages */ > > > + truncate_pagecache_range(inode, offset, end - 1); > > > + inode->i_mtime = inode->i_ctime = current_time(inode); > > > + > > > + if (IS_DAX(inode)) > > > + ret = dax_zeroinit_range(inode, offset, len, > > > + &ext4_iomap_report_ops); > > > + else > > > + ret = iomap_zeroout_range(inode, offset, len, > > > + &ext4_iomap_report_ops); > > > + if (ret == -ECANCELED) > > > + ret = -EOPNOTSUPP; > > > + if (ret) > > > + goto out_mmap; > > > + > > > + /* > > > + * In worst case we have to writeout two nonadjacent unwritten > > > + * blocks and update the inode > > > + */ > > > > Is this comment true? We are actually not touching IOMAP_UNWRITTEN blocks no? > > So is there any need for journal transaction for this? > > We are essentially only writing to blocks which are already allocated on disk > > and zeroing it out in both dax_zeroinit_range() and iomap_zeroinit_range(). > > Oops. Yeah, the comment is wrong. Deleted. > > > > + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1); > > > > I guess credits is 1 here since only inode is getting modified. > > Yep. > > > > > > + if (IS_ERR(handle)) { > > > + ret = PTR_ERR(handle); > > > + ext4_std_error(inode->i_sb, ret); > > > + goto out_mmap; > > > + } > > > + > > > + inode->i_mtime = inode->i_ctime = current_time(inode); > > > + ret = ext4_mark_inode_dirty(handle, inode); > > > + if (unlikely(ret)) > > > + goto out_handle; > > > + ext4_fc_track_range(handle, inode, offset >> inode->i_sb->s_blocksize_bits, > > > + (offset + len - 1) >> inode->i_sb->s_blocksize_bits); > > > > I am not sure whether we need ext4_fc_track_range() here? > > We are not doing any metadata operation except maybe updating inode timestamp > > right? > > I wasn't sure what fastcommit needs to track about the range. Is it > /only/ tracking changes to the file mapping? cc'ing Harshad as well (to correct if any of below is incorrect/incomplete). I guess so yes, from reading this [1]. ext4_fc_track_range() - Track _changed_ logical block offsets inodes For only inode update, I guess fastcommit uses ext4_fc_track_inode(), which is implicit when we call ext4_mark_inode_dirty(). [1]: https://lore.kernel.org/all/20201015203802.3597742-6-harshadshirwadkar@gmail.com/ > > /me is sadly falling further and further behind on where ext4 is these > days... :/ > > > > > > + ext4_update_inode_fsync_trans(handle, inode, 1); Then do we even need above? -ritesh > > > + > > > + if (file->f_flags & O_SYNC) > > > + ext4_handle_sync(handle); > > > + > > > +out_handle: > > > + ext4_journal_stop(handle); > > > +out_mmap: > > > + filemap_invalidate_unlock(mapping); > > > +out_mutex: > > > + inode_unlock(inode); > > > + return ret; > > > +} > > > + > > > static long ext4_zero_range(struct file *file, loff_t offset, > > > loff_t len, int mode) > > > { > > > @@ -4659,7 +4744,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > > /* Return error if mode is not supported */ > > > if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | > > > FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE | > > > - FALLOC_FL_INSERT_RANGE)) > > > + FALLOC_FL_INSERT_RANGE | FALLOC_FL_ZEROINIT_RANGE)) > > > return -EOPNOTSUPP; > > > > > > ext4_fc_start_update(inode); > > > @@ -4687,6 +4772,12 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > > ret = ext4_zero_range(file, offset, len, mode); > > > goto exit; > > > } > > > + > > > + if (mode & FALLOC_FL_ZEROINIT_RANGE) { > > > + ret = ext4_zeroinit_range(file, offset, len); > > > + goto exit; > > > + } > > > + > > > trace_ext4_fallocate_enter(inode, offset, len, mode); > > > lblk = offset >> blkbits; > > > > > > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h > > > index 0ea36b2b0662..282f1208067f 100644 > > > --- a/include/trace/events/ext4.h > > > +++ b/include/trace/events/ext4.h > > > @@ -1407,6 +1407,13 @@ DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range, > > > TP_ARGS(inode, offset, len, mode) > > > ); > > > > > > +DEFINE_EVENT(ext4__fallocate_mode, ext4_zeroinit_range, > > > + > > > + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), > > > + > > > + TP_ARGS(inode, offset, len, mode) > > > +); > > > + > > > TRACE_EVENT(ext4_fallocate_exit, > > > TP_PROTO(struct inode *inode, loff_t offset, > > > unsigned int max_blocks, int ret), > > >
On Fri, Sep 17, 2021 at 06:30:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Create a function that ensures that the storage backing part of a file
> contains zeroes and will not trip over old media errors if the contents
> are re-read.
I don't think this has anything to do with direct I/O, so I'd rather
not have it clutter direct-io.c. Also do we really want to wait
synchronously for every bio instead of batching them up? Especially
as a simple bio_chain is probably all that is needed.
On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > I think this wants to be a behavioural modifier for existing > operations rather than an operation unto itself. i.e. similar to how > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > the guarantees ALLOC provides userspace. > > In this case, the change of behaviour over ZERO_RANGE is that we > want physical zeros to be written instead of the filesystem > optimising away the physical zeros by manipulating the layout > of the file. Yes. > Then we have and API that looks like: > > ALLOC - allocate space efficiently > ALLOC | INIT - allocate space by writing zeros to it > ZERO - zero data and preallocate space efficiently > ZERO | INIT - zero range by writing zeros to it > > Which seems to cater for all the cases I know of where physically > writing zeros instead of allocating unwritten extents is the > preferred behaviour of fallocate().... Agreed. I'm not sure INIT is really the right name, but I can't come up with a better idea offhand.
On Fri, Sep 17, 2021 at 06:30:50PM -0700, Darrick J. Wong wrote:
> + case IOMAP_MAPPED:
> + while (nr_pages > 0) {
> + /* XXX function only supports one page at a time?! */
> + ret = dax_zero_page_range(iomap->dax_dev, start_page,
> + 1);
Yes. Given that it will have to kmap every page that kinda makes sense.
Unlike a nr_pages argument which needs to be 1, which is just silly.
That being said it would make sense to move the trivial loop over the
pages into the methods to reduce the indirect function call overhead
On Fri, Sep 17, 2021 at 06:30:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Create a function that ensures that the storage backing part of a file
> contains zeroes and will not trip over old media errors if the contents
> are re-read.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/iomap/direct-io.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/iomap.h | 3 ++
> 2 files changed, 78 insertions(+)
>
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 4ecd255e0511..48826a49f976 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -652,3 +652,78 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> return iomap_dio_complete(dio);
> }
> EXPORT_SYMBOL_GPL(iomap_dio_rw);
> +
> +static loff_t
> +iomap_zeroinit_iter(struct iomap_iter *iter)
> +{
> + struct iomap *iomap = &iter->iomap;
> + const struct iomap *srcmap = iomap_iter_srcmap(iter);
> + const u64 start = iomap->addr + iter->pos - iomap->offset;
> + const u64 nr_bytes = iomap_length(iter);
> + sector_t sector = start >> SECTOR_SHIFT;
> + sector_t nr_sectors = nr_bytes >> SECTOR_SHIFT;
> + int ret;
> +
> + if (!iomap->bdev)
> + return -ECANCELED;
> +
> + /* The physical extent must be sector-aligned for block layer APIs. */
> + if ((start | nr_bytes) & (SECTOR_SIZE - 1))
> + return -EINVAL;
> +
> + /* Must be able to zero storage directly without fs intervention. */
> + if (iomap->flags & IOMAP_F_SHARED)
> + return -ECANCELED;
> + if (srcmap != iomap)
> + return -ECANCELED;
> +
> + switch (iomap->type) {
> + case IOMAP_MAPPED:
> + ret = blkdev_issue_zeroout(iomap->bdev, sector, nr_sectors,
> + GFP_KERNEL, 0);
Pretty sure this needs to use BLKDEV_ZERO_NOUNMAP.
The whole point of this is having zeroed space allocated ready for
write on return, so having the hardware optimise away the physical
storage zeroing by punching a hole in it's backing store and then
potentially getting ENOSPC on the next write to this range would be
.... suboptimal.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > I think this wants to be a behavioural modifier for existing
> > operations rather than an operation unto itself. i.e. similar to how
> > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > the guarantees ALLOC provides userspace.
> >
> > In this case, the change of behaviour over ZERO_RANGE is that we
> > want physical zeros to be written instead of the filesystem
> > optimising away the physical zeros by manipulating the layout
> > of the file.
>
> Yes.
>
> > Then we have and API that looks like:
> >
> > ALLOC - allocate space efficiently
> > ALLOC | INIT - allocate space by writing zeros to it
> > ZERO - zero data and preallocate space efficiently
> > ZERO | INIT - zero range by writing zeros to it
> >
> > Which seems to cater for all the cases I know of where physically
> > writing zeros instead of allocating unwritten extents is the
> > preferred behaviour of fallocate()....
>
> Agreed. I'm not sure INIT is really the right name, but I can't come
> up with a better idea offhand.
FUA? As in, this is a forced-unit-access zeroing all the way to media
bypassing any mechanisms to emulate zero-filled payloads on future
reads.
On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > I think this wants to be a behavioural modifier for existing
> > > operations rather than an operation unto itself. i.e. similar to how
> > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > the guarantees ALLOC provides userspace.
> > >
> > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > want physical zeros to be written instead of the filesystem
> > > optimising away the physical zeros by manipulating the layout
> > > of the file.
> >
> > Yes.
> >
> > > Then we have and API that looks like:
> > >
> > > ALLOC - allocate space efficiently
> > > ALLOC | INIT - allocate space by writing zeros to it
> > > ZERO - zero data and preallocate space efficiently
> > > ZERO | INIT - zero range by writing zeros to it
> > >
> > > Which seems to cater for all the cases I know of where physically
> > > writing zeros instead of allocating unwritten extents is the
> > > preferred behaviour of fallocate()....
> >
> > Agreed. I'm not sure INIT is really the right name, but I can't come
> > up with a better idea offhand.
>
> FUA? As in, this is a forced-unit-access zeroing all the way to media
> bypassing any mechanisms to emulate zero-filled payloads on future
> reads.
FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
already exists at that file range?
--D
On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote: > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote: > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > > > > I think this wants to be a behavioural modifier for existing > > > > operations rather than an operation unto itself. i.e. similar to how > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > > > > the guarantees ALLOC provides userspace. > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we > > > > want physical zeros to be written instead of the filesystem > > > > optimising away the physical zeros by manipulating the layout > > > > of the file. > > > > > > Yes. > > > > > > > Then we have and API that looks like: > > > > > > > > ALLOC - allocate space efficiently > > > > ALLOC | INIT - allocate space by writing zeros to it > > > > ZERO - zero data and preallocate space efficiently > > > > ZERO | INIT - zero range by writing zeros to it > > > > > > > > Which seems to cater for all the cases I know of where physically > > > > writing zeros instead of allocating unwritten extents is the > > > > preferred behaviour of fallocate().... > > > > > > Agreed. I'm not sure INIT is really the right name, but I can't come > > > up with a better idea offhand. > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media > > bypassing any mechanisms to emulate zero-filled payloads on future > > reads. Yes, that's the semantic we want, but FUA already defines specific data integrity behaviour in the storage stack w.r.t. volatile caches. Also, FUA is associated with devices - it's low level storage jargon and so is not really appropriate to call a user interface operation FUA where users have no idea what a "unit" or "access" actually means. Hence we should not overload this name with some other operation that does not have (and should not have) explicit data integrity requirements. That will just cause confusion for everyone. > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that > already exists at that file range? IMO that doesn't work as a behavioural modifier for ALLOC because the ALLOC semantics are explicitly "don't touch existing user data"... Cheers, Dave. -- Dave Chinner david@fromorbit.com
On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote: > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote: > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote: > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > > > > > I think this wants to be a behavioural modifier for existing > > > > > operations rather than an operation unto itself. i.e. similar to how > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > > > > > the guarantees ALLOC provides userspace. > > > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we > > > > > want physical zeros to be written instead of the filesystem > > > > > optimising away the physical zeros by manipulating the layout > > > > > of the file. > > > > > > > > Yes. > > > > > > > > > Then we have and API that looks like: > > > > > > > > > > ALLOC - allocate space efficiently > > > > > ALLOC | INIT - allocate space by writing zeros to it > > > > > ZERO - zero data and preallocate space efficiently > > > > > ZERO | INIT - zero range by writing zeros to it > > > > > > > > > > Which seems to cater for all the cases I know of where physically > > > > > writing zeros instead of allocating unwritten extents is the > > > > > preferred behaviour of fallocate().... > > > > > > > > Agreed. I'm not sure INIT is really the right name, but I can't come > > > > up with a better idea offhand. > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media > > > bypassing any mechanisms to emulate zero-filled payloads on future > > > reads. > > Yes, that's the semantic we want, but FUA already defines specific > data integrity behaviour in the storage stack w.r.t. volatile > caches. > > Also, FUA is associated with devices - it's low level storage jargon > and so is not really appropriate to call a user interface operation > FUA where users have no idea what a "unit" or "access" actually > means. > > Hence we should not overload this name with some other operation > that does not have (and should not have) explicit data integrity > requirements. That will just cause confusion for everyone. > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that > > already exists at that file range? > > IMO that doesn't work as a behavioural modifier for ALLOC because > the ALLOC semantics are explicitly "don't touch existing user > data"... Well since you can't preallocate /and/ zerorange at the same time... /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */ #define FALLOC_FL_ZERO_EXISTING (0x80) /* For preallocation, allocate written extents and set the contents to * zeroes. */ #define FALLOC_FL_ALLOC_WRITE_ZEROES (0x80) --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On 21/09/21 10:44AM, Dave Chinner wrote:
> On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add a new mode to fallocate to zero-initialize all the storage backing a
> > file.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > fs/open.c | 5 +++++
> > include/linux/falloc.h | 1 +
> > include/uapi/linux/falloc.h | 9 +++++++++
> > 3 files changed, 15 insertions(+)
> >
> >
> > diff --git a/fs/open.c b/fs/open.c
> > index daa324606a41..230220b8f67a 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> > (mode & ~FALLOC_FL_INSERT_RANGE))
> > return -EINVAL;
> >
> > + /* Zeroinit should only be used by itself and keep size must be set. */
> > + if ((mode & FALLOC_FL_ZEROINIT_RANGE) &&
> > + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE)))
> > + return -EINVAL;
> > +
> > /* Unshare range should only be used with allocate mode. */
> > if ((mode & FALLOC_FL_UNSHARE_RANGE) &&
> > (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
> > diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> > index f3f0b97b1675..4597b416667b 100644
> > --- a/include/linux/falloc.h
> > +++ b/include/linux/falloc.h
> > @@ -29,6 +29,7 @@ struct space_resv {
> > FALLOC_FL_PUNCH_HOLE | \
> > FALLOC_FL_COLLAPSE_RANGE | \
> > FALLOC_FL_ZERO_RANGE | \
> > + FALLOC_FL_ZEROINIT_RANGE | \
> > FALLOC_FL_INSERT_RANGE | \
> > FALLOC_FL_UNSHARE_RANGE)
> >
> > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> > index 51398fa57f6c..8144403b6102 100644
> > --- a/include/uapi/linux/falloc.h
> > +++ b/include/uapi/linux/falloc.h
> > @@ -77,4 +77,13 @@
> > */
> > #define FALLOC_FL_UNSHARE_RANGE 0x40
> >
> > +/*
> > + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by
> > + * writing zeros to it. Subsequent read and writes should not fail due to any
> > + * previous media errors. Blocks must be not be shared or require copy on
> > + * write. Holes and unwritten extents are left untouched. This mode must be
> > + * used with FALLOC_FL_KEEP_SIZE.
> > + */
> > +#define FALLOC_FL_ZEROINIT_RANGE 0x80
>
> Hmmmm.
>
> I think this wants to be a behavioural modifier for existing
> operations rather than an operation unto itself. i.e. similar to how
> KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> the guarantees ALLOC provides userspace.
>
> In this case, the change of behaviour over ZERO_RANGE is that we
> want physical zeros to be written instead of the filesystem
> optimising away the physical zeros by manipulating the layout
> of the file.
>
> There's been requests in the past for a way to make ALLOC also
> behave like this - in the case that users want fully allocated space
> to be preallocated so their applications don't take unwritten extent
> conversion penalties on first writes. Databases are an example here,
> where setup of a new WAL file isn't performance critical, but writes
> to the WAL are and the WAL files are write-once. Hence they always
> take unwritten conversion penalties and the only way around that is
> to physically zero the files before use...
>
> So it seems to me what we actually need here is a "write zeroes"
> modifier to fallocate() operations to tell the filesystem that the
> application really wants it to write zeroes over that range, not
> just guarantee space has been physically allocated....
>
> Then we have and API that looks like:
>
> ALLOC - allocate space efficiently
> ALLOC | INIT - allocate space by writing zeros to it
> ZERO - zero data and preallocate space efficiently
> ZERO | INIT - zero range by writing zeros to it
>
> Which seems to cater for all the cases I know of where physically
> writing zeros instead of allocating unwritten extents is the
> preferred behaviour of fallocate()....
>
If that's the case we can just have FALLOC_FL_ZEROWRITE_RANGE?
Where FALLOC_FL_ZERO_RANGE & FALLOC_FL_ZEROWRITE_RANGE are mutually exclusive.
AFAIU,
/* FALLOC_FL_ZERO_RANGE may optimize the underlying blocks with unwritten
* extents if the filesystem allows so, but with FALLOC_FL_ZEROWRITE_RANGE,
* the underlying blocks are guranteed to be written with zeros.
* In case of hole it will be preallocated with written extents and will be
* initialized with zeroes. If FALLOC_FL_KEEP_SIZE is specified then the
* inode size will remain the same.
*
* Essentially similar to FALLOC_FL_ZERO_RANGE but with gurantees that
* underlying storage has written extents initialized with zeroes.
*/
#define FALLOC_FL_ZEROWRITE_RANGE 0x80
Does that make sense?
-ritesh
On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote:
> > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote:
> > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote:
> > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote:
> > > > > > I think this wants to be a behavioural modifier for existing
> > > > > > operations rather than an operation unto itself. i.e. similar to how
> > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter
> > > > > > the guarantees ALLOC provides userspace.
> > > > > >
> > > > > > In this case, the change of behaviour over ZERO_RANGE is that we
> > > > > > want physical zeros to be written instead of the filesystem
> > > > > > optimising away the physical zeros by manipulating the layout
> > > > > > of the file.
> > > > >
> > > > > Yes.
> > > > >
> > > > > > Then we have and API that looks like:
> > > > > >
> > > > > > ALLOC - allocate space efficiently
> > > > > > ALLOC | INIT - allocate space by writing zeros to it
> > > > > > ZERO - zero data and preallocate space efficiently
> > > > > > ZERO | INIT - zero range by writing zeros to it
> > > > > >
> > > > > > Which seems to cater for all the cases I know of where physically
> > > > > > writing zeros instead of allocating unwritten extents is the
> > > > > > preferred behaviour of fallocate()....
> > > > >
> > > > > Agreed. I'm not sure INIT is really the right name, but I can't come
> > > > > up with a better idea offhand.
> > > >
> > > > FUA? As in, this is a forced-unit-access zeroing all the way to media
> > > > bypassing any mechanisms to emulate zero-filled payloads on future
> > > > reads.
> >
> > Yes, that's the semantic we want, but FUA already defines specific
> > data integrity behaviour in the storage stack w.r.t. volatile
> > caches.
> >
> > Also, FUA is associated with devices - it's low level storage jargon
> > and so is not really appropriate to call a user interface operation
> > FUA where users have no idea what a "unit" or "access" actually
> > means.
> >
> > Hence we should not overload this name with some other operation
> > that does not have (and should not have) explicit data integrity
> > requirements. That will just cause confusion for everyone.
> >
> > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that
> > > already exists at that file range?
> >
> > IMO that doesn't work as a behavioural modifier for ALLOC because
> > the ALLOC semantics are explicitly "don't touch existing user
> > data"...
>
> Well since you can't preallocate /and/ zerorange at the same time...
>
> /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */
> #define FALLOC_FL_ZERO_EXISTING (0x80)
Except we also want the newly allocated regions (i.e. where holes
were) in that range being zeroed to have zeroes written to them as
well, yes? Otherwise we end up with a combination of unwritten
extents and physical zeroes, and you can't use
ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT
/*
* For preallocation and zeroing operations, force the filesystem to
* write zeroes rather than use unwritten extents to indicate the
* range contains zeroes.
*
* For filesystems that support unwritten extents, this trades off
* slow fallocate performance for faster first write performance as
* unwritten extent conversion on the first write to each block in
* the range is not needed.
*
* Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will
* be much slower overall for large ranges and/or slow storage
* compared to using unwritten extents.
*/
#define FALLOC_FL_ALLOC_INIT_DATA (1 << 7)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
On Tue, Sep 21, 2021 at 09:34:55AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 06:30:50PM -0700, Darrick J. Wong wrote:
> > + case IOMAP_MAPPED:
> > + while (nr_pages > 0) {
> > + /* XXX function only supports one page at a time?! */
> > + ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > + 1);
>
> Yes. Given that it will have to kmap every page that kinda makes sense.
> Unlike a nr_pages argument which needs to be 1, which is just silly.
> That being said it would make sense to move the trivial loop over the
> pages into the methods to reduce the indirect function call overhead
Done. AFAICT all the implementations *except* nvdimm can handle more
than one page.
--D
On Tue, Sep 21, 2021 at 09:37:08AM +0530, riteshh wrote: > On 21/09/20 10:22AM, Darrick J. Wong wrote: > > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote: > > > On 21/09/17 06:30PM, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > > > Our current "advice" to people using persistent memory and FSDAX who > > > > wish to recover upon receipt of a media error (aka 'hwpoison') event > > > > from ACPI is to punch-hole that part of the file and then pwrite it, > > > > which will magically cause the pmem to be reinitialized and the poison > > > > to be cleared. > > > > > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > > > > does not permit the caller to specify where to find blocks, which means > > > > that we might not get the same pmem back. This pushes the user farther > > > > away from the goal of reinitializing poisoned memory and leads to > > > > complaints about unnecessary file fragmentation. > > > > > > > > AFAICT, the only reason why the "punch and write" dance works at all is > > > > that the XFS and ext4 currently call blkdev_issue_zeroout when > > > > allocating pmem ahead of a write call. Even a regular overwrite won't > > > > clear the poison, because dax_direct_access is smart enough to bail out > > > > on poisoned pmem, but not smart enough to clear it. To be fair, that > > > > function maps pages and has no idea what kinds of reads and writes the > > > > caller might want to perform. > > > > > > > > Therefore, create a dax_zeroinit_range function that filesystems can to > > > > reset the pmem contents to zero and clear hardware media error flags. > > > > This uses the dax page zeroing helper function, which should ensure that > > > > subsequent accesses will not trip over any pre-existing media errors. > > > > > > Thanks Darrick for such clear explaination of the problem and your solution. > > > As I see from this thread [1], it looks like we are heading in this direction, > > > so I thought of why not review this RFC patch series :) > > > > > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/ > > > > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > > > --- > > > > fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > include/linux/dax.h | 7 ++++ > > > > 2 files changed, 100 insertions(+) > > > > > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > > index 4e3e5a283a91..765b80d08605 100644 > > > > --- a/fs/dax.c > > > > +++ b/fs/dax.c > > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > > > > return dax_insert_pfn_mkwrite(vmf, pfn, order); > > > > } > > > > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > > > > + > > > > +static loff_t > > > > +dax_zeroinit_iter(struct iomap_iter *iter) > > > > +{ > > > > + struct iomap *iomap = &iter->iomap; > > > > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > > > > + const u64 start = iomap->addr + iter->pos - iomap->offset; > > > > + const u64 nr_bytes = iomap_length(iter); > > > > + u64 start_page = start >> PAGE_SHIFT; > > > > + u64 nr_pages = nr_bytes >> PAGE_SHIFT; > > > > + int ret; > > > > + > > > > + if (!iomap->dax_dev) > > > > + return -ECANCELED; > > > > + > > > > + /* > > > > + * The physical extent must be page aligned because that's what the dax > > > > + * function requires. > > > > + */ > > > > + if (!PAGE_ALIGNED(start | nr_bytes)) > > > > + return -ECANCELED; > > > > + > > > > + /* > > > > + * The dax function, by using pgoff_t, is stuck with unsigned long, so > > > > + * we must check for overflows. > > > > + */ > > > > + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX) > > > > + return -ECANCELED; > > > > + > > > > + /* Must be able to zero storage directly without fs intervention. */ > > > > + if (iomap->flags & IOMAP_F_SHARED) > > > > + return -ECANCELED; > > > > + if (srcmap != iomap) > > > > + return -ECANCELED; > > > > + > > > > + switch (iomap->type) { > > > > + case IOMAP_MAPPED: > > > > + while (nr_pages > 0) { > > > > + /* XXX function only supports one page at a time?! */ > > > > + ret = dax_zero_page_range(iomap->dax_dev, start_page, > > > > + 1); > > > > + if (ret) > > > > + return ret; > > > > + start_page++; > > > > + nr_pages--; > > > > + } > > > > + > > > > + fallthrough; > > > > + case IOMAP_UNWRITTEN: > > > > + return nr_bytes; > > > > + } > > > > + > > > > + /* Reject holes, inline data, or delalloc extents. */ > > > > + return -ECANCELED; > > > > > > We reject holes here, but the other vfs plumbing patch [2] mentions > > > "Holes and unwritten extents are left untouched.". > > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well? > > > > I'm not entirely sure what we should do for holes and unwritten extents, > > as you can tell from the gross inconsistency between the comment and the > > code. :/ > > > > On block devices, I think we rely on the behavior that writing to disk > > will clear the device's error state (via LBA remapping or some other > > strategy). I think this means iomap_zeroinit can skip unwritten extents > > because reads and read faults will be satisfied from the zero page and > > writeback (or direct writes) will trigger the drive firmware. > > > > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and > > read faults with the (dax) zero page. Writes and write faults won't > > clear the poison (unlike disks). So I guess this means that > > dax_zeroinit *does* have to act on unwritten areas. I was confused when I wrote this -- before writing, dax filesystems are required to allocate written zeroed extents and/or zero unwritten extents and mark them written. So we don't actually need to zero unwritten extents. > > > > Ok. I'll make those changes. > > Yes, I guess unwritten extents still have extents blocks allocated with > generally a bit marked (to mark it as unwritten). So there could still be > a need to clear the poison for this in case of DAX. > > > > > As for holes -- on the one hand, one could argue that zero-initializing > > a hole makes no sense and should be an error. OTOH one could make an > > equally compelling argument that it's merely a nop. Thoughts? > > So in case of holes consider this case (please correct if any of my > understanding below is wrong/incomplete). > If we have a large hole and if someone tries to do write to that area. > 1. Then from what I understood from the code FS will try and allocate some disk > blocks (could these blocks be marked with HWpoison as FS has no way of > knowing it?). Freshly allocated extents are zeroed via blkdev_issue_zeroout before being mapped into the file, which will clear the poison. That last bit is only the reason why the punch-and-rewrite dance ever worked at all. We'll have to make sure this poison clearing still happens even after the dax/block layer divorce. > 2. If yes, then after allocating those blocks dax_direct_access will fail (as > you had mentioned above). But it won't clear the HWposion. > 3. Then the user again will have to clear using this API. But that is only for > a given extent which is some part of the hole which FS allocated. > Now above could be repeated until the entire hole range is covered. > Is that above understanding correct? > > If yes, then maybe it all depends on what sort of gurantee the API is providing. > If using the API on the given range guarantees that the entire file range will > not have any blocks with HWpoison then I guess we may have to cover the > IOMAP_HOLE case as well? > If not, then maybe we could explicitly mentioned this in the API documentation. Ok. The short version is that zeroinit will ensure that subsequent reads, writes, or faults to allocated file ranges won't have problems with pre-existing poison flags. If the user wants to fill sparse holes, they can do that with a separate fallocate call. --D > Please help correct if any of above does not make any sense. It will help me > understand this use case better. > > -ritesh > > > > > --D > > > > > [2]: "vfs: add a zero-initialization mode to fallocate" > > > > > > Although I am not an expert in this area, but the rest of the patch looks > > > very well crafted to me. Thanks again for such details :) > > > > > > -ritesh > > > > > > > > > > > +} > > > > + > > > > +/* > > > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the > > > > + * media are ready to accept read and write commands. This requires the use of > > > > + * the dax layer's zero page range function to write zeroes to a pmem region > > > > + * and to reset any hardware media error state. > > > > + * > > > > + * The physical extents must be aligned to page size. The file must be backed > > > > + * by a pmem device. The extents returned must not require copy on write (or > > > > + * any other mapping interventions from the filesystem) and must be contiguous. > > > > + * @done will be set to true if the reset succeeded. > > > > + * > > > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage > > > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does > > > > + * not support it, or the usual negative errno. > > > > + */ > > > > +int > > > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > > + const struct iomap_ops *ops) > > > > +{ > > > > + struct iomap_iter iter = { > > > > + .inode = inode, > > > > + .pos = pos, > > > > + .len = len, > > > > + .flags = IOMAP_REPORT, > > > > + }; > > > > + int ret; > > > > + > > > > + if (!IS_DAX(inode)) > > > > + return -EINVAL; > > > > + if (pos + len > i_size_read(inode)) > > > > + return -EINVAL; > > > > + > > > > + while ((ret = iomap_iter(&iter, ops)) > 0) > > > > + iter.processed = dax_zeroinit_iter(&iter); > > > > + return ret; > > > > +} > > > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range); > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > > > index 2619d94c308d..3c873f7c35ba 100644 > > > > --- a/include/linux/dax.h > > > > +++ b/include/linux/dax.h > > > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); > > > > struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); > > > > dax_entry_t dax_lock_page(struct page *page); > > > > void dax_unlock_page(struct page *page, dax_entry_t cookie); > > > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > > + const struct iomap_ops *ops); > > > > #else > > > > #define generic_fsdax_supported NULL > > > > > > > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) > > > > static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) > > > > { > > > > } > > > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > > + const struct iomap_ops *ops) > > > > +{ > > > > + return -EOPNOTSUPP; > > > > +} > > > > #endif > > > > > > > > #if IS_ENABLED(CONFIG_DAX) > > > >
On Tue, Sep 21, 2021 at 09:29:25AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 06:30:55PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Create a function that ensures that the storage backing part of a file
> > contains zeroes and will not trip over old media errors if the contents
> > are re-read.
>
> I don't think this has anything to do with direct I/O, so I'd rather
> not have it clutter direct-io.c. Also do we really want to wait
> synchronously for every bio instead of batching them up? Especially
> as a simple bio_chain is probably all that is needed.
__blkdev_issue_zeroout looks appropriate for chaining. I'll move the
zeroout routine into a new lowlevel.c file, since this isn't buffered io
either.
--D
On Wed, Sep 22, 2021 at 08:33:43AM +1000, Dave Chinner wrote: > On Fri, Sep 17, 2021 at 06:30:55PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Create a function that ensures that the storage backing part of a file > > contains zeroes and will not trip over old media errors if the contents > > are re-read. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > --- > > fs/iomap/direct-io.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++ > > include/linux/iomap.h | 3 ++ > > 2 files changed, 78 insertions(+) > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > index 4ecd255e0511..48826a49f976 100644 > > --- a/fs/iomap/direct-io.c > > +++ b/fs/iomap/direct-io.c > > @@ -652,3 +652,78 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > return iomap_dio_complete(dio); > > } > > EXPORT_SYMBOL_GPL(iomap_dio_rw); > > + > > +static loff_t > > +iomap_zeroinit_iter(struct iomap_iter *iter) > > +{ > > + struct iomap *iomap = &iter->iomap; > > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > > + const u64 start = iomap->addr + iter->pos - iomap->offset; > > + const u64 nr_bytes = iomap_length(iter); > > + sector_t sector = start >> SECTOR_SHIFT; > > + sector_t nr_sectors = nr_bytes >> SECTOR_SHIFT; > > + int ret; > > + > > + if (!iomap->bdev) > > + return -ECANCELED; > > + > > + /* The physical extent must be sector-aligned for block layer APIs. */ > > + if ((start | nr_bytes) & (SECTOR_SIZE - 1)) > > + return -EINVAL; > > + > > + /* Must be able to zero storage directly without fs intervention. */ > > + if (iomap->flags & IOMAP_F_SHARED) > > + return -ECANCELED; > > + if (srcmap != iomap) > > + return -ECANCELED; > > + > > + switch (iomap->type) { > > + case IOMAP_MAPPED: > > + ret = blkdev_issue_zeroout(iomap->bdev, sector, nr_sectors, > > + GFP_KERNEL, 0); > > Pretty sure this needs to use BLKDEV_ZERO_NOUNMAP. Oops, good catch! --D > The whole point of this is having zeroed space allocated ready for > write on return, so having the hardware optimise away the physical > storage zeroing by punching a hole in it's backing store and then > potentially getting ENOSPC on the next write to this range would be > .... suboptimal. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On 21/09/22 11:26AM, Darrick J. Wong wrote: > On Tue, Sep 21, 2021 at 09:37:08AM +0530, riteshh wrote: > > On 21/09/20 10:22AM, Darrick J. Wong wrote: > > > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote: > > > > On 21/09/17 06:30PM, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > > > > > Our current "advice" to people using persistent memory and FSDAX who > > > > > wish to recover upon receipt of a media error (aka 'hwpoison') event > > > > > from ACPI is to punch-hole that part of the file and then pwrite it, > > > > > which will magically cause the pmem to be reinitialized and the poison > > > > > to be cleared. > > > > > > > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite > > > > > does not permit the caller to specify where to find blocks, which means > > > > > that we might not get the same pmem back. This pushes the user farther > > > > > away from the goal of reinitializing poisoned memory and leads to > > > > > complaints about unnecessary file fragmentation. > > > > > > > > > > AFAICT, the only reason why the "punch and write" dance works at all is > > > > > that the XFS and ext4 currently call blkdev_issue_zeroout when > > > > > allocating pmem ahead of a write call. Even a regular overwrite won't > > > > > clear the poison, because dax_direct_access is smart enough to bail out > > > > > on poisoned pmem, but not smart enough to clear it. To be fair, that > > > > > > > > > > > > > > > function maps pages and has no idea what kinds of reads and writes the > > > > > caller might want to perform. > > > > > > > > > > Therefore, create a dax_zeroinit_range function that filesystems can to > > > > > reset the pmem contents to zero and clear hardware media error flags. > > > > > This uses the dax page zeroing helper function, which should ensure that > > > > > subsequent accesses will not trip over any pre-existing media errors. > > > > > > > > Thanks Darrick for such clear explaination of the problem and your solution. > > > > As I see from this thread [1], it looks like we are heading in this direction, > > > > so I thought of why not review this RFC patch series :) > > > > > > > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/ > > > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > > > > --- > > > > > fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > include/linux/dax.h | 7 ++++ > > > > > 2 files changed, 100 insertions(+) > > > > > > > > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > > > index 4e3e5a283a91..765b80d08605 100644 > > > > > --- a/fs/dax.c > > > > > +++ b/fs/dax.c > > > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, > > > > > return dax_insert_pfn_mkwrite(vmf, pfn, order); > > > > > } > > > > > EXPORT_SYMBOL_GPL(dax_finish_sync_fault); > > > > > + > > > > > +static loff_t > > > > > +dax_zeroinit_iter(struct iomap_iter *iter) > > > > > +{ > > > > > + struct iomap *iomap = &iter->iomap; > > > > > + const struct iomap *srcmap = iomap_iter_srcmap(iter); > > > > > + const u64 start = iomap->addr + iter->pos - iomap->offset; > > > > > + const u64 nr_bytes = iomap_length(iter); > > > > > + u64 start_page = start >> PAGE_SHIFT; > > > > > + u64 nr_pages = nr_bytes >> PAGE_SHIFT; > > > > > + int ret; > > > > > + > > > > > + if (!iomap->dax_dev) > > > > > + return -ECANCELED; > > > > > + > > > > > + /* > > > > > + * The physical extent must be page aligned because that's what the dax > > > > > + * function requires. > > > > > + */ > > > > > + if (!PAGE_ALIGNED(start | nr_bytes)) > > > > > + return -ECANCELED; > > > > > + > > > > > + /* > > > > > + * The dax function, by using pgoff_t, is stuck with unsigned long, so > > > > > + * we must check for overflows. > > > > > + */ > > > > > + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX) > > > > > + return -ECANCELED; > > > > > + > > > > > + /* Must be able to zero storage directly without fs intervention. */ > > > > > + if (iomap->flags & IOMAP_F_SHARED) > > > > > + return -ECANCELED; > > > > > + if (srcmap != iomap) > > > > > + return -ECANCELED; > > > > > + > > > > > + switch (iomap->type) { > > > > > + case IOMAP_MAPPED: > > > > > + while (nr_pages > 0) { > > > > > + /* XXX function only supports one page at a time?! */ > > > > > + ret = dax_zero_page_range(iomap->dax_dev, start_page, > > > > > + 1); > > > > > + if (ret) > > > > > + return ret; > > > > > + start_page++; > > > > > + nr_pages--; > > > > > + } > > > > > + > > > > > + fallthrough; > > > > > + case IOMAP_UNWRITTEN: > > > > > + return nr_bytes; > > > > > + } > > > > > + > > > > > + /* Reject holes, inline data, or delalloc extents. */ > > > > > + return -ECANCELED; > > > > > > > > We reject holes here, but the other vfs plumbing patch [2] mentions > > > > "Holes and unwritten extents are left untouched.". > > > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well? > > > > > > I'm not entirely sure what we should do for holes and unwritten extents, > > > as you can tell from the gross inconsistency between the comment and the > > > code. :/ > > > > > > On block devices, I think we rely on the behavior that writing to disk > > > will clear the device's error state (via LBA remapping or some other > > > strategy). I think this means iomap_zeroinit can skip unwritten extents > > > because reads and read faults will be satisfied from the zero page and > > > writeback (or direct writes) will trigger the drive firmware. > > > > > > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and > > > read faults with the (dax) zero page. Writes and write faults won't > > > clear the poison (unlike disks). So I guess this means that > > > dax_zeroinit *does* have to act on unwritten areas. > > I was confused when I wrote this -- before writing, dax filesystems are > required to allocate written zeroed extents and/or zero unwritten > extents and mark them written. So we don't actually need to zero > unwritten extents. Oh yes, thanks for catching that. ext4 has this flag set for DAX inode in ext4_iomap_alloc() ext4_iomap_begin() -> ext4_iomap_alloc() -> if (IS_DAX(inode)) m_flags = EXT4_GET_BLOCKS_CREATE_ZERO; Also, #define EXT4_GET_BLOCKS_CREATE_ZERO (EXT4_GET_BLOCKS_CREATE | EXT4_GET_BLOCKS_ZERO) if EXT4_GET_BLOCKS_ZERO is set then we call ext4_map_blocks -> ext4_issue_zeroout() -> blkdev_issue_zeroout() > > > > > > > Ok. I'll make those changes. > > > > Yes, I guess unwritten extents still have extents blocks allocated with > > generally a bit marked (to mark it as unwritten). So there could still be > > a need to clear the poison for this in case of DAX. > > > > > > > > As for holes -- on the one hand, one could argue that zero-initializing > > > a hole makes no sense and should be an error. OTOH one could make an > > > equally compelling argument that it's merely a nop. Thoughts? > > > > So in case of holes consider this case (please correct if any of my > > understanding below is wrong/incomplete). > > If we have a large hole and if someone tries to do write to that area. > > 1. Then from what I understood from the code FS will try and allocate some disk > > blocks (could these blocks be marked with HWpoison as FS has no way of > > knowing it?). > > Freshly allocated extents are zeroed via blkdev_issue_zeroout before > being mapped into the file, which will clear the poison. That last bit > is only the reason why the punch-and-rewrite dance ever worked at all. Frankly speaking, this discussion actually got me thinking about what is special about punch and write (w/o making any assumptions) that it clears the poison? What about just punch hole alone? So I guess this too can clear the HWpoison on the next write (after fault -> allocating blocks for DAX -> this will call blkdev_issue_zeroout()) but I guess the problem with this approach, as you mentioned, it may cause file fragmentation. Thisis since we may not get the same block back on next write after punch hole. Looking at the code of pmem driver, it is actually pmem_do_write() which clears the pmem HWpoison. And I guess the offset and size should be page_aligned because that's what calls dax_zero_page_range() which can clear the HWpoison in dax_iomap_zero(). > > We'll have to make sure this poison clearing still happens even after > the dax/block layer divorce. > > > 2. If yes, then after allocating those blocks dax_direct_access will fail (as > > you had mentioned above). But it won't clear the HWposion. > > 3. Then the user again will have to clear using this API. But that is only for > > a given extent which is some part of the hole which FS allocated. > > Now above could be repeated until the entire hole range is covered. > > Is that above understanding correct? > > > > If yes, then maybe it all depends on what sort of gurantee the API is providing. > > If using the API on the given range guarantees that the entire file range will > > not have any blocks with HWpoison then I guess we may have to cover the > > IOMAP_HOLE case as well? > > If not, then maybe we could explicitly mentioned this in the API documentation. > > Ok. The short version is that zeroinit will ensure that subsequent > reads, writes, or faults to allocated file ranges won't have problems > with pre-existing poison flags. If the user wants to fill sparse holes, > they can do that with a separate fallocate call. Yes, it is now clear to me. Thanks for taking time and replying. -ritesh > > --D > > > Please help correct if any of above does not make any sense. It will help me > > understand this use case better. > > > > -ritesh > > > > > > > > --D > > > > > > > [2]: "vfs: add a zero-initialization mode to fallocate" > > > > > > > > Although I am not an expert in this area, but the rest of the patch looks > > > > very well crafted to me. Thanks again for such details :) > > > > > > > > -ritesh > > > > > > > > > > > > > > +} > > > > > + > > > > > +/* > > > > > + * Initialize storage mapped to a DAX-mode file to a known value and ensure the > > > > > + * media are ready to accept read and write commands. This requires the use of > > > > > + * the dax layer's zero page range function to write zeroes to a pmem region > > > > > + * and to reset any hardware media error state. > > > > > + * > > > > > + * The physical extents must be aligned to page size. The file must be backed > > > > > + * by a pmem device. The extents returned must not require copy on write (or > > > > > + * any other mapping interventions from the filesystem) and must be contiguous. > > > > > + * @done will be set to true if the reset succeeded. > > > > > + * > > > > > + * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage > > > > > + * mappings do not support zero initialization, -EOPNOTSUPP if the device does > > > > > + * not support it, or the usual negative errno. > > > > > + */ > > > > > +int > > > > > +dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > > > + const struct iomap_ops *ops) > > > > > +{ > > > > > + struct iomap_iter iter = { > > > > > + .inode = inode, > > > > > + .pos = pos, > > > > > + .len = len, > > > > > + .flags = IOMAP_REPORT, > > > > > + }; > > > > > + int ret; > > > > > + > > > > > + if (!IS_DAX(inode)) > > > > > + return -EINVAL; > > > > > + if (pos + len > i_size_read(inode)) > > > > > + return -EINVAL; > > > > > + > > > > > + while ((ret = iomap_iter(&iter, ops)) > 0) > > > > > + iter.processed = dax_zeroinit_iter(&iter); > > > > > + return ret; > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(dax_zeroinit_range); > > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > > > > index 2619d94c308d..3c873f7c35ba 100644 > > > > > --- a/include/linux/dax.h > > > > > +++ b/include/linux/dax.h > > > > > @@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping); > > > > > struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); > > > > > dax_entry_t dax_lock_page(struct page *page); > > > > > void dax_unlock_page(struct page *page, dax_entry_t cookie); > > > > > +int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > > > + const struct iomap_ops *ops); > > > > > #else > > > > > #define generic_fsdax_supported NULL > > > > > > > > > > @@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page) > > > > > static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) > > > > > { > > > > > } > > > > > +static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len, > > > > > + const struct iomap_ops *ops) > > > > > +{ > > > > > + return -EOPNOTSUPP; > > > > > +} > > > > > #endif > > > > > > > > > > #if IS_ENABLED(CONFIG_DAX) > > > > >
On Wed, Sep 22, 2021 at 11:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Sep 21, 2021 at 09:37:08AM +0530, riteshh wrote:
> > On 21/09/20 10:22AM, Darrick J. Wong wrote:
> > > On Sat, Sep 18, 2021 at 10:24:08PM +0530, riteshh wrote:
> > > > On 21/09/17 06:30PM, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > Our current "advice" to people using persistent memory and FSDAX who
> > > > > wish to recover upon receipt of a media error (aka 'hwpoison') event
> > > > > from ACPI is to punch-hole that part of the file and then pwrite it,
> > > > > which will magically cause the pmem to be reinitialized and the poison
> > > > > to be cleared.
> > > > >
> > > > > Punching doesn't make any sense at all -- the (re)allocation on pwrite
> > > > > does not permit the caller to specify where to find blocks, which means
> > > > > that we might not get the same pmem back. This pushes the user farther
> > > > > away from the goal of reinitializing poisoned memory and leads to
> > > > > complaints about unnecessary file fragmentation.
> > > > >
> > > > > AFAICT, the only reason why the "punch and write" dance works at all is
> > > > > that the XFS and ext4 currently call blkdev_issue_zeroout when
> > > > > allocating pmem ahead of a write call. Even a regular overwrite won't
> > > > > clear the poison, because dax_direct_access is smart enough to bail out
> > > > > on poisoned pmem, but not smart enough to clear it. To be fair, that
> > > > > function maps pages and has no idea what kinds of reads and writes the
> > > > > caller might want to perform.
> > > > >
> > > > > Therefore, create a dax_zeroinit_range function that filesystems can to
> > > > > reset the pmem contents to zero and clear hardware media error flags.
> > > > > This uses the dax page zeroing helper function, which should ensure that
> > > > > subsequent accesses will not trip over any pre-existing media errors.
> > > >
> > > > Thanks Darrick for such clear explaination of the problem and your solution.
> > > > As I see from this thread [1], it looks like we are heading in this direction,
> > > > so I thought of why not review this RFC patch series :)
> > > >
> > > > [1]: https://lore.kernel.org/all/CAPcyv4iAr_Vwwgqw+4wz0RQUXhUUJGGz7_T+p+W6tC4T+k+zNw@mail.gmail.com/
> > > >
> > > > >
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > > fs/dax.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > include/linux/dax.h | 7 ++++
> > > > > 2 files changed, 100 insertions(+)
> > > > >
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > > index 4e3e5a283a91..765b80d08605 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > > > > return dax_insert_pfn_mkwrite(vmf, pfn, order);
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > > > > +
> > > > > +static loff_t
> > > > > +dax_zeroinit_iter(struct iomap_iter *iter)
> > > > > +{
> > > > > + struct iomap *iomap = &iter->iomap;
> > > > > + const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > > > + const u64 start = iomap->addr + iter->pos - iomap->offset;
> > > > > + const u64 nr_bytes = iomap_length(iter);
> > > > > + u64 start_page = start >> PAGE_SHIFT;
> > > > > + u64 nr_pages = nr_bytes >> PAGE_SHIFT;
> > > > > + int ret;
> > > > > +
> > > > > + if (!iomap->dax_dev)
> > > > > + return -ECANCELED;
> > > > > +
> > > > > + /*
> > > > > + * The physical extent must be page aligned because that's what the dax
> > > > > + * function requires.
> > > > > + */
> > > > > + if (!PAGE_ALIGNED(start | nr_bytes))
> > > > > + return -ECANCELED;
> > > > > +
> > > > > + /*
> > > > > + * The dax function, by using pgoff_t, is stuck with unsigned long, so
> > > > > + * we must check for overflows.
> > > > > + */
> > > > > + if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
> > > > > + return -ECANCELED;
> > > > > +
> > > > > + /* Must be able to zero storage directly without fs intervention. */
> > > > > + if (iomap->flags & IOMAP_F_SHARED)
> > > > > + return -ECANCELED;
> > > > > + if (srcmap != iomap)
> > > > > + return -ECANCELED;
> > > > > +
> > > > > + switch (iomap->type) {
> > > > > + case IOMAP_MAPPED:
> > > > > + while (nr_pages > 0) {
> > > > > + /* XXX function only supports one page at a time?! */
> > > > > + ret = dax_zero_page_range(iomap->dax_dev, start_page,
> > > > > + 1);
> > > > > + if (ret)
> > > > > + return ret;
> > > > > + start_page++;
> > > > > + nr_pages--;
> > > > > + }
> > > > > +
> > > > > + fallthrough;
> > > > > + case IOMAP_UNWRITTEN:
> > > > > + return nr_bytes;
> > > > > + }
> > > > > +
> > > > > + /* Reject holes, inline data, or delalloc extents. */
> > > > > + return -ECANCELED;
> > > >
> > > > We reject holes here, but the other vfs plumbing patch [2] mentions
> > > > "Holes and unwritten extents are left untouched.".
> > > > Shouldn't we just return nr_bytes for IOMAP_HOLE case as well?
> > >
> > > I'm not entirely sure what we should do for holes and unwritten extents,
> > > as you can tell from the gross inconsistency between the comment and the
> > > code. :/
> > >
> > > On block devices, I think we rely on the behavior that writing to disk
> > > will clear the device's error state (via LBA remapping or some other
> > > strategy). I think this means iomap_zeroinit can skip unwritten extents
> > > because reads and read faults will be satisfied from the zero page and
> > > writeback (or direct writes) will trigger the drive firmware.
> > >
> > > On FSDAX devices, reads are fulfilled by zeroing the user buffer, and
> > > read faults with the (dax) zero page. Writes and write faults won't
> > > clear the poison (unlike disks). So I guess this means that
> > > dax_zeroinit *does* have to act on unwritten areas.
>
> I was confused when I wrote this -- before writing, dax filesystems are
> required to allocate written zeroed extents and/or zero unwritten
> extents and mark them written. So we don't actually need to zero
> unwritten extents.
>
> > >
> > > Ok. I'll make those changes.
> >
> > Yes, I guess unwritten extents still have extents blocks allocated with
> > generally a bit marked (to mark it as unwritten). So there could still be
> > a need to clear the poison for this in case of DAX.
> >
> > >
> > > As for holes -- on the one hand, one could argue that zero-initializing
> > > a hole makes no sense and should be an error. OTOH one could make an
> > > equally compelling argument that it's merely a nop. Thoughts?
> >
> > So in case of holes consider this case (please correct if any of my
> > understanding below is wrong/incomplete).
> > If we have a large hole and if someone tries to do write to that area.
> > 1. Then from what I understood from the code FS will try and allocate some disk
> > blocks (could these blocks be marked with HWpoison as FS has no way of
> > knowing it?).
>
> Freshly allocated extents are zeroed via blkdev_issue_zeroout before
> being mapped into the file, which will clear the poison. That last bit
> is only the reason why the punch-and-rewrite dance ever worked at all.
>
> We'll have to make sure this poison clearing still happens even after
> the dax/block layer divorce.
>
> > 2. If yes, then after allocating those blocks dax_direct_access will fail (as
> > you had mentioned above). But it won't clear the HWposion.
> > 3. Then the user again will have to clear using this API. But that is only for
> > a given extent which is some part of the hole which FS allocated.
> > Now above could be repeated until the entire hole range is covered.
> > Is that above understanding correct?
> >
> > If yes, then maybe it all depends on what sort of gurantee the API is providing.
> > If using the API on the given range guarantees that the entire file range will
> > not have any blocks with HWpoison then I guess we may have to cover the
> > IOMAP_HOLE case as well?
> > If not, then maybe we could explicitly mentioned this in the API documentation.
>
> Ok. The short version is that zeroinit will ensure that subsequent
> reads, writes, or faults to allocated file ranges won't have problems
> with pre-existing poison flags.
s/won't/likely won't/
s/pre-existing/known pre-existing/
i.e. the guarantees of this interface is that it will have tried its
best to clean up media errors, and that may only be possible for pmem
if latent poison has been previously notified to the driver, and if
the driver supports error clearing. For example, if you're using
memmap=ss!nn to emulate PMEM with a DRAM range with poison I expect
that this routine will succeed even though no poison is corrected.
On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote: > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote: > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote: > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote: > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote: > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > > > > > > > I think this wants to be a behavioural modifier for existing > > > > > > > operations rather than an operation unto itself. i.e. similar to how > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > > > > > > > the guarantees ALLOC provides userspace. > > > > > > > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we > > > > > > > want physical zeros to be written instead of the filesystem > > > > > > > optimising away the physical zeros by manipulating the layout > > > > > > > of the file. > > > > > > > > > > > > Yes. > > > > > > > > > > > > > Then we have and API that looks like: > > > > > > > > > > > > > > ALLOC - allocate space efficiently > > > > > > > ALLOC | INIT - allocate space by writing zeros to it > > > > > > > ZERO - zero data and preallocate space efficiently > > > > > > > ZERO | INIT - zero range by writing zeros to it > > > > > > > > > > > > > > Which seems to cater for all the cases I know of where physically > > > > > > > writing zeros instead of allocating unwritten extents is the > > > > > > > preferred behaviour of fallocate().... > > > > > > > > > > > > Agreed. I'm not sure INIT is really the right name, but I can't come > > > > > > up with a better idea offhand. > > > > > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media > > > > > bypassing any mechanisms to emulate zero-filled payloads on future > > > > > reads. > > > > > > Yes, that's the semantic we want, but FUA already defines specific > > > data integrity behaviour in the storage stack w.r.t. volatile > > > caches. > > > > > > Also, FUA is associated with devices - it's low level storage jargon > > > and so is not really appropriate to call a user interface operation > > > FUA where users have no idea what a "unit" or "access" actually > > > means. > > > > > > Hence we should not overload this name with some other operation > > > that does not have (and should not have) explicit data integrity > > > requirements. That will just cause confusion for everyone. > > > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that > > > > already exists at that file range? > > > > > > IMO that doesn't work as a behavioural modifier for ALLOC because > > > the ALLOC semantics are explicitly "don't touch existing user > > > data"... > > > > Well since you can't preallocate /and/ zerorange at the same time... > > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */ > > #define FALLOC_FL_ZERO_EXISTING (0x80) > > Except we also want the newly allocated regions (i.e. where holes > were) in that range being zeroed to have zeroes written to them as > well, yes? Otherwise we end up with a combination of unwritten > extents and physical zeroes, and you can't use > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT > > /* > * For preallocation and zeroing operations, force the filesystem to > * write zeroes rather than use unwritten extents to indicate the > * range contains zeroes. > * > * For filesystems that support unwritten extents, this trades off > * slow fallocate performance for faster first write performance as > * unwritten extent conversion on the first write to each block in > * the range is not needed. > * > * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will > * be much slower overall for large ranges and/or slow storage > * compared to using unwritten extents. > */ > #define FALLOC_FL_ALLOC_INIT_DATA (1 << 7) I prefer FALLOC_FL_ZEROINIT_DATA here, because in the ZERO|INIT case we're not allocating any new space, merely rewriting existing storage. I also want to expand the description slightly: /* * For preallocation, force the filesystem to write zeroes rather than * use unwritten extents to indicate the range contains zeroes. For * zeroing operations, force the filesystem to write zeroes to existing * written extents. --D > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com
On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote: > On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote: > > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote: > > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote: > > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote: > > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote: > > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > > > > > > > > I think this wants to be a behavioural modifier for existing > > > > > > > > operations rather than an operation unto itself. i.e. similar to how > > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > > > > > > > > the guarantees ALLOC provides userspace. > > > > > > > > > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we > > > > > > > > want physical zeros to be written instead of the filesystem > > > > > > > > optimising away the physical zeros by manipulating the layout > > > > > > > > of the file. > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > Then we have and API that looks like: > > > > > > > > > > > > > > > > ALLOC - allocate space efficiently > > > > > > > > ALLOC | INIT - allocate space by writing zeros to it > > > > > > > > ZERO - zero data and preallocate space efficiently > > > > > > > > ZERO | INIT - zero range by writing zeros to it > > > > > > > > > > > > > > > > Which seems to cater for all the cases I know of where physically > > > > > > > > writing zeros instead of allocating unwritten extents is the > > > > > > > > preferred behaviour of fallocate().... > > > > > > > > > > > > > > Agreed. I'm not sure INIT is really the right name, but I can't come > > > > > > > up with a better idea offhand. > > > > > > > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media > > > > > > bypassing any mechanisms to emulate zero-filled payloads on future > > > > > > reads. > > > > > > > > Yes, that's the semantic we want, but FUA already defines specific > > > > data integrity behaviour in the storage stack w.r.t. volatile > > > > caches. > > > > > > > > Also, FUA is associated with devices - it's low level storage jargon > > > > and so is not really appropriate to call a user interface operation > > > > FUA where users have no idea what a "unit" or "access" actually > > > > means. > > > > > > > > Hence we should not overload this name with some other operation > > > > that does not have (and should not have) explicit data integrity > > > > requirements. That will just cause confusion for everyone. > > > > > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that > > > > > already exists at that file range? > > > > > > > > IMO that doesn't work as a behavioural modifier for ALLOC because > > > > the ALLOC semantics are explicitly "don't touch existing user > > > > data"... > > > > > > Well since you can't preallocate /and/ zerorange at the same time... > > > > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */ > > > #define FALLOC_FL_ZERO_EXISTING (0x80) > > > > Except we also want the newly allocated regions (i.e. where holes > > were) in that range being zeroed to have zeroes written to them as > > well, yes? Otherwise we end up with a combination of unwritten > > extents and physical zeroes, and you can't use > > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT Ooookay. This is drifting further from the original problem of wanting to write a buffer of zeroes to an already-mapped extent. What if part of the region is shared? If the goal is to make a read return zeroes, then the shared extent must be punched, right? Should the new region be preallocated? What if part of the region is unwritten? Should zeroing convert that to written at the same time? This isn't required to solve the problem, but "force the filesystem to write zeroes" implies that's required. Should preallocation start converting unwritten extents too? What if part of the region is sparse? Preallocation should allocate a written extent, but that wasn't the problem I was focusing on. Should zeroing preallocate a written extent? This also isn't required to solve my problem, but this extension of the API definition implies this too. What if part of the region is delalloc? Should preallocation allocate a written extent here too? Should zeroing? For ALLOC|INITDATA, I think it suffices to map new written extents into holes with BMAPI_ZERO and do no more work than that. For ZERO|INITDATA, I /think/ I can solve all of the above by writing zeroes to the page cache and then switching to regular preallocation to fill the holes. --D > > > > /* > > * For preallocation and zeroing operations, force the filesystem to > > * write zeroes rather than use unwritten extents to indicate the > > * range contains zeroes. > > * > > * For filesystems that support unwritten extents, this trades off > > * slow fallocate performance for faster first write performance as > > * unwritten extent conversion on the first write to each block in > > * the range is not needed. > > * > > * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will > > * be much slower overall for large ranges and/or slow storage > > * compared to using unwritten extents. > > */ > > #define FALLOC_FL_ALLOC_INIT_DATA (1 << 7) > > I prefer FALLOC_FL_ZEROINIT_DATA here, because in the ZERO|INIT case > we're not allocating any new space, merely rewriting existing storage. > I also want to expand the description slightly: > > /* > * For preallocation, force the filesystem to write zeroes rather than > * use unwritten extents to indicate the range contains zeroes. For > * zeroing operations, force the filesystem to write zeroes to existing > * written extents. > > --D > > > > > Cheers, > > > > Dave. > > > > -- > > Dave Chinner > > david@fromorbit.com
On Wed, Sep 22, 2021 at 05:02:55PM -0700, Darrick J. Wong wrote: > On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote: > > On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote: > > > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote: > > > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote: > > > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote: > > > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote: > > > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > > > > > > > > > I think this wants to be a behavioural modifier for existing > > > > > > > > > operations rather than an operation unto itself. i.e. similar to how > > > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > > > > > > > > > the guarantees ALLOC provides userspace. > > > > > > > > > > > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we > > > > > > > > > want physical zeros to be written instead of the filesystem > > > > > > > > > optimising away the physical zeros by manipulating the layout > > > > > > > > > of the file. > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > Then we have and API that looks like: > > > > > > > > > > > > > > > > > > ALLOC - allocate space efficiently > > > > > > > > > ALLOC | INIT - allocate space by writing zeros to it > > > > > > > > > ZERO - zero data and preallocate space efficiently > > > > > > > > > ZERO | INIT - zero range by writing zeros to it > > > > > > > > > > > > > > > > > > Which seems to cater for all the cases I know of where physically > > > > > > > > > writing zeros instead of allocating unwritten extents is the > > > > > > > > > preferred behaviour of fallocate().... > > > > > > > > > > > > > > > > Agreed. I'm not sure INIT is really the right name, but I can't come > > > > > > > > up with a better idea offhand. > > > > > > > > > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media > > > > > > > bypassing any mechanisms to emulate zero-filled payloads on future > > > > > > > reads. > > > > > > > > > > Yes, that's the semantic we want, but FUA already defines specific > > > > > data integrity behaviour in the storage stack w.r.t. volatile > > > > > caches. > > > > > > > > > > Also, FUA is associated with devices - it's low level storage jargon > > > > > and so is not really appropriate to call a user interface operation > > > > > FUA where users have no idea what a "unit" or "access" actually > > > > > means. > > > > > > > > > > Hence we should not overload this name with some other operation > > > > > that does not have (and should not have) explicit data integrity > > > > > requirements. That will just cause confusion for everyone. > > > > > > > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that > > > > > > already exists at that file range? > > > > > > > > > > IMO that doesn't work as a behavioural modifier for ALLOC because > > > > > the ALLOC semantics are explicitly "don't touch existing user > > > > > data"... > > > > > > > > Well since you can't preallocate /and/ zerorange at the same time... > > > > > > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */ > > > > #define FALLOC_FL_ZERO_EXISTING (0x80) > > > > > > Except we also want the newly allocated regions (i.e. where holes > > > were) in that range being zeroed to have zeroes written to them as > > > well, yes? Otherwise we end up with a combination of unwritten > > > extents and physical zeroes, and you can't use > > > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT > > Ooookay. This is drifting further from the original problem of wanting > to write a buffer of zeroes to an already-mapped extent. > > What if part of the region is shared? If the goal is to make a read > return zeroes, then the shared extent must be punched, right? Should > the new region be preallocated? > > What if part of the region is unwritten? Should zeroing convert that to > written at the same time? This isn't required to solve the problem, but > "force the filesystem to write zeroes" implies that's required. Should > preallocation start converting unwritten extents too? > > What if part of the region is sparse? Preallocation should allocate a > written extent, but that wasn't the problem I was focusing on. Should > zeroing preallocate a written extent? This also isn't required to solve > my problem, but this extension of the API definition implies this too. > > What if part of the region is delalloc? Should preallocation allocate a > written extent here too? Should zeroing? > > For ALLOC|INITDATA, I think it suffices to map new written extents into > holes with BMAPI_ZERO and do no more work than that. > > For ZERO|INITDATA, I /think/ I can solve all of the above by writing > zeroes to the page cache and then switching to regular preallocation to > fill the holes. No, because that will flood the page cache with zeroed pages, which will blow out the cache and is only marginally better than pwrite, which has already been denied. That's right, that's why we wanted all these fancier functions anywayh. Ok so first we flush the page cache, then we walk the extent map to unmap the shared extents and convert the unwritten extents to written extents. Then we can use blkdev_issue_zerooout and dax_zero_page_range to convert the written etents to ... That's too many times through the file mappings. First flush the page cache, then loop through the mappings via xfs_bmapi_read. If the mapping is sparse, replace it with a written/zeroed extent. If the extent is unwritten, convert it to a written/zeroed extent. If the mapping is a shared, punch it and replace it with a written/zeroed extent. If the mapping is written, go call the fancy functions directly, no iomappings needed or required. And now I have to go do the smae for ext4, since we have no shared code anymore. Crazylevel now approaching 130%... --D > > --D > > > > > > > /* > > > * For preallocation and zeroing operations, force the filesystem to > > > * write zeroes rather than use unwritten extents to indicate the > > > * range contains zeroes. > > > * > > > * For filesystems that support unwritten extents, this trades off > > > * slow fallocate performance for faster first write performance as > > > * unwritten extent conversion on the first write to each block in > > > * the range is not needed. > > > * > > > * Care is required when using FALLOC_FL_ALLOC_INIT_DATA as it will > > > * be much slower overall for large ranges and/or slow storage > > > * compared to using unwritten extents. > > > */ > > > #define FALLOC_FL_ALLOC_INIT_DATA (1 << 7) > > > > I prefer FALLOC_FL_ZEROINIT_DATA here, because in the ZERO|INIT case > > we're not allocating any new space, merely rewriting existing storage. > > I also want to expand the description slightly: > > > > /* > > * For preallocation, force the filesystem to write zeroes rather than > > * use unwritten extents to indicate the range contains zeroes. For > > * zeroing operations, force the filesystem to write zeroes to existing > > * written extents. > > > > --D > > > > > > > > Cheers, > > > > > > Dave. > > > > > > -- > > > Dave Chinner > > > david@fromorbit.com
I withdraw the patchset due to spiralling complexity and the need to get back to the other hojillion things. Dave now suggests creating a RWF_CLEAR_POISON/IOCB_CLEAR_POISON flag to clear poison ahead of userspace using a regular write to reset the storage. Honestly I only though of zero writes because I though that would lead to the least amount of back and forth, which was clearly wrong. Jane, could you take over and try that? --D
On Wed, Sep 22, 2021 at 5:52 PM Darrick J. Wong <djwong@kernel.org> wrote: > > I withdraw the patchset due to spiralling complexity and the need to get > back to the other hojillion things. > > Dave now suggests creating a RWF_CLEAR_POISON/IOCB_CLEAR_POISON flag to > clear poison ahead of userspace using a regular write to reset the > storage. Honestly I only though of zero writes because I though that > would lead to the least amount of back and forth, which was clearly > wrong. > Sorry Darrick, you really tried to do right by the feedback. > Jane, could you take over and try that? Let's just solve the immediate problem for DAX in the easiest way possible which is just Jane's original patches, but using the existing dax_zero_page_range() operation in the DAX pwrite() failure path. The only change would be to make sure that the pwrite() is page aligned, and punt on the ability to clear poison on sub-page granularity for now.
On Wed, Sep 22, 2021 at 05:02:55PM -0700, Darrick J. Wong wrote: > On Wed, Sep 22, 2021 at 02:27:25PM -0700, Darrick J. Wong wrote: > > On Wed, Sep 22, 2021 at 03:49:31PM +1000, Dave Chinner wrote: > > > On Tue, Sep 21, 2021 at 09:13:54PM -0700, Darrick J. Wong wrote: > > > > On Wed, Sep 22, 2021 at 01:59:07PM +1000, Dave Chinner wrote: > > > > > On Tue, Sep 21, 2021 at 07:38:01PM -0700, Darrick J. Wong wrote: > > > > > > On Tue, Sep 21, 2021 at 07:16:26PM -0700, Dan Williams wrote: > > > > > > > On Tue, Sep 21, 2021 at 1:32 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > > > > > > > On Tue, Sep 21, 2021 at 10:44:31AM +1000, Dave Chinner wrote: > > > > > > > > > I think this wants to be a behavioural modifier for existing > > > > > > > > > operations rather than an operation unto itself. i.e. similar to how > > > > > > > > > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > > > > > > > > > the guarantees ALLOC provides userspace. > > > > > > > > > > > > > > > > > > In this case, the change of behaviour over ZERO_RANGE is that we > > > > > > > > > want physical zeros to be written instead of the filesystem > > > > > > > > > optimising away the physical zeros by manipulating the layout > > > > > > > > > of the file. > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > Then we have and API that looks like: > > > > > > > > > > > > > > > > > > ALLOC - allocate space efficiently > > > > > > > > > ALLOC | INIT - allocate space by writing zeros to it > > > > > > > > > ZERO - zero data and preallocate space efficiently > > > > > > > > > ZERO | INIT - zero range by writing zeros to it > > > > > > > > > > > > > > > > > > Which seems to cater for all the cases I know of where physically > > > > > > > > > writing zeros instead of allocating unwritten extents is the > > > > > > > > > preferred behaviour of fallocate().... > > > > > > > > > > > > > > > > Agreed. I'm not sure INIT is really the right name, but I can't come > > > > > > > > up with a better idea offhand. > > > > > > > > > > > > > > FUA? As in, this is a forced-unit-access zeroing all the way to media > > > > > > > bypassing any mechanisms to emulate zero-filled payloads on future > > > > > > > reads. > > > > > > > > > > Yes, that's the semantic we want, but FUA already defines specific > > > > > data integrity behaviour in the storage stack w.r.t. volatile > > > > > caches. > > > > > > > > > > Also, FUA is associated with devices - it's low level storage jargon > > > > > and so is not really appropriate to call a user interface operation > > > > > FUA where users have no idea what a "unit" or "access" actually > > > > > means. > > > > > > > > > > Hence we should not overload this name with some other operation > > > > > that does not have (and should not have) explicit data integrity > > > > > requirements. That will just cause confusion for everyone. > > > > > > > > > > > FALLOC_FL_ZERO_EXISTING, because you want to zero the storage that > > > > > > already exists at that file range? > > > > > > > > > > IMO that doesn't work as a behavioural modifier for ALLOC because > > > > > the ALLOC semantics are explicitly "don't touch existing user > > > > > data"... > > > > > > > > Well since you can't preallocate /and/ zerorange at the same time... > > > > > > > > /* For FALLOC_FL_ZERO_RANGE, write zeroes to pre-existing mapped storage. */ > > > > #define FALLOC_FL_ZERO_EXISTING (0x80) > > > > > > Except we also want the newly allocated regions (i.e. where holes > > > were) in that range being zeroed to have zeroes written to them as > > > well, yes? Otherwise we end up with a combination of unwritten > > > extents and physical zeroes, and you can't use > > > ZERORANGE|EXISTING as a replacement for PUNCH + ALLOC|INIT > > Ooookay. This is drifting further from the original problem of wanting > to write a buffer of zeroes to an already-mapped extent. Yup, that's because fallocate() is a multiplexed set of generic operations. Adding one-off functionality for a specific use case ends up breaking down the problem into how it maps as a generic operation to the rest of the functionality that fallocate() provides. Please don't get frustrated here - the discussion needs to be had as to why fallocate() is the wrong place to be managing *storage hardware state* - the API and scope creep once the operation is broken down to a generic behaviour demonstrate this clearly. > What if part of the region is shared? If the goal is to make a read > return zeroes, then the shared extent must be punched, right? Should > the new region be preallocated? [....] These are implementation details once the API has been defined. And, yes, I agree, it's nuts that just adding "DAX clear poison" effectively degenerates into "rewrite the entire fallocate() implementation in XFS", but that's what adding "write physical zeroes rather than minimising IO overhead" semantics to generic fallocate() operations results in. Fundamentally, fallocate() is not for managing storage hardware state. It's for optimising away bulk data operations by replacing them with fast filesystem metadata manipulations and/or hardware acceleration of the bulk data operation. So, given that for clearing pmem hardware poison, we already know it requires writing zeros to the poisoned range and where in the file that we need to write zeroes to. So what exactly is the problem with doing this: iov.iov_base = zero_buf; iov.iov_len = zero_len; pwritev2(fd, &iov, 1, zero_offset, RWF_SYNC | RWF_CLEAR_HWERRORS); Where RWF_CLEAR_HWERRORS tells the low level hardware IO code to clear error states before performing the write of the user data provided? The API doesn't get much simpler or explict, and the RWF_CLEAR_HWERRORS flag is trivial to propagate all the way down into the DAX IO code where it can be performed appropriately before performing the write of the new user data into that range. And it's way simpler than plumbing this sort of things into fallocate(). We need an API that provides a one-off, single data write semantic to be defined and implemented to manage pmem hardware state. We already have that in pwritev2(). Hence this discussion leads me to conclude that fallocate() simply isn't the right interface to clear storage hardware poison state and it's much simpler for everyone - kernel and userspace - to provide a pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to clear hardware error state before issuing this user write to the hardware. Cheers, Dave. -- Dave Chinner david@fromorbit.com
On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
[..]
> Hence this discussion leads me to conclude that fallocate() simply
> isn't the right interface to clear storage hardware poison state and
> it's much simpler for everyone - kernel and userspace - to provide a
> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> clear hardware error state before issuing this user write to the
> hardware.
That flag would slot in nicely in dax_iomap_iter() as the gate for
whether dax_direct_access() should allow mapping over error ranges,
and then as a flag to dax_copy_from_iter() to indicate that it should
compare the incoming write to known poison and clear it before
proceeding.
I like the distinction, because there's a chance the application did
not know that the page had experienced data loss and might want the
error behavior. The other service the driver could offer with this
flag is to do a precise check of the incoming write to make sure it
overlaps known poison and then repair the entire page. Repairing whole
pages makes for a cleaner implementation of the code that tries to
keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> [..]
> > Hence this discussion leads me to conclude that fallocate() simply
> > isn't the right interface to clear storage hardware poison state and
> > it's much simpler for everyone - kernel and userspace - to provide a
> > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > clear hardware error state before issuing this user write to the
> > hardware.
>
> That flag would slot in nicely in dax_iomap_iter() as the gate for
> whether dax_direct_access() should allow mapping over error ranges,
> and then as a flag to dax_copy_from_iter() to indicate that it should
> compare the incoming write to known poison and clear it before
> proceeding.
>
> I like the distinction, because there's a chance the application did
> not know that the page had experienced data loss and might want the
> error behavior. The other service the driver could offer with this
> flag is to do a precise check of the incoming write to make sure it
> overlaps known poison and then repair the entire page. Repairing whole
> pages makes for a cleaner implementation of the code that tries to
> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
This flag could also be useful for preadv2() as there is currently no
way to read the good data in a PMEM page with poison via DAX. So the
flag would tell dax_direct_access() to again proceed in the face of
errors, but then the driver's dax_copy_to_iter() operation could
either read up to the precise byte offset of the error in the page, or
autoreplace error data with zero's to try to maximize data recovery.
On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > [..]
> > > Hence this discussion leads me to conclude that fallocate() simply
> > > isn't the right interface to clear storage hardware poison state and
> > > it's much simpler for everyone - kernel and userspace - to provide a
> > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > clear hardware error state before issuing this user write to the
> > > hardware.
> >
> > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > whether dax_direct_access() should allow mapping over error ranges,
> > and then as a flag to dax_copy_from_iter() to indicate that it should
> > compare the incoming write to known poison and clear it before
> > proceeding.
> >
> > I like the distinction, because there's a chance the application did
> > not know that the page had experienced data loss and might want the
> > error behavior. The other service the driver could offer with this
> > flag is to do a precise check of the incoming write to make sure it
> > overlaps known poison and then repair the entire page. Repairing whole
> > pages makes for a cleaner implementation of the code that tries to
> > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
>
> This flag could also be useful for preadv2() as there is currently no
> way to read the good data in a PMEM page with poison via DAX. So the
> flag would tell dax_direct_access() to again proceed in the face of
> errors, but then the driver's dax_copy_to_iter() operation could
> either read up to the precise byte offset of the error in the page, or
> autoreplace error data with zero's to try to maximize data recovery.
Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
everything that can be read from the bad range because it's the
other half of the problem RWF_RESET_HWERROR is trying to address.
That is, the operation we want to perform on a range with an error
state is -data recovery-, not "reinitialisation". Data recovery
requires two steps:
- "try to recover the data from the bad storage"; and
- "reinitialise the data and clear the error state"
These naturally map to read() and write() operations, not
fallocate(). With RWF flags they become explicit data recovery
operations, unlike fallocate() which needs to imply that "writing
zeroes" == "reset hardware error state". While that reset method
may be true for a specific pmem hardware implementation it is not a
requirement for all storage hardware. It's most definitely not a
requirement for future storage hardware, either.
It also means that applications have no choice in what data they can
use to reinitialise the damaged range with because fallocate() only
supports writing zeroes. If we've recovered data via a read() as you
suggest we could, then we can rebuild the data from other redundant
information and immediately write that back to the storage, hence
repairing the fault.
That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
write into an exclusive operation and hence allow the
reinitialisation with the recovered/repaired state to run atomically
w.r.t. all other filesystem operations. i.e. the reset write
completes the recovery operation instead of requiring separate
"reset" and "write recovered data into zeroed range" steps that
cannot be executed atomically by userspace...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> > On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > > [..]
> > > > Hence this discussion leads me to conclude that fallocate() simply
> > > > isn't the right interface to clear storage hardware poison state and
> > > > it's much simpler for everyone - kernel and userspace - to provide a
> > > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > > clear hardware error state before issuing this user write to the
> > > > hardware.
> > >
> > > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > > whether dax_direct_access() should allow mapping over error ranges,
> > > and then as a flag to dax_copy_from_iter() to indicate that it should
> > > compare the incoming write to known poison and clear it before
> > > proceeding.
> > >
> > > I like the distinction, because there's a chance the application did
> > > not know that the page had experienced data loss and might want the
> > > error behavior. The other service the driver could offer with this
> > > flag is to do a precise check of the incoming write to make sure it
> > > overlaps known poison and then repair the entire page. Repairing whole
> > > pages makes for a cleaner implementation of the code that tries to
> > > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> >
> > This flag could also be useful for preadv2() as there is currently no
> > way to read the good data in a PMEM page with poison via DAX. So the
> > flag would tell dax_direct_access() to again proceed in the face of
> > errors, but then the driver's dax_copy_to_iter() operation could
> > either read up to the precise byte offset of the error in the page, or
> > autoreplace error data with zero's to try to maximize data recovery.
>
> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> everything that can be read from the bad range because it's the
> other half of the problem RWF_RESET_HWERROR is trying to address.
> That is, the operation we want to perform on a range with an error
> state is -data recovery-, not "reinitialisation". Data recovery
> requires two steps:
>
> - "try to recover the data from the bad storage"; and
> - "reinitialise the data and clear the error state"
>
> These naturally map to read() and write() operations, not
> fallocate(). With RWF flags they become explicit data recovery
> operations, unlike fallocate() which needs to imply that "writing
> zeroes" == "reset hardware error state". While that reset method
> may be true for a specific pmem hardware implementation it is not a
> requirement for all storage hardware. It's most definitely not a
> requirement for future storage hardware, either.
>
> It also means that applications have no choice in what data they can
> use to reinitialise the damaged range with because fallocate() only
> supports writing zeroes. If we've recovered data via a read() as you
> suggest we could, then we can rebuild the data from other redundant
> information and immediately write that back to the storage, hence
> repairing the fault.
>
> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> write into an exclusive operation and hence allow the
> reinitialisation with the recovered/repaired state to run atomically
> w.r.t. all other filesystem operations. i.e. the reset write
> completes the recovery operation instead of requiring separate
> "reset" and "write recovered data into zeroed range" steps that
> cannot be executed atomically by userspace...
/me nods
Jane, want to take a run at patches for this ^^^?
On 9/23/2021 6:18 PM, Dan Williams wrote:
> On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
>>> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
>>>>
>>>> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
>>>> [..]
>>>>> Hence this discussion leads me to conclude that fallocate() simply
>>>>> isn't the right interface to clear storage hardware poison state and
>>>>> it's much simpler for everyone - kernel and userspace - to provide a
>>>>> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
>>>>> clear hardware error state before issuing this user write to the
>>>>> hardware.
>>>>
>>>> That flag would slot in nicely in dax_iomap_iter() as the gate for
>>>> whether dax_direct_access() should allow mapping over error ranges,
>>>> and then as a flag to dax_copy_from_iter() to indicate that it should
>>>> compare the incoming write to known poison and clear it before
>>>> proceeding.
>>>>
>>>> I like the distinction, because there's a chance the application did
>>>> not know that the page had experienced data loss and might want the
>>>> error behavior. The other service the driver could offer with this
>>>> flag is to do a precise check of the incoming write to make sure it
>>>> overlaps known poison and then repair the entire page. Repairing whole
>>>> pages makes for a cleaner implementation of the code that tries to
>>>> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
>>>
>>> This flag could also be useful for preadv2() as there is currently no
>>> way to read the good data in a PMEM page with poison via DAX. So the
>>> flag would tell dax_direct_access() to again proceed in the face of
>>> errors, but then the driver's dax_copy_to_iter() operation could
>>> either read up to the precise byte offset of the error in the page, or
>>> autoreplace error data with zero's to try to maximize data recovery.
>>
>> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
>> everything that can be read from the bad range because it's the
>> other half of the problem RWF_RESET_HWERROR is trying to address.
>> That is, the operation we want to perform on a range with an error
>> state is -data recovery-, not "reinitialisation". Data recovery
>> requires two steps:
>>
>> - "try to recover the data from the bad storage"; and
>> - "reinitialise the data and clear the error state"
>>
>> These naturally map to read() and write() operations, not
>> fallocate(). With RWF flags they become explicit data recovery
>> operations, unlike fallocate() which needs to imply that "writing
>> zeroes" == "reset hardware error state". While that reset method
>> may be true for a specific pmem hardware implementation it is not a
>> requirement for all storage hardware. It's most definitely not a
>> requirement for future storage hardware, either.
>>
>> It also means that applications have no choice in what data they can
>> use to reinitialise the damaged range with because fallocate() only
>> supports writing zeroes. If we've recovered data via a read() as you
>> suggest we could, then we can rebuild the data from other redundant
>> information and immediately write that back to the storage, hence
>> repairing the fault.
>>
>> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
>> write into an exclusive operation and hence allow the
>> reinitialisation with the recovered/repaired state to run atomically
>> w.r.t. all other filesystem operations. i.e. the reset write
>> completes the recovery operation instead of requiring separate
>> "reset" and "write recovered data into zeroed range" steps that
>> cannot be executed atomically by userspace...
>
> /me nods
>
> Jane, want to take a run at patches for this ^^^?
>
Sure, I'll give it a try.
Thank you all for the discussions!
-jane
On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote: > > On 9/23/2021 6:18 PM, Dan Williams wrote: > > On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote: > > > > On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > > > > > > > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > [..] > > > > > > Hence this discussion leads me to conclude that fallocate() simply > > > > > > isn't the right interface to clear storage hardware poison state and > > > > > > it's much simpler for everyone - kernel and userspace - to provide a > > > > > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to > > > > > > clear hardware error state before issuing this user write to the > > > > > > hardware. > > > > > > > > > > That flag would slot in nicely in dax_iomap_iter() as the gate for > > > > > whether dax_direct_access() should allow mapping over error ranges, > > > > > and then as a flag to dax_copy_from_iter() to indicate that it should > > > > > compare the incoming write to known poison and clear it before > > > > > proceeding. > > > > > > > > > > I like the distinction, because there's a chance the application did > > > > > not know that the page had experienced data loss and might want the > > > > > error behavior. The other service the driver could offer with this > > > > > flag is to do a precise check of the incoming write to make sure it > > > > > overlaps known poison and then repair the entire page. Repairing whole > > > > > pages makes for a cleaner implementation of the code that tries to > > > > > keep poison out of the CPU speculation path, {set,clear}_mce_nospec(). > > > > > > > > This flag could also be useful for preadv2() as there is currently no > > > > way to read the good data in a PMEM page with poison via DAX. So the > > > > flag would tell dax_direct_access() to again proceed in the face of > > > > errors, but then the driver's dax_copy_to_iter() operation could > > > > either read up to the precise byte offset of the error in the page, or > > > > autoreplace error data with zero's to try to maximize data recovery. > > > > > > Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read > > > everything that can be read from the bad range because it's the > > > other half of the problem RWF_RESET_HWERROR is trying to address. > > > That is, the operation we want to perform on a range with an error > > > state is -data recovery-, not "reinitialisation". Data recovery > > > requires two steps: > > > > > > - "try to recover the data from the bad storage"; and > > > - "reinitialise the data and clear the error state" > > > > > > These naturally map to read() and write() operations, not > > > fallocate(). With RWF flags they become explicit data recovery > > > operations, unlike fallocate() which needs to imply that "writing > > > zeroes" == "reset hardware error state". While that reset method > > > may be true for a specific pmem hardware implementation it is not a > > > requirement for all storage hardware. It's most definitely not a > > > requirement for future storage hardware, either. > > > > > > It also means that applications have no choice in what data they can > > > use to reinitialise the damaged range with because fallocate() only > > > supports writing zeroes. If we've recovered data via a read() as you > > > suggest we could, then we can rebuild the data from other redundant > > > information and immediately write that back to the storage, hence > > > repairing the fault. > > > > > > That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR > > > write into an exclusive operation and hence allow the > > > reinitialisation with the recovered/repaired state to run atomically > > > w.r.t. all other filesystem operations. i.e. the reset write > > > completes the recovery operation instead of requiring separate > > > "reset" and "write recovered data into zeroed range" steps that > > > cannot be executed atomically by userspace... > > > > /me nods > > > > Jane, want to take a run at patches for this ^^^? > > > > Sure, I'll give it a try. > > Thank you all for the discussions! Cool, thank you! --D > > -jane > >
On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:
> >
> > On 9/23/2021 6:18 PM, Dan Williams wrote:
> > > On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> > > > > On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >
> > > > > > On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > [..]
> > > > > > > Hence this discussion leads me to conclude that fallocate() simply
> > > > > > > isn't the right interface to clear storage hardware poison state and
> > > > > > > it's much simpler for everyone - kernel and userspace - to provide a
> > > > > > > pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> > > > > > > clear hardware error state before issuing this user write to the
> > > > > > > hardware.
> > > > > >
> > > > > > That flag would slot in nicely in dax_iomap_iter() as the gate for
> > > > > > whether dax_direct_access() should allow mapping over error ranges,
> > > > > > and then as a flag to dax_copy_from_iter() to indicate that it should
> > > > > > compare the incoming write to known poison and clear it before
> > > > > > proceeding.
> > > > > >
> > > > > > I like the distinction, because there's a chance the application did
> > > > > > not know that the page had experienced data loss and might want the
> > > > > > error behavior. The other service the driver could offer with this
> > > > > > flag is to do a precise check of the incoming write to make sure it
> > > > > > overlaps known poison and then repair the entire page. Repairing whole
> > > > > > pages makes for a cleaner implementation of the code that tries to
> > > > > > keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> > > > >
> > > > > This flag could also be useful for preadv2() as there is currently no
> > > > > way to read the good data in a PMEM page with poison via DAX. So the
> > > > > flag would tell dax_direct_access() to again proceed in the face of
> > > > > errors, but then the driver's dax_copy_to_iter() operation could
> > > > > either read up to the precise byte offset of the error in the page, or
> > > > > autoreplace error data with zero's to try to maximize data recovery.
> > > >
> > > > Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> > > > everything that can be read from the bad range because it's the
> > > > other half of the problem RWF_RESET_HWERROR is trying to address.
> > > > That is, the operation we want to perform on a range with an error
> > > > state is -data recovery-, not "reinitialisation". Data recovery
> > > > requires two steps:
> > > >
> > > > - "try to recover the data from the bad storage"; and
> > > > - "reinitialise the data and clear the error state"
> > > >
> > > > These naturally map to read() and write() operations, not
> > > > fallocate(). With RWF flags they become explicit data recovery
> > > > operations, unlike fallocate() which needs to imply that "writing
> > > > zeroes" == "reset hardware error state". While that reset method
> > > > may be true for a specific pmem hardware implementation it is not a
> > > > requirement for all storage hardware. It's most definitely not a
> > > > requirement for future storage hardware, either.
> > > >
> > > > It also means that applications have no choice in what data they can
> > > > use to reinitialise the damaged range with because fallocate() only
> > > > supports writing zeroes. If we've recovered data via a read() as you
> > > > suggest we could, then we can rebuild the data from other redundant
> > > > information and immediately write that back to the storage, hence
> > > > repairing the fault.
> > > >
> > > > That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> > > > write into an exclusive operation and hence allow the
> > > > reinitialisation with the recovered/repaired state to run atomically
> > > > w.r.t. all other filesystem operations. i.e. the reset write
> > > > completes the recovery operation instead of requiring separate
> > > > "reset" and "write recovered data into zeroed range" steps that
> > > > cannot be executed atomically by userspace...
> > >
> > > /me nods
> > >
> > > Jane, want to take a run at patches for this ^^^?
> > >
> >
> > Sure, I'll give it a try.
> >
> > Thank you all for the discussions!
>
> Cool, thank you!
I'd like to propose a slight modification to the API: a single RWF
flag called RWF_RECOVER_DATA. On read, this means the storage tries
to read all the data it can from the range, and for the parts it
can't read data from (cachelines, sectors, whatever) it returns as
zeroes.
On write, this means the errors over the range get cleared and the
user data provided gets written over the top of whatever was there.
Filesystems should perform this as an exclusive operation to that
range of the file.
That way we only need one IOCB_RECOVERY flag, and for communicating
with lower storage layers (e.g. dm/md raid and/or hardware) only one
REQ_RECOVERY flag is needed in the bio.
Thoughts?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
On 9/27/2021 2:07 PM, Dave Chinner wrote: > On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote: >> On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote: >>> >>> On 9/23/2021 6:18 PM, Dan Williams wrote: >>>> On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote: >>>>> >>>>> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote: >>>>>> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote: >>>>>>> >>>>>>> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote: >>>>>>> [..] >>>>>>>> Hence this discussion leads me to conclude that fallocate() simply >>>>>>>> isn't the right interface to clear storage hardware poison state and >>>>>>>> it's much simpler for everyone - kernel and userspace - to provide a >>>>>>>> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to >>>>>>>> clear hardware error state before issuing this user write to the >>>>>>>> hardware. >>>>>>> >>>>>>> That flag would slot in nicely in dax_iomap_iter() as the gate for >>>>>>> whether dax_direct_access() should allow mapping over error ranges, >>>>>>> and then as a flag to dax_copy_from_iter() to indicate that it should >>>>>>> compare the incoming write to known poison and clear it before >>>>>>> proceeding. >>>>>>> >>>>>>> I like the distinction, because there's a chance the application did >>>>>>> not know that the page had experienced data loss and might want the >>>>>>> error behavior. The other service the driver could offer with this >>>>>>> flag is to do a precise check of the incoming write to make sure it >>>>>>> overlaps known poison and then repair the entire page. Repairing whole >>>>>>> pages makes for a cleaner implementation of the code that tries to >>>>>>> keep poison out of the CPU speculation path, {set,clear}_mce_nospec(). >>>>>> >>>>>> This flag could also be useful for preadv2() as there is currently no >>>>>> way to read the good data in a PMEM page with poison via DAX. So the >>>>>> flag would tell dax_direct_access() to again proceed in the face of >>>>>> errors, but then the driver's dax_copy_to_iter() operation could >>>>>> either read up to the precise byte offset of the error in the page, or >>>>>> autoreplace error data with zero's to try to maximize data recovery. >>>>> >>>>> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read >>>>> everything that can be read from the bad range because it's the >>>>> other half of the problem RWF_RESET_HWERROR is trying to address. >>>>> That is, the operation we want to perform on a range with an error >>>>> state is -data recovery-, not "reinitialisation". Data recovery >>>>> requires two steps: >>>>> >>>>> - "try to recover the data from the bad storage"; and >>>>> - "reinitialise the data and clear the error state" >>>>> >>>>> These naturally map to read() and write() operations, not >>>>> fallocate(). With RWF flags they become explicit data recovery >>>>> operations, unlike fallocate() which needs to imply that "writing >>>>> zeroes" == "reset hardware error state". While that reset method >>>>> may be true for a specific pmem hardware implementation it is not a >>>>> requirement for all storage hardware. It's most definitely not a >>>>> requirement for future storage hardware, either. >>>>> >>>>> It also means that applications have no choice in what data they can >>>>> use to reinitialise the damaged range with because fallocate() only >>>>> supports writing zeroes. If we've recovered data via a read() as you >>>>> suggest we could, then we can rebuild the data from other redundant >>>>> information and immediately write that back to the storage, hence >>>>> repairing the fault. >>>>> >>>>> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR >>>>> write into an exclusive operation and hence allow the >>>>> reinitialisation with the recovered/repaired state to run atomically >>>>> w.r.t. all other filesystem operations. i.e. the reset write >>>>> completes the recovery operation instead of requiring separate >>>>> "reset" and "write recovered data into zeroed range" steps that >>>>> cannot be executed atomically by userspace... >>>> >>>> /me nods >>>> >>>> Jane, want to take a run at patches for this ^^^? >>>> >>> >>> Sure, I'll give it a try. >>> >>> Thank you all for the discussions! >> >> Cool, thank you! > > I'd like to propose a slight modification to the API: a single RWF > flag called RWF_RECOVER_DATA. On read, this means the storage tries > to read all the data it can from the range, and for the parts it > can't read data from (cachelines, sectors, whatever) it returns as > zeroes. > > On write, this means the errors over the range get cleared and the > user data provided gets written over the top of whatever was there. > Filesystems should perform this as an exclusive operation to that > range of the file. > > That way we only need one IOCB_RECOVERY flag, and for communicating > with lower storage layers (e.g. dm/md raid and/or hardware) only one > REQ_RECOVERY flag is needed in the bio. > > Thoughts? Sounds cleaner. Dan, your thoughts? thanks, -jane > > Cheers, > > Dave. >
On Mon, Sep 27, 2021 at 2:58 PM Jane Chu <jane.chu@oracle.com> wrote:
>
> On 9/27/2021 2:07 PM, Dave Chinner wrote:
> > On Thu, Sep 23, 2021 at 06:35:16PM -0700, Darrick J. Wong wrote:
> >> On Thu, Sep 23, 2021 at 06:21:19PM -0700, Jane Chu wrote:
> >>>
> >>> On 9/23/2021 6:18 PM, Dan Williams wrote:
> >>>> On Thu, Sep 23, 2021 at 3:54 PM Dave Chinner <david@fromorbit.com> wrote:
> >>>>>
> >>>>> On Wed, Sep 22, 2021 at 10:42:11PM -0700, Dan Williams wrote:
> >>>>>> On Wed, Sep 22, 2021 at 7:43 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >>>>>>>
> >>>>>>> On Wed, Sep 22, 2021 at 6:42 PM Dave Chinner <david@fromorbit.com> wrote:
> >>>>>>> [..]
> >>>>>>>> Hence this discussion leads me to conclude that fallocate() simply
> >>>>>>>> isn't the right interface to clear storage hardware poison state and
> >>>>>>>> it's much simpler for everyone - kernel and userspace - to provide a
> >>>>>>>> pwritev2(RWF_CLEAR_HWERROR) flag to directly instruct the IO path to
> >>>>>>>> clear hardware error state before issuing this user write to the
> >>>>>>>> hardware.
> >>>>>>>
> >>>>>>> That flag would slot in nicely in dax_iomap_iter() as the gate for
> >>>>>>> whether dax_direct_access() should allow mapping over error ranges,
> >>>>>>> and then as a flag to dax_copy_from_iter() to indicate that it should
> >>>>>>> compare the incoming write to known poison and clear it before
> >>>>>>> proceeding.
> >>>>>>>
> >>>>>>> I like the distinction, because there's a chance the application did
> >>>>>>> not know that the page had experienced data loss and might want the
> >>>>>>> error behavior. The other service the driver could offer with this
> >>>>>>> flag is to do a precise check of the incoming write to make sure it
> >>>>>>> overlaps known poison and then repair the entire page. Repairing whole
> >>>>>>> pages makes for a cleaner implementation of the code that tries to
> >>>>>>> keep poison out of the CPU speculation path, {set,clear}_mce_nospec().
> >>>>>>
> >>>>>> This flag could also be useful for preadv2() as there is currently no
> >>>>>> way to read the good data in a PMEM page with poison via DAX. So the
> >>>>>> flag would tell dax_direct_access() to again proceed in the face of
> >>>>>> errors, but then the driver's dax_copy_to_iter() operation could
> >>>>>> either read up to the precise byte offset of the error in the page, or
> >>>>>> autoreplace error data with zero's to try to maximize data recovery.
> >>>>>
> >>>>> Yes, it could. I like the idea - say RWF_IGNORE_HWERROR - to read
> >>>>> everything that can be read from the bad range because it's the
> >>>>> other half of the problem RWF_RESET_HWERROR is trying to address.
> >>>>> That is, the operation we want to perform on a range with an error
> >>>>> state is -data recovery-, not "reinitialisation". Data recovery
> >>>>> requires two steps:
> >>>>>
> >>>>> - "try to recover the data from the bad storage"; and
> >>>>> - "reinitialise the data and clear the error state"
> >>>>>
> >>>>> These naturally map to read() and write() operations, not
> >>>>> fallocate(). With RWF flags they become explicit data recovery
> >>>>> operations, unlike fallocate() which needs to imply that "writing
> >>>>> zeroes" == "reset hardware error state". While that reset method
> >>>>> may be true for a specific pmem hardware implementation it is not a
> >>>>> requirement for all storage hardware. It's most definitely not a
> >>>>> requirement for future storage hardware, either.
> >>>>>
> >>>>> It also means that applications have no choice in what data they can
> >>>>> use to reinitialise the damaged range with because fallocate() only
> >>>>> supports writing zeroes. If we've recovered data via a read() as you
> >>>>> suggest we could, then we can rebuild the data from other redundant
> >>>>> information and immediately write that back to the storage, hence
> >>>>> repairing the fault.
> >>>>>
> >>>>> That, in turn, allows the filesystem to turn the RWF_RESET_HWERROR
> >>>>> write into an exclusive operation and hence allow the
> >>>>> reinitialisation with the recovered/repaired state to run atomically
> >>>>> w.r.t. all other filesystem operations. i.e. the reset write
> >>>>> completes the recovery operation instead of requiring separate
> >>>>> "reset" and "write recovered data into zeroed range" steps that
> >>>>> cannot be executed atomically by userspace...
> >>>>
> >>>> /me nods
> >>>>
> >>>> Jane, want to take a run at patches for this ^^^?
> >>>>
> >>>
> >>> Sure, I'll give it a try.
> >>>
> >>> Thank you all for the discussions!
> >>
> >> Cool, thank you!
> >
> > I'd like to propose a slight modification to the API: a single RWF
> > flag called RWF_RECOVER_DATA. On read, this means the storage tries
> > to read all the data it can from the range, and for the parts it
> > can't read data from (cachelines, sectors, whatever) it returns as
> > zeroes.
> >
> > On write, this means the errors over the range get cleared and the
> > user data provided gets written over the top of whatever was there.
> > Filesystems should perform this as an exclusive operation to that
> > range of the file.
> >
> > That way we only need one IOCB_RECOVERY flag, and for communicating
> > with lower storage layers (e.g. dm/md raid and/or hardware) only one
> > REQ_RECOVERY flag is needed in the bio.
> >
> > Thoughts?
>
> Sounds cleaner. Dan, your thoughts?
I like it. I was struggling with a way to unify the flag names, and
"recovery" is a good term for not surprising the caller with zeros.
I.e. don't use this flow to avoid errors, use this flow to maximize
data recovery.