From: Ira Weiny <ira.weiny@intel.com> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory consumption due to their inability to detect whether the kernel will instantiate page cache for a file, and cases where a global dax enable via a mount option is too coarse. The following patch series enables selecting the use of DAX on individual files and/or directories on xfs, and lays some groundwork to do so in ext4. In this scheme the dax mount option can be omitted to allow the per-file property to take effect. The insight at LSF/MM was to separate the per-mount or per-file "physical" capability switch from an "effective" attribute for the file. At LSF/MM we discussed the difficulties of switching the mode of a file with active mappings / page cache. It was thought the races could be avoided by limiting mode flips to 0-length files. However, this turns out to not be true.[3] This is because address space operations (a_ops) may be in use at any time the inode is referenced and users have expressed a desire to be able to change the mode on a file with data in it. For those reasons this patch set allows changing the mode flag on a file as long as it is not current mapped. Furthermore, DAX is a property of the inode and as such, many operations other than address space operations need to be protected during a mode change. Therefore callbacks are placed within the inode operations and used to lock the inode as appropriate. As in V1, Users are able to query the effective and physical flags separately at any time. Specifically the addition of the statx attribute bit allows them to ensure the file is operating in the mode they intend. This 'effective flag' and physical flags could differ when the filesystem is mounted with the dax flag for example. It should be noted that the physical DAX flag inheritance is not shown in this patch set as it was maintained from previous work on XFS. The physical DAX flag and it's inheritance will need to be added to other file systems for user control. Finally, extensive testing was performed which resulted in a couple of bug fix and clean up patches. Specifically: fs: remove unneeded IS_DAX() check fs/xfs: Fix truncate up 'Fix truncate up' deserves specific attention because I'm not 100% sure it is the correct fix. Without that patch fsx testing failed within a few minutes with this error. Mapped Write: non-zero data past EOF (0x3b0da) page offset 0xdb is 0x3711 With 'Fix truncate up' running fsx while changing modes can run for hours but I have seen 2 other errors in the same genre after many hours of continuous testing. They are: READ BAD DATA: offset = 0x22dc, size = 0xcc7e, fname = /mnt/pmem/dax-file Mapped Read: non-zero data past EOF (0x3309e) page offset 0x9f is 0x6ab4 After seeing the patches to fix stale data exposure problems[4] I'm more confident now that all 3 of these errors are a latent bug rather than a bug in this series itself. However, because of these failures I'm only submitting this set RFC. [1] https://lwn.net/Articles/787973/ [2] https://lwn.net/Articles/787233/ [3] https://lkml.org/lkml/2019/10/20/96 [4] https://patchwork.kernel.org/patch/11310511/ To: linux-kernel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Cc: linux-ext4@vger.kernel.org Cc: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org
From: Ira Weiny <ira.weiny@intel.com> In order for users to determine if a file is currently operating in DAX mode (effective DAX). Define a statx attribute value and set that attribute if the effective DAX flag is set. To go along with this we propose the following addition to the statx man page: STATX_ATTR_DAX DAX (cpu direct access) is a file mode that attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a capable device, a compatible filesystem block size, and filesystem opt-in. It generally assumes all accesses are via cpu load / store instructions which can minimize overhead for small accesses, but adversely affect cpu utilization for large transfers. File I/O is done directly to/from user-space buffers. While the DAX property tends to result in data being transferred synchronously it does not give the guarantees of synchronous I/O that data and necessary metadata are transferred. Memory mapped I/O may be performed with direct mappings that bypass system memory buffering. Again while memory-mapped I/O tends to result in data being transferred synchronously it does not guarantee synchronous metadata updates. A dax file may optionally support being mapped with the MAP_SYNC flag which does allow cpu store operations to be considered synchronous modulo cpu cache effects. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/stat.c | 3 +++ include/uapi/linux/stat.h | 1 + 2 files changed, 4 insertions(+) diff --git a/fs/stat.c b/fs/stat.c index 030008796479..894699c74dde 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, if (IS_AUTOMOUNT(inode)) stat->attributes |= STATX_ATTR_AUTOMOUNT; + if (IS_DAX(inode)) + stat->attributes |= STATX_ATTR_DAX; + if (inode->i_op->getattr) return inode->i_op->getattr(path, stat, request_mask, query_flags); diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index ad80a5c885d5..e5f9d5517f6b 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -169,6 +169,7 @@ struct statx { #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */ #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */ #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ +#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */ #endif /* _UAPI_LINUX_STAT_H */ -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is changing as a quick check. But the implementation mixes the physical (XFS_DIFLAG2_DAX) and effective (S_DAX) DAX flags. Remove the use of the effective flag when determining if a change of the physical flag is required. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_ioctl.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 7b35d62ede9f..fe37708cea8f 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1195,9 +1195,11 @@ xfs_ioctl_setattr_dax_invalidate( } /* If the DAX state is not changing, we have nothing to do here. */ - if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode)) + if ((fa->fsx_xflags & FS_XFLAG_DAX) && + (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)) return 0; - if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode)) + if (!(fa->fsx_xflags & FS_XFLAG_DAX) && + !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)) return 0; if (S_ISDIR(inode->i_mode)) -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> xfs_inode_supports_dax() should reflect if the inode can support DAX not that it is enabled for DAX. Leave that to other helper functions. Change the caller of xfs_inode_supports_dax() to call xfs_inode_use_dax() which reflects new logic to override the effective DAX flag with either the mount option or the physical DAX flag. To make the logic clear create 2 helper functions for the mount and physical flag. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_iops.c | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 8afe69ca188b..0a0ea90259e9 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1234,6 +1234,15 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = { .update_time = xfs_vn_update_time, }; +static bool +xfs_inode_mount_is_dax( + struct xfs_inode *ip) +{ + struct xfs_mount *mp = ip->i_mount; + + return (mp->m_flags & XFS_MOUNT_DAX) == XFS_MOUNT_DAX; +} + /* Figure out if this file actually supports DAX. */ static bool xfs_inode_supports_dax( @@ -1245,11 +1254,6 @@ xfs_inode_supports_dax( if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip)) return false; - /* DAX mount option or DAX iflag must be set. */ - if (!(mp->m_flags & XFS_MOUNT_DAX) && - !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)) - return false; - /* Block size must match page size */ if (mp->m_sb.sb_blocksize != PAGE_SIZE) return false; @@ -1258,6 +1262,22 @@ xfs_inode_supports_dax( return xfs_inode_buftarg(ip)->bt_daxdev != NULL; } +static bool +xfs_inode_is_dax( + struct xfs_inode *ip) +{ + return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX; +} + +static bool +xfs_inode_use_dax( + struct xfs_inode *ip) +{ + return xfs_inode_supports_dax(ip) && + (xfs_inode_mount_is_dax(ip) || + xfs_inode_is_dax(ip)); +} + STATIC void xfs_diflags_to_iflags( struct inode *inode, @@ -1276,7 +1296,7 @@ xfs_diflags_to_iflags( inode->i_flags |= S_SYNC; if (flags & XFS_DIFLAG_NOATIME) inode->i_flags |= S_NOATIME; - if (xfs_inode_supports_dax(ip)) + if (xfs_inode_use_dax(ip)) inode->i_flags |= S_DAX; } -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> Rather than open coding xfs_inode_supports_dax() in xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and call it in preparation for swapping dax flags. This also means updating xfs_inode_supports_dax() to return true for a directory. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_ioctl.c | 16 +++------------- fs/xfs/xfs_iops.c | 8 ++++++-- fs/xfs/xfs_iops.h | 2 ++ 3 files changed, 11 insertions(+), 15 deletions(-) diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index fe37708cea8f..b5e00b67c297 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1176,23 +1176,13 @@ xfs_ioctl_setattr_dax_invalidate( int *join_flags) { struct inode *inode = VFS_I(ip); - struct super_block *sb = inode->i_sb; int error; *join_flags = 0; - /* - * It is only valid to set the DAX flag on regular files and - * directories on filesystems where the block size is equal to the page - * size. On directories it serves as an inherited hint so we don't - * have to check the device for dax support or flush pagecache. - */ - if (fa->fsx_xflags & FS_XFLAG_DAX) { - struct xfs_buftarg *target = xfs_inode_buftarg(ip); - - if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize)) - return -EINVAL; - } + if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX && + !xfs_inode_supports_dax(ip)) + return -EINVAL; /* If the DAX state is not changing, we have nothing to do here. */ if ((fa->fsx_xflags & FS_XFLAG_DAX) && diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 0a0ea90259e9..d6843cdb51d0 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1244,14 +1244,18 @@ xfs_inode_mount_is_dax( } /* Figure out if this file actually supports DAX. */ -static bool +bool xfs_inode_supports_dax( struct xfs_inode *ip) { struct xfs_mount *mp = ip->i_mount; /* Only supported on non-reflinked files. */ - if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip)) + if (xfs_is_reflink_inode(ip)) + return false; + + /* Only supported on regular files and directories. */ + if (!(S_ISREG(VFS_I(ip)->i_mode) || S_ISDIR(VFS_I(ip)->i_mode))) return false; /* Block size must match page size */ diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h index 4d24ff309f59..f24fec8de1d6 100644 --- a/fs/xfs/xfs_iops.h +++ b/fs/xfs/xfs_iops.h @@ -24,4 +24,6 @@ extern int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap, extern int xfs_vn_setattr_nonsize(struct dentry *dentry, struct iattr *vap); extern int xfs_vn_setattr_size(struct dentry *dentry, struct iattr *vap); +extern bool xfs_inode_supports_dax(struct xfs_inode *ip); + #endif /* __XFS_IOPS_H__ */ -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> The IS_DAX() check in io_is_direct() causes a race between changing the DAX mode and creating the iocb flags. Remove the check because DAX now emulates the page cache API and therefore it does not matter if the file mode is DAX or not when the iocb flags are created. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index d7584bcef5d3..e11989502eac 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3365,7 +3365,7 @@ extern int file_update_time(struct file *file); static inline bool io_is_direct(struct file *filp) { - return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host); + return (filp->f_flags & O_DIRECT); } static inline bool vma_is_dax(struct vm_area_struct *vma) -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> One of the checks for an inode supporting DAX is if the inode is reflinked. During a non-DAX to DAX mode change we could race with the file being reflinked and end up with a reflinked file being in DAX mode. Prevent this race by checking for DAX support under the MMAP_LOCK. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_ioctl.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index b5e00b67c297..bc3654fe3b5d 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1180,10 +1180,6 @@ xfs_ioctl_setattr_dax_invalidate( *join_flags = 0; - if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX && - !xfs_inode_supports_dax(ip)) - return -EINVAL; - /* If the DAX state is not changing, we have nothing to do here. */ if ((fa->fsx_xflags & FS_XFLAG_DAX) && (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)) @@ -1197,6 +1193,13 @@ xfs_ioctl_setattr_dax_invalidate( /* lock, flush and invalidate mapping in preparation for flag change */ xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL); + + if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX && + !xfs_inode_supports_dax(ip)) { + error = -EINVAL; + goto out_unlock; + } + error = filemap_write_and_wait(inode->i_mapping); if (error) goto out_unlock; -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> DAX requires special address space operations but many other functions check the IS_DAX() mode. DAX is a property of the inode thus we define an inode mode lock as an inode operation which file systems can optionally define. This patch defines the core function callbacks as well as puts the locking calls in place. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- Documentation/filesystems/vfs.rst | 30 ++++++++++++++++ fs/ioctl.c | 23 +++++++++---- fs/open.c | 4 +++ include/linux/fs.h | 57 +++++++++++++++++++++++++++++-- mm/fadvise.c | 10 ++++-- mm/khugepaged.c | 2 ++ mm/mmap.c | 7 ++++ 7 files changed, 123 insertions(+), 10 deletions(-) diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 7d4d09dd5e6d..b945aa95f15a 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -59,6 +59,28 @@ like open(2) the file, or stat(2) it to peek at the inode data. The stat(2) operation is fairly simple: once the VFS has the dentry, it peeks at the inode data and passes some of it back to userspace. +Changing inode 'modes' dynamically +---------------------------------- + +Some file systems may have different modes for their inodes which require +dyanic changing. A specific example of this is DAX enabled files in XFS and +ext4. To switch the mode safely we lock the inode mode in all "normal" file +system operations and restrict mode changes to those operations. The specific +rules are. + +To do this a file system must follow the following rules. + + 1) the direct_IO address_space_operation must be supported in all + potential a_ops vectors for any mode suported by the inode. + 2) Filesystems must define the lock_mode() and unlock_mode() operations + in struct inode_operations. These functions are used by the core + vfs layers to ensure that the mode is stable before allowing the + core operations to proceed. + 3) Mode changes shall not be allowed while the file is mmap'ed + 4) While changing modes filesystems should take exclusive locks which + prevent the core vfs layer from proceeding. + + The File Object --------------- @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined: int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode); int (*tmpfile) (struct inode *, struct dentry *, umode_t); + void (*lock_mode)(struct inode *); + void (*unlock_mode)(struct inode *); }; Again, all methods are called without any locks being held, unless @@ -584,6 +608,12 @@ otherwise noted. atomically creating, opening and unlinking a file in given directory. +``lock_mode`` + called to prevent operations which depend on the inode's mode from + proceeding should a mode change be in progress + +``unlock_mode`` + called when critical mode dependent operation is complete The Address Space Object ======================== diff --git a/fs/ioctl.c b/fs/ioctl.c index 7c9a5df5a597..ed6ab5303a24 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl); static int ioctl_fibmap(struct file *filp, int __user *p) { struct address_space *mapping = filp->f_mapping; + struct inode *inode = filp->f_inode; int res, block; + lock_inode_mode(inode); + /* do we support this mess? */ - if (!mapping->a_ops->bmap) - return -EINVAL; - if (!capable(CAP_SYS_RAWIO)) - return -EPERM; + if (!mapping->a_ops->bmap) { + res = -EINVAL; + goto out; + } + if (!capable(CAP_SYS_RAWIO)) { + res = -EPERM; + goto out; + } res = get_user(block, p); if (res) - return res; + goto out; res = mapping->a_ops->bmap(mapping, block); - return put_user(res, p); + res = put_user(res, p); + +out: + unlock_inode_mode(inode); + return res; } /** diff --git a/fs/open.c b/fs/open.c index b0be77ea8f1b..c62428bbc525 100644 --- a/fs/open.c +++ b/fs/open.c @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, if (ret) newattrs.ia_valid |= ret | ATTR_FORCE; + lock_inode_mode(dentry->d_inode); inode_lock(dentry->d_inode); /* Note any delegations or leases have already been broken: */ ret = notify_change(dentry, &newattrs, NULL); inode_unlock(dentry->d_inode); + unlock_inode_mode(dentry->d_inode); return ret; } @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -EOPNOTSUPP; file_start_write(file); + lock_inode_mode(inode); ret = file->f_op->fallocate(file, mode, offset, len); + unlock_inode_mode(inode); /* * Create inotify and fanotify events. diff --git a/include/linux/fs.h b/include/linux/fs.h index e11989502eac..631f11d6246e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -359,6 +359,11 @@ typedef struct { typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long); +/** + * NOTE: DO NOT define new functions in address_space_operations without first + * considering how dynamic inode modes can be supported. See the comment in + * struct inode_operations for the lock_mode() and unlock_mode() callbacks. + */ struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); @@ -1817,6 +1822,11 @@ struct block_device_operations; struct iov_iter; +/** + * NOTE: DO NOT define new functions in file_operations without first + * considering how dynamic inode modes can be supported. See the comment in + * struct inode_operations for the lock_mode() and unlock_mode() callbacks. + */ struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); @@ -1859,6 +1869,20 @@ struct file_operations { int (*fadvise)(struct file *, loff_t, loff_t, int); } __randomize_layout; +/* + * Filesystems wishing to support dynamic inode modes must do the following. + * + * 1) the direct_IO address_space_operation must be supported in all + * potential a_ops vectors for any mode suported by the inode. + * 2) Filesystems must define the lock_mode() and unlock_mode() operations + * in struct inode_operations. These functions are used by the core + * vfs layers to ensure that the mode is stable before allowing the + * core operations to proceed. + * 3) Mode changes shall not be allowed while the file is mmap'ed + * 4) While changing modes filesystems should take exclusive locks which + * prevent the core vfs layer from proceeding. + * + */ struct inode_operations { struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); @@ -1887,18 +1911,47 @@ struct inode_operations { umode_t create_mode); int (*tmpfile) (struct inode *, struct dentry *, umode_t); int (*set_acl)(struct inode *, struct posix_acl *, int); + void (*lock_mode)(struct inode*); + void (*unlock_mode)(struct inode*); } ____cacheline_aligned; +static inline void lock_inode_mode(struct inode *inode) +{ + WARN_ON_ONCE(inode->i_op->lock_mode && + !inode->i_op->unlock_mode); + if (inode->i_op->lock_mode) + inode->i_op->lock_mode(inode); +} +static inline void unlock_inode_mode(struct inode *inode) +{ + WARN_ON_ONCE(inode->i_op->unlock_mode && + !inode->i_op->lock_mode); + if (inode->i_op->unlock_mode) + inode->i_op->unlock_mode(inode); +} + static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { - return file->f_op->read_iter(kio, iter); + struct inode *inode = file_inode(kio->ki_filp); + ssize_t ret; + + lock_inode_mode(inode); + ret = file->f_op->read_iter(kio, iter); + unlock_inode_mode(inode); + return ret; } static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, struct iov_iter *iter) { - return file->f_op->write_iter(kio, iter); + struct inode *inode = file_inode(kio->ki_filp); + ssize_t ret; + + lock_inode_mode(inode); + ret = file->f_op->write_iter(kio, iter); + unlock_inode_mode(inode); + return ret; } static inline int call_mmap(struct file *file, struct vm_area_struct *vma) diff --git a/mm/fadvise.c b/mm/fadvise.c index 4f17c83db575..a4095a5deac8 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) bdi = inode_to_bdi(mapping->host); + lock_inode_mode(inode); if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { + int ret = 0; + switch (advice) { case POSIX_FADV_NORMAL: case POSIX_FADV_RANDOM: @@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) /* no bad return value, but ignore advice */ break; default: - return -EINVAL; + ret = -EINVAL; } - return 0; + + unlock_inode_mode(inode); + return ret; } + unlock_inode_mode(inode); /* * Careful about overflows. Len == 0 means "as much as possible". Use diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b679908743cb..ff49da065db0 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm, } else { /* !is_shmem */ if (!page || xa_is_value(page)) { xas_unlock_irq(&xas); + lock_inode_mode(file->f_inode); page_cache_sync_readahead(mapping, &file->f_ra, file, index, PAGE_SIZE); + unlock_inode_mode(file->f_inode); /* drain pagevecs to help isolate_lru_page() */ lru_add_drain(); page = find_lock_page(mapping, index); diff --git a/mm/mmap.c b/mm/mmap.c index 70f67c4515aa..dfaf1130e706 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1542,11 +1542,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } + if (file) + lock_inode_mode(file_inode(file)); + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; + + if (file) + unlock_inode_mode(file_inode(file)); + return addr; } -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> XFS requires regular files to be locked while changing to/from DAX mode. Define a new DAX lock type and implement the [un]lock_mode() inode operation callbacks. We define a new XFS_DAX_* lock type to carry the lock through the transaction because we don't want to use IOLOCK as that would cause performance issues with locking of the inode itself. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_icache.c | 2 ++ fs/xfs/xfs_inode.c | 37 +++++++++++++++++++++++++++++++++++-- fs/xfs/xfs_inode.h | 12 ++++++++++-- fs/xfs/xfs_iops.c | 24 +++++++++++++++++++++++- 4 files changed, 70 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 8dc2e5414276..0288672e8902 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -74,6 +74,8 @@ xfs_inode_alloc( INIT_LIST_HEAD(&ip->i_ioend_list); spin_lock_init(&ip->i_ioend_lock); + percpu_init_rwsem(&ip->i_dax_sem); + return ip; } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 401da197f012..e8fd95b75e5b 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared( * * Basic locking order: * - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock * * mmap_sem locking order: * * i_rwsem -> page lock -> mmap_sem - * mmap_sem -> i_mmap_lock -> page_lock + * mmap_sem -> i_dax_sem -> i_mmap_lock -> page_lock * * The difference in mmap_sem locking order mean that we cannot hold the * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can @@ -181,6 +181,13 @@ xfs_ilock( ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) != (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0); + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) != + (XFS_DAX_SHARED | XFS_DAX_EXCL)); + + if (lock_flags & XFS_DAX_EXCL) + percpu_down_write(&ip->i_dax_sem); + else if (lock_flags & XFS_DAX_SHARED) + percpu_down_read(&ip->i_dax_sem); if (lock_flags & XFS_IOLOCK_EXCL) { down_write_nested(&VFS_I(ip)->i_rwsem, @@ -224,6 +231,8 @@ xfs_ilock_nowait( * You can't set both SHARED and EXCL for the same lock, * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED, * and XFS_ILOCK_EXCL are valid values to set in lock_flags. + * + * XFS_DAX_* is not allowed */ ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) != (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)); @@ -232,6 +241,7 @@ xfs_ilock_nowait( ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) != (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0); + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); if (lock_flags & XFS_IOLOCK_EXCL) { if (!down_write_trylock(&VFS_I(ip)->i_rwsem)) @@ -302,6 +312,8 @@ xfs_iunlock( (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0); ASSERT(lock_flags != 0); + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) != + (XFS_DAX_SHARED | XFS_DAX_EXCL)); if (lock_flags & XFS_IOLOCK_EXCL) up_write(&VFS_I(ip)->i_rwsem); @@ -318,6 +330,11 @@ xfs_iunlock( else if (lock_flags & XFS_ILOCK_SHARED) mrunlock_shared(&ip->i_lock); + if (lock_flags & XFS_DAX_EXCL) + percpu_up_write(&ip->i_dax_sem); + else if (lock_flags & XFS_DAX_SHARED) + percpu_up_read(&ip->i_dax_sem); + trace_xfs_iunlock(ip, lock_flags, _RET_IP_); } @@ -333,6 +350,8 @@ xfs_ilock_demote( ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)); ASSERT((lock_flags & ~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0); + /* XFS_DAX_* is not allowed */ + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); if (lock_flags & XFS_ILOCK_EXCL) mrdemote(&ip->i_lock); @@ -369,6 +388,13 @@ xfs_isilocked( return rwsem_is_locked(&VFS_I(ip)->i_rwsem); } + if (lock_flags & (XFS_DAX_EXCL|XFS_DAX_SHARED)) { + if (!(lock_flags & XFS_DAX_SHARED)) + return !debug_locks || + percpu_rwsem_is_held(&ip->i_dax_sem, 0); + return rwsem_is_locked(&ip->i_dax_sem); + } + ASSERT(0); return 0; } @@ -465,6 +491,9 @@ xfs_lock_inodes( ASSERT(!(lock_mode & XFS_ILOCK_EXCL) || inodes <= XFS_ILOCK_MAX_SUBCLASS + 1); + /* XFS_DAX_* is not allowed */ + ASSERT((lock_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); + if (lock_mode & XFS_IOLOCK_EXCL) { ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL))); } else if (lock_mode & XFS_MMAPLOCK_EXCL) @@ -566,6 +595,10 @@ xfs_lock_two_inodes( ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) || !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))); + /* XFS_DAX_* is not allowed */ + ASSERT((ip0_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); + ASSERT((ip1_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); + ASSERT(ip0->i_ino != ip1->i_ino); if (ip0->i_ino > ip1->i_ino) { diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 492e53992fa9..693ca66bd89b 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -67,6 +67,9 @@ typedef struct xfs_inode { spinlock_t i_ioend_lock; struct work_struct i_ioend_work; struct list_head i_ioend_list; + + /* protect changing the mode to/from DAX */ + struct percpu_rw_semaphore i_dax_sem; } xfs_inode_t; /* Convert from vfs inode to xfs inode */ @@ -278,10 +281,13 @@ static inline void xfs_ifunlock(struct xfs_inode *ip) #define XFS_ILOCK_SHARED (1<<3) #define XFS_MMAPLOCK_EXCL (1<<4) #define XFS_MMAPLOCK_SHARED (1<<5) +#define XFS_DAX_EXCL (1<<6) +#define XFS_DAX_SHARED (1<<7) #define XFS_LOCK_MASK (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \ | XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \ - | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED) + | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \ + | XFS_DAX_EXCL | XFS_DAX_SHARED) #define XFS_LOCK_FLAGS \ { XFS_IOLOCK_EXCL, "IOLOCK_EXCL" }, \ @@ -289,7 +295,9 @@ static inline void xfs_ifunlock(struct xfs_inode *ip) { XFS_ILOCK_EXCL, "ILOCK_EXCL" }, \ { XFS_ILOCK_SHARED, "ILOCK_SHARED" }, \ { XFS_MMAPLOCK_EXCL, "MMAPLOCK_EXCL" }, \ - { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" } + { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }, \ + { XFS_DAX_EXCL, "DAX_EXCL" }, \ + { XFS_DAX_SHARED, "DAX_SHARED" } /* diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index d6843cdb51d0..a2f2604c3187 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1158,6 +1158,16 @@ xfs_vn_tmpfile( return xfs_generic_create(dir, dentry, mode, 0, true); } +static void xfs_lock_mode(struct inode *inode) +{ + xfs_ilock(XFS_I(inode), XFS_DAX_SHARED); +} + +static void xfs_unlock_mode(struct inode *inode) +{ + xfs_iunlock(XFS_I(inode), XFS_DAX_SHARED); +} + static const struct inode_operations xfs_inode_operations = { .get_acl = xfs_get_acl, .set_acl = xfs_set_acl, @@ -1168,6 +1178,18 @@ static const struct inode_operations xfs_inode_operations = { .update_time = xfs_vn_update_time, }; +static const struct inode_operations xfs_reg_inode_operations = { + .get_acl = xfs_get_acl, + .set_acl = xfs_set_acl, + .getattr = xfs_vn_getattr, + .setattr = xfs_vn_setattr, + .listxattr = xfs_vn_listxattr, + .fiemap = xfs_vn_fiemap, + .update_time = xfs_vn_update_time, + .lock_mode = xfs_lock_mode, + .unlock_mode = xfs_unlock_mode, +}; + static const struct inode_operations xfs_dir_inode_operations = { .create = xfs_vn_create, .lookup = xfs_vn_lookup, @@ -1372,7 +1394,7 @@ xfs_setup_iops( switch (inode->i_mode & S_IFMT) { case S_IFREG: - inode->i_op = &xfs_inode_operations; + inode->i_op = &xfs_reg_inode_operations; inode->i_fop = &xfs_file_operations; if (IS_DAX(inode)) inode->i_mapping->a_ops = &xfs_dax_aops; -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> Page faults need to ensure the inode mode is correct and consistent with the vmf information at the time of the fault. There is no easy way to ensure the vmf information is correct if a mode change is in progress. Furthermore, there is no good use case to require a mode change while the file is mmap'ed. Track mmap's of the file and fail the mode change if the file is mmap'ed. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/inode.c | 2 ++ fs/xfs/xfs_ioctl.c | 8 ++++++++ include/linux/fs.h | 1 + mm/mmap.c | 19 +++++++++++++++++-- 4 files changed, 28 insertions(+), 2 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 2b0f51161918..944711aed6f8 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -245,6 +245,8 @@ static struct inode *alloc_inode(struct super_block *sb) return NULL; } + atomic64_set(&inode->i_mapped, 0); + return inode; } diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index bc3654fe3b5d..1ab0906c6c7f 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate( goto out_unlock; } + /* + * If there is a mapping in place we must remain in our current mode. + */ + if (atomic64_read(&inode->i_mapped)) { + error = -EBUSY; + goto out_unlock; + } + error = filemap_write_and_wait(inode->i_mapping); if (error) goto out_unlock; diff --git a/include/linux/fs.h b/include/linux/fs.h index 631f11d6246e..6e7dc626b657 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -740,6 +740,7 @@ struct inode { #endif void *i_private; /* fs or device private pointer */ + atomic64_t i_mapped; } __randomize_layout; struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); diff --git a/mm/mmap.c b/mm/mmap.c index dfaf1130e706..e6b68924b7ca 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma) static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) { struct vm_area_struct *next = vma->vm_next; + struct file *f = vma->vm_file; might_sleep(); if (vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); - if (vma->vm_file) - fput(vma->vm_file); + if (f) { + struct inode *inode = file_inode(f); + if (inode) + atomic64_dec(&inode->i_mapped); + fput(f); + } mpol_put(vma_policy(vma)); vm_area_free(vma); return next; @@ -1837,6 +1842,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vma_set_page_prot(vma); + /* + * Track if there is mapping in place such that a mode change + * does not occur on a file which is mapped + */ + if (file) { + struct inode *inode = file_inode(file); + + atomic64_inc(&inode->i_mapped); + } + return addr; unmap_and_free_vma: -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> When zeroing the end of a file we must account for bytes contained in the final page which are past EOF. Extend the range passed to iomap_zero_range() to reach LLONG_MAX which will include all bytes of the final page. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_iops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a2f2604c3187..a34b04e8ac9c 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -910,7 +910,7 @@ xfs_setattr_size( */ if (newsize > oldsize) { trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); - error = iomap_zero_range(inode, oldsize, newsize - oldsize, + error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize, &did_zeroing, &xfs_buffered_write_iomap_ops); } else { error = iomap_truncate_page(inode, newsize, &did_zeroing, -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> Define a variable to hold the lock flags to ensure that the correct locks are returned or released on error. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_ioctl.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 1ab0906c6c7f..9a35bf83eaa1 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1176,7 +1176,7 @@ xfs_ioctl_setattr_dax_invalidate( int *join_flags) { struct inode *inode = VFS_I(ip); - int error; + int error, flags; *join_flags = 0; @@ -1191,8 +1191,10 @@ xfs_ioctl_setattr_dax_invalidate( if (S_ISDIR(inode->i_mode)) return 0; + flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL; + /* lock, flush and invalidate mapping in preparation for flag change */ - xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL); + xfs_ilock(ip, flags); if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX && !xfs_inode_supports_dax(ip)) { @@ -1215,11 +1217,11 @@ xfs_ioctl_setattr_dax_invalidate( if (error) goto out_unlock; - *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL; + *join_flags = flags; return 0; out_unlock: - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL); + xfs_iunlock(ip, flags); return error; } -- 2.21.0
From: Ira Weiny <ira.weiny@intel.com> Now that locking of the inode is in place we can allow a mode change while under the new lock. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- fs/xfs/xfs_inode.h | 1 + fs/xfs/xfs_ioctl.c | 9 ++++++--- fs/xfs/xfs_iops.c | 15 +++++++++++---- 3 files changed, 18 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 693ca66bd89b..b0d2e7da88c6 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -474,6 +474,7 @@ int xfs_break_layouts(struct inode *inode, uint *iolock, /* from xfs_iops.c */ extern void xfs_setup_inode(struct xfs_inode *ip); extern void xfs_setup_iops(struct xfs_inode *ip); +extern void xfs_setup_a_ops(struct xfs_inode *ip); /* * When setting up a newly allocated inode, we need to call diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 9a35bf83eaa1..e07559bf70a9 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1109,12 +1109,11 @@ xfs_diflags_to_linux( inode->i_flags |= S_NOATIME; else inode->i_flags &= ~S_NOATIME; -#if 0 /* disabled until the flag switching races are sorted out */ if (xflags & FS_XFLAG_DAX) inode->i_flags |= S_DAX; else inode->i_flags &= ~S_DAX; -#endif + } static int @@ -1191,7 +1190,7 @@ xfs_ioctl_setattr_dax_invalidate( if (S_ISDIR(inode->i_mode)) return 0; - flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL; + flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL | XFS_DAX_EXCL; /* lock, flush and invalidate mapping in preparation for flag change */ xfs_ilock(ip, flags); @@ -1512,6 +1511,8 @@ xfs_ioctl_setattr( else ip->i_d.di_cowextsize = 0; + xfs_setup_a_ops(ip); + code = xfs_trans_commit(tp); /* @@ -1621,6 +1622,8 @@ xfs_ioc_setxflags( goto out_drop_write; } + xfs_setup_a_ops(ip); + error = xfs_trans_commit(tp); out_drop_write: mnt_drop_write_file(filp); diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a34b04e8ac9c..c70164a0df97 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1386,6 +1386,16 @@ xfs_setup_inode( } } +void xfs_setup_a_ops(struct xfs_inode *ip) +{ + struct inode *inode = &ip->i_vnode; + + if (IS_DAX(inode)) + inode->i_mapping->a_ops = &xfs_dax_aops; + else + inode->i_mapping->a_ops = &xfs_address_space_operations; +} + void xfs_setup_iops( struct xfs_inode *ip) @@ -1396,10 +1406,7 @@ xfs_setup_iops( case S_IFREG: inode->i_op = &xfs_reg_inode_operations; inode->i_fop = &xfs_file_operations; - if (IS_DAX(inode)) - inode->i_mapping->a_ops = &xfs_dax_aops; - else - inode->i_mapping->a_ops = &xfs_address_space_operations; + xfs_setup_a_ops(ip); break; case S_IFDIR: if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb)) -- 2.21.0
On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > DAX requires special address space operations but many other functions > check the IS_DAX() mode. > > DAX is a property of the inode thus we define an inode mode lock as an > inode operation which file systems can optionally define. > > This patch defines the core function callbacks as well as puts the > locking calls in place. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > --- > Documentation/filesystems/vfs.rst | 30 ++++++++++++++++ > fs/ioctl.c | 23 +++++++++---- > fs/open.c | 4 +++ > include/linux/fs.h | 57 +++++++++++++++++++++++++++++-- > mm/fadvise.c | 10 ++++-- > mm/khugepaged.c | 2 ++ > mm/mmap.c | 7 ++++ > 7 files changed, 123 insertions(+), 10 deletions(-) > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > index 7d4d09dd5e6d..b945aa95f15a 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -59,6 +59,28 @@ like open(2) the file, or stat(2) it to peek at the inode data. The > stat(2) operation is fairly simple: once the VFS has the dentry, it > peeks at the inode data and passes some of it back to userspace. > > +Changing inode 'modes' dynamically > +---------------------------------- > + > +Some file systems may have different modes for their inodes which require > +dyanic changing. A specific example of this is DAX enabled files in XFS and > +ext4. To switch the mode safely we lock the inode mode in all "normal" file > +system operations and restrict mode changes to those operations. The specific > +rules are. > + > +To do this a file system must follow the following rules. > + > + 1) the direct_IO address_space_operation must be supported in all > + potential a_ops vectors for any mode suported by the inode. > + 2) Filesystems must define the lock_mode() and unlock_mode() operations > + in struct inode_operations. These functions are used by the core > + vfs layers to ensure that the mode is stable before allowing the > + core operations to proceed. > + 3) Mode changes shall not be allowed while the file is mmap'ed > + 4) While changing modes filesystems should take exclusive locks which > + prevent the core vfs layer from proceeding. > + > + > > The File Object > --------------- > @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined: > int (*atomic_open)(struct inode *, struct dentry *, struct file *, > unsigned open_flag, umode_t create_mode); > int (*tmpfile) (struct inode *, struct dentry *, umode_t); > + void (*lock_mode)(struct inode *); > + void (*unlock_mode)(struct inode *); Yikes. "mode" has a specific meaning for inodes, and this lock isn't related to i_mode. This lock protects aops from changing while an address space operation is in use. > }; > > Again, all methods are called without any locks being held, unless > @@ -584,6 +608,12 @@ otherwise noted. > atomically creating, opening and unlinking a file in given > directory. > > +``lock_mode`` > + called to prevent operations which depend on the inode's mode from > + proceeding should a mode change be in progress "Inodes can't change mode, because files do not suddenly become directories". ;) Oh, you meant "lock_XXXX is called to prevent a change in the pagecache mode from proceeding while there are address space operations in progress". So these are really more aops get and put functions... > +``unlock_mode`` > + called when critical mode dependent operation is complete > > The Address Space Object > ======================== > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 7c9a5df5a597..ed6ab5303a24 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl); > static int ioctl_fibmap(struct file *filp, int __user *p) > { > struct address_space *mapping = filp->f_mapping; > + struct inode *inode = filp->f_inode; > int res, block; > > + lock_inode_mode(inode); > + > /* do we support this mess? */ > - if (!mapping->a_ops->bmap) > - return -EINVAL; > - if (!capable(CAP_SYS_RAWIO)) > - return -EPERM; > + if (!mapping->a_ops->bmap) { > + res = -EINVAL; > + goto out; > + } > + if (!capable(CAP_SYS_RAWIO)) { > + res = -EPERM; > + goto out; Why does the order of these checks change here? > + } > res = get_user(block, p); > if (res) > - return res; > + goto out; > res = mapping->a_ops->bmap(mapping, block); > - return put_user(res, p); > + res = put_user(res, p); > + > +out: > + unlock_inode_mode(inode); > + return res; > } > > /** > diff --git a/fs/open.c b/fs/open.c > index b0be77ea8f1b..c62428bbc525 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs, > if (ret) > newattrs.ia_valid |= ret | ATTR_FORCE; > > + lock_inode_mode(dentry->d_inode); > inode_lock(dentry->d_inode); > /* Note any delegations or leases have already been broken: */ > ret = notify_change(dentry, &newattrs, NULL); > inode_unlock(dentry->d_inode); > + unlock_inode_mode(dentry->d_inode); > return ret; > } > > @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > return -EOPNOTSUPP; > > file_start_write(file); > + lock_inode_mode(inode); > ret = file->f_op->fallocate(file, mode, offset, len); > + unlock_inode_mode(inode); > > /* > * Create inotify and fanotify events. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e11989502eac..631f11d6246e 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -359,6 +359,11 @@ typedef struct { > typedef int (*read_actor_t)(read_descriptor_t *, struct page *, > unsigned long, unsigned long); > > +/** > + * NOTE: DO NOT define new functions in address_space_operations without first > + * considering how dynamic inode modes can be supported. See the comment in > + * struct inode_operations for the lock_mode() and unlock_mode() callbacks. > + */ > struct address_space_operations { > int (*writepage)(struct page *page, struct writeback_control *wbc); > int (*readpage)(struct file *, struct page *); > @@ -1817,6 +1822,11 @@ struct block_device_operations; > > struct iov_iter; > > +/** > + * NOTE: DO NOT define new functions in file_operations without first > + * considering how dynamic inode modes can be supported. See the comment in > + * struct inode_operations for the lock_mode() and unlock_mode() callbacks. > + */ > struct file_operations { > struct module *owner; > loff_t (*llseek) (struct file *, loff_t, int); > @@ -1859,6 +1869,20 @@ struct file_operations { > int (*fadvise)(struct file *, loff_t, loff_t, int); > } __randomize_layout; > > +/* > + * Filesystems wishing to support dynamic inode modes must do the following. > + * > + * 1) the direct_IO address_space_operation must be supported in all > + * potential a_ops vectors for any mode suported by the inode. > + * 2) Filesystems must define the lock_mode() and unlock_mode() operations > + * in struct inode_operations. These functions are used by the core > + * vfs layers to ensure that the mode is stable before allowing the > + * core operations to proceed. > + * 3) Mode changes shall not be allowed while the file is mmap'ed > + * 4) While changing modes filesystems should take exclusive locks which > + * prevent the core vfs layer from proceeding. > + * > + */ > struct inode_operations { > struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); > const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); > @@ -1887,18 +1911,47 @@ struct inode_operations { > umode_t create_mode); > int (*tmpfile) (struct inode *, struct dentry *, umode_t); > int (*set_acl)(struct inode *, struct posix_acl *, int); > + void (*lock_mode)(struct inode*); > + void (*unlock_mode)(struct inode*); > } ____cacheline_aligned; > > +static inline void lock_inode_mode(struct inode *inode) inode_aops_get()? > +{ > + WARN_ON_ONCE(inode->i_op->lock_mode && > + !inode->i_op->unlock_mode); > + if (inode->i_op->lock_mode) > + inode->i_op->lock_mode(inode); > +} > +static inline void unlock_inode_mode(struct inode *inode) > +{ > + WARN_ON_ONCE(inode->i_op->unlock_mode && > + !inode->i_op->lock_mode); > + if (inode->i_op->unlock_mode) > + inode->i_op->unlock_mode(inode); > +} > + > static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, > struct iov_iter *iter) inode_aops_put()? --D > { > - return file->f_op->read_iter(kio, iter); > + struct inode *inode = file_inode(kio->ki_filp); > + ssize_t ret; > + > + lock_inode_mode(inode); > + ret = file->f_op->read_iter(kio, iter); > + unlock_inode_mode(inode); > + return ret; > } > > static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, > struct iov_iter *iter) > { > - return file->f_op->write_iter(kio, iter); > + struct inode *inode = file_inode(kio->ki_filp); > + ssize_t ret; > + > + lock_inode_mode(inode); > + ret = file->f_op->write_iter(kio, iter); > + unlock_inode_mode(inode); > + return ret; > } > > static inline int call_mmap(struct file *file, struct vm_area_struct *vma) > diff --git a/mm/fadvise.c b/mm/fadvise.c > index 4f17c83db575..a4095a5deac8 100644 > --- a/mm/fadvise.c > +++ b/mm/fadvise.c > @@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > > bdi = inode_to_bdi(mapping->host); > > + lock_inode_mode(inode); > if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) { > + int ret = 0; > + > switch (advice) { > case POSIX_FADV_NORMAL: > case POSIX_FADV_RANDOM: > @@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) > /* no bad return value, but ignore advice */ > break; > default: > - return -EINVAL; > + ret = -EINVAL; > } > - return 0; > + > + unlock_inode_mode(inode); > + return ret; > } > + unlock_inode_mode(inode); > > /* > * Careful about overflows. Len == 0 means "as much as possible". Use > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index b679908743cb..ff49da065db0 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm, > } else { /* !is_shmem */ > if (!page || xa_is_value(page)) { > xas_unlock_irq(&xas); > + lock_inode_mode(file->f_inode); > page_cache_sync_readahead(mapping, &file->f_ra, > file, index, > PAGE_SIZE); > + unlock_inode_mode(file->f_inode); > /* drain pagevecs to help isolate_lru_page() */ > lru_add_drain(); > page = find_lock_page(mapping, index); > diff --git a/mm/mmap.c b/mm/mmap.c > index 70f67c4515aa..dfaf1130e706 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1542,11 +1542,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > vm_flags |= VM_NORESERVE; > } > > + if (file) > + lock_inode_mode(file_inode(file)); > + > addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); > if (!IS_ERR_VALUE(addr) && > ((vm_flags & VM_LOCKED) || > (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) > *populate = len; > + > + if (file) > + unlock_inode_mode(file_inode(file)); > + > return addr; > } > > -- > 2.21.0 >
On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > XFS requires regular files to be locked while changing to/from DAX mode. > > Define a new DAX lock type and implement the [un]lock_mode() inode > operation callbacks. > > We define a new XFS_DAX_* lock type to carry the lock through the > transaction because we don't want to use IOLOCK as that would cause > performance issues with locking of the inode itself. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > --- > fs/xfs/xfs_icache.c | 2 ++ > fs/xfs/xfs_inode.c | 37 +++++++++++++++++++++++++++++++++++-- > fs/xfs/xfs_inode.h | 12 ++++++++++-- > fs/xfs/xfs_iops.c | 24 +++++++++++++++++++++++- > 4 files changed, 70 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 8dc2e5414276..0288672e8902 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -74,6 +74,8 @@ xfs_inode_alloc( > INIT_LIST_HEAD(&ip->i_ioend_list); > spin_lock_init(&ip->i_ioend_lock); > > + percpu_init_rwsem(&ip->i_dax_sem); > + > return ip; > } > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 401da197f012..e8fd95b75e5b 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared( > * > * Basic locking order: > * > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock Mmmmmm, more locks. Can we skip the extra lock if CONFIG_FSDAX=n or if the filesystem devices don't support DAX at all? Also, I don't think we're actually following the i_rwsem -> i_daxsem order in fallocate, and possibly elsewhere too? Does the vfs have to take the i_dax_sem to do remapping things like reflink? (Pretend that reflink and dax are compatible for the moment) > * mmap_sem locking order: > * > * i_rwsem -> page lock -> mmap_sem > - * mmap_sem -> i_mmap_lock -> page_lock > + * mmap_sem -> i_dax_sem -> i_mmap_lock -> page_lock > * > * The difference in mmap_sem locking order mean that we cannot hold the > * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can > @@ -181,6 +181,13 @@ xfs_ilock( > ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) != > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0); > + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) != > + (XFS_DAX_SHARED | XFS_DAX_EXCL)); > + > + if (lock_flags & XFS_DAX_EXCL) > + percpu_down_write(&ip->i_dax_sem); > + else if (lock_flags & XFS_DAX_SHARED) > + percpu_down_read(&ip->i_dax_sem); > > if (lock_flags & XFS_IOLOCK_EXCL) { > down_write_nested(&VFS_I(ip)->i_rwsem, > @@ -224,6 +231,8 @@ xfs_ilock_nowait( > * You can't set both SHARED and EXCL for the same lock, > * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED, > * and XFS_ILOCK_EXCL are valid values to set in lock_flags. > + * > + * XFS_DAX_* is not allowed > */ > ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) != > (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)); > @@ -232,6 +241,7 @@ xfs_ilock_nowait( > ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) != > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0); > + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); > > if (lock_flags & XFS_IOLOCK_EXCL) { > if (!down_write_trylock(&VFS_I(ip)->i_rwsem)) > @@ -302,6 +312,8 @@ xfs_iunlock( > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)); > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0); > ASSERT(lock_flags != 0); > + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) != > + (XFS_DAX_SHARED | XFS_DAX_EXCL)); > > if (lock_flags & XFS_IOLOCK_EXCL) > up_write(&VFS_I(ip)->i_rwsem); > @@ -318,6 +330,11 @@ xfs_iunlock( > else if (lock_flags & XFS_ILOCK_SHARED) > mrunlock_shared(&ip->i_lock); > > + if (lock_flags & XFS_DAX_EXCL) > + percpu_up_write(&ip->i_dax_sem); > + else if (lock_flags & XFS_DAX_SHARED) > + percpu_up_read(&ip->i_dax_sem); > + > trace_xfs_iunlock(ip, lock_flags, _RET_IP_); > } > > @@ -333,6 +350,8 @@ xfs_ilock_demote( > ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)); > ASSERT((lock_flags & > ~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0); > + /* XFS_DAX_* is not allowed */ > + ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); > > if (lock_flags & XFS_ILOCK_EXCL) > mrdemote(&ip->i_lock); > @@ -369,6 +388,13 @@ xfs_isilocked( > return rwsem_is_locked(&VFS_I(ip)->i_rwsem); > } > > + if (lock_flags & (XFS_DAX_EXCL|XFS_DAX_SHARED)) { > + if (!(lock_flags & XFS_DAX_SHARED)) > + return !debug_locks || > + percpu_rwsem_is_held(&ip->i_dax_sem, 0); > + return rwsem_is_locked(&ip->i_dax_sem); > + } > + > ASSERT(0); > return 0; > } > @@ -465,6 +491,9 @@ xfs_lock_inodes( > ASSERT(!(lock_mode & XFS_ILOCK_EXCL) || > inodes <= XFS_ILOCK_MAX_SUBCLASS + 1); > > + /* XFS_DAX_* is not allowed */ > + ASSERT((lock_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); > + > if (lock_mode & XFS_IOLOCK_EXCL) { > ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL))); > } else if (lock_mode & XFS_MMAPLOCK_EXCL) > @@ -566,6 +595,10 @@ xfs_lock_two_inodes( > ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) || > !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))); > > + /* XFS_DAX_* is not allowed */ > + ASSERT((ip0_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); > + ASSERT((ip1_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0); > + > ASSERT(ip0->i_ino != ip1->i_ino); > > if (ip0->i_ino > ip1->i_ino) { > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index 492e53992fa9..693ca66bd89b 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -67,6 +67,9 @@ typedef struct xfs_inode { > spinlock_t i_ioend_lock; > struct work_struct i_ioend_work; > struct list_head i_ioend_list; > + > + /* protect changing the mode to/from DAX */ > + struct percpu_rw_semaphore i_dax_sem; > } xfs_inode_t; > > /* Convert from vfs inode to xfs inode */ > @@ -278,10 +281,13 @@ static inline void xfs_ifunlock(struct xfs_inode *ip) > #define XFS_ILOCK_SHARED (1<<3) > #define XFS_MMAPLOCK_EXCL (1<<4) > #define XFS_MMAPLOCK_SHARED (1<<5) > +#define XFS_DAX_EXCL (1<<6) > +#define XFS_DAX_SHARED (1<<7) > > #define XFS_LOCK_MASK (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \ > | XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \ > - | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED) > + | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \ > + | XFS_DAX_EXCL | XFS_DAX_SHARED) > > #define XFS_LOCK_FLAGS \ > { XFS_IOLOCK_EXCL, "IOLOCK_EXCL" }, \ > @@ -289,7 +295,9 @@ static inline void xfs_ifunlock(struct xfs_inode *ip) > { XFS_ILOCK_EXCL, "ILOCK_EXCL" }, \ > { XFS_ILOCK_SHARED, "ILOCK_SHARED" }, \ > { XFS_MMAPLOCK_EXCL, "MMAPLOCK_EXCL" }, \ > - { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" } > + { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }, \ > + { XFS_DAX_EXCL, "DAX_EXCL" }, \ Whitespace between the comma & string. > + { XFS_DAX_SHARED, "DAX_SHARED" } > > > /* > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index d6843cdb51d0..a2f2604c3187 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -1158,6 +1158,16 @@ xfs_vn_tmpfile( > return xfs_generic_create(dir, dentry, mode, 0, true); > } > > +static void xfs_lock_mode(struct inode *inode) > +{ > + xfs_ilock(XFS_I(inode), XFS_DAX_SHARED); > +} > + > +static void xfs_unlock_mode(struct inode *inode) > +{ > + xfs_iunlock(XFS_I(inode), XFS_DAX_SHARED); > +} > + > static const struct inode_operations xfs_inode_operations = { > .get_acl = xfs_get_acl, > .set_acl = xfs_set_acl, > @@ -1168,6 +1178,18 @@ static const struct inode_operations xfs_inode_operations = { > .update_time = xfs_vn_update_time, > }; > > +static const struct inode_operations xfs_reg_inode_operations = { > + .get_acl = xfs_get_acl, > + .set_acl = xfs_set_acl, > + .getattr = xfs_vn_getattr, > + .setattr = xfs_vn_setattr, > + .listxattr = xfs_vn_listxattr, > + .fiemap = xfs_vn_fiemap, > + .update_time = xfs_vn_update_time, > + .lock_mode = xfs_lock_mode, > + .unlock_mode = xfs_unlock_mode, > +}; > + > static const struct inode_operations xfs_dir_inode_operations = { > .create = xfs_vn_create, > .lookup = xfs_vn_lookup, > @@ -1372,7 +1394,7 @@ xfs_setup_iops( > > switch (inode->i_mode & S_IFMT) { > case S_IFREG: > - inode->i_op = &xfs_inode_operations; > + inode->i_op = &xfs_reg_inode_operations; xfs_file_inode_operations? --D > inode->i_fop = &xfs_file_operations; > if (IS_DAX(inode)) > inode->i_mapping->a_ops = &xfs_dax_aops; > -- > 2.21.0 >
On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > Page faults need to ensure the inode mode is correct and consistent with > the vmf information at the time of the fault. There is no easy way to > ensure the vmf information is correct if a mode change is in progress. > Furthermore, there is no good use case to require a mode change while > the file is mmap'ed. > > Track mmap's of the file and fail the mode change if the file is > mmap'ed. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > --- > fs/inode.c | 2 ++ > fs/xfs/xfs_ioctl.c | 8 ++++++++ > include/linux/fs.h | 1 + > mm/mmap.c | 19 +++++++++++++++++-- > 4 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index 2b0f51161918..944711aed6f8 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -245,6 +245,8 @@ static struct inode *alloc_inode(struct super_block *sb) > return NULL; > } > > + atomic64_set(&inode->i_mapped, 0); > + > return inode; > } > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > index bc3654fe3b5d..1ab0906c6c7f 100644 > --- a/fs/xfs/xfs_ioctl.c > +++ b/fs/xfs/xfs_ioctl.c > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate( > goto out_unlock; > } > > + /* > + * If there is a mapping in place we must remain in our current mode. > + */ > + if (atomic64_read(&inode->i_mapped)) { Urk, should we really be messing around with the address space internals? > + error = -EBUSY; > + goto out_unlock; > + } > + > error = filemap_write_and_wait(inode->i_mapping); > if (error) > goto out_unlock; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 631f11d6246e..6e7dc626b657 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -740,6 +740,7 @@ struct inode { > #endif > > void *i_private; /* fs or device private pointer */ > + atomic64_t i_mapped; I would have expected to find this in struct address_space since the mapping count is a function of the address space, right? --D > } __randomize_layout; > > struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode); > diff --git a/mm/mmap.c b/mm/mmap.c > index dfaf1130e706..e6b68924b7ca 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma) > static struct vm_area_struct *remove_vma(struct vm_area_struct *vma) > { > struct vm_area_struct *next = vma->vm_next; > + struct file *f = vma->vm_file; > > might_sleep(); > if (vma->vm_ops && vma->vm_ops->close) > vma->vm_ops->close(vma); > - if (vma->vm_file) > - fput(vma->vm_file); > + if (f) { > + struct inode *inode = file_inode(f); > + if (inode) > + atomic64_dec(&inode->i_mapped); > + fput(f); > + } > mpol_put(vma_policy(vma)); > vm_area_free(vma); > return next; > @@ -1837,6 +1842,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > > vma_set_page_prot(vma); > > + /* > + * Track if there is mapping in place such that a mode change > + * does not occur on a file which is mapped > + */ > + if (file) { > + struct inode *inode = file_inode(file); > + > + atomic64_inc(&inode->i_mapped); > + } > + > return addr; > > unmap_and_free_vma: > -- > 2.21.0 >
On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > When zeroing the end of a file we must account for bytes contained in > the final page which are past EOF. > > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which > will include all bytes of the final page. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > --- > fs/xfs/xfs_iops.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index a2f2604c3187..a34b04e8ac9c 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -910,7 +910,7 @@ xfs_setattr_size( > */ > if (newsize > oldsize) { > trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); > - error = iomap_zero_range(inode, oldsize, newsize - oldsize, > + error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize, Huh? Won't this cause the file size to be set to LLONG_MAX? --D > &did_zeroing, &xfs_buffered_write_iomap_ops); > } else { > error = iomap_truncate_page(inode, newsize, &did_zeroing, > -- > 2.21.0 >
On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote: > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> [snip] > > > > The File Object > > --------------- > > @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined: > > int (*atomic_open)(struct inode *, struct dentry *, struct file *, > > unsigned open_flag, umode_t create_mode); > > int (*tmpfile) (struct inode *, struct dentry *, umode_t); > > + void (*lock_mode)(struct inode *); > > + void (*unlock_mode)(struct inode *); > > Yikes. "mode" has a specific meaning for inodes, and this lock isn't > related to i_mode. This lock protects aops from changing while an > address space operation is in use. Ah... yea ok mode is a bad name. > > > }; > > > > Again, all methods are called without any locks being held, unless > > @@ -584,6 +608,12 @@ otherwise noted. > > atomically creating, opening and unlinking a file in given > > directory. > > > > +``lock_mode`` > > + called to prevent operations which depend on the inode's mode from > > + proceeding should a mode change be in progress > > "Inodes can't change mode, because files do not suddenly become > directories". ;) Yea sorry. > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache > mode from proceeding while there are address space operations in > progress". So these are really more aops get and put functions... At first I actually did have aops get/put functions but this is really protecting more than the aops vector because as Christoph said there are file operations which need to be protected not just address space operations. But I agree "mode" is a bad name... Sorry... > > > +``unlock_mode`` > > + called when critical mode dependent operation is complete > > > > The Address Space Object > > ======================== > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > index 7c9a5df5a597..ed6ab5303a24 100644 > > --- a/fs/ioctl.c > > +++ b/fs/ioctl.c > > @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl); > > static int ioctl_fibmap(struct file *filp, int __user *p) > > { > > struct address_space *mapping = filp->f_mapping; > > + struct inode *inode = filp->f_inode; > > int res, block; > > > > + lock_inode_mode(inode); > > + > > /* do we support this mess? */ > > - if (!mapping->a_ops->bmap) > > - return -EINVAL; > > - if (!capable(CAP_SYS_RAWIO)) > > - return -EPERM; > > + if (!mapping->a_ops->bmap) { > > + res = -EINVAL; > > + goto out; > > + } > > + if (!capable(CAP_SYS_RAWIO)) { > > + res = -EPERM; > > + goto out; > > Why does the order of these checks change here? I don't understand? The order does not change we just can't return without releasing the lock. And to protect against bmap changing the lock needs to be taken first. [snip] > > > > +static inline void lock_inode_mode(struct inode *inode) > > inode_aops_get()? Let me think on this. This is not just getting a reference to the aops vector. It is more than that... and inode_get is not right either! ;-P > > > +{ > > + WARN_ON_ONCE(inode->i_op->lock_mode && > > + !inode->i_op->unlock_mode); > > + if (inode->i_op->lock_mode) > > + inode->i_op->lock_mode(inode); > > +} > > +static inline void unlock_inode_mode(struct inode *inode) > > +{ > > + WARN_ON_ONCE(inode->i_op->unlock_mode && > > + !inode->i_op->lock_mode); > > + if (inode->i_op->unlock_mode) > > + inode->i_op->unlock_mode(inode); > > +} > > + > > static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, > > struct iov_iter *iter) > > inode_aops_put()? ... something like that but not 'aops'... Ira > > --D >
On Mon, Jan 13, 2020 at 02:19:57PM -0800, Darrick J. Wong wrote: > On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> [snip] > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index 401da197f012..e8fd95b75e5b 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared( > > * > > * Basic locking order: > > * > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock > > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock > > Mmmmmm, more locks. Can we skip the extra lock if CONFIG_FSDAX=n or if > the filesystem devices don't support DAX at all? I'll look into it. > > Also, I don't think we're actually following the i_rwsem -> i_daxsem > order in fallocate, and possibly elsewhere too? I'll have to verify. It took a lot of iterations to get the order working so I'm not going to claim perfection. > > Does the vfs have to take the i_dax_sem to do remapping things like > reflink? (Pretend that reflink and dax are compatible for the moment) Honestly I can't say for sure. For this series I was careful to exclude reflink from the locking requirement. [snip] > > > > #define XFS_LOCK_FLAGS \ > > { XFS_IOLOCK_EXCL, "IOLOCK_EXCL" }, \ > > @@ -289,7 +295,9 @@ static inline void xfs_ifunlock(struct xfs_inode *ip) > > { XFS_ILOCK_EXCL, "ILOCK_EXCL" }, \ > > { XFS_ILOCK_SHARED, "ILOCK_SHARED" }, \ > > { XFS_MMAPLOCK_EXCL, "MMAPLOCK_EXCL" }, \ > > - { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" } > > + { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }, \ > > + { XFS_DAX_EXCL, "DAX_EXCL" }, \ > > Whitespace between the comma & string. Fixed. [snip] > > + > > static const struct inode_operations xfs_dir_inode_operations = { > > .create = xfs_vn_create, > > .lookup = xfs_vn_lookup, > > @@ -1372,7 +1394,7 @@ xfs_setup_iops( > > > > switch (inode->i_mode & S_IFMT) { > > case S_IFREG: > > - inode->i_op = &xfs_inode_operations; > > + inode->i_op = &xfs_reg_inode_operations; > > xfs_file_inode_operations? Sounds better. Fixed. Ira
On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote: > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > When zeroing the end of a file we must account for bytes contained in > > the final page which are past EOF. > > > > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which > > will include all bytes of the final page. > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > --- > > fs/xfs/xfs_iops.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > index a2f2604c3187..a34b04e8ac9c 100644 > > --- a/fs/xfs/xfs_iops.c > > +++ b/fs/xfs/xfs_iops.c > > @@ -910,7 +910,7 @@ xfs_setattr_size( > > */ > > if (newsize > oldsize) { > > trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); > > - error = iomap_zero_range(inode, oldsize, newsize - oldsize, > > + error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize, > > Huh? Won't this cause the file size to be set to LLONG_MAX? Not as I understand the code. But as I said in the cover I am not 100% sure of this fix. From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the mappings and the page cache and the traces I have indicate that the DAX mode is not changing or was properly held off. Any other suggestions as to the problem are welcome. Ira > > --D > > > &did_zeroing, &xfs_buffered_write_iomap_ops); > > } else { > > error = iomap_truncate_page(inode, newsize, &did_zeroing, > > -- > > 2.21.0 > >
On Mon, Jan 13, 2020 at 02:22:12PM -0800, Darrick J. Wong wrote: > On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > [snip] > > > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > > index bc3654fe3b5d..1ab0906c6c7f 100644 > > --- a/fs/xfs/xfs_ioctl.c > > +++ b/fs/xfs/xfs_ioctl.c > > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate( > > goto out_unlock; > > } > > > > + /* > > + * If there is a mapping in place we must remain in our current mode. > > + */ > > + if (atomic64_read(&inode->i_mapped)) { > > Urk, should we really be messing around with the address space > internals? I contemplated a function call instead of checking i_mapped directly? Is that what you mean? > > > + error = -EBUSY; > > + goto out_unlock; > > + } > > + > > error = filemap_write_and_wait(inode->i_mapping); > > if (error) > > goto out_unlock; > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 631f11d6246e..6e7dc626b657 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -740,6 +740,7 @@ struct inode { > > #endif > > > > void *i_private; /* fs or device private pointer */ > > + atomic64_t i_mapped; > > I would have expected to find this in struct address_space since the > mapping count is a function of the address space, right? I suppose but the only external call (above) would be passing an inode. So to me it seemed better here. Ira > > --D >
On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote: > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote: > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > [snip] > > > > > > > The File Object > > > --------------- > > > @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined: > > > int (*atomic_open)(struct inode *, struct dentry *, struct file *, > > > unsigned open_flag, umode_t create_mode); > > > int (*tmpfile) (struct inode *, struct dentry *, umode_t); > > > + void (*lock_mode)(struct inode *); > > > + void (*unlock_mode)(struct inode *); > > > > Yikes. "mode" has a specific meaning for inodes, and this lock isn't > > related to i_mode. This lock protects aops from changing while an > > address space operation is in use. > > Ah... yea ok mode is a bad name. > > > > > > }; > > > > > > Again, all methods are called without any locks being held, unless > > > @@ -584,6 +608,12 @@ otherwise noted. > > > atomically creating, opening and unlinking a file in given > > > directory. > > > > > > +``lock_mode`` > > > + called to prevent operations which depend on the inode's mode from > > > + proceeding should a mode change be in progress > > > > "Inodes can't change mode, because files do not suddenly become > > directories". ;) > > Yea sorry. > > > > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache > > mode from proceeding while there are address space operations in > > progress". So these are really more aops get and put functions... > > At first I actually did have aops get/put functions but this is really > protecting more than the aops vector because as Christoph said there are file > operations which need to be protected not just address space operations. > > But I agree "mode" is a bad name... Sorry... inode_fops_{get,set}(), then? inode_start_fileop() inode_end_fileop() ? Trying to avoid sounding foppish <COUGH> > > > > > +``unlock_mode`` > > > + called when critical mode dependent operation is complete > > > > > > The Address Space Object > > > ======================== > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > index 7c9a5df5a597..ed6ab5303a24 100644 > > > --- a/fs/ioctl.c > > > +++ b/fs/ioctl.c > > > @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl); > > > static int ioctl_fibmap(struct file *filp, int __user *p) > > > { > > > struct address_space *mapping = filp->f_mapping; > > > + struct inode *inode = filp->f_inode; > > > int res, block; > > > > > > + lock_inode_mode(inode); > > > + > > > /* do we support this mess? */ > > > - if (!mapping->a_ops->bmap) > > > - return -EINVAL; > > > - if (!capable(CAP_SYS_RAWIO)) > > > - return -EPERM; > > > + if (!mapping->a_ops->bmap) { > > > + res = -EINVAL; > > > + goto out; > > > + } > > > + if (!capable(CAP_SYS_RAWIO)) { > > > + res = -EPERM; > > > + goto out; > > > > Why does the order of these checks change here? > > I don't understand? The order does not change we just can't return without > releasing the lock. And to protect against bmap changing the lock needs to be > taken first. Doh. -ENOCOFFEE, I plead. --D > [snip] > > > > > > > +static inline void lock_inode_mode(struct inode *inode) > > > > inode_aops_get()? > > Let me think on this. This is not just getting a reference to the aops vector. > It is more than that... and inode_get is not right either! ;-P > > > > > > +{ > > > + WARN_ON_ONCE(inode->i_op->lock_mode && > > > + !inode->i_op->unlock_mode); > > > + if (inode->i_op->lock_mode) > > > + inode->i_op->lock_mode(inode); > > > +} > > > +static inline void unlock_inode_mode(struct inode *inode) > > > +{ > > > + WARN_ON_ONCE(inode->i_op->unlock_mode && > > > + !inode->i_op->lock_mode); > > > + if (inode->i_op->unlock_mode) > > > + inode->i_op->unlock_mode(inode); > > > +} > > > + > > > static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio, > > > struct iov_iter *iter) > > > > inode_aops_put()? > > ... something like that but not 'aops'... > > Ira > > > > > --D > >
On Mon, Jan 13, 2020 at 04:40:47PM -0800, Ira Weiny wrote: > On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote: > > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > When zeroing the end of a file we must account for bytes contained in > > > the final page which are past EOF. > > > > > > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which > > > will include all bytes of the final page. > > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > --- > > > fs/xfs/xfs_iops.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > > index a2f2604c3187..a34b04e8ac9c 100644 > > > --- a/fs/xfs/xfs_iops.c > > > +++ b/fs/xfs/xfs_iops.c > > > @@ -910,7 +910,7 @@ xfs_setattr_size( > > > */ > > > if (newsize > oldsize) { > > > trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); > > > - error = iomap_zero_range(inode, oldsize, newsize - oldsize, > > > + error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize, > > > > Huh? Won't this cause the file size to be set to LLONG_MAX? > > Not as I understand the code. iomap_zero_range uses the standard iomap_write_{begin,end} functions, which means that if you pass it an (offset, length) that extend beyond EOF it will change isize to offset+length. > But as I said in the cover I am not 100% sure of > this fix. > From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the > mappings and the page cache and the traces I have indicate that the DAX mode > is not changing or was properly held off. Hmm, that implies the invalidation didn't work. Can you find a way to report the contents of the page cache after the dax mode change invalidation fails? I wonder if this is something dorky like rounding down such that the EOF page doesn't actually get invalidated? Hmm, no, xfs_ioctl_setattr_dax_invalidate should be nuking all the pages... do you have a quick reproducer? --D > Any other suggestions as to the problem are welcome. > > Ira > > > > > > --D > > > > > &did_zeroing, &xfs_buffered_write_iomap_ops); > > > } else { > > > error = iomap_truncate_page(inode, newsize, &did_zeroing, > > > -- > > > 2.21.0 > > >
On Mon, Jan 13, 2020 at 04:46:10PM -0800, Ira Weiny wrote: > On Mon, Jan 13, 2020 at 02:22:12PM -0800, Darrick J. Wong wrote: > > On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > [snip] > > > > > > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > > > index bc3654fe3b5d..1ab0906c6c7f 100644 > > > --- a/fs/xfs/xfs_ioctl.c > > > +++ b/fs/xfs/xfs_ioctl.c > > > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate( > > > goto out_unlock; > > > } > > > > > > + /* > > > + * If there is a mapping in place we must remain in our current mode. > > > + */ > > > + if (atomic64_read(&inode->i_mapped)) { > > > > Urk, should we really be messing around with the address space > > internals? > > I contemplated a function call instead of checking i_mapped directly? Is that > what you mean? Yeah. Abstracting the details just enough that filesystems don't have to know that i_mapped is atomic64 etc. > > > > > > + error = -EBUSY; > > > + goto out_unlock; > > > + } > > > + > > > error = filemap_write_and_wait(inode->i_mapping); > > > if (error) > > > goto out_unlock; > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index 631f11d6246e..6e7dc626b657 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -740,6 +740,7 @@ struct inode { > > > #endif > > > > > > void *i_private; /* fs or device private pointer */ > > > + atomic64_t i_mapped; > > > > I would have expected to find this in struct address_space since the > > mapping count is a function of the address space, right? > > I suppose but the only external call (above) would be passing an inode. So to > me it seemed better here. But the number of memory mappings reflects the state of the address space, not the inode. Or maybe put another way, if I were an mm developer I would not expect to look in struct inode for mm state. static inline bool inode_has_mappings(struct inode *inode) { return atomic64_read(&inode->i_mapping->mapcount) > 0; } OTOH if there exist other mm developers who /do/ find that storing the mmap count in struct inode is more logical, please let me know. :) --D > Ira > > > > > --D > >
On Mon, Jan 13, 2020 at 05:30:04PM -0800, Darrick J. Wong wrote: > On Mon, Jan 13, 2020 at 04:46:10PM -0800, Ira Weiny wrote: > > On Mon, Jan 13, 2020 at 02:22:12PM -0800, Darrick J. Wong wrote: > > > On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote: > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > [snip] > > > > > > > > > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > > > > index bc3654fe3b5d..1ab0906c6c7f 100644 > > > > --- a/fs/xfs/xfs_ioctl.c > > > > +++ b/fs/xfs/xfs_ioctl.c > > > > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate( > > > > goto out_unlock; > > > > } > > > > > > > > + /* > > > > + * If there is a mapping in place we must remain in our current mode. > > > > + */ > > > > + if (atomic64_read(&inode->i_mapped)) { > > > > > > Urk, should we really be messing around with the address space > > > internals? > > > > I contemplated a function call instead of checking i_mapped directly? Is that > > what you mean? > > Yeah. Abstracting the details just enough that filesystems don't have > to know that i_mapped is atomic64 etc. Done. > > > > > > > > > > + error = -EBUSY; > > > > + goto out_unlock; > > > > + } > > > > + > > > > error = filemap_write_and_wait(inode->i_mapping); > > > > if (error) > > > > goto out_unlock; > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > index 631f11d6246e..6e7dc626b657 100644 > > > > --- a/include/linux/fs.h > > > > +++ b/include/linux/fs.h > > > > @@ -740,6 +740,7 @@ struct inode { > > > > #endif > > > > > > > > void *i_private; /* fs or device private pointer */ > > > > + atomic64_t i_mapped; > > > > > > I would have expected to find this in struct address_space since the > > > mapping count is a function of the address space, right? > > > > I suppose but the only external call (above) would be passing an inode. So to > > me it seemed better here. > > But the number of memory mappings reflects the state of the address > space, not the inode. Or maybe put another way, if I were an mm > developer I would not expect to look in struct inode for mm state. This is a good point... > > static inline bool inode_has_mappings(struct inode *inode) > { > return atomic64_read(&inode->i_mapping->mapcount) > 0; > } > > OTOH if there exist other mm developers who /do/ find that storing the > mmap count in struct inode is more logical, please let me know. :) ... My thinking was that the number of mappings does not matters to the mm system... However, I'm starting to think you are correct... ;-) I've made a note of it and we will see what others think. Ira > > --D > > > Ira > > > > > > > > --D > > >
On Mon, Jan 13, 2020 at 05:14:07PM -0800, Darrick J. Wong wrote: > On Mon, Jan 13, 2020 at 04:40:47PM -0800, Ira Weiny wrote: > > On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote: > > > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote: > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > When zeroing the end of a file we must account for bytes contained in > > > > the final page which are past EOF. > > > > > > > > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which > > > > will include all bytes of the final page. > > > > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > --- > > > > fs/xfs/xfs_iops.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > > > index a2f2604c3187..a34b04e8ac9c 100644 > > > > --- a/fs/xfs/xfs_iops.c > > > > +++ b/fs/xfs/xfs_iops.c > > > > @@ -910,7 +910,7 @@ xfs_setattr_size( > > > > */ > > > > if (newsize > oldsize) { > > > > trace_xfs_zero_eof(ip, oldsize, newsize - oldsize); > > > > - error = iomap_zero_range(inode, oldsize, newsize - oldsize, > > > > + error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize, > > > > > > Huh? Won't this cause the file size to be set to LLONG_MAX? > > > > Not as I understand the code. > > iomap_zero_range uses the standard iomap_write_{begin,end} functions, > which means that if you pass it an (offset, length) that extend beyond > EOF it will change isize to offset+length. I don't see that but I'll take your word for it... That is unfortunate because I would have thought that the full page would have been zero'ed by something already. I found code in xfs_free_file_space() with this comment: /* * If we zeroed right up to EOF and EOF straddles a page boundary we * must make sure that the post-EOF area is also zeroed because the * page could be mmap'd and iomap_zero_range doesn't do that for us. * Writeback of the eof page will do this, albeit clumsily. */ But that just calls filemap_write_and_wait_range()... :-/ > > > But as I said in the cover I am not 100% sure of > > this fix. > > > From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the > > mappings and the page cache and the traces I have indicate that the DAX mode > > is not changing or was properly held off. > > Hmm, that implies the invalidation didn't work. Can you find a way to > report the contents of the page cache after the dax mode change > invalidation fails? I wonder if this is something dorky like rounding > down such that the EOF page doesn't actually get invalidated? > > Hmm, no, xfs_ioctl_setattr_dax_invalidate should be nuking all the > pages... do you have a quick reproducer? I thought I did... What I have done is take this patch: https://www.spinics.net/lists/fstests/msg13313.html and put [run_fsx ""] in a loop... (diff below) And without this truncate fix patch it was failing in about 5 - 10 iterations. But I'm running it right now and it has gone for 30+... :-( I am 90% confident that this series works for 100% of the use cases we have. I think this is an existing bug which I've just managed to find. And again I'm not comfortable with this patch either. So I'm not trying to argue for it but I just don't know what could be wrong... Ira diff --git a/tests/generic/999 b/tests/generic/999 index 6dd5529dbc65..929c20c6db04 100755 --- a/tests/generic/999 +++ b/tests/generic/999 @@ -274,7 +274,9 @@ function run_fsx { pid="" } -run_fsx "" +while [ 1 ]; do + run_fsx "" +done run_fsx "-A" run_fsx "-Z -r 4096 -w 4096"
[-- Attachment #1: Type: text/plain, Size: 3036 bytes --] On Tue, Jan 14, 2020 at 11:00:57AM -0800, 'Ira Weiny' wrote: > On Mon, Jan 13, 2020 at 05:14:07PM -0800, Darrick J. Wong wrote: > > On Mon, Jan 13, 2020 at 04:40:47PM -0800, Ira Weiny wrote: > > > On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote: > > > > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote: > > > > > From: Ira Weiny <ira.weiny@intel.com> [snip] > > > > > > But as I said in the cover I am not 100% sure of > > > this fix. > > > > > From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the > > > mappings and the page cache and the traces I have indicate that the DAX mode > > > is not changing or was properly held off. > > > > Hmm, that implies the invalidation didn't work. Can you find a way to > > report the contents of the page cache after the dax mode change > > invalidation fails? I wonder if this is something dorky like rounding > > down such that the EOF page doesn't actually get invalidated? > > > > Hmm, no, xfs_ioctl_setattr_dax_invalidate should be nuking all the > > pages... do you have a quick reproducer? > > I thought I did... > > What I have done is take this patch: > > https://www.spinics.net/lists/fstests/msg13313.html > > and put [run_fsx ""] in a loop... (diff below) And without this truncate fix > patch it was failing in about 5 - 10 iterations. But I'm running it right now > and it has gone for 30+... :-( Ok... perhaps this is a qemu problem? In qemu I had a slightly different script; 'test-suite.sh' (attached). This was copied to create the xfstest I sent. Without this 'fix truncate patch I get something like the following after in just a few iterations. ... *** run 'fsx ' racing with setting/clearing the DAX flag *** run 'fsx ' racing with setting/clearing the DAX flag FAILED: fsx exited with status : 205 see trace_output zero_range to largest ever: 0x19867 truncating to largest ever: 0x3fb6d zero_range to largest ever: 0x40000 Mapped Write: non-zero data past EOF (0x3ab88) page offset 0xb89 is 0xe71c LOG DUMP (15817 total operations): 15818(202 mod 256): SKIPPED (no operation) 15819(203 mod 256): ZERO 0x2ae9f thru 0x3346c (0x85ce bytes) 15820(204 mod 256): READ 0x3637 thru 0x91e6 (0x5bb0 bytes) 15821(205 mod 256): MAPREAD 0x38a80 thru 0x3d014 (0x4595 bytes) 15822(206 mod 256): INSERT 0x28000 thru 0x29fff (0x2000 bytes) Ira > > I am 90% confident that this series works for 100% of the use cases we have. I > think this is an existing bug which I've just managed to find. And again I'm > not comfortable with this patch either. So I'm not trying to argue for it but > I just don't know what could be wrong... > > Ira > > diff --git a/tests/generic/999 b/tests/generic/999 > index 6dd5529dbc65..929c20c6db04 100755 > --- a/tests/generic/999 > +++ b/tests/generic/999 > @@ -274,7 +274,9 @@ function run_fsx { > pid="" > } > > -run_fsx "" > +while [ 1 ]; do > + run_fsx "" > +done > run_fsx "-A" > run_fsx "-Z -r 4096 -w 4096" > [-- Attachment #2: test-suite.sh --] [-- Type: text/plain, Size: 7344 bytes --] #!/bin/bash # # Run suite of tests for feature # mnt_pt=/mnt/pmem function mount_no_dax { umount $mnt_pt mount /dev/pmem0 $mnt_pt } function mount_dax { umount $mnt_pt mount -o dax /dev/pmem0 $mnt_pt } # test set up if [ ! -d /mnt/pmem ]; then echo "" echo " Setting up FS DAX" echo "" ndctl create-namespace -e namespace0.0 -f --mode=fsdax #mkfs.ext4 /dev/pmem0 mkfs.xfs -f /dev/pmem0 mkdir -p /mnt/pmem mount_no_dax fi #if [ ! -c /dev/dax1.0 ]; then # echo "" # echo " Setting up DEV DAX" # echo "" # # ndctl create-namespace -f --mode=devdax --align=4096 -e namespace1.0 #fi if [ "$1" == "--create-fs-only" ]; then exit 0 fi test_file=/mnt/pmem/foo function start_test { rm -f $test_file echo "" echo " ***** START ${FUNCNAME[1]} *****" } function end_test { echo " ***** END ${FUNCNAME[1]} *****" echo "" killall -9 rdma-fsdax } function expect_pass { echo -n " ***** $2 : " if [ "$1" == "0" ]; then echo "PASSED" else echo "FAILED" exit 255 fi } function expect_fail { echo -n " ***** $2 : " if [ "$1" == "0" ]; then echo "FAILED" exit 255 else echo "PASSED" fi } # test below here are not fully written yet. Just ideas of what we should test # for. function check_phys_dax { xfs_io -c 'lsattr' $1 | awk -e '{ print $1 }' | grep 'x' &> /dev/null if [ "$?" != "0" ]; then echo "FAILED: Did NOT find DAX flag on $1" exit 255 fi } function check_effective_dax { attr=`xfs_io -c 'statx -r' $1 | grep 'stat.attributes' | awk -e '{ print $3 }'` masked=$(( $attr & 0x2000 )) if [ "$masked" != "8192" ]; then echo "FAILED: Did NOT find VFS DAX flag on $1" exit 255 fi } function check_phys_no_dax { xfs_io -c 'lsattr' $1 | awk -e '{ print $1 }' | grep 'x' &> /dev/null if [ "$?" == "0" ]; then echo "FAILED: Found DAX flag on $1" exit 255 fi } function check_effective_no_dax { attr=`xfs_io -c 'statx -r' $1 | grep 'stat.attributes' | awk -e '{ print $3 }'` masked=$(( $attr & 0x2000 )) if [ "$masked" == "8192" ]; then echo "FAILED: Found VFS DAX flag on $1" exit 255 fi } XFS_IO_PROG=xfs_io TEST_DIR=/mnt/pmem dax_dir=$TEST_DIR/dax-dir dax_sub_dir=$TEST_DIR/dax-dir/dax-sub-dir dax_inh_file=$dax_dir/dax-inh-file dax_non_inh_file=$dax_dir/dax-non-inh-file non_dax=$TEST_DIR/non-dax dax_file=$TEST_DIR/dax-file dax_file_copy=$TEST_DIR/dax-file-copy dax_file_move=$TEST_DIR/dax-file-move data_file=$TEST_DIR/data-file function clean_up { echo "cleaning up..." rm -rf $TEST_DIR/* } clean_up # Do double mount to have these special options. mount_no_dax touch $dax_file if [ "$1" == "--chattr-only" ]; then for i in `seq 1 10000`; do xfs_io -c 'chattr +x' $dax_file xfs_io -c 'chattr -x' $dax_file done exit 0 fi if [ "$1" == "--fsx-only" ]; then ./fsx -R -W -N 100000 $dax_file head $dax_file.fsxlog exit 0 fi if [ "$1" == "--fsx-aio-only" ]; then ./fsx -R -W -A -N 100000 $dax_file head $dax_file.fsxlog exit 0 fi # The below should be kept the same as xfstests echo " *** mount w/o dax flag." mount_no_dax echo " *** mark dax-dir as dax enabled" mkdir $dax_dir xfs_io -c 'chattr +x' $dax_dir check_phys_dax $dax_dir check_effective_dax $dax_dir echo " *** check file inheritance" touch $dax_inh_file check_phys_dax $dax_inh_file check_effective_dax $dax_inh_file echo " *** check directory inheritance" mkdir $dax_sub_dir check_phys_dax $dax_sub_dir check_effective_dax $dax_sub_dir echo " *** check changing directory" xfs_io -c 'chattr -x' $dax_dir check_phys_no_dax $dax_dir check_effective_no_dax $dax_dir echo " *** check non file inheritance" touch $dax_non_inh_file check_phys_no_dax $dax_non_inh_file check_effective_no_dax $dax_non_inh_file echo " *** check that previous file stays enabled" check_phys_dax $dax_inh_file check_effective_dax $dax_inh_file echo " *** Reset the directory" xfs_io -c 'chattr +x' $dax_dir check_phys_dax $dax_dir check_effective_dax $dax_dir # check mount override # ==================== echo " *** Remount fs with mount flag" mount_dax touch $non_dax check_phys_no_dax $non_dax check_effective_dax $non_dax echo " *** Check for non-dax files to be dax with mount option" check_effective_dax $dax_non_inh_file echo " *** check for file dax flag 'sticky-ness' after remount" touch $dax_file xfs_io -c 'chattr +x' $dax_file check_phys_dax $dax_file check_effective_dax $dax_file echo " *** remount w/o mount flag" mount_no_dax check_phys_dax $dax_file check_effective_dax $dax_file check_phys_no_dax $non_dax check_effective_no_dax $non_dax # Check non-zero file operations # ============================== echo " *** file should change effective but page cache should be empty" $XFS_IO_PROG -f -c "pwrite 0 10000" $data_file > /dev/null xfs_io -c 'chattr +x' $data_file check_phys_dax $data_file check_effective_dax $data_file # Check inheritance on cp, mv # =========================== echo " *** check inheritance on cp, mv" cp $non_dax $dax_dir/conv-dax check_phys_dax $dax_dir/conv-dax check_effective_dax $dax_dir/conv-dax echo " *** Moved files 'don't inherit'" mv $non_dax $dax_dir/move-dax check_phys_no_dax $dax_dir/move-dax check_effective_no_dax $dax_dir/move-dax # Check preservation of phys on cp, mv # ==================================== mv $dax_file $dax_file_move check_phys_dax $dax_file_move check_effective_dax $dax_file_move cp $dax_file_move $dax_file_copy check_phys_no_dax $dax_file_copy check_effective_no_dax $dax_file_copy # Verify no mode changes on mmap # ============================== echo " *** check no mode change when mmaped" dd if=/dev/zero of=$dax_file bs=4096 count=10 > $tmp.log 2>&1 # set known state. xfs_io -c 'chattr -x' $dax_file check_phys_no_dax $dax_file check_effective_no_dax $dax_file python3 - << EOF > $tmp.log 2>&1 & import mmap import time print ('mmaping "$dax_file"') f=open("$dax_file", "r+b") mm = mmap.mmap(f.fileno(), 0) print ('mmaped "$dax_file"') while True: time.sleep(1) EOF pid=$! sleep 1 # attempt to should fail xfs_io -c 'chattr +x' $dax_file > /dev/null 2>&1 check_phys_no_dax $dax_file check_effective_no_dax $dax_file kill -TERM $pid > /dev/null 2>&1 wait $pid > /dev/null 2>&1 # after mmap released should work xfs_io -c 'chattr +x' $dax_file check_phys_dax $dax_file check_effective_dax $dax_file # Finally run the test stolen from Christoph Hellwig to test changing the mode # while performing a series of operations # ============================================================================= function run_fsx { options=$1 echo " *** run 'fsx $options' racing with setting/clearing the DAX flag" ./fsx $options -N 20000 $dax_file > $tmp.log 2>&1 & pid=$! if [ ! -n "$pid" ]; then echo "FAILED to start fsx" exit 255 fi # NOTE: for some for i in `seq 1 500`; do xfs_io -c 'chattr +x' $dax_file > /dev/null 2>&1 xfs_io -c 'chattr -x' $dax_file > /dev/null 2>&1 done wait $pid status=$? if [ "$status" != "0" ]; then cat /sys/kernel/debug/tracing/trace > trace_output echo "FAILED: fsx exited with status : $status" echo " see trace_output" head $dax_file.fsxlog exit $status fi pid="" } while [ 1 ]; do run_fsx "" done run_fsx "-A" run_fsx "-Z -r 4096 -w 4096" echo " *** Check 'dump' and 'xfsdump'" echo " TBD" echo " *** PASSED! All tests PASSED ***" exit 0
On Mon, Jan 13, 2020 at 04:35:21PM -0800, 'Ira Weiny' wrote: > On Mon, Jan 13, 2020 at 02:19:57PM -0800, Darrick J. Wong wrote: > > On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > [snip] > > > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > > index 401da197f012..e8fd95b75e5b 100644 > > > --- a/fs/xfs/xfs_inode.c > > > +++ b/fs/xfs/xfs_inode.c > > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared( > > > * > > > * Basic locking order: > > > * > > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock > > > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock > > > > Mmmmmm, more locks. Can we skip the extra lock if CONFIG_FSDAX=n or if > > the filesystem devices don't support DAX at all? > > I'll look into it. > > > > > Also, I don't think we're actually following the i_rwsem -> i_daxsem > > order in fallocate, and possibly elsewhere too? > > I'll have to verify. It took a lot of iterations to get the order working so > I'm not going to claim perfection. Yes this was inconsistent. The code was right WRT i_rwsem. mmap_sem may have issues: What about this? diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index c5d11b70d067..8808782a085e 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared( * * Basic locking order: * - * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock + * i_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock * * mmap_sem locking order: * * i_rwsem -> page lock -> mmap_sem - * mmap_sem -> i_dax_sem -> i_mmap_lock -> page_lock + * i_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock * * The difference in mmap_sem locking order mean that we cannot hold the * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can diff --git a/mm/mmap.c b/mm/mmap.c index e6b68924b7ca..b500aef30b27 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1547,18 +1547,12 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } - if (file) - lock_inode_mode(file_inode(file)); - addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; - if (file) - unlock_inode_mode(file_inode(file)); - return addr; } diff --git a/mm/util.c b/mm/util.c index 988d11e6c17c..1cfead8cd1ce 100644 --- a/mm/util.c +++ b/mm/util.c @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, ret = security_mmap_file(file, prot, flag); if (!ret) { - if (down_write_killable(&mm->mmap_sem)) + if (file) + lock_inode_mode(file_inode(file)); + if (down_write_killable(&mm->mmap_sem)) { + if (file) + unlock_inode_mode(file_inode(file)); return -EINTR; + } ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate, &uf); up_write(&mm->mmap_sem); + if (file) + unlock_inode_mode(file_inode(file)); userfaultfd_unmap_complete(mm, &uf); if (populate) mm_populate(ret, populate);
From ira.weiny@intel.com
> Sent: 10 January 2020 19:30
>
> Page faults need to ensure the inode mode is correct and consistent with
> the vmf information at the time of the fault. There is no easy way to
> ensure the vmf information is correct if a mode change is in progress.
> Furthermore, there is no good use case to require a mode change while
> the file is mmap'ed.
>
> Track mmap's of the file and fail the mode change if the file is
> mmap'ed.
This seems wrong to me.
I presume the 'mode changes' are from things like 'chmod -w ...'.
mmap() should be no different to open().
Only the permissions set when the file is opened count.
Next you'll be stopping unlink() when a file is open :-)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue 14-01-20 09:53:54, Ira Weiny wrote:
> On Mon, Jan 13, 2020 at 05:30:04PM -0800, Darrick J. Wong wrote:
> > > > > + error = -EBUSY;
> > > > > + goto out_unlock;
> > > > > + }
> > > > > +
> > > > > error = filemap_write_and_wait(inode->i_mapping);
> > > > > if (error)
> > > > > goto out_unlock;
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 631f11d6246e..6e7dc626b657 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -740,6 +740,7 @@ struct inode {
> > > > > #endif
> > > > >
> > > > > void *i_private; /* fs or device private pointer */
> > > > > + atomic64_t i_mapped;
> > > >
> > > > I would have expected to find this in struct address_space since the
> > > > mapping count is a function of the address space, right?
> > >
> > > I suppose but the only external call (above) would be passing an inode. So to
> > > me it seemed better here.
> >
> > But the number of memory mappings reflects the state of the address
> > space, not the inode. Or maybe put another way, if I were an mm
> > developer I would not expect to look in struct inode for mm state.
>
> This is a good point...
>
> >
> > static inline bool inode_has_mappings(struct inode *inode)
> > {
> > return atomic64_read(&inode->i_mapping->mapcount) > 0;
> > }
> >
> > OTOH if there exist other mm developers who /do/ find that storing the
> > mmap count in struct inode is more logical, please let me know. :)
>
> ... My thinking was that the number of mappings does not matters to the mm
> system... However, I'm starting to think you are correct... ;-)
>
> I've made a note of it and we will see what others think.
Well, more importantly mapping != inode. There can be multiple inodes
pointing to the same mapping (struct address_space) as is the case for
example for block devices. So this counter definitely belongs into struct
address_space.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > In order for users to determine if a file is currently operating in DAX > mode (effective DAX). Define a statx attribute value and set that > attribute if the effective DAX flag is set. > > To go along with this we propose the following addition to the statx man > page: > > STATX_ATTR_DAX > > DAX (cpu direct access) is a file mode that attempts to minimize > software cache effects for both I/O and memory mappings of this > file. It requires a capable device, a compatible filesystem > block size, and filesystem opt-in. It generally assumes all > accesses are via cpu load / store instructions which can > minimize overhead for small accesses, but adversely affect cpu > utilization for large transfers. File I/O is done directly > to/from user-space buffers. While the DAX property tends to > result in data being transferred synchronously it does not give > the guarantees of synchronous I/O that data and necessary > metadata are transferred. Memory mapped I/O may be performed > with direct mappings that bypass system memory buffering. Again > while memory-mapped I/O tends to result in data being > transferred synchronously it does not guarantee synchronous > metadata updates. A dax file may optionally support being mapped > with the MAP_SYNC flag which does allow cpu store operations to > be considered synchronous modulo cpu cache effects. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> This looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/stat.c | 3 +++ > include/uapi/linux/stat.h | 1 + > 2 files changed, 4 insertions(+) > > diff --git a/fs/stat.c b/fs/stat.c > index 030008796479..894699c74dde 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, > if (IS_AUTOMOUNT(inode)) > stat->attributes |= STATX_ATTR_AUTOMOUNT; > > + if (IS_DAX(inode)) > + stat->attributes |= STATX_ATTR_DAX; > + > if (inode->i_op->getattr) > return inode->i_op->getattr(path, stat, request_mask, > query_flags); > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > index ad80a5c885d5..e5f9d5517f6b 100644 > --- a/include/uapi/linux/stat.h > +++ b/include/uapi/linux/stat.h > @@ -169,6 +169,7 @@ struct statx { > #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */ > #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */ > #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ > +#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */ > > > #endif /* _UAPI_LINUX_STAT_H */ > -- > 2.21.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > In order for users to determine if a file is currently operating in DAX > > mode (effective DAX). Define a statx attribute value and set that > > attribute if the effective DAX flag is set. > > > > To go along with this we propose the following addition to the statx man > > page: > > > > STATX_ATTR_DAX > > > > DAX (cpu direct access) is a file mode that attempts to minimize "..is a file I/O mode"? > > software cache effects for both I/O and memory mappings of this > > file. It requires a capable device, a compatible filesystem > > block size, and filesystem opt-in. "...a capable storage device..." What does "compatible fs block size" mean? How does the user figure out if their fs blocksize is compatible? Do we tell users to refer their filesystem's documentation here? > > It generally assumes all > > accesses are via cpu load / store instructions which can > > minimize overhead for small accesses, but adversely affect cpu > > utilization for large transfers. Will this always be true for persistent memory? I wasn't even aware that large transfers adversely affected CPU utilization. ;) > > File I/O is done directly > > to/from user-space buffers. While the DAX property tends to > > result in data being transferred synchronously it does not give "...transferred synchronously, it does not..." > > the guarantees of synchronous I/O that data and necessary "...it does not guarantee that I/O or file metadata have been flushed to the storage device." > > metadata are transferred. Memory mapped I/O may be performed > > with direct mappings that bypass system memory buffering. "...with direct memory mappings that bypass kernel page cache." > > Again > > while memory-mapped I/O tends to result in data being I would move the sentence about "Memory mapped I/O..." to directly after the sentence about file I/O being done directly to and from userspace so that you don't need to repeat this statement. > > transferred synchronously it does not guarantee synchronous > > metadata updates. A dax file may optionally support being mapped > > with the MAP_SYNC flag which does allow cpu store operations to > > be considered synchronous modulo cpu cache effects. How does one detect or work around or deal with "cpu cache effects"? I assume some sort of CPU cache flush instruction is what is meant here, but I think we could mention the basics of what has to be done here: "A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush operations to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information."? Oof, a paragraph break would be nice. :) --D > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > This looks good to me. You can add: > > Reviewed-by: Jan Kara <jack@suse.cz> > > Honza > > > --- > > fs/stat.c | 3 +++ > > include/uapi/linux/stat.h | 1 + > > 2 files changed, 4 insertions(+) > > > > diff --git a/fs/stat.c b/fs/stat.c > > index 030008796479..894699c74dde 100644 > > --- a/fs/stat.c > > +++ b/fs/stat.c > > @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, > > if (IS_AUTOMOUNT(inode)) > > stat->attributes |= STATX_ATTR_AUTOMOUNT; > > > > + if (IS_DAX(inode)) > > + stat->attributes |= STATX_ATTR_DAX; > > + > > if (inode->i_op->getattr) > > return inode->i_op->getattr(path, stat, request_mask, > > query_flags); > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > > index ad80a5c885d5..e5f9d5517f6b 100644 > > --- a/include/uapi/linux/stat.h > > +++ b/include/uapi/linux/stat.h > > @@ -169,6 +169,7 @@ struct statx { > > #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */ > > #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */ > > #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ > > +#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */ > > > > > > #endif /* _UAPI_LINUX_STAT_H */ > > -- > > 2.21.0 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed, Jan 15, 2020 at 10:21:45AM +0000, David Laight wrote: > From ira.weiny@intel.com > > Sent: 10 January 2020 19:30 > > > > Page faults need to ensure the inode mode is correct and consistent with > > the vmf information at the time of the fault. There is no easy way to > > ensure the vmf information is correct if a mode change is in progress. > > Furthermore, there is no good use case to require a mode change while > > the file is mmap'ed. > > > > Track mmap's of the file and fail the mode change if the file is > > mmap'ed. > > This seems wrong to me. > I presume the 'mode changes' are from things like 'chmod -w ...'. No... Sorry... "mode" was a _very_ bad name. In this context "mode" was the "DAX mode" not the file mode. > mmap() should be no different to open(). > Only the permissions set when the file is opened count. > > Next you'll be stopping unlink() when a file is open :-) hehehe :-D no ... sorry that was not the meaning. To be clear what this is preventing is a change from non-DAX to DAX or vice versa while a file is mmap'ed. I'm looking at a better name for this. For this commit message is this more clear? <commit> fs: Prevent DAX change if file is mmap'ed Page faults need to ensure the inode DAX configuration is correct and consistent with the vmf information at the time of the fault. There is no easy way to ensure the vmf information is correct if a DAX change is in progress. Furthermore, there is no good use case to require changing DAX configs while the file is mmap'ed. Track mmap's of the file and fail the DAX change if the file is mmap'ed. </commit> Sorry for the confusion, Ira > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Wed, Jan 15, 2020 at 12:34:55PM +0100, Jan Kara wrote:
> On Tue 14-01-20 09:53:54, Ira Weiny wrote:
> > On Mon, Jan 13, 2020 at 05:30:04PM -0800, Darrick J. Wong wrote:
> > > > > > + error = -EBUSY;
> > > > > > + goto out_unlock;
> > > > > > + }
> > > > > > +
> > > > > > error = filemap_write_and_wait(inode->i_mapping);
> > > > > > if (error)
> > > > > > goto out_unlock;
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index 631f11d6246e..6e7dc626b657 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -740,6 +740,7 @@ struct inode {
> > > > > > #endif
> > > > > >
> > > > > > void *i_private; /* fs or device private pointer */
> > > > > > + atomic64_t i_mapped;
> > > > >
> > > > > I would have expected to find this in struct address_space since the
> > > > > mapping count is a function of the address space, right?
> > > >
> > > > I suppose but the only external call (above) would be passing an inode. So to
> > > > me it seemed better here.
> > >
> > > But the number of memory mappings reflects the state of the address
> > > space, not the inode. Or maybe put another way, if I were an mm
> > > developer I would not expect to look in struct inode for mm state.
> >
> > This is a good point...
> >
> > >
> > > static inline bool inode_has_mappings(struct inode *inode)
> > > {
> > > return atomic64_read(&inode->i_mapping->mapcount) > 0;
> > > }
> > >
> > > OTOH if there exist other mm developers who /do/ find that storing the
> > > mmap count in struct inode is more logical, please let me know. :)
> >
> > ... My thinking was that the number of mappings does not matters to the mm
> > system... However, I'm starting to think you are correct... ;-)
> >
> > I've made a note of it and we will see what others think.
>
> Well, more importantly mapping != inode. There can be multiple inodes
> pointing to the same mapping (struct address_space) as is the case for
> example for block devices. So this counter definitely belongs into struct
> address_space.
Ah Yes, great point. Done.
Ira
On Mon, Jan 13, 2020 at 05:03:22PM -0800, Darrick J. Wong wrote: > On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote: > > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote: > > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote: > > > > From: Ira Weiny <ira.weiny@intel.com> [snip] > > > > +``lock_mode`` > > > > + called to prevent operations which depend on the inode's mode from > > > > + proceeding should a mode change be in progress > > > > > > "Inodes can't change mode, because files do not suddenly become > > > directories". ;) > > > > Yea sorry. > > > > > > > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache > > > mode from proceeding while there are address space operations in > > > progress". So these are really more aops get and put functions... > > > > At first I actually did have aops get/put functions but this is really > > protecting more than the aops vector because as Christoph said there are file > > operations which need to be protected not just address space operations. > > > > But I agree "mode" is a bad name... Sorry... > > inode_fops_{get,set}(), then? > > inode_start_fileop() > inode_end_fileop() ? > > Trying to avoid sounding foppish <COUGH> What about? inode_[un]lock_state()? Ira
On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > In order for users to determine if a file is currently operating in DAX > > > mode (effective DAX). Define a statx attribute value and set that > > > attribute if the effective DAX flag is set. > > > > > > To go along with this we propose the following addition to the statx man > > > page: > > > > > > STATX_ATTR_DAX > > > > > > DAX (cpu direct access) is a file mode that attempts to minimize > > "..is a file I/O mode"? or "... is a file state ..."? > > > software cache effects for both I/O and memory mappings of this > > > file. It requires a capable device, a compatible filesystem > > > block size, and filesystem opt-in. > > "...a capable storage device..." Done > > What does "compatible fs block size" mean? How does the user figure out > if their fs blocksize is compatible? Do we tell users to refer their > filesystem's documentation here? Perhaps it is wrong for this to be in the man page at all? Would it be better to assume the file system and block device are already configured properly by the admin? For which the blocksize restrictions are already well documented. ie: https://www.kernel.org/doc/Documentation/filesystems/dax.txt ? How about changing the text to: It requires a block device and file system which have been configured to support DAX. ? > > > > It generally assumes all > > > accesses are via cpu load / store instructions which can > > > minimize overhead for small accesses, but adversely affect cpu > > > utilization for large transfers. > > Will this always be true for persistent memory? I'm not clear. Did you mean; "this" == adverse utilization for large transfers? > > I wasn't even aware that large transfers adversely affected CPU > utilization. ;) Sure vs using a DMA engine for example. > > > > File I/O is done directly > > > to/from user-space buffers. While the DAX property tends to > > > result in data being transferred synchronously it does not give > > "...transferred synchronously, it does not..." done. > > > > the guarantees of synchronous I/O that data and necessary > > "...it does not guarantee that I/O or file metadata have been flushed to > the storage device." The lack of guarantee here is mainly regarding metadata. How about: While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees of synchronous I/O where data and the necessary metadata are transferred together. > > > > metadata are transferred. Memory mapped I/O may be performed > > > with direct mappings that bypass system memory buffering. > > "...with direct memory mappings that bypass kernel page cache." Done. > > > > Again > > > while memory-mapped I/O tends to result in data being > > I would move the sentence about "Memory mapped I/O..." to directly after > the sentence about file I/O being done directly to and from userspace so > that you don't need to repeat this statement. Done. > > > > transferred synchronously it does not guarantee synchronous > > > metadata updates. A dax file may optionally support being mapped > > > with the MAP_SYNC flag which does allow cpu store operations to > > > be considered synchronous modulo cpu cache effects. > > How does one detect or work around or deal with "cpu cache effects"? I > assume some sort of CPU cache flush instruction is what is meant here, > but I think we could mention the basics of what has to be done here: > > "A DAX file may support being mapped with the MAP_SYNC flag, which > enables a program to use CPU cache flush operations to persist CPU store > operations without an explicit fsync(2). See mmap(2) for more > information."? That sounds better. I like the reference to mmap as well. Ok I changed a couple of things as well. How does this sound? STATX_ATTR_DAX DAX (cpu direct access) is a file mode that attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a block device and file system which have been configured to support DAX. DAX generally assumes all accesses are via cpu load / store instructions which can minimize overhead for small accesses, but may adversely affect cpu utilization for large transfers. File I/O is done directly to/from user-space buffers and memory mapped I/O may be performed with direct memory mappings that bypass kernel page cache. While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees of synchronous I/O where data and the necessary metadata are transferred together. A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush operations to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information. Ira > > Oof, a paragraph break would be nice. :) > > --D > > > > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > This looks good to me. You can add: > > > > Reviewed-by: Jan Kara <jack@suse.cz> > > > > Honza > > > > > --- > > > fs/stat.c | 3 +++ > > > include/uapi/linux/stat.h | 1 + > > > 2 files changed, 4 insertions(+) > > > > > > diff --git a/fs/stat.c b/fs/stat.c > > > index 030008796479..894699c74dde 100644 > > > --- a/fs/stat.c > > > +++ b/fs/stat.c > > > @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat, > > > if (IS_AUTOMOUNT(inode)) > > > stat->attributes |= STATX_ATTR_AUTOMOUNT; > > > > > > + if (IS_DAX(inode)) > > > + stat->attributes |= STATX_ATTR_DAX; > > > + > > > if (inode->i_op->getattr) > > > return inode->i_op->getattr(path, stat, request_mask, > > > query_flags); > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > > > index ad80a5c885d5..e5f9d5517f6b 100644 > > > --- a/include/uapi/linux/stat.h > > > +++ b/include/uapi/linux/stat.h > > > @@ -169,6 +169,7 @@ struct statx { > > > #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */ > > > #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */ > > > #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ > > > +#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */ > > > > > > > > > #endif /* _UAPI_LINUX_STAT_H */ > > > -- > > > 2.21.0 > > > > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR
On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > In order for users to determine if a file is currently operating in DAX > > > > mode (effective DAX). Define a statx attribute value and set that > > > > attribute if the effective DAX flag is set. > > > > > > > > To go along with this we propose the following addition to the statx man > > > > page: > > > > > > > > STATX_ATTR_DAX > > > > > > > > DAX (cpu direct access) is a file mode that attempts to minimize > > > > "..is a file I/O mode"? > > or "... is a file state ..."? > > > > > software cache effects for both I/O and memory mappings of this > > > > file. It requires a capable device, a compatible filesystem > > > > block size, and filesystem opt-in. > > > > "...a capable storage device..." > > Done > > > > > What does "compatible fs block size" mean? How does the user figure out > > if their fs blocksize is compatible? Do we tell users to refer their > > filesystem's documentation here? > > Perhaps it is wrong for this to be in the man page at all? Would it be better > to assume the file system and block device are already configured properly by > the admin? > > For which the blocksize restrictions are already well documented. ie: > > https://www.kernel.org/doc/Documentation/filesystems/dax.txt > > ? > > How about changing the text to: > > It requires a block device and file system which have been configured > to support DAX. > > ? The goal was to document the gauntlet of checks that __generic_fsdax_supported() performs so someone could debug "why am I not able to get dax operation?" > > > > > > > It generally assumes all > > > > accesses are via cpu load / store instructions which can > > > > minimize overhead for small accesses, but adversely affect cpu > > > > utilization for large transfers. > > > > Will this always be true for persistent memory? For direct-mapped pmem there is no opportunity to do dma offload so it will always be true that application dax access consumes cpu to do I/O where something like NVMe does not. There has been unfruitful to date experiments with the driver using an offload engine for kernel internal I/O, but if you're use case is kernel internal I/O bound then you don't need dax. > > I'm not clear. Did you mean; "this" == adverse utilization for large transfers? > > > > > I wasn't even aware that large transfers adversely affected CPU > > utilization. ;) > > Sure vs using a DMA engine for example. Right, this is purely a statement about cpu memcpy vs device-dma. > > > > > > > File I/O is done directly > > > > to/from user-space buffers. While the DAX property tends to > > > > result in data being transferred synchronously it does not give > > > > "...transferred synchronously, it does not..." > > done. > > > > > > > the guarantees of synchronous I/O that data and necessary > > > > "...it does not guarantee that I/O or file metadata have been flushed to > > the storage device." > > The lack of guarantee here is mainly regarding metadata. > > How about: > > While the DAX property tends to result in data being transferred > synchronously, it does not give the same guarantees of > synchronous I/O where data and the necessary metadata are > transferred together. > > > > > > > metadata are transferred. Memory mapped I/O may be performed > > > > with direct mappings that bypass system memory buffering. > > > > "...with direct memory mappings that bypass kernel page cache." > > Done. > > > > > > > Again > > > > while memory-mapped I/O tends to result in data being > > > > I would move the sentence about "Memory mapped I/O..." to directly after > > the sentence about file I/O being done directly to and from userspace so > > that you don't need to repeat this statement. > > Done. > > > > > > > transferred synchronously it does not guarantee synchronous > > > > metadata updates. A dax file may optionally support being mapped > > > > with the MAP_SYNC flag which does allow cpu store operations to > > > > be considered synchronous modulo cpu cache effects. > > > > How does one detect or work around or deal with "cpu cache effects"? I > > assume some sort of CPU cache flush instruction is what is meant here, > > but I think we could mention the basics of what has to be done here: > > > > "A DAX file may support being mapped with the MAP_SYNC flag, which > > enables a program to use CPU cache flush operations to persist CPU store > > operations without an explicit fsync(2). See mmap(2) for more > > information."? > > That sounds better. I like the reference to mmap as well. > > Ok I changed a couple of things as well. How does this sound? > > > STATX_ATTR_DAX > > DAX (cpu direct access) is a file mode that attempts to minimize s/mode/state/? > software cache effects for both I/O and memory mappings of this > file. It requires a block device and file system which have > been configured to support DAX. It may not require a block device in the future. > > DAX generally assumes all accesses are via cpu load / store > instructions which can minimize overhead for small accesses, but > may adversely affect cpu utilization for large transfers. > > File I/O is done directly to/from user-space buffers and memory > mapped I/O may be performed with direct memory mappings that > bypass kernel page cache. > > While the DAX property tends to result in data being transferred > synchronously, it does not give the same guarantees of > synchronous I/O where data and the necessary metadata are Maybe use "O_SYNC I/O" explicitly to further differentiate the 2 meanings of "synchronous" in this sentence? > transferred together. > > A DAX file may support being mapped with the MAP_SYNC flag, > which enables a program to use CPU cache flush operations to s/operations/instructions/ > persist CPU store operations without an explicit fsync(2). See > mmap(2) for more information. I think this also wants a reference to the Linux interpretation of platform "persistence domains" we were discussing that here [1], but maybe it should be part of a "pmem" manpage that can be referenced from this man page. [1]: http://lore.kernel.org/r/20200108064905.170394-1-aneesh.kumar@linux.ibm.com
On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote: > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > [snip] > > Ok I changed a couple of things as well. How does this sound? > > > > > > STATX_ATTR_DAX > > > > DAX (cpu direct access) is a file mode that attempts to minimize > > s/mode/state/? DOH! yes state... ;-) > > > software cache effects for both I/O and memory mappings of this > > file. It requires a block device and file system which have > > been configured to support DAX. > > It may not require a block device in the future. Ok: "It requires a file system which has been configured to support DAX." ? I'm trying to separate the user of the individual STATX DAX flag from the Admin details of configuring the file system and/or devices which supports it. Also, I just realized that we should follow the format of the other STATX_* attributes. They all read something like "the file is..." So I'm adding that text as well. > > > > > DAX generally assumes all accesses are via cpu load / store > > instructions which can minimize overhead for small accesses, but > > may adversely affect cpu utilization for large transfers. > > > > File I/O is done directly to/from user-space buffers and memory > > mapped I/O may be performed with direct memory mappings that > > bypass kernel page cache. > > > > While the DAX property tends to result in data being transferred > > synchronously, it does not give the same guarantees of > > synchronous I/O where data and the necessary metadata are > > Maybe use "O_SYNC I/O" explicitly to further differentiate the 2 > meanings of "synchronous" in this sentence? Done. > > > transferred together. > > > > A DAX file may support being mapped with the MAP_SYNC flag, > > which enables a program to use CPU cache flush operations to > > s/operations/instructions/ Done. > > > persist CPU store operations without an explicit fsync(2). See > > mmap(2) for more information. > > I think this also wants a reference to the Linux interpretation of > platform "persistence domains" we were discussing that here [1], but > maybe it should be part of a "pmem" manpage that can be referenced > from this man page. Sure, but for now I think referencing mmap for details on MAP_SYNC works. I suspect that we may have some word smithing once I get this series in and we submit a change to the statx man page itself. Can I move forward with the following for this patch? <quote> STATX_ATTR_DAX The file is in the DAX (cpu direct access) state. DAX state attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a file system which has been configured to support DAX. DAX generally assumes all accesses are via cpu load / store instructions which can minimize overhead for small accesses, but may adversely affect cpu utilization for large transfers. File I/O is done directly to/from user-space buffers and memory mapped I/O may be performed with direct memory mappings that bypass kernel page cache. While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees of synchronous I/O where data and the necessary metadata are transferred together. A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush instructions to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information. </quote> Ira > > [1]: http://lore.kernel.org/r/20200108064905.170394-1-aneesh.kumar@linux.ibm.com
On Mon, Jan 13, 2020 at 02:19:57PM -0800, Darrick J. Wong wrote: > On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > [snip] > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index 401da197f012..e8fd95b75e5b 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared( > > * > > * Basic locking order: > > * > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock > > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock > > Mmmmmm, more locks. Can we skip the extra lock if CONFIG_FSDAX=n or if > the filesystem devices don't support DAX at all? I've looked into this a bit more and I think skipping on CONFIG_FSDAX=n is ok but doing so on individual devices is not going to be possible because we don't have that information at the vfs layer. I'll continue to think about how to mitigate this more while I add CONFIG_FSDAX checks before rolling out a new patch set. > > Also, I don't think we're actually following the i_rwsem -> i_daxsem > order in fallocate, and possibly elsewhere too? > > Does the vfs have to take the i_dax_sem to do remapping things like > reflink? (Pretend that reflink and dax are compatible for the moment) > I spoke with Dan today about this and we believe the answer is going to be yes. However, not until the code is added to support DAX in reflink. Because currently this is only required for places which use "IS_DAX()" to see the state of the inode. Currently the vfs layer does not have any IS_DAX() checks in the reflink path. But when they are added they will be required to take this sem. Ira
On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote: > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote: > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > [snip] > > > > Ok I changed a couple of things as well. How does this sound? > > > > > > > > > STATX_ATTR_DAX > > > > > > DAX (cpu direct access) is a file mode that attempts to minimize > > > > s/mode/state/? > > DOH! yes state... ;-) > > > > > > software cache effects for both I/O and memory mappings of this > > > file. It requires a block device and file system which have > > > been configured to support DAX. > > > > It may not require a block device in the future. > > Ok: > > "It requires a file system which has been configured to support DAX." ? > > I'm trying to separate the user of the individual STATX DAX flag from the Admin > details of configuring the file system and/or devices which supports it. > > Also, I just realized that we should follow the format of the other STATX_* > attributes. They all read something like "the file is..." > > So I'm adding that text as well. > > > > > > > > > DAX generally assumes all accesses are via cpu load / store > > > instructions which can minimize overhead for small accesses, but > > > may adversely affect cpu utilization for large transfers. > > > > > > File I/O is done directly to/from user-space buffers and memory > > > mapped I/O may be performed with direct memory mappings that > > > bypass kernel page cache. > > > > > > While the DAX property tends to result in data being transferred > > > synchronously, it does not give the same guarantees of > > > synchronous I/O where data and the necessary metadata are > > > > Maybe use "O_SYNC I/O" explicitly to further differentiate the 2 > > meanings of "synchronous" in this sentence? > > Done. > > > > > > transferred together. > > > > > > A DAX file may support being mapped with the MAP_SYNC flag, > > > which enables a program to use CPU cache flush operations to > > > > s/operations/instructions/ > > Done. > > > > > > persist CPU store operations without an explicit fsync(2). See > > > mmap(2) for more information. > > > > I think this also wants a reference to the Linux interpretation of > > platform "persistence domains" we were discussing that here [1], but > > maybe it should be part of a "pmem" manpage that can be referenced > > from this man page. > > Sure, but for now I think referencing mmap for details on MAP_SYNC works. > > I suspect that we may have some word smithing once I get this series in and we > submit a change to the statx man page itself. Can I move forward with the > following for this patch? > > <quote> > STATX_ATTR_DAX > > The file is in the DAX (cpu direct access) state. DAX state Hmm, now that I see it written out, I <cough> kind of like "DAX mode" better now. :/ "The file is in DAX (CPU direct access) mode. DAX mode attempts..." > attempts to minimize software cache effects for both I/O and > memory mappings of this file. It requires a file system which > has been configured to support DAX. > > DAX generally assumes all accesses are via cpu load / store > instructions which can minimize overhead for small accesses, but > may adversely affect cpu utilization for large transfers. > > File I/O is done directly to/from user-space buffers and memory > mapped I/O may be performed with direct memory mappings that > bypass kernel page cache. > > While the DAX property tends to result in data being transferred > synchronously, it does not give the same guarantees of > synchronous I/O where data and the necessary metadata are > transferred together. (I'm frankly not sure that synchronous I/O actually guarantees that the metadata has hit stable storage...) --D > A DAX file may support being mapped with the MAP_SYNC flag, > which enables a program to use CPU cache flush instructions to > persist CPU store operations without an explicit fsync(2). See > mmap(2) for more information. > </quote> > > Ira > > > > > [1]: http://lore.kernel.org/r/20200108064905.170394-1-aneesh.kumar@linux.ibm.com
On Wed, Jan 15, 2020 at 11:08:46AM -0800, Ira Weiny wrote: > On Mon, Jan 13, 2020 at 05:03:22PM -0800, Darrick J. Wong wrote: > > On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote: > > > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote: > > > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote: > > > > > From: Ira Weiny <ira.weiny@intel.com> > > [snip] > > > > > > +``lock_mode`` > > > > > + called to prevent operations which depend on the inode's mode from > > > > > + proceeding should a mode change be in progress > > > > > > > > "Inodes can't change mode, because files do not suddenly become > > > > directories". ;) > > > > > > Yea sorry. > > > > > > > > > > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache > > > > mode from proceeding while there are address space operations in > > > > progress". So these are really more aops get and put functions... > > > > > > At first I actually did have aops get/put functions but this is really > > > protecting more than the aops vector because as Christoph said there are file > > > operations which need to be protected not just address space operations. > > > > > > But I agree "mode" is a bad name... Sorry... > > > > inode_fops_{get,set}(), then? > > > > inode_start_fileop() > > inode_end_fileop() ? > > > > Trying to avoid sounding foppish <COUGH> > > What about? > > inode_[un]lock_state()? Kinda vague -- which state? inodes retain a lot of different state. This locking primitive ensures that file operations pointers can't change while any operations are ongoing, right? --D > Ira >
On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
[..]
> > attempts to minimize software cache effects for both I/O and
> > memory mappings of this file. It requires a file system which
> > has been configured to support DAX.
> >
> > DAX generally assumes all accesses are via cpu load / store
> > instructions which can minimize overhead for small accesses, but
> > may adversely affect cpu utilization for large transfers.
> >
> > File I/O is done directly to/from user-space buffers and memory
> > mapped I/O may be performed with direct memory mappings that
> > bypass kernel page cache.
> >
> > While the DAX property tends to result in data being transferred
> > synchronously, it does not give the same guarantees of
> > synchronous I/O where data and the necessary metadata are
> > transferred together.
>
> (I'm frankly not sure that synchronous I/O actually guarantees that the
> metadata has hit stable storage...)
Oh? That text was motivated by the open(2) man page description of O_SYNC.
On Wed, Jan 15, 2020 at 10:05:00PM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> [..]
> > > attempts to minimize software cache effects for both I/O and
> > > memory mappings of this file. It requires a file system which
> > > has been configured to support DAX.
> > >
> > > DAX generally assumes all accesses are via cpu load / store
> > > instructions which can minimize overhead for small accesses, but
> > > may adversely affect cpu utilization for large transfers.
> > >
> > > File I/O is done directly to/from user-space buffers and memory
> > > mapped I/O may be performed with direct memory mappings that
> > > bypass kernel page cache.
> > >
> > > While the DAX property tends to result in data being transferred
> > > synchronously, it does not give the same guarantees of
> > > synchronous I/O where data and the necessary metadata are
> > > transferred together.
> >
> > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > metadata has hit stable storage...)
>
> Oh? That text was motivated by the open(2) man page description of O_SYNC.
Eh, that's just me being cynical about software. Yes, the O_SYNC docs
say that data+metadata are supposed to happen; that's good enough for
another section in the man pages. :)
--D
On Wed, Jan 15, 2020 at 10:18 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Wed, Jan 15, 2020 at 10:05:00PM -0800, Dan Williams wrote:
> > On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > [..]
> > > > attempts to minimize software cache effects for both I/O and
> > > > memory mappings of this file. It requires a file system which
> > > > has been configured to support DAX.
> > > >
> > > > DAX generally assumes all accesses are via cpu load / store
> > > > instructions which can minimize overhead for small accesses, but
> > > > may adversely affect cpu utilization for large transfers.
> > > >
> > > > File I/O is done directly to/from user-space buffers and memory
> > > > mapped I/O may be performed with direct memory mappings that
> > > > bypass kernel page cache.
> > > >
> > > > While the DAX property tends to result in data being transferred
> > > > synchronously, it does not give the same guarantees of
> > > > synchronous I/O where data and the necessary metadata are
> > > > transferred together.
> > >
> > > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > > metadata has hit stable storage...)
> >
> > Oh? That text was motivated by the open(2) man page description of O_SYNC.
>
> Eh, that's just me being cynical about software. Yes, the O_SYNC docs
> say that data+metadata are supposed to happen; that's good enough for
> another section in the man pages. :)
>
Ah ok, yes, "all storage is a lie".
On Fri 10-01-20 11:29:38, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > XFS requires regular files to be locked while changing to/from DAX mode. > > Define a new DAX lock type and implement the [un]lock_mode() inode > operation callbacks. > > We define a new XFS_DAX_* lock type to carry the lock through the > transaction because we don't want to use IOLOCK as that would cause > performance issues with locking of the inode itself. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> ... > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index 492e53992fa9..693ca66bd89b 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -67,6 +67,9 @@ typedef struct xfs_inode { > spinlock_t i_ioend_lock; > struct work_struct i_ioend_work; > struct list_head i_ioend_list; > + > + /* protect changing the mode to/from DAX */ > + struct percpu_rw_semaphore i_dax_sem; > } xfs_inode_t; This adds overhead of ~32k per inode for typical distro kernel. That's not going to fly. That's why ext4 has similar kind of lock in the superblock shared by all inodes. For read side it does not matter because that's per-cpu and shared lock. For write side we don't care as changing inode access mode should be rare. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Fri 10-01-20 11:29:35, ira.weiny@intel.com wrote: > From: Ira Weiny <ira.weiny@intel.com> > > The IS_DAX() check in io_is_direct() causes a race between changing the > DAX mode and creating the iocb flags. > > Remove the check because DAX now emulates the page cache API and > therefore it does not matter if the file mode is DAX or not when the > iocb flags are created. > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > include/linux/fs.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index d7584bcef5d3..e11989502eac 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3365,7 +3365,7 @@ extern int file_update_time(struct file *file); > > static inline bool io_is_direct(struct file *filp) > { > - return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host); > + return (filp->f_flags & O_DIRECT); > } > > static inline bool vma_is_dax(struct vm_area_struct *vma) > -- > 2.21.0 > -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote: > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote: > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote: > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > > [snip] > > > > Sure, but for now I think referencing mmap for details on MAP_SYNC works. > > > > I suspect that we may have some word smithing once I get this series in and we > > submit a change to the statx man page itself. Can I move forward with the > > following for this patch? > > > > <quote> > > STATX_ATTR_DAX > > > > The file is in the DAX (cpu direct access) state. DAX state > > Hmm, now that I see it written out, I <cough> kind of like "DAX mode" > better now. :/ > > "The file is in DAX (CPU direct access) mode. DAX mode attempts..." Sure... now you tell me... ;-) Seriously, we could use mode here in the man page as this is less confusing to say "DAX mode". But I think the code should still use 'state' because mode is just too overloaded. You were not the only one who was thrown by my use of mode and I don't want that confusion when we look at this code 2 weeks from now... https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/ ;-) > > > attempts to minimize software cache effects for both I/O and > > memory mappings of this file. It requires a file system which > > has been configured to support DAX. > > > > DAX generally assumes all accesses are via cpu load / store > > instructions which can minimize overhead for small accesses, but > > may adversely affect cpu utilization for large transfers. > > > > File I/O is done directly to/from user-space buffers and memory > > mapped I/O may be performed with direct memory mappings that > > bypass kernel page cache. > > > > While the DAX property tends to result in data being transferred > > synchronously, it does not give the same guarantees of > > synchronous I/O where data and the necessary metadata are > > transferred together. > > (I'm frankly not sure that synchronous I/O actually guarantees that the > metadata has hit stable storage...) I'll let you and Dan work this one out... ;-) Ira
On Thu, Jan 16, 2020 at 09:55:02AM -0800, Ira Weiny wrote: > On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote: > > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote: > > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote: > > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > > > > > > [snip] > > > > > > > Sure, but for now I think referencing mmap for details on MAP_SYNC works. > > > > > > I suspect that we may have some word smithing once I get this series in and we > > > submit a change to the statx man page itself. Can I move forward with the > > > following for this patch? > > > > > > <quote> > > > STATX_ATTR_DAX > > > > > > The file is in the DAX (cpu direct access) state. DAX state > > > > Hmm, now that I see it written out, I <cough> kind of like "DAX mode" > > better now. :/ > > > > "The file is in DAX (CPU direct access) mode. DAX mode attempts..." > > Sure... now you tell me... ;-) > > Seriously, we could use mode here in the man page as this is less confusing to > say "DAX mode". > > But I think the code should still use 'state' because mode is just too > overloaded. You were not the only one who was thrown by my use of mode and I > don't want that confusion when we look at this code 2 weeks from now... > > https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/ > > ;-) Ok, let's leave it alone for now then. I'm not even sure what 'DAX' stands for. Direct Access to ... Professor Xavier? 8-) > > > > > attempts to minimize software cache effects for both I/O and > > > memory mappings of this file. It requires a file system which > > > has been configured to support DAX. > > > > > > DAX generally assumes all accesses are via cpu load / store > > > instructions which can minimize overhead for small accesses, but > > > may adversely affect cpu utilization for large transfers. > > > > > > File I/O is done directly to/from user-space buffers and memory > > > mapped I/O may be performed with direct memory mappings that > > > bypass kernel page cache. > > > > > > While the DAX property tends to result in data being transferred > > > synchronously, it does not give the same guarantees of > > > synchronous I/O where data and the necessary metadata are > > > transferred together. > > > > (I'm frankly not sure that synchronous I/O actually guarantees that the > > metadata has hit stable storage...) > > I'll let you and Dan work this one out... ;-) Hehe. I think the wording here is fine. --D > Ira >
On Thu, Jan 16, 2020 at 10:38:07AM +0100, Jan Kara wrote: > On Fri 10-01-20 11:29:35, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > The IS_DAX() check in io_is_direct() causes a race between changing the > > DAX mode and creating the iocb flags. > > > > Remove the check because DAX now emulates the page cache API and > > therefore it does not matter if the file mode is DAX or not when the > > iocb flags are created. > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > The patch looks good to me. You can add: > > Reviewed-by: Jan Kara <jack@suse.cz> Thanks, Ira > > Honza > > > --- > > include/linux/fs.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index d7584bcef5d3..e11989502eac 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -3365,7 +3365,7 @@ extern int file_update_time(struct file *file); > > > > static inline bool io_is_direct(struct file *filp) > > { > > - return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host); > > + return (filp->f_flags & O_DIRECT); > > } > > > > static inline bool vma_is_dax(struct vm_area_struct *vma) > > -- > > 2.21.0 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu, Jan 16, 2020 at 10:04:21AM -0800, Darrick J. Wong wrote: > On Thu, Jan 16, 2020 at 09:55:02AM -0800, Ira Weiny wrote: > > On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote: > > > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote: > > > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote: > > > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > > > > > > > > > > [snip] > > > > > > > > > > Sure, but for now I think referencing mmap for details on MAP_SYNC works. > > > > > > > > I suspect that we may have some word smithing once I get this series in and we > > > > submit a change to the statx man page itself. Can I move forward with the > > > > following for this patch? > > > > > > > > <quote> > > > > STATX_ATTR_DAX > > > > > > > > The file is in the DAX (cpu direct access) state. DAX state > > > > > > Hmm, now that I see it written out, I <cough> kind of like "DAX mode" > > > better now. :/ > > > > > > "The file is in DAX (CPU direct access) mode. DAX mode attempts..." > > > > Sure... now you tell me... ;-) > > > > Seriously, we could use mode here in the man page as this is less confusing to > > say "DAX mode". > > > > But I think the code should still use 'state' because mode is just too > > overloaded. You were not the only one who was thrown by my use of mode and I > > don't want that confusion when we look at this code 2 weeks from now... > > > > https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/ > > > > ;-) > > Ok, let's leave it alone for now then. Cool could I get a reviewed by? And Jan is this reword of the man page/commit ok to keep your reviewed by? > > I'm not even sure what 'DAX' stands for. Direct Access to ... > Professor Xavier? 8-) That is pronounced 'Direct A'Xes' you know, for chopping wood! Thanks everyone, Ira > > > > > > > > attempts to minimize software cache effects for both I/O and > > > > memory mappings of this file. It requires a file system which > > > > has been configured to support DAX. > > > > > > > > DAX generally assumes all accesses are via cpu load / store > > > > instructions which can minimize overhead for small accesses, but > > > > may adversely affect cpu utilization for large transfers. > > > > > > > > File I/O is done directly to/from user-space buffers and memory > > > > mapped I/O may be performed with direct memory mappings that > > > > bypass kernel page cache. > > > > > > > > While the DAX property tends to result in data being transferred > > > > synchronously, it does not give the same guarantees of > > > > synchronous I/O where data and the necessary metadata are > > > > transferred together. > > > > > > (I'm frankly not sure that synchronous I/O actually guarantees that the > > > metadata has hit stable storage...) > > > > I'll let you and Dan work this one out... ;-) > > Hehe. I think the wording here is fine. > > --D > > > Ira > >
On Wed, Jan 15, 2020 at 09:40:40PM -0800, Darrick J. Wong wrote: > On Wed, Jan 15, 2020 at 11:08:46AM -0800, Ira Weiny wrote: > > On Mon, Jan 13, 2020 at 05:03:22PM -0800, Darrick J. Wong wrote: > > > On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote: > > > > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote: > > > > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote: > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > [snip] > > > > > > > > +``lock_mode`` > > > > > > + called to prevent operations which depend on the inode's mode from > > > > > > + proceeding should a mode change be in progress > > > > > > > > > > "Inodes can't change mode, because files do not suddenly become > > > > > directories". ;) > > > > > > > > Yea sorry. > > > > > > > > > > > > > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache > > > > > mode from proceeding while there are address space operations in > > > > > progress". So these are really more aops get and put functions... > > > > > > > > At first I actually did have aops get/put functions but this is really > > > > protecting more than the aops vector because as Christoph said there are file > > > > operations which need to be protected not just address space operations. > > > > > > > > But I agree "mode" is a bad name... Sorry... > > > > > > inode_fops_{get,set}(), then? > > > > > > inode_start_fileop() > > > inode_end_fileop() ? > > > > > > Trying to avoid sounding foppish <COUGH> > > > > What about? > > > > inode_[un]lock_state()? > > Kinda vague -- which state? inodes retain a lot of different state. > > This locking primitive ensures that file operations pointers can't > change while any operations are ongoing, right? file ops Address space ops ... Whatever it needs... With the current patch set we could be more specific and call it. inode_[un]lock_dax_state() Ira > > --D > > > Ira > >
On Thu, Jan 16, 2020 at 10:24:46AM +0100, Jan Kara wrote: > On Fri 10-01-20 11:29:38, ira.weiny@intel.com wrote: > > From: Ira Weiny <ira.weiny@intel.com> > > > > XFS requires regular files to be locked while changing to/from DAX mode. > > > > Define a new DAX lock type and implement the [un]lock_mode() inode > > operation callbacks. > > > > We define a new XFS_DAX_* lock type to carry the lock through the > > transaction because we don't want to use IOLOCK as that would cause > > performance issues with locking of the inode itself. > > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > ... > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > index 492e53992fa9..693ca66bd89b 100644 > > --- a/fs/xfs/xfs_inode.h > > +++ b/fs/xfs/xfs_inode.h > > @@ -67,6 +67,9 @@ typedef struct xfs_inode { > > spinlock_t i_ioend_lock; > > struct work_struct i_ioend_work; > > struct list_head i_ioend_list; > > + > > + /* protect changing the mode to/from DAX */ > > + struct percpu_rw_semaphore i_dax_sem; > > } xfs_inode_t; > > This adds overhead of ~32k per inode for typical distro kernel. Wow! > That's not > going to fly. Probably not... > That's why ext4 has similar kind of lock in the superblock > shared by all inodes. For read side it does not matter because that's > per-cpu and shared lock. For write side we don't care as changing inode > access mode should be rare. Sounds reasonable to me. I'll convert it. Thanks for pointing this out, that would have been bad indeed. Ira
On Thu, Jan 16, 2020 at 10:52:36AM -0800, Ira Weiny wrote: > On Thu, Jan 16, 2020 at 10:04:21AM -0800, Darrick J. Wong wrote: > > On Thu, Jan 16, 2020 at 09:55:02AM -0800, Ira Weiny wrote: > > > On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote: > > > > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote: > > > > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote: > > > > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > > > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote: > > > > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote: > > > > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote: > > > > > > > > > > From: Ira Weiny <ira.weiny@intel.com> > > > > > > > > > > > > > > > > > > > > > [snip] > > > > > > > > > > > > > Sure, but for now I think referencing mmap for details on MAP_SYNC works. > > > > > > > > > > I suspect that we may have some word smithing once I get this series in and we > > > > > submit a change to the statx man page itself. Can I move forward with the > > > > > following for this patch? > > > > > > > > > > <quote> > > > > > STATX_ATTR_DAX > > > > > > > > > > The file is in the DAX (cpu direct access) state. DAX state > > > > > > > > Hmm, now that I see it written out, I <cough> kind of like "DAX mode" > > > > better now. :/ > > > > > > > > "The file is in DAX (CPU direct access) mode. DAX mode attempts..." > > > > > > Sure... now you tell me... ;-) > > > > > > Seriously, we could use mode here in the man page as this is less confusing to > > > say "DAX mode". > > > > > > But I think the code should still use 'state' because mode is just too > > > overloaded. You were not the only one who was thrown by my use of mode and I > > > don't want that confusion when we look at this code 2 weeks from now... > > > > > > https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/ > > > > > > ;-) > > > > Ok, let's leave it alone for now then. > > Cool could I get a reviewed by? My bike shed is painted green with purple polka dots, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > And Jan is this reword of the man page/commit ok to keep your reviewed by? > > > > > I'm not even sure what 'DAX' stands for. Direct Access to ... > > Professor Xavier? 8-) > > That is pronounced 'Direct A'Xes' you know, for chopping wood! > > Thanks everyone, > Ira > > > > > > > > > > > > attempts to minimize software cache effects for both I/O and > > > > > memory mappings of this file. It requires a file system which > > > > > has been configured to support DAX. > > > > > > > > > > DAX generally assumes all accesses are via cpu load / store > > > > > instructions which can minimize overhead for small accesses, but > > > > > may adversely affect cpu utilization for large transfers. > > > > > > > > > > File I/O is done directly to/from user-space buffers and memory > > > > > mapped I/O may be performed with direct memory mappings that > > > > > bypass kernel page cache. > > > > > > > > > > While the DAX property tends to result in data being transferred > > > > > synchronously, it does not give the same guarantees of > > > > > synchronous I/O where data and the necessary metadata are > > > > > transferred together. > > > > > > > > (I'm frankly not sure that synchronous I/O actually guarantees that the > > > > metadata has hit stable storage...) > > > > > > I'll let you and Dan work this one out... ;-) > > > > Hehe. I think the wording here is fine. > > > > --D > > > > > Ira > > >
On Thu 16-01-20 10:52:36, Ira Weiny wrote:
> And Jan is this reword of the man page/commit ok to keep your reviewed by?
Yes.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Wed, Jan 15, 2020 at 10:05:00PM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> [..]
> > > attempts to minimize software cache effects for both I/O and
> > > memory mappings of this file. It requires a file system which
> > > has been configured to support DAX.
> > >
> > > DAX generally assumes all accesses are via cpu load / store
> > > instructions which can minimize overhead for small accesses, but
> > > may adversely affect cpu utilization for large transfers.
> > >
> > > File I/O is done directly to/from user-space buffers and memory
> > > mapped I/O may be performed with direct memory mappings that
> > > bypass kernel page cache.
> > >
> > > While the DAX property tends to result in data being transferred
> > > synchronously, it does not give the same guarantees of
> > > synchronous I/O where data and the necessary metadata are
> > > transferred together.
> >
> > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > metadata has hit stable storage...)
>
> Oh? That text was motivated by the open(2) man page description of O_SYNC.
Ugh. "synchronous I/O" means two different things, depending on
context. In the AIO context, it means "process context waits for operation
completion direct", but in the O_SYNC context, it means "we guarantee
data integrity for each I/O submitted".
Indeed, O_SYNC AIO is a thing. i.e. we can do an "async sync
write" to guarantee data integrity without directly waiting for
it. Now try describing that only using the words "synchronous
write" to describe both behaviours. :)
IOWs, if you are talking about data integrity, you need to
explicitly say "O_SYNC semantics", not "synchronous write", because
"synchronous write" is totally ambiguous without the O_SYNC context
of the open(2) man page...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com