* [PATCH v4 0/3] MAP_DIRECT and block-map sealed files @ 2017-08-15 6:12 Dan Williams 2017-08-15 6:12 ` [PATCH v4 1/3] fs, xfs: introduce S_IOMAP_SEALED Dan Williams ` (3 more replies) 0 siblings, 4 replies; 22+ messages in thread From: Dan Williams @ 2017-08-15 6:12 UTC (permalink / raw) To: darrick.wong-QHcLZuEGTsvQT0dZR+AlfA Cc: Jan Kara, Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Alexander Viro, luto-DgEjT+Ai2ygdnm+yROfE0A, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Christoph Hellwig Changes since v3 [1]: * Move from an fallocate(2) interface to a new mmap(2) flag and rename 'immutable' to 'sealed'. * Do not record the sealed state in permanent metadata it is now purely a temporary state for as long as a MAP_DIRECT vma is referencing the inode (Christoph) * Drop the CAP_IMMUTABLE requirement, but do require a PROT_WRITE mapping. [1]: https://lwn.net/Articles/730570/ --- This is the next revision of a patch series that aims to enable applications that otherwise need to resort to DAX mapping a raw device file to instead move to a filesystem. In the course of reviewing a previous posting, Christoph said: That being said I think we absolutely should support RDMA memory registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure all the blocks are populated and all ptes are set up. Second we need to make sure get_user_page works, which for now means we'll need a struct page mapping for the region (which will be really annoying for PCIe mappings, like the upcoming NVMe persistent memory region), and we need to guarantee that the extent mapping won't change while the get_user_pages holds the pages inside it. I think that is true due to side effects even with the current DAX code, but we'll need to make it explicit. And maybe that's where we need to converge - "sealing" the extent map makes sense as such a temporary measure that is not persisted on disk, which automatically gets released when the holding process exits, because we sort of already do this implicitly. It might also make sense to have explicitly breakable seals similar to what I do for the pNFS blocks kernel server, as any userspace RDMA file server would also need those semantics. So, this is an attempt to converge on the idea that we need an explicit and process-lifetime-temporary mechanism for a process to be able to make assumptions about the mapping to physical page to dax-file-offset relationship. The "explicitly breakable seals" aspect is not addressed in these patches, but I wonder if it might be a voluntary mechanism that can implemented via userfaultfd. These pass a basic smoke test and are meant to just gauge 'right track' / 'wrong track'. The main question it seems is whether the pinning done in this patchset is too early (applies before get_user_pages()) and too coarse (applies to the whole file). Perhaps this is where I discarded too easily Jan's suggestion to look at Peter Z's mm_mpin() syscall [2]? On the other hand, the coarseness and simple lifetime rules of MAP_DIRECT make it an easy mechanism to implement and explain. Another reason I kept the scope of S_IOMAP_SEALED coarsely defined was to support Dave's desired use case of sealing for operating on reflinked files [3]. Suggested mmap(2) man page edits are included in the changelog of patch 3. [2]: https://lwn.net/Articles/600502/ [3]: https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1467677.html --- Dan Williams (3): fs, xfs: introduce S_IOMAP_SEALED mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges fs/attr.c | 10 +++ fs/dax.c | 2 + fs/open.c | 6 ++ fs/read_write.c | 3 + fs/xfs/libxfs/xfs_bmap.c | 5 + fs/xfs/xfs_bmap_util.c | 3 + fs/xfs/xfs_file.c | 107 ++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 1 fs/xfs/xfs_ioctl.c | 6 ++ fs/xfs/xfs_super.c | 1 include/linux/fs.h | 9 +++ include/linux/mm.h | 2 - include/linux/mm_types.h | 1 include/linux/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 2 + mm/filemap.c | 5 + mm/mmap.c | 22 ++++++- 17 files changed, 183 insertions(+), 5 deletions(-) ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 1/3] fs, xfs: introduce S_IOMAP_SEALED 2017-08-15 6:12 [PATCH v4 0/3] MAP_DIRECT and block-map sealed files Dan Williams @ 2017-08-15 6:12 ` Dan Williams [not found] ` <150277752553.23945.13932394738552748440.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> ` (2 subsequent siblings) 3 siblings, 0 replies; 22+ messages in thread From: Dan Williams @ 2017-08-15 6:12 UTC (permalink / raw) To: darrick.wong Cc: Jan Kara, linux-nvdimm, linux-api, Dave Chinner, linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig When a filesystem sees this flag set it will not allow changes to the file-offset to physical-block-offset relationship of any extent in the file. The extent of the extents covered by the global S_IOMAP_SEALED is filesystem specific. In other words it is similar to the inode-wide XFS_DIFLAG2_REFLINK flag where we make the distinction apply globally to the inode even though we could theoretically limit that effect to a sub-range of the file. The interface that sets this flag (mmap(..., MAP_DIRECT, ...)) will be careful to document that it is implementation specific whether the 'sealed' restrictions apply to a sub-range or the whole file. Applications should be prepared for unrelated ranges in the file to be effected. The term 'sealed' is used instead of 'immutable' to better indicate that this is a file property that is temporary and can be undone. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/attr.c | 10 ++++++++++ fs/open.c | 6 ++++++ fs/read_write.c | 3 +++ fs/xfs/libxfs/xfs_bmap.c | 5 +++++ fs/xfs/xfs_bmap_util.c | 3 +++ fs/xfs/xfs_ioctl.c | 6 ++++++ include/linux/fs.h | 2 ++ mm/filemap.c | 5 +++++ 8 files changed, 40 insertions(+) diff --git a/fs/attr.c b/fs/attr.c index 135304146120..d940386e0ca9 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare); */ int inode_newsize_ok(const struct inode *inode, loff_t offset) { + if (IS_IOMAP_SEALED(inode)) { + /* + * Any size change is disallowed. Size increases may + * dirty metadata that an application is not prepared to + * sync, and a size decrease may expose free blocks to + * in-flight DMA. + */ + return -ETXTBSY; + } + if (inode->i_size < offset) { unsigned long limit; diff --git a/fs/open.c b/fs/open.c index 35bb784763a4..92d89ec2d6b3 100644 --- a/fs/open.c +++ b/fs/open.c @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -ETXTBSY; /* + * We cannot allow any allocation changes on an iomap sealed file + */ + if (IS_IOMAP_SEALED(inode)) + return -ETXTBSY; + + /* * Revalidate the write permissions, in case security policy has * changed since the files were opened. */ diff --git a/fs/read_write.c b/fs/read_write.c index 0cc7033aa413..55700ca85f7e 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out)) return -ETXTBSY; + if (IS_IOMAP_SEALED(inode_in) || IS_IOMAP_SEALED(inode_out)) + return -ETXTBSY; + /* Don't reflink dirs, pipes, sockets... */ if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) return -EISDIR; diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index a2d64666cdd4..84d8ee9f414c 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4481,6 +4481,11 @@ xfs_bmapi_write( if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; + /* fail any attempts to mutate data extents */ + if (IS_IOMAP_SEALED(VFS_I(ip)) + && !(flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ATTRFORK))) + return -ETXTBSY; + ifp = XFS_IFORK_PTR(ip, whichfork); XFS_STATS_INC(mp, xs_blk_mapw); diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 93e955262d07..ef4c4e8b0f58 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1294,6 +1294,9 @@ xfs_free_file_space( trace_xfs_free_file_space(ip); + if (IS_IOMAP_SEALED(VFS_I(ip))) + return -ETXTBSY; + error = xfs_qm_dqattach(ip, 0); if (error) return error; diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index e75c40a47b7d..b716d184ae9a 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1755,6 +1755,12 @@ xfs_ioc_swapext( goto out_put_tmp_file; } + if (IS_IOMAP_SEALED(file_inode(f.file)) || + IS_IOMAP_SEALED(file_inode(tmp.file))) { + error = -EINVAL; + goto out_put_tmp_file; + } + /* * We need to ensure that the fds passed in point to XFS inodes * before we cast and access them as XFS structures as we have no diff --git a/include/linux/fs.h b/include/linux/fs.h index 6e1fd5d21248..1104e5df39ef 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1829,6 +1829,7 @@ struct super_operations { #else #define S_DAX 0 /* Make all the DAX code disappear */ #endif +#define S_IOMAP_SEALED 16384 /* logical-to-physical extent map is fixed */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1867,6 +1868,7 @@ struct super_operations { #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#define IS_IOMAP_SEALED(inode) ((inode)->i_flags & S_IOMAP_SEALED) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) diff --git a/mm/filemap.c b/mm/filemap.c index a49702445ce0..4c929961b538 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2806,6 +2806,11 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) if (unlikely(pos >= inode->i_sb->s_maxbytes)) return -EFBIG; + /* Are we about to mutate the block map on a sealed file? */ + if (IS_IOMAP_SEALED(inode) + && (pos + iov_iter_count(from) > i_size_read(inode))) + return -ETXTBSY; + iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos); return iov_iter_count(from); } ^ permalink raw reply related [flat|nested] 22+ messages in thread
[parent not found: <150277752553.23945.13932394738552748440.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags [not found] ` <150277752553.23945.13932394738552748440.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-08-15 6:12 ` Dan Williams 2017-08-15 12:27 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Dan Williams @ 2017-08-15 6:12 UTC (permalink / raw) To: darrick.wong-QHcLZuEGTsvQT0dZR+AlfA Cc: Jan Kara, Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-api-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, luto-DgEjT+Ai2ygdnm+yROfE0A, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Christoph Hellwig The mmap syscall suffers from the ABI anti-pattern of not validating unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a mechanism to define new behavior that is known to fail on older kernels without the feature. Use the fact that specifying MAP_SHARED and MAP_PRIVATE at the same time is invalid as a cute hack to allow a new set of validated flags to be introduced. This also introduces the ->fmmap() file operation that is ->mmap() plus flags. Each ->fmmap() implementation must fail requests when a locally unsupported flag is specified. Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Suggested-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- include/linux/fs.h | 7 +++++++ include/linux/mm.h | 2 +- include/linux/mman.h | 3 +++ include/uapi/asm-generic/mman-common.h | 1 + mm/mmap.c | 20 +++++++++++++++++--- 5 files changed, 29 insertions(+), 4 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 1104e5df39ef..bbe755d0caee 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1674,6 +1674,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); + int (*fmmap) (struct file *, struct vm_area_struct *, unsigned long); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); @@ -1748,6 +1749,12 @@ static inline int call_mmap(struct file *file, struct vm_area_struct *vma) return file->f_op->mmap(file, vma); } +static inline int call_fmmap(struct file *file, struct vm_area_struct *vma, + unsigned long flags) +{ + return file->f_op->fmmap(file, vma, flags); +} + ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, unsigned long nr_segs, unsigned long fast_segs, struct iovec *fast_pointer, diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5e8569..49eef48da4b7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2090,7 +2090,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf); + struct list_head *uf, unsigned long flags); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, diff --git a/include/linux/mman.h b/include/linux/mman.h index c8367041fafd..73d4ac7e7136 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -7,6 +7,9 @@ #include <linux/atomic.h> #include <uapi/linux/mman.h> +/* the MAP_VALIDATE set of supported flags */ +#define MAP_SUPPORTED_MASK (0) + extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; extern unsigned long sysctl_overcommit_kbytes; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 8c27db0c5c08..8bf8c7828275 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -24,6 +24,7 @@ #else # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ #endif +#define MAP_VALIDATE (MAP_SHARED|MAP_PRIVATE) /* mechanism to define new shared semantics */ /* * Flags for mlock diff --git a/mm/mmap.c b/mm/mmap.c index f19efcf75418..d2919a9e25bf 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1388,6 +1388,12 @@ unsigned long do_mmap(struct file *file, unsigned long addr, struct inode *inode = file_inode(file); switch (flags & MAP_TYPE) { + case MAP_VALIDATE: + if (flags & ~(MAP_SUPPORTED_MASK | MAP_VALIDATE)) + return -EINVAL; + if (!file->f_op->fmmap) + return -EOPNOTSUPP; + /* fall through */ case MAP_SHARED: if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) return -EACCES; @@ -1464,7 +1470,12 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; } - addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); + if ((flags & MAP_VALIDATE) == MAP_VALIDATE) + flags &= MAP_SUPPORTED_MASK; + else + flags = 0; + + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags); if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) @@ -1601,7 +1612,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + struct list_head *uf, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1686,7 +1697,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * new file must not have been exposed to user-space, yet. */ vma->vm_file = get_file(file); - error = call_mmap(file, vma); + if (flags) + error = call_fmmap(file, vma, flags); + else + error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-08-15 6:12 ` [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags Dan Williams @ 2017-08-15 12:27 ` Jan Kara [not found] ` <20170815122701.GF27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2017-09-17 3:44 ` Dan Williams 2017-08-15 16:28 ` Andy Lutomirski 2017-08-17 8:06 ` kbuild test robot 2 siblings, 2 replies; 22+ messages in thread From: Jan Kara @ 2017-08-15 12:27 UTC (permalink / raw) To: Dan Williams Cc: darrick.wong, Jan Kara, Arnd Bergmann, linux-nvdimm, linux-api, linux-kernel, linux-xfs, linux-mm, luto, linux-fsdevel, Andrew Morton, Christoph Hellwig On Mon 14-08-17 23:12:16, Dan Williams wrote: > The mmap syscall suffers from the ABI anti-pattern of not validating > unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a > mechanism to define new behavior that is known to fail on older kernels > without the feature. Use the fact that specifying MAP_SHARED and > MAP_PRIVATE at the same time is invalid as a cute hack to allow a new > set of validated flags to be introduced. > > This also introduces the ->fmmap() file operation that is ->mmap() plus > flags. Each ->fmmap() implementation must fail requests when a locally > unsupported flag is specified. ... > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 1104e5df39ef..bbe755d0caee 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1674,6 +1674,7 @@ struct file_operations { > long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); > long (*compat_ioctl) (struct file *, unsigned int, unsigned long); > int (*mmap) (struct file *, struct vm_area_struct *); > + int (*fmmap) (struct file *, struct vm_area_struct *, unsigned long); > int (*open) (struct inode *, struct file *); > int (*flush) (struct file *, fl_owner_t id); > int (*release) (struct inode *, struct file *); > @@ -1748,6 +1749,12 @@ static inline int call_mmap(struct file *file, struct vm_area_struct *vma) > return file->f_op->mmap(file, vma); > } > > +static inline int call_fmmap(struct file *file, struct vm_area_struct *vma, > + unsigned long flags) > +{ > + return file->f_op->fmmap(file, vma, flags); > +} > + Hum, I dislike a new file op for this when the only problem with ->mmap is that it misses 'flags' argument. I understand there are lots of ->mmap implementations out there and modifying prototype of them all is painful but is it so bad? Coccinelle patch for this should be rather easy... Also for MAP_SYNC I want the flag to be copied in VMA anyway so for that I don't need additional flags argument anyway. And I wonder how you want to make things work without VMA flag in case of MAP_DIRECT as well - VMAs can be split, partially unmapped etc. and so without VMA flag you are going to have hard time to detect whether there's any mapping left which blocks block mapping changes. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20170815122701.GF27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags [not found] ` <20170815122701.GF27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> @ 2017-08-15 16:24 ` Dan Williams 0 siblings, 0 replies; 22+ messages in thread From: Dan Williams @ 2017-08-15 16:24 UTC (permalink / raw) To: Jan Kara Cc: Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Darrick J. Wong, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Andy Lutomirski, linux-fsdevel, Andrew Morton, Christoph Hellwig On Tue, Aug 15, 2017 at 5:27 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote: > On Mon 14-08-17 23:12:16, Dan Williams wrote: >> The mmap syscall suffers from the ABI anti-pattern of not validating >> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a >> mechanism to define new behavior that is known to fail on older kernels >> without the feature. Use the fact that specifying MAP_SHARED and >> MAP_PRIVATE at the same time is invalid as a cute hack to allow a new >> set of validated flags to be introduced. >> >> This also introduces the ->fmmap() file operation that is ->mmap() plus >> flags. Each ->fmmap() implementation must fail requests when a locally >> unsupported flag is specified. > ... >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 1104e5df39ef..bbe755d0caee 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -1674,6 +1674,7 @@ struct file_operations { >> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); >> long (*compat_ioctl) (struct file *, unsigned int, unsigned long); >> int (*mmap) (struct file *, struct vm_area_struct *); >> + int (*fmmap) (struct file *, struct vm_area_struct *, unsigned long); >> int (*open) (struct inode *, struct file *); >> int (*flush) (struct file *, fl_owner_t id); >> int (*release) (struct inode *, struct file *); >> @@ -1748,6 +1749,12 @@ static inline int call_mmap(struct file *file, struct vm_area_struct *vma) >> return file->f_op->mmap(file, vma); >> } >> >> +static inline int call_fmmap(struct file *file, struct vm_area_struct *vma, >> + unsigned long flags) >> +{ >> + return file->f_op->fmmap(file, vma, flags); >> +} >> + > > Hum, I dislike a new file op for this when the only problem with ->mmap is > that it misses 'flags' argument. I understand there are lots of ->mmap > implementations out there and modifying prototype of them all is painful > but is it so bad? Coccinelle patch for this should be rather easy... Changing the prototype is relatively easy with Coccinelle, but we still need the code in each ->mmap() implementation to validate a local list of supported flags. How about adding a 'supported mmap flags' field to 'struct file_operations' so that the validation code can be made generic? I'll go with that since it's a bit less surprising than a new operation type, and not as messy as teaching every mmap implementation in the kernel to validate flags that they will likely never care about. > Also for MAP_SYNC I want the flag to be copied in VMA anyway so for that I > don't need additional flags argument anyway. And I wonder how you want to > make things work without VMA flag in case of MAP_DIRECT as well - VMAs can > be split, partially unmapped etc. and so without VMA flag you are going to > have hard time to detect whether there's any mapping left which blocks > block mapping changes. Outside of requiring a 64-bit arch, we're out of vm_flags. Also, the core mm does not really care about MAP_DIRECT or MAP_SYNC so that's why I added a new ->fs_flags field since these are more filesystem properties than core mm. The problem of tracking MAP_DIRECT over vma splits appears to already be handled. __split_vma does: /* most fields are the same, copy all, and then fixup */ *new = *vma; ... if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new); In ->open() I'm checking if 'new' has MAP_DIRECT in ->fs_flags and taking a reference against the S_IOMAP_SEALED flag. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-08-15 12:27 ` Jan Kara [not found] ` <20170815122701.GF27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> @ 2017-09-17 3:44 ` Dan Williams [not found] ` <CAA9_cmc0vejxCsc1NWp5b4C0CSsO5xetF3t6LCoCuEYB6yPiwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-09-18 9:26 ` Jan Kara 1 sibling, 2 replies; 22+ messages in thread From: Dan Williams @ 2017-09-17 3:44 UTC (permalink / raw) To: Jan Kara Cc: Darrick J. Wong, Arnd Bergmann, linux-nvdimm, Linux API, Linux Kernel Mailing List, linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton, Christoph Hellwig On Tue, Aug 15, 2017 at 5:27 AM, Jan Kara <jack@suse.cz> wrote: > On Mon 14-08-17 23:12:16, Dan Williams wrote: >> The mmap syscall suffers from the ABI anti-pattern of not validating >> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a >> mechanism to define new behavior that is known to fail on older kernels >> without the feature. Use the fact that specifying MAP_SHARED and >> MAP_PRIVATE at the same time is invalid as a cute hack to allow a new >> set of validated flags to be introduced. >> >> This also introduces the ->fmmap() file operation that is ->mmap() plus >> flags. Each ->fmmap() implementation must fail requests when a locally >> unsupported flag is specified. > ... >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 1104e5df39ef..bbe755d0caee 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -1674,6 +1674,7 @@ struct file_operations { >> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); >> long (*compat_ioctl) (struct file *, unsigned int, unsigned long); >> int (*mmap) (struct file *, struct vm_area_struct *); >> + int (*fmmap) (struct file *, struct vm_area_struct *, unsigned long); >> int (*open) (struct inode *, struct file *); >> int (*flush) (struct file *, fl_owner_t id); >> int (*release) (struct inode *, struct file *); >> @@ -1748,6 +1749,12 @@ static inline int call_mmap(struct file *file, struct vm_area_struct *vma) >> return file->f_op->mmap(file, vma); >> } >> >> +static inline int call_fmmap(struct file *file, struct vm_area_struct *vma, >> + unsigned long flags) >> +{ >> + return file->f_op->fmmap(file, vma, flags); >> +} >> + > > Hum, I dislike a new file op for this when the only problem with ->mmap is > that it misses 'flags' argument. I understand there are lots of ->mmap > implementations out there and modifying prototype of them all is painful > but is it so bad? Coccinelle patch for this should be rather easy... So it wasn't all that easy, and Linus declined to take it. I think we should add a new ->mmap_validate() file operation and save the tree-wide cleanup until later. ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CAA9_cmc0vejxCsc1NWp5b4C0CSsO5xetF3t6LCoCuEYB6yPiwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags [not found] ` <CAA9_cmc0vejxCsc1NWp5b4C0CSsO5xetF3t6LCoCuEYB6yPiwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-09-17 17:39 ` Christoph Hellwig [not found] ` <20170917173945.GA22200-jcswGhMUV9g@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2017-09-17 17:39 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Darrick J. Wong, Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Linux Kernel Mailing List, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton, Christoph Hellwig On Sat, Sep 16, 2017 at 08:44:14PM -0700, Dan Williams wrote: > So it wasn't all that easy, and Linus declined to take it. I think we > should add a new ->mmap_validate() file operation and save the > tree-wide cleanup until later. Note that we already have a mmap_capabilities callout for nommu, I wonder if we could generalize that. ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20170917173945.GA22200-jcswGhMUV9g@public.gmane.org>]
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags [not found] ` <20170917173945.GA22200-jcswGhMUV9g@public.gmane.org> @ 2017-09-18 9:31 ` Jan Kara 2017-09-18 15:47 ` Dan Williams 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2017-09-18 9:31 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Arnd Bergmann, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Darrick J. Wong, Linux Kernel Mailing List, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton On Sun 17-09-17 19:39:45, Christoph Hellwig wrote: > On Sat, Sep 16, 2017 at 08:44:14PM -0700, Dan Williams wrote: > > So it wasn't all that easy, and Linus declined to take it. I think we > > should add a new ->mmap_validate() file operation and save the > > tree-wide cleanup until later. > > Note that we already have a mmap_capabilities callout for nommu, > I wonder if we could generalize that. So if I understood Dan right, Linus refused to merge the patch which adds 'flags' argument to ->mmap callback. That is actually logically independent change from validating flags passed to mmap(2) syscall. Dan did it just to save himself from adding a VMA flag for MAP_DIRECT. For validating flags passed to mmap(2), I agree we could use ->mmap_capabilities() instead of mmap_supported_mask Dan has added. But I don't have a strong opinion there. Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-09-18 9:31 ` Jan Kara @ 2017-09-18 15:47 ` Dan Williams 0 siblings, 0 replies; 22+ messages in thread From: Dan Williams @ 2017-09-18 15:47 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Darrick J. Wong, Arnd Bergmann, linux-nvdimm, Linux API, Linux Kernel Mailing List, linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton, david On Mon, Sep 18, 2017 at 2:31 AM, Jan Kara <jack@suse.cz> wrote: > On Sun 17-09-17 19:39:45, Christoph Hellwig wrote: >> On Sat, Sep 16, 2017 at 08:44:14PM -0700, Dan Williams wrote: >> > So it wasn't all that easy, and Linus declined to take it. I think we >> > should add a new ->mmap_validate() file operation and save the >> > tree-wide cleanup until later. >> >> Note that we already have a mmap_capabilities callout for nommu, >> I wonder if we could generalize that. > > So if I understood Dan right, Linus refused to merge the patch which adds > 'flags' argument to ->mmap callback. That is actually logically independent > change from validating flags passed to mmap(2) syscall. Dan did it just to > save himself from adding a VMA flag for MAP_DIRECT. > > For validating flags passed to mmap(2), I agree we could use > ->mmap_capabilities() instead of mmap_supported_mask Dan has added. But I > don't have a strong opinion there. The drawback I see with mmap_capabilities is that it requires all mmap flags to have a corresponding vm_flag. After the cold reaction the VM_DAX flag received I'd want to be sure they were on board with this direction. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-09-17 3:44 ` Dan Williams [not found] ` <CAA9_cmc0vejxCsc1NWp5b4C0CSsO5xetF3t6LCoCuEYB6yPiwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-09-18 9:26 ` Jan Kara 1 sibling, 0 replies; 22+ messages in thread From: Jan Kara @ 2017-09-18 9:26 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, Darrick J. Wong, Arnd Bergmann, linux-nvdimm, Linux API, Linux Kernel Mailing List, linux-xfs, linux-mm, Andy Lutomirski, linux-fsdevel, Andrew Morton, Christoph Hellwig On Sat 16-09-17 20:44:14, Dan Williams wrote: > On Tue, Aug 15, 2017 at 5:27 AM, Jan Kara <jack@suse.cz> wrote: > > On Mon 14-08-17 23:12:16, Dan Williams wrote: > >> The mmap syscall suffers from the ABI anti-pattern of not validating > >> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a > >> mechanism to define new behavior that is known to fail on older kernels > >> without the feature. Use the fact that specifying MAP_SHARED and > >> MAP_PRIVATE at the same time is invalid as a cute hack to allow a new > >> set of validated flags to be introduced. > >> > >> This also introduces the ->fmmap() file operation that is ->mmap() plus > >> flags. Each ->fmmap() implementation must fail requests when a locally > >> unsupported flag is specified. > > ... > >> diff --git a/include/linux/fs.h b/include/linux/fs.h > >> index 1104e5df39ef..bbe755d0caee 100644 > >> --- a/include/linux/fs.h > >> +++ b/include/linux/fs.h > >> @@ -1674,6 +1674,7 @@ struct file_operations { > >> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); > >> long (*compat_ioctl) (struct file *, unsigned int, unsigned long); > >> int (*mmap) (struct file *, struct vm_area_struct *); > >> + int (*fmmap) (struct file *, struct vm_area_struct *, unsigned long); > >> int (*open) (struct inode *, struct file *); > >> int (*flush) (struct file *, fl_owner_t id); > >> int (*release) (struct inode *, struct file *); > >> @@ -1748,6 +1749,12 @@ static inline int call_mmap(struct file *file, struct vm_area_struct *vma) > >> return file->f_op->mmap(file, vma); > >> } > >> > >> +static inline int call_fmmap(struct file *file, struct vm_area_struct *vma, > >> + unsigned long flags) > >> +{ > >> + return file->f_op->fmmap(file, vma, flags); > >> +} > >> + > > > > Hum, I dislike a new file op for this when the only problem with ->mmap is > > that it misses 'flags' argument. I understand there are lots of ->mmap > > implementations out there and modifying prototype of them all is painful > > but is it so bad? Coccinelle patch for this should be rather easy... > > So it wasn't all that easy, and Linus declined to take it. I think we > should add a new ->mmap_validate() file operation and save the > tree-wide cleanup until later. Well, we don't even strictly need the flags passed to ->mmap callback if we are willing to use VMA flags. I want to use it for MAP_SYNC anyway... So bumping vma->flags to u64 and using a flag is also an option (and frankly I'd personally just go for that). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-08-15 6:12 ` [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags Dan Williams 2017-08-15 12:27 ` Jan Kara @ 2017-08-15 16:28 ` Andy Lutomirski 2017-08-15 22:31 ` Dan Williams 2017-08-17 8:06 ` kbuild test robot 2 siblings, 1 reply; 22+ messages in thread From: Andy Lutomirski @ 2017-08-15 16:28 UTC (permalink / raw) To: Dan Williams Cc: Darrick J. Wong, Jan Kara, Arnd Bergmann, linux-nvdimm, Linux API, linux-kernel, linux-xfs, linux-mm, Andrew Lutomirski, Linux FS Devel, Andrew Morton, Christoph Hellwig On Mon, Aug 14, 2017 at 11:12 PM, Dan Williams <dan.j.williams@intel.com> wrote: > The mmap syscall suffers from the ABI anti-pattern of not validating > unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a > mechanism to define new behavior that is known to fail on older kernels > without the feature. Use the fact that specifying MAP_SHARED and > MAP_PRIVATE at the same time is invalid as a cute hack to allow a new > set of validated flags to be introduced. While this is cute, is it actually better than a new syscall? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-08-15 16:28 ` Andy Lutomirski @ 2017-08-15 22:31 ` Dan Williams 0 siblings, 0 replies; 22+ messages in thread From: Dan Williams @ 2017-08-15 22:31 UTC (permalink / raw) To: Andy Lutomirski Cc: Darrick J. Wong, Jan Kara, Arnd Bergmann, linux-nvdimm, Linux API, linux-kernel, linux-xfs, linux-mm, Linux FS Devel, Andrew Morton, Christoph Hellwig On Tue, Aug 15, 2017 at 9:28 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Mon, Aug 14, 2017 at 11:12 PM, Dan Williams <dan.j.williams@intel.com> wrote: >> The mmap syscall suffers from the ABI anti-pattern of not validating >> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a >> mechanism to define new behavior that is known to fail on older kernels >> without the feature. Use the fact that specifying MAP_SHARED and >> MAP_PRIVATE at the same time is invalid as a cute hack to allow a new >> set of validated flags to be introduced. > > While this is cute, is it actually better than a new syscall? After playing with MAP_DIRECT defined as (MAP_SHARED|MAP_PRIVATE|0x40) I think a new syscall is better. It's very easy to make the mistake that "MAP_DIRECT" defines a single flag vs representing a multi-bit encoding. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags 2017-08-15 6:12 ` [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags Dan Williams 2017-08-15 12:27 ` Jan Kara 2017-08-15 16:28 ` Andy Lutomirski @ 2017-08-17 8:06 ` kbuild test robot 2 siblings, 0 replies; 22+ messages in thread From: kbuild test robot @ 2017-08-17 8:06 UTC (permalink / raw) To: Dan Williams Cc: kbuild-all, darrick.wong, Jan Kara, Arnd Bergmann, linux-nvdimm, linux-api, linux-kernel, linux-xfs, linux-mm, luto, linux-fsdevel, Andrew Morton, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 6397 bytes --] Hi Dan, [auto build test ERROR on linus/master] [also build test ERROR on v4.13-rc5 next-20170816] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Dan-Williams/fs-xfs-introduce-S_IOMAP_SEALED/20170817-114711 config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All errors (new ones prefixed by >>): mm/mmap.c: In function 'do_mmap': >> mm/mmap.c:1391:8: error: 'MAP_VALIDATE' undeclared (first use in this function) case MAP_VALIDATE: ^ mm/mmap.c:1391:8: note: each undeclared identifier is reported only once for each function it appears in vim +/MAP_VALIDATE +1391 mm/mmap.c 1316 1317 /* 1318 * The caller must hold down_write(¤t->mm->mmap_sem). 1319 */ 1320 unsigned long do_mmap(struct file *file, unsigned long addr, 1321 unsigned long len, unsigned long prot, 1322 unsigned long flags, vm_flags_t vm_flags, 1323 unsigned long pgoff, unsigned long *populate, 1324 struct list_head *uf) 1325 { 1326 struct mm_struct *mm = current->mm; 1327 int pkey = 0; 1328 1329 *populate = 0; 1330 1331 if (!len) 1332 return -EINVAL; 1333 1334 /* 1335 * Does the application expect PROT_READ to imply PROT_EXEC? 1336 * 1337 * (the exception is when the underlying filesystem is noexec 1338 * mounted, in which case we dont add PROT_EXEC.) 1339 */ 1340 if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) 1341 if (!(file && path_noexec(&file->f_path))) 1342 prot |= PROT_EXEC; 1343 1344 if (!(flags & MAP_FIXED)) 1345 addr = round_hint_to_min(addr); 1346 1347 /* Careful about overflows.. */ 1348 len = PAGE_ALIGN(len); 1349 if (!len) 1350 return -ENOMEM; 1351 1352 /* offset overflow? */ 1353 if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) 1354 return -EOVERFLOW; 1355 1356 /* Too many mappings? */ 1357 if (mm->map_count > sysctl_max_map_count) 1358 return -ENOMEM; 1359 1360 /* Obtain the address to map to. we verify (or select) it and ensure 1361 * that it represents a valid section of the address space. 1362 */ 1363 addr = get_unmapped_area(file, addr, len, pgoff, flags); 1364 if (offset_in_page(addr)) 1365 return addr; 1366 1367 if (prot == PROT_EXEC) { 1368 pkey = execute_only_pkey(mm); 1369 if (pkey < 0) 1370 pkey = 0; 1371 } 1372 1373 /* Do simple checking here so the lower-level routines won't have 1374 * to. we assume access permissions have been handled by the open 1375 * of the memory object, so we don't do any here. 1376 */ 1377 vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | 1378 mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; 1379 1380 if (flags & MAP_LOCKED) 1381 if (!can_do_mlock()) 1382 return -EPERM; 1383 1384 if (mlock_future_check(mm, vm_flags, len)) 1385 return -EAGAIN; 1386 1387 if (file) { 1388 struct inode *inode = file_inode(file); 1389 1390 switch (flags & MAP_TYPE) { > 1391 case MAP_VALIDATE: 1392 if (flags & ~(MAP_SUPPORTED_MASK | MAP_VALIDATE)) 1393 return -EINVAL; 1394 if (!file->f_op->fmmap) 1395 return -EOPNOTSUPP; 1396 /* fall through */ 1397 case MAP_SHARED: 1398 if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) 1399 return -EACCES; 1400 1401 /* 1402 * Make sure we don't allow writing to an append-only 1403 * file.. 1404 */ 1405 if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE)) 1406 return -EACCES; 1407 1408 /* 1409 * Make sure there are no mandatory locks on the file. 1410 */ 1411 if (locks_verify_locked(file)) 1412 return -EAGAIN; 1413 1414 vm_flags |= VM_SHARED | VM_MAYSHARE; 1415 if (!(file->f_mode & FMODE_WRITE)) 1416 vm_flags &= ~(VM_MAYWRITE | VM_SHARED); 1417 1418 /* fall through */ 1419 case MAP_PRIVATE: 1420 if (!(file->f_mode & FMODE_READ)) 1421 return -EACCES; 1422 if (path_noexec(&file->f_path)) { 1423 if (vm_flags & VM_EXEC) 1424 return -EPERM; 1425 vm_flags &= ~VM_MAYEXEC; 1426 } 1427 1428 if (!file->f_op->mmap) 1429 return -ENODEV; 1430 if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) 1431 return -EINVAL; 1432 break; 1433 1434 default: 1435 return -EINVAL; 1436 } 1437 } else { 1438 switch (flags & MAP_TYPE) { 1439 case MAP_SHARED: 1440 if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) 1441 return -EINVAL; 1442 /* 1443 * Ignore pgoff. 1444 */ 1445 pgoff = 0; 1446 vm_flags |= VM_SHARED | VM_MAYSHARE; 1447 break; 1448 case MAP_PRIVATE: 1449 /* 1450 * Set pgoff according to addr for anon_vma. 1451 */ 1452 pgoff = addr >> PAGE_SHIFT; 1453 break; 1454 default: 1455 return -EINVAL; 1456 } 1457 } 1458 1459 /* 1460 * Set 'VM_NORESERVE' if we should not account for the 1461 * memory use of this mapping. 1462 */ 1463 if (flags & MAP_NORESERVE) { 1464 /* We honor MAP_NORESERVE if allowed to overcommit */ 1465 if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) 1466 vm_flags |= VM_NORESERVE; 1467 1468 /* hugetlb applies strict overcommit unless MAP_NORESERVE */ 1469 if (file && is_file_hugepages(file)) 1470 vm_flags |= VM_NORESERVE; 1471 } 1472 1473 if ((flags & MAP_VALIDATE) == MAP_VALIDATE) 1474 flags &= MAP_SUPPORTED_MASK; 1475 else 1476 flags = 0; 1477 1478 addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags); 1479 if (!IS_ERR_VALUE(addr) && 1480 ((vm_flags & VM_LOCKED) || 1481 (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) 1482 *populate = len; 1483 return addr; 1484 } 1485 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 50926 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges 2017-08-15 6:12 [PATCH v4 0/3] MAP_DIRECT and block-map sealed files Dan Williams 2017-08-15 6:12 ` [PATCH v4 1/3] fs, xfs: introduce S_IOMAP_SEALED Dan Williams [not found] ` <150277752553.23945.13932394738552748440.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-08-15 6:12 ` Dan Williams 2017-08-15 9:18 ` Kirill A. Shutemov [not found] ` <150277754211.23945.458876600578531019.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-08-15 9:01 ` [PATCH v4 0/3] MAP_DIRECT and block-map sealed files Dave Chinner 3 siblings, 2 replies; 22+ messages in thread From: Dan Williams @ 2017-08-15 6:12 UTC (permalink / raw) To: darrick.wong Cc: Jan Kara, linux-nvdimm, linux-api, Dave Chinner, linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig MAP_DIRECT is an mmap(2) flag with the following semantics: MAP_DIRECT In addition to this mapping having MAP_SHARED semantics, successful faults in this range may assume that the block map (logical-file-offset to physical memory address) is pinned for the lifetime of the mapping. Successful MAP_DIRECT faults establish mappings that bypass any kernel indirections like the page-cache. All updates are carried directly through to the underlying file physical blocks (modulo cpu cache effects). ETXTBSY is returned on attempts to change the block map (allocate blocks / convert unwritten extents / break shared extents) in the mapped range. Some filesystems may extend these same restrictions outside the mapped range and return ETXTBSY to any file operations that might mutate the block map. MAP_DIRECT faults may fail with a SIGSEGV if the filesystem needs to write the block map to satisfy the fault. For example, if the mapping was established over a hole in a sparse file. The kernel ignores attempts to mark a MAP_DIRECT mapping MAP_PRIVATE and will silently fall back to MAP_SHARED semantics. ERRORS EACCES A MAP_DIRECT mapping was requested and PROT_WRITE was not set. EINVAL MAP_ANONYMOUS was specified with MAP_DIRECT. EOPNOTSUPP The filesystem explicitly does not support the flag SIGSEGV Attempted to write a MAP_DIRECT mapping at a file offset that might require block-map updates. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/dax.c | 2 + fs/xfs/xfs_file.c | 109 ++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 1 fs/xfs/xfs_super.c | 1 include/linux/mm_types.h | 1 include/linux/mman.h | 2 - include/uapi/asm-generic/mman-common.h | 1 mm/mmap.c | 2 + 8 files changed, 117 insertions(+), 2 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 306c2b603fb8..a654b2dd9016 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1121,6 +1121,8 @@ static int dax_fault_return(int error) return VM_FAULT_NOPAGE; if (error == -ENOMEM) return VM_FAULT_OOM; + if (error == -ETXTBSY) + return VM_FAULT_SIGSEGV; return VM_FAULT_SIGBUS; } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c4893e226fd8..fcdf6d5768aa 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -40,6 +40,7 @@ #include "xfs_iomap.h" #include "xfs_reflink.h" +#include <linux/mman.h> #include <linux/dcache.h> #include <linux/falloc.h> #include <linux/pagevec.h> @@ -1001,6 +1002,23 @@ xfs_file_llseek( return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); } +STATIC int +xfs_vma_checks( + struct vm_area_struct *vma, + struct inode *inode) +{ + if ((vma->fs_flags & MAP_DIRECT) != MAP_DIRECT) + return 0; + + if (xfs_is_reflink_inode(XFS_I(inode))) + return VM_FAULT_SIGSEGV; + + if (!IS_DAX(inode)) + return VM_FAULT_SIGSEGV; + + return 0; +} + /* * Locking for serialisation of IO during page faults. This results in a lock * ordering of: @@ -1031,6 +1049,10 @@ xfs_filemap_page_mkwrite( file_update_time(vmf->vma->vm_file); xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + ret = xfs_vma_checks(vmf->vma, inode); + if (ret) + goto out_unlock; + if (IS_DAX(inode)) { ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops); } else { @@ -1038,6 +1060,7 @@ xfs_filemap_page_mkwrite( ret = block_page_mkwrite_return(ret); } +out_unlock: xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); sb_end_pagefault(inode->i_sb); @@ -1058,10 +1081,15 @@ xfs_filemap_fault( return xfs_filemap_page_mkwrite(vmf); xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + ret = xfs_vma_checks(vmf->vma, inode); + if (ret) + goto out_unlock; + if (IS_DAX(inode)) ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops); else ret = filemap_fault(vmf); +out_unlock: xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); return ret; @@ -1094,7 +1122,9 @@ xfs_filemap_huge_fault( } xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops); + ret = xfs_vma_checks(vmf->vma, inode); + if (ret == 0) + ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops); xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (vmf->flags & FAULT_FLAG_WRITE) @@ -1137,12 +1167,63 @@ xfs_filemap_pfn_mkwrite( } +STATIC void +xfs_filemap_open( + struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct inode *inode = file_inode(filp); + struct xfs_inode *ip = XFS_I(inode); + + if ((vma->fs_flags & MAP_DIRECT) != MAP_DIRECT) + return; + atomic_inc(&ip->i_mapdcount); +} + +STATIC int +atomic_dec_and_xfs_ilock( + atomic_t *atomic, + struct xfs_inode *ip, + uint lock_flags) +{ + /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ + if (atomic_add_unless(atomic, -1, 1)) + return 0; + + /* Otherwise do it the slow way */ + xfs_ilock(ip, lock_flags); + if (atomic_dec_and_test(atomic)) + return 1; + xfs_iunlock(ip, lock_flags); + return 0; +} + +STATIC void +xfs_filemap_close( + struct vm_area_struct *vma) +{ + struct file *filp = vma->vm_file; + struct inode *inode = file_inode(filp); + struct xfs_inode *ip = XFS_I(inode); + + if ((vma->fs_flags & MAP_DIRECT) != MAP_DIRECT) + return; + + if (!atomic_dec_and_xfs_ilock(&ip->i_mapdcount, ip, + XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL)) + return; + inode->i_flags &= ~S_IOMAP_SEALED; + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL); +} + static const struct vm_operations_struct xfs_file_vm_ops = { .fault = xfs_filemap_fault, .huge_fault = xfs_filemap_huge_fault, .map_pages = filemap_map_pages, .page_mkwrite = xfs_filemap_page_mkwrite, .pfn_mkwrite = xfs_filemap_pfn_mkwrite, + .open = xfs_filemap_open, + .close = xfs_filemap_close, }; STATIC int @@ -1157,6 +1238,31 @@ xfs_file_mmap( return 0; } +#define XFS_MAP_SUPPORTED (MAP_DIRECT) + +STATIC int +xfs_file_fmmap( + struct file *filp, + struct vm_area_struct *vma, + unsigned long flags) +{ + struct inode *inode = file_inode(filp); + struct xfs_inode *ip = XFS_I(inode); + + if (flags & ~(XFS_MAP_SUPPORTED)) + return -EOPNOTSUPP; + + xfs_ilock(ip, XFS_MMAPLOCK_EXCL|XFS_IOLOCK_EXCL); + if ((flags & MAP_DIRECT) == MAP_DIRECT) { + vma->fs_flags |= MAP_DIRECT; + inode->i_flags |= S_IOMAP_SEALED; + atomic_inc(&ip->i_mapdcount); + } + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL|XFS_IOLOCK_EXCL); + + return xfs_file_mmap(filp, vma); +} + const struct file_operations xfs_file_operations = { .llseek = xfs_file_llseek, .read_iter = xfs_file_read_iter, @@ -1168,6 +1274,7 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, + .fmmap = xfs_file_fmmap, .open = xfs_file_open, .release = xfs_file_release, .fsync = xfs_file_fsync, diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 0ee453de239a..50d3e1bca1a9 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -58,6 +58,7 @@ typedef struct xfs_inode { mrlock_t i_lock; /* inode lock */ mrlock_t i_mmaplock; /* inode mmap IO lock */ atomic_t i_pincount; /* inode pin count */ + atomic_t i_mapdcount; /* inode MAP_DIRECT count */ spinlock_t i_flags_lock; /* inode i_flags lock */ /* Miscellaneous state. */ unsigned long i_flags; /* see defined flags below */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 664db709cd1a..2604568354db 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1011,6 +1011,7 @@ xfs_fs_inode_init_once( /* xfs inode */ atomic_set(&ip->i_pincount, 0); + atomic_set(&ip->i_mapdcount, 0); spin_lock_init(&ip->i_flags_lock); mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index ff151814a02d..73fdc0ada9ee 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -306,6 +306,7 @@ struct vm_area_struct { struct mm_struct *vm_mm; /* The address space we belong to. */ pgprot_t vm_page_prot; /* Access permissions of this VMA. */ unsigned long vm_flags; /* Flags, see mm.h. */ + unsigned long fs_flags; /* fs flags, see MAP_DIRECT etc */ /* * For areas with an address space and backing store, diff --git a/include/linux/mman.h b/include/linux/mman.h index 73d4ac7e7136..dc120995f684 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -8,7 +8,7 @@ #include <uapi/linux/mman.h> /* the MAP_VALIDATE set of supported flags */ -#define MAP_SUPPORTED_MASK (0) +#define MAP_SUPPORTED_MASK (MAP_DIRECT) extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 8bf8c7828275..a16184402c45 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -25,6 +25,7 @@ # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ #endif #define MAP_VALIDATE (MAP_SHARED|MAP_PRIVATE) /* mechanism to define new shared semantics */ +#define MAP_DIRECT (MAP_VALIDATE | 0x40) /* shared, sealed, and no page cache */ /* * Flags for mlock diff --git a/mm/mmap.c b/mm/mmap.c index d2919a9e25bf..f12de3859fec 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1393,6 +1393,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr, return -EINVAL; if (!file->f_op->fmmap) return -EOPNOTSUPP; + if ((flags & MAP_DIRECT) && !(prot & PROT_WRITE)) + return -EACCES; /* fall through */ case MAP_SHARED: if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges 2017-08-15 6:12 ` [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges Dan Williams @ 2017-08-15 9:18 ` Kirill A. Shutemov 2017-08-15 17:11 ` Dan Williams [not found] ` <150277754211.23945.458876600578531019.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 1 sibling, 1 reply; 22+ messages in thread From: Kirill A. Shutemov @ 2017-08-15 9:18 UTC (permalink / raw) To: Dan Williams Cc: darrick.wong, Jan Kara, linux-nvdimm, linux-api, Dave Chinner, linux-kernel, linux-xfs, linux-mm, Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig On Mon, Aug 14, 2017 at 11:12:22PM -0700, Dan Williams wrote: > MAP_DIRECT is an mmap(2) flag with the following semantics: > > MAP_DIRECT > In addition to this mapping having MAP_SHARED semantics, successful > faults in this range may assume that the block map (logical-file-offset > to physical memory address) is pinned for the lifetime of the mapping. > Successful MAP_DIRECT faults establish mappings that bypass any kernel > indirections like the page-cache. All updates are carried directly > through to the underlying file physical blocks (modulo cpu cache > effects). > > ETXTBSY is returned on attempts to change the block map (allocate blocks > / convert unwritten extents / break shared extents) in the mapped range. > Some filesystems may extend these same restrictions outside the mapped > range and return ETXTBSY to any file operations that might mutate the > block map. MAP_DIRECT faults may fail with a SIGSEGV if the filesystem > needs to write the block map to satisfy the fault. For example, if the > mapping was established over a hole in a sparse file. We had issues before with user-imposed ETXTBSY. See MAP_DENYWRITE. Are we sure it won't a source of denial-of-service attacks? > The kernel ignores attempts to mark a MAP_DIRECT mapping MAP_PRIVATE and > will silently fall back to MAP_SHARED semantics. Hm.. Any reason for this strage behaviour? Looks just broken to me. > > ERRORS > EACCES A MAP_DIRECT mapping was requested and PROT_WRITE was not set. > > EINVAL MAP_ANONYMOUS was specified with MAP_DIRECT. > > EOPNOTSUPP The filesystem explicitly does not support the flag > > SIGSEGV Attempted to write a MAP_DIRECT mapping at a file offset that > might require block-map updates. I think it should be SIGBUS. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges 2017-08-15 9:18 ` Kirill A. Shutemov @ 2017-08-15 17:11 ` Dan Williams 2017-08-16 10:25 ` Kirill A. Shutemov 0 siblings, 1 reply; 22+ messages in thread From: Dan Williams @ 2017-08-15 17:11 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Linux API, Dave Chinner, linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler, Christoph Hellwig On Tue, Aug 15, 2017 at 2:18 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Mon, Aug 14, 2017 at 11:12:22PM -0700, Dan Williams wrote: >> MAP_DIRECT is an mmap(2) flag with the following semantics: >> >> MAP_DIRECT >> In addition to this mapping having MAP_SHARED semantics, successful >> faults in this range may assume that the block map (logical-file-offset >> to physical memory address) is pinned for the lifetime of the mapping. >> Successful MAP_DIRECT faults establish mappings that bypass any kernel >> indirections like the page-cache. All updates are carried directly >> through to the underlying file physical blocks (modulo cpu cache >> effects). >> >> ETXTBSY is returned on attempts to change the block map (allocate blocks >> / convert unwritten extents / break shared extents) in the mapped range. >> Some filesystems may extend these same restrictions outside the mapped >> range and return ETXTBSY to any file operations that might mutate the >> block map. MAP_DIRECT faults may fail with a SIGSEGV if the filesystem >> needs to write the block map to satisfy the fault. For example, if the >> mapping was established over a hole in a sparse file. > > We had issues before with user-imposed ETXTBSY. See MAP_DENYWRITE. > > Are we sure it won't a source of denial-of-service attacks? I believe MAP_DENYWRITE allowed any application with read access to be able to deny writes which is obviously problematic. MAP_DIRECT is different. You need write access to the file so you can already destroy data that another application might depend on, and this only blocks allocation and reflink. However, I'm not opposed to adding more safety around this. I think we can address this concern with an fcntl seal as Dave suggests, but the seal only applies to the 'struct file' instance and only gates whether MAP_DIRECT is allowed on that file. The act of setting F_MAY_SEAL_IOMAP requires CAP_IMMUTABLE, but MAP_DIRECT does not. This allows the 'permission to mmap(MAP_DIRECT)' to be passed around with an open file descriptor. > >> The kernel ignores attempts to mark a MAP_DIRECT mapping MAP_PRIVATE and >> will silently fall back to MAP_SHARED semantics. > > Hm.. Any reason for this strage behaviour? Looks just broken to me. > >> >> ERRORS >> EACCES A MAP_DIRECT mapping was requested and PROT_WRITE was not set. >> >> EINVAL MAP_ANONYMOUS was specified with MAP_DIRECT. >> >> EOPNOTSUPP The filesystem explicitly does not support the flag >> >> SIGSEGV Attempted to write a MAP_DIRECT mapping at a file offset that >> might require block-map updates. > > I think it should be SIGBUS. Ok, that does seem to fit this definition from the mmap(2) man page: SIGBUS Attempted access to a portion of the buffer that does not correspond to the file (for example, beyond the end of the file, including the case where another process has truncated the file). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges 2017-08-15 17:11 ` Dan Williams @ 2017-08-16 10:25 ` Kirill A. Shutemov 0 siblings, 0 replies; 22+ messages in thread From: Kirill A. Shutemov @ 2017-08-16 10:25 UTC (permalink / raw) To: Dan Williams Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Linux API, Dave Chinner, linux-kernel, linux-xfs, Linux MM, Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler, Christoph Hellwig On Tue, Aug 15, 2017 at 10:11:27AM -0700, Dan Williams wrote: > > We had issues before with user-imposed ETXTBSY. See MAP_DENYWRITE. > > > > Are we sure it won't a source of denial-of-service attacks? > > I believe MAP_DENYWRITE allowed any application with read access to be > able to deny writes which is obviously problematic. MAP_DIRECT is > different. You need write access to the file so you can already > destroy data that another application might depend on, and this only > blocks allocation and reflink. > > However, I'm not opposed to adding more safety around this. I think we > can address this concern with an fcntl seal as Dave suggests, but the > seal only applies to the 'struct file' instance and only gates whether > MAP_DIRECT is allowed on that file. The act of setting > F_MAY_SEAL_IOMAP requires CAP_IMMUTABLE, but MAP_DIRECT does not. This > allows the 'permission to mmap(MAP_DIRECT)' to be passed around with > an open file descriptor. Sounds like a good approach to me. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <150277754211.23945.458876600578531019.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges [not found] ` <150277754211.23945.458876600578531019.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> @ 2017-08-15 12:42 ` Jan Kara [not found] ` <20170815124250.GG27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2017-08-17 8:49 ` kbuild test robot 1 sibling, 1 reply; 22+ messages in thread From: Jan Kara @ 2017-08-15 12:42 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-api-u79uwXL29TY76Z2rM5mHXA, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Alexander Viro, luto-DgEjT+Ai2ygdnm+yROfE0A, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Mon 14-08-17 23:12:22, Dan Williams wrote: > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index ff151814a02d..73fdc0ada9ee 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -306,6 +306,7 @@ struct vm_area_struct { > struct mm_struct *vm_mm; /* The address space we belong to. */ > pgprot_t vm_page_prot; /* Access permissions of this VMA. */ > unsigned long vm_flags; /* Flags, see mm.h. */ > + unsigned long fs_flags; /* fs flags, see MAP_DIRECT etc */ > > /* > * For areas with an address space and backing store, Ah, OK, here are VMA flags I was missing in the previous patch :) But why did you create separate fs_flags field for this? on 64-bit archs there's still space in vm_flags and frankly I don't see why we should separate MAP_DIRECT or MAP_SYNC from other flags? After all a difference in these flags must also prevent VMA merging (which you forgot to handle I think) and they need to be copied on split (which happens by chance even now). Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <20170815124250.GG27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges [not found] ` <20170815124250.GG27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> @ 2017-08-15 16:29 ` Dan Williams [not found] ` <CAPcyv4h01os0Gc6bYmaGdMXt5q4G4zfirNRPWG3=gQi5POrpmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 22+ messages in thread From: Dan Williams @ 2017-08-15 16:29 UTC (permalink / raw) To: Jan Kara Cc: Darrick J. Wong, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler, Christoph Hellwig On Tue, Aug 15, 2017 at 5:42 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote: > On Mon 14-08-17 23:12:22, Dan Williams wrote: >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index ff151814a02d..73fdc0ada9ee 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -306,6 +306,7 @@ struct vm_area_struct { >> struct mm_struct *vm_mm; /* The address space we belong to. */ >> pgprot_t vm_page_prot; /* Access permissions of this VMA. */ >> unsigned long vm_flags; /* Flags, see mm.h. */ >> + unsigned long fs_flags; /* fs flags, see MAP_DIRECT etc */ >> >> /* >> * For areas with an address space and backing store, > > Ah, OK, here are VMA flags I was missing in the previous patch :) But why > did you create separate fs_flags field for this? on 64-bit archs there's > still space in vm_flags and frankly I don't see why we should separate > MAP_DIRECT or MAP_SYNC from other flags? Where would MAP_DIRECT go in the 32-bit case? > After all a difference in these > flags must also prevent VMA merging (which you forgot to handle I think) > and they need to be copied on split (which happens by chance even now). Ah, yes I did miss blocking the merge of a vma with MAP_DIRECT and one without. However, the vma split path looks ok. ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CAPcyv4h01os0Gc6bYmaGdMXt5q4G4zfirNRPWG3=gQi5POrpmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges [not found] ` <CAPcyv4h01os0Gc6bYmaGdMXt5q4G4zfirNRPWG3=gQi5POrpmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-08-16 1:15 ` Dan Williams 0 siblings, 0 replies; 22+ messages in thread From: Dan Williams @ 2017-08-16 1:15 UTC (permalink / raw) To: Jan Kara Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Darrick J. Wong, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, Linux MM, Alexander Viro, Andy Lutomirski, linux-fsdevel, Christoph Hellwig On Tue, Aug 15, 2017 at 9:29 AM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote: > On Tue, Aug 15, 2017 at 5:42 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote: >> On Mon 14-08-17 23:12:22, Dan Williams wrote: >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>> index ff151814a02d..73fdc0ada9ee 100644 >>> --- a/include/linux/mm_types.h >>> +++ b/include/linux/mm_types.h >>> @@ -306,6 +306,7 @@ struct vm_area_struct { >>> struct mm_struct *vm_mm; /* The address space we belong to. */ >>> pgprot_t vm_page_prot; /* Access permissions of this VMA. */ >>> unsigned long vm_flags; /* Flags, see mm.h. */ >>> + unsigned long fs_flags; /* fs flags, see MAP_DIRECT etc */ >>> >>> /* >>> * For areas with an address space and backing store, >> >> Ah, OK, here are VMA flags I was missing in the previous patch :) But why >> did you create separate fs_flags field for this? on 64-bit archs there's >> still space in vm_flags and frankly I don't see why we should separate >> MAP_DIRECT or MAP_SYNC from other flags? > > Where would MAP_DIRECT go in the 32-bit case? > >> After all a difference in these >> flags must also prevent VMA merging (which you forgot to handle I think) >> and they need to be copied on split (which happens by chance even now). > > Ah, yes I did miss blocking the merge of a vma with MAP_DIRECT and one > without. However, the vma split path looks ok. The merge path already blocks merging vmas that have the ->close() operation defined in is_mergeable_vma(). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges [not found] ` <150277754211.23945.458876600578531019.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-08-15 12:42 ` Jan Kara @ 2017-08-17 8:49 ` kbuild test robot 1 sibling, 0 replies; 22+ messages in thread From: kbuild test robot @ 2017-08-17 8:49 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, linux-api-u79uwXL29TY76Z2rM5mHXA, darrick.wong-QHcLZuEGTsvQT0dZR+AlfA, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-xfs-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kbuild-all-JC7UmRfGjtg, luto-DgEjT+Ai2ygdnm+yROfE0A, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Alexander Viro Hi Dan, [auto build test ERROR on linus/master] [also build test ERROR on v4.13-rc5 next-20170816] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Dan-Williams/fs-xfs-introduce-S_IOMAP_SEALED/20170817-114711 config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 4.9.0 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All error/warnings (new ones prefixed by >>): mm/mmap.c: In function 'do_mmap': mm/mmap.c:1391:8: error: 'MAP_VALIDATE' undeclared (first use in this function) case MAP_VALIDATE: ^ mm/mmap.c:1391:8: note: each undeclared identifier is reported only once for each function it appears in In file included from mm/mmap.c:17:0: >> include/linux/mman.h:11:29: error: 'MAP_DIRECT' undeclared (first use in this function) #define MAP_SUPPORTED_MASK (MAP_DIRECT) ^ >> mm/mmap.c:1392:18: note: in expansion of macro 'MAP_SUPPORTED_MASK' if (flags & ~(MAP_SUPPORTED_MASK | MAP_VALIDATE)) ^ vim +/MAP_DIRECT +11 include/linux/mman.h 9 10 /* the MAP_VALIDATE set of supported flags */ > 11 #define MAP_SUPPORTED_MASK (MAP_DIRECT) 12 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v4 0/3] MAP_DIRECT and block-map sealed files 2017-08-15 6:12 [PATCH v4 0/3] MAP_DIRECT and block-map sealed files Dan Williams ` (2 preceding siblings ...) 2017-08-15 6:12 ` [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges Dan Williams @ 2017-08-15 9:01 ` Dave Chinner 3 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2017-08-15 9:01 UTC (permalink / raw) To: Dan Williams Cc: darrick.wong, Jan Kara, Arnd Bergmann, linux-nvdimm, linux-api, linux-kernel, Christoph Hellwig, linux-xfs, linux-mm, Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Andrew Morton, Ross Zwisler On Mon, Aug 14, 2017 at 11:12:05PM -0700, Dan Williams wrote: > Changes since v3 [1]: > * Move from an fallocate(2) interface to a new mmap(2) flag and rename > 'immutable' to 'sealed'. > > * Do not record the sealed state in permanent metadata it is now purely > a temporary state for as long as a MAP_DIRECT vma is referencing the > inode (Christoph) > > * Drop the CAP_IMMUTABLE requirement, but do require a PROT_WRITE > mapping. > > [1]: https://lwn.net/Articles/730570/ > > --- > > This is the next revision of a patch series that aims to enable > applications that otherwise need to resort to DAX mapping a raw device > file to instead move to a filesystem. > > In the course of reviewing a previous posting, Christoph said: > > That being said I think we absolutely should support RDMA memory > registrations for DAX mappings. I'm just not sure how S_IOMAP_IMMUTABLE > helps with that. We'll want a MAP_SYNC | MAP_POPULATE to make sure all > the blocks are populated and all ptes are set up. Second we need to > make sure get_user_page works, which for now means we'll need a struct > page mapping for the region (which will be really annoying for PCIe > mappings, like the upcoming NVMe persistent memory region), and we need > to guarantee that the extent mapping won't change while the > get_user_pages holds the pages inside it. I think that is true due to > side effects even with the current DAX code, but we'll need to make it > explicit. And maybe that's where we need to converge - "sealing" the > extent map makes sense as such a temporary measure that is not persisted > on disk, which automatically gets released when the holding process > exits, because we sort of already do this implicitly. It might also > make sense to have explicitly breakable seals similar to what I do for > the pNFS blocks kernel server, as any userspace RDMA file server would > also need those semantics. > > So, this is an attempt to converge on the idea that we need an explicit > and process-lifetime-temporary mechanism for a process to be able to > make assumptions about the mapping to physical page to dax-file-offset > relationship. The "explicitly breakable seals" aspect is not addressed > in these patches, but I wonder if it might be a voluntary mechanism that > can implemented via userfaultfd. > > These pass a basic smoke test and are meant to just gauge 'right track' > / 'wrong track'. The main question it seems is whether the pinning done > in this patchset is too early (applies before get_user_pages()) and too > coarse (applies to the whole file). Perhaps this is where I discarded > too easily Jan's suggestion to look at Peter Z's mm_mpin() syscall [2]? On > the other hand, the coarseness and simple lifetime rules of MAP_DIRECT > make it an easy mechanism to implement and explain. > > Another reason I kept the scope of S_IOMAP_SEALED coarsely defined was > to support Dave's desired use case of sealing for operating on reflinked > files [3]. Which really needs a fcntl() interface to set/clear iomap seals. Which, now that I look at it, already has a bunch of "file sealing" commands defined which arrived in 3.17. It appears to be a special purpose access control interface for memfd_create() to manage shared access to anonymous tmpfs files and will EINVAL on any fd that points to a real file. Oh, even more problematic: Seals are a property of an inode. [....] Furthermore, seals can never be removed, only added. That seems somewhat difficult to reconcile with how I need F_SEAL_IOMAP to operate. /me calls it a day and goes looking for the hard liquor..... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2017-09-18 15:47 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-15 6:12 [PATCH v4 0/3] MAP_DIRECT and block-map sealed files Dan Williams 2017-08-15 6:12 ` [PATCH v4 1/3] fs, xfs: introduce S_IOMAP_SEALED Dan Williams [not found] ` <150277752553.23945.13932394738552748440.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-08-15 6:12 ` [PATCH v4 2/3] mm: introduce MAP_VALIDATE a mechanism for adding new mmap flags Dan Williams 2017-08-15 12:27 ` Jan Kara [not found] ` <20170815122701.GF27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2017-08-15 16:24 ` Dan Williams 2017-09-17 3:44 ` Dan Williams [not found] ` <CAA9_cmc0vejxCsc1NWp5b4C0CSsO5xetF3t6LCoCuEYB6yPiwQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-09-17 17:39 ` Christoph Hellwig [not found] ` <20170917173945.GA22200-jcswGhMUV9g@public.gmane.org> 2017-09-18 9:31 ` Jan Kara 2017-09-18 15:47 ` Dan Williams 2017-09-18 9:26 ` Jan Kara 2017-08-15 16:28 ` Andy Lutomirski 2017-08-15 22:31 ` Dan Williams 2017-08-17 8:06 ` kbuild test robot 2017-08-15 6:12 ` [PATCH v4 3/3] fs, xfs: introduce MAP_DIRECT for creating block-map-sealed file ranges Dan Williams 2017-08-15 9:18 ` Kirill A. Shutemov 2017-08-15 17:11 ` Dan Williams 2017-08-16 10:25 ` Kirill A. Shutemov [not found] ` <150277754211.23945.458876600578531019.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org> 2017-08-15 12:42 ` Jan Kara [not found] ` <20170815124250.GG27505-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2017-08-15 16:29 ` Dan Williams [not found] ` <CAPcyv4h01os0Gc6bYmaGdMXt5q4G4zfirNRPWG3=gQi5POrpmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2017-08-16 1:15 ` Dan Williams 2017-08-17 8:49 ` kbuild test robot 2017-08-15 9:01 ` [PATCH v4 0/3] MAP_DIRECT and block-map sealed files Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).