Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: ira.weiny@intel.com
To: linux-kernel@vger.kernel.org
Cc: Ira Weiny <ira.weiny@intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>,
	"Theodore Y. Ts'o" <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
Date: Fri, 10 Jan 2020 11:29:37 -0800
Message-ID: <20200110192942.25021-8-ira.weiny@intel.com> (raw)
In-Reply-To: <20200110192942.25021-1-ira.weiny@intel.com>

From: Ira Weiny <ira.weiny@intel.com>

DAX requires special address space operations but many other functions
check the IS_DAX() mode.

DAX is a property of the inode thus we define an inode mode lock as an
inode operation which file systems can optionally define.

This patch defines the core function callbacks as well as puts the
locking calls in place.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/filesystems/vfs.rst | 30 ++++++++++++++++
 fs/ioctl.c                        | 23 +++++++++----
 fs/open.c                         |  4 +++
 include/linux/fs.h                | 57 +++++++++++++++++++++++++++++--
 mm/fadvise.c                      | 10 ++++--
 mm/khugepaged.c                   |  2 ++
 mm/mmap.c                         |  7 ++++
 7 files changed, 123 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..b945aa95f15a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -59,6 +59,28 @@ like open(2) the file, or stat(2) it to peek at the inode data.  The
 stat(2) operation is fairly simple: once the VFS has the dentry, it
 peeks at the inode data and passes some of it back to userspace.
 
+Changing inode 'modes' dynamically
+----------------------------------
+
+Some file systems may have different modes for their inodes which require
+dyanic changing.  A specific example of this is DAX enabled files in XFS and
+ext4.  To switch the mode safely we lock the inode mode in all "normal" file
+system operations and restrict mode changes to those operations.  The specific
+rules are.
+
+To do this a file system must follow the following rules.
+
+        1) the direct_IO address_space_operation must be supported in all
+           potential a_ops vectors for any mode suported by the inode.
+	2) Filesystems must define the lock_mode() and unlock_mode() operations
+           in struct inode_operations.  These functions are used by the core
+           vfs layers to ensure that the mode is stable before allowing the
+           core operations to proceed.
+        3) Mode changes shall not be allowed while the file is mmap'ed
+        4) While changing modes filesystems should take exclusive locks which
+           prevent the core vfs layer from proceeding.
+
+
 
 The File Object
 ---------------
@@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined:
 		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
 				   unsigned open_flag, umode_t create_mode);
 		int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+		void (*lock_mode)(struct inode *);
+		void (*unlock_mode)(struct inode *);
 	};
 
 Again, all methods are called without any locks being held, unless
@@ -584,6 +608,12 @@ otherwise noted.
 	atomically creating, opening and unlinking a file in given
 	directory.
 
+``lock_mode``
+	called to prevent operations which depend on the inode's mode from
+        proceeding should a mode change be in progress
+
+``unlock_mode``
+	called when critical mode dependent operation is complete
 
 The Address Space Object
 ========================
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7c9a5df5a597..ed6ab5303a24 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl);
 static int ioctl_fibmap(struct file *filp, int __user *p)
 {
 	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = filp->f_inode;
 	int res, block;
 
+	lock_inode_mode(inode);
+
 	/* do we support this mess? */
-	if (!mapping->a_ops->bmap)
-		return -EINVAL;
-	if (!capable(CAP_SYS_RAWIO))
-		return -EPERM;
+	if (!mapping->a_ops->bmap) {
+		res = -EINVAL;
+		goto out;
+	}
+	if (!capable(CAP_SYS_RAWIO)) {
+		res = -EPERM;
+		goto out;
+	}
 	res = get_user(block, p);
 	if (res)
-		return res;
+		goto out;
 	res = mapping->a_ops->bmap(mapping, block);
-	return put_user(res, p);
+	res = put_user(res, p);
+
+out:
+	unlock_inode_mode(inode);
+	return res;
 }
 
 /**
diff --git a/fs/open.c b/fs/open.c
index b0be77ea8f1b..c62428bbc525 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	lock_inode_mode(dentry->d_inode);
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
+	unlock_inode_mode(dentry->d_inode);
 	return ret;
 }
 
@@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
+	lock_inode_mode(inode);
 	ret = file->f_op->fallocate(file, mode, offset, len);
+	unlock_inode_mode(inode);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e11989502eac..631f11d6246e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -359,6 +359,11 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+/**
+ * NOTE: DO NOT define new functions in address_space_operations without first
+ * considering how dynamic inode modes can be supported.  See the comment in
+ * struct inode_operations for the lock_mode() and unlock_mode() callbacks.
+ */
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -1817,6 +1822,11 @@ struct block_device_operations;
 
 struct iov_iter;
 
+/**
+ * NOTE: DO NOT define new functions in file_operations without first
+ * considering how dynamic inode modes can be supported.  See the comment in
+ * struct inode_operations for the lock_mode() and unlock_mode() callbacks.
+ */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1859,6 +1869,20 @@ struct file_operations {
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
+/*
+ * Filesystems wishing to support dynamic inode modes must do the following.
+ *
+ * 1) the direct_IO address_space_operation must be supported in all
+ *    potential a_ops vectors for any mode suported by the inode.
+ * 2) Filesystems must define the lock_mode() and unlock_mode() operations
+ *    in struct inode_operations.  These functions are used by the core
+ *    vfs layers to ensure that the mode is stable before allowing the
+ *    core operations to proceed.
+ * 3) Mode changes shall not be allowed while the file is mmap'ed
+ * 4) While changing modes filesystems should take exclusive locks which
+ *    prevent the core vfs layer from proceeding.
+ *
+ */
 struct inode_operations {
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
@@ -1887,18 +1911,47 @@ struct inode_operations {
 			   umode_t create_mode);
 	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
+	void (*lock_mode)(struct inode*);
+	void (*unlock_mode)(struct inode*);
 } ____cacheline_aligned;
 
+static inline void lock_inode_mode(struct inode *inode)
+{
+	WARN_ON_ONCE(inode->i_op->lock_mode &&
+		     !inode->i_op->unlock_mode);
+	if (inode->i_op->lock_mode)
+		inode->i_op->lock_mode(inode);
+}
+static inline void unlock_inode_mode(struct inode *inode)
+{
+	WARN_ON_ONCE(inode->i_op->unlock_mode &&
+		     !inode->i_op->lock_mode);
+	if (inode->i_op->unlock_mode)
+		inode->i_op->unlock_mode(inode);
+}
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-	return file->f_op->read_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	lock_inode_mode(inode);
+	ret = file->f_op->read_iter(kio, iter);
+	unlock_inode_mode(inode);
+	return ret;
 }
 
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
-	return file->f_op->write_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	lock_inode_mode(inode);
+	ret = file->f_op->write_iter(kio, iter);
+	unlock_inode_mode(inode);
+	return ret;
 }
 
 static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4f17c83db575..a4095a5deac8 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 
 	bdi = inode_to_bdi(mapping->host);
 
+	lock_inode_mode(inode);
 	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
+		int ret = 0;
+
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
@@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			return -EINVAL;
+			ret = -EINVAL;
 		}
-		return 0;
+
+		unlock_inode_mode(inode);
+		return ret;
 	}
+	unlock_inode_mode(inode);
 
 	/*
 	 * Careful about overflows. Len == 0 means "as much as possible".  Use
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908743cb..ff49da065db0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
 		} else {	/* !is_shmem */
 			if (!page || xa_is_value(page)) {
 				xas_unlock_irq(&xas);
+				lock_inode_mode(file->f_inode);
 				page_cache_sync_readahead(mapping, &file->f_ra,
 							  file, index,
 							  PAGE_SIZE);
+				unlock_inode_mode(file->f_inode);
 				/* drain pagevecs to help isolate_lru_page() */
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
diff --git a/mm/mmap.c b/mm/mmap.c
index 70f67c4515aa..dfaf1130e706 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1542,11 +1542,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (file)
+		lock_inode_mode(file_inode(file));
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
 		*populate = len;
+
+	if (file)
+		unlock_inode_mode(file_inode(file));
+
 	return addr;
 }
 
-- 
2.21.0


  parent reply index

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute ira.weiny
2020-01-15 11:37   ` Jan Kara
2020-01-15 17:38     ` Darrick J. Wong
2020-01-15 19:45       ` Ira Weiny
2020-01-15 20:10         ` Dan Williams
2020-01-15 22:38           ` Ira Weiny
2020-01-16  5:39             ` Darrick J. Wong
2020-01-16  6:05               ` Dan Williams
2020-01-16  6:18                 ` Darrick J. Wong
2020-01-16  6:25                   ` Dan Williams
2020-01-18  9:11                 ` Dave Chinner
2020-01-16 17:55               ` Ira Weiny
2020-01-16 18:04                 ` Darrick J. Wong
2020-01-16 18:52                   ` Ira Weiny
2020-01-16 22:19                     ` Darrick J. Wong
2020-01-17 11:58                     ` Jan Kara
2020-01-10 19:29 ` [RFC PATCH V2 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 04/12] fs/xfs: Clean up DAX support check ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check ira.weiny
2020-01-16  9:38   ` Jan Kara
2020-01-16 18:47     ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
2020-01-10 19:29 ` ira.weiny [this message]
2020-01-13 22:12   ` [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode' Darrick J. Wong
2020-01-14  0:20     ` Ira Weiny
2020-01-14  1:03       ` Darrick J. Wong
2020-01-15 19:08         ` Ira Weiny
2020-01-16  5:40           ` Darrick J. Wong
2020-01-16 18:54             ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs ira.weiny
2020-01-13 22:19   ` Darrick J. Wong
2020-01-14  0:35     ` Ira Weiny
2020-01-15  0:57       ` Ira Weiny
2020-01-15 23:52     ` Ira Weiny
2020-01-16  9:24   ` Jan Kara
2020-01-16 19:12     ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed ira.weiny
2020-01-13 22:22   ` Darrick J. Wong
2020-01-14  0:46     ` Ira Weiny
2020-01-14  1:30       ` Darrick J. Wong
2020-01-14 17:53         ` Ira Weiny
2020-01-15 11:34           ` Jan Kara
2020-01-15 18:24             ` Ira Weiny
2020-01-15 10:21   ` David Laight
2020-01-15 17:53     ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 10/12] fs/xfs: Fix truncate up ira.weiny
2020-01-13 22:27   ` Darrick J. Wong
2020-01-14  0:40     ` Ira Weiny
2020-01-14  1:14       ` Darrick J. Wong
2020-01-14 19:00         ` Ira Weiny
2020-01-14 19:39           ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200110192942.25021-8-ira.weiny@intel.com \
    --to=ira.weiny@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git