linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/2] O_NOCMTIME protected by generic mount option
@ 2015-05-15 21:23 Zach Brown
  2015-05-15 21:23 ` [PATCH RFC v2 1/2] vfs: add generic nocmtime mount flag Zach Brown
       [not found] ` <1431725028-24071-1-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
  0 siblings, 2 replies; 4+ messages in thread
From: Zach Brown @ 2015-05-15 21:23 UTC (permalink / raw)
  To: Sage Weil, linux-fsdevel, linux-kernel, linux-api

Here's a current draft of what is now the O_NOCMTIME series.  It
implements the frequent suggestion to gate unprivileged O_NOCMTIME use
with a mount option.

This method has the advantage of being entirely runtime.  There's no
persistence that'd require updating all the tools that deal with each
file system's format.  It's also requested by writers as they open.
Writes to the file that know nothing of O_NOCMTIME will behave as
usual.

Another suggested method is to use inode attributes: require root to
set +nocmtime on a dir and inherit it down subdirs to new files.  This
nicely solves the unprivilieged use problem without having to fiddle
with mount options but it requires touching all systems that support
it and would prevent cmtime updates on all writes to the inode.  I
have a patch series that starts on this but haven't taken it very far.

Sage is working on spinning up some hardware to test the various dirty
inode avoidance methods at load and should have numbers soon.  That'll
tell us if lazytime isn't good enough and if any of this is worth the
trouble.

- z

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC v2 1/2] vfs: add generic nocmtime mount flag
  2015-05-15 21:23 [PATCH RFC v2 0/2] O_NOCMTIME protected by generic mount option Zach Brown
@ 2015-05-15 21:23 ` Zach Brown
       [not found] ` <1431725028-24071-1-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
  1 sibling, 0 replies; 4+ messages in thread
From: Zach Brown @ 2015-05-15 21:23 UTC (permalink / raw)
  To: Sage Weil, linux-fsdevel, linux-kernel, linux-api

Add the infrastructure to support a generic nocmtime mount flag.  Like
MS_NOATIME/MNT_NOATIME this can be used to support the mount option in
file systems without having to touch each file system.

This will be used to provide a priviledged indication that unpriviledged
apps can safely use O_NOCMTIME to prevent cmtime updates without harm.

Signed-off-by: Zach Brown <zab@zabbo.net>
---
 fs/namespace.c          | 2 ++
 fs/proc_namespace.c     | 1 +
 fs/statfs.c             | 2 ++
 include/linux/mount.h   | 1 +
 include/linux/statfs.h  | 1 +
 include/uapi/linux/fs.h | 1 +
 6 files changed, 8 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1b9e111..48be1f9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2652,6 +2652,8 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 		mnt_flags |= MNT_NOEXEC;
 	if (flags & MS_NOATIME)
 		mnt_flags |= MNT_NOATIME;
+	if (flags & MS_NOCMTIME)
+		mnt_flags |= MNT_NOCMTIME;
 	if (flags & MS_NODIRATIME)
 		mnt_flags |= MNT_NODIRATIME;
 	if (flags & MS_STRICTATIME)
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 8db932d..49d7839 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -64,6 +64,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NODEV, ",nodev" },
 		{ MNT_NOEXEC, ",noexec" },
 		{ MNT_NOATIME, ",noatime" },
+		{ MNT_NOCMTIME, ",nocmtime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
 		{ 0, NULL }
diff --git a/fs/statfs.c b/fs/statfs.c
index 083dc0a..43d3de2 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -23,6 +23,8 @@ static int flags_by_mnt(int mnt_flags)
 		flags |= ST_NOEXEC;
 	if (mnt_flags & MNT_NOATIME)
 		flags |= ST_NOATIME;
+	if (mnt_flags & MNT_NOCMTIME)
+		flags |= ST_NOCMTIME;
 	if (mnt_flags & MNT_NODIRATIME)
 		flags |= ST_NODIRATIME;
 	if (mnt_flags & MNT_RELATIME)
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c..deb458f 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -28,6 +28,7 @@ struct mnt_namespace;
 #define MNT_NODIRATIME	0x10
 #define MNT_RELATIME	0x20
 #define MNT_READONLY	0x40	/* does the user want this to be r/o? */
+#define MNT_NOCMTIME	0x80	/* allow O_NOCMTIME to stop cmtime updates */
 
 #define MNT_SHRINKABLE	0x100
 #define MNT_WRITE_HOLD	0x200
diff --git a/include/linux/statfs.h b/include/linux/statfs.h
index 0166d32..bde224e 100644
--- a/include/linux/statfs.h
+++ b/include/linux/statfs.h
@@ -39,5 +39,6 @@ struct kstatfs {
 #define ST_NOATIME	0x0400	/* do not update access times */
 #define ST_NODIRATIME	0x0800	/* do not update directory access times */
 #define ST_RELATIME	0x1000	/* update atime relative to mtime/ctime */
+#define ST_NOCMTIME	0x2000	/* allow O_NOCMTIME to stop cmtime updates */
 
 #endif
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9b964a5..af1131e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -91,6 +91,7 @@ struct inodes_stat_t {
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
 #define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
+#define MS_NOCMTIME	(1<<26) /* allow O_NOCMTIME to stop cmtime updates */
 
 /* These sb flags are internal to the kernel */
 #define MS_NOSEC	(1<<28)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC v2 2/2] vfs: add O_NOCMTIME
       [not found] ` <1431725028-24071-1-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
@ 2015-05-15 21:23   ` Zach Brown
       [not found]     ` <1431725028-24071-3-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Zach Brown @ 2015-05-15 21:23 UTC (permalink / raw)
  To: Sage Weil, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

Add a O_NOCMTIME flag which prevents inode time updates on writes and
can greatly reduce the IO overhead of writes to allocated and
initialized regions of files.

ceph servers can have loads where they perform O_DIRECT overwrites of
allocated file data and then sync to make sure that the O_DIRECT writes
are flushed from write caches.  If the writes dirty the inode with mtime
updates then the syncs also write out the metadata needed to track the
inodes which can add significant iop and latency overhead.

The ceph servers don't use mtime at all.  They're using the local file
system as a backing store and any backups would be driven by their upper
level ceph metadata.

In simple tests a O_DIRECT|O_NOCMTIME overwriting write followed by a
sync went from 2 serial write round trips to 1 in XFS and from 4 serial
IO round trips to 1 in ext4.

file_update_time() is changed to call a file_is_nocmtime() helper which
tests the file flag in addition to testing the inode's S_NOCMTIME flag.
It doesn't check FMODE_NOCMTIME because that's only used by XFS to
trigger private flags which trigger private flags which do other things.

O_NOCMTIME can only be used if the mount has its MNT_NOCMTIME flag set.
This requires priviledged intervention to testify that mtime isn't
critical to, say, backup infrastructure or NFS server consistency
guarantees.

Signed-off-by: Zach Brown <zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
---
 fs/fcntl.c                       | 30 +++++++++++++++++++++++-------
 fs/inode.c                       |  2 +-
 fs/namei.c                       |  3 +--
 include/linux/fs.h               |  7 +++++++
 include/uapi/asm-generic/fcntl.h |  4 ++++
 5 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..eaa5d1d 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -22,12 +22,30 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/shmem_fs.h>
+#include <linux/mount.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
 #include <asm/uaccess.h>
 
-#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
+#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
+		    O_NOCMTIME)
+
+/*
+ * O_NOATIME and O_NOCMTIME can only be set by the ownder or superuser.
+ * And O_NOCMTIME requires MNT_NOCMTIME.
+ */
+bool forbid_o_notime(struct inode *inode, struct vfsmount *mnt,
+		     unsigned long flags)
+{
+	if ((flags & (O_NOATIME|O_NOCMTIME)) && !inode_owner_or_capable(inode))
+		return true;
+
+	if ((flags & O_NOCMTIME) && !(mnt->mnt_flags & MNT_NOCMTIME))
+		return true;
+
+	return false;
+}
 
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
@@ -41,10 +59,8 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 	if (((arg ^ filp->f_flags) & O_APPEND) && IS_APPEND(inode))
 		return -EPERM;
 
-	/* O_NOATIME can only be set by the owner or superuser */
-	if ((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME))
-		if (!inode_owner_or_capable(inode))
-			return -EPERM;
+	if (forbid_o_notime(inode, filp->f_path.mnt, arg & ~filp->f_flags))
+		return -EPERM;
 
 	/* required for strict SunOS emulation */
 	if (O_NONBLOCK != O_NDELAY)
@@ -740,7 +756,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
 		O_RDONLY	| O_WRONLY	| O_RDWR	|
 		O_CREAT		| O_EXCL	| O_NOCTTY	|
 		O_TRUNC		| O_APPEND	| /* O_NONBLOCK	| */
@@ -748,7 +764,7 @@ static int __init fcntl_init(void)
 		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
 		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC	|
 		__FMODE_EXEC	| O_PATH	| __O_TMPFILE	|
-		__FMODE_NONOTIFY
+		__FMODE_NONOTIFY| O_NOCMTIME
 		));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/inode.c b/fs/inode.c
index ea37cd1..b643dd0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1721,7 +1721,7 @@ int file_update_time(struct file *file)
 	int ret;
 
 	/* First try to exhaust all avenues to not sync */
-	if (IS_NOCMTIME(inode))
+	if (file_is_nocmtime(file))
 		return 0;
 
 	now = current_fs_time(inode->i_sb);
diff --git a/fs/namei.c b/fs/namei.c
index fe30d3b..8ecebca 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2617,8 +2617,7 @@ static int may_open(struct path *path, int acc_mode, int flag)
 			return -EPERM;
 	}
 
-	/* O_NOATIME can only be set by the owner or superuser */
-	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
+	if (forbid_o_notime(inode, path->mnt, flag))
 		return -EPERM;
 
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35ec87e..dd92eeb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1471,6 +1471,8 @@ static inline void sb_start_intwrite(struct super_block *sb)
 
 
 extern bool inode_owner_or_capable(const struct inode *inode);
+extern bool forbid_o_notime(struct inode *inode, struct vfsmount *mnt,
+			    unsigned long flags);
 
 /*
  * VFS helper functions..
@@ -2950,6 +2952,11 @@ static inline bool is_root_inode(struct inode *inode)
 	return inode == inode->i_sb->s_root->d_inode;
 }
 
+static inline bool file_is_nocmtime(struct file *file)
+{
+	return IS_NOCMTIME(file_inode(file)) || (file->f_flags & O_NOCMTIME);
+}
+
 static inline bool dir_emit(struct dir_context *ctx,
 			    const char *name, int namelen,
 			    u64 ino, unsigned type)
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063eff..ed7b2e1 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_NOCMTIME
+#define O_NOCMTIME	040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC v2 2/2] vfs: add O_NOCMTIME
       [not found]     ` <1431725028-24071-3-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
@ 2015-05-16 21:50       ` Azat Khuzhin
  0 siblings, 0 replies; 4+ messages in thread
From: Azat Khuzhin @ 2015-05-16 21:50 UTC (permalink / raw)
  To: Zach Brown
  Cc: Sage Weil, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Fri, May 15, 2015 at 02:23:48PM -0700, Zach Brown wrote:
> Add a O_NOCMTIME flag which prevents inode time updates on writes and
> can greatly reduce the IO overhead of writes to allocated and
> initialized regions of files.

> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h

You may also want to resolve this:

/*
 * Don't update ctime and mtime.
 *
 * Currently a special hack for the XFS open_by_handle ioctl, but we'll
 * hopefully graduate it to a proper O_CMTIME flag supported by open(2) soon.
 */
#define FMODE_NOCMTIME      ((__force fmode_t)0x800)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-05-16 21:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-15 21:23 [PATCH RFC v2 0/2] O_NOCMTIME protected by generic mount option Zach Brown
2015-05-15 21:23 ` [PATCH RFC v2 1/2] vfs: add generic nocmtime mount flag Zach Brown
     [not found] ` <1431725028-24071-1-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
2015-05-15 21:23   ` [PATCH RFC v2 2/2] vfs: add O_NOCMTIME Zach Brown
     [not found]     ` <1431725028-24071-3-git-send-email-zab-ugsP4Wv/S6ZeoWH0uzbU5w@public.gmane.org>
2015-05-16 21:50       ` Azat Khuzhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).