From: "Theodore Ts'o" <tytso@mit.edu> To: linux-fsdevel@vger.kernel.org Cc: Ext4 Developers List <linux-ext4@vger.kernel.org>, xfs@oss.sgi.com, linux-btrfs@vger.kernel.org, "Theodore Ts'o" <tytso@mit.edu> Subject: [PATCH 2/4] vfs: add support for a lazytime mount option Date: Fri, 21 Nov 2014 14:59:22 -0500 [thread overview] Message-ID: <1416599964-21892-3-git-send-email-tytso@mit.edu> (raw) In-Reply-To: <1416599964-21892-1-git-send-email-tytso@mit.edu> Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact 99.9 percentile latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> --- fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++- fs/inode.c | 18 ++++++++++++++++++ fs/proc_namespace.c | 1 + fs/sync.c | 7 +++++++ include/linux/fs.h | 1 + include/uapi/linux/fs.h | 1 + 6 files changed, 65 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef9bef1..ce7de22 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) inode->i_state &= ~I_DIRTY_PAGES; dirty = inode->i_state & I_DIRTY; - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME); spin_unlock(&inode->i_lock); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +/* + * This works like wait_sb_inodes(), but it is called *before* we kick + * the bdi so the inodes can get written out. + */ +static void flush_sb_dirty_time(struct super_block *sb) +{ + struct inode *inode, *old_inode = NULL; + + WARN_ON(!rwsem_is_locked(&sb->s_umount)); + spin_lock(&inode_sb_list_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + int dirty_time; + + spin_lock(&inode->i_lock); + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { + spin_unlock(&inode->i_lock); + continue; + } + dirty_time = inode->i_state & I_DIRTY_TIME; + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(&inode_sb_list_lock); + + iput(old_inode); + old_inode = inode; + + if (dirty_time) + mark_inode_dirty(inode); + cond_resched(); + spin_lock(&inode_sb_list_lock); + } + spin_unlock(&inode_sb_list_lock); + iput(old_inode); +} + /** * writeback_inodes_sb_nr - writeback dirty inodes from given super_block * @sb: the superblock @@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); + flush_sb_dirty_time(sb); bdi_queue_work(sb->s_bdi, &work); wait_for_completion(&done); diff --git a/fs/inode.c b/fs/inode.c index 8f5c4b5..6e91aca 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,18 @@ static void evict(struct inode *inode) BUG_ON(!(inode->i_state & I_FREEING)); BUG_ON(!list_empty(&inode->i_lru)); + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) { + if (inode->i_op->write_time) + inode->i_op->write_time(inode); + else if (inode->i_sb->s_op->write_inode) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + }; + mark_inode_dirty(inode); + inode->i_sb->s_op->write_inode(inode, &wbc); + } + } + if (!list_empty(&inode->i_wb_list)) inode_wb_list_del(inode); @@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags & S_MTIME) inode->i_mtime = *time; } + if (inode->i_sb->s_flags & MS_LAZYTIME) { + spin_lock(&inode->i_lock); + inode->i_state |= I_DIRTY_TIME; + spin_unlock(&inode->i_lock); + return 0; + } if (inode->i_op->write_time) return inode->i_op->write_time(inode); mark_inode_dirty_sync(inode); diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 73ca174..f98234a 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) { MS_SYNCHRONOUS, ",sync" }, { MS_DIRSYNC, ",dirsync" }, { MS_MANDLOCK, ",mand" }, + { MS_LAZYTIME, ",lazytime" }, { 0, NULL } }; const struct proc_fs_info *fs_infop; diff --git a/fs/sync.c b/fs/sync.c index bdc729d..db7930e 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -177,8 +177,15 @@ SYSCALL_DEFINE1(syncfs, int, fd) */ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { + struct inode *inode = file->f_mapping->host; + if (!file->f_op->fsync) return -EINVAL; + if (!datasync && inode->i_state & I_DIRTY_TIME) { + spin_lock(&inode->i_lock); + inode->i_state |= I_DIRTY_SYNC; + spin_unlock(&inode->i_lock); + } return file->f_op->fsync(file, start, end, datasync); } EXPORT_SYMBOL(vfs_fsync_range); diff --git a/include/linux/fs.h b/include/linux/fs.h index 3633239..489b2f2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1721,6 +1721,7 @@ struct super_operations { #define __I_DIO_WAKEUP 9 #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) +#define I_DIRTY_TIME (1 << 11) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 3735fa0..cc9713a 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -90,6 +90,7 @@ struct inodes_stat_t { #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ #define MS_STRICTATIME (1<<24) /* Always perform atime updates */ +#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */ /* These sb flags are internal to the kernel */ #define MS_NOSEC (1<<28) -- 2.1.0
WARNING: multiple messages have this Message-ID (diff)
From: Theodore Ts'o <tytso@mit.edu> To: linux-fsdevel@vger.kernel.org Cc: Ext4 Developers List <linux-ext4@vger.kernel.org>, Theodore Ts'o <tytso@mit.edu>, linux-btrfs@vger.kernel.org, xfs@oss.sgi.com Subject: [PATCH 2/4] vfs: add support for a lazytime mount option Date: Fri, 21 Nov 2014 14:59:22 -0500 [thread overview] Message-ID: <1416599964-21892-3-git-send-email-tytso@mit.edu> (raw) In-Reply-To: <1416599964-21892-1-git-send-email-tytso@mit.edu> Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact 99.9 percentile latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> --- fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++- fs/inode.c | 18 ++++++++++++++++++ fs/proc_namespace.c | 1 + fs/sync.c | 7 +++++++ include/linux/fs.h | 1 + include/uapi/linux/fs.h | 1 + 6 files changed, 65 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef9bef1..ce7de22 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) inode->i_state &= ~I_DIRTY_PAGES; dirty = inode->i_state & I_DIRTY; - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME); spin_unlock(&inode->i_lock); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +/* + * This works like wait_sb_inodes(), but it is called *before* we kick + * the bdi so the inodes can get written out. + */ +static void flush_sb_dirty_time(struct super_block *sb) +{ + struct inode *inode, *old_inode = NULL; + + WARN_ON(!rwsem_is_locked(&sb->s_umount)); + spin_lock(&inode_sb_list_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + int dirty_time; + + spin_lock(&inode->i_lock); + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { + spin_unlock(&inode->i_lock); + continue; + } + dirty_time = inode->i_state & I_DIRTY_TIME; + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(&inode_sb_list_lock); + + iput(old_inode); + old_inode = inode; + + if (dirty_time) + mark_inode_dirty(inode); + cond_resched(); + spin_lock(&inode_sb_list_lock); + } + spin_unlock(&inode_sb_list_lock); + iput(old_inode); +} + /** * writeback_inodes_sb_nr - writeback dirty inodes from given super_block * @sb: the superblock @@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); + flush_sb_dirty_time(sb); bdi_queue_work(sb->s_bdi, &work); wait_for_completion(&done); diff --git a/fs/inode.c b/fs/inode.c index 8f5c4b5..6e91aca 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,18 @@ static void evict(struct inode *inode) BUG_ON(!(inode->i_state & I_FREEING)); BUG_ON(!list_empty(&inode->i_lru)); + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) { + if (inode->i_op->write_time) + inode->i_op->write_time(inode); + else if (inode->i_sb->s_op->write_inode) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + }; + mark_inode_dirty(inode); + inode->i_sb->s_op->write_inode(inode, &wbc); + } + } + if (!list_empty(&inode->i_wb_list)) inode_wb_list_del(inode); @@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags & S_MTIME) inode->i_mtime = *time; } + if (inode->i_sb->s_flags & MS_LAZYTIME) { + spin_lock(&inode->i_lock); + inode->i_state |= I_DIRTY_TIME; + spin_unlock(&inode->i_lock); + return 0; + } if (inode->i_op->write_time) return inode->i_op->write_time(inode); mark_inode_dirty_sync(inode); diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 73ca174..f98234a 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) { MS_SYNCHRONOUS, ",sync" }, { MS_DIRSYNC, ",dirsync" }, { MS_MANDLOCK, ",mand" }, + { MS_LAZYTIME, ",lazytime" }, { 0, NULL } }; const struct proc_fs_info *fs_infop; diff --git a/fs/sync.c b/fs/sync.c index bdc729d..db7930e 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -177,8 +177,15 @@ SYSCALL_DEFINE1(syncfs, int, fd) */ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { + struct inode *inode = file->f_mapping->host; + if (!file->f_op->fsync) return -EINVAL; + if (!datasync && inode->i_state & I_DIRTY_TIME) { + spin_lock(&inode->i_lock); + inode->i_state |= I_DIRTY_SYNC; + spin_unlock(&inode->i_lock); + } return file->f_op->fsync(file, start, end, datasync); } EXPORT_SYMBOL(vfs_fsync_range); diff --git a/include/linux/fs.h b/include/linux/fs.h index 3633239..489b2f2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1721,6 +1721,7 @@ struct super_operations { #define __I_DIO_WAKEUP 9 #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) +#define I_DIRTY_TIME (1 << 11) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 3735fa0..cc9713a 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -90,6 +90,7 @@ struct inodes_stat_t { #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ #define MS_STRICTATIME (1<<24) /* Always perform atime updates */ +#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */ /* These sb flags are internal to the kernel */ #define MS_NOSEC (1<<28) -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-11-21 19:59 UTC|newest] Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-11-21 19:59 [PATCH 0/4] add support for a lazytime mount option Theodore Ts'o 2014-11-21 19:59 ` Theodore Ts'o 2014-11-21 19:59 ` [PATCH 1/4] fs: split update_time() into update_time() and write_time() Theodore Ts'o 2014-11-21 19:59 ` Theodore Ts'o 2014-11-21 20:08 ` Chris Mason 2014-11-21 20:08 ` Chris Mason 2014-11-21 21:42 ` Theodore Ts'o 2014-11-21 21:42 ` Theodore Ts'o 2014-11-24 16:38 ` David Sterba 2014-11-24 16:38 ` David Sterba 2014-11-24 17:22 ` Theodore Ts'o 2014-11-24 17:22 ` Theodore Ts'o 2014-11-24 18:09 ` David Sterba 2014-11-24 18:09 ` David Sterba 2014-11-24 15:21 ` Christoph Hellwig 2014-11-24 15:21 ` Christoph Hellwig 2014-11-24 15:56 ` Theodore Ts'o 2014-11-24 15:56 ` Theodore Ts'o 2014-11-24 17:34 ` David Sterba 2014-11-24 17:34 ` David Sterba 2014-11-25 15:51 ` David Sterba 2014-11-25 15:51 ` David Sterba 2014-11-25 17:01 ` Christoph Hellwig 2014-11-21 19:59 ` Theodore Ts'o [this message] 2014-11-21 19:59 ` [PATCH 2/4] vfs: add support for a lazytime mount option Theodore Ts'o 2014-11-25 1:52 ` Dave Chinner 2014-11-25 1:52 ` Dave Chinner 2014-11-25 4:33 ` Theodore Ts'o 2014-11-25 4:33 ` Theodore Ts'o 2014-11-25 15:32 ` Boaz Harrosh 2014-11-25 15:32 ` Boaz Harrosh 2014-11-25 17:19 ` Jan Kara 2014-11-25 17:19 ` Jan Kara 2014-11-25 17:57 ` Theodore Ts'o 2014-11-25 17:57 ` Theodore Ts'o 2014-11-25 20:18 ` Jan Kara 2014-11-25 20:18 ` Jan Kara 2014-11-25 17:30 ` Jan Kara 2014-11-25 17:30 ` Jan Kara 2014-11-25 19:26 ` Theodore Ts'o 2014-11-25 19:26 ` Theodore Ts'o 2014-11-26 0:24 ` Dave Chinner 2014-11-26 0:24 ` Dave Chinner 2014-11-21 19:59 ` [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale Theodore Ts'o 2014-11-21 19:59 ` Theodore Ts'o 2014-11-21 20:19 ` Andreas Dilger 2014-11-21 20:19 ` Andreas Dilger 2014-11-21 21:36 ` Theodore Ts'o 2014-11-21 21:36 ` Theodore Ts'o 2014-11-21 23:09 ` Andreas Dilger 2014-11-25 1:53 ` Dave Chinner 2014-11-25 1:53 ` Dave Chinner 2014-11-25 4:45 ` Theodore Ts'o 2014-11-25 4:45 ` Theodore Ts'o 2014-11-25 23:48 ` Dave Chinner 2014-11-25 23:48 ` Dave Chinner 2014-11-26 10:20 ` Theodore Ts'o 2014-11-26 10:20 ` Theodore Ts'o 2014-11-26 22:39 ` Dave Chinner 2014-11-26 22:39 ` Dave Chinner 2014-11-25 17:31 ` Jan Kara 2014-11-25 17:31 ` Jan Kara 2014-11-21 19:59 ` [PATCH 4/4] ext4: add support for a lazytime mount option Theodore Ts'o 2014-11-21 19:59 ` Theodore Ts'o 2014-11-25 17:34 ` Jan Kara 2014-11-25 17:34 ` Jan Kara
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1416599964-21892-3-git-send-email-tytso@mit.edu \ --to=tytso@mit.edu \ --cc=linux-btrfs@vger.kernel.org \ --cc=linux-ext4@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=xfs@oss.sgi.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.