All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/10] VFS hot tracking
@ 2013-08-05 14:49 zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 01/10] VFS hot tracking: Define basic data structures and functions zwu.kernel
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  The patchset is trying to introduce hot tracking function in
VFS layer, which will keep track of real disk I/O in memory.
By it, you will easily know more details about disk I/O, and
then detect where disk I/O hot spots are. Also, specific FS
can take use of it to do accurate defragment, and hot relocation
support, etc.

  Now it's time to send out its V4 for external review, and 
any comments or ideas are appreciated, thanks.

NOTE:

  The patchset can be obtained via my kernel dev git on github:
git://github.com/wuzhy/kernel.git hot_tracking
  If you're interested, you can also review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

  For how to use and more other info and performance report,
please check hot_tracking.txt in Documentation and following
links:
  1.) http://lwn.net/Articles/525651/
  2.) https://lkml.org/lkml/2012/12/20/199

Changelog from v3:
 - Removed debugfs support, but leave it to TODO list
 - Killed HOT_DELETING and HOT_IN_LIST flag
 - Fixed unlink issues
 - Fixed the issue on lookups (both for inode and range)
   leak on race with unlink
 - Killed hot_comm_item and split the functions which take it
 - Fixed some other issues

v3:
 - Added memory caping function for hot items [Zhiyong]
 - Cleanup aging function [Zhiyong]

v2:
 - Refactored to be under RCU [Chandra Seetharaman]
 - Merged some code changes [Chandra Seetharaman]
 - Fixed some issues [Chandra Seetharaman]

v1:
 - Solved 64 bits inode number issue. [David Sterba]
 - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
 - Cleanup Some issues [David Sterba]
 - Use a static hot debugfs root [Greg KH]

rfcv4:
 - Introduce hot func registering framework [Zhiyong]
 - Remove global variable for hot tracking [Zhiyong]
 - Add btrfs hot tracking support [Zhiyong]

rfcv3:
 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
 2.) Refactored workqueue support. [Dave Chinner]
 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
       TIME_TO_KICK, and HEAT_UPDATE_DELAY
 4.) Cleanedup a lot of other issues [Dave Chinner]


rfcv2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

rfcv1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
                                        [Marco Stornelli , Dave Chinner]

Dave Chinner (1):
  VFS hot tracking, xfs: Add hot tracking support

Zhi Yong Wu (9):
  VFS hot tracking: Define basic data structures and functions
  VFS hot tracking: Track IO and record heat information
  VFS hot tracking: Add a workqueue to move items between hot maps
  VFS hot tracking: Add shrinker functionality to curtail memory usage
  VFS hot tracking: Add an ioctl to get hot tracking information
  VFS hot tracking: Add a /proc interface to make the interval tunable
  VFS hot tracking: Add two /proc interfaces to control memory usage
  VFS hot tracking: Add documentation
  VFS hot tracking, btrfs: Add hot tracking support

 Documentation/filesystems/00-INDEX         |   2 +
 Documentation/filesystems/hot_tracking.txt | 210 +++++++
 fs/Makefile                                |   2 +-
 fs/btrfs/ctree.h                           |   1 +
 fs/btrfs/super.c                           |  22 +-
 fs/compat_ioctl.c                          |   5 +
 fs/dcache.c                                |   2 +
 fs/direct-io.c                             |   6 +
 fs/hot_tracking.c                          | 848 +++++++++++++++++++++++++++++
 fs/hot_tracking.h                          |  66 +++
 fs/ioctl.c                                 |  68 +++
 fs/namei.c                                 |   3 +
 fs/xfs/xfs_mount.h                         |   1 +
 fs/xfs/xfs_super.c                         |  18 +
 include/linux/fs.h                         |   4 +
 include/linux/hot_tracking.h               | 148 +++++
 kernel/sysctl.c                            |  21 +
 mm/filemap.c                               |   7 +
 mm/page-writeback.c                        |  13 +
 mm/readahead.c                             |   9 +
 20 files changed, 1454 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 01/10] VFS hot tracking: Define basic data structures and functions
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 02/10] VFS hot tracking: Track IO and record heat information zwu.kernel
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

This patch includes the basic data structure and functions needed for
VFS hot tracking.

It adds hot_inode_tree struct to keep track of frequently accessed
files, and is keyed by {inode, offset}. Trees contain hot_inode_items
representing those files and hot_range_items representing ranges in that
file.

It defines a data structure hot_info, which is associated with a mounted
filesystem, and will be used to store the inode tree and range tree for
hot items pertaining to that filesystem.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/Makefile                  |   2 +-
 fs/dcache.c                  |   2 +
 fs/hot_tracking.c            | 233 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |  20 ++++
 include/linux/fs.h           |   4 +
 include/linux/hot_tracking.h |  84 ++++++++++++++++
 6 files changed, 344 insertions(+), 1 deletion(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..5f9b8f1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o statfs.o
+		stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index 87bdb53..1e84808 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/prefetch.h>
 #include <linux/ratelimit.h>
+#include <linux/hot_tracking.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -3081,4 +3082,5 @@ void __init vfs_caches_init(unsigned long mempages)
 	mnt_init();
 	bdev_cache_init();
 	chrdev_init();
+	hot_cache_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 0000000..8a65472
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,233 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/spinlock.h>
+#include <linux/list_sort.h>
+#include "hot_tracking.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+static void hot_range_item_init(struct hot_range_item *hr,
+			struct hot_inode_item *he, loff_t start)
+{
+	kref_init(&hr->refs);
+	hr->start = start;
+	hr->len = hot_bit_shift(1, RANGE_BITS, true);
+	hr->hot_inode = he;
+}
+
+static void hot_range_item_free_cb(struct rcu_head *head)
+{
+	struct hot_range_item *hr = container_of(head,
+				struct hot_range_item, rcu);
+
+	kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+static void hot_range_item_free(struct kref *kref)
+{
+	struct hot_range_item *hr = container_of(kref,
+				struct hot_range_item, refs);
+	struct hot_info *root = hr->hot_inode->hot_root;
+
+	rb_erase(&hr->rb_node, &hr->hot_inode->hot_range_tree);
+	call_rcu(&hr->rcu, hot_range_item_free_cb);
+}
+
+void hot_range_item_get(struct hot_range_item *hr)
+{
+        kref_get(&hr->refs);
+}
+EXPORT_SYMBOL_GPL(hot_range_item_get);
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_range_item_put(struct hot_range_item *hr)
+{
+        kref_put(&hr->refs, hot_range_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_range_item_put);
+
+/*
+ * Free the entire hot_range_tree.
+ */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+	struct rb_node *node;
+	struct hot_range_item *hr;
+
+	/* Free hot inode and range trees on fs root */
+	spin_lock(&he->i_lock);
+	node = rb_first(&he->hot_range_tree);
+	while (node) {
+		hr = rb_entry(node, struct hot_range_item, rb_node);
+		node = rb_next(node);
+		hot_range_item_put(hr);
+	}
+	spin_unlock(&he->i_lock);
+}
+
+static void hot_inode_item_init(struct hot_inode_item *he,
+			struct hot_info *root, u64 ino)
+{
+	kref_init(&he->refs);
+	he->i_ino = ino;
+	he->hot_root = root;
+	spin_lock_init(&he->i_lock);
+}
+
+static void hot_inode_item_free_cb(struct rcu_head *head)
+{
+	struct hot_inode_item *he = container_of(head,
+				struct hot_inode_item, rcu);
+
+	kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+	struct hot_inode_item *he = container_of(kref,
+				struct hot_inode_item, refs);
+
+	rb_erase(&he->rb_node, &he->hot_root->hot_inode_tree);
+	hot_range_tree_free(he);
+	call_rcu(&he->rcu, hot_inode_item_free_cb);
+}
+
+void hot_inode_item_get(struct hot_inode_item *he)
+{
+        kref_get(&he->refs);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_get);
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+        kref_put(&he->refs, hot_inode_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_put);
+
+/*
+ * Initialize kmem cache for hot_inode_item and hot_range_item.
+ */
+void __init hot_cache_init(void)
+{
+	hot_inode_item_cachep = kmem_cache_create("hot_inode_cache",
+			sizeof(struct hot_inode_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+			NULL);
+	if (!hot_inode_item_cachep)
+		return;
+
+	hot_range_item_cachep = kmem_cache_create("hot_range_cache",
+			sizeof(struct hot_range_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+			NULL);
+	if (!hot_range_item_cachep)
+		kmem_cache_destroy(hot_inode_item_cachep);
+}
+EXPORT_SYMBOL_GPL(hot_cache_init);
+
+static struct hot_info *hot_tree_init(struct super_block *sb)
+{
+	struct hot_info *root;
+	int i, j;
+
+	root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+	if (!root) {
+		printk(KERN_ERR "%s: Failed to malloc memory for "
+				"hot_info\n", __func__);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	root->hot_inode_tree = RB_ROOT;
+	spin_lock_init(&root->t_lock);
+
+	return root;
+}
+
+/*
+ * Frees the entire hot tree.
+ */
+static void hot_tree_exit(struct hot_info *root)
+{
+	struct rb_node *node;
+
+	spin_lock(&root->t_lock);
+	node = rb_first(&root->hot_inode_tree);
+	while (node) {
+		struct hot_inode_item *he = rb_entry(node,
+				struct hot_inode_item, rb_node);
+		node = rb_next(node);
+		hot_inode_item_put(he);
+	}
+	spin_unlock(&root->t_lock);
+}
+
+/*
+ * Initialize the data structures for hot tracking.
+ * This function will be called by *_fill_super()
+ * when filesystem is mounted.
+ */
+int hot_track_init(struct super_block *sb)
+{
+	struct hot_info *root;
+	int ret = 0;
+
+	if (!hot_inode_item_cachep || !hot_range_item_cachep) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	root = hot_tree_init(sb);
+	if (IS_ERR(root)) {
+		ret = PTR_ERR(root);
+		goto err;
+	}
+
+	sb->s_hot_root = root;
+
+	printk(KERN_INFO "VFS: Turning on hot tracking\n");
+
+	return ret;
+
+err:
+	sb->s_hot_root = NULL;
+
+	printk(KERN_ERR "VFS: Fail to turn on hot tracking\n");
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(hot_track_init);
+
+/*
+ * This function will be called by *_put_super()
+ * when filesystem is umounted, or also by *_fill_super()
+ * in some exceptional cases.
+ */
+void hot_track_exit(struct super_block *sb)
+{
+	struct hot_info *root = sb->s_hot_root;
+
+	sb->s_hot_root = NULL;
+	hot_tree_exit(root);
+	kfree(root);
+}
+EXPORT_SYMBOL_GPL(hot_track_exit);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
new file mode 100644
index 0000000..2776092
--- /dev/null
+++ b/fs/hot_tracking.h
@@ -0,0 +1,20 @@
+/*
+ * fs/hot_tracking.h
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_TRACKING__
+#define __HOT_TRACKING__
+
+#include <linux/hot_tracking.h>
+
+/* size of sub-file ranges */
+#define RANGE_BITS 20
+
+#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..9003733 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -28,6 +28,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/hot_tracking.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1328,6 +1329,9 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/* Hot data tracking*/
+	struct hot_info *s_hot_root;
 };
 
 /* superblock cache pruning functions */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
new file mode 100644
index 0000000..a7d128d
--- /dev/null
+++ b/include/linux/hot_tracking.h
@@ -0,0 +1,84 @@
+/*
+ *  include/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot data tracking
+ * structures etc.
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_HOTTRACK_H
+#define _LINUX_HOTTRACK_H
+
+#include <linux/types.h>
+#include <linux/slab.h>
+
+#ifdef __KERNEL__
+
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/fs.h>
+
+#define MAP_BITS 8
+#define MAP_SIZE (1 << MAP_BITS)
+
+/* values for hot_freq flags */
+enum {
+	TYPE_INODE = 0,
+	TYPE_RANGE,
+	MAX_TYPES,
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+	struct kref refs;
+	struct rb_node rb_node;         /* rbtree index */
+	struct rcu_head rcu;
+	struct rb_root hot_range_tree;	/* tree of ranges */
+	spinlock_t i_lock;		/* protect above tree */
+	struct hot_info *hot_root;	/* associated hot_info */
+	u64 i_ino;			/* inode number from inode */
+};
+
+/*
+ * An item representing a range inside of
+ * an inode whose frequency is being tracked
+ */
+struct hot_range_item {
+	struct kref refs;
+	struct rb_node rb_node;                 /* rbtree index */
+	struct rcu_head rcu;
+	struct hot_inode_item *hot_inode;	/* associated hot_inode_item */
+	loff_t start;				/* offset in bytes */
+	size_t len;				/* length in bytes */
+};
+
+struct hot_info {
+	struct rb_root hot_inode_tree;
+	spinlock_t t_lock;				/* protect above tree */
+};
+
+extern void __init hot_cache_init(void);
+extern int hot_track_init(struct super_block *sb);
+extern void hot_track_exit(struct super_block *sb);
+extern void hot_range_item_put(struct hot_range_item *hr);
+extern void hot_inode_item_put(struct hot_inode_item *he);
+extern void hot_range_item_get(struct hot_range_item *hr);
+extern void hot_inode_item_get(struct hot_inode_item *he);
+
+static inline u64 hot_bit_shift(u64 counter, u32 bits, bool dir)
+{
+	if (dir)
+		return counter << bits;
+	else
+		return counter >> bits;
+}
+
+#endif /* __KERNEL__ */
+
+#endif  /* _LINUX_HOTTRACK_H */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 02/10] VFS hot tracking: Track IO and record heat information
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 01/10] VFS hot tracking: Define basic data structures and functions zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 03/10] VFS hot tracking: Add a workqueue to move items between hot maps zwu.kernel
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

This patch adds read/write code paths: include read_pages(),
do_writepages(), do_generic_file_read() and __blockdev_direct_IO()
to record heat information.

When real disk i/o for an inode is done, its own hot_inode_item will
be created or updated in the RB tree for the filesystem, and the i/o freq for
all of its extents will also be created/updated in the RB-tree per inode.

Each of the two structures hot_inode_item and hot_range_item
contains a hot_freq_data struct with its frequency of access metrics
(number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}).

Each hot_inode_item contains one hot_range_tree struct which is keyed by
{inode, offset, length} and used to keep track of all the ranges in this file.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/direct-io.c               |   6 ++
 fs/hot_tracking.c            | 242 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |   1 +
 fs/namei.c                   |   3 +
 include/linux/hot_tracking.h |  27 +++++
 mm/filemap.c                 |   7 ++
 mm/page-writeback.c          |  13 +++
 mm/readahead.c               |   9 ++
 8 files changed, 308 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..46d698d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,7 @@
 #include <linux/atomic.h>
 #include <linux/prefetch.h>
 #include <linux/aio.h>
+#include "hot_tracking.h"
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1295,6 +1296,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	prefetch(bdev->bd_queue);
 	prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
 
+	/* Hot tracking */
+	if (hot_track_enabled(inode, iov_length(iov, nr_segs)))
+		hot_update_freqs(inode, offset,
+			iov_length(iov, nr_segs), rw & WRITE);
+
 	return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 				     nr_segs, get_block, end_io,
 				     submit_io, flags);
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 8a65472..e2a6e84 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -19,10 +19,23 @@
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
 
+inline bool hot_track_enabled(struct inode *inode, size_t len)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+
+	if (!root || (len == 0) || !S_ISREG(inode->i_mode))
+		return false;
+	else
+		return true;
+}
+EXPORT_SYMBOL_GPL(hot_track_enabled);
+
 static void hot_range_item_init(struct hot_range_item *hr,
 			struct hot_inode_item *he, loff_t start)
 {
 	kref_init(&hr->refs);
+	hr->freq.avg_delta_reads = (u64) -1;
+	hr->freq.avg_delta_writes = (u64) -1;
 	hr->start = start;
 	hr->len = hot_bit_shift(1, RANGE_BITS, true);
 	hr->hot_inode = he;
@@ -62,6 +75,64 @@ void hot_range_item_put(struct hot_range_item *hr)
 }
 EXPORT_SYMBOL_GPL(hot_range_item_put);
 
+struct hot_range_item
+*hot_range_item_lookup(struct hot_inode_item *he, loff_t start, int alloc)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_range_item *hr, *hr_new = NULL;
+
+	start = hot_bit_shift(start, RANGE_BITS, true);
+
+	/* walk tree to find insertion point */
+redo:
+	spin_lock(&he->i_lock);
+	p = &he->hot_range_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		hr = rb_entry(parent, struct hot_range_item, rb_node);
+		if (start < hr->start)
+			p = &(*p)->rb_left;
+		else if (start > (hr->start + hr->len - 1))
+			p = &(*p)->rb_right;
+		else {
+			hot_range_item_get(hr);
+			if (hr_new) {
+				/*
+				 * Lost the race. Somebody else inserted
+				 * the item for the range. Free the
+				 * newly allocated item.
+				 */
+				hot_range_item_put(hr_new);
+			}
+			spin_unlock(&he->i_lock);
+
+			return hr;
+		}
+	}
+
+	if (hr_new) {
+		rb_link_node(&hr_new->rb_node, parent, p);
+		rb_insert_color(&hr_new->rb_node, &he->hot_range_tree);
+		hot_range_item_get(hr_new); /* For the caller */
+		spin_unlock(&he->i_lock);
+		return hr_new;
+	}
+        spin_unlock(&he->i_lock);
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	hr_new = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
+	if (!hr_new)
+		return ERR_PTR(-ENOMEM);
+
+	hot_range_item_init(hr_new, he, start);
+
+	goto redo;
+}
+EXPORT_SYMBOL_GPL(hot_range_item_lookup);
+
 /*
  * Free the entire hot_range_tree.
  */
@@ -85,6 +156,8 @@ static void hot_inode_item_init(struct hot_inode_item *he,
 			struct hot_info *root, u64 ino)
 {
 	kref_init(&he->refs);
+	he->freq.avg_delta_reads = (u64) -1;
+	he->freq.avg_delta_writes = (u64) -1;
 	he->i_ino = ino;
 	he->hot_root = root;
 	spin_lock_init(&he->i_lock);
@@ -124,6 +197,126 @@ void hot_inode_item_put(struct hot_inode_item *he)
 }
 EXPORT_SYMBOL_GPL(hot_inode_item_put);
 
+struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root, u64 ino, int alloc)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_inode_item *he, *he_new = NULL;
+
+	/* walk tree to find insertion point */
+redo:
+	spin_lock(&root->t_lock);
+	p = &root->hot_inode_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		he = rb_entry(parent, struct hot_inode_item, rb_node);
+		if (ino < he->i_ino)
+			p = &(*p)->rb_left;
+		else if (ino > he->i_ino)
+			p = &(*p)->rb_right;
+		else {
+			hot_inode_item_get(he);
+			if (he_new) {
+				/*
+				 * Lost the race. Somebody else inserted
+				 * the item for the inode. Free the
+				 * newly allocated item.
+				 */
+				hot_inode_item_put(he_new);
+			}
+			spin_unlock(&root->t_lock);
+
+			return he;
+		}
+	}
+
+	if (he_new) {
+		rb_link_node(&he_new->rb_node, parent, p);
+		rb_insert_color(&he_new->rb_node, &root->hot_inode_tree);
+		hot_inode_item_get(he_new); /* For the caller */
+		spin_unlock(&root->t_lock);
+		return he_new;
+	}
+	spin_unlock(&root->t_lock);
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	he_new = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
+	if (!he_new)
+		return ERR_PTR(-ENOMEM);
+
+	hot_inode_item_init(he_new, root, ino);
+
+	goto redo;
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_lookup);
+
+void hot_inode_item_unlink(struct inode *inode)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_inode_item *he;
+
+	if (!root || !S_ISREG(inode->i_mode))
+		return;
+
+	he = hot_inode_item_lookup(root, inode->i_ino, 0);
+	if (IS_ERR(he))
+                return;
+
+	spin_lock(&root->t_lock);
+	hot_inode_item_put(he); /* For the caller */
+	hot_inode_item_put(he);
+	spin_unlock(&root->t_lock);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_unlink);
+
+/*
+ * This function does the actual work of updating
+ * the frequency numbers.
+ *
+ * avg_delta_{reads,writes} are indeed a kind of simple moving
+ * average of the time difference between each of the last
+ * 2^(FREQ_POWER) reads/writes. If there have not yet been that
+ * many reads or writes, it's likely that the values will be very
+ * large; They are initialized to the largest possible value for the
+ * data type. Simply, we don't want a few fast access to a file to
+ * automatically make it appear very hot.
+ */
+static void hot_freq_calc(struct timespec old_atime,
+		struct timespec cur_time, u64 *avg)
+{
+	struct timespec delta_ts;
+	u64 new_delta;
+
+	delta_ts = timespec_sub(cur_time, old_atime);
+	new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
+
+	*avg = (*avg << FREQ_POWER) - *avg + new_delta;
+	*avg = *avg >> FREQ_POWER;
+}
+
+static void hot_freq_update(struct hot_info *root,
+		struct hot_freq *freq, bool write)
+{
+	struct timespec cur_time = current_kernel_time();
+
+	if (write) {
+		freq->nr_writes += 1;
+		hot_freq_calc(freq->last_write_time,
+				cur_time,
+				&freq->avg_delta_writes);
+		freq->last_write_time = cur_time;
+	} else {
+		freq->nr_reads += 1;
+		hot_freq_calc(freq->last_read_time,
+				cur_time,
+				&freq->avg_delta_reads);
+		freq->last_read_time = cur_time;
+	}
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -145,6 +338,55 @@ void __init hot_cache_init(void)
 }
 EXPORT_SYMBOL_GPL(hot_cache_init);
 
+/*
+ * Main function to update i/o access frequencies, and it will be called
+ * from read/writepages() hooks, which are read_pages(), do_writepages(),
+ * do_generic_file_read(), and __blockdev_direct_IO().
+ */
+void hot_update_freqs(struct inode *inode, loff_t start,
+			size_t len, int rw)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+	u64 range_size;
+	loff_t cur, end;
+
+	he = hot_inode_item_lookup(root, inode->i_ino, 1);
+	if (IS_ERR(he))
+		return;
+
+	hot_freq_update(root, &he->freq, rw);
+
+	/*
+	 * Align ranges on range size boundary
+	 * to prevent proliferation of range structs
+	 */
+	range_size  = hot_bit_shift(1, RANGE_BITS, true);
+	end = hot_bit_shift((start + len + range_size - 1),
+			RANGE_BITS, false);
+	cur = hot_bit_shift(start, RANGE_BITS, false);
+	for (; cur < end; cur++) {
+		hr = hot_range_item_lookup(he, cur, 1);
+		if (IS_ERR(hr)) {
+			WARN(1, "hot_range_item_lookup returns %ld\n",
+				PTR_ERR(hr));
+			return;
+		}
+
+		hot_freq_update(root, &hr->freq, rw);
+
+		spin_lock(&he->i_lock);
+		hot_range_item_put(hr);
+		spin_unlock(&he->i_lock);
+	}
+
+	spin_lock(&root->t_lock);
+	hot_inode_item_put(he);
+	spin_unlock(&root->t_lock);
+}
+EXPORT_SYMBOL_GPL(hot_update_freqs);
+
 static struct hot_info *hot_tree_init(struct super_block *sb)
 {
 	struct hot_info *root;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 2776092..bb4cb16 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -16,5 +16,6 @@
 
 /* size of sub-file ranges */
 #define RANGE_BITS 20
+#define FREQ_POWER 4
 
 #endif /* __HOT_TRACKING__ */
diff --git a/fs/namei.c b/fs/namei.c
index 8b61d10..13f073f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3454,6 +3454,9 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
 	}
 	mutex_unlock(&dentry->d_inode->i_mutex);
 
+	if (!error && !dentry->d_inode->i_nlink)
+		hot_inode_item_unlink(dentry->d_inode);
+
 	/* We don't d_delete() NFS sillyrenamed files--they still exist. */
 	if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
 		fsnotify_link_count(dentry->d_inode);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index a7d128d..e2a9d50 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -34,8 +34,24 @@ enum {
 	MAX_TYPES,
 };
 
+/*
+ * A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item
+ */
+struct hot_freq {
+	struct timespec last_read_time;
+	struct timespec last_write_time;
+	u32 nr_reads;
+	u32 nr_writes;
+	u64 avg_delta_reads;
+	u64 avg_delta_writes;
+	u32 last_temp;
+};
+
 /* An item representing an inode and its access frequency */
 struct hot_inode_item {
+	struct hot_freq freq;           /* frequency data */
 	struct kref refs;
 	struct rb_node rb_node;         /* rbtree index */
 	struct rcu_head rcu;
@@ -50,6 +66,7 @@ struct hot_inode_item {
  * an inode whose frequency is being tracked
  */
 struct hot_range_item {
+	struct hot_freq freq;                   /* frequency data */
 	struct kref refs;
 	struct rb_node rb_node;                 /* rbtree index */
 	struct rcu_head rcu;
@@ -70,6 +87,16 @@ extern void hot_range_item_put(struct hot_range_item *hr);
 extern void hot_inode_item_put(struct hot_inode_item *he);
 extern void hot_range_item_get(struct hot_range_item *hr);
 extern void hot_inode_item_get(struct hot_inode_item *he);
+extern void hot_update_freqs(struct inode *inode,
+			loff_t start, size_t len, int rw);
+extern struct hot_range_item
+*hot_range_item_lookup(struct hot_inode_item *he,
+			loff_t start, int alloc);
+extern struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root,
+			u64 ino, int alloc);
+extern void hot_inode_item_unlink(struct inode *inode);
+extern inline bool hot_track_enabled(struct inode *inode, size_t len);
 
 static inline u64 hot_bit_shift(u64 counter, u32 bits, bool dir)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 4b51ac1..c9f0a99 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/hot_tracking.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1242,6 +1243,12 @@ readpage:
 		 * PG_error will be set again if readpage fails.
 		 */
 		ClearPageError(page);
+
+		/* Hot tracking */
+		if (hot_track_enabled(inode, PAGE_CACHE_SIZE))
+			hot_update_freqs(inode, page->index << PAGE_CACHE_SHIFT,
+				PAGE_CACHE_SIZE, 0);
+
 		/* Start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..0e92e2e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,6 +36,7 @@
 #include <linux/pagevec.h>
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
+#include <linux/hot_tracking.h>
 #include <trace/events/writeback.h>
 
 /*
@@ -1921,13 +1922,25 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
+	loff_t start = 0;
+	size_t count = 0, len = 0;
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+
+	start = mapping->writeback_index << PAGE_CACHE_SHIFT;
+	count = wbc->nr_to_write;
+
 	if (mapping->a_ops->writepages)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/* Hot tracking */
+	len = (count - wbc->nr_to_write) * PAGE_CACHE_SIZE;
+	if (hot_track_enabled(mapping->host, len))
+		hot_update_freqs(mapping->host, start, len, 1);
+
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 829a77c..1e40015 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/file.h>
+#include <linux/hot_tracking.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -114,6 +115,14 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	struct blk_plug plug;
 	unsigned page_idx;
 	int ret;
+	size_t len = 0;
+
+	/* Hot tracking */
+	len = (size_t)nr_pages * PAGE_CACHE_SIZE;
+	if (hot_track_enabled(mapping->host, len)) {
+		loff_t start = list_to_page(pages)->index << PAGE_CACHE_SHIFT;
+		hot_update_freqs(mapping->host, start, len, 0);
+	}
 
 	blk_start_plug(&plug);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 03/10] VFS hot tracking: Add a workqueue to move items between hot maps
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 01/10] VFS hot tracking: Define basic data structures and functions zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 02/10] VFS hot tracking: Track IO and record heat information zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 04/10] VFS hot tracking: Add shrinker functionality to curtail memory usage zwu.kernel
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Add a workqueue per superblock and a delayed_work
to run periodic work to update map info on each superblock.

Two arrays of map list are defined, one is for hot inode
items, and the other is for hot extent items.

The hot items in the RB-tree will be at first distilled
into one temperature in the range [0, 255]. It will be
be linked to its corresponding array of map list which use
the temperature as its index.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 264 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |  24 ++++
 include/linux/hot_tracking.h |   9 +-
 3 files changed, 296 insertions(+), 1 deletion(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index e2a6e84..857d423 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -34,6 +34,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
 			struct hot_inode_item *he, loff_t start)
 {
 	kref_init(&hr->refs);
+	INIT_LIST_HEAD(&hr->track_list);
 	hr->freq.avg_delta_reads = (u64) -1;
 	hr->freq.avg_delta_writes = (u64) -1;
 	hr->start = start;
@@ -56,6 +57,9 @@ static void hot_range_item_free(struct kref *kref)
 	struct hot_info *root = hr->hot_inode->hot_root;
 
 	rb_erase(&hr->rb_node, &hr->hot_inode->hot_range_tree);
+	spin_lock(&root->m_lock);
+	list_del_init(&hr->track_list);
+	spin_unlock(&root->m_lock);
 	call_rcu(&hr->rcu, hot_range_item_free_cb);
 }
 
@@ -81,6 +85,8 @@ struct hot_range_item
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hot_range_item *hr, *hr_new = NULL;
+	u32 temp;
+	u8 temp_cur;
 
 	start = hot_bit_shift(start, RANGE_BITS, true);
 
@@ -114,6 +120,12 @@ redo:
 	if (hr_new) {
 		rb_link_node(&hr_new->rb_node, parent, p);
 		rb_insert_color(&hr_new->rb_node, &he->hot_range_tree);
+		temp = hot_temp_calc(&hr_new->freq);
+		temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+		spin_lock(&he->hot_root->m_lock);
+		list_add_tail(&hr_new->track_list,
+			&he->hot_root->hot_map[TYPE_RANGE][temp_cur]);
+		spin_unlock(&he->hot_root->m_lock);
 		hot_range_item_get(hr_new); /* For the caller */
 		spin_unlock(&he->i_lock);
 		return hr_new;
@@ -152,10 +164,68 @@ static void hot_range_tree_free(struct hot_inode_item *he)
 	spin_unlock(&he->i_lock);
 }
 
+static void hot_range_map_update(struct hot_info *root,
+			struct hot_range_item *hr)
+{
+	u32 temp = hot_temp_calc(&hr->freq);
+	u8 temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+	u8 temp_prev = (u8)hot_bit_shift((u64)hr->freq.last_temp,
+				(32 - MAP_BITS), false);
+
+	hr->freq.last_temp = temp;
+
+	spin_lock(&root->m_lock);
+	if (!list_empty(&hr->track_list)
+		&& (temp_cur != temp_prev)) {
+		list_del_init(&hr->track_list);
+		list_add_tail(&hr->track_list,
+			&root->hot_map[TYPE_RANGE][temp_cur]);
+	}
+	spin_unlock(&root->m_lock);
+}
+
+/*
+ * Update temperatures for each range item for aging purposes.
+ * If one hot range item is old, it will be aged out.
+ */
+static void hot_range_tree_update(struct hot_inode_item *he,
+				struct hot_info *root)
+{
+	struct rb_node *node;
+	struct hot_range_item *hr;
+
+	rcu_read_lock();
+	node = rb_first(&he->hot_range_tree);
+	while (node) {
+		hr = rb_entry(node, struct hot_range_item, rb_node);
+		node = rb_next(node);
+		hot_range_map_update(root, hr);
+	}
+	rcu_read_unlock();
+}
+
+static int hot_range_temp_cmp(void *priv, struct list_head *a,
+				struct list_head *b)
+{
+	struct hot_range_item *ap = container_of(a,
+			struct hot_range_item, track_list);
+	struct hot_range_item *bp = container_of(b,
+			struct hot_range_item, track_list);
+
+	int diff = ap->freq.last_temp - bp->freq.last_temp;
+	if (diff > 0)
+		return -1;
+	else if (diff < 0)
+		return 1;
+	else
+		return 0;
+}
+
 static void hot_inode_item_init(struct hot_inode_item *he,
 			struct hot_info *root, u64 ino)
 {
 	kref_init(&he->refs);
+	INIT_LIST_HEAD(&he->track_list);
 	he->freq.avg_delta_reads = (u64) -1;
 	he->freq.avg_delta_writes = (u64) -1;
 	he->i_ino = ino;
@@ -177,6 +247,7 @@ static void hot_inode_item_free(struct kref *kref)
 				struct hot_inode_item, refs);
 
 	rb_erase(&he->rb_node, &he->hot_root->hot_inode_tree);
+	list_del_init(&he->track_list);
 	hot_range_tree_free(he);
 	call_rcu(&he->rcu, hot_inode_item_free_cb);
 }
@@ -203,6 +274,8 @@ struct hot_inode_item
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hot_inode_item *he, *he_new = NULL;
+	u32 temp;
+	u8 temp_cur;
 
 	/* walk tree to find insertion point */
 redo:
@@ -234,6 +307,10 @@ redo:
 	if (he_new) {
 		rb_link_node(&he_new->rb_node, parent, p);
 		rb_insert_color(&he_new->rb_node, &root->hot_inode_tree);
+		temp = hot_temp_calc(&he_new->freq);
+		temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+		list_add_tail(&he_new->track_list,
+			&root->hot_map[TYPE_INODE][temp_cur]);
 		hot_inode_item_get(he_new); /* For the caller */
 		spin_unlock(&root->t_lock);
 		return he_new;
@@ -273,6 +350,48 @@ void hot_inode_item_unlink(struct inode *inode)
 EXPORT_SYMBOL_GPL(hot_inode_item_unlink);
 
 /*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature.
+ */
+static void hot_inode_map_update(struct hot_info *root,
+			struct hot_inode_item *he)
+{
+	u32 temp = hot_temp_calc(&he->freq);
+	u8 temp_cur = (u8)hot_bit_shift((u64)temp, (32 - MAP_BITS), false);
+	u8 temp_prev = (u8)hot_bit_shift((u64)he->freq.last_temp,
+				(32 - MAP_BITS), false);
+
+	he->freq.last_temp = temp;
+
+	spin_lock(&root->t_lock);
+	if (!list_empty(&he->track_list)
+		&& (temp_cur != temp_prev)) {
+		list_del_init(&he->track_list);
+		list_add_tail(&he->track_list,
+			&root->hot_map[TYPE_INODE][temp_cur]);
+	}
+	spin_unlock(&root->t_lock);
+}
+
+static int hot_inode_temp_cmp(void *priv, struct list_head *a,
+				struct list_head *b)
+{
+	struct hot_inode_item *ap = container_of(a,
+			struct hot_inode_item, track_list);
+	struct hot_inode_item *bp = container_of(b,
+			struct hot_inode_item, track_list);
+
+	int diff = ap->freq.last_temp - bp->freq.last_temp;
+	if (diff > 0)
+		return -1;
+	else if (diff < 0)
+		return 1;
+	else
+		return 0;
+}
+
+/*
  * This function does the actual work of updating
  * the frequency numbers.
  *
@@ -318,6 +437,128 @@ static void hot_freq_update(struct hot_info *root,
 }
 
 /*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ *
+ * With the six values, we first do some very rudimentary
+ * "normalizations" to each metric such that they affect the
+ * final temperature calculation exactly the right way. It's
+ * important to note that we still weren't really sure that
+ * these six adjustments were exactly right.
+ * They could definitely use more tweaking and adjustment,
+ * especially in terms of the memory footprint they consume.
+ *
+ * Next, we take the adjusted values and shift them down to
+ * a manageable size, whereafter they are weighted using the
+ * the *_COEFF_POWER values and combined to a single temperature
+ * value.
+ */
+u32 hot_temp_calc(struct hot_freq *freq)
+{
+	u32 result = 0;
+
+	struct timespec ckt = current_kernel_time();
+	u64 cur_time = timespec_to_ns(&ckt);
+	u32 nrr_heat, nrw_heat;
+	u64 ltr_heat, ltw_heat, avr_heat, avw_heat;
+
+	nrr_heat = (u32)hot_bit_shift((u64)freq->nr_reads,
+					NRR_MULTIPLIER_POWER, true);
+	nrw_heat = (u32)hot_bit_shift((u64)freq->nr_writes,
+					NRW_MULTIPLIER_POWER, true);
+
+	ltr_heat =
+	hot_bit_shift((cur_time - timespec_to_ns(&freq->last_read_time)),
+			LTR_DIVIDER_POWER, false);
+	ltw_heat =
+	hot_bit_shift((cur_time - timespec_to_ns(&freq->last_write_time)),
+			LTW_DIVIDER_POWER, false);
+
+	avr_heat =
+	hot_bit_shift((((u64) -1) - freq->avg_delta_reads),
+			AVR_DIVIDER_POWER, false);
+	avw_heat =
+	hot_bit_shift((((u64) -1) - freq->avg_delta_writes),
+			AVW_DIVIDER_POWER, false);
+
+	/* ltr_heat is now guaranteed to be u32 safe */
+	if (ltr_heat >= hot_bit_shift((u64) 1, 32, true))
+		ltr_heat = 0;
+	else
+		ltr_heat = hot_bit_shift((u64) 1, 32, true) - ltr_heat;
+
+	/* ltw_heat is now guaranteed to be u32 safe */
+	if (ltw_heat >= hot_bit_shift((u64) 1, 32, true))
+		ltw_heat = 0;
+	else
+		ltw_heat = hot_bit_shift((u64) 1, 32, true) - ltw_heat;
+
+	/* avr_heat is now guaranteed to be u32 safe */
+	if (avr_heat >= hot_bit_shift((u64) 1, 32, true))
+		avr_heat = (u32) -1;
+
+	/* avw_heat is now guaranteed to be u32 safe */
+	if (avw_heat >= hot_bit_shift((u64) 1, 32, true))
+		avw_heat = (u32) -1;
+
+	nrr_heat = (u32)hot_bit_shift((u64)nrr_heat,
+		(3 - NRR_COEFF_POWER), false);
+	nrw_heat = (u32)hot_bit_shift((u64)nrw_heat,
+		(3 - NRW_COEFF_POWER), false);
+	ltr_heat = hot_bit_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+	ltw_heat = hot_bit_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+	avr_heat = hot_bit_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+	avw_heat = hot_bit_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+	result = nrr_heat + nrw_heat + (u32) ltr_heat +
+		(u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+	return result;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+	struct hot_info *root = container_of(to_delayed_work(work),
+					struct hot_info, update_work);
+	struct hot_inode_item *he;
+	struct rb_node *node;
+	int i;
+
+	rcu_read_lock();
+	node = rb_first(&root->hot_inode_tree);
+	while (node) {
+		he = rb_entry(node, struct hot_inode_item, rb_node);
+		node = rb_next(node);
+		hot_inode_map_update(root, he);
+		hot_range_tree_update(he, root);
+	}
+	rcu_read_unlock();
+
+	/* Sort temperature map info based on last temperature */
+	for (i = 0; i < MAP_SIZE; i++) {
+		spin_lock(&root->t_lock);
+		list_sort(NULL, &root->hot_map[TYPE_INODE][i],
+			hot_inode_temp_cmp);
+		spin_unlock(&root->t_lock);
+
+		spin_lock(&root->m_lock);
+		list_sort(NULL, &root->hot_map[TYPE_RANGE][i],
+			hot_range_temp_cmp);
+		spin_unlock(&root->m_lock);
+	}
+
+	/* Instert next delayed work */
+	queue_delayed_work(root->update_wq, &root->update_work,
+		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 void __init hot_cache_init(void)
@@ -401,6 +642,26 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 
 	root->hot_inode_tree = RB_ROOT;
 	spin_lock_init(&root->t_lock);
+	spin_lock_init(&root->m_lock);
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		for (j = 0; j < MAX_TYPES; j++)
+			INIT_LIST_HEAD(&root->hot_map[j][i]);
+	}
+
+	root->update_wq = alloc_workqueue(
+			"hot_update_wq", WQ_NON_REENTRANT, 0);
+	if (!root->update_wq) {
+		printk(KERN_ERR "%s: Failed to create "
+				"hot update workqueue\n", __func__);
+		kfree(root);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Initialize hot tracking wq and arm one delayed work */
+	INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
+	queue_delayed_work(root->update_wq, &root->update_work,
+		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 
 	return root;
 }
@@ -412,6 +673,9 @@ static void hot_tree_exit(struct hot_info *root)
 {
 	struct rb_node *node;
 
+	cancel_delayed_work_sync(&root->update_work);
+	destroy_workqueue(root->update_wq);
+
 	spin_lock(&root->t_lock);
 	node = rb_first(&root->hot_inode_tree);
 	while (node) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index bb4cb16..0be7621 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -12,10 +12,34 @@
 #ifndef __HOT_TRACKING__
 #define __HOT_TRACKING__
 
+#include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
+#define HOT_UPDATE_INTERVAL 150
+
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
 
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ */
+#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
+#define AVW_COEFF_POWER 0
+
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index e2a9d50..9095859 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -55,6 +55,7 @@ struct hot_inode_item {
 	struct kref refs;
 	struct rb_node rb_node;         /* rbtree index */
 	struct rcu_head rcu;
+	struct list_head track_list;    /* link to *_map[] */
 	struct rb_root hot_range_tree;	/* tree of ranges */
 	spinlock_t i_lock;		/* protect above tree */
 	struct hot_info *hot_root;	/* associated hot_info */
@@ -70,6 +71,7 @@ struct hot_range_item {
 	struct kref refs;
 	struct rb_node rb_node;                 /* rbtree index */
 	struct rcu_head rcu;
+	struct list_head track_list;            /* link to *_map[] */
 	struct hot_inode_item *hot_inode;	/* associated hot_inode_item */
 	loff_t start;				/* offset in bytes */
 	size_t len;				/* length in bytes */
@@ -77,7 +79,11 @@ struct hot_range_item {
 
 struct hot_info {
 	struct rb_root hot_inode_tree;
-	spinlock_t t_lock;				/* protect above tree */
+	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
+	spinlock_t t_lock;		/* protect tree and map for inode item */
+	spinlock_t m_lock;		/* protect map for range item */
+	struct workqueue_struct *update_wq;
+	struct delayed_work update_work;
 };
 
 extern void __init hot_cache_init(void);
@@ -96,6 +102,7 @@ extern struct hot_inode_item
 *hot_inode_item_lookup(struct hot_info *root,
 			u64 ino, int alloc);
 extern void hot_inode_item_unlink(struct inode *inode);
+extern u32 hot_temp_calc(struct hot_freq *freq);
 extern inline bool hot_track_enabled(struct inode *inode, size_t len);
 
 static inline u64 hot_bit_shift(u64 counter, u32 bits, bool dir)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 04/10] VFS hot tracking: Add shrinker functionality to curtail memory usage
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (2 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 03/10] VFS hot tracking: Add a workqueue to move items between hot maps zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 05/10] VFS hot tracking: Add an ioctl to get hot tracking information zwu.kernel
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Register a shrinker to control the amount of memory that
is used in tracking hot regions. If we are throwing inodes
out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking
code is using, even if it means losing useful information.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 74 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h |  2 ++
 2 files changed, 76 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 857d423..037d5db 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -40,13 +40,16 @@ static void hot_range_item_init(struct hot_range_item *hr,
 	hr->start = start;
 	hr->len = hot_bit_shift(1, RANGE_BITS, true);
 	hr->hot_inode = he;
+	atomic_long_inc(&he->hot_root->hot_cnt);
 }
 
 static void hot_range_item_free_cb(struct rcu_head *head)
 {
 	struct hot_range_item *hr = container_of(head,
 				struct hot_range_item, rcu);
+	struct hot_info *root = hr->hot_inode->hot_root;
 
+	atomic_long_dec(&root->hot_cnt);
 	kmem_cache_free(hot_range_item_cachep, hr);
 }
 
@@ -231,13 +234,16 @@ static void hot_inode_item_init(struct hot_inode_item *he,
 	he->i_ino = ino;
 	he->hot_root = root;
 	spin_lock_init(&he->i_lock);
+	atomic_long_inc(&root->hot_cnt);
 }
 
 static void hot_inode_item_free_cb(struct rcu_head *head)
 {
 	struct hot_inode_item *he = container_of(head,
 				struct hot_inode_item, rcu);
+	struct hot_info *root = he->hot_root;
 
+	atomic_long_dec(&root->hot_cnt);
 	kmem_cache_free(hot_inode_item_cachep, he);
 }
 
@@ -517,6 +523,39 @@ u32 hot_temp_calc(struct hot_freq *freq)
 	return result;
 }
 
+static void hot_item_evict(struct hot_info *root, unsigned long work,
+			unsigned long (*work_get)(struct hot_info *root))
+{
+	int i;
+
+	if (work <= 0)
+		return;
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		struct hot_inode_item *he, *next;
+		unsigned long work_prev;
+
+		spin_lock(&root->t_lock);
+		if (list_empty(&root->hot_map[TYPE_INODE][i])) {
+			spin_unlock(&root->t_lock);
+			continue;
+		}
+
+		list_for_each_entry_safe(he, next,
+			&root->hot_map[TYPE_INODE][i], track_list) {
+			work_prev = work_get(root);
+			hot_inode_item_put(he);
+			work -= (work_prev - work_get(root));
+			if (work <= 0)
+				break;
+		}
+		spin_unlock(&root->t_lock);
+
+		if (work <= 0)
+			break;
+	}
+}
+
 /*
  * Every sync period we update temperatures for
  * each hot inode item and hot range item for aging
@@ -579,6 +618,34 @@ void __init hot_cache_init(void)
 }
 EXPORT_SYMBOL_GPL(hot_cache_init);
 
+static inline unsigned long hot_cnt_get(struct hot_info *root)
+{
+	return (unsigned long)atomic_long_read(&root->hot_cnt);
+}
+
+static void hot_prune_map(struct hot_info *root, unsigned long nr)
+{
+	hot_item_evict(root, nr, hot_cnt_get);
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+			struct shrink_control *sc)
+{
+	struct hot_info *root =
+		container_of(shrink, struct hot_info, hot_shrink);
+
+	if (sc->nr_to_scan == 0)
+		return atomic_long_read(&root->hot_cnt) / 2;
+
+	if (!(sc->gfp_mask & __GFP_FS))
+		return -1;
+
+	hot_prune_map(root, sc->nr_to_scan);
+
+	return atomic_long_read(&root->hot_cnt);
+}
+
 /*
  * Main function to update i/o access frequencies, and it will be called
  * from read/writepages() hooks, which are read_pages(), do_writepages(),
@@ -643,6 +710,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	root->hot_inode_tree = RB_ROOT;
 	spin_lock_init(&root->t_lock);
 	spin_lock_init(&root->m_lock);
+	atomic_long_set(&root->hot_cnt, 0);
 
 	for (i = 0; i < MAP_SIZE; i++) {
 		for (j = 0; j < MAX_TYPES; j++)
@@ -663,6 +731,11 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	queue_delayed_work(root->update_wq, &root->update_work,
 		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 
+	/* Register a shrinker callback */
+	root->hot_shrink.shrink = hot_track_prune;
+	root->hot_shrink.seeks = DEFAULT_SEEKS;
+	register_shrinker(&root->hot_shrink);
+
 	return root;
 }
 
@@ -673,6 +746,7 @@ static void hot_tree_exit(struct hot_info *root)
 {
 	struct rb_node *node;
 
+	unregister_shrinker(&root->hot_shrink);
 	cancel_delayed_work_sync(&root->update_work);
 	destroy_workqueue(root->update_wq);
 
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 9095859..adac767 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -82,8 +82,10 @@ struct hot_info {
 	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
 	spinlock_t t_lock;		/* protect tree and map for inode item */
 	spinlock_t m_lock;		/* protect map for range item */
+	atomic_long_t hot_cnt;
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
+	struct shrinker hot_shrink;
 };
 
 extern void __init hot_cache_init(void);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 05/10] VFS hot tracking: Add an ioctl to get hot tracking information
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (3 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 04/10] VFS hot tracking: Add shrinker functionality to curtail memory usage zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 06/10] VFS hot tracking: Add a /proc interface to make the interval tunable zwu.kernel
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics.

Optionally, retrieve the temperature from the hot data hash list
instead of recalculating it.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/compat_ioctl.c            |  5 ++++
 fs/ioctl.c                   | 68 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h | 21 ++++++++++++++
 3 files changed, 94 insertions(+)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5d19acf..9026b8a 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include <linux/i2c-dev.h>
 #include <linux/atalk.h>
 #include <linux/gfp.h>
+#include <linux/hot_tracking.h>
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci.h>
@@ -1399,6 +1400,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
 /* fat 'r' ioctls. These are handled by fat with ->compat_ioctl,
    but we don't want warnings on other file systems. So declare
    them as compatible here. */
@@ -1578,6 +1582,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
+	case FS_IOC_GET_HEAT_INFO:
 		if (S_ISREG(file_inode(f.file)->i_mode))
 			break;
 		/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index fd507fb..51a16e1 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include <linux/writeback.h>
 #include <linux/buffer_head.h>
 #include <linux/falloc.h>
+#include <linux/hot_tracking.h>
 
 #include <asm/ioctls.h>
 
@@ -537,6 +538,70 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given inode.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the map list, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info->live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	struct hot_heat_info heat_info;
+	struct hot_inode_item *he;
+	int ret = 0;
+
+	/* The 'live' field need to be read from the user space */
+	if (copy_from_user((void *)&heat_info,
+			argp,
+			sizeof(struct hot_heat_info)) != 0) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	he = hot_inode_item_lookup(inode->i_sb->s_hot_root, inode->i_ino, 0);
+	if (IS_ERR(he)) {
+		/* we don't have any info on this file yet */
+		ret = -ENODATA;
+		goto err;
+	}
+
+	heat_info.avg_delta_reads =
+		(__u64) he->freq.avg_delta_reads;
+	heat_info.avg_delta_writes =
+		(__u64) he->freq.avg_delta_writes;
+	heat_info.last_read_time =
+	(__u64) timespec_to_ns(&he->freq.last_read_time);
+	heat_info.last_write_time =
+	(__u64) timespec_to_ns(&he->freq.last_write_time);
+	heat_info.num_reads = (__u32) he->freq.nr_reads;
+	heat_info.num_writes = (__u32) he->freq.nr_writes;
+
+	if (heat_info.live > 0) {
+		/*
+		 * got a request for live temperature,
+		 * call hot_calc_temp() to recalculate
+		 */
+		heat_info.temp = hot_temp_calc(&he->freq);
+	} else {
+		/* not live temperature, get it from the map list */
+		heat_info.temp = he->freq.last_temp;
+	}
+
+	hot_inode_item_put(he);
+
+	if (copy_to_user(argp, (void *)&heat_info,
+			sizeof(struct hot_heat_info))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+err:
+	return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +656,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FS_IOC_GET_HEAT_INFO:
+		return ioctl_heat_info(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index adac767..e480f7d 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -18,6 +18,19 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 
+struct hot_heat_info {
+	__u64 avg_delta_reads;
+	__u64 avg_delta_writes;
+	__u64 last_read_time;
+	__u64 last_write_time;
+	__u32 num_reads;
+	__u32 num_writes;
+	__u32 temp;
+	__u8 live;
+	__u8 resv[3];
+	__u64 future[4]; /* For future expansions */
+};
+
 #ifdef __KERNEL__
 
 #include <linux/rbtree.h>
@@ -88,6 +101,14 @@ struct hot_info {
 	struct shrinker hot_shrink;
 };
 
+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define FS_IOC_GET_HEAT_INFO _IOR('f', 17, \
+			struct hot_heat_info)
+
 extern void __init hot_cache_init(void);
 extern int hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 06/10] VFS hot tracking: Add a /proc interface to make the interval tunable
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (4 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 05/10] VFS hot tracking: Add an ioctl to get hot tracking information zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 07/10] VFS hot tracking: Add two /proc interfaces to control memory usage zwu.kernel
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Add a proc interface hot-update-interval under the dir
/proc/sys/fs/ in order to turn HOT_UPDATE_INTERVAL into
a tunable parameter.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 7 +++++--
 fs/hot_tracking.h            | 2 --
 include/linux/hot_tracking.h | 3 +++
 kernel/sysctl.c              | 7 +++++++
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 037d5db..a3742b7 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,9 @@
 #include <linux/list_sort.h>
 #include "hot_tracking.h"
 
+int sysctl_hot_update_interval __read_mostly = 150;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -594,7 +597,7 @@ static void hot_update_worker(struct work_struct *work)
 
 	/* Instert next delayed work */
 	queue_delayed_work(root->update_wq, &root->update_work,
-		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+		msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 }
 
 /*
@@ -729,7 +732,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	/* Initialize hot tracking wq and arm one delayed work */
 	INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
 	queue_delayed_work(root->update_wq, &root->update_work,
-		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+		msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 
 	/* Register a shrinker callback */
 	root->hot_shrink.shrink = hot_track_prune;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 0be7621..23b1339 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -15,8 +15,6 @@
 #include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
-#define HOT_UPDATE_INTERVAL 150
-
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index e480f7d..92e3547 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -101,6 +101,9 @@ struct hot_info {
 	struct shrinker hot_shrink;
 };
 
+/* set how often to update temperatures (seconds) */
+extern int sysctl_hot_update_interval;
+
 /*
  * Hot data tracking ioctls:
  *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 07f6fc4..398cc05 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1631,6 +1631,13 @@ static struct ctl_table fs_table[] = {
 		.proc_handler	= &pipe_proc_fn,
 		.extra1		= &pipe_min_size,
 	},
+	{
+		.procname	= "hot-update-interval",
+		.data		= &sysctl_hot_update_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 07/10] VFS hot tracking: Add two /proc interfaces to control memory usage
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (5 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 06/10] VFS hot tracking: Add a /proc interface to make the interval tunable zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 08/10] VFS hot tracking: Add documentation zwu.kernel
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Introduce two proc interfaces hot-mem-high-thresh and
hot-mem-low-thresh to cap the memory which is consumed by
hot_inode_item and hot_range_item, and they will be in
the unit of 1M bytes.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 32 ++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            | 23 +++++++++++++++++++++++
 include/linux/hot_tracking.h |  4 ++++
 kernel/sysctl.c              | 14 ++++++++++++++
 4 files changed, 73 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index a3742b7..3a08b66 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,6 +15,12 @@
 #include <linux/list_sort.h>
 #include "hot_tracking.h"
 
+int sysctl_hot_mem_high_thresh __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_hot_mem_high_thresh);
+
+int sysctl_hot_mem_low_thresh __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_hot_mem_low_thresh);
+
 int sysctl_hot_update_interval __read_mostly = 150;
 EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);
 
@@ -44,6 +50,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
 	hr->len = hot_bit_shift(1, RANGE_BITS, true);
 	hr->hot_inode = he;
 	atomic_long_inc(&he->hot_root->hot_cnt);
+	hot_mem_limit_add(he->hot_root, sizeof(struct hot_range_item));
 }
 
 static void hot_range_item_free_cb(struct rcu_head *head)
@@ -53,6 +60,7 @@ static void hot_range_item_free_cb(struct rcu_head *head)
 	struct hot_info *root = hr->hot_inode->hot_root;
 
 	atomic_long_dec(&root->hot_cnt);
+	hot_mem_limit_sub(root, sizeof(struct hot_range_item));
 	kmem_cache_free(hot_range_item_cachep, hr);
 }
 
@@ -238,6 +246,7 @@ static void hot_inode_item_init(struct hot_inode_item *he,
 	he->hot_root = root;
 	spin_lock_init(&he->i_lock);
 	atomic_long_inc(&root->hot_cnt);
+	hot_mem_limit_add(root, sizeof(struct hot_inode_item));
 }
 
 static void hot_inode_item_free_cb(struct rcu_head *head)
@@ -247,6 +256,7 @@ static void hot_inode_item_free_cb(struct rcu_head *head)
 	struct hot_info *root = he->hot_root;
 
 	atomic_long_dec(&root->hot_cnt);
+	hot_mem_limit_sub(root, sizeof(struct hot_inode_item));
 	kmem_cache_free(hot_inode_item_cachep, he);
 }
 
@@ -559,6 +569,25 @@ static void hot_item_evict(struct hot_info *root, unsigned long work,
 	}
 }
 
+static void hot_mem_evict(struct hot_info *root)
+{
+	unsigned long sum, thresh;
+
+	if (sysctl_hot_mem_low_thresh == 0 ||
+		sysctl_hot_mem_high_thresh == 0 ||
+		(sysctl_hot_mem_high_thresh < sysctl_hot_mem_low_thresh))
+		return;
+
+	sum = hot_mem_limit_sum(root);
+	/* Note: sysctl_** is in the unit of 1M bytes */
+	thresh = sysctl_hot_mem_high_thresh;
+	thresh *= 1024 * 1024;
+	if (sum <= thresh)
+		return;
+
+	hot_item_evict(root, sum - thresh, hot_mem_limit_sum);
+}
+
 /*
  * Every sync period we update temperatures for
  * each hot inode item and hot range item for aging
@@ -572,6 +601,8 @@ static void hot_update_worker(struct work_struct *work)
 	struct rb_node *node;
 	int i;
 
+	hot_mem_evict(root);
+
 	rcu_read_lock();
 	node = rb_first(&root->hot_inode_tree);
 	while (node) {
@@ -785,6 +816,7 @@ int hot_track_init(struct super_block *sb)
 		goto err;
 	}
 
+	hot_mem_limit_init(root);
 	sb->s_hot_root = root;
 
 	printk(KERN_INFO "VFS: Turning on hot tracking\n");
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 23b1339..c9efa5b 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -40,4 +40,27 @@
 #define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
 #define AVW_COEFF_POWER 0
 
+/* Memory Tracking Functions. */
+static inline unsigned long hot_mem_limit_sum(struct hot_info *root)
+{
+	return atomic_long_read(&root->mem);
+}
+
+static inline void hot_mem_limit_sub(struct hot_info *root,
+				unsigned long count)
+{
+	atomic_long_sub(count, &root->mem);
+}
+
+static inline void hot_mem_limit_add(struct hot_info *root,
+				unsigned long count)
+{
+	atomic_long_add(count, &root->mem);
+}
+
+static inline void hot_mem_limit_init(struct hot_info *root)
+{
+	atomic_long_set(&root->mem, 0);
+}
+
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 92e3547..64e1c8a 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -99,10 +99,14 @@ struct hot_info {
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
 	struct shrinker hot_shrink;
+	atomic_long_t mem;
 };
 
 /* set how often to update temperatures (seconds) */
 extern int sysctl_hot_update_interval;
+/* note: sysctl_** is in the unit of 1M bytes */
+extern int sysctl_hot_mem_high_thresh;
+extern int sysctl_hot_mem_low_thresh;
 
 /*
  * Hot data tracking ioctls:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 398cc05..c56aa34 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1632,6 +1632,20 @@ static struct ctl_table fs_table[] = {
 		.extra1		= &pipe_min_size,
 	},
 	{
+		.procname       = "hot-mem-high-thresh",
+		.data           = &sysctl_hot_mem_high_thresh,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "hot-mem-low-thresh",
+		.data           = &sysctl_hot_mem_low_thresh,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
 		.procname	= "hot-update-interval",
 		.data		= &sysctl_hot_update_interval,
 		.maxlen		= sizeof(int),
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 08/10] VFS hot tracking: Add documentation
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (6 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 07/10] VFS hot tracking: Add two /proc interfaces to control memory usage zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:49 ` [PATCH v4 09/10] VFS hot tracking, btrfs: Add hot tracking support zwu.kernel
  2013-08-05 14:50 ` [PATCH v4 10/10] VFS hot tracking, xfs: " zwu.kernel
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Add Documentation for VFS hot tracking feature

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 Documentation/filesystems/00-INDEX         |   2 +
 Documentation/filesystems/hot_tracking.txt | 210 +++++++++++++++++++++++++++++
 2 files changed, 212 insertions(+)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8042050..46b2f6f 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -122,3 +122,5 @@ xfs.txt
 	- info and mount options for the XFS filesystem.
 xip.txt
 	- info on execute-in-place for file mappings.
+hot_tracking.txt
+	- info on hot tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 0000000..2f4ad19
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,210 @@
+Hot Data Tracking
+
+April, 2013		Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes & Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+  The feature adds the  support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+"temperature" value that reflects what data is "hot", and filesystem
+can use this information to move hot data from slow devices to fast
+devices.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+
+2. Motivation
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+<https://btrfs.wiki.kernel.org/index.php/Project_ideas>.
+It will divide into two parts. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, this feature provides the first part
+of the functionality.
+
+
+3. The Design
+
+These include the following parts:
+
+    * Hooks in existing vfs functions to track data access frequency
+
+    * New rb-trees for tracking access frequency of inodes and sub-file
+ranges
+    The relationship between super_block and rb-trees is as below:
+hot_info.hot_inode_tree
+    Each FS instance can find hot tracking info s_hot_root.
+    hot_info has hot_inode_tree and it has inode's hot information,
+and it has hot_range_tree, which has range's hot information.
+
+    * A list of hot inodes and hot ranges by its temperature
+
+    * A work queue for updating inode heat info
+
+    * Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+    * An ioctl to retrieve the frequency information collected for a certain
+file
+    * Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+    * hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+    * hot_inode_item contains access frequency data for that inode
+
+    * hot_inode_item holds a heat list node to link the access frequency
+data for that inode
+
+    * hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+    * hot_range_item contains access frequency data for that range
+
+    * hot_range_item holds a heat list node to index the access
+frequency data for that range
+
+    * hot_info.heat_inode_map indexes per-inode heat list nodes
+
+    * hot_info.heat_range_map indexes per-range heat list nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+                          super_block
+                              |
+                              V
+                           hot_info
+                              |
+    +-------------------------+----------------------------------------+
+    |                         |                                        |
+    |                         |                                        |
+    V                         V                                        V
+heat_inode_map           hot_inode_tree                         heat_range_map
+    |                         |                                        |
+    |                         V                                        |
+    |           +-------hot_comm_item--------+                         |
+    |           |       frequency data       |                         |
++---+           |        list_head           |                         |
+|               V                            V                         |
+| ...<--hot_comm_item-->...      ...<--hot_comm_item-->...             |
+        frequency data                 frequency data                  |
+          list_head                      list_head                     |
+       hot_range_tree                  hot_range_tree                  |
+                                             |                         |
+                                             V                         |
+                               +-------hot_comm_item--------+          |
+                               |       frequency data       |          |
+                               |        list_head           |          +---+
+                               V            ^ |             V		   |
+                    <--hot_comm_item-->...  | |  ...<--hot_comm_item-->... |
+                         frequency data               frequency data
+                           list_head                    list_head
+
+
+4. How to Calc Frequency of Reads/Writes & Temperature
+
+1.) hot_freq_calc()
+
+  This function does the actual work of updating the frequency numbers.
+FREQ_POWER determines how many atime deltas we keep track of (as a power of 2).
+So, setting it to anything above 16ish is probably overkill. Also,
+the higher the power, the more bits get right shifted out of the timestamp,
+reducing precision, so take note of that as well.
+
+  FREQ_POWER, defined immediately below, determines how heavily to weight
+the current frequency numbers against the newest access. For example, a value
+of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+as heavily as the existing frequency info. In essence, this is a kludged-
+together version of a weighted average, since we can't afford to keep all of
+the information that it would take to get a _real_ weighted average.
+
+2.) hot_temp_calc()
+
+  The following comments explain what exactly comprises a unit of heat.
+Each of six values of heat are calculated and combined in order to form an
+overall temperature for the data:
+
+    * NRR - number of reads since mount
+    * NRW - number of writes since mount
+    * LTR - time elapsed since last read (ns)
+    * LTW - time elapsed since last write (ns)
+    * AVR - average delta between recent reads (ns)
+    * AVW - average delta between recent writes (ns)
+
+  These values are divided (right-shifted) according to the *_DIVIDER_POWER
+values defined below to bring the numbers into a reasonable range. You can
+modify these values to fit your needs. However, each heat unit is a u32 and
+thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+carefully or else they could max out or be stuck at zero quite easily.
+(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+delta would bring the temperature above zero, ever.)
+
+  Finally, each value is added to the overall temperature between 0 and 8
+times, depending on its *_COEFF_POWER value. Note that the coefficients are
+also actually implemented with shifts, so take care to treat these values
+as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+
+    * AVR/AVW cold unit = 2^X ns of average delta
+    * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+
+  E.g., data with an average delta between 0 and 2^X ns will have a cold
+value of 0, which means a heat value equal to HEAT_MAX_VALUE.
+
+  This function is responsible for distilling the six heat
+criteria, which are described in detail in hot_tracking.h) down into a single
+temperature value for the data, which is an integer between 0
+and HEAT_MAX_VALUE.
+
+  To accomplish this, the raw values from the hot_freq_data structure
+are shifted in order to make the temperature calculation more
+or less sensitive to each value.
+
+  Once this calibration has happened, we do some additional normalization and
+make sure that everything fits nicely in a u32. From there, we take a very
+rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+values act as weights for the average.
+
+  Finally, we use the MAP_BITS value, which determines the size of the
+heat list array, to normalize the temperature to the proper granularity.
+
+
+5. Git Development Tree
+
+  This feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+
+  https://github.com/wuzhy/kernel.git hot_tracking
+  git://github.com/wuzhy/kernel.git hot_tracking
+
+
+6. Usage Example
+
+1.) To use hot tracking, you should mount like this:
+
+$ mount -o hot_track /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] VFS: Turning on hot tracking
+
+2.) Retrieve hot tracking info for some specific file by ioctl().
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 09/10] VFS hot tracking, btrfs: Add hot tracking support
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (7 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 08/10] VFS hot tracking: Add documentation zwu.kernel
@ 2013-08-05 14:49 ` zwu.kernel
  2013-08-05 14:50 ` [PATCH v4 10/10] VFS hot tracking, xfs: " zwu.kernel
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:49 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

Introduce one new mount option '-o hot_track',
and add its parsing support.

Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/super.c | 22 +++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e795bf1..cabb11a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1967,6 +1967,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY	(1 << 20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
+#define BTRFS_MOUNT_HOT_TRACK		(1 << 23)
 
 #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8eb6191..ba0f4d9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -42,6 +42,7 @@
 #include <linux/cleancache.h>
 #include <linux/ratelimit.h>
 #include <linux/btrfs.h>
+#include <linux/hot_tracking.h>
 #include "compat.h"
 #include "delayed-inode.h"
 #include "ctree.h"
@@ -308,6 +309,10 @@ static void btrfs_put_super(struct super_block *sb)
 	 * last process that kept it busy.  Or segfault in the aforementioned
 	 * process...  Whom would you report that to?
 	 */
+
+	/* Hot data tracking */
+	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+		hot_track_exit(sb);
 }
 
 enum {
@@ -320,7 +325,7 @@ enum {
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
 	Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
 	Opt_check_integrity, Opt_check_integrity_including_extent_data,
-	Opt_check_integrity_print_mask, Opt_fatal_errors,
+	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
 	Opt_err,
 };
 
@@ -361,6 +366,7 @@ static match_table_t tokens = {
 	{Opt_check_integrity_including_extent_data, "check_int_data"},
 	{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
+	{Opt_hot_track, "hot_track"},
 	{Opt_err, NULL},
 };
 
@@ -626,6 +632,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				goto out;
 			}
 			break;
+		case Opt_hot_track:
+			btrfs_set_opt(info->mount_opt, HOT_TRACK);
+			break;
 		case Opt_err:
 			printk(KERN_INFO "btrfs: unrecognized mount option "
 			       "'%s'\n", p);
@@ -842,11 +851,20 @@ static int btrfs_fill_super(struct super_block *sb,
 		goto fail_close;
 	}
 
+	if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+		err = hot_track_init(sb);
+		if (err)
+			goto fail_hot;
+	}
+
 	save_mount_options(sb, data);
 	cleancache_init_fs(sb);
 	sb->s_flags |= MS_ACTIVE;
 	return 0;
 
+fail_hot:
+	dput(sb->s_root);
+	sb->s_root = NULL;
 fail_close:
 	close_ctree(fs_info->tree_root);
 	return err;
@@ -942,6 +960,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",skip_balance");
 	if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR))
 		seq_puts(seq, ",fatal_errors=panic");
+	if (btrfs_test_opt(root, HOT_TRACK))
+		seq_puts(seq, ",hot_track");
 	return 0;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 10/10] VFS hot tracking, xfs: Add hot tracking support
  2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
                   ` (8 preceding siblings ...)
  2013-08-05 14:49 ` [PATCH v4 09/10] VFS hot tracking, btrfs: Add hot tracking support zwu.kernel
@ 2013-08-05 14:50 ` zwu.kernel
  9 siblings, 0 replies; 11+ messages in thread
From: zwu.kernel @ 2013-08-05 14:50 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, sekharan, Dave Chinner, Zhi Yong Wu

From: Dave Chinner <dchinner@redhat.com>

Connect up the VFS hot tracking support so XFS filesystem
can make use of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/xfs/xfs_mount.h |  1 +
 fs/xfs/xfs_super.c | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4e374d4..948a070 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -222,6 +222,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_WSYNC		(1ULL << 0)	/* for nfs - all metadata ops
 						   must be synchronous except
 						   for space allocations */
+#define XFS_MOUNT_HOTTRACK      (1ULL << 1)     /* hot tracking */
 #define XFS_MOUNT_WAS_CLEAN	(1ULL << 3)
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1d68ffc..e3ea98b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -62,6 +62,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/parser.h>
+#include <linux/hot_tracking.h>
 
 static const struct super_operations xfs_super_operations;
 static kmem_zone_t *xfs_ioend_zone;
@@ -115,6 +116,7 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_NODELAYLOG  "nodelaylog"	/* Delayed logging disabled */
 #define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
 #define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
+#define MNTOPT_HOTTRACK    "hot_track"  /* hot tracking */
 
 /*
  * Table driven mount option parser.
@@ -381,6 +383,8 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
 			mp->m_flags &= ~XFS_MOUNT_DISCARD;
+		} else if (!strcmp(this_char, MNTOPT_HOTTRACK)) {
+			mp->m_flags |= XFS_MOUNT_HOTTRACK;
 		} else if (!strcmp(this_char, "ihashsize")) {
 			xfs_warn(mp,
 	"ihashsize no longer used, option is deprecated.");
@@ -510,6 +514,7 @@ xfs_showargs(
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ XFS_MOUNT_SMALL_INUMS,	"," MNTOPT_32BITINODE },
+		{ XFS_MOUNT_HOTTRACK,		"," MNTOPT_HOTTRACK },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
@@ -1053,6 +1058,9 @@ xfs_fs_put_super(
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	if (mp->m_flags & XFS_MOUNT_HOTTRACK)
+		hot_track_exit(sb);
+
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
 
@@ -1500,8 +1508,18 @@ xfs_fs_fill_super(
 		goto out_unmount;
 	}
 
+	if (mp->m_flags & XFS_MOUNT_HOTTRACK) {
+		error = hot_track_init(sb);
+		if (error)
+			goto out_free_root;
+	}
+
 	return 0;
 
+ out_free_root:
+	dput(sb->s_root);
+	sb->s_root = NULL;
+
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
  out_free_sb:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-05 15:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-05 14:49 [PATCH v4 00/10] VFS hot tracking zwu.kernel
2013-08-05 14:49 ` [PATCH v4 01/10] VFS hot tracking: Define basic data structures and functions zwu.kernel
2013-08-05 14:49 ` [PATCH v4 02/10] VFS hot tracking: Track IO and record heat information zwu.kernel
2013-08-05 14:49 ` [PATCH v4 03/10] VFS hot tracking: Add a workqueue to move items between hot maps zwu.kernel
2013-08-05 14:49 ` [PATCH v4 04/10] VFS hot tracking: Add shrinker functionality to curtail memory usage zwu.kernel
2013-08-05 14:49 ` [PATCH v4 05/10] VFS hot tracking: Add an ioctl to get hot tracking information zwu.kernel
2013-08-05 14:49 ` [PATCH v4 06/10] VFS hot tracking: Add a /proc interface to make the interval tunable zwu.kernel
2013-08-05 14:49 ` [PATCH v4 07/10] VFS hot tracking: Add two /proc interfaces to control memory usage zwu.kernel
2013-08-05 14:49 ` [PATCH v4 08/10] VFS hot tracking: Add documentation zwu.kernel
2013-08-05 14:49 ` [PATCH v4 09/10] VFS hot tracking, btrfs: Add hot tracking support zwu.kernel
2013-08-05 14:50 ` [PATCH v4 10/10] VFS hot tracking, xfs: " zwu.kernel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.