All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/13] VFS hot tracking
@ 2013-06-21 12:17 zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 01/13] VFS hot tracking: introduce some data structures zwu.kernel
                   ` (14 more replies)
  0 siblings, 15 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  The patchset is trying to introduce hot tracking function in
VFS layer, which will keep track of real disk I/O in memory.
By it, you will easily know more details about disk I/O, and
then detect where disk I/O hot spots are. Also, specific FS
can take use of it to do accurate defragment, and hot relocation
support, etc.

  After V1 was sent out, Chandra Seetharaman has reviewed and
made a lot of comments, thanks a lot to him. Now it's time to
send out its V3 for external review, any comments or ideas are
appreciated, thanks.

NOTE:

  The patchset can be obtained via my kernel dev git on github:
git://github.com/wuzhy/kernel.git hot_tracking
  If you're interested, you can also review them via
https://github.com/wuzhy/kernel/commits/hot_tracking

  For how to use and more other info and performance report,
please check hot_tracking.txt in Documentation and following
links:
  1.) http://lwn.net/Articles/525651/
  2.) https://lkml.org/lkml/2012/12/20/199

Changelog from v2:
 - Added memory caping function for hot items [Zhiyong]
 - Cleanup aging function [Zhiyong]

v2:
 - Refactored to be under RCU [Chandra Seetharaman]
 - Merged some code changes [Chandra Seetharaman]
 - Fixed some issues [Chandra Seetharaman]

v1:
 - Solved 64 bits inode number issue. [David Sterba]
 - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
 - Cleanup Some issues [David Sterba]
 - Use a static hot debugfs root [Greg KH]

rfcv4:
 - Introduce hot func registering framework [Zhiyong]
 - Remove global variable for hot tracking [Zhiyong]
 - Add btrfs hot tracking support [Zhiyong]

rfcv3:
 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
 2.) Refactored workqueue support. [Dave Chinner]
 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
       TIME_TO_KICK, and HEAT_UPDATE_DELAY
 4.) Cleanedup a lot of other issues [Dave Chinner]


rfcv2:
 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
 2.) Added memory shrinker [Dave Chinner]
 3.) Converted to one workqueue to update map info periodically [Dave Chinner]
 4.) Cleanedup a lot of other issues [Dave Chinner]

rfcv1:
 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
 2.) The first three patches can probably just be flattened into one.
                                        [Marco Stornelli , Dave Chinner]

Zhi Yong Wu (13):
  VFS hot tracking: introduce some data structures
  VFS hot tracking: add i/o freq tracking hooks
  VFS hot tracking: add one wq to update hot map
  VFS hot tracking: register one shrinker
  VFS hot tracking, rcu: introduce one rcu macro for list
  VFS hot tracking, seq_file: add seq_list rcu interfaces
  VFS hot tracking: add debugfs support
  VFS hot tracking: add one ioctl interface
  VFS hot tracking, procfs: add one proc interface
  VFS hot tracking: add memory caping function
  VFS hot tracking, btrfs: add hot tracking support
  VFS hot tracking: add documentation
  VFS hot tracking: add fs hot type support

 Documentation/filesystems/00-INDEX         |    2 +
 Documentation/filesystems/hot_tracking.txt |  252 ++++++
 fs/Makefile                                |    2 +-
 fs/btrfs/ctree.h                           |    1 +
 fs/btrfs/super.c                           |   22 +-
 fs/compat_ioctl.c                          |    5 +
 fs/dcache.c                                |    2 +
 fs/direct-io.c                             |    5 +
 fs/hot_tracking.c                          | 1318 ++++++++++++++++++++++++++++
 fs/hot_tracking.h                          |   87 ++
 fs/ioctl.c                                 |   70 ++
 fs/namei.c                                 |    2 +
 fs/seq_file.c                              |   37 +
 include/linux/fs.h                         |    5 +
 include/linux/hot_tracking.h               |  176 ++++
 include/linux/rculist.h                    |    5 +
 include/linux/seq_file.h                   |    7 +
 kernel/sysctl.c                            |   21 +
 mm/filemap.c                               |    6 +
 mm/page-writeback.c                        |   12 +
 mm/readahead.c                             |    6 +
 21 files changed, 2041 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/filesystems/hot_tracking.txt
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/13] VFS hot tracking: introduce some data structures
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 02/13] VFS hot tracking: add i/o freq tracking hooks zwu.kernel
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
  Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
  Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
  Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/Makefile                  |   2 +-
 fs/dcache.c                  |   2 +
 fs/hot_tracking.c            | 209 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |  17 ++++
 include/linux/fs.h           |   4 +
 include/linux/hot_tracking.h | 103 +++++++++++++++++++++
 6 files changed, 336 insertions(+), 1 deletion(-)
 create mode 100644 fs/hot_tracking.c
 create mode 100644 fs/hot_tracking.h
 create mode 100644 include/linux/hot_tracking.h

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..5f9b8f1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		attr.o bad_inode.o file.o filesystems.o namespace.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o \
-		stack.o fs_struct.o statfs.o
+		stack.o fs_struct.o statfs.o hot_tracking.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..9d7c2af 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/prefetch.h>
 #include <linux/ratelimit.h>
+#include <linux/hot_tracking.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -3094,4 +3095,5 @@ void __init vfs_caches_init(unsigned long mempages)
 	mnt_init();
 	bdev_cache_init();
 	chrdev_init();
+	hot_cache_init();
 }
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 0000000..6bf4229
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,209 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/list_sort.h>
+#include <linux/limits.h>
+#include "hot_tracking.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+static void hot_inode_item_free(struct kref *kref);
+
+static void hot_comm_item_free_cb(struct rcu_head *head)
+{
+	struct hot_comm_item *ci = container_of(head,
+				struct hot_comm_item, c_rcu);
+
+	if (ci->hot_freq_data.flags == TYPE_RANGE) {
+		struct hot_range_item *hr = container_of(ci,
+				struct hot_range_item, hot_range);
+		kmem_cache_free(hot_range_item_cachep, hr);
+	} else {
+		struct hot_inode_item *he = container_of(ci,
+				struct hot_inode_item, hot_inode);
+		kmem_cache_free(hot_inode_item_cachep, he);
+	}
+}
+
+static void hot_range_item_free(struct kref *kref)
+{
+	struct hot_comm_item *ci = container_of(kref,
+		struct hot_comm_item, refs);
+	struct hot_range_item *hr = container_of(ci,
+		struct hot_range_item, hot_range);
+
+	hr->hot_inode = NULL;
+
+	call_rcu(&hr->hot_range.c_rcu, hot_comm_item_free_cb);
+}
+
+/*
+ * Drops the reference out on hot_comm_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_comm_item_put(struct hot_comm_item *ci)
+{
+	kref_put(&ci->refs, (ci->hot_freq_data.flags == TYPE_RANGE) ?
+			hot_range_item_free : hot_inode_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_comm_item_put);
+
+static void hot_comm_item_unlink(struct hot_info *root,
+				struct hot_comm_item *ci)
+{
+	if (!test_and_set_bit(HOT_DELETING, &ci->delete_flag)) {
+		hot_comm_item_put(ci);
+	}
+}
+
+/*
+ * Frees the entire hot_range_tree.
+ */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+	struct hot_info *root = he->hot_root;
+	struct rb_node *node;
+	struct hot_comm_item *ci;
+
+	/* Free hot inode and range trees on fs root */
+	rcu_read_lock();
+	node = rb_first(&he->hot_range_tree);
+	while (node) {
+		ci = rb_entry(node, struct hot_comm_item, rb_node);
+		node = rb_next(node);
+		hot_comm_item_unlink(root, ci);
+	}
+	rcu_read_unlock();
+
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+	struct hot_comm_item *ci = container_of(kref,
+			struct hot_comm_item, refs);
+	struct hot_inode_item *he = container_of(ci,
+			struct hot_inode_item, hot_inode);
+
+	hot_range_tree_free(he);
+	he->hot_root = NULL;
+
+	call_rcu(&he->hot_inode.c_rcu, hot_comm_item_free_cb);
+}
+
+/*
+ * Initialize kmem cache for hot_inode_item and hot_range_item.
+ */
+void __init hot_cache_init(void)
+{
+	hot_inode_item_cachep = kmem_cache_create("hot_inode_item",
+			sizeof(struct hot_inode_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+			NULL);
+	if (!hot_inode_item_cachep)
+		return;
+
+	hot_range_item_cachep = kmem_cache_create("hot_range_item",
+			sizeof(struct hot_range_item), 0,
+			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+			NULL);
+	if (!hot_range_item_cachep)
+		kmem_cache_destroy(hot_inode_item_cachep);
+}
+EXPORT_SYMBOL_GPL(hot_cache_init);
+
+static struct hot_info *hot_tree_init(struct super_block *sb)
+{
+	struct hot_info *root;
+	int i, j;
+
+	root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+	if (!root) {
+		printk(KERN_ERR "%s: Failed to malloc memory for "
+				"hot_info\n", __func__);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	root->hot_inode_tree = RB_ROOT;
+	spin_lock_init(&root->t_lock);
+	spin_lock_init(&root->m_lock);
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		for (j = 0; j < MAX_TYPES; j++)
+			INIT_LIST_HEAD(&root->hot_map[j][i]);
+	}
+
+	return root;
+}
+
+/*
+ * Frees the entire hot tree.
+ */
+static void hot_tree_exit(struct hot_info *root)
+{
+	struct rb_node *node;
+	struct hot_comm_item *ci;
+
+	rcu_read_lock();
+	node = rb_first(&root->hot_inode_tree);
+	while (node) {
+		struct hot_inode_item *he;
+		ci = rb_entry(node, struct hot_comm_item, rb_node);
+		he = container_of(ci, struct hot_inode_item, hot_inode);
+		node = rb_next(node);
+		hot_comm_item_unlink(root, &he->hot_inode);
+	}
+	rcu_read_unlock();
+}
+
+/*
+ * Initialize the data structures for hot tracking.
+ * This function will be called by *_fill_super()
+ * when filesystem is mounted.
+ */
+int hot_track_init(struct super_block *sb)
+{
+	struct hot_info *root;
+
+	root = hot_tree_init(sb);
+	if (IS_ERR(root))
+		return PTR_ERR(root);
+
+	sb->s_hot_root = root;
+
+	printk(KERN_INFO "VFS: Turning on hot data tracking\n");
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hot_track_init);
+
+/*
+ * This function will be called by *_put_super()
+ * when filesystem is umounted, or also by *_fill_super()
+ * in some exceptional cases.
+ */
+void hot_track_exit(struct super_block *sb)
+{
+	struct hot_info *root = sb->s_hot_root;
+
+	hot_tree_exit(root);
+	sb->s_hot_root = NULL;
+	kfree(root);
+}
+EXPORT_SYMBOL_GPL(hot_track_exit);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
new file mode 100644
index 0000000..a2ee95f
--- /dev/null
+++ b/fs/hot_tracking.h
@@ -0,0 +1,17 @@
+/*
+ * fs/hot_tracking.h
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_TRACKING__
+#define __HOT_TRACKING__
+
+#include <linux/hot_tracking.h>
+
+#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 43db02e..ee2c54f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/hot_tracking.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1322,6 +1323,9 @@ struct super_block {
 
 	/* Being remounted read-only */
 	int s_readonly_remount;
+
+	/* Hot data tracking*/
+	struct hot_info *s_hot_root;
 };
 
 /* superblock cache pruning functions */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
new file mode 100644
index 0000000..b57de1f
--- /dev/null
+++ b/include/linux/hot_tracking.h
@@ -0,0 +1,103 @@
+/*
+ *  include/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot data tracking
+ * structures etc.
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_HOTTRACK_H
+#define _LINUX_HOTTRACK_H
+
+#include <linux/types.h>
+
+#ifdef __KERNEL__
+
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/fs.h>
+
+#define MAP_BITS 8
+#define MAP_SIZE (1 << MAP_BITS)
+
+/* values for hot_freq_data flags */
+enum {
+	TYPE_INODE = 0,
+	TYPE_RANGE,
+	MAX_TYPES,
+};
+
+enum {
+	HOT_DELETING,
+};
+
+/*
+ * A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item
+ */
+struct hot_freq_data {
+	struct timespec last_read_time;
+	struct timespec last_write_time;
+	u32 nr_reads;
+	u32 nr_writes;
+	u64 avg_delta_reads;
+	u64 avg_delta_writes;
+	u32 flags;
+	u32 last_temp;
+};
+
+/* The common info for both following structures */
+struct hot_comm_item {
+	struct hot_freq_data hot_freq_data;	/* frequency data */
+	struct kref refs;
+	struct rb_node rb_node;			/* rbtree index */
+	unsigned long delete_flag;
+	struct rcu_head c_rcu;
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+	struct hot_comm_item hot_inode; /* node in hot_inode_tree */
+	struct rb_root hot_range_tree;	/* tree of ranges */
+	spinlock_t i_lock;		/* protect above tree */
+};
+
+/*
+ * An item representing a range inside of
+ * an inode whose frequency is being tracked
+ */
+struct hot_range_item {
+	struct hot_comm_item hot_range;
+	struct hot_inode_item *hot_inode;	/* associated hot_inode_item */
+};
+
+struct hot_info {
+	struct rb_root hot_inode_tree;
+	spinlock_t t_lock;				/* protect above tree */
+	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
+	spinlock_t m_lock;
+};
+
+extern void __init hot_cache_init(void);
+extern int hot_track_init(struct super_block *sb);
+extern void hot_track_exit(struct super_block *sb);
+extern void hot_comm_item_put(struct hot_comm_item *ci);
+
+static inline u64 hot_shift(u64 counter, u32 bits, bool dir)
+{
+	if (dir)
+		return counter << bits;
+	else
+		return counter >> bits;
+}
+
+#endif /* __KERNEL__ */
+
+#endif  /* _LINUX_HOTTRACK_H */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/13] VFS hot tracking: add i/o freq tracking hooks
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 01/13] VFS hot tracking: introduce some data structures zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 03/13] VFS hot tracking: add one wq to update hot map zwu.kernel
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add i/o freq tracking hooks in real read/write code paths
which include read_pages(), do_writepages(), do_generic_file_read(),
and __blockdev_direct_IO().
  Currently whole FS has one RB tree to track i/o freqs for
all inodes which had real disk i/o, while every inode has its
own one RB tree to track i/o freqs for all of its extents.
  When real disk i/o for the inode are done, its own i/o freq will
be created or updated in the RB tree per FS, and the i/o freq for
all of its extents will also be done in the RB-tree per inode.
  Also, Each of the two structures hot_inode_item and hot_range_item
contains a hot_freq_data struct with its frequency of access metrics
(number of {reads, writes}, last {read,write} time, frequency of
{reads,writes}).
  Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/direct-io.c               |   5 +
 fs/hot_tracking.c            | 284 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |   4 +
 fs/namei.c                   |   2 +
 include/linux/hot_tracking.h |  17 +++
 mm/filemap.c                 |   6 +
 mm/page-writeback.c          |  12 ++
 mm/readahead.c               |   6 +
 8 files changed, 336 insertions(+)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7ab90f5..6cb0598 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,6 +38,7 @@
 #include <linux/atomic.h>
 #include <linux/prefetch.h>
 #include <linux/aio.h>
+#include "hot_tracking.h"
 
 /*
  * How many user pages to map in one call to get_user_pages().  This determines
@@ -1295,6 +1296,10 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	prefetch(bdev->bd_queue);
 	prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
 
+	/* Hot data tracking */
+	hot_update_freqs(inode, offset, iov_length(iov, nr_segs),
+			rw & WRITE);
+
 	return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
 				     nr_segs, get_block, end_io,
 				     submit_io, flags);
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 6bf4229..cc899f4 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -26,6 +26,26 @@ static struct kmem_cache *hot_range_item_cachep __read_mostly;
 
 static void hot_inode_item_free(struct kref *kref);
 
+static void hot_comm_item_init(struct hot_comm_item *ci, int type)
+{
+	kref_init(&ci->refs);
+	clear_bit(HOT_DELETING, &ci->delete_flag);
+	memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data));
+	ci->hot_freq_data.avg_delta_reads = (u64) -1;
+	ci->hot_freq_data.avg_delta_writes = (u64) -1;
+	ci->hot_freq_data.flags = type;
+}
+
+static void hot_range_item_init(struct hot_range_item *hr,
+			struct hot_inode_item *he, loff_t start)
+{
+	hr->start = start;
+	hr->len = hot_shift(1, RANGE_BITS, true);
+	hr->hot_inode = he;
+	hr->storage_type = -1;
+	hot_comm_item_init(&hr->hot_range, TYPE_RANGE);
+}
+
 static void hot_comm_item_free_cb(struct rcu_head *head)
 {
 	struct hot_comm_item *ci = container_of(head,
@@ -65,10 +85,27 @@ void hot_comm_item_put(struct hot_comm_item *ci)
 }
 EXPORT_SYMBOL_GPL(hot_comm_item_put);
 
+/*
+ * root->t_lock or he->i_lock is acquired in this function
+ */
 static void hot_comm_item_unlink(struct hot_info *root,
 				struct hot_comm_item *ci)
 {
 	if (!test_and_set_bit(HOT_DELETING, &ci->delete_flag)) {
+		if (ci->hot_freq_data.flags == TYPE_RANGE) {
+			struct hot_range_item *hr = container_of(ci,
+					struct hot_range_item, hot_range);
+			struct hot_inode_item *he = hr->hot_inode;
+
+			spin_lock(&he->i_lock);
+			rb_erase(&ci->rb_node, &he->hot_range_tree);
+			spin_unlock(&he->i_lock);
+		} else {
+			spin_lock(&root->t_lock);
+			rb_erase(&ci->rb_node, &root->hot_inode_tree);
+			spin_unlock(&root->t_lock);
+		}
+
 		hot_comm_item_put(ci);
 	}
 }
@@ -94,6 +131,15 @@ static void hot_range_tree_free(struct hot_inode_item *he)
 
 }
 
+static void hot_inode_item_init(struct hot_inode_item *he,
+			struct hot_info *hot_root, u64 ino)
+{
+	he->i_ino = ino;
+	he->hot_root = hot_root;
+	spin_lock_init(&he->i_lock);
+	hot_comm_item_init(&he->hot_inode, TYPE_INODE);
+}
+
 static void hot_inode_item_free(struct kref *kref)
 {
 	struct hot_comm_item *ci = container_of(kref,
@@ -107,6 +153,195 @@ static void hot_inode_item_free(struct kref *kref)
 	call_rcu(&he->hot_inode.c_rcu, hot_comm_item_free_cb);
 }
 
+/* root->t_lock is acquired in this function. */
+struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root, u64 ino, int alloc)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_comm_item *ci;
+	struct hot_inode_item *he, *he_new = NULL;
+
+	/* walk tree to find insertion point */
+redo:
+	spin_lock(&root->t_lock);
+	p = &root->hot_inode_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		ci = rb_entry(parent, struct hot_comm_item, rb_node);
+		he = container_of(ci, struct hot_inode_item, hot_inode);
+		if (ino < he->i_ino)
+			p = &(*p)->rb_left;
+		else if (ino > he->i_ino)
+			p = &(*p)->rb_right;
+		else {
+			hot_comm_item_get(&he->hot_inode);
+			spin_unlock(&root->t_lock);
+			if (he_new)
+				/*
+				 * Lost the race. Somebody else inserted
+				 * the item for the inode. Free the
+				 * newly allocated item.
+				 */
+				kmem_cache_free(hot_inode_item_cachep, he_new);
+
+			if (test_bit(HOT_DELETING, &he->hot_inode.delete_flag))
+				return ERR_PTR(-ENOENT);
+
+			return he;
+		}
+	}
+
+	if (he_new) {
+		rb_link_node(&he_new->hot_inode.rb_node, parent, p);
+		rb_insert_color(&he_new->hot_inode.rb_node,
+				&root->hot_inode_tree);
+		hot_comm_item_get(&he_new->hot_inode);
+		spin_unlock(&root->t_lock);
+		return he_new;
+	}
+	spin_unlock(&root->t_lock);
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	he_new = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
+	if (!he_new)
+		return ERR_PTR(-ENOMEM);
+
+	hot_inode_item_init(he_new, root, ino);
+
+	goto redo;
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_lookup);
+
+void hot_inode_item_delete(struct inode *inode)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_inode_item *he;
+
+	if (!root || !S_ISREG(inode->i_mode))
+		return;
+
+	he = hot_inode_item_lookup(root, inode->i_ino, 0);
+	if (IS_ERR(he))
+		return;
+
+	hot_comm_item_put(&he->hot_inode); /* for lookup */
+	hot_comm_item_unlink(root, &he->hot_inode);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_delete);
+
+/* he->i_lock is acquired in this function. */
+struct hot_range_item
+*hot_range_item_lookup(struct hot_inode_item *he, loff_t start, int alloc)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct hot_comm_item *ci;
+	struct hot_range_item *hr, *hr_new = NULL;
+
+	start = hot_shift(start, RANGE_BITS, true);
+
+	/* walk tree to find insertion point */
+redo:
+	spin_lock(&he->i_lock);
+	p = &he->hot_range_tree.rb_node;
+	while (*p) {
+		parent = *p;
+		ci = rb_entry(parent, struct hot_comm_item, rb_node);
+		hr = container_of(ci, struct hot_range_item, hot_range);
+		if (start < hr->start)
+			p = &(*p)->rb_left;
+		else if (start > (hr->start + hr->len - 1))
+			p = &(*p)->rb_right;
+		else {
+			hot_comm_item_get(&hr->hot_range);
+			spin_unlock(&he->i_lock);
+			if(hr_new)
+				/*
+				 * Lost the race. Somebody else inserted
+				 * the item for the range. Free the
+				 * newly allocated item.
+				 */
+				kmem_cache_free(hot_range_item_cachep, hr_new);
+
+			if (test_bit(HOT_DELETING, &hr->hot_range.delete_flag))
+				return ERR_PTR(-ENOENT);
+
+			return hr;
+		}
+	}
+
+	if (hr_new) {
+		rb_link_node(&hr_new->hot_range.rb_node, parent, p);
+		rb_insert_color(&hr_new->hot_range.rb_node,
+				&he->hot_range_tree);
+		hot_comm_item_get(&hr_new->hot_range);
+		spin_unlock(&he->i_lock);
+		return hr_new;
+	}
+	spin_unlock(&he->i_lock);
+
+	if (!alloc)
+		return ERR_PTR(-ENOENT);
+
+	hr_new = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
+	if (!hr_new)
+		return ERR_PTR(-ENOMEM);
+
+	hot_range_item_init(hr_new, he, start);
+
+	goto redo;
+}
+EXPORT_SYMBOL_GPL(hot_range_item_lookup);
+
+/*
+ * This function does the actual work of updating
+ * the frequency numbers.
+ *
+ * avg_delta_{reads,writes} are indeed a kind of simple moving
+ * average of the time difference between each of the last
+ * 2^(FREQ_POWER) reads/writes. If there have not yet been that
+ * many reads or writes, it's likely that the values will be very
+ * large; They are initialized to the largest possible value for the
+ * data type. Simply, we don't want a few fast access to a file to
+ * automatically make it appear very hot.
+ */
+static void hot_freq_calc(struct timespec old_atime,
+		struct timespec cur_time, u64 *avg)
+{
+	struct timespec delta_ts;
+	u64 new_delta;
+
+	delta_ts = timespec_sub(cur_time, old_atime);
+	new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
+
+	*avg = (*avg << FREQ_POWER) - *avg + new_delta;
+	*avg = *avg >> FREQ_POWER;
+}
+
+static void hot_freq_update(struct hot_info *root,
+		struct hot_comm_item *ci, bool write)
+{
+	struct timespec cur_time = current_kernel_time();
+	struct hot_freq_data *freq_data = &ci->hot_freq_data;
+
+	if (write) {
+		freq_data->nr_writes += 1;
+		hot_freq_calc(freq_data->last_write_time,
+				cur_time,
+				&freq_data->avg_delta_writes);
+		freq_data->last_write_time = cur_time;
+	} else {
+		freq_data->nr_reads += 1;
+		hot_freq_calc(freq_data->last_read_time,
+				cur_time,
+				&freq_data->avg_delta_reads);
+		freq_data->last_read_time = cur_time;
+	}
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -128,6 +363,55 @@ void __init hot_cache_init(void)
 }
 EXPORT_SYMBOL_GPL(hot_cache_init);
 
+/*
+ * Main function to update i/o access frequencies, and it will be called
+ * from read/writepages() hooks, which are read_pages(), do_writepages(),
+ * do_generic_file_read(), and __blockdev_direct_IO().
+ */
+void hot_update_freqs(struct inode *inode, loff_t start,
+			size_t len, int rw)
+{
+	struct hot_info *root = inode->i_sb->s_hot_root;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+	u64 range_size;
+	loff_t cur, end;
+
+	if (!root || (len == 0) || !S_ISREG(inode->i_mode))
+		return;
+
+	he = hot_inode_item_lookup(root, inode->i_ino, 1);
+	if (IS_ERR(he))
+		return;
+
+	hot_freq_update(root, &he->hot_inode, rw);
+
+	/*
+	 * Align ranges on range size boundary
+	 * to prevent proliferation of range structs
+	 */
+	range_size  = hot_shift(1, RANGE_BITS, true);
+	end = hot_shift((start + len + range_size - 1),
+			RANGE_BITS, false);
+	cur = hot_shift(start, RANGE_BITS, false);
+	for (; cur < end; cur++) {
+		hr = hot_range_item_lookup(he, cur, 1);
+		if (IS_ERR(hr)) {
+			WARN(1, "hot_range_item_lookup returns %ld\n",
+				PTR_ERR(hr));
+			hot_comm_item_put(&he->hot_inode);
+			return;
+		}
+
+		hot_freq_update(root, &hr->hot_range, rw);
+
+		hot_comm_item_put(&hr->hot_range);
+	}
+
+	hot_comm_item_put(&he->hot_inode);
+}
+EXPORT_SYMBOL_GPL(hot_update_freqs);
+
 static struct hot_info *hot_tree_init(struct super_block *sb)
 {
 	struct hot_info *root;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index a2ee95f..bb4cb16 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -14,4 +14,8 @@
 
 #include <linux/hot_tracking.h>
 
+/* size of sub-file ranges */
+#define RANGE_BITS 20
+#define FREQ_POWER 4
+
 #endif /* __HOT_TRACKING__ */
diff --git a/fs/namei.c b/fs/namei.c
index 9ed9361..5685445 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3394,6 +3394,8 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
 	if (!dir->i_op->unlink)
 		return -EPERM;
 
+	hot_inode_item_delete(dentry->d_inode);
+
 	mutex_lock(&dentry->d_inode->i_mutex);
 	if (d_mountpoint(dentry))
 		error = -EBUSY;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index b57de1f..1437248 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -67,6 +67,8 @@ struct hot_inode_item {
 	struct hot_comm_item hot_inode; /* node in hot_inode_tree */
 	struct rb_root hot_range_tree;	/* tree of ranges */
 	spinlock_t i_lock;		/* protect above tree */
+	struct hot_info *hot_root;	/* associated hot_info */
+	u64 i_ino;			/* inode number from inode */
 };
 
 /*
@@ -76,6 +78,9 @@ struct hot_inode_item {
 struct hot_range_item {
 	struct hot_comm_item hot_range;
 	struct hot_inode_item *hot_inode;	/* associated hot_inode_item */
+	loff_t start;				/* offset in bytes */
+	size_t len;				/* length in bytes */
+	int storage_type;			/* type of storage */
 };
 
 struct hot_info {
@@ -89,6 +94,13 @@ extern void __init hot_cache_init(void);
 extern int hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
 extern void hot_comm_item_put(struct hot_comm_item *ci);
+extern void hot_update_freqs(struct inode *inode, loff_t start,
+				size_t len, int rw);
+extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root,
+						u64 ino, int alloc);
+extern struct hot_range_item *hot_range_item_lookup(struct hot_inode_item *he,
+						loff_t start, int alloc);
+extern void hot_inode_item_delete(struct inode *inode);
 
 static inline u64 hot_shift(u64 counter, u32 bits, bool dir)
 {
@@ -98,6 +110,11 @@ static inline u64 hot_shift(u64 counter, u32 bits, bool dir)
 		return counter >> bits;
 }
 
+static inline void hot_comm_item_get(struct hot_comm_item *ci)
+{
+	kref_get(&ci->refs);
+}
+
 #endif /* __KERNEL__ */
 
 #endif  /* _LINUX_HOTTRACK_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7905fe7..eb64c49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
+#include <linux/hot_tracking.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -1242,6 +1243,11 @@ readpage:
 		 * PG_error will be set again if readpage fails.
 		 */
 		ClearPageError(page);
+
+		/* Hot data tracking */
+		hot_update_freqs(inode, (loff_t)page->index << PAGE_CACHE_SHIFT,
+				PAGE_CACHE_SIZE, 0);
+
 		/* Start the actual read. The read will unlock the page. */
 		error = mapping->a_ops->readpage(filp, page);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..4bbca3a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,6 +36,7 @@
 #include <linux/pagevec.h>
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
+#include <linux/hot_tracking.h>
 #include <trace/events/writeback.h>
 
 /*
@@ -1921,13 +1922,24 @@ EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
+	loff_t start = 0;
+	size_t count = 0;
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+
+	start = mapping->writeback_index << PAGE_CACHE_SHIFT;
+	count = wbc->nr_to_write;
+
 	if (mapping->a_ops->writepages)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
 		ret = generic_writepages(mapping, wbc);
+
+	/* Hot data tracking */
+	hot_update_freqs(mapping->host, start,
+			(count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1);
+
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index daed28d..901396b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/file.h>
+#include <linux/hot_tracking.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -115,6 +116,11 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	unsigned page_idx;
 	int ret;
 
+	/* Hot data tracking */
+	hot_update_freqs(mapping->host,
+			list_to_page(pages)->index << PAGE_CACHE_SHIFT,
+			(size_t)nr_pages * PAGE_CACHE_SIZE, 0);
+
 	blk_start_plug(&plug);
 
 	if (mapping->a_ops->readpages) {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/13] VFS hot tracking: add one wq to update hot map
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 01/13] VFS hot tracking: introduce some data structures zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 02/13] VFS hot tracking: add i/o freq tracking hooks zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 04/13] VFS hot tracking: register one shrinker zwu.kernel
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add a workqueue per superblock and a delayed_work
to run periodic work to update map info on each superblock.
  Two arrays of map list are defined, one is for hot inode
items, and the other is for hot extent items.
  The hot items in the RB-tree will be at first distilled
into one temperature in the range [0, 255]. If it is old,
it will be not linked or aged out, otherwise then it will
be linked to its corresponding array of map list which use
the temperature as its index.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 241 ++++++++++++++++++++++++++++++++++++++++++-
 fs/hot_tracking.h            |  25 +++++
 include/linux/hot_tracking.h |   4 +
 3 files changed, 269 insertions(+), 1 deletion(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index cc899f4..50c6820 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -29,7 +29,9 @@ static void hot_inode_item_free(struct kref *kref);
 static void hot_comm_item_init(struct hot_comm_item *ci, int type)
 {
 	kref_init(&ci->refs);
+	clear_bit(HOT_IN_LIST, &ci->delete_flag);
 	clear_bit(HOT_DELETING, &ci->delete_flag);
+	INIT_LIST_HEAD(&ci->track_list);
 	memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data));
 	ci->hot_freq_data.avg_delta_reads = (u64) -1;
 	ci->hot_freq_data.avg_delta_writes = (u64) -1;
@@ -86,12 +88,24 @@ void hot_comm_item_put(struct hot_comm_item *ci)
 EXPORT_SYMBOL_GPL(hot_comm_item_put);
 
 /*
- * root->t_lock or he->i_lock is acquired in this function
+ * root->t_lock or he->i_lock, and root->m_lock
+ * are acquired in this function
  */
 static void hot_comm_item_unlink(struct hot_info *root,
 				struct hot_comm_item *ci)
 {
 	if (!test_and_set_bit(HOT_DELETING, &ci->delete_flag)) {
+		bool flag = false;
+		spin_lock(&root->m_lock);
+		if (test_and_clear_bit(HOT_IN_LIST, &ci->delete_flag)) {
+			list_del_rcu(&ci->track_list);
+			flag = true;
+		}
+		spin_unlock(&root->m_lock);
+
+		if (flag)
+			hot_comm_item_put(ci);
+
 		if (ci->hot_freq_data.flags == TYPE_RANGE) {
 			struct hot_range_item *hr = container_of(ci,
 					struct hot_range_item, hot_range);
@@ -343,6 +357,214 @@ static void hot_freq_update(struct hot_info *root,
 }
 
 /*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ *
+ * With the six values, we first do some very rudimentary
+ * "normalizations" to each metric such that they affect the
+ * final temperature calculation exactly the right way. It's
+ * important to note that we still weren't really sure that
+ * these six adjustments were exactly right.
+ * They could definitely use more tweaking and adjustment,
+ * especially in terms of the memory footprint they consume.
+ *
+ * Next, we take the adjusted values and shift them down to
+ * a manageable size, whereafter they are weighted using the
+ * the *_COEFF_POWER values and combined to a single temperature
+ * value.
+ */
+static u32 hot_temp_calc(struct hot_comm_item *ci)
+{
+	u32 result = 0;
+	struct hot_freq_data *freq_data = &ci->hot_freq_data;
+
+	struct timespec ckt = current_kernel_time();
+	u64 cur_time = timespec_to_ns(&ckt);
+	u32 nrr_heat, nrw_heat;
+	u64 ltr_heat, ltw_heat, avr_heat, avw_heat;
+
+	nrr_heat = (u32)hot_shift((u64)freq_data->nr_reads,
+					NRR_MULTIPLIER_POWER, true);
+	nrw_heat = (u32)hot_shift((u64)freq_data->nr_writes,
+					NRW_MULTIPLIER_POWER, true);
+
+	ltr_heat =
+	hot_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)),
+			LTR_DIVIDER_POWER, false);
+	ltw_heat =
+	hot_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)),
+			LTW_DIVIDER_POWER, false);
+
+	avr_heat =
+	hot_shift((((u64) -1) - freq_data->avg_delta_reads),
+			AVR_DIVIDER_POWER, false);
+	avw_heat =
+	hot_shift((((u64) -1) - freq_data->avg_delta_writes),
+			AVW_DIVIDER_POWER, false);
+
+	/* ltr_heat is now guaranteed to be u32 safe */
+	if (ltr_heat >= hot_shift((u64) 1, 32, true))
+		ltr_heat = 0;
+	else
+		ltr_heat = hot_shift((u64) 1, 32, true) - ltr_heat;
+
+	/* ltw_heat is now guaranteed to be u32 safe */
+	if (ltw_heat >= hot_shift((u64) 1, 32, true))
+		ltw_heat = 0;
+	else
+		ltw_heat = hot_shift((u64) 1, 32, true) - ltw_heat;
+
+	/* avr_heat is now guaranteed to be u32 safe */
+	if (avr_heat >= hot_shift((u64) 1, 32, true))
+		avr_heat = (u32) -1;
+
+	/* avw_heat is now guaranteed to be u32 safe */
+	if (avw_heat >= hot_shift((u64) 1, 32, true))
+		avw_heat = (u32) -1;
+
+	nrr_heat = (u32)hot_shift((u64)nrr_heat,
+		(3 - NRR_COEFF_POWER), false);
+	nrw_heat = (u32)hot_shift((u64)nrw_heat,
+		(3 - NRW_COEFF_POWER), false);
+	ltr_heat = hot_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+	ltw_heat = hot_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+	avr_heat = hot_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+	avw_heat = hot_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+	result = nrr_heat + nrw_heat + (u32) ltr_heat +
+		(u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+	return result;
+}
+
+/*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature.
+ */
+static bool hot_map_update(struct hot_info *root,
+			struct hot_comm_item *ci)
+{
+	u32 temp = hot_temp_calc(ci);
+	u8 cur_temp, prev_temp;
+	bool flag = false;
+
+	cur_temp = (u8)hot_shift((u64)temp,
+				(32 - MAP_BITS), false);
+	prev_temp = (u8)hot_shift((u64)ci->hot_freq_data.last_temp,
+				(32 - MAP_BITS), false);
+
+	if (cur_temp != prev_temp) {
+		u32 type = ci->hot_freq_data.flags;
+		spin_lock(&root->m_lock);
+		if (test_and_clear_bit(HOT_IN_LIST, &ci->delete_flag)) {
+			list_del_rcu(&ci->track_list);
+			flag = true;
+		}
+		spin_unlock(&root->m_lock);
+
+		if (!flag)
+			hot_comm_item_get(ci);
+
+		spin_lock(&root->m_lock);
+		if (test_bit(HOT_DELETING, &ci->delete_flag)) {
+			spin_unlock(&root->m_lock);
+			return true;
+		}
+		set_bit(HOT_IN_LIST, &ci->delete_flag);
+		list_add_tail_rcu(&ci->track_list,
+				&root->hot_map[type][cur_temp]);
+		spin_unlock(&root->m_lock);
+
+		ci->hot_freq_data.last_temp = temp;
+	}
+
+	return false;
+}
+
+/*
+ * Update temperatures for each range item for aging purposes.
+ * If one hot range item is old, it will be aged out.
+ */
+static void hot_range_update(struct hot_inode_item *he,
+				struct hot_info *root)
+{
+	struct rb_node *node;
+	struct hot_comm_item *ci;
+
+	rcu_read_lock();
+	node = rb_first(&he->hot_range_tree);
+	while (node) {
+		ci = rb_entry(node, struct hot_comm_item, rb_node);
+		node = rb_next(node);
+		if (test_bit(HOT_DELETING, &ci->delete_flag) ||
+			hot_map_update(root, ci))
+			continue;
+	}
+	rcu_read_unlock();
+}
+
+/* Temperature compare function*/
+static int hot_temp_cmp(void *priv, struct list_head *a,
+				struct list_head *b)
+{
+	struct hot_comm_item *ap = container_of(a,
+			struct hot_comm_item, track_list);
+	struct hot_comm_item *bp = container_of(b,
+			struct hot_comm_item, track_list);
+
+	int diff = ap->hot_freq_data.last_temp
+			- bp->hot_freq_data.last_temp;
+	if (diff > 0)
+		return -1;
+	if (diff < 0)
+		return 1;
+	return 0;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+	struct hot_info *root = container_of(to_delayed_work(work),
+					struct hot_info, update_work);
+	struct rb_node *node;
+	struct hot_comm_item *ci;
+	struct hot_inode_item *he;
+	int i, j;
+
+	rcu_read_lock();
+	node = rb_first(&root->hot_inode_tree);
+	while (node) {
+		ci = rb_entry(node, struct hot_comm_item, rb_node);
+		node = rb_next(node);
+		if (test_bit(HOT_DELETING, &ci->delete_flag) ||
+			hot_map_update(root, ci))
+			continue;
+		he = container_of(ci, struct hot_inode_item, hot_inode);
+		hot_range_update(he, root);
+	}
+	rcu_read_unlock();
+
+	/* Sort temperature map info based on last temperature*/
+	for (i = 0; i < MAP_SIZE; i++) {
+		for (j = 0; j < MAX_TYPES; j++) {
+			spin_lock(&root->m_lock);
+			list_sort(NULL, &root->hot_map[j][i], hot_temp_cmp);
+			spin_unlock(&root->m_lock);
+		}
+	}
+
+	/* Instert next delayed work */
+	queue_delayed_work(root->update_wq, &root->update_work,
+		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+}
+
+/*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
 void __init hot_cache_init(void)
@@ -433,6 +655,20 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 			INIT_LIST_HEAD(&root->hot_map[j][i]);
 	}
 
+	root->update_wq = alloc_workqueue(
+			"hot_update_wq", WQ_NON_REENTRANT, 0);
+	if (!root->update_wq) {
+		printk(KERN_ERR "%s: Failed to create "
+				"hot update workqueue\n", __func__);
+		kfree(root);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* Initialize hot tracking wq and arm one delayed work */
+	INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
+	queue_delayed_work(root->update_wq, &root->update_work,
+		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+
 	return root;
 }
 
@@ -444,6 +680,9 @@ static void hot_tree_exit(struct hot_info *root)
 	struct rb_node *node;
 	struct hot_comm_item *ci;
 
+	cancel_delayed_work_sync(&root->update_work);
+	destroy_workqueue(root->update_wq);
+
 	rcu_read_lock();
 	node = rb_first(&root->hot_inode_tree);
 	while (node) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index bb4cb16..8a53c2d 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -12,10 +12,35 @@
 #ifndef __HOT_TRACKING__
 #define __HOT_TRACKING__
 
+#include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
+#define HOT_UPDATE_INTERVAL 150
+#define HOT_AGE_INTERVAL 300
+
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
 
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ */
+#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
+#define AVW_COEFF_POWER 0
+
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 1437248..02a9521 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -35,6 +35,7 @@ enum {
 
 enum {
 	HOT_DELETING,
+	HOT_IN_LIST,
 };
 
 /*
@@ -60,6 +61,7 @@ struct hot_comm_item {
 	struct rb_node rb_node;			/* rbtree index */
 	unsigned long delete_flag;
 	struct rcu_head c_rcu;
+	struct list_head track_list;		/* link to *_map[] */
 };
 
 /* An item representing an inode and its access frequency */
@@ -88,6 +90,8 @@ struct hot_info {
 	spinlock_t t_lock;				/* protect above tree */
 	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
 	spinlock_t m_lock;
+	struct workqueue_struct *update_wq;
+	struct delayed_work update_work;
 };
 
 extern void __init hot_cache_init(void);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/13] VFS hot tracking: register one shrinker
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (2 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 03/13] VFS hot tracking: add one wq to update hot map zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 05/13] VFS hot tracking, rcu: introduce one rcu macro for list zwu.kernel
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Register a shrinker to control the amount of memory that
is used in tracking hot regions. If we are throwing inodes
out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking
code is using, even if it means losing useful information
That is, the shrinker accelerates the aging process.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 76 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/hot_tracking.h |  2 ++
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 50c6820..3f3b656 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -103,8 +103,10 @@ static void hot_comm_item_unlink(struct hot_info *root,
 		}
 		spin_unlock(&root->m_lock);
 
-		if (flag)
+		if (flag) {
+			atomic_dec(&root->hot_map_nr);
 			hot_comm_item_put(ci);
+		}
 
 		if (ci->hot_freq_data.flags == TYPE_RANGE) {
 			struct hot_range_item *hr = container_of(ci,
@@ -464,8 +466,10 @@ static bool hot_map_update(struct hot_info *root,
 		}
 		spin_unlock(&root->m_lock);
 
-		if (!flag)
+		if (!flag) {
+			atomic_inc(&root->hot_map_nr);
 			hot_comm_item_get(ci);
+		}
 
 		spin_lock(&root->m_lock);
 		if (test_bit(HOT_DELETING, &ci->delete_flag)) {
@@ -523,6 +527,39 @@ static int hot_temp_cmp(void *priv, struct list_head *a,
 	return 0;
 }
 
+static void hot_item_evictor(struct hot_info *root, unsigned long work,
+			unsigned long (*work_get)(struct hot_info *root))
+{
+	int i;
+
+	if (work <= 0)
+		return;
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		struct hot_comm_item *ci;
+		unsigned long work_prev;
+
+		rcu_read_lock();
+		if (list_empty(&root->hot_map[TYPE_INODE][i])) {
+			rcu_read_unlock();
+			continue;
+		}
+
+		list_for_each_entry_rcu(ci, &root->hot_map[TYPE_INODE][i],
+					track_list) {
+			work_prev = work_get(root);
+			hot_comm_item_unlink(root, ci);
+			work -= (work_prev - work_get(root));
+			if (work <= 0)
+				break;
+		}
+		rcu_read_unlock();
+
+		if (work <= 0)
+			break;
+	}
+}
+
 /*
  * Every sync period we update temperatures for
  * each hot inode item and hot range item for aging
@@ -585,6 +622,34 @@ void __init hot_cache_init(void)
 }
 EXPORT_SYMBOL_GPL(hot_cache_init);
 
+static inline unsigned long hot_nr_get(struct hot_info *root)
+{
+	return (unsigned long)atomic_read(&root->hot_map_nr);
+}
+
+static void hot_prune_map(struct hot_info *root, unsigned long nr)
+{
+	hot_item_evictor(root, nr, hot_nr_get);
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+			struct shrink_control *sc)
+{
+	struct hot_info *root =
+		container_of(shrink, struct hot_info, hot_shrink);
+
+	if (sc->nr_to_scan == 0)
+		return atomic_read(&root->hot_map_nr) / 2;
+
+	if (!(sc->gfp_mask & __GFP_FS))
+		return -1;
+
+	hot_prune_map(root, sc->nr_to_scan);
+
+	return atomic_read(&root->hot_map_nr);
+}
+
 /*
  * Main function to update i/o access frequencies, and it will be called
  * from read/writepages() hooks, which are read_pages(), do_writepages(),
@@ -649,6 +714,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	root->hot_inode_tree = RB_ROOT;
 	spin_lock_init(&root->t_lock);
 	spin_lock_init(&root->m_lock);
+	atomic_set(&root->hot_map_nr, 0);
 
 	for (i = 0; i < MAP_SIZE; i++) {
 		for (j = 0; j < MAX_TYPES; j++)
@@ -669,6 +735,11 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	queue_delayed_work(root->update_wq, &root->update_work,
 		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 
+	/* Register a shrinker callback */
+	root->hot_shrink.shrink = hot_track_prune;
+	root->hot_shrink.seeks = DEFAULT_SEEKS;
+	register_shrinker(&root->hot_shrink);
+
 	return root;
 }
 
@@ -680,6 +751,7 @@ static void hot_tree_exit(struct hot_info *root)
 	struct rb_node *node;
 	struct hot_comm_item *ci;
 
+	unregister_shrinker(&root->hot_shrink);
 	cancel_delayed_work_sync(&root->update_work);
 	destroy_workqueue(root->update_wq);
 
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 02a9521..8cb7526 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -90,8 +90,10 @@ struct hot_info {
 	spinlock_t t_lock;				/* protect above tree */
 	struct list_head hot_map[MAX_TYPES][MAP_SIZE];	/* map of inode temp */
 	spinlock_t m_lock;
+	atomic_t hot_map_nr;
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
+	struct shrinker hot_shrink;
 };
 
 extern void __init hot_cache_init(void);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/13] VFS hot tracking, rcu: introduce one rcu macro for list
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (3 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 04/13] VFS hot tracking: register one shrinker zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 06/13] VFS hot tracking, seq_file: add seq_list rcu interfaces zwu.kernel
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  This rcu macro for list will be used in seq_list
rcu interfaces.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 include/linux/rculist.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index f4b1001..380b9be 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -218,6 +218,11 @@ static inline void list_splice_init_rcu(struct list_head *list,
 	at->prev = last;
 }
 
+#define __list_for_each_rcu(pos, head)				\
+	for (pos = rcu_dereference(list_next_rcu(head));	\
+	     pos != head;						\
+	     pos = rcu_dereference(list_next_rcu(pos)))
+
 /**
  * list_entry_rcu - get the struct for this entry
  * @ptr:        the &struct list_head pointer.
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/13] VFS hot tracking, seq_file: add seq_list rcu interfaces
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (4 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 05/13] VFS hot tracking, rcu: introduce one rcu macro for list zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 07/13] VFS hot tracking: add debugfs support zwu.kernel
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  The patch will introduce one set of rcu interface for seq_list.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/seq_file.c            | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/seq_file.h |  7 +++++++
 2 files changed, 44 insertions(+)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 774c1eb..301caa7 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -795,6 +795,43 @@ struct list_head *seq_list_next(void *v, struct list_head *head, loff_t *ppos)
 }
 EXPORT_SYMBOL(seq_list_next);
 
+struct list_head *seq_list_start_rcu(struct list_head *head, loff_t pos)
+{
+	struct list_head *lh;
+
+	__list_for_each_rcu(lh, head)
+		if (pos-- == 0)
+			return lh;
+
+	return NULL;
+}
+EXPORT_SYMBOL(seq_list_start_rcu);
+
+struct list_head *seq_list_start_head_rcu(struct list_head *head, loff_t pos)
+{
+	if (!pos)
+		return head;
+
+	return seq_list_start_rcu(head, pos - 1);
+}
+EXPORT_SYMBOL(seq_list_start_head_rcu);
+
+struct list_head *seq_list_next_rcu(void *v, struct list_head *head,
+					loff_t *ppos)
+{
+	struct list_head *lh;
+
+	++*ppos;
+	rcu_read_lock();
+	lh = rcu_dereference(((struct list_head *)v)->next);
+	if (lh == head)
+		lh = NULL;
+	rcu_read_unlock();
+
+	return lh;
+}
+EXPORT_SYMBOL(seq_list_next_rcu);
+
 /**
  * seq_hlist_start - start an iteration of a hlist
  * @head: the head of the hlist
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 2da29ac..7e391c9 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -155,6 +155,13 @@ extern struct list_head *seq_list_start_head(struct list_head *head,
 extern struct list_head *seq_list_next(void *v, struct list_head *head,
 		loff_t *ppos);
 
+extern struct list_head *seq_list_start_rcu(struct list_head *head,
+		loff_t pos);
+extern struct list_head *seq_list_start_head_rcu(struct list_head *head,
+		loff_t pos);
+extern struct list_head *seq_list_next_rcu(void *v, struct list_head *head,
+		loff_t *ppos);
+
 /*
  * Helpers for iteration over hlist_head-s in seq_files
  */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/13] VFS hot tracking: add debugfs support
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (5 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 06/13] VFS hot tracking, seq_file: add seq_list rcu interfaces zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 08/13] VFS hot tracking: add one ioctl interface zwu.kernel
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add a directory '<dev_name>' in /sys/kernel/debug/hot_track/
for each volume that contains four files which are 'inode_stat',
'extent_stat', 'inode_spot', and 'extent_spot'.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 465 +++++++++++++++++++++++++++++++++++++++++++
 fs/hot_tracking.h            |   5 +
 include/linux/hot_tracking.h |   2 +
 3 files changed, 472 insertions(+)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 3f3b656..51e2e9c 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -17,9 +17,12 @@
 #include <linux/fs.h>
 #include <linux/types.h>
 #include <linux/list_sort.h>
+#include <linux/debugfs.h>
 #include <linux/limits.h>
 #include "hot_tracking.h"
 
+static struct dentry *hot_debugfs_root;
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -461,6 +464,10 @@ static bool hot_map_update(struct hot_info *root,
 		u32 type = ci->hot_freq_data.flags;
 		spin_lock(&root->m_lock);
 		if (test_and_clear_bit(HOT_IN_LIST, &ci->delete_flag)) {
+			if (atomic_read(&root->run_debugfs)) {
+				spin_unlock(&root->m_lock);
+				return true;
+			}
 			list_del_rcu(&ci->track_list);
 			flag = true;
 		}
@@ -601,6 +608,449 @@ static void hot_update_worker(struct work_struct *work)
 		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
 }
 
+static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires(rcu)
+{
+	struct hot_info *root = seq->private;
+	struct rb_node *node_he, *node_hr;
+	struct hot_comm_item *ci_he, *ci_hr;
+	struct hot_inode_item *he;
+	struct hot_range_item *hr;
+	loff_t l = *pos;
+
+	rcu_read_lock();
+	node_he = rb_first(&root->hot_inode_tree);
+	while (node_he) {
+		ci_he = rb_entry(node_he, struct hot_comm_item, rb_node);
+		he = container_of(ci_he, struct hot_inode_item, hot_inode);
+		node_hr = rb_first(&he->hot_range_tree);
+		while (node_hr) {
+			if (!l--) {
+				ci_hr = rb_entry(node_hr,
+					struct hot_comm_item, rb_node);
+				hr = container_of(ci_hr,
+					struct hot_range_item, hot_range);
+				return hr;
+			}
+			node_hr = rb_next(node_hr);
+		}
+		node_he = rb_next(node_he);
+	}
+
+	return NULL;
+}
+
+static void *hot_range_seq_next(struct seq_file *seq,
+				void *v, loff_t *pos)
+{
+	struct rb_node *node_he, *node_hr;
+	struct hot_comm_item *ci_he, *ci_hr;
+	struct hot_range_item *hr_next = NULL, *hr = v;
+	struct hot_inode_item *he_next;
+
+	(*pos)++;
+	node_hr = rb_next(&hr->hot_range.rb_node);
+	if (node_hr) {
+		ci_hr = rb_entry(node_hr, struct hot_comm_item, rb_node);
+		hr_next = container_of(ci_hr, struct hot_range_item, hot_range);
+
+		return hr_next;
+	}
+
+	node_he = rb_next(&hr->hot_inode->hot_inode.rb_node);
+loop_he:
+	if (node_he) {
+		ci_he = rb_entry(node_he, struct hot_comm_item, rb_node);
+		he_next = container_of(ci_he, struct hot_inode_item, hot_inode);
+		node_hr = rb_first(&he_next->hot_range_tree);
+		if (node_hr) {
+			ci_hr = rb_entry(node_hr,
+					struct hot_comm_item, rb_node);
+			hr_next = container_of(ci_hr,
+					struct hot_range_item, hot_range);
+		} else {
+			node_he = rb_next(node_he);
+			goto loop_he;
+		}
+	}
+
+	return hr_next;
+}
+
+static void hot_seq_stop(struct seq_file *seq, void *v)
+	__releases(rcu)
+{
+	rcu_read_unlock();
+}
+
+static int hot_range_seq_show(struct seq_file *seq, void *v)
+{
+	struct hot_range_item *hr = v;
+	struct hot_inode_item *he = hr->hot_inode;
+	struct hot_freq_data *freq_data;
+
+	freq_data = &hr->hot_range.hot_freq_data;
+	seq_printf(seq, "inode %llu, extent %llu+%llu, " \
+			"reads %u, writes %u, temp %u, " \
+			"storage %s\n",
+			he->i_ino, (unsigned long long)hr->start,
+			(unsigned long long)hr->len,
+			freq_data->nr_reads,
+			freq_data->nr_writes,
+			(u8)hot_shift((u64)freq_data->last_temp,
+					(32 - MAP_BITS), false),
+			(hr->storage_type == 1) ? "nonrot" : "rot");
+
+	return 0;
+}
+
+static void *hot_inode_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires(rcu)
+{
+	struct hot_info *root = seq->private;
+	struct rb_node *node;
+	struct hot_comm_item *ci;
+	struct hot_inode_item *he = NULL;
+	loff_t l = *pos;
+
+	rcu_read_lock();
+	node = rb_first(&root->hot_inode_tree);
+	while (node) {
+		if (!l--) {
+			ci = rb_entry(node, struct hot_comm_item, rb_node);
+			he = container_of(ci, struct hot_inode_item, hot_inode);
+			break;
+		}
+		node = rb_next(node);
+	}
+
+	return he;
+}
+
+static void *hot_inode_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct hot_inode_item *he_next = NULL, *he = v;
+	struct rb_node *node;
+	struct hot_comm_item *ci;
+
+	(*pos)++;
+	node = rb_next(&he->hot_inode.rb_node);
+	if (node) {
+		ci = rb_entry(node, struct hot_comm_item, rb_node);
+		he_next = container_of(ci, struct hot_inode_item, hot_inode);
+	}
+
+	return he_next;
+}
+
+static int hot_inode_seq_show(struct seq_file *seq, void *v)
+{
+	struct hot_inode_item *he = v;
+	struct hot_freq_data *freq_data = &he->hot_inode.hot_freq_data;
+
+	seq_printf(seq, "inode %llu, reads %u, writes %u, temp %d\n",
+		he->i_ino,
+		freq_data->nr_reads,
+		freq_data->nr_writes,
+		(u8)hot_shift((u64)freq_data->last_temp,
+				(32 - MAP_BITS), false));
+
+	return 0;
+}
+
+static struct hot_comm_item *hot_spot_seq_start(struct hot_info *root,
+					loff_t *pos, int type)
+	__acquires(rcu)
+{
+	struct hot_comm_item *ci;
+	struct list_head *track_list;
+	int i;
+
+	atomic_inc(&root->run_debugfs);
+
+	rcu_read_lock();
+	for (i = MAP_SIZE - 1; i >= 0; i--) {
+		track_list = seq_list_start_rcu(&root->hot_map[type][i], *pos);
+		if (track_list) {
+			ci = container_of(track_list,
+				struct hot_comm_item, track_list);
+			return ci;
+		}
+	}
+
+	return NULL;
+}
+
+static struct hot_comm_item *hot_spot_seq_next(struct hot_info *root,
+					struct hot_comm_item *ci,
+					loff_t *pos, int type)
+{
+	struct hot_comm_item *ci_next = NULL;
+	struct list_head *track_list;
+	int i;
+
+	i = (int)hot_shift(ci->hot_freq_data.last_temp,
+			(32 - MAP_BITS), false);
+
+	track_list = seq_list_next_rcu(&ci->track_list,
+				&root->hot_map[type][i], pos);
+next:
+	if (track_list)
+		ci_next = container_of(track_list,
+				struct hot_comm_item, track_list);
+	else if (--i >= 0) {
+		track_list = seq_list_next_rcu(&root->hot_map[type][i],
+					&root->hot_map[type][i], pos);
+		goto next;
+	}
+
+	return ci_next;
+}
+
+static void hot_spot_seq_stop(struct seq_file *seq, void *v)
+	__releases(rcu)
+{
+	struct hot_info *root = seq->private;
+
+	atomic_dec(&root->run_debugfs);
+	rcu_read_unlock();
+}
+
+static void *hot_spot_range_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct hot_info *root = seq->private;
+	struct hot_range_item *hr = NULL;
+	struct hot_comm_item *ci;
+
+	ci =  hot_spot_seq_start(root, pos, TYPE_RANGE);
+	if (ci)
+		hr = container_of(ci, struct hot_range_item, hot_range);
+
+	return hr;
+}
+
+static void *hot_spot_range_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct hot_info *root = seq->private;
+	struct hot_range_item *hr_next = NULL, *hr = v;
+	struct hot_comm_item *ci_next;
+
+	ci_next = hot_spot_seq_next(root, &hr->hot_range, pos, TYPE_RANGE);
+	if (ci_next)
+		hr_next = container_of(ci_next,
+				struct hot_range_item, hot_range);
+
+	return hr_next;
+}
+
+static void *hot_spot_inode_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct hot_info *root = seq->private;
+	struct hot_inode_item *he = NULL;
+	struct hot_comm_item *ci;
+
+	ci = hot_spot_seq_start(root, pos, TYPE_INODE);
+	if (ci)
+		he = container_of(ci, struct hot_inode_item, hot_inode);
+
+	return he;
+}
+
+static void *hot_spot_inode_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct hot_info *root = seq->private;
+	struct hot_inode_item *he_next = NULL, *he = v;
+	struct hot_comm_item *ci_next;
+
+	ci_next = hot_spot_seq_next(root, &he->hot_inode, pos, TYPE_INODE);
+	if (ci_next)
+		he_next = container_of(ci_next,
+				struct hot_inode_item, hot_inode);
+
+	return he_next;
+}
+
+static const struct seq_operations hot_range_seq_ops = {
+	.start = hot_range_seq_start,
+	.next = hot_range_seq_next,
+	.stop = hot_seq_stop,
+	.show = hot_range_seq_show
+};
+
+static const struct seq_operations hot_inode_seq_ops = {
+	.start = hot_inode_seq_start,
+	.next = hot_inode_seq_next,
+	.stop = hot_seq_stop,
+	.show = hot_inode_seq_show
+};
+
+static const struct seq_operations hot_spot_range_seq_ops = {
+	.start = hot_spot_range_seq_start,
+	.next = hot_spot_range_seq_next,
+	.stop = hot_spot_seq_stop,
+	.show = hot_range_seq_show
+};
+
+static const struct seq_operations hot_spot_inode_seq_ops = {
+	.start = hot_spot_inode_seq_start,
+	.next = hot_spot_inode_seq_next,
+	.stop = hot_spot_seq_stop,
+	.show = hot_inode_seq_show
+};
+
+static int hot_range_seq_open(struct inode *inode, struct file *file)
+{
+	int ret = seq_open_private(file, &hot_range_seq_ops, 0);
+	if (ret == 0) {
+		struct seq_file *seq = file->private_data;
+		seq->private = inode->i_private;
+	}
+	return ret;
+}
+
+static int hot_inode_seq_open(struct inode *inode, struct file *file)
+{
+	int ret = seq_open_private(file, &hot_inode_seq_ops, 0);
+	if (ret == 0) {
+		struct seq_file *seq = file->private_data;
+		seq->private = inode->i_private;
+	}
+	return ret;
+}
+
+static int hot_spot_range_seq_open(struct inode *inode, struct file *file)
+{
+	int ret = seq_open_private(file, &hot_spot_range_seq_ops, 0);
+	if (ret == 0) {
+		struct seq_file *seq = file->private_data;
+		seq->private = inode->i_private;
+	}
+	return ret;
+}
+
+static int hot_spot_inode_seq_open(struct inode *inode, struct file *file)
+{
+	int ret = seq_open_private(file, &hot_spot_inode_seq_ops, 0);
+	if (ret == 0) {
+		struct seq_file *seq = file->private_data;
+		seq->private = inode->i_private;
+	}
+	return ret;
+}
+
+/* fops to override for printing range data */
+static const struct file_operations hot_debugfs_range_fops = {
+	.open = hot_range_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+/* fops to override for printing inode data */
+static const struct file_operations hot_debugfs_inode_fops = {
+	.open = hot_inode_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+/* fops to override for printing temperature data */
+static const struct file_operations hot_debugfs_spot_range_fops = {
+	.open = hot_spot_range_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static const struct file_operations hot_debugfs_spot_inode_fops = {
+	.open = hot_spot_inode_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release,
+};
+
+static const struct hot_debugfs hot_debugfs[] = {
+	{
+		.name = "extent_stat",
+		.fops  = &hot_debugfs_range_fops,
+	},
+	{
+		.name = "inode_stat",
+		.fops  = &hot_debugfs_inode_fops,
+	},
+	{
+		.name = "extent_spot",
+		.fops  = &hot_debugfs_spot_range_fops,
+	},
+	{
+		.name = "inode_spot",
+		.fops  = &hot_debugfs_spot_inode_fops,
+	},
+};
+
+/* initialize debugfs */
+static int hot_debugfs_init(struct super_block *sb)
+{
+	static const char hot_name[] = "hot_track";
+	struct dentry *dentry;
+	int i, ret = 0;
+
+	/* Determine if hot debufs root has existed */
+	if (!hot_debugfs_root) {
+		hot_debugfs_root = debugfs_create_dir(hot_name, NULL);
+		if (IS_ERR(hot_debugfs_root)) {
+			ret = PTR_ERR(hot_debugfs_root);
+			return ret;
+		}
+	}
+
+	/* create debugfs folder for this volume by mounted dev name */
+	sb->s_hot_root->debugfs_dentry =
+			debugfs_create_dir(sb->s_id, hot_debugfs_root);
+	if (IS_ERR(sb->s_hot_root->debugfs_dentry)) {
+		ret = PTR_ERR(sb->s_hot_root->debugfs_dentry);
+		goto root_err;
+	}
+
+	/* create debugfs hot data files */
+	for (i = 0; i < ARRAY_SIZE(hot_debugfs); i++) {
+		dentry = debugfs_create_file(hot_debugfs[i].name,
+					S_IFREG | S_IRUSR | S_IWUSR,
+					sb->s_hot_root->debugfs_dentry,
+					sb->s_hot_root,
+					hot_debugfs[i].fops);
+		if (IS_ERR(dentry)) {
+			ret = PTR_ERR(dentry);
+			goto err;
+		}
+	}
+
+	return 0;
+
+err:
+	debugfs_remove_recursive(sb->s_hot_root->debugfs_dentry);
+
+root_err:
+	if (list_empty(&hot_debugfs_root->d_subdirs)) {
+		debugfs_remove(hot_debugfs_root);
+		hot_debugfs_root = NULL;
+	}
+
+	return ret;
+}
+
+/* remove dentries for debugsfs */
+static void hot_debugfs_exit(struct super_block *sb)
+{
+	/* remove all debugfs entries recursively from the volume root */
+	debugfs_remove_recursive(sb->s_hot_root->debugfs_dentry);
+
+	if (list_empty(&hot_debugfs_root->d_subdirs)) {
+		debugfs_remove(hot_debugfs_root);
+		hot_debugfs_root = NULL;
+	}
+}
+
 /*
  * Initialize kmem cache for hot_inode_item and hot_range_item.
  */
@@ -715,6 +1165,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	spin_lock_init(&root->t_lock);
 	spin_lock_init(&root->m_lock);
 	atomic_set(&root->hot_map_nr, 0);
+	atomic_set(&root->run_debugfs, 0);
 
 	for (i = 0; i < MAP_SIZE; i++) {
 		for (j = 0; j < MAX_TYPES; j++)
@@ -775,6 +1226,7 @@ static void hot_tree_exit(struct hot_info *root)
 int hot_track_init(struct super_block *sb)
 {
 	struct hot_info *root;
+	int ret;
 
 	root = hot_tree_init(sb);
 	if (IS_ERR(root))
@@ -782,9 +1234,21 @@ int hot_track_init(struct super_block *sb)
 
 	sb->s_hot_root = root;
 
+	ret = hot_debugfs_init(sb);
+	if (ret) {
+		printk(KERN_ERR "%s: hot_debugfs_init error: %d\n",
+				__func__, ret);
+		goto out;
+	}
+
 	printk(KERN_INFO "VFS: Turning on hot data tracking\n");
 
 	return 0;
+
+out:
+	hot_tree_exit(root);
+	sb->s_hot_root = NULL;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(hot_track_init);
 
@@ -797,6 +1261,7 @@ void hot_track_exit(struct super_block *sb)
 {
 	struct hot_info *root = sb->s_hot_root;
 
+	hot_debugfs_exit(sb);
 	hot_tree_exit(root);
 	sb->s_hot_root = NULL;
 	kfree(root);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 8a53c2d..fcc60ac 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -43,4 +43,9 @@
 #define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
 #define AVW_COEFF_POWER 0
 
+struct hot_debugfs {
+	const char *name;
+	const struct file_operations *fops;
+};
+
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 8cb7526..9f6cd71 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -94,6 +94,8 @@ struct hot_info {
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
 	struct shrinker hot_shrink;
+	struct dentry *debugfs_dentry;
+	atomic_t run_debugfs;
 };
 
 extern void __init hot_cache_init(void);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/13] VFS hot tracking: add one ioctl interface
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (6 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 07/13] VFS hot tracking: add debugfs support zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 09/13] VFS hot tracking, procfs: add one proc interface zwu.kernel
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally,
retrieve the temperature from the hot data hash list instead of
recalculating it.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/compat_ioctl.c            |  5 ++++
 fs/hot_tracking.c            |  2 +-
 fs/ioctl.c                   | 70 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/hot_tracking.h | 21 +++++++++++++
 4 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 996cdc5..97bf972 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
 #include <linux/i2c-dev.h>
 #include <linux/atalk.h>
 #include <linux/gfp.h>
+#include <linux/hot_tracking.h>
 
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci.h>
@@ -1402,6 +1403,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
 COMPATIBLE_IOCTL(TIOCSTOP)
 #endif
 
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
 /* fat 'r' ioctls. These are handled by fat with ->compat_ioctl,
    but we don't want warnings on other file systems. So declare
    them as compatible here. */
@@ -1581,6 +1585,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
 	case FIBMAP:
 	case FIGETBSZ:
 	case FIONREAD:
+	case FS_IOC_GET_HEAT_INFO:
 		if (S_ISREG(file_inode(f.file)->i_mode))
 			break;
 		/*FALL THROUGH*/
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 51e2e9c..aa1916d 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -379,7 +379,7 @@ static void hot_freq_update(struct hot_info *root,
  * the *_COEFF_POWER values and combined to a single temperature
  * value.
  */
-static u32 hot_temp_calc(struct hot_comm_item *ci)
+u32 hot_temp_calc(struct hot_comm_item *ci)
 {
 	u32 result = 0;
 	struct hot_freq_data *freq_data = &ci->hot_freq_data;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index fd507fb..f9f3497 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
 #include <linux/writeback.h>
 #include <linux/buffer_head.h>
 #include <linux/falloc.h>
+#include <linux/hot_tracking.h>
 
 #include <asm/ioctls.h>
 
@@ -537,6 +538,72 @@ static int ioctl_fsthaw(struct file *filp)
 }
 
 /*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the map list, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info->live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	struct hot_heat_info heat_info;
+	struct hot_inode_item *he;
+	int ret = 0;
+
+	if (copy_from_user((void *)&heat_info,
+			argp,
+			sizeof(struct hot_heat_info)) != 0) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	he = hot_inode_item_lookup(inode->i_sb->s_hot_root, inode->i_ino, 0);
+	if (IS_ERR(he)) {
+		/* we don't have any info on this file yet */
+		ret = -ENODATA;
+		goto err;
+	}
+
+	heat_info.avg_delta_reads =
+		(__u64) he->hot_inode.hot_freq_data.avg_delta_reads;
+	heat_info.avg_delta_writes =
+		(__u64) he->hot_inode.hot_freq_data.avg_delta_writes;
+	heat_info.last_read_time =
+	(__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_read_time);
+	heat_info.last_write_time =
+	(__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_write_time);
+	heat_info.num_reads =
+		(__u32) he->hot_inode.hot_freq_data.nr_reads;
+	heat_info.num_writes =
+		(__u32) he->hot_inode.hot_freq_data.nr_writes;
+
+	if (heat_info.live > 0) {
+		/*
+		 * got a request for live temperature,
+		 * call hot_calc_temp() to recalculate
+		 */
+		heat_info.temp = hot_temp_calc(&he->hot_inode);
+	} else {
+		/* not live temperature, get it from the map list */
+		heat_info.temp = he->hot_inode.hot_freq_data.last_temp;
+	}
+
+	hot_comm_item_put(&he->hot_inode);
+
+	if (copy_to_user(argp, (void *)&heat_info,
+			sizeof(struct hot_heat_info))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+err:
+	return ret;
+}
+
+/*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
  *
@@ -591,6 +658,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, argp);
 
+	case FS_IOC_GET_HEAT_INFO:
+		return ioctl_heat_info(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 9f6cd71..bd683c9 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -17,6 +17,18 @@
 
 #include <linux/types.h>
 
+struct hot_heat_info {
+	__u64 avg_delta_reads;
+	__u64 avg_delta_writes;
+	__u64 last_read_time;
+	__u64 last_write_time;
+	__u32 num_reads;
+	__u32 num_writes;
+	__u32 temp;
+	__u8 live;
+	__u8 resv[3];
+};
+
 #ifdef __KERNEL__
 
 #include <linux/rbtree.h>
@@ -98,6 +110,14 @@ struct hot_info {
 	atomic_t run_debugfs;
 };
 
+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define FS_IOC_GET_HEAT_INFO _IOR('f', 17, \
+			struct hot_heat_info)
+
 extern void __init hot_cache_init(void);
 extern int hot_track_init(struct super_block *sb);
 extern void hot_track_exit(struct super_block *sb);
@@ -109,6 +129,7 @@ extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root,
 extern struct hot_range_item *hot_range_item_lookup(struct hot_inode_item *he,
 						loff_t start, int alloc);
 extern void hot_inode_item_delete(struct inode *inode);
+extern u32 hot_temp_calc(struct hot_comm_item *ci);
 
 static inline u64 hot_shift(u64 counter, u32 bits, bool dir)
 {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/13] VFS hot tracking, procfs: add one proc interface
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (7 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 08/13] VFS hot tracking: add one ioctl interface zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 10/13] VFS hot tracking: add memory caping function zwu.kernel
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Add one proc interface hot-update-interval under the dir
/proc/sys/fs/ in order to turn HOT_UPDATE_INTERVAL into be tunable.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 7 +++++--
 fs/hot_tracking.h            | 3 ---
 include/linux/hot_tracking.h | 3 +++
 kernel/sysctl.c              | 7 +++++++
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index aa1916d..0d265b9 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -23,6 +23,9 @@
 
 static struct dentry *hot_debugfs_root;
 
+int sysctl_hot_update_interval __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);
+
 /* kmem_cache pointers for slab caches */
 static struct kmem_cache *hot_inode_item_cachep __read_mostly;
 static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -605,7 +608,7 @@ static void hot_update_worker(struct work_struct *work)
 
 	/* Instert next delayed work */
 	queue_delayed_work(root->update_wq, &root->update_work,
-		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+		msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 }
 
 static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
@@ -1184,7 +1187,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 	/* Initialize hot tracking wq and arm one delayed work */
 	INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
 	queue_delayed_work(root->update_wq, &root->update_work,
-		msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC));
+		msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC));
 
 	/* Register a shrinker callback */
 	root->hot_shrink.shrink = hot_track_prune;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index fcc60ac..d1ab48b 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -15,9 +15,6 @@
 #include <linux/workqueue.h>
 #include <linux/hot_tracking.h>
 
-#define HOT_UPDATE_INTERVAL 150
-#define HOT_AGE_INTERVAL 300
-
 /* size of sub-file ranges */
 #define RANGE_BITS 20
 #define FREQ_POWER 4
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index bd683c9..f5c5769 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -110,6 +110,9 @@ struct hot_info {
 	atomic_t run_debugfs;
 };
 
+/* set how often to update temperatures (seconds) */
+extern int sysctl_hot_update_interval;
+
 /*
  * Hot data tracking ioctls:
  *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9edcf45..1ba111d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1616,6 +1616,13 @@ static struct ctl_table fs_table[] = {
 		.proc_handler	= &pipe_proc_fn,
 		.extra1		= &pipe_min_size,
 	},
+	{
+		.procname	= "hot-update-interval",
+		.data		= &sysctl_hot_update_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 10/13] VFS hot tracking: add memory caping function
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (8 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 09/13] VFS hot tracking, procfs: add one proc interface zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 11/13] VFS hot tracking, btrfs: add hot tracking support zwu.kernel
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Introduce two proc interfaces hot-mem-high-thresh and
hot-mem-low-thresh to cap the memory which is consumed by
hot_inode_item and hot_range_item, and they will be in
the unit of 1M bytes.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 40 ++++++++++++++++++++++++++++++++++++++--
 fs/hot_tracking.h            | 26 ++++++++++++++++++++++++++
 include/linux/hot_tracking.h |  6 ++++++
 kernel/sysctl.c              | 14 ++++++++++++++
 4 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 0d265b9..915b48b 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -23,6 +23,12 @@
 
 static struct dentry *hot_debugfs_root;
 
+int sysctl_hot_mem_high_thresh __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_hot_mem_high_thresh);
+
+int sysctl_hot_mem_low_thresh __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_hot_mem_low_thresh);
+
 int sysctl_hot_update_interval __read_mostly = 300;
 EXPORT_SYMBOL_GPL(sysctl_hot_update_interval);
 
@@ -122,10 +128,14 @@ static void hot_comm_item_unlink(struct hot_info *root,
 			spin_lock(&he->i_lock);
 			rb_erase(&ci->rb_node, &he->hot_range_tree);
 			spin_unlock(&he->i_lock);
+
+			hot_mem_limit_sub(root, sizeof(struct hot_range_item));
 		} else {
 			spin_lock(&root->t_lock);
 			rb_erase(&ci->rb_node, &root->hot_inode_tree);
 			spin_unlock(&root->t_lock);
+
+			hot_mem_limit_sub(root, sizeof(struct hot_inode_item));
 		}
 
 		hot_comm_item_put(ci);
@@ -199,13 +209,15 @@ redo:
 		else {
 			hot_comm_item_get(&he->hot_inode);
 			spin_unlock(&root->t_lock);
-			if (he_new)
+			if (he_new) {
 				/*
 				 * Lost the race. Somebody else inserted
 				 * the item for the inode. Free the
 				 * newly allocated item.
 				 */
 				kmem_cache_free(hot_inode_item_cachep, he_new);
+				hot_mem_limit_sub(root, sizeof(struct hot_inode_item));
+			}
 
 			if (test_bit(HOT_DELETING, &he->hot_inode.delete_flag))
 				return ERR_PTR(-ENOENT);
@@ -231,6 +243,7 @@ redo:
 	if (!he_new)
 		return ERR_PTR(-ENOMEM);
 
+	hot_mem_limit_add(root, sizeof(struct hot_inode_item));
 	hot_inode_item_init(he_new, root, ino);
 
 	goto redo;
@@ -280,13 +293,15 @@ redo:
 		else {
 			hot_comm_item_get(&hr->hot_range);
 			spin_unlock(&he->i_lock);
-			if(hr_new)
+			if(hr_new) {
 				/*
 				 * Lost the race. Somebody else inserted
 				 * the item for the range. Free the
 				 * newly allocated item.
 				 */
 				kmem_cache_free(hot_range_item_cachep, hr_new);
+				hot_mem_limit_sub(root, sizeof(struct hot_range_item));
+			}
 
 			if (test_bit(HOT_DELETING, &hr->hot_range.delete_flag))
 				return ERR_PTR(-ENOENT);
@@ -312,6 +327,7 @@ redo:
 	if (!hr_new)
 		return ERR_PTR(-ENOMEM);
 
+	hot_mem_limit_add(root, sizeof(struct hot_range_item));
 	hot_range_item_init(hr_new, he, start);
 
 	goto redo;
@@ -570,6 +586,22 @@ static void hot_item_evictor(struct hot_info *root, unsigned long work,
 	}
 }
 
+static void hot_mem_evictor(struct hot_info *root)
+{
+	unsigned long work;
+
+	if (sysctl_hot_mem_high_thresh == 0)
+		return;
+
+	/* note: sysctl_** is in the unit of 1M bytes */
+	if (hot_mem_limit(root) <= sysctl_hot_mem_high_thresh * 1024 * 1024)
+		return;
+
+	work = hot_mem_limit(root) - sysctl_hot_mem_low_thresh * 1024 * 1024;
+
+	hot_item_evictor(root, work, hot_mem_limit);
+}
+
 /*
  * Every sync period we update temperatures for
  * each hot inode item and hot range item for aging
@@ -584,6 +616,8 @@ static void hot_update_worker(struct work_struct *work)
 	struct hot_inode_item *he;
 	int i, j;
 
+	hot_mem_evictor(root);
+
 	rcu_read_lock();
 	node = rb_first(&root->hot_inode_tree);
 	while (node) {
@@ -1235,6 +1269,7 @@ int hot_track_init(struct super_block *sb)
 	if (IS_ERR(root))
 		return PTR_ERR(root);
 
+	hot_mem_limit_init(root);
 	sb->s_hot_root = root;
 
 	ret = hot_debugfs_init(sb);
@@ -1264,6 +1299,7 @@ void hot_track_exit(struct super_block *sb)
 {
 	struct hot_info *root = sb->s_hot_root;
 
+	hot_mem_limit_exit(root);
 	hot_debugfs_exit(sb);
 	hot_tree_exit(root);
 	sb->s_hot_root = NULL;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index d1ab48b..be9f5cd 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -45,4 +45,30 @@ struct hot_debugfs {
 	const struct file_operations *fops;
 };
 
+/* Memory Tracking Functions. */
+static inline unsigned long hot_mem_limit(struct hot_info *root)
+{
+	return percpu_counter_read(&root->mem);
+}
+
+static inline void hot_mem_limit_sub(struct hot_info *root, int i)
+{
+	percpu_counter_add(&root->mem, -i);
+}
+
+static inline void hot_mem_limit_add(struct hot_info *root, int i)
+{
+	percpu_counter_add(&root->mem, i);
+}
+
+static inline void hot_mem_limit_init(struct hot_info *root)
+{
+	percpu_counter_init(&root->mem, 0);
+}
+
+static inline void hot_mem_limit_exit(struct hot_info *root)
+{
+	percpu_counter_destroy(&root->mem);
+}
+
 #endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index f5c5769..03e5026 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -16,6 +16,7 @@
 #define _LINUX_HOTTRACK_H
 
 #include <linux/types.h>
+#include <linux/percpu_counter.h>
 
 struct hot_heat_info {
 	__u64 avg_delta_reads;
@@ -108,10 +109,15 @@ struct hot_info {
 	struct shrinker hot_shrink;
 	struct dentry *debugfs_dentry;
 	atomic_t run_debugfs;
+
+	struct percpu_counter   mem ____cacheline_aligned_in_smp;
 };
 
 /* set how often to update temperatures (seconds) */
 extern int sysctl_hot_update_interval;
+/* note: sysctl_** is in the unit of 1M bytes */
+extern int sysctl_hot_mem_high_thresh;
+extern int sysctl_hot_mem_low_thresh;
 
 /*
  * Hot data tracking ioctls:
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1ba111d..753585d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1617,6 +1617,20 @@ static struct ctl_table fs_table[] = {
 		.extra1		= &pipe_min_size,
 	},
 	{
+		.procname       = "hot-mem-high-thresh",
+		.data           = &sysctl_hot_mem_high_thresh,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
+		.procname       = "hot-mem-low-thresh",
+		.data           = &sysctl_hot_mem_low_thresh,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{
 		.procname	= "hot-update-interval",
 		.data		= &sysctl_hot_update_interval,
 		.maxlen		= sizeof(int),
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/13] VFS hot tracking, btrfs: add hot tracking support
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (9 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 10/13] VFS hot tracking: add memory caping function zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 12/13] VFS hot tracking: add documentation zwu.kernel
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Introduce one new mount option '-o hot_track',
and add its parsing support.
  Its usage looks like:
   mount -o hot_track
   mount -o nouser,hot_track
   mount -o nouser,hot_track,loop
   mount -o hot_track,nouser

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/super.c | 22 +++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d6dd49b..745cac4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1927,6 +1927,7 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_MOUNT_CHECK_INTEGRITY	(1 << 20)
 #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
 #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR	(1 << 22)
+#define BTRFS_MOUNT_HOT_TRACK		(1 << 23)
 
 #define btrfs_clear_opt(o, opt)		((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)		((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f0857e0..f13517b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -42,6 +42,7 @@
 #include <linux/cleancache.h>
 #include <linux/ratelimit.h>
 #include <linux/btrfs.h>
+#include <linux/hot_tracking.h>
 #include "compat.h"
 #include "delayed-inode.h"
 #include "ctree.h"
@@ -306,6 +307,10 @@ static void btrfs_put_super(struct super_block *sb)
 	 * last process that kept it busy.  Or segfault in the aforementioned
 	 * process...  Whom would you report that to?
 	 */
+
+	/* Hot data tracking */
+	if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+		hot_track_exit(sb);
 }
 
 enum {
@@ -318,7 +323,7 @@ enum {
 	Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
 	Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
 	Opt_check_integrity, Opt_check_integrity_including_extent_data,
-	Opt_check_integrity_print_mask, Opt_fatal_errors,
+	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
 	Opt_err,
 };
 
@@ -359,6 +364,7 @@ static match_table_t tokens = {
 	{Opt_check_integrity_including_extent_data, "check_int_data"},
 	{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
+	{Opt_hot_track, "hot_track"},
 	{Opt_err, NULL},
 };
 
@@ -624,6 +630,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
 				goto out;
 			}
 			break;
+		case Opt_hot_track:
+			btrfs_set_opt(info->mount_opt, HOT_TRACK);
+			break;
 		case Opt_err:
 			printk(KERN_INFO "btrfs: unrecognized mount option "
 			       "'%s'\n", p);
@@ -843,11 +852,20 @@ static int btrfs_fill_super(struct super_block *sb,
 		goto fail_close;
 	}
 
+	if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+		err = hot_track_init(sb);
+		if (err)
+			goto fail_hot;
+	}
+
 	save_mount_options(sb, data);
 	cleancache_init_fs(sb);
 	sb->s_flags |= MS_ACTIVE;
 	return 0;
 
+fail_hot:
+	dput(sb->s_root);
+	sb->s_root = NULL;
 fail_close:
 	close_ctree(fs_info->tree_root);
 	return err;
@@ -943,6 +961,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 		seq_puts(seq, ",skip_balance");
 	if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR))
 		seq_puts(seq, ",fatal_errors=panic");
+	if (btrfs_test_opt(root, HOT_TRACK))
+		seq_puts(seq, ",hot_track");
 	return 0;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/13] VFS hot tracking: add documentation
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (10 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 11/13] VFS hot tracking, btrfs: add hot tracking support zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-21 12:17 ` [PATCH v3 13/13] VFS hot tracking: add fs hot type support zwu.kernel
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

      Add Documentation for VFS hot tracking feature

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 Documentation/filesystems/00-INDEX         |   2 +
 Documentation/filesystems/hot_tracking.txt | 252 +++++++++++++++++++++++++++++
 2 files changed, 254 insertions(+)
 create mode 100644 Documentation/filesystems/hot_tracking.txt

diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 8042050..2454472 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -122,3 +122,5 @@ xfs.txt
 	- info and mount options for the XFS filesystem.
 xip.txt
 	- info on execute-in-place for file mappings.
+hot_tracking.txt
+	- info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 0000000..b7547f3
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,252 @@
+Hot Data Tracking
+
+April, 2013		Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes & Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+  The feature adds the  support for tracking data temperature
+information in VFS layer.  Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+"temperature" value that reflects what data is "hot", and filesystem
+can use this information to move hot data from slow devices to fast
+devices.
+
+  The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+
+2. Motivation
+
+  This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+  The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+<https://btrfs.wiki.kernel.org/index.php/Project_ideas>.
+It will divide into two parts. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, this feature provides the first part
+of the functionality.
+
+
+3. The Design
+
+These include the following parts:
+
+    * Hooks in existing vfs functions to track data access frequency
+
+    * New rb-trees for tracking access frequency of inodes and sub-file
+ranges
+    The relationship between super_block and rb-trees is as below:
+hot_info.hot_inode_tree
+    Each FS instance can find hot tracking info s_hot_root.
+    hot_info has hot_inode_tree and it has inode's hot information,
+and it has hot_range_tree, which has range's hot information.
+
+    * A list of hot inodes and hot ranges by its temperature
+
+    * A debugfs interface for dumping data from the rb-trees
+
+    * A work queue for updating inode heat info
+
+    * Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+    * An ioctl to retrieve the frequency information collected for a certain
+file
+    * Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+    * hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+    * hot_inode_item contains access frequency data for that inode
+
+    * hot_inode_item holds a heat list node to link the access frequency
+data for that inode
+
+    * hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+    * hot_range_item contains access frequency data for that range
+
+    * hot_range_item holds a heat list node to index the access
+frequency data for that range
+
+    * hot_info.heat_inode_map indexes per-inode heat list nodes
+
+    * hot_info.heat_range_map indexes per-range heat list nodes
+
+  How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+                          super_block
+                              |
+                              V
+                           hot_info
+                              |
+    +-------------------------+----------------------------------------+
+    |                         |                                        |
+    |                         |                                        |
+    V                         V                                        V
+heat_inode_map           hot_inode_tree                         heat_range_map
+    |                         |                                        |
+    |                         V                                        |
+    |           +-------hot_comm_item--------+                         |
+    |           |       frequency data       |                         |
++---+           |        list_head           |                         |
+|               V                            V                         |
+| ...<--hot_comm_item-->...      ...<--hot_comm_item-->...             |
+        frequency data                 frequency data                  |
+          list_head                      list_head                     |
+       hot_range_tree                  hot_range_tree                  |
+                                             |                         |
+                                             V                         |
+                               +-------hot_comm_item--------+          |
+                               |       frequency data       |          |
+                               |        list_head           |          +---+
+                               V            ^ |             V		   |
+                    <--hot_comm_item-->...  | |  ...<--hot_comm_item-->... |
+                         frequency data               frequency data
+                           list_head                    list_head
+
+
+4. How to Calc Frequency of Reads/Writes & Temperature
+
+1.) hot_rw_freq_calc()
+
+  This function does the actual work of updating the frequency numbers.
+FREQ_POWER determines how many atime deltas we keep track of (as a power of 2).
+So, setting it to anything above 16ish is probably overkill. Also,
+the higher the power, the more bits get right shifted out of the timestamp,
+reducing precision, so take note of that as well.
+
+  FREQ_POWER, defined immediately below, determines how heavily to weight
+the current frequency numbers against the newest access. For example, a value
+of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+as heavily as the existing frequency info. In essence, this is a kludged-
+together version of a weighted average, since we can't afford to keep all of
+the information that it would take to get a _real_ weighted average.
+
+2.) hot_temp_calc()
+
+  The following comments explain what exactly comprises a unit of heat.
+Each of six values of heat are calculated and combined in order to form an
+overall temperature for the data:
+
+    * NRR - number of reads since mount
+    * NRW - number of writes since mount
+    * LTR - time elapsed since last read (ns)
+    * LTW - time elapsed since last write (ns)
+    * AVR - average delta between recent reads (ns)
+    * AVW - average delta between recent writes (ns)
+
+  These values are divided (right-shifted) according to the *_DIVIDER_POWER
+values defined below to bring the numbers into a reasonable range. You can
+modify these values to fit your needs. However, each heat unit is a u32 and
+thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+carefully or else they could max out or be stuck at zero quite easily.
+(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+delta would bring the temperature above zero, ever.)
+
+  Finally, each value is added to the overall temperature between 0 and 8
+times, depending on its *_COEFF_POWER value. Note that the coefficients are
+also actually implemented with shifts, so take care to treat these values
+as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+
+    * AVR/AVW cold unit = 2^X ns of average delta
+    * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+
+  E.g., data with an average delta between 0 and 2^X ns will have a cold
+value of 0, which means a heat value equal to HEAT_MAX_VALUE.
+
+  This function is responsible for distilling the six heat
+criteria, which are described in detail in hot_tracking.h) down into a single
+temperature value for the data, which is an integer between 0
+and HEAT_MAX_VALUE.
+
+  To accomplish this, the raw values from the hot_freq_data structure
+are shifted in order to make the temperature calculation more
+or less sensitive to each value.
+
+  Once this calibration has happened, we do some additional normalization and
+make sure that everything fits nicely in a u32. From there, we take a very
+rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+values act as weights for the average.
+
+  Finally, we use the MAP_BITS value, which determines the size of the
+heat list array, to normalize the temperature to the proper granularity.
+
+
+5. Git Development Tree
+
+  This feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+
+  https://github.com/wuzhy/kernel.git hot_tracking
+  git://github.com/wuzhy/kernel.git hot_tracking
+
+
+6. Usage Example
+
+1.) To use hot tracking, you should mount like this:
+
+$ mount -o hot_track /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] vfs: turning on hot data tracking
+
+2.) Mount debugfs at first:
+
+$ mount -t debugfs none /sys/kernel/debug
+$ ls -l /sys/kernel/debug/hot_track/
+total 0
+drwxr-xr-x 2 root root 0 Aug  8 04:40 sdb
+$ ls -l /sys/kernel/debug/hot_track/sdb
+total 0
+-rw-r--r-- 1 root root 0 Aug  8 04:40 inode_stat
+-rw-r--r-- 1 root root 0 Aug  8 04:40 extent_stat
+
+3.) View information about hot tracking from debugfs:
+
+$ echo "hot tracking test" > /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/inode_stat
+inode 279, reads 0, writes 1, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/extent_stat
+inode 279, extent 0+1048576, reads 0, writes 1, temp 64
+
+$ echo "hot data tracking test" >> /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/inode_stat
+inode 279, reads 0, writes 2, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/extent_stat
+inode 279, extent 0+1048576 reads 0, writes 2, temp 64
+
+4.) Check temp sorting result of some nodes:
+
+$ cat /sys/kernel/debug/hot_track/loop0/inode_spot
+inode 5248773, reads 0, writes 244, temp 111
+inode 878523, reads 0, writes 1, temp 109
+inode 878524, reads 0, writes 1, temp 109
+
+5.) Tune some hot tracking parameters as below:
+
+$ echo 360 > /proc/sys/fs/hot-age-interval
+$ cat /proc/sys/fs/hot-update-interval
+300
+$ echo 360 > /proc/sys/fs/hot-update-interval
+$ cat /proc/sys/fs/hot-update-interval
+360
+
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 13/13] VFS hot tracking: add fs hot type support
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (11 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 12/13] VFS hot tracking: add documentation zwu.kernel
@ 2013-06-21 12:17 ` zwu.kernel
  2013-06-24 13:41 ` [PATCH v3 00/13] VFS hot tracking Zhi Yong Wu
  2013-06-28 16:03 ` Al Viro
  14 siblings, 0 replies; 21+ messages in thread
From: zwu.kernel @ 2013-06-21 12:17 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: viro, sekharan, linuxram, david, chris.mason, jbacik, Zhi Yong Wu

From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

  Introduce one ability to enable that specific FS
can register its own hot tracking functions.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 fs/hot_tracking.c            | 28 +++++++++++++++++++---------
 fs/hot_tracking.h            | 13 +++++++++++++
 fs/ioctl.c                   |  2 +-
 include/linux/fs.h           |  1 +
 include/linux/hot_tracking.h | 20 +++++++++++++++++++-
 5 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 915b48b..dbc90d4 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -54,7 +54,7 @@ static void hot_range_item_init(struct hot_range_item *hr,
 			struct hot_inode_item *he, loff_t start)
 {
 	hr->start = start;
-	hr->len = hot_shift(1, RANGE_BITS, true);
+	hr->len = hot_shift(1, he->hot_root->hot_type->range_bits, true);
 	hr->hot_inode = he;
 	hr->storage_type = -1;
 	hot_comm_item_init(&hr->hot_range, TYPE_RANGE);
@@ -273,10 +273,11 @@ struct hot_range_item
 {
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
+	struct hot_info *root = he->hot_root;
 	struct hot_comm_item *ci;
 	struct hot_range_item *hr, *hr_new = NULL;
 
-	start = hot_shift(start, RANGE_BITS, true);
+	start = hot_shift(start, root->hot_type->range_bits, true);
 
 	/* walk tree to find insertion point */
 redo:
@@ -367,13 +368,13 @@ static void hot_freq_update(struct hot_info *root,
 
 	if (write) {
 		freq_data->nr_writes += 1;
-		hot_freq_calc(freq_data->last_write_time,
+		HOT_FREQ_CALC(root, freq_data->last_write_time,
 				cur_time,
 				&freq_data->avg_delta_writes);
 		freq_data->last_write_time = cur_time;
 	} else {
 		freq_data->nr_reads += 1;
-		hot_freq_calc(freq_data->last_read_time,
+		HOT_FREQ_CALC(root, freq_data->last_read_time,
 				cur_time,
 				&freq_data->avg_delta_reads);
 		freq_data->last_read_time = cur_time;
@@ -398,7 +399,7 @@ static void hot_freq_update(struct hot_info *root,
  * the *_COEFF_POWER values and combined to a single temperature
  * value.
  */
-u32 hot_temp_calc(struct hot_comm_item *ci)
+static u32 hot_temp_calc(struct hot_comm_item *ci)
 {
 	u32 result = 0;
 	struct hot_freq_data *freq_data = &ci->hot_freq_data;
@@ -470,7 +471,7 @@ u32 hot_temp_calc(struct hot_comm_item *ci)
 static bool hot_map_update(struct hot_info *root,
 			struct hot_comm_item *ci)
 {
-	u32 temp = hot_temp_calc(ci);
+	u32 temp = HOT_TEMP_CALC(root, ci);
 	u8 cur_temp, prev_temp;
 	bool flag = false;
 
@@ -1164,10 +1165,10 @@ void hot_update_freqs(struct inode *inode, loff_t start,
 	 * Align ranges on range size boundary
 	 * to prevent proliferation of range structs
 	 */
-	range_size  = hot_shift(1, RANGE_BITS, true);
+	range_size  = hot_shift(1, root->hot_type->range_bits, true);
 	end = hot_shift((start + len + range_size - 1),
-			RANGE_BITS, false);
-	cur = hot_shift(start, RANGE_BITS, false);
+			root->hot_type->range_bits, false);
+	cur = hot_shift(start, root->hot_type->range_bits, false);
 	for (; cur < end; cur++) {
 		hr = hot_range_item_lookup(he, cur, 1);
 		if (IS_ERR(hr)) {
@@ -1209,6 +1210,15 @@ static struct hot_info *hot_tree_init(struct super_block *sb)
 			INIT_LIST_HEAD(&root->hot_map[j][i]);
 	}
 
+	/* Get hot type for specific FS */
+	root->hot_type = &sb->s_type->hot_type;
+	if (!HOT_FREQ_FN_EXIST(root))
+		SET_HOT_FREQ_FN(root, hot_freq_calc);
+	if (!HOT_TEMP_FN_EXIST(root))
+		SET_HOT_TEMP_FN(root, hot_temp_calc);
+	if (root->hot_type->range_bits == 0)
+		root->hot_type->range_bits = RANGE_BITS;
+
 	root->update_wq = alloc_workqueue(
 			"hot_update_wq", WQ_NON_REENTRANT, 0);
 	if (!root->update_wq) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index be9f5cd..b5f043c 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -40,6 +40,19 @@
 #define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
 #define AVW_COEFF_POWER 0
 
+#define HOT_FREQ_FN_EXIST(root) \
+	((root)->hot_type->ops.hot_freq_calc)
+#define HOT_TEMP_FN_EXIST(root) \
+	((root)->hot_type->ops.hot_temp_calc)
+
+#define HOT_FREQ_CALC(root, lt, ct, avg) \
+	((root)->hot_type->ops.hot_freq_calc(lt, ct, avg))
+
+#define SET_HOT_FREQ_FN(root, fn) \
+	(root)->hot_type->ops.hot_freq_calc = fn
+#define SET_HOT_TEMP_FN(root, fn) \
+	(root)->hot_type->ops.hot_temp_calc = fn
+
 struct hot_debugfs {
 	const char *name;
 	const struct file_operations *fops;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index f9f3497..95ec029 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -585,7 +585,7 @@ static int ioctl_heat_info(struct file *file, void __user *argp)
 		 * got a request for live temperature,
 		 * call hot_calc_temp() to recalculate
 		 */
-		heat_info.temp = hot_temp_calc(&he->hot_inode);
+		heat_info.temp = HOT_TEMP_CALC(he->hot_root, &he->hot_inode);
 	} else {
 		/* not live temperature, get it from the map list */
 		heat_info.temp = he->hot_inode.hot_freq_data.last_temp;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ee2c54f..dda3b9c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1817,6 +1817,7 @@ struct file_system_type {
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
 	void (*kill_sb) (struct super_block *);
+	struct hot_type hot_type;
 	struct module *owner;
 	struct file_system_type * next;
 	struct hlist_head fs_supers;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 03e5026..1009377 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -98,6 +98,24 @@ struct hot_range_item {
 	int storage_type;			/* type of storage */
 };
 
+typedef void (hot_freq_calc_fn) (struct timespec old_atime,
+				struct timespec cur_time, u64 *avg);
+typedef u32 (hot_temp_calc_fn) (struct hot_comm_item *ci);
+
+struct hot_func_ops {
+	hot_freq_calc_fn *hot_freq_calc;
+	hot_temp_calc_fn *hot_temp_calc;
+};
+
+/* identifies an hot type */
+struct hot_type {
+	u64 range_bits;
+	struct hot_func_ops ops;	/* fields provided by specific FS */
+};
+
+#define HOT_TEMP_CALC(root, ci) \
+	((root)->hot_type->ops.hot_temp_calc(ci))
+
 struct hot_info {
 	struct rb_root hot_inode_tree;
 	spinlock_t t_lock;				/* protect above tree */
@@ -106,6 +124,7 @@ struct hot_info {
 	atomic_t hot_map_nr;
 	struct workqueue_struct *update_wq;
 	struct delayed_work update_work;
+	struct hot_type *hot_type;
 	struct shrinker hot_shrink;
 	struct dentry *debugfs_dentry;
 	atomic_t run_debugfs;
@@ -138,7 +157,6 @@ extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root,
 extern struct hot_range_item *hot_range_item_lookup(struct hot_inode_item *he,
 						loff_t start, int alloc);
 extern void hot_inode_item_delete(struct inode *inode);
-extern u32 hot_temp_calc(struct hot_comm_item *ci);
 
 static inline u64 hot_shift(u64 counter, u32 bits, bool dir)
 {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (12 preceding siblings ...)
  2013-06-21 12:17 ` [PATCH v3 13/13] VFS hot tracking: add fs hot type support zwu.kernel
@ 2013-06-24 13:41 ` Zhi Yong Wu
  2013-06-28 16:03 ` Al Viro
  14 siblings, 0 replies; 21+ messages in thread
From: Zhi Yong Wu @ 2013-06-24 13:41 UTC (permalink / raw)
  To: torvalds
  Cc: viro, sekharan, Ram Pai, Dave Chinner, chris.mason, jbacik,
	Zhi Yong Wu, linux-fsdevel

HI, Linus

   Do you have any comments or thought on this patchset? It has been
blocked for so long time.


On Fri, Jun 21, 2013 at 8:17 PM,  <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
>   The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
>
>   After V1 was sent out, Chandra Seetharaman has reviewed and
> made a lot of comments, thanks a lot to him. Now it's time to
> send out its V3 for external review, any comments or ideas are
> appreciated, thanks.
>
> NOTE:
>
>   The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
>   If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
>   For how to use and more other info and performance report,
> please check hot_tracking.txt in Documentation and following
> links:
>   1.) http://lwn.net/Articles/525651/
>   2.) https://lkml.org/lkml/2012/12/20/199
>
> Changelog from v2:
>  - Added memory caping function for hot items [Zhiyong]
>  - Cleanup aging function [Zhiyong]
>
> v2:
>  - Refactored to be under RCU [Chandra Seetharaman]
>  - Merged some code changes [Chandra Seetharaman]
>  - Fixed some issues [Chandra Seetharaman]
>
> v1:
>  - Solved 64 bits inode number issue. [David Sterba]
>  - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
>  - Cleanup Some issues [David Sterba]
>  - Use a static hot debugfs root [Greg KH]
>
> rfcv4:
>  - Introduce hot func registering framework [Zhiyong]
>  - Remove global variable for hot tracking [Zhiyong]
>  - Add btrfs hot tracking support [Zhiyong]
>
> rfcv3:
>  1.) Rewritten debugfs support based seq_file operation. [Dave Chinner]
>  2.) Refactored workqueue support. [Dave Chinner]
>  3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng]
>        TIME_TO_KICK, and HEAT_UPDATE_DELAY
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
>
> rfcv2:
>  1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner]
>  2.) Added memory shrinker [Dave Chinner]
>  3.) Converted to one workqueue to update map info periodically [Dave Chinner]
>  4.) Cleanedup a lot of other issues [Dave Chinner]
>
> rfcv1:
>  1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
>  2.) The first three patches can probably just be flattened into one.
>                                         [Marco Stornelli , Dave Chinner]
>
> Zhi Yong Wu (13):
>   VFS hot tracking: introduce some data structures
>   VFS hot tracking: add i/o freq tracking hooks
>   VFS hot tracking: add one wq to update hot map
>   VFS hot tracking: register one shrinker
>   VFS hot tracking, rcu: introduce one rcu macro for list
>   VFS hot tracking, seq_file: add seq_list rcu interfaces
>   VFS hot tracking: add debugfs support
>   VFS hot tracking: add one ioctl interface
>   VFS hot tracking, procfs: add one proc interface
>   VFS hot tracking: add memory caping function
>   VFS hot tracking, btrfs: add hot tracking support
>   VFS hot tracking: add documentation
>   VFS hot tracking: add fs hot type support
>
>  Documentation/filesystems/00-INDEX         |    2 +
>  Documentation/filesystems/hot_tracking.txt |  252 ++++++
>  fs/Makefile                                |    2 +-
>  fs/btrfs/ctree.h                           |    1 +
>  fs/btrfs/super.c                           |   22 +-
>  fs/compat_ioctl.c                          |    5 +
>  fs/dcache.c                                |    2 +
>  fs/direct-io.c                             |    5 +
>  fs/hot_tracking.c                          | 1318 ++++++++++++++++++++++++++++
>  fs/hot_tracking.h                          |   87 ++
>  fs/ioctl.c                                 |   70 ++
>  fs/namei.c                                 |    2 +
>  fs/seq_file.c                              |   37 +
>  include/linux/fs.h                         |    5 +
>  include/linux/hot_tracking.h               |  176 ++++
>  include/linux/rculist.h                    |    5 +
>  include/linux/seq_file.h                   |    7 +
>  kernel/sysctl.c                            |   21 +
>  mm/filemap.c                               |    6 +
>  mm/page-writeback.c                        |   12 +
>  mm/readahead.c                             |    6 +
>  21 files changed, 2041 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/filesystems/hot_tracking.txt
>  create mode 100644 fs/hot_tracking.c
>  create mode 100644 fs/hot_tracking.h
>  create mode 100644 include/linux/hot_tracking.h
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
                   ` (13 preceding siblings ...)
  2013-06-24 13:41 ` [PATCH v3 00/13] VFS hot tracking Zhi Yong Wu
@ 2013-06-28 16:03 ` Al Viro
  2013-07-01 13:19   ` Zhi Yong Wu
  2013-07-02 12:45   ` Zhi Yong Wu
  14 siblings, 2 replies; 21+ messages in thread
From: Al Viro @ 2013-06-28 16:03 UTC (permalink / raw)
  To: zwu.kernel
  Cc: linux-fsdevel, sekharan, linuxram, david, chris.mason, jbacik,
	Zhi Yong Wu

On Fri, Jun 21, 2013 at 08:17:09PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
>   The patchset is trying to introduce hot tracking function in
> VFS layer, which will keep track of real disk I/O in memory.
> By it, you will easily know more details about disk I/O, and
> then detect where disk I/O hot spots are. Also, specific FS
> can take use of it to do accurate defragment, and hot relocation
> support, etc.
> 
>   After V1 was sent out, Chandra Seetharaman has reviewed and
> made a lot of comments, thanks a lot to him. Now it's time to
> send out its V3 for external review, any comments or ideas are
> appreciated, thanks.

First of all, my apologies for obscenely long delay with review.  I've
started doing it several times and dropped these attempts getting mired
in the nightmare of lifetime rules in there.  What I should've done was
sending the obvious low-hanging fruits right then and waiting for the
variant with that stuff sanitized ;-/

First of all, one general observation: please, separate inode and range stuff
clearly; do *not* use the same functions on both, it only makes harder to read.
I.e. kill hot_comm_item and split the functions that take it.  This code is
trying to be too generic for its own good and ends up being obfuscated to hell
and back...

refcounting:
	* the whole refcount + "DELETING" flag approach is a bad idea.
hot_comm_item_unlink() tries to be idempotent *and* includes dropping the
reference on its first call.  Schemes like that tend to be either pointless
(i.e. we know that it won't be called twice for the same object) or prone
to stepping on dangling pointers.  This one does the latter (at least).
	* lookups (both for inode and range) leak on race with unlink.
You grab a reference, drop the lock, check if HOT_DELETING is set and
return an error if it is; how the hell could the caller possibly undo
that reference grab, when it has no idea what object had been grabbed
in the first place?  Moreover, that check does *not* prevent getting
an object with HOT_DELETING from those functions - unlink coming just
after that check will be unnoticed.

hot_inode_item_delete() mess:
	* hot_inode_item_delete() is done on unlink(2), no matter how many
links are there.  Why?
	* hot_inode_item_delete() is done even if unlink() fails (e.g. with
EBUSY, or on whatever error ->unlink() might return).
	* hot_inode_item_delete() grabs a reference to hot_inode_item,
*drops* *it*, then does hot_comm_item_unlink().  What protects it from being
freed right as we drop the damn reference?

debugfs-related issues - debugfs is completely unsuitable for dynamic
objects and you step into that big way, in addition to races of your
own:
	* creation of hot_debugfs_root is racy - WTF prevents
two hot_track_init() in parallel?
	* if creation fails, we leave hot_debugfs_root ERR_PTR(something);
from that point on, no attempts to create it will be done (check for
hot_debugfs_root being *NULL*)
	* what guarantees that sb->s_id is unique?
	* removal on failure - screwed; first of all, list_empty()
will *not* be true if we have somebody open it (cursors are inserted into
that list).  Moreover, what's to stop another hot_debugfs_init() from
being called just as we are doing debugfs_remove(hot_debugfs_root) and
see that it's non-NULL *before* we get to assigning NULL there?
	* hot_debugfs_exit() - screwed in the same way.
	* debugfs files get ->i_private set to corresponding sb->s_hot_root.
It's copied to seq->private on open() and used by iterators.  WTF prevents
open on debugfs, followed by umount of corresponding btrfs volume, freeing of
sb->s_hot_root and then read() on our file stepping into kfree'd hot_root
in ->start()?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-06-28 16:03 ` Al Viro
@ 2013-07-01 13:19   ` Zhi Yong Wu
  2013-07-03 13:30     ` Al Viro
  2013-07-02 12:45   ` Zhi Yong Wu
  1 sibling, 1 reply; 21+ messages in thread
From: Zhi Yong Wu @ 2013-07-01 13:19 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, sekharan, Ram Pai, Dave Chinner, chris.mason,
	jbacik, Zhi Yong Wu

On Sat, Jun 29, 2013 at 12:03 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jun 21, 2013 at 08:17:09PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   The patchset is trying to introduce hot tracking function in
>> VFS layer, which will keep track of real disk I/O in memory.
>> By it, you will easily know more details about disk I/O, and
>> then detect where disk I/O hot spots are. Also, specific FS
>> can take use of it to do accurate defragment, and hot relocation
>> support, etc.
>>
>>   After V1 was sent out, Chandra Seetharaman has reviewed and
>> made a lot of comments, thanks a lot to him. Now it's time to
>> send out its V3 for external review, any comments or ideas are
>> appreciated, thanks.
>
> First of all, my apologies for obscenely long delay with review.  I've
It is so exciting when you made the comments, thanks a lot.
> started doing it several times and dropped these attempts getting mired
> in the nightmare of lifetime rules in there.  What I should've done was
> sending the obvious low-hanging fruits right then and waiting for the
> variant with that stuff sanitized ;-/
>
> First of all, one general observation: please, separate inode and range stuff
> clearly; do *not* use the same functions on both, it only makes harder to read.
> I.e. kill hot_comm_item and split the functions that take it.  This code is
OK, i will follow up with it.
> trying to be too generic for its own good and ends up being obfuscated to hell
> and back...
>
> refcounting:
>         * the whole refcount + "DELETING" flag approach is a bad idea.
Do you have any better suggestion? i adopt this way mainly to void
this object which has been deleted will be relinked to the hot list
when the hot worker is issued.
> hot_comm_item_unlink() tries to be idempotent *and* includes dropping the
> reference on its first call.  Schemes like that tend to be either pointless
> (i.e. we know that it won't be called twice for the same object) or prone
> to stepping on dangling pointers.  This one does the latter (at least).
Do you mean we should the condition determination move
test_and_set_bit() out of hot_comm_item_unlink()?
>         * lookups (both for inode and range) leak on race with unlink.
> You grab a reference, drop the lock, check if HOT_DELETING is set and
> return an error if it is; how the hell could the caller possibly undo
> that reference grab, when it has no idea what object had been grabbed
I don't get what you mean, do you elaborate it with more details?
which caller? which scenario will it take place in?
> in the first place?  Moreover, that check does *not* prevent getting
> an object with HOT_DELETING from those functions - unlink coming just
> after that check will be unnoticed.
Good catch, thanks.
>
> hot_inode_item_delete() mess:
>         * hot_inode_item_delete() is done on unlink(2), no matter how many
> links are there.  Why?
Good catch, will fix it.
>         * hot_inode_item_delete() is done even if unlink() fails (e.g. with
> EBUSY, or on whatever error ->unlink() might return).
Good catch, will fix it.
>         * hot_inode_item_delete() grabs a reference to hot_inode_item,
> *drops* *it*, then does hot_comm_item_unlink().  What protects it from being
> freed right as we drop the damn reference?
Good catch, we should do hot_comm_item_unlink() at first, then *drop* this ref.
>
> debugfs-related issues - debugfs is completely unsuitable for dynamic
> objects and you step into that big way, in addition to races of your
**** Is debugfs also completely unsuitable for dynamic objects even
though we fix the following issues listed by you? Do you have any
better way about this?
> own:
>         * creation of hot_debugfs_root is racy - WTF prevents
hot_debugfs_root is public for all disk volumes, and is debugfs root
for hot tracking.
Why do you think it is racy?
> two hot_track_init() in parallel?
I think that mount() will make sure hot_track_init() will be done once
for the same super block, right? hot_track_init will be done *only*
when mount is issued.
That is, mount() can not make sure hot_track_init() is done serially?

>         * if creation fails, we leave hot_debugfs_root ERR_PTR(something);
> from that point on, no attempts to create it will be done (check for
> hot_debugfs_root being *NULL*)
Good catch, will fix it.
>         * what guarantees that sb->s_id is unique?
sorry, can you let me know where sb->s_id will be not unique?
>         * removal on failure - screwed; first of all, list_empty()
On failure, it must be removed? no.
> will *not* be true if we have somebody open it (cursors are inserted into
yes, in this case, it will not be removed. what issue come here?
> that list).  Moreover, what's to stop another hot_debugfs_init() from
> being called just as we are doing debugfs_remove(hot_debugfs_root) and
> see that it's non-NULL *before* we get to assigning NULL there?
Good catch, it need to be reentrance.
>         * hot_debugfs_exit() - screwed in the same way.
Ditto.
>         * debugfs files get ->i_private set to corresponding sb->s_hot_root.
> It's copied to seq->private on open() and used by iterators.  WTF prevents
> open on debugfs, followed by umount of corresponding btrfs volume, freeing of
> sb->s_hot_root and then read() on our file stepping into kfree'd hot_root
> in ->start()?
Good catch, will fix it.



--
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-06-28 16:03 ` Al Viro
  2013-07-01 13:19   ` Zhi Yong Wu
@ 2013-07-02 12:45   ` Zhi Yong Wu
  1 sibling, 0 replies; 21+ messages in thread
From: Zhi Yong Wu @ 2013-07-02 12:45 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, sekharan, Ram Pai, Dave Chinner, chris.mason,
	jbacik, Zhi Yong Wu

HI, Vl

  Thanks a lot ofr your review at first, but i still have qeustions
about your comments and post them in my previous reply. Can you give
us some answers if you are available? thanks.


On Sat, Jun 29, 2013 at 12:03 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jun 21, 2013 at 08:17:09PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>
>>   The patchset is trying to introduce hot tracking function in
>> VFS layer, which will keep track of real disk I/O in memory.
>> By it, you will easily know more details about disk I/O, and
>> then detect where disk I/O hot spots are. Also, specific FS
>> can take use of it to do accurate defragment, and hot relocation
>> support, etc.
>>
>>   After V1 was sent out, Chandra Seetharaman has reviewed and
>> made a lot of comments, thanks a lot to him. Now it's time to
>> send out its V3 for external review, any comments or ideas are
>> appreciated, thanks.
>
> First of all, my apologies for obscenely long delay with review.  I've
> started doing it several times and dropped these attempts getting mired
> in the nightmare of lifetime rules in there.  What I should've done was
> sending the obvious low-hanging fruits right then and waiting for the
> variant with that stuff sanitized ;-/
>
> First of all, one general observation: please, separate inode and range stuff
> clearly; do *not* use the same functions on both, it only makes harder to read.
> I.e. kill hot_comm_item and split the functions that take it.  This code is
> trying to be too generic for its own good and ends up being obfuscated to hell
> and back...
>
> refcounting:
>         * the whole refcount + "DELETING" flag approach is a bad idea.
> hot_comm_item_unlink() tries to be idempotent *and* includes dropping the
> reference on its first call.  Schemes like that tend to be either pointless
> (i.e. we know that it won't be called twice for the same object) or prone
> to stepping on dangling pointers.  This one does the latter (at least).
>         * lookups (both for inode and range) leak on race with unlink.
> You grab a reference, drop the lock, check if HOT_DELETING is set and
> return an error if it is; how the hell could the caller possibly undo
> that reference grab, when it has no idea what object had been grabbed
> in the first place?  Moreover, that check does *not* prevent getting
> an object with HOT_DELETING from those functions - unlink coming just
> after that check will be unnoticed.
>
> hot_inode_item_delete() mess:
>         * hot_inode_item_delete() is done on unlink(2), no matter how many
> links are there.  Why?
>         * hot_inode_item_delete() is done even if unlink() fails (e.g. with
> EBUSY, or on whatever error ->unlink() might return).
>         * hot_inode_item_delete() grabs a reference to hot_inode_item,
> *drops* *it*, then does hot_comm_item_unlink().  What protects it from being
> freed right as we drop the damn reference?
>
> debugfs-related issues - debugfs is completely unsuitable for dynamic
> objects and you step into that big way, in addition to races of your
> own:
>         * creation of hot_debugfs_root is racy - WTF prevents
> two hot_track_init() in parallel?
>         * if creation fails, we leave hot_debugfs_root ERR_PTR(something);
> from that point on, no attempts to create it will be done (check for
> hot_debugfs_root being *NULL*)
>         * what guarantees that sb->s_id is unique?
>         * removal on failure - screwed; first of all, list_empty()
> will *not* be true if we have somebody open it (cursors are inserted into
> that list).  Moreover, what's to stop another hot_debugfs_init() from
> being called just as we are doing debugfs_remove(hot_debugfs_root) and
> see that it's non-NULL *before* we get to assigning NULL there?
>         * hot_debugfs_exit() - screwed in the same way.
>         * debugfs files get ->i_private set to corresponding sb->s_hot_root.
> It's copied to seq->private on open() and used by iterators.  WTF prevents
> open on debugfs, followed by umount of corresponding btrfs volume, freeing of
> sb->s_hot_root and then read() on our file stepping into kfree'd hot_root
> in ->start()?



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-07-01 13:19   ` Zhi Yong Wu
@ 2013-07-03 13:30     ` Al Viro
  2013-07-03 15:16       ` Zhi Yong Wu
  0 siblings, 1 reply; 21+ messages in thread
From: Al Viro @ 2013-07-03 13:30 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: linux-fsdevel, sekharan, Ram Pai, Dave Chinner, chris.mason,
	jbacik, Zhi Yong Wu

On Mon, Jul 01, 2013 at 09:19:08PM +0800, Zhi Yong Wu wrote:

> >         * lookups (both for inode and range) leak on race with unlink.
> > You grab a reference, drop the lock, check if HOT_DELETING is set and
> > return an error if it is; how the hell could the caller possibly undo
> > that reference grab, when it has no idea what object had been grabbed
> I don't get what you mean, do you elaborate it with more details?
> which caller? which scenario will it take place in?

Sigh...  Suppose that test has failed and you've returned ERR_PTR(-ENOENT).
If that ever happens (and I don't see what would prevent that, seeing
that e.g. unlink(2) can happen at any point, that you are not holding any
locks at that point and that unlink(2) wouldn't care for any of your locks
anyway) you are going to have a leak - you've grabbed a reference a few
lines above and once you return, there's no way to tell which object had
been grabbed, let alone do the matching kref_put().

> > debugfs-related issues - debugfs is completely unsuitable for dynamic
> > objects and you step into that big way, in addition to races of your
> **** Is debugfs also completely unsuitable for dynamic objects even
> though we fix the following issues listed by you? Do you have any
> better way about this?

The last issue in that list (IO hours after object removal) is just about
unfixable without debugfs overhaul.

> >         * creation of hot_debugfs_root is racy - WTF prevents
> hot_debugfs_root is public for all disk volumes, and is debugfs root
> for hot tracking.
> Why do you think it is racy?
> > two hot_track_init() in parallel?
> I think that mount() will make sure hot_track_init() will be done once
> for the same super block, right? hot_track_init will be done *only*
> when mount is issued.
> That is, mount() can not make sure hot_track_init() is done serially?

What the devil would serialize mount on completely unrelated devices?
And what for?

> >         * what guarantees that sb->s_id is unique?
> sorry, can you let me know where sb->s_id will be not unique?

On any number of filesystem types it isn't; are you making that a restriction
on fs types that can use your stuff?

> >         * removal on failure - screwed; first of all, list_empty()
> On failure, it must be removed? no.

Then what would eventually remove it?  And why bother with lazy creation,
if so?

> >         * hot_debugfs_exit() - screwed in the same way.
> Ditto.

Again, if you don't mind that thing sticking around indefinitely, just
because somebody happened to do ls on it at the moment it would've been
removed otherwise, WTF remove it at all?

> >         * debugfs files get ->i_private set to corresponding sb->s_hot_root.
> > It's copied to seq->private on open() and used by iterators.  WTF prevents
> > open on debugfs, followed by umount of corresponding btrfs volume, freeing of
> > sb->s_hot_root and then read() on our file stepping into kfree'd hot_root
> > in ->start()?
> Good catch, will fix it.

How?

As for the lifetime rules suggestions - depends on what you want to achieve.
How long should those objects live?  In which cases can we get an attempt
of ...unlink... more than once on the same object?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-07-03 13:30     ` Al Viro
@ 2013-07-03 15:16       ` Zhi Yong Wu
  2013-07-08 12:44         ` Zhi Yong Wu
  0 siblings, 1 reply; 21+ messages in thread
From: Zhi Yong Wu @ 2013-07-03 15:16 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, sekharan, Ram Pai, Dave Chinner, chris.mason,
	jbacik, Zhi Yong Wu

HI, Al Viro,

   Thanks for your comments and patiently explaining my questions. I
still have some questions as below now:
1.)  Since you think the whole refcount + "DELETING" flag approach is
a bad idea, do you have any other idea? or do you want me to get rid
of ref count totally?
      the ref count is introduced mainly to take hot relocation
support, etc. into account.

2.) For debugfs, I had a lot of racy issue along the way, and had
fixed a lot, since you think that it is completely unsuitable for
dynamic objects, i want to remove debugfs, what do you think of it?

3.) For hot_comm_item struct, if it is killed totally, it will
introduce a lot of duplicate codes for the hot_update_worker function,
so i want to keep it there, but split those functions which take it.
This will make those functions clear. What do you think of it?

I will rework all your other comments and repost the next version soon.

On Wed, Jul 3, 2013 at 9:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Jul 01, 2013 at 09:19:08PM +0800, Zhi Yong Wu wrote:
>
>> >         * lookups (both for inode and range) leak on race with unlink.
>> > You grab a reference, drop the lock, check if HOT_DELETING is set and
>> > return an error if it is; how the hell could the caller possibly undo
>> > that reference grab, when it has no idea what object had been grabbed
>> I don't get what you mean, do you elaborate it with more details?
>> which caller? which scenario will it take place in?
>
> Sigh...  Suppose that test has failed and you've returned ERR_PTR(-ENOENT).
> If that ever happens (and I don't see what would prevent that, seeing
> that e.g. unlink(2) can happen at any point, that you are not holding any
> locks at that point and that unlink(2) wouldn't care for any of your locks
> anyway) you are going to have a leak - you've grabbed a reference a few
> lines above and once you return, there's no way to tell which object had
> been grabbed, let alone do the matching kref_put().
i got it now, and will fix it, thanks for what you explain.

With all these additional issues i decided to not push the debugfs part for now.
>
>> > debugfs-related issues - debugfs is completely unsuitable for dynamic
>> > objects and you step into that big way, in addition to races of your
>> **** Is debugfs also completely unsuitable for dynamic objects even
>> though we fix the following issues listed by you? Do you have any
>> better way about this?
>
> The last issue in that list (IO hours after object removal) is just about
> unfixable without debugfs overhaul.
OK, i see. I had fixed a lot of issues, but it still has a lot raised by you.
>
>> >         * creation of hot_debugfs_root is racy - WTF prevents
>> hot_debugfs_root is public for all disk volumes, and is debugfs root
>> for hot tracking.
>> Why do you think it is racy?
>> > two hot_track_init() in parallel?
>> I think that mount() will make sure hot_track_init() will be done once
>> for the same super block, right? hot_track_init will be done *only*
>> when mount is issued.
>> That is, mount() can not make sure hot_track_init() is done serially?
>
> What the devil would serialize mount on completely unrelated devices?
> And what for?
>
>> >         * what guarantees that sb->s_id is unique?
>> sorry, can you let me know where sb->s_id will be not unique?
>
> On any number of filesystem types it isn't; are you making that a restriction
> on fs types that can use your stuff?
>
>> >         * removal on failure - screwed; first of all, list_empty()
>> On failure, it must be removed? no.
>
> Then what would eventually remove it?  And why bother with lazy creation,
> if so?
>
>> >         * hot_debugfs_exit() - screwed in the same way.
>> Ditto.
>
> Again, if you don't mind that thing sticking around indefinitely, just
> because somebody happened to do ls on it at the moment it would've been
> removed otherwise, WTF remove it at all?
>
>> >         * debugfs files get ->i_private set to corresponding sb->s_hot_root.
>> > It's copied to seq->private on open() and used by iterators.  WTF prevents
>> > open on debugfs, followed by umount of corresponding btrfs volume, freeing of
>> > sb->s_hot_root and then read() on our file stepping into kfree'd hot_root
>> > in ->start()?
>> Good catch, will fix it.
>
> How?
>
> As for the lifetime rules suggestions - depends on what you want to achieve.
> How long should those objects live?  In which cases can we get an attempt
> of ...unlink... more than once on the same object?

--
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] VFS hot tracking
  2013-07-03 15:16       ` Zhi Yong Wu
@ 2013-07-08 12:44         ` Zhi Yong Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Zhi Yong Wu @ 2013-07-08 12:44 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, sekharan, Ram Pai, Zhi Yong Wu

HI, Al Viro

  Can you give us some suggestions when you are available? In order
that we can get agreements on how to design the following points.
thanks.

On Wed, Jul 3, 2013 at 11:16 PM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> HI, Al Viro,
>
>    Thanks for your comments and patiently explaining my questions. I
> still have some questions as below now:
> 1.)  Since you think the whole refcount + "DELETING" flag approach is
> a bad idea, do you have any other idea? or do you want me to get rid
> of ref count totally?
>       the ref count is introduced mainly to take hot relocation
> support, etc. into account.
>
> 2.) For debugfs, I had a lot of racy issue along the way, and had
> fixed a lot, since you think that it is completely unsuitable for
> dynamic objects, i want to remove debugfs, what do you think of it?
>
> 3.) For hot_comm_item struct, if it is killed totally, it will
> introduce a lot of duplicate codes for the hot_update_worker function,
> so i want to keep it there, but split those functions which take it.
> This will make those functions clear. What do you think of it?
>
> I will rework all your other comments and repost the next version soon.
>
> On Wed, Jul 3, 2013 at 9:30 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Jul 01, 2013 at 09:19:08PM +0800, Zhi Yong Wu wrote:
>>
>>> >         * lookups (both for inode and range) leak on race with unlink.
>>> > You grab a reference, drop the lock, check if HOT_DELETING is set and
>>> > return an error if it is; how the hell could the caller possibly undo
>>> > that reference grab, when it has no idea what object had been grabbed
>>> I don't get what you mean, do you elaborate it with more details?
>>> which caller? which scenario will it take place in?
>>
>> Sigh...  Suppose that test has failed and you've returned ERR_PTR(-ENOENT).
>> If that ever happens (and I don't see what would prevent that, seeing
>> that e.g. unlink(2) can happen at any point, that you are not holding any
>> locks at that point and that unlink(2) wouldn't care for any of your locks
>> anyway) you are going to have a leak - you've grabbed a reference a few
>> lines above and once you return, there's no way to tell which object had
>> been grabbed, let alone do the matching kref_put().
> i got it now, and will fix it, thanks for what you explain.
>
> With all these additional issues i decided to not push the debugfs part for now.
>>
>>> > debugfs-related issues - debugfs is completely unsuitable for dynamic
>>> > objects and you step into that big way, in addition to races of your
>>> **** Is debugfs also completely unsuitable for dynamic objects even
>>> though we fix the following issues listed by you? Do you have any
>>> better way about this?
>>
>> The last issue in that list (IO hours after object removal) is just about
>> unfixable without debugfs overhaul.
> OK, i see. I had fixed a lot of issues, but it still has a lot raised by you.
>>
>>> >         * creation of hot_debugfs_root is racy - WTF prevents
>>> hot_debugfs_root is public for all disk volumes, and is debugfs root
>>> for hot tracking.
>>> Why do you think it is racy?
>>> > two hot_track_init() in parallel?
>>> I think that mount() will make sure hot_track_init() will be done once
>>> for the same super block, right? hot_track_init will be done *only*
>>> when mount is issued.
>>> That is, mount() can not make sure hot_track_init() is done serially?
>>
>> What the devil would serialize mount on completely unrelated devices?
>> And what for?
>>
>>> >         * what guarantees that sb->s_id is unique?
>>> sorry, can you let me know where sb->s_id will be not unique?
>>
>> On any number of filesystem types it isn't; are you making that a restriction
>> on fs types that can use your stuff?
>>
>>> >         * removal on failure - screwed; first of all, list_empty()
>>> On failure, it must be removed? no.
>>
>> Then what would eventually remove it?  And why bother with lazy creation,
>> if so?
>>
>>> >         * hot_debugfs_exit() - screwed in the same way.
>>> Ditto.
>>
>> Again, if you don't mind that thing sticking around indefinitely, just
>> because somebody happened to do ls on it at the moment it would've been
>> removed otherwise, WTF remove it at all?
>>
>>> >         * debugfs files get ->i_private set to corresponding sb->s_hot_root.
>>> > It's copied to seq->private on open() and used by iterators.  WTF prevents
>>> > open on debugfs, followed by umount of corresponding btrfs volume, freeing of
>>> > sb->s_hot_root and then read() on our file stepping into kfree'd hot_root
>>> > in ->start()?
>>> Good catch, will fix it.
>>
>> How?
>>
>> As for the lifetime rules suggestions - depends on what you want to achieve.
>> How long should those objects live?  In which cases can we get an attempt
>> of ...unlink... more than once on the same object?
>
> --
> Regards,
>
> Zhi Yong Wu



--
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-07-08 12:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-21 12:17 [PATCH v3 00/13] VFS hot tracking zwu.kernel
2013-06-21 12:17 ` [PATCH v3 01/13] VFS hot tracking: introduce some data structures zwu.kernel
2013-06-21 12:17 ` [PATCH v3 02/13] VFS hot tracking: add i/o freq tracking hooks zwu.kernel
2013-06-21 12:17 ` [PATCH v3 03/13] VFS hot tracking: add one wq to update hot map zwu.kernel
2013-06-21 12:17 ` [PATCH v3 04/13] VFS hot tracking: register one shrinker zwu.kernel
2013-06-21 12:17 ` [PATCH v3 05/13] VFS hot tracking, rcu: introduce one rcu macro for list zwu.kernel
2013-06-21 12:17 ` [PATCH v3 06/13] VFS hot tracking, seq_file: add seq_list rcu interfaces zwu.kernel
2013-06-21 12:17 ` [PATCH v3 07/13] VFS hot tracking: add debugfs support zwu.kernel
2013-06-21 12:17 ` [PATCH v3 08/13] VFS hot tracking: add one ioctl interface zwu.kernel
2013-06-21 12:17 ` [PATCH v3 09/13] VFS hot tracking, procfs: add one proc interface zwu.kernel
2013-06-21 12:17 ` [PATCH v3 10/13] VFS hot tracking: add memory caping function zwu.kernel
2013-06-21 12:17 ` [PATCH v3 11/13] VFS hot tracking, btrfs: add hot tracking support zwu.kernel
2013-06-21 12:17 ` [PATCH v3 12/13] VFS hot tracking: add documentation zwu.kernel
2013-06-21 12:17 ` [PATCH v3 13/13] VFS hot tracking: add fs hot type support zwu.kernel
2013-06-24 13:41 ` [PATCH v3 00/13] VFS hot tracking Zhi Yong Wu
2013-06-28 16:03 ` Al Viro
2013-07-01 13:19   ` Zhi Yong Wu
2013-07-03 13:30     ` Al Viro
2013-07-03 15:16       ` Zhi Yong Wu
2013-07-08 12:44         ` Zhi Yong Wu
2013-07-02 12:45   ` Zhi Yong Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.